Data Acquisition

Data Acquisition: Data Ingestion or Data Intake

Alexander Mikhalev published on
1 min, 118 words

For the Reference Architecture for AI, we used Kaggle Cord19 dataset, "COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 1,000,000 scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease."

Ingest documents

Example script parses documents taking out body_text and saves under paragraphs in Redis cluster.