Data Acquisition: Data Ingestion or Data Intake
For the Reference Architecture for AI, we used Kaggle Cord19 dataset, "COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 1,000,000 scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease."
Example script parses documents taking out body_text and saves under paragraphs in Redis cluster.