CovidPubGraph: a FAIR knowledge graph on COVID-19 publications

RDF data model design

The ontology behind our knowledge graph was derived from the source from which it was extracted, i.e. the full texts of the publications provided as part of the CORD-19 dataset. The ontology was designed to enable searching, question answering, and machine learning. At the time of writing, our dataset is based on CORD-19 version 2021-11-08 (https://www.semanticscholar.org/cord19/download). Our conversion process is implemented in Python 3.6 with RDFLib 5.0.0 (https://github.com/RDFLib/rdflib). We make our source code publicly available (https://github.com/dice-group/COVID19DS) to ensure reproducibility of our results and rapid conversion to new versions of CORD-19. A version of the generated RDF dataset can be found on Zenodo22.

RDF namespaces

To facilitate the reuse of our knowledge graph, we represent our data in widely used vocabularies and namespaces, as shown in Listing 1.

RDF data model

Figure 1 shows the important classes (eg, articles, authors, sections, bibliographic entries, and named entities) as well as predicates (eg, first name, last name, license).

Fig. 1

UML class diagram of CovidePuhgraph Ontology.

Papers

We represent the bibliographic information of articles using four vocabularies: bibo, bibtex, fabioand plan (see namespaces above). Important attributes include title, PMID, DOI, publication date, publisher, publisher URI, license, and authors. For each paper, we store provenance information. In particular, our code allows reference to the original CORD-19 raw files as well as the time at which we generate the resource. The URIs of our generated products Paper resources follow the format https://covid-19ds.data.dice-research.org/resource/ wherePaper ID> is the unique identifier of the paper in the CORD-19 dataset. An example resource is given in Listing 2.

Authors

Authors are represented in FOAF (http://xmlns.com/foaf/spec/). Important attributes include first, middle, and last names as well as email addresses and institutions.

Divisions

The articles are then subdivided by section and the corresponding information is expressed in the SALT ontology23. We keep track of a set of predefined sections, including Summary, Introduction, Background, Related Works, Preliminaries, Conclusion, Experience and Discussion. In case another section title appears in the document, we assign it to the default section Body. We then subdivide a section using cvdo:hasSection. An example is given in Listing 3.

References

References to other sections, figures and tables in the text are resolved and stored in RDF format using Biref. Important attributes are the reference’s anchor (for example, the number of the section, figure or table), its source string in the text (nif:referenceContext) as well as its position in the text (nif:beginIndex, nif:endIndex) as well as the referenced object (sound: taIdentRef) which could be a paper (BibEntry), a figure (Figure), or an array (Table).

Named entities

As machine learning and question answering often rely on named entities and their locations in texts, we annotate CORD-19 articles accordingly and represent this information with the NIF 2.0 core ontology (https://persistence .uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html). Further details on our entity linking process are described in the Linking section.

Examples of RDF resources

Listing 2 provides an example of an article represented as an RDF resource. Listing 3 shows an example of a section resource. Each section is linked to its text string via nif:isString and its title via bibtex:hasTitle. If a section includes references to other documents, figures or tables (for example, (1-3), (4.5), Figure 1A, Figs. 1etc.), we represent a reference in RDF as follows: We represent the anchor of the reference as nif:ancreFrom (for example, the number of a digit), the starting position of the reference with nif:beginIndexthe final position of the reference with nif:endIndexthe source section of the reference with nif:referenceContextand the referenced target with sound: taIdentRef (for example, a bibtex entry, a figure or a table). An example is shown in Listing 4. Listing 5 shows an example of provenance information.

Linking

We link our dataset to other data sources to ensure its reusability and integrability as well as to enhance its use for research, question answering, and structured machine learning. We generate links from our article and author resources to related publicly available knowledge bases. Additionally, we extract named entities related to diseases, genes and cells from all converted articles and link them to three external knowledge bases.

Link publications, authors and institutes

We connect editions in our knowledge graph to six other datasets using the owl: same as and rdfs: see also predicates (see the first six rows of Table 2). To our knowledge, these six datasets are the most relevant RDF datasets that deal with the same publication data. We leave it to future work to relate our dataset to non-RDF datasets such as Covide19-KG12 and Wikidata Scholia24.

VSorder19-NEKG and our dataset use the same CORD-19 Paper ID make the binding process simple. ForthisVSovide, we use the PubMed Central Id (PMC-id) which is provided as part of CORD-19. For Covide-19-Lliterature etcorder-19-on-FHIR, we employ sha CORD-19 hash values. Additionally, we link our dataset to the JSON files of the posts in Cord-19-on-FHIR with the predicate rdfs: see also. Listing 6 shows an example of related posts from our C datasetovidePuhgraph to Corder19-NEKG and LthisVSovide.

We connect our resources from our two authors and institutes to Microsoft Academic Knowledge Chart (MAKG)25 using the latest version of our LIMES link discovery framework26. To link the authorsLIMES is set up to discover owl: same as links between our bodies foaf:Nobody and Microsoft makg:Author. To link the instituteswe are looking for links between instances of type dbo: Educational Institution from our knowledge graph and MAKG type resources makg:Affiliate. LIMES configuration files for linking authors and institutes are available in our source code (https://github.com/dice-group/COVID19DS).

Linking named entities

We apply entity linking to connect entities derived from article sections to other knowledge bases. This process consists of two steps: (1) feature extraction and (2) feature linking. For the extraction step, we use Scipacity27 in version 0.2.4 in conjunction with the model en_ner_bionlp13cg_md (https://github.com/allenai/scispacy) which allows the extraction of biomedical entities such as diseases, genes and cells. Scipacity is a specialized NLP library based on the spaCy library (https://spacy.io/). The NER model in spaCy is a transition-based segmentation model that represents tokens as hashed embedded representations of the prefix, suffix, shape, and lemmatized features of individual words.27.

For the binding step, we adapt the MAG entity binding framework28 link our extracted resources to the three knowledge bases Sid19KEgg20 and DcoverBthank you18—using their RDF versions provided by the Bio2RDF project (https://bio2rdf.org/). We scale MAG by creating a search index for each of the external knowledge bases and running MAG once per knowledge base. The output is a set of entities in NLP Interchange Format (NIF) (https://persistence.uni-leipzig.org/nlp2rdf/). In Listing 7, we provide an example for the named entity “folic acid”.

Automated C generationovidePuhgraph

CORD-19 uploaded new data almost every day for the second half of 2020. Because of this, we need to automate the process of updating our knowledge graph. To this end, we have developed a pipeline to automate the whole process, which can be found in Fig. 2. This pipeline contains several stages:

  1. 1.

    crawling. We start by exploring the most recent version as a zip file from the CORD-19 website, which includes a CSV metadata file and JSON analyzed the full texts of scientific articles on the coronavirus.

  2. 2.

    RDF conversion. Next, we convert the CORD-19 data into an RDF knowledge graph with a Python script using the RDFLib library (https://github.com/RDFLib/rdflib).

  3. 3.

    Linking. We integrate the AGDISTIS library (https://github.com/dice-group/AGDISTIS) in the build process to extract and link named entities from abstracts of scientific papers. Additionally, we perform the entity linking tasks (i.e., publication and authors to other datasets) using the link discovery framework Limes (https://github.com/dice-group/LIMES).

  4. 4.

    KG Update. We upload the new version of CovidePuhgraph pours into the HOBBIT server (https://hobbitdata.informatik.uni-leipzig.de/COVID19DS/archive/) as well as at Virtuoso triple store (https://hub.docker.com/r/openlink/virtuoso-opensource-7).

Figure 2
Figure 2

Starting in 2021, CORD-19 only releases new data every two weeks. Therefore, we keep our KG updated by mining the new version of the CORD-19 dataset every two weeks. Next, we follow the KG creation procedure shown in Fig. 2. Since the dataset is not yet too large to refresh, we refresh the full dataset every two weeks. Still, having an automatic incremental update is part of our future plans.

Comments are closed.