tmCOVID: a text mining tool for extracting and summarizing bioconcepts (diseases, chemicals, genes, species, cell lines, and mutations) in COVID-19 scientific literature


There has been an exponential growth in the number of scientific publications related to COVID-19 since Dec, 2019. On March 16, 2020, the White House issued a call to action asking the data science community to develop literature mining tools that can help the scientific community answer high-priority questions related to COVID-19. tmCOVID is an interactive web-based tool to extract and summarize the bioconcepts (genes, chemicals, drugs, mutations, cell lines, species, and diseases) in the COVID-19 scientific literature. Our ongoing work includes incorporation of capabilities to support the CORD-19 dataset and generate full-text summaries by detecting most relevant sentences using network centrality methods. Automated summarization of biomedical text will enhance access to information and help identify patterns within the text. Furthermore, it will allow biomedical researchers and general public to find information related to risk factors of COVID-19 including pregnancy, smoking, and comorbidities.

Sample queries:

SARS-CoV-2 and COVID-19

COVID-19 and 'risk factors'

COVID-19 and smoking

COVID-19 and pregnancy

Options for selecting databases and type of publications:

The user can query PubMed abstracts or PMC full-text articles. Additional filters include restricting the search to only journal articles or case reports.

Summarization options:

The 'Bioconcept frequency in all documents option' generates a table with the frequency of bioconcept IDs aggregated across all documents.

The 'Bioconcept frequency in each document option' generates a table with the frequency of bioconcept IDs in each document.

Additional graphical and textual summarization options are currently under development and will be released soon.

Searching the results table:

The user can sort the results by bioconcept type and search for bioconcepts of interest in the results table.

Word Cloud:

The word cloud provides a visual summarization of the top 30 most frequent bioconcepts found in the documents matching the useery query. The size of the words correlates with the frequency of occurrence.

Overview of methodology:

tmCOVID uses NCBI Entrez for retreiving PubMed IDs based on the input query. PubTator is used for extracting bioconcepts (genes, chemicals, diseases, species, and mutations) from each PubMed abstract or PMC full-text article. Summary tables are generated with the frequency of occurrence of bioconcepts at the document level or across all hits. All data is stored in an RSQLite database.