SemRep

Knowledge Graph Exploration with NLM semrep

Title: A study of Drug to cardiovascular disease (CVD) Association with SemRep and Deep learning

</img>)

           Image Courtesy: WIKIPEDIA

Description: Starting with well defined oxidative stress categories (e.g., Initiation, Regulation and Outcome of Oxidative Stress) and a list of drugs in cardiovascular disease (CVD), we will explore semrep to extract all relevant SPO- triplet. We further build knowledge graphs with these triplets and prepare a muli-order association matrix to represent graph data structure. Using this graph structure, we will build a sequence prediction model for drug to CVD association. This project will provide a detailed analysis of drugs to CVD association with both qualitative evidence and quantitative scores.

Leaders/Instructors: Dr. Dibakar Sigdel & Dr. David Liem (Mr. Vincent Kyi for technical support)

Participants

Vladimir Guevara
Maya Gupta
Alex Zhang*
Ethan Tran*
Aaliyah

Note: Participants with * sign are also involved in other projects in Project 1 (A,B and C)

Project walk-through

Get familiar with NLM SemRep for Biomedical Documents (https://semrep.nlm.nih.gov/). Learn to extract Drug information from DrugBank API(https://www.drugbank.ca/) and learn a curated list of oxidative stress categories and associated molecules. Extract knowledge graph triplets (SPO) for drug and CVD association from SemRep tool and create a graph data structure (Association Matrix). Build RNN (LSTM) model to predict/classify/partition the drug/molecule for the category of oxidative stress with associated information and analysis. Organize codes and prepare project documentation sites at PingLab Intern GitHub account. Final presentation at lab meeting.

Education goals: The students will learn how to work with innovative text mining tools (e.g., semrap, caseOLAP, Neo4J) for biomedical documents and machine learning approach (RNN, LSTM) for model development and implementation to answer important biomedical questions.

Scientific goals: The students will explore knowledge graphs for drug and CVD associations with a focus on oxidative stress categories (e.g., Initiation, Regulation and Outcome) and underlying molecular mechanism.

Preparing Foundation

Create an account at NLM/UMLS account
Get Familiar with the following:
- MeSH tree: Madical Subject Headings
- ICD codes: International Classification of Disease
- Uniprot: Database of proteins organized with Protein ID, synonyms, Abbr, associated Genes and biomolecules
- Reactome : Database of Pathways and associated documents in Graph structure
- Drugbank : Database of Drugs and associated documents
Install Python(Anaconda)
Install Neo4J and take available tutorials
Get Familiar with NLTK NLP package in Python
Get introductory idea of Machine Learning from Scikit Learn and Tensorflow (Specially, LSTM)

Project Detail

Step -1: Prepare required CVD entities (e.g, UMLS dictionary, MeSH Tree, ICD Tree, Proteins, Genes, RNASeq, Drugs)
Step -2: Explore semRep for CVD cases with CVD entities
Step -3: Prepare a CVD KG from available SPO triplets
Step -4: Create a Graph association Matrix and artifacts
Step -5: Build and test LSTM models
Step -6: Analyse the result
Step -7: Present the result
Step -8: Organize the documentation with Mkdocs

References

Github for SemRep (https://github.com/CaseOLAP/SemRep)
SemRep Paper (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8574025)
SemRep NLM (https://semrep.nlm.nih.gov/)
Oxidative Stress (https://en.wikipedia.org/wiki/Oxidative_stress)