Motivations
In the era of Big Data, a huge amount of biological data related to different entities, such as proteins, genes, non-coding RNA, diseases, functional associations, has been made available.
These resources are typically stored in several bioinformatics databases, each one implementing its own data model and user interface.
However, in many bioinformatics scenarios there is often the need to use more than one resource.
For a bioinformatician that implies a further effort in terms of ability to skip form one service to another one; waste of working time for transferring data and intermediate results from one resource to another one, sometimes dealing with aliases and accession ID disambiguation.
The availability of a single bioinformatics platform that integrates many biological resources and services is, for those reasons, a fundamental issue.
Methods
BioGraphDB is an integrated graph database that collects and links heterogeneous bioinformatics resources, and it is implemented on-top of OrientDB.
Graph databases allow, in fact, a greater scalability and queries efficiency with regards to the size of data, rather than traditional SQL database.
Each component database has been downloaded from its original site and it has been processed using customized Extract-Transformer-Loader (ETL) modules, in order to be assembled into a graph architecture. Each biological entity and its properties have been mapped respectively into a vertex and its attributes, and each relationship between two biological entities has been mapped into an edge. The whole assembled graph can be traversed using proper query languages, such as Gremlin. Each graph traversal represents a set of queries that are enough in order to solve several bioinformatics scenarios.
Nodes | Properties |
---|---|
Gene |
geneId (String) locusTag (String) chromosome (String) mapLocation (String) description (String) type (String) nomenclatureAuthoritySymbol (String) nomenclatureAuthorityFullName (String) nomenclatureStatus (String) otherDesignations (String) |
GeneName | symbol (String) |
Go |
goId (String) name (String) namespace (String) definition (String) obsolete (String) comment (String) |
Protein |
name (String) fullName (String) alternativeName (String) gene (String) sequence (String) sequenceLenght (Int) sequenceMass (Int) |
ProteinName | name (String) |
Pathway |
pathwayId (String) name (String) summation (String) |
Cancer | name (String) |
MiRNA |
accession (String) name (String) description (String) comment (String) sequence (String) |
MiRNAmature |
... location (String) sequence (String) |
MiRNASNP |
SNPid (String) miRNA (String) chr (String) miRstart (Int) miRend (Int) lostNum (Int) gainNum (Int) |
Interaction |
transcriptId (String) extTranscriptId (String) mirAlignment (String) alignment (String) geneAlignment (String) mirStart (Int) mirEnd (Int) geneStart (Int) geneEnd (Int) genomeCoordinates (String) conservation (Double) alignScore (Int) seedCat (Int) energy (Double) mirSvrScore (String) mirTarBaseId (String) experiments (String) supportType (String) snpEnergy (Double) basePair (String) geneAve (Double) mirnaAve (Double) database (String) |
Relations | Properties |
---|---|
ANNOTATES |
evidence (String) qualifier (String) category (String) |
SYNONYM_OF | - |
CODING | - |
CONTAINS | - |
REFERS_TO | - |
CANCER2MIRNA | profile (String) |
PRECURSOR_OF | - |
HAS_SNP | - |
INTERACTING_GENE | - |
INTERACTING_MIRNA | - |
INTERACTING_SNP | - |
Search for Genes that are associated with a particular Gene Ontology (GO) annotation.
g.V().hasLabel('Go').has('name', goTerm ).
out('ANNOTATES').hasLabel('Gene').order().by('description')
For a given Pathways show any Genes associated with the Pathway.
g.V().hasLabel('Pathway').has('name', pathwayName ).
out('CONTAINS').in('CODING').order().by('description')
For a given pathway, show all proteins.
g.V().hasLabel('Pathway').has('name', pathwayName ).
out('CONTAINS').order().by('name')
Search for GO annotations for a particular gene.
g.V().hasLabel('Gene').has('nomenclatureAuthoritySymbol', symbol ).
in('ANNOTATES').order().by('name')
For a given Gene, show any associated Pathway.
g.V().hasLabel('Gene').has('nomenclatureAuthoritySymbol', symbol ).
out('CODING').in('CONTAINS').order().by('name')
For a given Protein, returns the associated Gene Ontology (GO) terms.
g.V().hasLabel('Protein').has('name', proteinName ).
in('ANNOTATES').order().by('name')
For a given miRNA, returns the associated cancers from miRCancer.
g.V().hasLabel('MiRNA').has('name', mirnaName ).
in('CANCER2MIRNA').dedup().order().by('name')
For a given miRNA mature, returns the genes through all the validated (miRTarBase) interactions.
g.V().hasLabel('MiRNAmature').has('product', mirnaName ).
in('INTERACTING_MIRNA').has('database','miRTarBase').out('INTERACTING_GENE').
dedup().order().by('nomenclatureAuthoritySymbol')
The query investigates the functional role of miRNAs in cancer pathology.
Wild-type differentially expressed (DE) miRNAs in a specific cancer disease are investigated as regulative elements of gene targets through interaction analysis.
At this point an energy filter is applied according to the free energy score of the binding site predicted by miRanda.
This allows to highlight only miRNA-target interactions that are strongly bound.
The targets evidenced are then analyzed through GO enrichment, to see the functional annotations that link these molecules to the selected cancer disease.
g.V().hasLabel('Cancer').has('name', cancerName ).
out('CANCER2MIRNA').dedup().out('PRECURSOR_OF').in('INTERACTING_MIRNA').
has('database','miRanda').has('energy',lt( energy )).
out('INTERACTING_GENE').dedup().in('ANNOTATES').dedup()
The query allows to evidence the functional significance of miRNA single nucleotide polymorphisms (SNPs) in cancer pathology.
Starting from a specific cancer type, miRNA SNPs linked to the cancer disease are selected and used in miRNA-target interactions DB (a free energy score is applied).
The results list used to evidence GO association lists related to DE miRNA SNPs and cancer disease.
g.V().hasLabel('Cancer').has('name', cancerName ).
out('CANCER2MIRNA').dedup().out('PRECURSOR_OF').out('HAS_SNP').
in('INTERACTING_SNP').has('snpEnergy',lt( snpEnergy )).
out('INTERACTING_GENE').dedup().in('ANNOTATES').dedup()
Starting from a specific pathway, finds the up-regulated miRNAs involved in a specific cancer scenario.
g.V().hasLabel('Pathway').has('name', pathwayName ).out('CONTAINS').in('CODING').
in('INTERACTING_GENE').has('database', 'miRanda').has('energy',lt( energy )).
out('INTERACTING_MIRNA').dedup().in('PRECURSOR_OF').inE('CANCER2MIRNA').has('profile', profile ).outV().dedup()
Starting from two given genes, finds all pathways. If common pathways are found, the user will focus them instantly.
g.V().hasLabel('Gene').choose(values('nomenclatureAuthoritySymbol')).
option( gene1 ,__.as('a')).option( gene2 ,__.as('b')).
out('CODING').in('CONTAINS')
To do...