Data integration of ONGene database

The primary aim of the database is to support oncogene research by maintaining a high quality oncogene database that serves as a comprehensive, fully classified, richly and accurately annotated oncogene resource, with extensive cross-references and querying interfaces freely accessible to the scientific community.

Data sources for curation

We performed a systematic search of the primary research literatures recorded in PubMed on Dec 29th 2015 using the term: oncogene [title] or oncogenic [title] or oncoprotein [title] or proto-oncogene [title]. This returned 17,033 abstracts. In addition, 2,976 short statements associated with 1,912 PubMed abstracts were extracted from GeneRif (Gene Reference Into Function) database using the keywords "oncogene" and "oncoprotein". To improve further the literature searching sensitivity about lncRNAs we performed the search again on Feb 25th 2016 by using the "long non-coding RNA" and "oncogene". This search returned 435 new references. For oncogenic miRNAs, we integrated the curated oncogenic miRNAs from oncomirdb (mapped to 144 Entrez genes) and miRCancer (mapped to 34 Entrez genes).

Literature curation

After removing the redundant PubMed results from the various data sources, there remained a total of 18,670 PubMed reports for follow-up. Firstly, we downloaded all the 18,670 abstracts and grouped them according to their semantic similarity. Next, the sentences containing "oncogene" and "oncoprotein" were used to manually map gene names to the official gene symbols from the Entrez Gene database. Some of the genes, for example, lncRNA gnee "AB073614", is not recorded in Entrez gene database. To avoid any low quality of gene assignment, we decided not to include them in this version.

Information for oncogenes  [ top ]

Information is represented on six different types of pages, including general information view, literature highlight view, gene expression view, co-expressed lncRNAs, gene mutation view, and homologous gene view.

The general information page is like the following:

In this page, users can find the data source and our curated descriptions for oncogenes from literature. It is easy to switch to other annotations by clicking the hyperlink at the top of the page.

User can find the details of the literatures with keywords highlighted in the literature highlight page as below. The keywords "oncogene", "oncoprotein", "proto-oncogene", and "cancer" are marked in red; keywords such as "pathway" are marked in green; and the keywords such as "mutation" and "expression" are marked in black as shown in below.

The gene expression page is as below:

In the page, users can find gene expression profiles from 184 human tumor samples and 84 normal tissue samples from BioGPS. It is easy to view all the sample information by clicking the hyperlink in the profile images. Some genes have multiple probes; to provide an unbiased view for users, we presented all the gene expressions from all probes without any modification.

User can obtain all the sample inforamtion by clicking on the expression images.

The co-expressed lncRNA page appears as follows:

The protein-coding gene expression data for 11 cancer types were downloaded from pan-cancer gene expression analysis as described in our developed LnCaNet database. The eleven cancer types are BLCA (bladder urothelial carcinoma carcinoma), BRCA (breast invasive carcinoma), COAD (colon adenocarcinoma), HNSC (head and neck squamous cell carcinoma), KIRC (kidney renal clear cell carcinoma), LUAD (lung adenocarcinoma), LAML (acute myeloid leukemia), LUSC (lung squamous cell carcinoma), OV (high-grade serous ovarian cancer), READ (rectum adenocarcinoma), and UCEC (uterine corpus endometrial carcinoma). To explore the co-expression network of lncRNA, we estimated the expression correlation among thehuman oncogenes and all the lncRNAs from Mitranscriptome using the correlation method that is implemented in the R language package (version 2.14.0) to calculate their expression correlation scores and corresponding p-values. For all the p-values in each cancer type, a false discovery rate (FDR) was applied to correct the statistical significance of multiple testing. For all the pairs between cancer genes and lncRNSs, we required their expression correlation scores greater than 0.5 and the FDR adjusted P-values be less than 0.01.

The gene mutation page appears as follows:

All the cancer related mutations were collected from the COSMIC database.

The gene homolog page appears as follows:

All the homologs from NCBI HomoloGene were collected from its public website data portal database.

Query and sequence search against database   [ top ]

All the oncogenes and their annotations in our database are searchable.

The text Search and sequence-based Blast are provided.

Browsing and text search of various annotation in our database [ top ]

Users can search against the oncogene by typing its name, accession IDs and its characteristics, including genomic location, biological pathway and disease. In total, we provided four different search forms for users, including "Gene General Information Search", "Literature Search", and " Annotation Search" allow users to access general information, literature-based information, and other annotation information respectively.

The search is performed by typing keywords into any field separately or into several fields simultaneously in the query forms. Generally, text search information in the each searching form mainly includes three steps. Take the basic information query as an example below

  • select a specific annotation or field from from the dropdown menu in basic gene information and mutation query forms.

  • Input your interesting keyword.

  • In addition, the basic gene information and mutation query forms support the logical 'And,' 'Or,' and 'Not' operators to combine multiple keywords.

    The search result shows the list of matched oncogenes linked to the detailed gene information page below.

    Blast all sequences of genes in our database [ top ]

    In the BLAST menu, users can search the ONGene database based on their input sequences. The high similarity oncogenes with input sequences will be listed in the BLAST result page. In the input page, users can choose various sequence alignment options such as E-value and identity. The matched sequence signatures are visualized on the query sequence.

    To do a sequence-based search for all the oncogenes, please access the BLAST pagepage.

    The output of BLAST is as below

    Click on the hyperlink in the Blast result page, users can access the oncogenes in our database.

    Data analysis and download [ top ]

    Users can freely download all the oncogenes for advanced integrative analysis. In addition, we also provided the bulk downloadable files for the co-expressed lncRNAs, differentitally expressed oncogenes in TCGA pan-cancer cohort, hmologous genes, gene-gene interactions and somatics mutations. Please access Download page as below.

    If users have any suggestion to add new comment to records in current ONGene or to revise wrong information in current ONGene,please send us email directly.