Ge Gao, Ph. D

Computational Genomics

Assistant Professor, Peking University



As biology is increasingly turning into a data-rich science, massive data generated by high-throughput technologies pose both opportunities and serious challenges. We focus primarily on developing novel computational technology to analyze and integrate these “Big data” effectively and efficiently, with application to decipher the function and evolution of gene regulatory system.

Handle Biological “BIG DATA” Effectively and Efficiently. Powerful bioinformatics infrastructure are critical to store, manage, and analyze these data, and finally to extract novel knowledge effectively and efficiently. Supported by grant from Chinese Ministry of Science and Technology, we developed an online bioinformatic platform, Weblab, to help users analyze biological data with 260+ integrated tools, and share their data, results, and even the whole workflows with each other. As the largest online bioinformatics platform in China, Weblab has had 54+ million hits annually, from 5,000+ registered active users worldwide. In response to the rapid increase in the amount of sequencing data, we further developed the customizable genome visualization framework ABrowse for enabling more effective access to heterogeneous “-omics” data. Being the first general-purpose framework with full supports for interactive browse, open data access and collaborative teamwork genome-widely, ABrowse has 2,095 total downloads (as of Apr 2015), with 1.41 average daily downloads.

Decipher the Function and Evolution of Gene Regulatory System. Based on the powerful infrastructure, we studied the functionality and evolutionary dynamics of development-related regulatory system in various model organisms. One of our long-term goals is to determine the regulatory roles for novel (i.e. evolutionary young) regulators, as well as how they are “wired” into the existing regulation network.

Transcription Factors (TFs) are key elements in gene expression regulation circuits, play essential roles in plant development and stress response. Benefitted from continuously improving data quality and analysis methodologies during past decade, our Plant Transcription Factor database, PlantTFDB, is becoming the most comprehensive data portal for plant transcription factors, with 10+ million hits from worldwide users annually. The current PlantTFDB 3.0 contains 129,288 TFs from 83 species, covering all major lineages of green plants. The wide coverage of PlantTFDB enables us to investigate the evolutionary dynamics of plant transcriptional regulatory system globally. We found, unexpectedly, statistically significant connection between the binding specificity and wiring preference of novel TFs, suggesting novel regulators can modify the regulatory circuits by introducing holistic (but highly specialized) module.

The long non-coding RNA (lncRNA, with operational definition as noncoding transcript longer than 200 nt) is only recognized as important regulator recently. By employing machine learning approach, we developed CPC the first online tool to identify non-coding transcripts based on sequence features solely. Currently, CPC has been widely used by the noncoding community, with 42 million hits annually for the CPC online server. The rapid evolutionary rate and flexible target adaptation of noncoding RNAs make them an attractive source for evolutionary novelty. By integrating functional and evolutionary genomics data across multiple clades, we identified that a selection-driven process, rather than a purely neutral mutation-driven mechanism, contributes to the origin and maintenance of intergenic noncoding RNAs in both fruit fly and human genome, highlighting the putative roles of novel long noncoding RNAs in early development and its possible connection with lineage-specific evolutionary novelty.

1.Cheng SJ, Shi FY, Liu H, Ding Y, Jiang S, Liang N, Gao G. (2017) Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Res., 45: e82.

2.Jin J, Tian F, Yang DC, Kong L, Gao G. (2017) PlantTFDB 4.0: towards a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Res. 45: D1040-D1045.

3.Hou M, Tian F, Jiang S, Kong L, Yang DC, Gao G. (2016) LocExpress: a web server for efficiently estimating expression of novel transcripts. BMC Genomics, 17: 175-179

4.Hu B, Jin J, Guo A, Zhang H, Luo J, Gao G. (2014) GSDS 2.0: an upgraded gene feature visualization server. Bioinformatics, 31: 1296-1297

5.Zhao Y, Tang L, Li Z, Jin J, Luo J, Gao G. (2015) Identification and analysis of unitary loss of long-established protein-coding genes in Poaceae shows evidences for biased gene loss and putatively functional transcription of relics. BMC Evol. Biol., 15: 66.

6.Jin J, He K, Tang X, Li Z, Lv L, Zhao Y, Luo J, Gao G. (2015) An Arabidopsis transcriptional regulatory map reveals distinct functional and evolutionary features of novel transcription factors. Mol. Biol. Evol., 32: 1767-1773.

7.Jin J, Zhang H, Kong L, Gao G, Luo J. (2014) PlantTFDB 3.0: a portal for the functional and evolutionary study of plant transcription factors. Nucl. Acids Res., 42: D1182-1187.

8.Xiao A, Cheng Z, Kong L, Zhu Z, Lin S, Gao G, Zhang B. (2014) CasOT: a genome-wide Cas9/gRNA off-target searching tool. Bioinformatics, 30: 1180-1182.

9.Gao G, Vibranovski MD, Zhang L, Li Z, Liu M, Zhang Y, Li XM, Zhang WX, Fan Q, Long M, Wei L. (2014) A long term demasculinization of X-linked intergenic noncoding RNAs in Drosophila melanogaster. Genome Res., 24: 629-638.

10.Kong L, Wang J, Zhao S, Gu X, Luo J, Gao G. (2012) ABrowse - a customizable next-generation genome browser framework. BMC Bioinformatics, 13: 2.

Li-Chen Ren, Lei Kong, Wen Tang, Yuan Lin, Jin-Pu Jin, Qian Wang, Nan Liang, Yang Ding, Yu-Qi Meng, Si-Jin Cheng, Yu-Jian Kang, Shuai Jiang, Fang-Yuan Shi, Lan Ke, Feng Tian, Huai-Yuan Sun, Jing-Yi Li, Zhi-Jie Cao, Zheng-Yang Wen, Yu Wang, De-Chang Yang