Research

My research focuses on three opportunities in the field of computational biology. First, novel machine learning methodology is needed to apply structured models incorporating biological prior knowledge to very large, dense feature spaces. Every experimental result captures a rich snapshot of cellular biological activity, and tens of thousands of such results can be mined to infer protein function, to predict new cellular pathways and regulatory networks, to design future experiments, and to target research and treatments to the causes of genetic disorders. Second, it is critical to apply these methods to human population data in a public health context; genome-wide association studies opened the door to personalized genomic medicine, but a tremendous amount of work remains to integrate available data and to understand the molecular mechanisms of genomic variation as they contribute to disease. Finally, microbial populations are of particular interest to me, as very few computational tools exist to describe the biological functionality of these communities at a systems level. Low-cost sequencing is just beginning to provide data on microfloral communities, and computational techniques are necessary to augment these metagenomic profiles with functional data from model organisms and to summarize community complexity in terms of overarching systemic functions and host interactions.


Areas of interest

The primary focus of my current research is the development of systems for large scale biological data mining. This is particularly applicable to higher organisms and metagenomic communities, where large (effective) genomes and complex regulatory and signaling networks can be untangled by the integration of many diverse datasets. In tandem with these large scale methods, my lab has developed targeted mechanistic modeling techniques that leverage specific, non-high-throughput experimental results to understand, for example, the regulation of cellular growth. In all cases, we have implemented these systems in practical software available to the bioinformatic and biological communities.

Human functional genomics

I have developed and implemented a probabilistic system integrating ~30,000 publicly available experimental results (>100GB of data) to predict protein function, functional relationships, cross-talk among pathways and processes, and disease involvement in human beings. This required new methods for the exploration and statistical analysis of large, dense, weighted graphs, in addition to solving the machine learning and software engineering challenges of efficiently processing this amount of data. The resulting process of functional mapping can be applied in a variety of biological settings to direct experimenters to under-annotated functional areas or to discover functional similarities among genomic datasets. A system incorporating these algorithms, called HEFalMp (Human Experimental/Functional Mapper), provides a web site through which biologists can query genes, processes, and diseases of specific interest. In collaboration with Hilary Coller (Princeton Mol. Bio.), we have confirmed the resulting predicted involvements of several proteins in the process of autophagy in human fibroblasts.

Linear models of gene expression

I have developed a statistical linear model that describes the S. cerevisiae transcriptional response to changes in cellular growth rate, in collaboration with Edo Airoldi (Harvard Statistics) and David Botstein (Princeton Mol. Bio.) In addition to describing which portions of the genome are regulated with respect to growth rate, this model can be applied to new microarray data to predict the growth rate of the originating culture. This allows the inference of growth rates at instantaneous time scales not measurable by standard experimental techniques, and the model is robust to changes in growth conditions, microarray platform, and organism, as I have also successfully applied it to S. bayanus and Schz. pombe. I am currently working with Maitreya Dunham (U. Washington Genome Sci.) to create a more sophisticated computational model to capture the changes in gene regulation induced by aneuploidy.

Integrating computation and experimentation

One of the major current opportunities in bioinformatics is the closer integration of functional predictions with rigorous experimental follow-up; many computational predictions are made, but only a small fraction of them are definitively confirmed in the laboratory. In collaboration with Amy Caudy (Princeton Lewis-Sigler), Chad Myers (U. Minnesota Comp. Sci.), Matthew Hibbs (Jackson Labs), David Hess (Princeton Lewis-Sigler), and others, we have integrated a collection of computational function prediction methods and experimentally verified the involvement of nearly 100 new proteins in the process of mitochondrial inheritance in yeast. My group specifically performed a study of the implications these results have for the field of computational protein function prediction. Based on the success of this study and the reliability with which we applied computational results to laboratory investigations, I am eager to continue this work in other organisms and biological areas through targeted experimental collaborations.

Software for functional genomics

My lab has developed and documented the Sleipnir library for functional genomics, which is currently the only publicly available software specifically allowing efficient manipulation and machine learning from very large collections of genomic data. This C++ library, comprising over 60,000 lines of code, includes both fundamental computer science concepts (parallelization, database management, generative and discriminative machine learning, etc.) and biological modeling (representations of genes, gene sets, interactions, functional catalogs, expression data, etc.) with a focus on integrating and learning from large, diverse biological datasets. In less than three months since publication, the Sleipnir web site has been viewed over 10,000 times, and the library has been downloaded by hundreds of visitors.