
Send email or ask me about it at ISMB 2009.
My lab in the Biostatistics Department at the Harvard School of Public Health focuses on systems biology in the context of high throughput computational methods tightly integrated with wet lab biology. What that means without so much jargon is that we process very large quantities of experimental data in order to make predictions about how genes work, and we work hard to make sure that the results are realistic and useful to biologists and physicians.
My work is particularly aimed at tackling two aspects of this problem:
- Functional genomics in higher organisms and its application to human disease.
-
Multicellular organisms (and human beings in particular) represent a unique opportunity for bioinformatic analysis. Paradoxically, although we tend to know more details about simple model organisms, the vast majority of the experimental data out there deals with metazoans. However, the same complexities that make us human - hundreds of distinct tissue types, combinatorial intra- and intercellular signaling, and development from a single cell to over ten trillion - make it difficult to translate results from unicellular models into higher organisms. A small selection of the challenges involved includes:
- Practical. Bacteria and most unicellular organisms have in the neighborhood of 5,000 genes, and as you scale up genome size to humans' ~25,000 (or oak's ~100,000), data management and analysis become difficult.
- Experimental. You can't grow most organisms in petri dishes, so there's just plain not as much detailed information available, and function predictions are harder to validate in a meaningful way.
- Evolutionary. After genomes have been duplicated and rearranged across millions of years, how can you tell when sequence conservation equates functional conservation? What does it mean when one gene in a model organism is homologous to a whole set of human genes (especially if they're only expressed in different tissue types)?
- Theoretical. How do you best search multiple aligned interaction networks for dense subgraphs? In other words, if you know (or at least guess) what every gene is doing in several related organisms, how do you pick out the interesting bits?
My work includes initial efforts to broaden our functional analysis techniques to encompass a collection of higher organisms, with the particular goal of using computation to advance the state of genomic medicine and our mechanistic understanding of human disease. This includes not only making computational predictions in multicellular organisms, but also comparing and analyzing their commonalities and differences and using better studied model systems to enhance our knowledge of less well understood ones.
- Storing, retrieving, and analyzing massive datasets efficiently.
- The "practical" problems that arise during analysis of complex organisms can be of such a magnitude that they become computer systems research questions in their own right. Given the amazing rate at which genomic information is currently becoming available, even incompletely studied organisms can be associated with hundreds of gigabytes of data. This must be stored in a manner that is reasonably efficient (hard drives are cheap, but not that cheap) yet rapidly accessible for complex machine learning algorithms.
Conversely, presentation of this data to an end user need not be quite so speedy (humans are so slow in comparison to computers!), but it is difficult to design a user interface capable of summarizing billions of data points drawn from hundreds of experiments across dozens of organisms. Each experimental result must be archived in such a manner that it can be summarized concisely or recalled in excruciating detail when focused. Results might be more or less reliable in different organisms, environments, or biological contexts - and computational predictions can themselves feed back to provide additional "experimental outcomes" in such a collection!
Thus, simply managing, archiving, and presenting genomic data can be a challenge in itself. Particularly when this presentation allows experimental biologists to rapidly find genes of interest, view them in context, and analyze existing experimental and computational results in the area, it becomes a research problem whose resolution immediately advances the state of the art in integrated computational biology.
Of course, like everyone else in academia, I work on plenty of other projects as well. I also love teaching, and I believe it's just as vital to the field as is research. I've now had the privilege of instructing at Johns Hopkins CTY for several summers, working with the novel Integrated Science course at Princeton, and mentoring students new to research in the PSURE program. All of these have been a blast for me, and hopefully for the students as well. Education and training are too often downplayed in academia, paradoxically enough, and I feel strongly about bringing quality and commitment into science classrooms at every academic level. Besides, it's more fun that way, for students and faculty alike!