
My lab in the Biostatistics Department at the Harvard School of Public Health focuses on computational methods for systems biology using data mining in large genomic data collections. What that means without so much jargon is that we find ways to use all available experimental results - which can mean billions of data points - to answer specific biomedical questions, and we work hard to make sure that the outcome is realistic and useful to biologists and physicians.
The computational methodology required for large scale data mining is a mix of machine learning and efficient algorithms. If you're interested in computer science, this touches a number of practical areas:
- How do you deal with learning spaces that are large in both dimensions (features and records), dense (all or nearly all feature values are present), and noisy?
- How do you perform your machine learning efficiently in both time and space? Parallelism and online learning are both areas of interest for my lab.
- How do you build models that can incorporate prior knowledge and domain-specific metadata (such as biological or experimental context - keep reading below!) without becoming computationally inefficient?
Although the scale of available data is large, these data mining techniques turn out to be applicable at a wide range of biological scales. Two areas that I'm particularly interested in are:
Computational models for functional genomics in higher organisms, particularly for human health and populations. Multicellular organisms (and human beings in particular) represent a unique opportunity for bioinformatic analysis. Paradoxically, although we tend to know more details about simple model organisms, the vast majority of the experimental data out there deals with metazoans. However, the same complexities that make us human - hundreds of distinct tissue types, combinatorial intra- and intercellular signaling, and development from a single cell to over ten trillion - make it difficult to translate results from unicellular models into higher organisms. A selection of the challenges involved includes:
- Practical. Bacteria and most unicellular organisms have in the neighborhood of 5,000 genes, and as you scale up genome size to humans' ~25,000, data management and analysis become difficult.
- Experimental. You can't grow most organisms in petri dishes, so there's just plain not as much detailed information available, and computational predictions are harder to validate in a meaningful way.
- Evolutionary. After genomes have been duplicated and rearranged across millions of years, how can you tell when sequence conservation equates functional conservation? What does it mean when one gene in a model organism is homologous to a whole set of human genes (especially if they're expressed in different tissue types or at different times)?
- Theoretical. How do you best search multiple aligned interaction networks for dense subgraphs? In other words, if you know (or at least guess) what every gene is doing in several related organisms, how do you pick out the interesting bits?
My work includes initial efforts to broaden our functional analysis techniques for higher organisms, with the particular goal of using computation to advance the state of genomic medicine and our mechanistic understanding of human disease. This includes not only making computational predictions in multicellular organisms, but also comparing and analyzing their commonalities and differences and using better studied model systems to enhance our knowledge of less well understood ones.
Computational models for functional genomics in microbial communities and metagenomes run into many of the same problems. Model bacteria like E. coli have been very closely studied, but there are upwards of two pounds of bacteria in your gut right now, and we don't know what they're doing there. Or rather, we know that they're helping to digest your food, to keep your immune system in good shape, and probably to fight off metabolic disorders like diabetes and obesity - but we don't yet know how. Much of the same can be said of pathogen communities: it's not known where or how tuberculosis replicates in human hosts, studies are just beginning to characterize functional variation in viruses ranging from the flu to HIV, and the interactions between the malaria parasite and its hosts have only been unraveled in the last few years. And if you scoop up the water we drink or the soil we grow our crops in, either one contains thousands of species of microbes, almost all uncharacterized.
The little pieces of these big puzzles that I'm involved with deal with some of the same systems and functional biology questions that can be tackled in multicellular organisms using large scale data mining:
- How can we use large-scale data integration to predict protein function and biological networks in uncharacterized microbial populations? Sequence homology doesn't cut it in most microorganisms; horizontal gene transfer and sub/neo-functionalization wipe out a lot of the expected signal.
- How can we integrate data across species in a microbial community in order to predict the overall behavior of the system? This requires a different perspective on biological function, since there's a difference between what one bug in your gut is doing to survive and what they're all doing together to help eat your food.
- Where does "you" stop and your microbial community begin? There are more microbial cells in you than human cells, and they're just as beneficial to your health as your kidneys or lungs (unless they're pathogens, in which case they're not as nice). I'm interested in using large scale data mining to assemble a joint picture of microbial and host interactions as a single biological system.
Like everyone else in academia, I work on plenty of other projects as well. I also love teaching, which is just as vital to the field as is research. I've had the privilege of instructing at Johns Hopkins CTY for several summers, working with the novel Integrated Science course at Princeton, and mentoring students new to research in the PSURE program. All of these have been a blast for me, and hopefully for the students as well. Education and training are too often downplayed in academia, paradoxically enough, and I feel strongly about bringing quality and commitment into science classrooms at every academic level. It's more fun that way, on both sides of the lectern!
