HUMAnN: The HMP Unified Metabolic Analysis Network
HUMAnN is a pipeline for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data. Sequencing a metagenome typically produces millions of short DNA/RNA reads. HUMAnN takes these reads as inputs and produces gene and pathway summaries as outputs:
- The abundance of each orthologous gene family in the community. Orthologous families are groups of genes that perform roughly the same biological roles. HUMAnN uses the KEGG Orthology (KO) by default, but any catalog of orthologs can be employed with minor changes (COG, NOG, etc.)
- The presence/absence of each pathway in the community. HUMAnN refers to pathway presence/absence as "coverage," and defines a pathway as a set of two or more genes. HUMAnN uses KEGG pathways and modules by default, but again can easily be modified to use GO terms or other gene sets.
- The abundance of each pathway in the community, i.e. how many "copies" of that pathway are present.
HUMAnN can thus be used in tandem with any translated BLAST program, with out-of-the-box support for NCBI BLAST, USEARCH, MBLASTX, and MAPX. The pipeline converts sequence reads into coverage and abundance tables summarizing the gene families and pathways in one or more microbial communities. This lets you analyze a collection of metagenomes as a matrix of gene/pathway abundances, just like you might analyze a collection of microarrays.
We are aware that KEGG is now commercial, and we have updated HUMAnN accordingly. In brief, we include derived files and information needed for normal HUMAnN operation, but creation and evaluation of synthetic metagenomes is impeded without a KEGG license. Please contact the KEGG developers if this is an inconvenience for you contact us at the HUMAnN Google Group for assistance in evaluating HUMAnN output if necessary.
Many thanks to the NIH and to the entire Human Microbiome Project team for making the HMP possible and for the many collaborators who helped to make HUMAnN a reality. Sahar Abubucker and Makedonka Mitreva (Washington University) co-led the Metabolic Reconstruction group, Nicola Segata (Harvard School of Public Health) performed many HMP-specific analyses, the pipeline incorporates software from Yuzhen Ye (Indiana University), Beltran Rodriguez-Mueller (SDSU), and Pat Schloss (University of Michigan), and specific contributors include Alyx Schubert (University of Michigan), Jeremy Zucker (Broad Institute), Brandi Cantarel (UMD), Qiandong Zeng (Broad Institute), Johannes Goll (JCVI), and many others.
An overview of HUMAnN
Metabolic modules differentially abundant in one or more body sites of the human microbiome
Synthetic mock communities for validation
We generated 4 synthetic metagenomes to aid in evaluating HUMAnN's predictive accuracy. We generated two high-complexity (HC, 100 organisms) synthetic metagenomes called HC1 and HC2 and two low-complecity (LC, 20 organisms) synthetic metagenomes called LC1 and LC2. HC1 and LC1 have even distributions (all organisms present at equal abundance) while HC2 and LC2 have staggered distributions (organisms have random, log-normally distributed abundances). Organisms included in the LC metagenomes were manually selected from KEGG v54-curated reference genomes associated with the human microbiome, while organisms included in the HC metagenomes were randomly selected from all manually curated bacterial genomes.
- Completely revamped pathway coverage calculation, much more accurate for low-abundance events (thanks to Kat Huang, Sean Sykes!)
- Fixed hits2metarep.py handling of empty hits files (thanks to Pavan Kumar!)
- Fixed missing KO gloss annotations in merged 01b*.txt per-gene tabular abundance quantification output files.
- Added tab-delimited input file formats.
- Added GraPhlAn tree output files to enable visualization of abundance overlays on KEGG hierarchies (thanks to Jovian Yu, Morgan Paull!).
- Added preliminary organism-specific output generation.
- Allow module2modulec.py to remove unusual duplicate enzymes from KEGG's files
- Allow input filenames to contain underscores
- Fix module size calculation in filter.py
- Fix a bug in hits2enzymes.py to allow a wider range of KEGG gene name detection
- Fix missing exclude.py (thanks to Brandi Cantarel!)
- Add several internal evaluation pipelines in response to initial reviews
- Fix hits2*.py handling of zero/very small e-values (thanks to Fah Sathira!)
- MAJOR CHANGE: KEGG is now defunct, and HUMAnN has been updated accordingly
- KEGG derived information needed for normal operation is included
- KEGG files needed for synthetic metagenome construction are _not_ included
- "Frozen" synthetic metagenome evaluation is still possible
- Please contact us directly for more information if needed
- Add documentation on potential maq issues (thanks to Shinichi Sunagawa!)
- Fix a typo in fastq2fasta.py formatting (thanks to Shinichi Sunagawa!)
- Fix a typo in module2modulec.py formatting (thanks to Kathryn Iverson!)
- Fix a typo in eco.py for overly sparse input files (thanks to Jeffrey Werner!)
- Work around Mac OS X zcat issues (thanks to Jeffrey Werner!)
- Fix a typo in hits2enzymes.py (only affected unused filter option)
- Add complete parameter evaluation process to HMP pipeline