Bayesian data integration

Basics

The rough goal of Bayesian data integration is to input genomic data and known functional relationships and to output predicted functional relationships. Although there are endless variations on this task, in inputs, ouputs, goals, techniques, and data, the important steps we tend to use are:

  • Input data can be any type of experimental or computational result that's been digested down to a pairwise score. Many experimental results are inherently pairwise - for example, a two-hybrid assay tells you which gene pairs bind and which don't. Data types that aren't inherently pairwise can generally be made so, and this often increases their informativity; for example, a microarray dataset provides one or more measurements per gene, but these can be made into pairwise correlations that are often easier to interpret than the measurements themselves. Practically speaking, all input data should be in the quantized pair/score format discussed here
  • Input gold standard functional relationships can include any means of finding gene pairs known to be functionally related or unrelated in a particular organism. In general, we form these gold standards from the Gene Ontology in the manner discussed in 2006. Details and refinements of this process will be discussed below.
  • Output functional relationships are a predicted functional relationship network indicating how likely every pair of genes is to interact. If one considers this network as a connection matrix, it's really just a collection of gene pair scores like your input data, each score representing a probability of functional relationship between the appropriate genes.

The basic idea of Bayesian integration is discussed in Troyanskaya 2003, but we've developed the technique fairly substantially since then. We generally use a naive Bayesian classifier, since more complex models slow down the learning and inference processes tremendously with little gain in performance. The "class" being predicted is whether two genes are functionally related (and is thus binary), and each dataset is represented by a single node in the classifier. Probability distributions over the values in these datasets are learned from the examples in the gold standard, allowing the prediction of new functional relationships for any other gene pairs with sufficient data. For an overview of this machine learning process, see this presentation.

Details

Input Data

Start with one or more datasets, each providing scores for one or more gene pairs. Boil these down to DAT/DAB files and accompanying QUANT files as described here.

Gold Standard

You want a single, global answer file that contains 0 or 1 scores for some number of gene pairs; it's fine if some fraction of the genome, possibly the majority, is neither related (1) nor unrelated (0). These unknown gene pairs will simply not be used for training (although they can, of course, be predicted later if you have data for them).

This answer file is usually generated from sets of known functionally related genes; these are in turn usually drawn from catalogs such as GO, KEGG, or MIPS. this is a good list of "interesting" GO terms, each of which represents a set of functionally related genes, and this is the resulting gold standard.

To generate positive (related) gene pairs (gi, gj) from sets of known related genes G = {G1, G2, ..., GN}, consider any gene pair to be related if there exists Gk in G such that gi in Gk and gj in Gk. That is, any two genes coannotated to some term of interest are related.

To generate negative (unrelated) gene pairs, consider gi and gj to be unrelated if for Gk and Gm in G with k != m, gi in Gk and gj in Gm. Additionally, the p-value of hypergeometric overlap between Gk and Gm must be greater than 0.05 (HG(|Gk intersect Gm|, |Gk|, |Gm|, |genome|) > 0.05). In other words, two genes are unrelated if we know what they do (i.e. they're annotated somewhere), we know those things are different (i.e. they're not coannotated anywhere), and the difference is significant (i.e. the two terms aren't too similar).

Alternately, negative pairs randomly selected from the genome (minus the positive pairs) are actually pretty good. Make sure you get the right ratio of positive to negative pairs; about ten times as many negatives as positives is a good, biologically plausible rule of thumb.

Learning

Bayesian learning is the process of determining probability distributions over each dataset's possible values given the gold standard. Briefly, this entails:

  • Counting up the total number of negative and positive pairs and dividing to get percentages.
  • For each dataset, counting up the number of times each possible value occurs separately for negative pairs and for positive pairs. See the presentation for some examples.

Since a naive Bayesian classifier assumes independence between all datasets, you can do this in isolation for each dataset. If you're being context-specific (see below), different probability distributions (i.e. a different Bayesian network) is learned for each context of interest.

Evaluation

In the absence of a gold standard, the probability of any given gene pair's functional relationship can be predicted from its data. Any gene pair with no available data is assigned the prior probability of functional relationship (that is, the proportion of related gene pairs in the standard). Any datasets that contribute a value for the gene pair in question are taken into account by the Bayesian classifier and modify this probability; see Huttenhower 2006 for details.

Any valid machine learning technique can be used to evaluate the performance of a learned classifier. It's most common to do cross validation using one or more holdout sets and display the precision/recall or ROC curve for the held out data. Predictions for gene pairs not in the standard can only really be validated by experimental work or outside verification - which is harder, but great for collaborations! For some examples of these processes, see Myers 2006 or Huttenhower 2007.

Context-Specificity

You can think of the Bayesian learning process as determining how to "weight" each dataset. Equivalently, it answers the question, "If I'm trying to predict whether two genes are related and dataset X contains the result Y, how much should that influence the probability of functional relationship?" A problem (or opportunity) arises in that different datasets can be more or less informative in different biological contexts: a microarray in which high correlation tells you nothing about nitrogen metabolism might be very informative regarding sporulation genes.

To take advantage of this type of variation, separate Bayesian classifiers can be learned for each context of interest. In most cases, the contexts are identical to the sets of known related genes used to make the gold standard; that is, is you use the genes annotated to the GO terms "alcohol metabolism", "ribosome biogenesis", and "sporulation" to generate your gold standard, then you'd learn three separate classifiers. This requires no modification of the datasets or global gold standard, but each classifier is learned using only a subset of the global standard.

To determine whether a gene pair (gi, gj) in the global gold standard should be used for learning a classifier for context Gk:

  • If gi in Gk and gj in Gk, then the pair should be used (and it's a positive pair). That is, if both genes are in the term of interest, the pair is relevant.
  • If gi in Gk or gj in Gk and the pair is unrelated (negative), it should be used. That is, if one gene is annotated to the term of interest, the other gene isn't, and they're not coannotated to any other term, the pair is relevant.
  • Any other pair should not be used.

You can think of the gene pairs related some context as the ones "in" (coannotated to) or "touching" that context, except "confusing" pairs that touch the context but are related (coannotated) elsewhere. For more information, see Huttenhower 2006 and Myers 2007.

Tools

Data Manipulation

All of the following tools are part of the Sleipnir library for large scale functional genomics. For practical purposes, they live in the shared disk space under /r01/tergeo/logrus, and each executable prints a short help message when given the -h command line flag.

KNNImputer

KNNImputer implements the knnimpute algorithm for missing value estimation in microarrays. Given a PCL file, KNNImputer performs two important tasks:

  • It removes any row (gene) with less than some percentage of columns (experiments) present, 70% by default.
  • It estimates all remaining missing values for each gene g by finding the k nearest neighbors to g (based on a selectable similarity measure) and calculating their weighted average.

Basic usage is:

KNNImputer < <input.pcl> > <output.pcl>

Or, equivalently:

KNNImputer -i <input.pcl> -o <output.pcl>

Important command line arguments include:

Flag Default Type Description
-s 2 Integer Non-data columns to skip after the ID column. For most PCL files, this should be two (the NAME and GWEIGHT columns).
-k 10 Integer Number of nearest neighbors to use for imputation.
-d euclidean Enum: euclidean, pearson, spearman, kendalls, etc. Similarity measure to use for defining nearest neighbors.
-m 0.7 Double Minimum fraction of columns that must be present to retain row.

Dat2Dab

Dat2Dab converts text-based DAT data files to equivalent binary DAB files, and vice versa. In other words, if you want to dump a DAB as text, this is what you're after. Conversely, if you have data that's ready to be put into a Bayes net learner as a DAB, run it through here.

Basic usage for converting a DAT to a DAB is one of:

Dat2Dab -i <input.dat> -o <output.dab>
Dat2Dab -o <output.dab> < <input.dat>

Usage for converting a DAB to a DAT is:

Dat2Dab -i <input.dab> -o <output.dat>
Dat2Dab -i <input.dab> > <output.dat>

Note that DAB files can't be piped, since they're binary, but DAT files can. Interesting command line flags include:

Flag Default Type Description
-m off Flag Memory map input/output DABs. This will only work if you're not modifying the DAB, since the memory mapping is read-only, but if you're just dumping a DAB, it'll speed it up a ton.
-l, -L none Strings One (-l) or two (-l and -L) gene names to look up. Combined with -m, this is a quick and easy way to retrieve one gene pair's score or all pairs involving one gene.
-n off Flag Normalize all scores to the range [0, 1] during output creation.
-z off Flag Normalize all scores using z-scoring during output creation.
-e off Flag Replace all missing scores with zeros.
-c none Double Output only pairs with scores above the given value.

Distancer

Convert a microarray PCL into a pairwise DAT/DAB file indicating correlations (or other similarity measures) between each gene pair. Default usage is one of:

Distancer -o <output.dab> < <input.pcl>
Distancer -o <output.dab> -i <input.pcl>

Command line flags include:

Flag Default Type Description
-d pearnorm Enum: pearnorm, pearson, euclidean, spearman, kendalls, etc. Similarity measure to use for pairwise calculations. Default (pearnorm) is Fisher z-scored Pearson correlation.
-s 2 Integer Non-data columns to skip after the ID column. For most PCL files, this should be two (the NAME and GWEIGHT columns).
-z on Flag Z-score pairwise scores before output. The default is on, so giving the -z flag will turn it off. The default (-d pearnorm, -z on) corresponds to the "recommended" microarray processing.

Combiner

Combiner serves dual (at least) purposes: it combines multiple PCLs into one by keying off of gene IDs, and it combines multiple DAT/DABs into one by averaging (or otherwise combining) corresponding pairwise scores.

When combining PCLs, Combiner might take two inputs like:

ID	C1	C2
A	1	2
B		3
C	4	5
ID	C3	C4
B	6	7
C	8	9

and produce an output like:

ID	C1	C2	C3	C4
A	1	2
B		3	6	7
C	4	5	8	9

When combining DAT/DABs, Combiner might take two inputs like:

A	B	1
A	C	2
B	C	3
A	B	4
B	C	5

and produce an output like:

A	B	2.5
A	C	2
B	C	4

This being said, basic usage for PCLs is one of:

Combiner -o <output.pcl> <input1.pcl> <input2.pcl> ...
Combiner <input1.pcl> <input2.pcl> ... > <output.pcl>

Usage for DAT/DABs is one of:

Combiner -o <output.dab> <input1.dab> <input2.dab> ...
Combiner <input1.dab> <input2.dab> ... > <output.dat>

Command line arguments of interest are:

Flag Default Type Description
-t pcl Enum: pcl, dat, dab Type of combination to do. DAB does not combine multiple DABs! I mean, it does, but it produces a DAD rather than an averaged DAB. Never use dab here; use pcl (the default) to combine PCLs and dat to average DATs or DABs.
-m mean Enum: mean, gmean, hmean, min, max Method to use when combining DAT/DABs. Standard average by default; geometric and harmonic means are also available for dealing with outliers, and min and max are useful for combining answer files (since max will work like an OR on positives).
-n off Flag Normalize (z-score) inputs before combining; this is very useful when averaging multiple context-specific functional relationship networks into a single global network.
-p off Flag Memory map input DABs. This is only possible when -n is off (if -n is on, -p will be ignored).

Dab2Dad

Dab2Dad combines multiple DAB files into a single DAD file, possibly in combination with an answer file (also a DAB) and/or a Bayesian network (X/DSL) for rapid, repeated learning from the same set of data and network structure. This is mainly useful for repeated processing of non-naive Bayesian networks, which can be slow; since we've moved almost exclusively to naive classifiers, Dab2Dad isn't really necessary any more. If you happen to have a DAD file around, Dab2Dad can also dump it to text.

If you have a Bayes net that you want to train, with answers and some data, the basic usage is:

Dab2Dad -n <network.xdsl> -o <data.dad> -w <answers.dab> -d 

where the data directory houses all of your DAB/QUANT files. If you just want to combine some DAB files, you can do:

Dab2Dad -o <data.dad> <input1.dab> <input2.dab> ...

And if you want to dump a DAD file as text, try:

Dab2Dad -i <data.dad>

Command line arguments include:

Flag Default Type Description
-e off Flag If working with an answer file, include all gene pairs, even those not in the gold standard (which are normally excluded).
-l, -L none Strings Look up the data values for a single gene pair (-l and -L) or all pairs containing a specific gene (-l). This is useful in that it can work either from an input DAD file or from a set of input DABs. This is a quick way to dump all of the available data for a particular gene pair.
-m off Flag Memory map input. This will make your life much more bearable when combined with -l and/or -L.

OntoShell

OntoShell is a fairly orthogonal tool meant to work with gene functional catalogs (GO/MIPS/KEGG) rather than genomic data. It mimics a system shell (hence the name) in which one explores a functional ontology rather than a file system. When you start it up, it loads one or more ontologies for an organism of choice and provides a command prompt. From here, you can "cd" to different terms in the ontologies, "ls" the genes there, "cat" individual genes' annotations, and "find" functional enrichments. Ok, so it's a little gimmicky, but it's useful!

Because of the potentially large number of input files (separate ontology and annotation files for GO, MIPS, KEGG, SGD features, etc.), OntoShell usually works with an ini file that stores default command line parameters. For example, an OntoShell.ini file to read in GO, KEGG, and MIPS for yeast on Windows might look like:

kegg		"c:\stuff\ko"
kegg_org	"SCE"
go_onto		"c:\stuff\gene_ontology.obo"
go_anno		"c:\stuff\gene_association.sgd"
mips_onto	"c:\stuff\funcat-2.0_scheme"
mips_anno	"c:\stuff\funcat-2.0_data_18052006"
features	"c:\stuff\SGD_features.tab"

Lines in this ini file can be commented out using hash marks, so if I wanted to "turn on" human annotations and "turn off" yeast, I might change it to:

kegg		"c:\stuff\ko"
#kegg_org	"SCE"
kegg_org	"HSA"
go_onto		"c:\stuff\gene_ontology.obo"
#go_anno	"c:\stuff\gene_association.sgd"
go_anno		"c:\stuff\gene_association.goa_human"
#mips_onto	"c:\stuff\funcat-2.0_scheme"
#mips_anno	"c:\stuff\funcat-2.0_data_18052006"
#features	"c:\stuff\SGD_features.tab"

If an appropriate OntoShell.ini is in the current directory, you can start OntoShell up for yeast just by running:

OntoShell

To start it up for humans, assuming you have an HGNC-mapped GO annotation file (it'll probably be gene_association.hgnc instead of gene_association.goa_human), you need to tell OntoShell to look at slightly different gene IDs:

OntoShell -l

After you do that, you'll get a command prompt. A transcript might look something like:

/> ls

- ROOT
O KEGG  6691
O GOBP  13596
O GOMF  14673
O GOCC  13857
O MIPS  0
O MIPSP 0

/> cd GOBP

GOBP> ls

- GOBP  13596
C GO:0008150         309   13287 biological_process

GOBP> cd GO:0008150

GO:0008150> ls -g

- GO:0008150         309   13287 biological_process
C GO:0000003         3     546   reproduction
C GO:0001906         0     16    cell killing
...
 AKAP8L         AMMECR1        ANKRA2         ANKRD30B       APOBEC3B(9582)
 ARID3B         ASAHL(27163)   ATXN2L         AUTS2          BAIAP2L1
...

GO:0008150> cat AKAP8L

AKAP8L
AKAP8L, NAKAP95, HRIHFB2018: A-kinase anchor protein 8-like
GOBP: GO:0008150         biological_process
GOCC: GO:0005694         chromosome
      GO:0005634         nucleus
GOMF: GO:0017151         DEAD/H-box RNA helicase binding
      GO:0003677         DNA binding
      GO:0008270         zinc ion binding

GO:0008150> find my_gene_list.txt

KEGG:
ko04140            1.19913e-007  3    3    34   6691 Genetic Information Pro...
 *ATG12(9140)      *ATG5(9474)       *ATG7(10533)      ATG3(64422)
 ATG4A(115201)     ATG4B(23192)      ATG4C(84938)      ATG4D(84971)
 BECN1(8678)       GABARAP(11337)    GABARAPL1(23710)  IFNA1(3439)
 IFNA10(3446)      IFNA13(3447)      IFNA14(3448)      IFNA16(3449)
 IFNA17(3451)      IFNA2(3440)       IFNA21(3452)      IFNA4(3441)
 IFNA5(3442)       IFNA6(3443)       IFNA7(3444)       IFNA8(3445)
 IFNG(3458)        INS(3630)         LOC441925(441925) PIK3C3(5289)
 PIK3R4(30849)     PRKAA1(5562)      PRKAA2(5563)      ULK1(8408)
 ULK2(9706)        ULK3(25989)
GOBP:
GO:0006914         2.08547e-006  3    4    24   13596 autophagy
 *ATG12(9140)  *ATG5(9474)   *ATG7(10533)  ASIP(434)     ASPA(443)
 ASPM          ATG10         ATG16L1       ATG3(64422)   ATG4A(115201)
 ATG4B(23192)  ATG4C(84938)  ATG4D(84971)  ATG9A         ATG9B
 BECN1(8678)   CLN3          MAP1LC3A      MAP1LC3B      MAP1LC3C
 RAB24(53917)  RGS19         ROPN1L        WIPI1
GOMF:
GO:0004839         0.0136291     1    4    2    14673 ubiquitin activating e...
 *ATG7(10533)     SAE1
GO:0050811         0.0340623     1    4    5    14673 GABA receptor binding
 *GABARAPL2       GABARAP(11337)   GABARAPL1(23710) JAKMIP1
 TRAK2
GO:0019807         0.0408705     1    4    6    14673 aspartoacylase activit...
 *ATG5(9474)      ACY3(91703)      ASIP(434)        ASPA(443)
 ASPM             ROPN1L

Here, "my_gene_list.txt" is a text file containing one gene ID per line, and the results are a Bonferroni-corrected TermFinder run against all available ontologies. *s indicate genes that were in the input query.

Tab completion works for most commands (it'll search term IDs, filenames, etc.), as do the up/down arrows and buffer editing using readline (just like a shell prompt). You can get help at any time using the "help" command, which will also provide information on in-program command line flags (e.g. "help ls" will tell you about "ls -g" as shown above).

The only real command line flag of interest is -x, which will execute a given string and dump the results to standard output. This is extremely useful for doing TermFinder enrichments:

OntoShell -x "find my_gene_list.txt" > my_enriched_terms.txt

This will work for any combination of ontologies in any organism, and there are in-program flags for modifying the Bonferroni correction, background, ways of listing genes and terms, etc. etc.

Gold Standard Generation

BNFunc

BNFunc's job is to take as input a list of "fringe" or "interesting" ontology terms (as per GRIFn and turn them into sets of genes. It uses the same ontology parsing code as OntoShell, and it can thus deal with any organisms or combinations of GO/KEGG/MIPS. Like OntoShell, you should use an ini file (BNFunc.ini) to specify the locations of the appropriate ontology files. All of the examples below assume that you have a proper BNFunc.ini in the current directory.

BNFunc has two modes, one in which it generates an answer DAB file directly and one in which it simply outputs files containing lists of genes. The latter is generally more useful, since the Answerer tool provides a more flexible way of turning these gene sets into a gold standard. Basic usage for generating a bunch of gene set files from BNFunc is:

BNFunc -i <term_list.txt> -d <output_directory>

Be careful about gene IDs! Like OntoShell, you should have an appropriately mapped annotation file (i.e. one using the type of IDs you're interested in, ORFs, HGNC symbols, etc.) and make sure that BNFunc reads the right ones. Options to tweak this are -s, which will use the first "synonym" in the annotation file (useful for yeast, where these are generally the ORF IDs), and -b, which will use the database ID instead of the standard name (useful for worm/fly for getting WormBase/FlyBase annotations). -l will output all ID types, which you can use as a last resort and post-process them down to the correct thing after the fact.

Oh, and be careful to specify an appropriate -d parameter. The default is the current directory, and if your list of interesting terms is big, you can end up populating your working directory with a bunch of tiny text files. Be warned!

Answerer

Given sets of known related genes, Answerer generates a gold standard DAB file. Answerer optionally takes two types of related gene sets as inputs: positive sets, which are the usual lists of related genes that get turned into positive pairs, and negative sets. Negative sets are sets of "minimally related" genes such that gene pairs not coannotated to one of them are definitely unrelated.

For example, suppose Answerer got the following two positive sets:

A
B
C

and

A
D
E

and one negative set:

A
B
C
E

Then it would generate the answer file:

A	B	1
A	C	1
A	D	1
A	E	1
B	C	1
B	D	0
C	D	0
D	E	1

Note that the pairs B E and C E are missing from this answer file: they are neither positive nor negative, since B, C, and E are all coannotated to a negative set.

In any case, the basic usage for Answerer is:

Answerer -p <positives_dir> [-n <negatives_dir>] -o <answers.dab>

Command line arguments of interest are:

Flag Default Type Description
-l 0 Double Hypergeometric p-value of overlap below which overlapping gene sets will not generate negative pairs. That is, if the genes annotated to term X and to term Y overlap with p less than the given value, no pair of genes x in X and y in Y will be considered negative. They might be related (for example, if y is annotated to both X and Y), but -l allows you to mark terms as "too similar" to allow confident negatives. A good value is -l 0.05.
-i none String (filename) Rather than calculating positive pairs from sets of related genes, Answerer can incorporate a pre-existing DAB file of related gene pairs. This is useful for merging in answers derived from non-annotation sources (e.g. Reactome, in which relationships are specified directly as pairs).
-x none Integer Expected interactions per gene. If given, Answerer will generate an appropriate number of random unrelated gene pairs (excluding any given positive pairs, of course).

Bayesian Learning/Inference

When using DAT or DAB files with any of the Bayesian network tools, you must have accompanying QUANT files. Since all of our Bayes nets are discrete (experiments with continuous networks way back when showed that they don't buy us anything), the QUANT files are automatically loaded to bin continuous values from a DAT or DAB. Recall that "accompanying" means that a data file named data.dab should have a file data.quant in the same location, consisting of a single tab-delimited line of numbers, one per bin edge.

BNCreator

BNCreator is the most straightforward of the Bayes net learning and inference tools. Given some data and a gold standard, it'll learn a naive classifier; given some data and a naive classifier, it'll output the complete predicted functional relationship network. In other words, if you just want to learn a Bayes net and use it to predict gene interactions, this is what you're after. Basic usage for learning is:

BNCreator -w  -o    ...

For evaluation, normal syntax is:

BNCreator -i  -o  -d 

You almost certainly want to use both the -Z and -m flags during both learning and evaluation; check out their table entries below. Note the difference in describing the input data between the learning and evaluation cases: the former requires individual data files on the command line, the latter a directory containing the data files given with -d. This is due to the fact that building a naive classifier requires knowing exactly which datasets (nodes) should be included in the network. When evaluating a classifier, the names (but not paths) of the datasets to look for are already included in the (X)DSL file, so the program just needs to know what directory contains them.

Command line flags include:

Flag Default Type Description
-Z none String (filename) The file given to the -Z flag should be a tab-delimited text file containing two columns, one listing node IDs (i.e. dataset names without the extension, such as blah for a dataset stored as blah.dab) and one listing bin numbers (starting with zero). Any given dataset/value pair is used as a default value for pairs not otherwise present in that dataset. This is the recommended way of providing defaults for a particular data type; for example, two_hybrid_results.dab might only contain pairs of the form A B 1 indicating which gene pairs bound in the experiment. But zeros.txt given to the -Z flag might include a line two_hybrid_results 0, which will make the Bayes net treat every other value as 0. Missing pairs in datasets without default values (e.g. microarrays) are ignored during both learning and evaluation.
-z off Flag Assume that 0 is the default value for every missing pair in every dataset. This is rarely the right thing to do, especially if your input data includes microarrays, but it can be useful for testing. Overrides the -Z file.
-m off Flag Memory map input. Can greatly speed things up, as usual.
-c none String (filename) Text file containing a list of genes. This gene list is the "context" for context-specific learning; if given, only gene pairs related to this gene list (as per the MEFIT description) will be used for calculating Bayesian probabilities.
-b none String (filename) Default or fallback (X)DSL file for context-specific learning. Because context-specific learning can greatly reduce the effective size of the gold standard, sometimes there are too few gene pairs in a particular condition to learn a reasonable probability distribution. In this case, if a default file is given with -b, the distribution from that network will be used instead. Typically, BNCreator is used without -c to learn a global (X)DSL first, and then any context-specific learning is performed using that global file with -b.

BNWeaver

BNWeaver is essentially a multi-threaded version of BNCreator made specifically for learning context-specific networks. It does exactly the same thing as running BNCreator a bunch of times with different -c (context) parameters. Parallelizing a bunch of BNWeavers across a cluster, each with multiple threads, is a great way to learn a whole lot of Bayes nets very quickly.

Minimal usage is:

BNWeaver -w <answers.dab> -o <classifier_dir> -d <data_dir> <context1.txt> <context2.txt> ...

A more normal usage might be:

BNWeaver -w <answers.dab> -o <classifier_dir> -d <data_dir> -Z <zeros.txt> -b <default.xdsl> -m -t 4 <context1.txt> <context2.txt> ...

Each of the context files given as filenames without flags on the command line is a gene set of interest, just as would be provided to the -c flag of BNCreator. So if given contexts 1 through 5 in the last example, BNWeaver would produce five XDSL files in <classifier_dir>, each learned using all of the DAB files in <data_dir> and only the genes relevant to one of the five contexts, with fallback probability distributions coming from <default.xdsl>. It would do it in two passes, four threads in parallel to learn the first four networks and one thread afterwards to learn the fifth. -t 5 would learn them all in parallel, at least on a machine with five or more cores. The -Z and -m flags (and a few others) behave identically to those in BNCreator.

BNUnraveler

BNUnraveler is the evil twin of BNWeaver: while BNWeaver takes answers and data to produce a Bayesian classifier, BNUnraveler takes a classifier and data to produce a predicted functional relationship network. In other words, BNUnraveler and BNWeaver together are multithreaded versions of the two functions of BNCreator, classifier learning and interaction network inference.

As with BNWeaver, minimal usage might be:

BNUnraveler -i <classifier_dir> -o <fr_dir> <context1.txt> <context2.txt> ...

But more realistic usage might look like:

BNUnraveler -i <classifier_dir> -o <fr_dir> -Z <zeros.txt> -t 5 -m -e <context1.txt> <context2.txt> ...

A quick rundown of command line flags:

Flag Default Type Description
-e off Flag The "everything" flag, if present, produces predictions for every possible gene pair, including those not relevant to the context of interest. This is usually desirable, since non-context gene pairs can be screened out later during evaluation, if they need to be screen out at all.
-t -1 Integer Number of threads to use; default is to use as many as there are contexts, which can swamp machines without enough cores!
-w none String (filename) Answer file; if given, only gene pairs with answers will be included in the predicted functional relationship network. This can speed things up if you're not planning on doing anything with the predicted network other than a precision/recall or ROC curve.
-g none String (filename) If given, filter input data to include only genes from the given text file (one gene ID per line).
-G none String (filename) If given, filter output predictions to include only genes from the given text file (one gene ID per line). -g and -G will generally have identical effects.
-Z none String (filename) Text file of default values for any datasets that need them, as per BNCreator.
-z off Flag Default every dataset to 0, as per BNCreator. Overrides -Z.
-m off Flag Memory map input. Can greatly speed things up, as usual.

BNConverter

BNConverter is the oldest Bayes net tool of the bunch, and its name reflects it; it was originally written to convert between PNL and SMILE formatted Bayes nets, and its primary function now doesn't really have anything to do with that. Specifically, its remaining purpose in life is to learn non-naive Bayes networks, much like BNCreator does for naive classifiers; evaluation is performed with BNTester.

Because of its age, BNConverter has even more cruft than the rest of these tools. First of all, PNL support is no longer included by default, no matter what the command line help says. It's _there_ if anyone needs to learn/evaluate continuous Bayes nets for any reason, but it's slow, PNL's huge and clunky, and it generally takes special coddling. Nevertheless, we do have the ability to look at continuous (Gaussian, etc.) Bayes nets if necessary.

Normal usage of BNConverter for learning is:

BNConverter -i <input.xdsl> -o <learned.xdsl> -w <answers.dab> -d <data_dir>

There are many command line flags, but the most interesting ones are:

Flag Default Type Description
-D none String (filename) Replaces -d and -w; the given filename is a DAD file constructed from the network, answer file, and data files of interest. Using -D in place of -d and -w can greatly speed up repeated learning with the same Bayes net structure, e.g. when learning context-specific networks over the same data and answers but with different subsets of the answer file.
-p off Flag Use the PNL library instead of SMILE; see caveat above.
-f off Flag Use function-fitting Bayes nets. These are essentially a custom implementation of continuous naive networks able to work with any parameterizable, maximum-likelihood fittable function. This includes Gaussians and a few other simple distributions. I've never gotten them to do anything particularly interesting, but they're there.
-a off Flag Start probability tables from a random point before EM learning. Usually a good idea.
-s 20 Integer EM iterations to perform. 20 is pretty high; usually 5-10 is plenty.
-c none String (filename) List of genes for context-specific learning, as per BNCreator.
-z off Flag Default missing pairs in all datasets to 0. BNConverter has no -Z flag simply because it's never been added; if it would be helpful, it's easy to implement.
-l off Flag Use Extended Logistic Regression rather than Expectation Maximization for probability learning. EM is a generative model, ELR is a discriminative model that's essentially gradient descent over Bayes net parameters. Interesting but slow and without any real benefit in the tests we've run before.
-b none String (filename) Fallback Bayes net to use for probability distributions when the available data is too sparse; generally used with context-specific learning as per BNCreator.

BNTester

What BNUnraveler is to BNWeaver, BNTester is to BNConverter. Once you've learned a non-naive Bayesian network's parameters with BNConverter, you can infer a functional relationship network from it using BNTester. Most of the syntax and concepts are the same, with normal usage resembling:

BNTester -d <data_dir> -i <learned.xdsl> -o <fr.dab>

The command line arguments look generally familiar:

Flag Default Type Description
-y off Flag Evaluate every pair, even those without answers; analogous to the -e flag in BNUnraveler. This should only matter in conjunction with -D, not -d.
-D none String (filename) Replace the data directory -d with the DAD file given with -D. Useful for repeated context-specific evaluation as per BNConverter.
-f off Flag Use function-fitting networks; necessary if you're evaluating a network constructed with function-fitting by BNConverter.
-p off Flag Use PNL instead of SMILE; same deal as all the PNL warnings in BNConverter.
-z off Flag Replace all missing pairs with 0. Again, there's no -Z, but it would be easy to add.

Evaluation

DChecker

Given an inferred functional relationship network and an answer file, DChecker will generate a ROC and/or precision/recall curve. More specifically, it can order, bin, and evaluate predicted scores (i.e. edge weights in an interaction network) relative to a gold standard (i.e. an answer file with known related/unrelated pairs). The output will also include a handful of descriptive statistics, gene counts, and an AUC.

Depending on the range and character of the scores you're evaluating, there are a few different usage paradigms. The most general will bin any range of scores into quantiles and evaluate them against a binary (0/1) gold standard:

DChecker -i <fr.dab> -w <answers.dab> -b 1000 > <bins.txt>

If you have finitely many scores rather than a continuous distribution (e.g. you're evaluating a single discretized dataset), DChecker can be made to process those scores more precisely (and efficiently):

DChecker -i <fr.dab> -w <answers.dab> -f > <bins.txt>

If you have scores in a finite [0, 1] (or you request pre-normalization of your scores to that range), you can bin by monotonic steps rather than quantiles:

DChecker -i <fr.dab> -w <answers.dab> -e 0.01 > <bins.txt>

All of these usages can be modified by various command line parameters:

Flag Default Type Description
-n off Flag Normalize all input scores to the range [0, 1] before evaluation.
-t off Flag Use 1-score in place of score; useful in combination with -n, since the normalization will happen first.
-m none Double If using monotonic steps rather than quantiles, start at this minimum scores.
-M none Double If using monotonic steps rather than quantiles, end at this maximum score.
-e 0.01 Double If using monotonic steps rather than quantiles, the size of the steps to take (between -m and -M).
-b none Integer If given, use the given number of quantiles rather than monotonic steps.
-f off Flag If given, count finitely many input values (rather than using quantiles or monotonic steps).
-c none String (filename) If given, evaluate only pairs relevant to the given context (set of genes).
-p off Flag Memory map input; cannot be used with anything that modifies the input scores, e.g. -n or -t.

Post-Hoc Analysis

BNTruster

BNTruster is a program with spectacularly simple usage and fairly subtle behavior. You run it on any number of XDSL files that share the same structure:

BNTruster <learned1.xdsl> <learned2.xdsl> ... >  <trusts.txt>

It produces a table of "trusts" or informativity scores for each dataset (node in the structure) in each context (XDSL). This will have the form:

Context	Dataset1	Dataset2	...
Context1	0.1	0.2	...
Context2	0.3	0.4	...
...

Each score represents the weighted average difference between the posterior and the prior over all values from that dataset. Confused? Well, our Mediawiki isn't set up to handle Latex, but if it was, the formula would be: T(D,C) = \sum_{d \in D} P(D = d|C) * |P(FR|C) - P(FR|D = d, C)|. Here, the trust T of dataset D in context C is the sum over each value d of d's prior probability in C times the difference between C's prior and the posterior given D = d.

BNTruster's only command line flag gives a few other ways of calculating trust scores, but none of them are nearly as good:

Flag Default Type Description
-t posteriors Enum: posteriors, sums, ratios "posteriors" calculates trusts as described above. "sums" calculates the sum of differences between FR and !FR cases for all d in D, and "ratios" calculates the product of their ratios.

Dat2Graph

Dat2Graph provides one of the most graphical ways to view a DAT/DAB, specifically dumping it as a DOT file that can be visualized by Graphviz. There's also some rudimentary support for other graph visualization packages, but I've never gotten any of them to work as well. Basic usage is:

Dat2Graph -i <input.dab> -e <cutoff> > <output.dot>

Trust me, you rarely want to run this without a high cutoff (e.g. 0.9 if you're using correlations). At least run wc on your DOT file to sanity check before sending Graphviz off on an impossible rendering task. Interesting command line flags include:

Flag Default Type Description
-t dot Enum: dot, gdf, net, matisse, list, dat, correl Produce output in the given format rather than DOT; support is more or less lacking for every other format. DAT is present to produce just that, a plain DAT file, but with the appropriate filtering performed beforehand (most useful for things like bioPIXIE queries with -q).
-e none Double Only output edges with scores above the given value. A must-have for any large graph!
-f none String (filename) SGD features file for annotating graphs with common names.
-m off Flag Memory map input file; incompatible with modifications like -n.
-n off Flag Normalize edge weights to the range [0, 1] before processing. Useful for coloring edges sanely and calibrating -e.
-q none String (filename) Output only the subgraph that results from a bioPIXIE query with the given genes. The neighborhood size can be modified with -k (default 40).
-w none String (filename) DAT/DAB file of edges to exclude; any edge that's present and nonzero in the given DAT/DAB will be excluded from the output DOT.
-l none String (filename) Input text file containing colors for the graph nodes, one per line, in the range [0, 1]. Line order must correspond to node (gene) order in the input DAT/DAB (and thus output DOT).
none String (filename) Input text file containing border widths for the graph nodes, one per line, in floating point values corresponding to pixel counts. Default for all borders is one pixel; again, order must correspond to gene order in the input DAT/DAB.

Hubber

Nominally, Hubber finds information about the "hubbiness" of genes and gene sets in a given DAT/DAB representing a predicted functional relationship network. In reality, what this means is that Hubber will predict gene function and detect poorly annotated GO terms, both of which are fairly important tasks. It evolved to this point gradually, so the usage is a little oblique; the two basic modes are:

Hubber -i <fr.dab> -g 100 <context1.txt> <context2.txt> ... > <hubs.txt>
Hubber -i <fr.dab> -g -1 <context1.txt> <context2.txt> > <hubs.txt>

When -g is zero or greater, Hubber produces one row of output per context, each indicating the hubbiness, cliquiness, and deviations of these values in each input context in the given interaction network. "Hubbiness" here means average out connectivity, i.e. the average weight of an edge going from a gene in the context to any gene in the genome. "Cliquiness" means the average in connectivity, i.e. the average weight of an edge connecting two genes in the context. The standard deviations are thus over all such edges.

A gene's strength of connection to a context is thus the ratio of its in-connectivity to that context over its out-connectivity, i.e. its average connection strength to a gene in the context divided by its average connection strength. When -g is greater than zero, the -g genes with the highest connection strengths are output for each context.

When -g is -1, the output is formatted as a table in which each column represents a gene, each row a context, and the cells are connection strengths. This isn't actually very convenient for context-specific predictions (you have to run Hubber multiple times, once for each context's functional relationship DAB, and then merge the resulting tables), but it's great for getting a genome-wide prediction set from a global interaction network.

The only other relevant command line options are the usuals:

Flag Default Type Description
-m off Flag Memory map input files; incompatible with -n.
-n off Flag Normalize weights in the input DAT/DAB to the range [0, 1]. Hubber can handle any positive edge weights, but negatives will break it, so use -n if your input might have negative scores in it.

Overlapper

Given two DAT/DABs, usually answer files, Overlapper produces a confusion matrix between the two. That is, given a.dab and b.dab, Overlapper will tell you for how many pairs A and B both say 0, A and B both say 1, A says 0 and B says 1, and so forth, including pairs that are in A but not B and vice versa. So you might run:

Overlapper -i <one.dab> -I <two.dab> -m

And get output like:

one.dab: 4708 genes
two.dab: 6691 genes

one.dab +
467269	35732	2409
1333762		10376795	two.dab -
84958   1518076 7098651
one.dab -

This means that there are:

  • 467269 pairs positive in both standards
  • 35732 positive in one but not present in two
  • 2409 positive in one but negative in two
  • 1333762 positive in two but not present in one
  • 10376795 negative in two but not present in one
  • 84958 positive in two but negative in one
  • 1518076 negative in one but not present in two
  • 7098651 pairs negative in both standards

Overlapper will also output some rough statistics about how overlappy two standards are, but they're not particularly meaningful. For once, there aren't any other command line flags, although it's strongly recommended that you use -m! Otherwise, you're loading two (probably) very large files completely into memory both at once.