Genomic data formats

External Formats

Gene Identifiers

Identifying genes is much trickier than it sounds, as conventions vary wildly from organism to organism (particularly in higher organisms where multiple splicing becomes an issue). Some quick guidelines:

Microarray Probes

Don't confuse microarray probe identifiers with gene identifiers! Probe identifiers are usually specific to a manufacturer (e.g. Affymetrix, Agilent, etc.) rather than an organism. More importantly, multiple probes might correspond to a single gene, so be careful! There are a variety of mechanisms for mapping from multiple probes to a single gene's expression level (which this margin is too narrow to contain). Some common probe identifier families are:

Yeast

ORF names (e.g. YAL012W) are the standard for uniquely identifying genes in a computational system. '''Do not''' use them when presenting data to biologists! This is a social gaffe that will label you forever as a computer scientist with no grasp of biology (well, maybe not forever). Instead, use the standard gene name (e.g. AIM2) for presentation and user interface purposes. In many cases, you'll also want to watch out for capitalization, italicization, and other conventions for indicating genes, proteins, knockouts, and so forth.

Human

HGNC symbols (e.g. BRCA1) are a useful compromise for both uniquely identifying and presenting human genes. Human gene identification is ''extremely'' difficult due to multiple versions of the reference sequence, different schemes for genes/transcripts/proteins, splice variants, a variety of conflicting gene databases, and so forth. Some common formats are:

Mouse

MGI IDs (e.g. MGI:104537) are a reasonable standard for uniquely identifying genes during analysis. Again, standard names (e.g. BRCA1) should be used for presentation. Mouse genes are also often identified by many of the same systems as human genes.

Worm

WormBase sequence IDs (e.g. F18E2.2) are a good standard for uniquely identifying genes. WormBase gene IDs (e.g. WBGene00006512) can also be used. In either case, common names (e.g. abcf-1) should be used for presentation.

Fly

FlyBase gene IDs (e.g. FBgn0000043) are a good standard for uniquely identifying genes. Sequence annotations (e.g. CG12051) are also reasonable. In either case, common names (e.g. Act42A) should be used for presentation, even though the Drosophila community is crazy.

PCL Files

PreCLustered microarray data is a tab-delimited text file consisting of some headers, gene identifiers, and numerical values (generally representing microarray intensity log-ratios). It looks like this:

The important pieces are:

  • The first row is a header, indicating what's in each column. The first few columns are themselves headers, indicating what's in each row (gene IDs, names, and weights). The remaining columns each represent a single microarray, and the column headers describe the conditions under which those microarrays were run.
  • The second row, EWEIGHT, is sometimes omitted. Its original purpose was to indicate how much to weight each experiment (column) during clustering (e.g. to downweight a pair of duplicated arrays), but it's no longer used with any frequency.
  • The remaining rows each represent one gene's expression values in the one or more microarray conditions, along with appropriate identifying information.
  • The first column, often marked UID, GID, ORF, ID, or some such, is a unique identifier for each gene in the experiment(s). In yeast, this is almost always the ORF identifier (e.g. YAL012W).
  • The second column, usually called NAME, contains some amount of additional human-readable information about the gene. This can range from a simple copy of the ID (which isn't that useful), to the standard gene name, to a semi-formatted list of gene name and functional information (e.g. "AIM2 || mitochondrial organization || etc. etc."
  • The third column, GWEIGHT, is a per-row analog of the per-column EWEIGHT. Generally not used any more.
  • Missing data values can occur for a variety of reasons during microarray generation. They appear as blank cells, NA, null, and other markers, and in many cases should be removed using KNNImputer before analysis.

CDT, GTR, and ATR Files

Clustered Data Table files are similar to PCLs, but represent microarray data that's been clustered in some way (usually hierarchically). CDTs differ from PCLs mainly in the first column, which is generally a re-identification of the genes in a cluster-specific manner (sensitive to ordering, etc.) This is often accompanied by a Gene TRee and/or Array TRee file indicating the hierarchies themselves.

SOFT Files

SOFT files are used by the GEO database to store microarray data. They're more or less similar to PCLs (that is, tab delimited text) with the addition of comments, headers, and so forth. The major tasks in transforming a SOFT file into a PCL are:

  • Moving the condition names from the SOFT-specific header to the appropriate tab-delimited header row.
  • Moving the array probe IDs from the SOFT-specific header (sometimes available in an accompanying platform description file) to the appropriate tab-delimited ID column.
  • Mapping multiple probes down to individual genes.

(X)DSL Files

The GeNIe and SMILE tools for Bayes net manipulation store graphical models in the Decision Systems Lab format (DSL) and an XML version of this structure (XDSL). These files are best edited with GeNIe, a GUI program for Bayes net construction, and the SMILE library is able to perform inference an so forth on the resulting models.

Internal Formats

Most of these formats are used with the lab's Bayesian integration tools.

DAT Files

Most of our data and functional relationships are analyzed in terms of gene pairs. This results in a fundamental DATa file format containing three tab delimited columns:

Gene_A Gene_B Number
Gene_A Gene_C Number
Gene_B Gene_C Number
...
  • "Gene_A" through "Gene_C" are unique gene identifiers as previously discussed.
  • "Number" is any numerical score appropriate to the data. This might be 1 or 0 to indicate binding/nonbinding in an experiment, 1 or 0 to indicate relatedness/unrelatedness in a gold standard, a continuous value between -1 and 1 to indicate correlation in a microarray, and so forth.
  • The gene identifiers are unordered (i.e. A B is equivalent to B A).
  • Gene pairs must be distinct; that is, a gene cannot be paired with itself. This means that A A, B B, and so forth are illegal.
  • Each gene pair should appear at most once per DAT file. Gene pairs need not appear, so in a genome containing genes A through C, a DAT file might contain anywhere from one to three lines.
  • Gene pairs do not have to be ordered, although they usually are in these examples for visual clarity.

A DAT file is thus equivalent to a symmetric matrix or a half matrix in which the diagonal (and potentially additional values) is missing. The DAT file:

A	B	0
A	C	1
B	D	-1
C	D	0

is equivalent to the half matrix:

A B C D
A 0 1 -
B - -1
C 0
D

DAB Files

Since DATs are text based and repeat information ("YAL012W" might be spelled out 6000 times in the yeast genome!), they can be big and slow. To solve this problem, we use DAta Binary files, which contain a single list of gene identifiers paired with 32-bit floating point values. They contain exactly the same information as DAT files, and in most cases they are much smaller and can be processes much more quickly. On disk, the format is:

  • A four-byte unsigned word indicating the total number of genes in the DAB (N).
  • N null-terminated Unicode strings (two bytes per character) indicating the gene identifiers.
  • N*(N-1)/2 four-byte floating point values representing the scores for each gene pair.
  • Missing values are stored as a floating point NaN or infinity.
  • The order of the gene pairs is: (1 2), (1 3), ..., (1 N), (2 3), (2 4), ..., (2 N), (3 4), ..., (N-1 N). This means that exactly one value is present for each pair, and the structure is analogous (although not identical) to the half-matrix shown above.
    • Another way of visualizing the order of floating point values on disk is:

      1,2	1,3	1,4	...	1,N
      2,3	2,4	...	2,N
      3,4	...	3,N
      ...
      N-2,N-1	N-2,N
      N-1,N

      Note that there is thus no N's "row", and the ''i''th row always has N-''i'' "columns" (or N-''i''-1 if you start from zero). Of course, there are no delimiters on disk, so they're not really rows or columns, but that's what they act like conceptually.

      QUANT Files

      In general, Bayes nets deal with discretized data - that is, a finite number of bins, not a continuous range of values. It's useful to always keep the raw data around (e.g. in DAT or DAB files), but we generally want to think of the data as binned in some way when we're performing Bayesian analysis. To do this, each DAT or DAB file is often paired with a QUANTization file. This is a single line of tab-delimited text indicating the '''bin edges''' for discretization of the gene pair values. For example, suppose we have the following DAT:

      A	B	0.2
      A	C	0.9
      B	C	0.6

      If this is blah.dat, we can pair it with a file blah.quant containing:

      0.3	0.6	0.9

      These two files together are equivalent to the binning:

      A	B	0
      A	C	2
      B	C	1

      Note that:

      • The number of columns in the QUANT is equivalent to the number of bins in the resulting discretization.
      • The last bin edge is ignored. This is something of a historical artifact, but it ensure that all possible values are quantizable (i.e. there's no upper limit on input values) and that the number of columns equals the number of bins.
      • Each bin edge (except the last) represents an inclusive upper bound. That is, given a value, it falls into the first bin where it's less than or equal to the edge. In interval notation, this means the QUANT above is equivalent to (-infinity, 0.3], (0.3, 0.6], (0.6, infinity).
        • DAD Files

          Since we're often interested in putting lots of pre-quantized datasets together, a DAta Dataset is a single file efficiently incorporating a whole collection of DAT/QUANT pairs (or equivalently, DAB/QUANT pairs). Rather than using 32-bit floating point values to store numbers, DADs use exactly the minimum number of bits necessary to store the discretized value (or a missing value marker).

          Confused? Consider the DAT/QUANT pair above. If we encode those three pairs as a DAB, their values will take up 3*32 bits = 12 bytes. However, the QUANT discretizes those floating point values down to one of four possibilities: 0, 1, 2, or "missing" (although we don't have any missing values in this particular example). This means that we really only ''need'' two bits, not 32. Thus, the values could be stored in only 3*2 bits = 3/4 of a byte. That's a big difference!

          DADs are stored on disk as:

          • Four-byte boolean value, zero indicating discretized data and anything else indicating continuous. Continuous DADs are rarely useful, so this should almost always be nonzero.
          • Four-byte unsigned integer D, the original number of datasets. This will only differ from the actual number of datasets when used with a Bayes net with hidden nodes, which rarely happens.
          • D four-byte unsigned integers, each identifying the actual dataset in that slot. When no hidden nodes are present, the ''i''th integer will always be ''i''. -1 (0xFFFFFFFF) indicates a hidden node/missing dataset.
          • Four-byte unsigned integer G, the number of genes.
          • Four-byte unsigned integer, the total number of characters plus null terminators in all gene identifiers.
          • G null-terminated Unicode gene identifiers, two bytes per character.
          • Four-byte unsigned integer N, the number of datasets.
          • N one-byte unsigned integers, indicating the number of discrete bins in each dataset.
          • Four-byte unsigned integer N, the number of datasets (its duplicated to allow for different formats of DAD files).
          • N compact half matrices, each of the form:
          • Four-byte unsigned integer X, the number of elements (genes) in the matrix.
          • One-byte unsigned integer C, the number of bits occupied by each matrix entry.
          • Four-byte unsigned integer, the total number of bytes occupied by the matrix's data elements (should be X*(X-1)/2*C/8).
          • A data matrix in DAB format, save that each element is a C-bit unsigned integer. The matrix is padded to a multiple of 64 bits to make memory mapping easy.

          DAC and DAS Files

          These are some rarely-used variations on DATs and DABs. A DAC is essentially one dataset's worth of a DAD, i.e. DAta Compressed; it's stored in a pre-quantized compact half matrix. A DAS is a sparse DAB encoded using a list format rather than a matrix format, although it's only actually smaller for very sparse matrices. You'll almost certainly never see either of these.