Next: , Previous: Trees, Up: File Format


3.2 Data

The lengths of the sequences can either be listed on the first line of the file, as the second number on that line (the first number is usually used for the number of species, though cold does not require this number, it assumes that it is there for compatibility with other packages which do require it), or else, the length of each sequence can be listed on the line with that sequence. It is possible to input sequences of different lengths, aligned at the start of the first sequence, however, cold assumes that all sequences are at least as long as the first sequence, so the first species in the tree must be the one with the shortest sequence in this case.

The data should be listed as nucleotide sequences. Each line should consist of the name of the species, followed by a space (or multiple spaces), then the length of the sequence (if necessary), then the sequence. Any character other than ‘ACGT’ or a space is treated as an unknown nucleotide. (This means there is limited error checking, for instance no error would be produced by the sequence ‘ACTGTGTTTCSC’ - the program would simply interpret the ‘S’ as an unknown nucleotide, instead of a typo, which is a more likely explanation.)

[Currently there is no way to specify partial information about a nucleotide. This feature may be added in later versions.]

Each sequence should be on a separate line. The program can search the tree for species names, so as long as the sequences for all species in the tree are contained in the file, it will work - it doesn't matter if the file contains extra data. The program will give an error if it can't find the data for one of the species in the tree.

See the file testdata for examples.