Codon Optimal Likelihood Discoverer


Data Files for Cold

Cold requires a number of files for input. The formats and uses of these files are as described below, as well as the example files included in the package. The example files can be found in either the "Data", or the "Models" directory, depending on the use of the file:

Trees:

Generally Newick formatted trees should work. The semicolon at the end is optional. Multiple trees can be given in a single file, each on a separate line; no more than one tree should be listed on a single line. It is not necessary to include blank lines between different trees, but it will improve readability.

Clades should be placed in brackets, with the length of the branch joining the clade to the rest of the tree listed after the clade, separated by a colon. The species names can be anything not containining the characters '(', ',', ')', ':', or newline. However the space ' ' character should be avoided because it makes reading the data not work. An example of a valid tree is:

(ABC100:5,((DEF_37:0.004,GHI:1):0.3,JKL32:1.003,(MNO23:0.6,(MNO32:0.12,PQR:0.33):0.7):0.56):0.4)

See the file testtrees for more examples.

Data:

The lengths of the sequences can either be listed on the first line of the file, as the second number on that line (the first number is usually used for the number of species, though cold does not require this number, it assumes that it is there for compatibility with other packages which do require it), or else, the length of each sequence can be listed on the line with that sequence. It is possible to input sequences of different lengths, aligned at the start of the first sequence, however, cold assumes that all sequences are at least as long as the first sequence, so the first species in the tree must be the one with the shortest sequence in this case.

The data should be listed as nucleotide sequences. Each line should consist of the name of the species, followed by a space (or multiple spaces), then the length of the sequence (if necessary), then the sequence. [Note that this format does not allow spaces in species names. It is possible that such a feature may be incorporated into later versions.] Any character other than ACGT or a space is treated as an unknown nucleotide. (This means there is limited error checking, for instance no error would be produced by the sequence ACTGTGTTTCSC - the program would simply interpret the S as an unknown nucleotide, instead of a typo, which is a more likely explanation.)

[Currently there is no way to specify partial information about a nucleotide. This feature may be added in later versions.]

Each sequence should be on a separate line. The program can search the tree for species names, so, as long as the sequences for all species in the tree are contained in the file, it will work - it doesn't matter if the file contains extra data. The program will give an error if it can't find the data for any of the species in the tree.

See the file testdata for examples.

Matrices:

Matrices should be listed in the usual format. Entries should be separated by the space character. Multiple spaces are OK. Lines are separated by newline characters. The size of the matrix is determined by the length of the first line, so it is important that the first line should be the right length. Other lines don't matter (though it will look better if all lines are the right length).

Separate matrices in the file do not [at least, should not - I haven't done a lot of testing of obscure input formats] need to be separated by newlines, but doing so will improve readability of the file.

The following files contain matrices:

  • ECMq.txt
  • masks
  • parametermatrices
  • standardmodelmatrices
  • For the first two, the first line can be used to indicate the number of matrices in the file (and also whether any matrix should be used as a mask). The format of the first line is

    number [mask] number

    where the numbers are the number of matrices before and after the mask matrix.

    InitialParameters:

    The initial parameters file contains a list of the Pi values, followed by initial values for all the parameters that are to be estimated.

    See the file M0_example_initpars for an example.

    Variables:

    The variable file consists of lines, each of which specifies a command line option. The lines have the following format:

    name commandlineoption type value

    The name doesn't really matter for most options. The commandlineoption is what you would type at the commandline to invoke this option. The type is the type of the argument (see the following list for the required values). The value is the argument for the option.

    The following options can be invoked from a file. The type value should be as given in this list.

    --modelstring
    --modelfilestring
    --numparsstring
    --nomask
    --parameterselectionstring
    --usematrixstring
    --justblstring
    --maskstring
    --maskfilestring
    --mixturestring
    --mixfilestring
    --mixstringstring
    --empiricalstring
    --fixedprobsstring
    --setfixedparsstring
    --treenumberstring
    --initparsstring
    --variablesstring
    --noparsimony
    --pathpath
    -q
    -istring
    -Dstring
    --showeverysiteint
    --printsitelikes
    --hessian
    --statestring
    --recoverstring
    --noautobackup
    -Pstring
    -bstring
    -Tstring
    -v
    --testderivs

    For an example, see the file .variables. This file is automatically read before any other arguments. This is important because the path variable needs to be set before other options are processed.

    Mixture:

    Each row represents a parameter for the components. The row starts with the number of the parameter (starting with 0, which is the rate). Then it has a list of classes separated by colons if the parameters must have equal values, and commas if they can have separate values. A dot between two values indicates a comma-separated list of all values between them, that have not already been listed. A dash indicates a colon-seperated list. If there is no first value, zero is assumed. If there is no second value, the number of components minus one is assumed (so that all later components are included). If the parameter number in a row is a dot, this means that all rows between the one above and below that do not already have patterns should follow this pattern (or the pattern of the row above if there is no pattern). If the parameter number is a |, it indicates that the groups in the rows above and below should be merged. Rows can be separated with a semicolon instead of an end of line. This is useful for inputing the mixture as a string, rather than writing a separate file for very simple mixtures.

    [Some aspects of this don't work quite as described yet. This shouldn't be a problem for usual mixture files, only if the mixture file is specified in a strange way. If you list the rows and entries in increasing order, there should be no problems. Things like 5 7-5,3.6,0:2;|;.;2, however, are not guarenteed to work correctly.]

    For examples, see the file mixture. See also the mixture strings in the file models.

    Models:

    The modelfile consists of a number of models. Each model consists of the model name, then any parameters as a comma separated list between parantheses e.g. (a,b,c), then the model specification enclosed between braces { and }. [Currently, if there are no parameters, there needs to be a space between the model name and the opening brace. Hopefully this bug will be sorted out soon.]

    The model specification consists of a collection of assignments of the form

    FIELDNAME=VALUE

    then a closing brace. The parameters are substituted directly into the VALUE part when they occur (separated by spaces). For example, if there is a parameter called a, whose value is testparameters, then a line of the form

    PARAMETERS=myparamfile. a

    woulds set the parameter file to

    myparamfile. testparameters

    Note the space produced in the middle of the filename. This is probably not what you want. The easiest way to avoid it is by separating the text from the parameter with a pair of quotation marks. For example, the line above would be rewritten:

    PARAMETERS=myparamfile.""a

    Currently, arithmetic expressions are not available in the modelfile. I plan to add support for them in a later version. They will start with the # character. If you want to avoid a character having it's normal meaning, you can either precede it with a backslash, or place the text in quotation marks.

    Note that the \ character has a special meaning - it removes any special significance to the following character. For example, if your file name has quotation marks in it, you can use \" to get them. This means that if your filename has a \ character in it (which is only likely to be a problem for Windows users) you need to use \\.

    The modelfile simply substitutes the values of the fieldname into commandline arguments. It can therefore do anything that the corresponding commandline options can do. (And conversely, it can only do what the commandline options can do.)

    The following aspects of the model can be set in the model file: (given with the equivalent commandline options, which can be looked up in cold.info if more information is required).

    MIXTURE--mixture the number of sets of parameters in the mixture model.
    MIXFILE--mixfile a file that describes how the parameters vary (see above).
    MIXSTRING--mixstring like mixfile, but gives the mixture as a string, rather than loading it from a separate file.
    PARAMETERS-p the file from which parameters are read
    PARAMETERSELECTION--parameterselection the parameters from this file to be used in the model.
    INITIALMATRIX-m the matrix used to choose the initial parameters.
    NUMPARS--numpars the number of parameters to be read from the parameter file.
    MASKFILE--maskfile the file containing the mask to be used.
    MASK--mask the mask to be used (if more than one is available)
    USEMATRIX--usematrix use a constant matrix for the Q matrix.
    INITIALPARS--initpars the initial values for parameters.
    JUSTBRANCHLENGTHS--justbl only optimise the branch lengths. [Doesn't yet work with mixed models]
    EMPIRICAL--empirical sets empirical pi values.
    FIXEDPROBS--fixedprobs fixes mixing probabilities as the given values.
    FIXEDPARS--setfixedpars sets the values of fixed parameters

    See the file models for examples.

    State file

    This file should not be edited by the user. The reason for understanding the file layout is for debugging in the event of an error. The statefile is generated either in the event of a fatal error, or as a regular backup for long runs. It's purpose is to allow the program to resume running from the point at which it was interrupted. This is done using the --recover option.

    The statefile is however in a human-readable format, so it can be read in an attempt to determine the cause of the fatal error. Its format is as follows:

  • Line 1 is a text string identifying the current stage of program execution. The options are:
    newxThe program has just calculated the next set of parameters and branch lengths.
    hessianThe program was in the middle of calculating the hessian.
    endhessianThe program had just finished calculating the derivatives, but had not yet sorted the values out into a matrix and applied the Newton-Raphson method.
  • Line 2 gives abbreviations for the types of variables stored in the file. It will normally be of the form LiLiLiLiL. This line is to allow the same code to be useable for other programs which might save different types of variables.
  • Line 3 is the current set of parameter values. The parameters are orthonormalised at the start of the program, to improve convergence. Therefore, this line will not give a direct insight into the parameter values. [You could find out the actual parameter values, by running with the options --recover and -n0.]
  • The remaining lines come in pairs, the first of which contains the number of the site on which the program was working when it was interrupted, and the second contains a list of cumulative values for the log likelihood and its derivatives, on the sites already calculated. There should be one pair for each thread of the program. The current cumulative log-likelihood and its derivatives can be obtained by summing the relevant values for all these pairs. The sites that remain to be done are the site numbers listed in the program, and any other numbers higher than all those site numbers.
  • File searching

    COLD searches in a number of places for files. There are three things that affect where it searches:

  • The current directory (i.e. the directory from which the program was executed).
  • The main search directory [controlled by the ROOTSEARCHDIR variable, which is set at compile time (installation) by setting either the ROOTSEARCHDIR or mainsearchdir variable when running make and make install]. This is the directory into which the default files are installed, so this is where COLD will search for model files, parameter files, etc.
  • The --path commandline option. This is automatically set from the .variables file (which can be configured at installation, or manually after installation). It can be changed by the commandline options, or another variable file. The default path includes directories Models and Data. These are necessary for finding several default files, so if you want to modify the path, you should probably include these directories.
  • The file searching algorithm searches in all of the following places, in this order:

  • The current directory (i.e. the directory from which the program was executed).
  • The main search directory. By default this is the subdirectory cold of your home directory. It is the directory in which COLD installed its default files.
  • Subdirectories of the current directory indicated by the search path (input from the command line using the --path option, or from some variable file).
  • Subdirectories of the main search directory indicated by the search path (input from the command line using the --path option, or from some variable file).
  • Actually, the last two are interleaved, so it goes through each directory in the search path, and searches in that subdirectory of the current directory, then that subdirectory of the main search directory.

    COLD opens the first match for the file in question, so for example, by putting a file in the current directory, you can effectively override the file in a later search directory.

    Changing the search path
  • Why change the search path?
  • It might be that you have all your data stored in a particular directory. In such a case, it would probably be a good idea to add that directory (and any subdirectories which you need to search) to the search path. Similarly, if you are studying your own models, you could put them all in one directory, and add that directory to the search path. [You should probably make these directories absolute, so that COLD can find the files wherever you run it from.]

  • How can I change the search path?
  • There are three ways:

  • Modify the default variable file .variables
  • Create a new variable file, add it to the default .variables file, and add the new search path to it.
  • Use the --path command line option.
  • The first option will change the search path every time you run the program. The second will also change the search path every time you run the program, but the options in your personal .variables file can easily be all added, or all removed together, so if you have a set of options that you use for some program runs, but not others, then you could put these in a variable file and use the --variable commandline option whenever you want to use those options. The third option will set the search path just for the current run.

  • How should I format the search path?
  • The search path just consists of a list of directories, separated by the colon character :. If the directory name contains a colon, you can precede the colon with a backslash \. This means that in order to have a backslash in your directory name, you need to precede it by a backslash as well (so it would be \\). [Note that if specifying the path variable on the command line, your commandline may give special significance to certain characters, which may therefore need to be escaped (often by preceding them with a backslash, or by putting them in quotation marks). [For example, on linux, if you want to specify a search path on the command line, and one of the directory names contains a colon, you would have to precede that colon with a backslash, but the shell would remove the backslash, so you would need to precede it by another backslash.] Directory names containing new line characters are not supported, and probably never will be.

  • Even if I'm only using files in the search directories for some runs, couldn't I just leave all the interesting directories in the search path for every run?
  • This is probably OK. There are a few issues.

  • This can make the program slower, since it now has to search many more directories. This is unlikely to be significant.
  • There is a danger, if the program is searching a lot of directories, that it will find a different file from the one you mean, especially if the names are not very descriptive.
  • The -f option

    When using the -f option, the search algorithm works slightly differently. It still searches directories in the order above. For each directory, it first searches for the file as typed. Then it searches for the file with each of the allowed file extensions. [The defaults are .tree, .tree.1.txt, .tree.txt, and .txt for tree files, and .seq, .nuc.txt, .dat, and .txt for sequence files. Currently, these can only be changed by modifying the fileExtensions.h file before compiling the program. I expect that later versions of COLD will allow it to be set more easily.]

    It then compiles a list of all the possibilities, in the order listed above. Once it has the lists for data and tree files, it considers using the first possible tree file as a tree file, and checks whether this would allow it to find a data file. If it would, then it opens this tree file and opens the first remaining data file. Otherwise, it tries the next tree file, which should allow it to find a data file. If not, it will keep trying (but in this case, it should not manage to find a tree and data file).

    [Note that the program does not make any checks on the contents of the files. It simply chooses the file names. If it tries to open a data file as a tree file or vice-versa, it will almost certainly abort with an error message. Hopefully, later versions of COLD will have more sophisticated methods for checking file type, and will be able to resolve more cases like this.]


    Back to home page