Codon Optimal Likelihood Discoverer


Running Cold

cold calculates and maximises log-likelihoods for user-defined codon models in phylogeny.

This manual is for version 1.0.0 of cold.

cold has a number of command line options as detailed below. A number of the options read input from files. The required formats of these files are described in files.html. The options have been divided into the following categories:

  • model options, which control the model that the program fits to the data;
  • data options, which control the data and tree on which the program fits the model;
  • starting options that control the initial estimate - these should not affect the final answer but could speed up the program, or avoid local maxima;
  • output options, which control what the program outputs;
  • running options, which control various aspects of how the program runs;
  • debugging and status options, which control what the program prints on the screen while running (status updates, debugging information, etc.)
  • technical options which control internal working of the program, and are mostly useful for debugging (only the -T option might be regularly useful for ordinary users).
  • Model options:

  • --model modelname
  • loads the arguments from the chosen model, which must be described in the model file. The model file models is provided with the package, and contains details for various standard models. If these are not sufficient, the user can supply their own model file for describing their own models.

  • --modelfile filename
  • instructs the program to use the specified model file instead of the default one.

  • --numpars number
  • tells the program to only read the first n parameter matrices (thereby selecting a submodel of the model described in the parameter file).

  • --nomask
  • tells the program to ignore the mask matrix (which determines which codon changes are possible in a single step - e.g. only single nucleotide change, ...)

  • -p filename
  • Parameter file - the program loads the parameters from the given text file. The files parametermatrices and standardmodelmatrices are provided with the package. Alternatively, the user can write their own parameter files

  • --parameterselection list
  • allows a selected subset of the parameters to be used. The format is fairly obvious, it should accept a comma-separated list of ranges, for example 1,3-7,9-11,36. If you want to include spaces in the list, you will probably have to enclose the list in quotes.

  • --usematrix filename
  • specifies that parameters don't vary, and that the specified transition matrix should be used.

  • --justbl
  • instructs the program to fix the parameter values, and only optimise the branch lengths.

  • --mask number
  • selects a mask to use from the maskfile. (This allows the user to specify no multiple-nucleotide changes for example). The input is the number of the chosen mask within the file. The default mask file masks supplied with the package only contains one mask at present, so this value should always be 1 unless using a user-supplied mask file.

  • --maskfile filename
  • instructs the program to use the file specified when reading masks, instead of the default mask file.

  • --mixture number
  • Tells the program to use a mixed model, and indicates the number of separate distributions to be mixed. Parameters can be further controlled by the option --mixfile or --mixstring

  • --mixfile filename
  • Defines the parameters in a mixed model (some parameters may be estimated for each component of the mixture; others may be the same for all component; while others may be fixed in certain components). See files.html for details of the format of this file

  • --mixstring string
  • Like mixfile, except that the information is input on the command line as a string instead of being read from a file. The format is the same as for the mixture file.

  • --empirical style
  • Instructs the program to estimate the Pi parameters empirically in the manner described. There are currently four options: F61 Empirically estimates all codon frequencies F3x4 Empirically estimates nucleotide frequencies in each position within the codon. F1x4 Empirically estimates nucleotide frequencies. Fequal Sets all codon frequencies to 1/61.

  • --fixedprobs values
  • Instructs the program to use the given values for mixing probabilities, rather than estimating them. This can be useful for discrete approximations to fixed distributions.

  • --setfixedpars values
  • Sets the values of fixed parameters.

    Data Options:

  • -t filename
  • Tree file - the program loads the tree from the given text file

  • --treenumber number
  • Selects a tree from a file containing multiple trees.

  • -d filename
  • Data file - the program loads the sequence data from the given text file.

  • -f basefilename
  • family - both the tree and sequence file have the same name, but a different file extension. The program attempts to locate both of them, by testing the known extensions. [The possibilities are set at compile time from fileExtensions.h. Hopefully later versions of cold will allow the chosen file extensions to be set at runtime. Also, hopefully, later versions will have improved detection routines to help find the right files. For full details on the file searching algorithms, see this note.]

    Starting Options:

  • --initpars filename
  • gives a file from which the initial set of parameters are read. If this option is not given, the initial parameters are estimated from an initial matrix.

  • -m filename
  • Matrix file - the program will base its initial estimate of the parameters on the Q matrix read from this file. It chooses the parameters that best approximate the Q matrix loaded from this file, as its starting value.

  • --variables filename
  • reads additional commandline arguments from the specified file. By default the file .variables is read for extra command-line arguments. Editing this file can be used to change the default values of variables, or in environments without a command line, to give the commandline arguments.

  • --noparsimony
  • By default the program uses a parsimony method to estimate the initial branch lengths. This command tells the program to use the branch lengths indicated on the tree.

    Output Options

  • --output filename
  • Saves the final output (hopefully the parameters that give maximum likelihood) to the file named filename. Without this option, it outputs this information to standard output.

  • --printsitelikes
  • causes the program to print the likelihood for the data at each site, if the Newton-Raphson method converges, so that the maximum likelihood is found. If the optimisation does not converge (so the maximum likelihood is not found) this option does nothing.

  • --tstats
  • causes the program to print t-statistics for all variables at the MLE (assuming it converges). These can be used to decide which variables are more likely to be important.

  • --observedinformation
  • Causes the observed information matrix to be printed out in the final output. This matrix can be used to estimate asymptotic covariance.

  • --blstdev
  • Outputs the estimated standard deviations of the branchlengths. These can be used to obtain confidence intervals for the branchlength estimates.

    Running Options:

  • -n number
  • Number of iterations - this is the maximum number of iterations of Newton-Raphson to try while seeking to maximise likelihood. If the algorithm hasn't converged by the end of n iterations, it reports the current position. Unless the quiet flag is set, the output should be enough to indicate whether continuing would be likely to lead to convergence (i.e. if the value of n had been set too small).

  • --path searchpath
  • sets the searchpath for all other files. This is where cold searches for files. This allows the user to store necessary data files in non-standard locations. The default searchpath is stored in the .variables file, and so can be changed. [Later versions may allow this option to add to the search path, rather than replacing it.]

  • --simulate [number,]length[:seed]
  • Simulates data: starting from the root of the tree, it simulates random evolution according to the model specified (using the other options, same as for maximising likelihood). The first number is the number of files to produce, and the second is the length of the sequences in each file (in nucleotides). This length should therefore be a multiple of 3 [there is currently no error checking for this]. If given, the seed is used to seed the random number generator. This makes it possible to reproduce the results. [The simulation uses the default random number generator provided by the compiler. This may not be good enough for some purposes.]

  • --variableselect
  • Automatically runs a crude variable selection algorithm. It checks the t-statistics after estimating the parameters. Any that are not significant are removed, and a new model is fitted on the remaining variables. If this model is significantly worse, the previous model is output. Otherwise, it looks at the t-statistics in the new model, and tries removing ones that are not significant.

    This variable selection is rather crude, and it would probably be better to do the variable selection manually for individual data sets. This option is useful when selecting variables for many simulated data sets.

  • --state filename
  • gives the name of the file to save the current state to if the program is interrupted, or has certain kinds of error. The file output is plain text, so it can be read by the user in the case of error, to help with debugging. The default filename is statefile.

  • --recover filename
  • tells the program to attempt to recover it's last state from a file, which should have been set using the --state option on a previous run, saving the need to repeat a lot of calculation.

  • --noautobackup
  • By default, the program automatically backs up it's current position after every Newton-Raphson iteration, allowing some recovery in the event of an unstoppable interruption. This option disables automatic backups.

    Debugging and Status Options

  • -i level
  • Interactivity - sets the level of interactivity for the program. Options are as follows:

    -1 LOGFILE send all messages to a log file.
    0 AWAYdisplay messages, make default choices, don't ask for user input.
    1 VITALonly request user input when absolutely necessary.
    2 PRESENTrequest user input whenever it may be helpful to do so.
    3 ALLrequest user input on all irregularities, even when no input would usually be expected.
    START [Not yet implemented] Perform all file loading and parsing operations at the start of execution, and prompt for user input at this time. Do not prompt for user input once optimisation has started.
  • -D level
  • Debug level - controls the amount of debuging information output. The following options are available:

    -1 SILENCE No Debugging or even status updates displayed.
    0 NODEBUG Appropriate for most users. This provides basic status updates indicating the progress of the program. This should be sufficient to indicate whether the program is progressing normally. If a problem occurs, the program can be rerun with a higher debugging level to identify the problem.
    1 BASICDEBUG Gives basic debugging output. May be helpful if something goes wrong. Might identify file format problems or similar things that may be fixed easily by the user.
    2 FULLDEBUG Mostly only useful for developing the package, or for users who want to extend the package with their own code. if you want to send a bug report, please include the output from the command run with -D2.

    Numbers greater than 2 default to 2, so running with -D3 is also OK, and will allow for any later extensions to the debugging options. See the table at the bottom of the page for details of what output is available.

  • -q
  • [Not yet implemented] quiet - does not print any information about the status of the program - just outputs the desired information at the end. This can be achieved with -D -1.

  • -v
  • [Not yet implemented] verbose - provides a lot of information about the current state of the program. Useful for debugging. This can be achieved with -D 2

  • --showeverysite number
  • controls the status display. The option should be a number. When calculating a hessian, the program will print information about which site it is currently working on. This option controls how often the program prints this message.

    Technical Options (These should only be used if understood, many are just useful for debugging or dealing with any bugs that might be present.):

  • -P number
  • [Not yet implemented] precision - controls how close to zero numbers have to be before being treated as zero. Too small a precision can lead to unstable results.

  • -b number
  • [Not yet implemented] buffer size - controls sizes of internal buffers for loading data from files. There should be little reason to change this, but could be useful if there seem to be problems with saved data being corrupted.

  • -T number
  • Number of threads. In order to make more efficient use of multiple processors, computations for different sites are performed by different threads, which can be run simultaneously on different processors. The number of threads should usually be equal to the number of 'cores' on the machine (or twice the number if hyperthreading is possible). This should be set as the default when compiling. However, if other processes are also running on the machine, it may be better to use fewer threads. Single threaded mode is less likely to have bugs (though the program has been well tested for threading bugs and seems OK). [Note that for large models or trees, each thread can take up a lot of memory, so it may be faster to run fewer threads than there are processors available in this case.] For more information on choosing the number of threads, see this note.

  • -o operation
  • Operation - the default operation 1 is to maximise, this option can be used to attempt to minimise the likelihood, when set with value 0.

  • --testderivs
  • For debugging. Tests whether the hessian and derivatives calculated by the program are correct (by comparing to values approximated by comparing two nearby points).

  • -e number
  • For use with the --testderivs option. This sets the distance between points for testing derivatives. The default is 1e-6. If the derivatives are small relative to the values, this can cause rounding errors to make the test inaccurate. In such a case, setting a larger value can improve the testing.

    Program Output

    The program output can be divided into two sorts - status and debugging information during execution; and final output.

    Final Output

  • If the --variableselect option is used, the first line of the output indicates which parameters have been selected.
  • The first line of final output states whether the Newton-Raphson method converged. If it converged, then the output is likely to be the maximum likelihood. If not, then the program did not find the maximum likelihood [but can be told to continue, using the state file]. In either case, the program then outputs either the MLE, or where it has currently searched to.
  • The next line of output is the inferred branch lengths in Newick format, but without the terminating semicolon.
  • If the --blstdev option is used, the next output is another tree in Newick format, with the estimated standard deviations for the estimated branch lengths. These can be used to give confidence intervals for branch lengths, but are not very reliable [often they are too large to be of much use].
  • Next is a table, listing for each parameter the coefficient in the cold model (...), and the exponential of the coefficient (which is the parameter in some models).
  • If the --printsitelikes option is used, the next output is a table of the likelihoods for each site.
  • If the --tstats option is used, the next output is a table of the t-statistics for each variable.
  • If the --observedinformation option is given, then the final output is the observed information matrix. This can be used to give confidence intervals for parameter estimates.
  • Status and Debugging Information

    Minimum Debug LevelStatementMeaning
    0Started step number nCurrently working on nth iteration.
    0ValueLog Likelihood for current parameters.
    0Expected ImprovementThe amount by which the Newton-Raphson method predicts that the next point will be better than the current one (Assuming that the step-size is not changed by other factors). If the value is positive it means that it seems to be near a local maximum. If the value is negative, it is near a local minimum.
    0Step sizeThe modulus of the difference between the current parameters and the next parameters.
    2Producing n threads for computationThe site likelihoods are computed separately. Computing them in different threads can improve speed. However, it increases memory usage, and there are more danger of bugs.
    2Angle between steepest and actual ascentThis is a normalised dot product rather than an angle. It gives the dot product between the direction suggested by Newton-Raphson and the direction of steepest descent. Generally, values close to 1 or -1 indicate circular contours, which should lead to good convergence. Values close to 0 indicate elongated contours which can indicate ill-conditioned parameters. Positive values indicate the program is converging towards a maximum; negative values indicate it is moving away from a minimum.
    2ModulusThe sum of squares of all current parameter values. This gives an idea of the large-scale movement of the algorithm.

    Back to home page