COLD

cold calculates and maximises log-likelihoods for user-defined codon models in phylogeny.

cold has a number of command line options as detailed below. A number of the options read input from files. The required formats of these files are described in files.html. The options have been divided into the following categories:

Model options:

--model modelname

loads the arguments from the chosen model, which must be described in the model file. The model file models is provided with the package, and contains details for various standard models. If these are not sufficient, the user can supply their own model file for describing their own models.

--modelfile filename

instructs the program to use the specified model file instead of the default one.

--numpars number

tells the program to only read the first n parameter matrices (thereby selecting a submodel of the model described in the parameter file).

--nomask

tells the program to ignore the mask matrix (which determines which codon changes are possible in a single step - e.g. only single nucleotide change, ...)

-p filename

Parameter file - the program loads the parameters from the given text file. The files parametermatrices and standardmodelmatrices are provided with the package. Alternatively, the user can write their own parameter files

--parameterselection list

allows a selected subset of the parameters to be used. The format is fairly obvious, it should accept a comma-separated list of ranges, for example 1,3-7,9-11,36. If you want to include spaces in the list, you will probably have to enclose the list in quotes.

--usematrix filename

specifies that parameters don't vary, and that the specified transition matrix should be used.

--justbl

instructs the program to fix the parameter values, and only optimise the branch lengths.

--mask number

selects a mask to use from the maskfile. (This allows the user to specify no multiple-nucleotide changes for example). The input is the number of the chosen mask within the file. The default mask file masks supplied with the package only contains one mask at present, so this value should always be 1 unless using a user-supplied mask file.

--maskfile filename

instructs the program to use the file specified when reading masks, instead of the default mask file.

--mixture number

Tells the program to use a mixed model, and indicates the number of separate distributions to be mixed. Parameters can be further controlled by the option --mixfile or --mixstring

--mixfile filename

Defines the parameters in a mixed model (some parameters may be estimated for each component of the mixture; others may be the same for all component; while others may be fixed in certain components). See files.html for details of the format of this file

--mixstring string

Like mixfile, except that the information is input on the command line as a string instead of being read from a file. The format is the same as for the mixture file.

--empirical style

Instructs the program to estimate the Pi parameters empirically in the manner described. There are currently four options: F61 Empirically estimates all codon frequencies F3x4 Empirically estimates nucleotide frequencies in each position within the codon. F1x4 Empirically estimates nucleotide frequencies. Fequal Sets all codon frequencies to 1/61.

--fixedprobs values

Instructs the program to use the given values for mixing probabilities, rather than estimating them. This can be useful for discrete approximations to fixed distributions.

--setfixedpars values

Sets the values of fixed parameters.

Data Options:

-t filename

Tree file - the program loads the tree from the given text file

--treenumber number

Selects a tree from a file containing multiple trees.

-d filename

Data file - the program loads the sequence data from the given text file.

-f basefilename

family - both the tree and sequence file have the same name, but a different file extension. The program attempts to locate both of them, by testing the known extensions. [The possibilities are set at compile time from fileExtensions.h. Hopefully later versions of cold will allow the chosen file extensions to be set at runtime. Also, hopefully, later versions will have improved detection routines to help find the right files. For full details on the file searching algorithms, see this note.]

Starting Options:

--initpars filename

gives a file from which the initial set of parameters are read. If this option is not given, the initial parameters are estimated from an initial matrix.

-m filename

Matrix file - the program will base its initial estimate of the parameters on the Q matrix read from this file. It chooses the parameters that best approximate the Q matrix loaded from this file, as its starting value.

--variables filename

reads additional commandline arguments from the specified file. By default the file .variables is read for extra command-line arguments. Editing this file can be used to change the default values of variables, or in environments without a command line, to give the commandline arguments.

--noparsimony

By default the program uses a parsimony method to estimate the initial branch lengths. This command tells the program to use the branch lengths indicated on the tree.

Output Options

--output filename

Saves the final output (hopefully the parameters that give maximum likelihood) to the file named filename. Without this option, it outputs this information to standard output.

--printsitelikes

causes the program to print the likelihood for the data at each site, if the Newton-Raphson method converges, so that the maximum likelihood is found. If the optimisation does not converge (so the maximum likelihood is not found) this option does nothing.

--tstats

causes the program to print t-statistics for all variables at the MLE (assuming it converges). These can be used to decide which variables are more likely to be important.

--observedinformation

Causes the observed information matrix to be printed out in the final output. This matrix can be used to estimate asymptotic covariance.

--blstdev

Outputs the estimated standard deviations of the branchlengths. These can be used to obtain confidence intervals for the branchlength estimates.

Running Options:

-n number

Number of iterations - this is the maximum number of iterations of Newton-Raphson to try while seeking to maximise likelihood. If the algorithm hasn't converged by the end of n iterations, it reports the current position. Unless the quiet flag is set, the output should be enough to indicate whether continuing would be likely to lead to convergence (i.e. if the value of n had been set too small).

--path searchpath

sets the searchpath for all other files. This is where cold searches for files. This allows the user to store necessary data files in non-standard locations. The default searchpath is stored in the .variables file, and so can be changed. [Later versions may allow this option to add to the search path, rather than replacing it.]

--simulate [number,]length[:seed]

Simulates data: starting from the root of the tree, it simulates random evolution according to the model specified (using the other options, same as for maximising likelihood). The first number is the number of files to produce, and the second is the length of the sequences in each file (in nucleotides). This length should therefore be a multiple of 3 [there is currently no error checking for this]. If given, the seed is used to seed the random number generator. This makes it possible to reproduce the results. [The simulation uses the default random number generator provided by the compiler. This may not be good enough for some purposes.]

--variableselect

Automatically runs a crude variable selection algorithm. It checks the t-statistics after estimating the parameters. Any that are not significant are removed, and a new model is fitted on the remaining variables. If this model is significantly worse, the previous model is output. Otherwise, it looks at the t-statistics in the new model, and tries removing ones that are not significant.

This variable selection is rather crude, and it would probably be better to do the variable selection manually for individual data sets. This option is useful when selecting variables for many simulated data sets.

--state filename

gives the name of the file to save the current state to if the program is interrupted, or has certain kinds of error. The file output is plain text, so it can be read by the user in the case of error, to help with debugging. The default filename is statefile.

--recover filename

tells the program to attempt to recover it's last state from a file, which should have been set using the --state option on a previous run, saving the need to repeat a lot of calculation.

--noautobackup

By default, the program automatically backs up it's current position after every Newton-Raphson iteration, allowing some recovery in the event of an unstoppable interruption. This option disables automatic backups.

Debugging and Status Options

-i level

Interactivity - sets the level of interactivity for the program. Options are as follows:

-1	LOGFILE	send all messages to a log file.
0	AWAY	display messages, make default choices, don't ask for user input.
1	VITAL	only request user input when absolutely necessary.
2	PRESENT	request user input whenever it may be helpful to do so.
3	ALL	request user input on all irregularities, even when no input would usually be expected.
	START	[Not yet implemented] Perform all file loading and parsing operations at the start of execution, and prompt for user input at this time. Do not prompt for user input once optimisation has started.

-D level

Debug level - controls the amount of debuging information output. The following options are available:

-1	SILENCE	No Debugging or even status updates displayed.
0	NODEBUG	Appropriate for most users. This provides basic status updates indicating the progress of the program. This should be sufficient to indicate whether the program is progressing normally. If a problem occurs, the program can be rerun with a higher debugging level to identify the problem.
1	BASICDEBUG	Gives basic debugging output. May be helpful if something goes wrong. Might identify file format problems or similar things that may be fixed easily by the user.
2	FULLDEBUG	Mostly only useful for developing the package, or for users who want to extend the package with their own code. if you want to send a bug report, please include the output from the command run with -D2.

Numbers greater than 2 default to 2, so running with -D3 is also OK, and will allow for any later extensions to the debugging options. See the table at the bottom of the page for details of what output is available.

-q

[Not yet implemented] quiet - does not print any information about the status of the program - just outputs the desired information at the end. This can be achieved with -D -1.

-v

[Not yet implemented] verbose - provides a lot of information about the current state of the program. Useful for debugging. This can be achieved with -D 2

--showeverysite number

controls the status display. The option should be a number. When calculating a hessian, the program will print information about which site it is currently working on. This option controls how often the program prints this message.

Technical Options (These should only be used if understood, many are just useful for debugging or dealing with any bugs that might be present.):

-P number

[Not yet implemented] precision - controls how close to zero numbers have to be before being treated as zero. Too small a precision can lead to unstable results.

-b number

[Not yet implemented] buffer size - controls sizes of internal buffers for loading data from files. There should be little reason to change this, but could be useful if there seem to be problems with saved data being corrupted.

-T number

Number of threads. In order to make more efficient use of multiple processors, computations for different sites are performed by different threads, which can be run simultaneously on different processors. The number of threads should usually be equal to the number of 'cores' on the machine (or twice the number if hyperthreading is possible). This should be set as the default when compiling. However, if other processes are also running on the machine, it may be better to use fewer threads. Single threaded mode is less likely to have bugs (though the program has been well tested for threading bugs and seems OK). [Note that for large models or trees, each thread can take up a lot of memory, so it may be faster to run fewer threads than there are processors available in this case.] For more information on choosing the number of threads, see this note.

-o operation

Operation - the default operation 1 is to maximise, this option can be used to attempt to minimise the likelihood, when set with value 0.

--testderivs

For debugging. Tests whether the hessian and derivatives calculated by the program are correct (by comparing to values approximated by comparing two nearby points).

-e number

For use with the --testderivs option. This sets the distance between points for testing derivatives. The default is 1e-6. If the derivatives are small relative to the values, this can cause rounding errors to make the test inaccurate. In such a case, setting a larger value can improve the testing.

Program Output

The program output can be divided into two sorts - status and debugging information during execution; and final output.

Final Output

If the --variableselect option is used, the first line of the output indicates which parameters have been selected.

The first line of final output states whether the Newton-Raphson method converged. If it converged, then the output is likely to be the maximum likelihood. If not, then the program did not find the maximum likelihood [but can be told to continue, using the state file]. In either case, the program then outputs either the MLE, or where it has currently searched to.

The next line of output is the inferred branch lengths in Newick format, but without the terminating semicolon.

If the --blstdev option is used, the next output is another tree in Newick format, with the estimated standard deviations for the estimated branch lengths. These can be used to give confidence intervals for branch lengths, but are not very reliable [often they are too large to be of much use].

Next is a table, listing for each parameter the coefficient in the cold model (...), and the exponential of the coefficient (which is the parameter in some models).

If the --printsitelikes option is used, the next output is a table of the likelihoods for each site.

If the --tstats option is used, the next output is a table of the t-statistics for each variable.

If the --observedinformation option is given, then the final output is the observed information matrix. This can be used to give confidence intervals for parameter estimates.

Minimum Debug Level	Statement	Meaning
0	Started step number `n`	Currently working on `n`th iteration.
0	Value	Log Likelihood for current parameters.
0	Expected Improvement	The amount by which the Newton-Raphson method predicts that the next point will be better than the current one (Assuming that the step-size is not changed by other factors). If the value is positive it means that it seems to be near a local maximum. If the value is negative, it is near a local minimum.
0	Step size	The modulus of the difference between the current parameters and the next parameters.
2	Producing `n` threads for computation	The site likelihoods are computed separately. Computing them in different threads can improve speed. However, it increases memory usage, and there are more danger of bugs.
2	Angle between steepest and actual ascent	This is a normalised dot product rather than an angle. It gives the dot product between the direction suggested by Newton-Raphson and the direction of steepest descent. Generally, values close to 1 or -1 indicate circular contours, which should lead to good convergence. Values close to 0 indicate elongated contours which can indicate ill-conditioned parameters. Positive values indicate the program is converging towards a maximum; negative values indicate it is moving away from a minimum.
2	Modulus	The sum of squares of all current parameter values. This gives an idea of the large-scale movement of the algorithm.

Codon Optimal Likelihood Discoverer

Running Cold