PHYML User's guide (command-line interface)

Overview

PHYML is a software implementing a new method for building phylogenies from DNA and protein sequences using maximum likelihood. Data sets can be analyzed under several models of evolution (JC69, K80, F81, F84, HKY85, TN93 and GTR for nucleotides and Dayhoff, JTT, mtREV, WAG, DCMut, RtREV, CpREV, VT, Blosum62 and MtMam for amino acids). A discrete-gamma model (Yang, 1994) is implemented to accommodate rate variation among sites. Invariable sites can also be taken into account. PHYML has been compared to several other softwares using extensive simulations. The results indicate that its topological accuracy is at least as high as that of fastDNAml, while being much faster.

The command-line interface

Download the binary files ; you can execute PHYML by typing "./phyml" followd by a list of parameters. Type 13 parameters for DNA sequences :
./phyml sequences file data type sequence format nb data sets nb bootstrapped data sets substitution model ts/tv ratio prop. invariable sites nb categories gamma parameter starting tree optimise topology optimise branch lengths and rate parameters
Example :
./phyml seqs1 0 i 2 0 HKY 4.0 e 1 1.0 BIONJ y y

Type 12 parameters for amino-acids sequences :
./phyml sequences file data type sequence format nb data sets nb bootstrapped data sets substitution model prop. invariable sites nb categories gamma parameter starting tree optimise topology optimise branch lengths and rate parameters
Example :
./phyml seqs2 1 i 1 0 JTT 0.0 4 1.0 BIONJ n n
For complete details type './phyml -h' or see the 'command-line' specific comments in the 'Options' section below.

PHYML enables to analyze one or several data sets in conjunction with one or several starting trees.
PHYML produces several results files :
  • <sequence file name>_phyml_lk.txt : likelihood value(s)
  • <sequence file name>_phyml_tree.txt : inferred tree(s)
  • <sequence file name>_phyml_stat.txt : detailed execution stats
  • <sequence file name>_phyml_boot_trees.txt : bootstrap trees (special case)
  • <sequence file name>_phyml_boot_stats.txt : bootstrap statistics (special case)

    Here are the possible uses of PHYML :

  • One data set, one starting tree

  • Standard analysis under a given substitution model, PHYML then returns the inferred tree. Moreover, a special option allows to perform non-parametric bootstrapp analysis on the original data set. PHYML then returns the bootstrap tree with branch lengths and bootstrap values, using standard NEWICK format (an option gives the pseudo trees in a *_boot_trees.txt file).

  • Several data sets, one starting tree
    Several standard analysis start from the same intial tree with different data sets, without the bootstrap option.
    The results are given in the order of the data sets.
    This can be used to process multiple genes in a supertree approach.

  • One data set, several starting trees
    Several standard analysis of the same data set using different starting tree situations, without the bootstrap option.
    All results are given in the order of the trees. Moreover, the most likely tree is provided in the *_best_stat.txt and *_best_tree.txt files.
    This should be used to avoid being trapped into local optima and then obtain better trees. Fast parsimony methods can be used to obtain a set of starting trees.

  • Several data sets, several starting trees
    Several standard runs, where each data set is analysed with the corresponding starting tree, without the bootstrap option.
    The results are given in the order of the data sets.
    This can be used when comparing the likelihood of various trees regarding different data sets.
  • Options


  • Sequences
    The input sequence file is a standard PHYLIP file of aligned DNA or amino-acids sequences. It should look like this in interleaved format :
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAG
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGG
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGG
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGG
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGG
    
    GAAATGGTCAATATTACAAGGT
    GAAATGGTCAACATTAAAAGAT
    GAAATCGTCAATATTAAAAGGT
    GAAATGGTCAATCTTAAAAGGT
    GAAATGGTCAATATTAAAAGGT
    
    The same data set in sequential format:
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    

    On the first line is the number of taxa, a space, then the number of characters for each taxon.

    The maximum number of characters in species name MUST not exceed 50. Blanks within the species name are NOT allowed. However, blanks (one or more) MUST appear at the end of each species name.

    In a sequence, three special characters '.', '-', and '?' may be used: a dot '.' means the same character as in the first sequence, a dash '-' means an alignment gap and a question mark '?' means an undetermined nucleotide. Sites at which one or more sequences involve '-' are NOT excluded from the analysis. Therefore, gaps are treated as unknown character (like '?') on the grounds that ''we don't know what would be there if something were there'' (J. Felsenstein, PHYLIP documentation). Finally, standard ambiguity characters for nucleotides are accepted (Table 1).

    Table 1 - Nucleotide character coding
    Character Nucleotide
    A Adenosine
    G Guanine
    C Cytosine
    T Thymine
    U Uracil
    M A or C
    R A or G
    W A or T
    S C or G
    Y C or T
    K G or T
    B C or G or T
    D A or G or T
    H A or C or T
    V A or C or G
    N or X or ? unknown
    Table 2 - Amino-Acid character coding
    Character Amino-Acid
    A Alanine
    R Arginine
    N or B Asparagine
    D Aspartic acid
    C Cysteine
    Q or Z Glutamine
    E Glutamic acid
    G Glycine
    H Histidine
    I Isoleucine
    L Leucine
    K Lysine
    M Methionine
    F Phenylalanine
    P Proline
    S Serine
    T Threonine
    W Tryptophan
    Y Tyrosine
    V Valine
    X or ? unknown

    command-line : type the sequence filename with its path from the current directory

  • Data type
    This indicates if the sequence file contains DNA or amino-acids. The default choice is to analyze DNA sequences.
    command-line : the data type is specified by the number of parameters, 9 for DNA, 8 for amino-acids

  • Sequence format
    The input sequences can be either in interleaved (default) or sequential format, see "Sequences" above.
    command-line : type 'i' or 's'

  • Number of data sets Multiple data sets are allowed, e.g. to perform bootstrap analysis using SEQBOOT (from the PHYLIP package). In this case, the data sets are given one after the other, in the formats above explained. For example (with three data sets):
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    

    command-line : type the value of this parameter

  • Number of bootstrapped data sets
    When there is only one data sets you can ask PHYML to generate bootstrapped pseudo data sets from this original data set. PHYML returns the bootstrap tree with branch-lengths and bootstrap values, using standard NEWICK format. The pseudo trees are given in a *_boot_trees.txt file.
    command-line : type the value of this parameter

  • Substitution model
    A nucleotide or amino-acid substitution model. For DNA sequences, the default choice is HKY85 (Hasegawa et al., 1985). This model is analogous to K80 (Kimura, 1980), but allows for different base frequencies. The other models are JC69 (Jukes and Cantor, 1969), K80 (Kimura, 1980), F81 (Felsenstein, 1981), F84 (Felsenstein, 1989), TN93 (Tamura and Nei, 1993) and GTR (e.g., Lanave et al. 1984, Tavaré 1986, Rodriguez et al. 1990). The rate matrices of these models are given in Swofford et al. (1996). For Amino-Acid sequences, the default choice is JTT (Jones, Taylor and Thornton, 1992). The other models are Dayhoff (Dayhoff et al., 1978), mtREV (as implemented in Yang's PAML), WAG (Whelan and Goldman, 2001), DCMut (Kosiol and Goldman, 2005), RtREV (Dimmic et al.), CpREV (Adachi et al., 2000) VT (Muller and Vingron, 2000), Blosum62 (Henikoff anf Henikoff, 1992) and MtMam (Cao, 1998).
    command-line : type the name of the model

  • Transition / transversion ratio
    With DNA sequences, it is possible to set the transition/transversion ratio, except for the JC69 and F81 models, or to estimate its value by maximizing the likelihood of the phylogeny. The later makes the program slower. The default value is 4.0. The definition of the transition/transversion ratio is the same as in PAML (Yang, 1994). In PHYLIP, the ''transition/transversion rate ratio'' is used instead. 4.0 in PHYML roughly corresponds to 2.0 in PHYLIP.
    command-line : type the value of this parameter or type 'e' to estimate it

  • Proportion of invariable sites
    The default is to consider that the data set does not contain invariable sites (0.0). However, this proportion can be set to any value in the 0.0-1.0 range. This parameter can also be estimated by maximizing the likelihood of the phylogeny. The later makes the program slower.
    command-line : type the value of this parameter or type 'e' to estimate it

  • Number of substitution rate categories
    The default is having all the sites evolving at the same rate, hence having one substitution rate category. A discrete-gamma distribution can be used to account for variable substitution rates among sites, in which case the number of categories that defines this distribution is supplied by the user. The higher this number, the better is the goodness-of-fit regarding the continuous distribution. The default is to use four categories, in this case the likelihood of the phylogeny at one site is averaged over four conditional likelihoods corresponding to four rates and the computation of the likelihood is four times slower than with a unique rate. Number of categories less than four or higher than eight are not recommended. In the first case, the discrete distribution is a poor approximation of the continuous one. In the second case, the computational burden becomes high and an higher number of categories is not likely to enhance the accuracy of phylogeny estimation.
    command-line : type the value of this parameter

  • Gamma distribution parameter
    The shape of a gamma distribution is defined by this numerical parameter. The higher its value, the lower the variation of substitution rates among sites (this option is used when having more than 1 substitution rate category). The default value is 1.0. It corresponds to a moderate variation. Values less than say 0.7 correspond to high variations. Values between 0.7 and 1.5 corresponds to moderate variations. Higher values correspond to low variations. This value can be fixed by the user. It can also be estimated by maximizing the likelihood of the phylogeny.
    command-line : type the value of this parameter or type 'e' to estimate it

  • Starting tree(s)
    Used as the starting tree(s) to be refined by the maximum likelihood algorithm. The default is to use a BIONJ distance-based tree. It is also possible to supply one or several trees in NEWICK format, one per line in the file, which must be written in the standard parenthesis representation (NEWICK format) ; the branch lengths must be given, and the tree(s) must be unrooted. Labels on branches (such as bootstrap proportions) are supported. Therefore, a tree with four taxa named A, B, C, and D with a bootstrap value equals to 90 on its internal branch, should look like this:
    (A:0.02,B:0.004,(C:0.1,D:0.04)90:0.05);
    If you give several trees and analyse several data sets the two numbers must match.
    command-line : type the tree filename with its path from the current directory or type 'BIONJ'

  • Optimise starting tree(s) options
    You can optimise the starting tree(s) in three ways :
    - You can optimise the topology, the branch lengths and rate parameters (transition/transversion ratio, proportion of invariant sites, gamma distribution parameter),
    command-line : type 'y' and 'y'
    - You can keep the topology and optimise the branch lengths and rate parameters (it is not possible to optimise the tree topology and keep the branch lengths and rate parameters),
    command-line : type 'n' and 'y'
    - You can ask for no optimisation, PHYML just computes the likelihood of the starting tree(s).
    command-line : type 'n' and 'n'
  • References

  • Z. Yang (1994) J. Mol. Evol. 39, 306-14.
  • S. Ota & W.-H. Li (2001) Mol. Biol. Evol.  18, 1983-1992.
  • N. Saitou & M. Nei (1987) Mol. Biol. Evol.  4(4), 406-425.
  • W. Bruno, N. D. Socci, & A. L. Halpern (2000) Mol. Biol. Evol. 17, 189-197.
  • J. Felsenstein (1989) Cladistics 5, 164-166.
  • G. J. Olsen, H. Matsuda, R. Hagstrom, & R. Overbeek (1994) CABIOS 10, 41-48.
  • N. Goldman (1993) J. Mol. Evol. 36, 182-198.
  • M. Kimura (1980) J. Mol. Evol. 16, 111-120.
  • T. H. Jukes & C. R. Cantor (1969) in Mammalian Protein Metabolism, ed. H. N. Munro. (Academic Press, New York) Vol. III, pp. 21-132.
  • M. Hasegawa, H. Kishino, & T. Yano (1985) J. Mol. Evol.  22, 160-174.
  • J. Felsenstein (1981) J. Mol. Evol. 17, 368-376.
  • David L. Swofford, Gary J. Olsen, Peter J. Waddel, & David M. Hillis (1996) in Molecular Systematics, eds. David M. Hillis, Craig Moritz, & Barbara K. Mable. (Sinauer Associates, Inc., Sunderland, Massachusetts, USA).
  • K. Tamura & M. Nei (1993) Mol. Biol. Evol. 10, 512-526.
  • Lanave C, Preparata G., Saccone C. and Serio G.. (1984) A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20, 86-93.
  • Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. (1978). A model of evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of Protein Sequence Structur, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington DC, pp. 345-352.
  • Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.
  • S. Whelan and N. Goldman. (2001). A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach Mol. Biol. Evol. 18, 691-699.
  • Dimmic M.W., J.S. Rest, D.P. Mindell, and D. Goldstein. 2002. RArtREV: An amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. Journal of Molecular Evolution 55: 65-73.
  • Adachi, J., P. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. Journal of Molecular Evolution 50:348-358.
  • Muller, T., and M. Vingron. 2000. Modeling amino acid replacement. Journal of Computational Biology 7:761-776.
  • Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., U.S.A. 89:10915-10919.
  • Cao, Y. et al. 1998 Conflict amongst individual mitochondrial proteins in resolving the phylogeny of eutherian orders. Journal of Molecular Evolution 15:1600-1611.