Up to the Phylogenetics main page

Goals

The goal of this lab is to gain experience simulating DNA sequence data, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. The workhorse for DNA simulations in phylogenetics is Andrew Rambaut’s program seq-gen, which still available (and still as useful as it always was!). I will also show you how to perform simulation experiments with PAUP*, which we will use to demonstrate long branch attraction.

As always, start by logging into the Xanadu cluster and grabbing a node on the cluster that is free:

srun --qos=mcbstudent --partition=mcbstudent --pty bash

Using Seq-Gen to simulate sequence data

Compiling seq-gen

There is no module for seq-gen on the Xanadu cluster, so you will need to download the source code and compile it yourself. In the IQ-TREE lab, you learned how to download and unpack an executable file, but seq-gen is different in that what you will download is the source code (written in the computing language C), which needs to be compiled into an executable file before it can be used.

First, create a directory to use for this lab and navigate into that directory:

mkdir simlab
cd simlab

Download the seq-gen source code from GitHub using the following curl command:

curl -LO https://github.com/rambaut/Seq-Gen/archive/refs/tags/1.3.4.tar.gz

Unpack the downloaded “tape archive” file 1.3.4.tar.gz as follows:

tar zxvf 1.3.4.tar.gz

You should now have a directory named Seq-Gen-1.3.4 inside your simlab directory. Navigate into the source subdirectory of the new Seq-Gen-1.3.4 directory:

cd Seq-Gen-1.3.4/source

There is a file inside this directory named Makefile that contains instructions for building seq-gen from the C source files (which are the files with names that end in either .c or .h). All you need to do now is type make to process these instructions:

make

The C compiler will “compile” (note the -c on the command line) all the files ending in .c (including, as necessary, the files ending in .h), producing files with the same prefix but ending in .o. You can ignore the warning that is issued along the way (errors are harder to ignore, but fortunately there should be no errors). These “object” files are then linked together (note the -o rather than -c this time) in the final step to create the executable file, which is named seq-gen. Move the executable file back to the simlab directory so that it is easier to use for this lab:

mv seq-gen ~/simlab
cd ~/simlab

Viewing the seq-gen documentation

The documentation is in the form of an HTML file, Seq-Gen.Manual.html, which is located in the Seq-Gen-1.3.4/documentation directory. It is not that convenient to view web pages on the cluster, so you may wish to download the zip file version to your local laptop in order to get the file Seq-Gen.Manual.html where you can open it in a web browser.

That said, you can read this file while on the cluster using the lynx command:

cd Seq-Gen-1.3.4/documentation
lynx Seq-Gen.Manual.html

Type Q to quit when you’ve seen enough.

Using seq-gen

Return to the simlab directory now, where your seq-gen executable is located.

Using nano, create a file named tree.txt that contains the following single line:

(A:1.0,B:1.0,((C:1.0,D:1.0):1.0,(E:1.0,F:1.0):1.0):1.0)

Now, create a file named sg.sh containing the following:

$HOME/simlab/seq-gen -mHKY -l10000 -n1 -p1 -t2.0 -on < tree.txt > simdata.nex

# -mHKY                       use the HKY substitution model
#                             (HKY, F84, GTR, JTT, WAG, PAM, BLOSUM, MTREV, CPREV or GENERAL)
# -l10000                     sequence length (number of sites to simulate)
# -n1                         number of data sets to simulate from this gene tree
# -p1                         number of data partitions
# -f 0.25 0.25 0.25 0.25      nucleotide frequencies
# -t2.0                       transition/transversion ratio (not the rate ratio!)
# -r 1.0 4.0 1.0 1.0 4.0 1.0  GTR exchangeabilities)
# -a0.5                       shape parameter of gamma among-site rate heterogeneity
# -i0.5                       proportion of invariable sites
# -z123579                    random number seed
# -on                         output format (on=nexus, op=phylip, or=relaxed phylip)
# -s1.0                       branch length scaling factor (1.0 says to not scale branch lengths at all)
#
# tree.txt
#
# (A:1.0,B:1.0,((C:1.0,D:1.0):1.0,(E:1.0,F:1.0):1.0):1.0)

Most of the contents of this file are comments (lines starting with #). I included those as a sort of cheat-sheet for using seq-gen to save having to look some things up in the manual.

You will need to execute the command at the top of this file, and the easiest way to run Linux commands that are stored in a file is to “source” the file:

. sg.sh

The initial dot tells the bash interpreter to simply issue the commands found in the following “shell script” file as if you typed them into the console. You could skip the creation of sg.sh in the first place by simply issuing the command on the command line, like this:

$HOME/simlab/seq-gen -mHKY -l10000 -n1 -p1 -t2.0 -a0.5 -on < tree.txt > simdata.nex

I showed you how to store the command in a file because that provides a record of what you did and allows you to easily modify the command and run it again.

:thinking: Take a look at the file seq-gen generated. Can you explain why nearly every site show evidence of substitution? (hint1: look at the branch lengths specified in the true tree)

Modify your sg.sh specifying a branch length scaling factor of 0.0001 and rerun it.

:thinking: Take a look at the file seq-gen generated. How many sites would you expect to look at before seeing one that shows evidence of substitution? (hint: I’m not asking you to count constant sites! You can answer this using the true tree branch lengths and scaling factor)

Using PAUP* to perform simulation experiments

We will be using cutting-edge features in PAUP* – so cutting edge that you will not be able to find any information about these features anywhere online or by using the help command in PAUP*! So don’t get confused when you try to look up some of the components of the NEXUS file you will be using.

Simulation Template

Create an empty text file named paupsim.nex:

#nexus

begin trees;
    tree 1 = ((A:0.1,B:0.1):0.1,(C:0.1,D:0.1):0);
end;

begin dnasim;
    simdata nchar=10000;
    lset nst=1 basefreq=equal rates=equal pinvar=0;
    truetree source=memory treenum=1 showtruetree=brlens;
    beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);
        [parsimony]
            set criterion=parsimony;
            alltrees;
            tally parsimony;
        [likelihood]
            set criterion=likelihood;
            alltrees;
            tally 'ML-JC';
    endsim; 
end;

begin paup;
    set nowarntsave;
    quit;
end;

The trees block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted.

The beginsim…endsim loop in the dnasim block tells PAUP* to simulate 100 nucleotide data sets (nreps=100).

:thinking: What substitution model will PAUP* use? (hint: the lset command specifies the model.)

For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. I’ve added the [parsimony] and [likelihood] comments along with indentation to show you where these two searches are defined.

The tally commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file results.txt. The name supplied after the tally command will end up being the column name in the file. Note that I had to surround the ML-JC with single quotes because of the embedded hyphen. (The quotes would also be necessary if you wanted a space inside the column name.)

For both parsimony and ML, tally calculates the following quantities (where NAME is either “parsimony” or “ML-JC”):

  • NAME_Ntrees, the number of trees tied for being best (ideally 1)
  • NAME_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)
  • NAME_correct, same as NAME_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)

The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.

Execute the NEXUS File

Log on to the cluster, start a session using

srun --partition=mcbstudent --qos=mcbstudent --pty bash

then load the current module of PAUP*

module load paup/4.0a-166

Start PAUP*, and execute your NEXUS file:

paup paupsim.nex

Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the results.txt file directly in your terminal by typing

column -t results.txt

The -t makes the columns align.

:thinking: Did both optimality criteria get the tree correct most of the time?

Enter the Felsenstein Zone

As you’ve learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named paupsimFZ.nex. Edit the new file and change two edge lengths to 1.0 in order to create a true tree that falls in the Felsenstein zone. Also change the name of the results.txt file to resultsFZ.txt so that you will not overwrite your previous results.

Execute paupsimFZ.nex, then cat the new resultsFZ.txt file and consider the results of your simulations.

:thinking: Which optimality criterion performed best at recovering the true tree?

Change the simdata nchar=10000; line to simdata nchar=(100 1000 10000); and change output=allreps to output=meansonly. Now PAUP* will simulate data sets of 3 different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.

:thinking: Which (parsimony or ML) appears to be statistically consistent? Why?

Add substantial rate HETEROgeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate HOMOgeneity).

:thinking: How did you modify your paupsimFZ.nex file in order to accomplish this?

:thinking: Is ML statistically consistent when the model is violated in this way? Why? (hint: think about which data are hard to model accurately with high heterogeneity)

Saving Simulated Data

Can you figure out how to change your orignal paupsim.nex file (with no rate heterogenity) so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use

export ?

to figure out how to use the export command to save the simulated data to a file. You will want to delete both the parsimony and ML analysis code between the beginsim and endsim lines and replace it with your export command. You should also add a cstatus command to help answer the questions about proportion of constant sites.

You will also probably want to make these other changes to your file:

  • specify only 1 sequence length (e.g. nchar=1000)
  • specify only 1 simulation replicate (i.e. nreps=1)
  • specify that you want to see output (i.e. monitor=yes)

:thinking: Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites?

:thinking: How many substitutions are simulated, on average, per site over the entire tree?

:thinking: Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites?

:thinking: How many substitutions are simulated, on average, per site over the entire tree?

:thinking: Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now?

:thinking: How many substitutions are simulated, on average, per site over the entire tree?

:thinking: To which of the previous 2 simulated data sets is this data set most similar in terms of the proportion of constant sites?

Start PAUP* (without specifying a data file) and use the gammaplot shape=0.01 command to examine the rate means for the four categories of rates.

:thinking: What fraction of sites are essentially invariable?

Strimmer and Rambaut (2002) Study

Download this paper, which I’ll refer to as SR from now on:

Strimmer K., and Rambaut A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. Biol. Sci. 269:137–142.

SR Fig. 1
SR Fig. 1

SR simulated data on the tree shown in Figure 1 of their paper and expected the SH (Shimodaira and Hasegawa, 1999) test to reveal that all three possible resolutions of the polytomy were equally supported by the data. Makes sense, doesn’t it? What they found instead was that (unless they simulated 4000 sites or more) all 15 (rooted) trees for the four taxa A, B, C, and D were considered equal by the SH test. They concluded that the SH test has a bias making it overly conservative and this bias dissipates as sequence lengths increase. This result motivated Shimodaira to create the AU (Approximately Unbiased) test (Shimodaira, 2002).

Can you recreate SR’s results for 500 and 5000 sites (see their Table 3)?

To do this, start with the original paupsim.nex file shown above in the “Simulation template” section. You will need to make your true tree equal the tree SR show in their Figure 1, and you need to make the simulation model equal to the one they used (see the bottom right part of SR p. 140). You can delete all the commands and comments inside the beginsim…endsim loop, replacing these with

set maxtrees=105;
generatetrees all model=equiprobable;
lscores all / shtest autest RELL bootreps=1000;

which generates all 105 possible trees and tests them all using the SH test. To see the output, you’ll need to say monitor=yes in your beginsim command, and you can delete the resultsfile statement.

(Note: look at the column labeled SH, not the column labeled wtd-SH.)

:thinking: How many of the 105 trees were not significant using the SH test for 500 sites? 5000 sites?

:thinking: Does the AU test produce a different result?

Literature Cited

H Shimodaira and M Hasegawa. 1999. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Molelcular Biology and Evolution 16:1114–1116.

H Shimodaira. 2002. An Approximately Unbiased Test of Phylogenetic Tree Selection. Systematic Biology 51:492–508.

Acknowledgements

Kevin Keegan contributed to a previous version of this lab exercise