Simulation Lab
Up to the Phylogenetics main page
Goals
The goal of this lab is to gain experience simulating DNA sequence data, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. The workhorse for DNA simulations in phylogenetics is Andrew Rambaut’s program seq-gen, which still available (and still as useful as it always was!), but today we will use PAUP* to perform a simulation experiment to demonstrate the phenomenon known as long branch attraction.
As always, start by logging into the Xanadu cluster and grabbing a node on the cluster that is free:
srun --qos=mcbstudent --partition=mcbstudent --pty bash
Create a directory for this exercise
First, create a directory to use for this lab and navigate into that directory:
mkdir simlab
cd simlab
Template
Here is a template with the questions asked below. Copy this into a text file and turn it in with your answers after the lab.
What substitution model will PAUP* use?
answer:
Did both optimality criteria get the tree correct most of the time?
answer:
Which optimality criterion performed best at recovering the true tree?
answer:
Which (parsimony or ML) appears to be statistically consistent? Why?
answer:
How did you modify your paupsimFZ.nex file in order to accomplish this?
answer:
Is ML statistically consistent when the model is violated in this way? Why?
answer:
Make all branches in the true tree long (e.g. 10).
What is the proportion of constant sites?
answer:
How many substitutions are simulated, on average, per
site over the entire tree?
answer:
Make all branches in the true tree short (e.g. 0.001).
What is the proportion of constant sites?
answer:
How many substitutions are simulated, on average, per
site over the entire tree?
answer:
Make all branches in the true tree 10 but add significant
rate heterogeneity (gamma shape 0.01). What about the
proportion of constant sites now?
answer:
How many substitutions are simulated, on average, per site
over the entire tree?
answer:
To which of the previous 2 simulated data sets is this data
set most similar in terms of the proportion of constant sites?
answer:
What fraction of sites are essentially invariable?
answer:
Using PAUP* to perform simulation experiments
We will be using cutting-edge features in PAUP* – so cutting edge that you will not be able to find any information about these features anywhere online or by using the help command in PAUP*! So don’t get confused when you try to look up some of the components of the NEXUS file you will be using.
Simulation Template
Create an empty text file named paupsim.nex:
#nexus
begin trees;
tree 1 = ((A:0.1,B:0.1):0.1,(C:0.1,D:0.1):0);
end;
begin dnasim;
simdata nchar=10000;
lset nst=1 basefreq=equal rates=equal pinvar=0;
truetree source=memory treenum=1 showtruetree=brlens;
beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);
[parsimony]
set criterion=parsimony;
alltrees;
tally parsimony;
[likelihood]
set criterion=likelihood;
alltrees;
tally 'ML-JC';
endsim;
end;
begin paup;
set nowarntsave;
quit;
end;
The trees block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted.
The beginsim…endsim loop in the dnasim block tells PAUP* to simulate 100 nucleotide data sets (nreps=100).
What substitution model will PAUP* use? (hint: the lset command specifies the model.)
For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. I’ve added the [parsimony] and [likelihood] comments along with indentation to show you where these two searches are defined.
The tally commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file results.txt. The name supplied after the tally command will end up being the column name in the file. Note that I had to surround the ML-JC with single quotes because of the embedded hyphen. (The quotes would also be necessary if you wanted a space inside the column name.)
For both parsimony and ML, tally calculates the following quantities (where NAME is either “parsimony” or “ML-JC”):
- NAME_Ntrees, the number of trees tied for being best (ideally 1)
- NAME_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)
- NAME_correct, same as NAME_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)
The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.
Execute the NEXUS File
Log on to the cluster, start a session using
srun --partition=mcbstudent --qos=mcbstudent --pty bash
then load the current module of PAUP*
module load paup/4.0a-166
Start PAUP*, and execute your NEXUS file:
paup paupsim.nex
Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the results.txt file directly in your terminal by typing
column -t results.txt
The -t makes the columns align.
Did both optimality criteria get the tree correct most of the time?
Enter the Felsenstein Zone
As you’ve learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named paupsimFZ.nex. Edit the new file and change two edge lengths to 1.0 in order to create a true tree that falls in the Felsenstein zone. Also change the name of the results.txt file to resultsFZ.txt so that you will not overwrite your previous results.
Execute paupsimFZ.nex, then cat
the new resultsFZ.txt file and consider the results of your simulations.
Which optimality criterion performed best at recovering the true tree?
Change the simdata nchar=10000; line to simdata nchar=(100 1000 10000); and change output=allreps to output=meansonly. Now PAUP* will simulate data sets of 3 different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.
Which (parsimony or ML) appears to be statistically consistent? Why?
Add substantial rate HETEROgeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate HOMOgeneity).
How did you modify your paupsimFZ.nex file in order to accomplish this?
Is ML statistically consistent when the model is violated in this way? Why? (hint: think about which data are hard to model accurately with high heterogeneity)
Saving Simulated Data
Can you figure out how to change your orignal paupsim.nex file (with no rate heterogenity) so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use
export ?
to figure out how to use the export command to save the simulated data to a file. You will want to delete both the parsimony and ML analysis code between the beginsim and endsim lines and replace it with your export command. You should also add a cstatus command to help answer the questions about proportion of constant sites.
You will also probably want to make these other changes to your file:
- specify only 1 sequence length (e.g. nchar=1000)
- specify only 1 simulation replicate (i.e. nreps=1)
- specify that you want to see output (i.e. monitor=yes)
Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites?
How many substitutions are simulated, on average, per site over the entire tree?
Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites?
How many substitutions are simulated, on average, per site over the entire tree?
Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now?
How many substitutions are simulated, on average, per site over the entire tree?
To which of the previous 2 simulated data sets is this data set most similar in terms of the proportion of constant sites?
Start PAUP* (without specifying a data file) and use the gammaplot shape=0.01 command to examine the rate means for the four categories of rates.
What fraction of sites are essentially invariable?
Acknowledgements
Kevin Keegan contributed to a previous version of this lab exercise