Simulation Lab

Goals

The goal of this lab is to gain experience simulating DNA sequence data, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. The workhorse for DNA simulations in phylogenetics is Andrew Rambaut’s program seq-gen, which still available (and still as useful as it always was!), but today we will use PAUP* to perform a simulation experiment to demonstrate the phenomenon known as long branch attraction.

As always, start by logging into the Xanadu cluster and grabbing a node on the cluster that is free:

srun --qos=mcbstudent --partition=mcbstudent --pty bash

Create a directory for this exercise

First, create a directory to use for this lab and navigate into that directory:

mkdir simlab
cd simlab

Template

Here is a template with the questions asked below. Copy this into a text file and turn it in with your answers after the lab.

What substitution model will PAUP* use?
answer:

Did both optimality criteria get the tree correct most of the time?
answer: 

Which optimality criterion performed best at recovering the true tree?
answer: 

Which (parsimony or ML) appears to be statistically consistent? Why?
answer: 

How did you modify your paupsimFZ.nex file in order to accomplish this?
answer: 

Is ML statistically consistent when the model is violated in this way? Why? 
answer: 

Make all branches in the true tree long (e.g. 10). 
What is the proportion of constant sites? 
answer: 

How many substitutions are simulated, on average, per 
site over the entire tree?
answer: 

Make all branches in the true tree short (e.g. 0.001). 
What is the proportion of constant sites? 
answer: 

How many substitutions are simulated, on average, per 
site over the entire tree?
answer: 

Make all branches in the true tree 10 but add significant 
rate heterogeneity (gamma shape 0.01). What about the
proportion of constant sites now? 
answer: 

How many substitutions are simulated, on average, per site 
over the entire tree? 
answer: 

To which of the previous 2 simulated data sets is this data 
set most similar in terms of the proportion of constant sites? 
answer: 

What fraction of sites are essentially invariable?
answer: 

Using PAUP* to perform simulation experiments

We will be using cutting-edge features in PAUP* – so cutting edge that you will not be able to find any information about these features anywhere online or by using the help command in PAUP*! So don’t get confused when you try to look up some of the components of the NEXUS file you will be using.

Simulation Template

Create an empty text file named paupsim.nex:

#nexus

begin trees;
    tree 1 = ((A:0.1,B:0.1):0.1,(C:0.1,D:0.1):0);
end;

begin dnasim;
    simdata nchar=10000;
    lset nst=1 basefreq=equal rates=equal pinvar=0;
    truetree source=memory treenum=1 showtruetree=brlens;
    beginsim nreps=100 seed=12345 monitor=no resultsfile=(name=results.txt replace output=allreps);
        [parsimony]
            set criterion=parsimony;
            alltrees;
            tally parsimony;
        [likelihood]
            set criterion=likelihood;
            alltrees;
            tally 'ML-JC';
    endsim; 
end;

begin paup;
    set nowarntsave;
    quit;
end;

The trees block contains the description of the true tree that we will use to simulate data. By default, trees are considered unrooted.

The beginsim…endsim loop in the dnasim block tells PAUP* to simulate 100 nucleotide data sets (nreps=100).

What substitution model will PAUP* use? (hint: the lset command specifies the model.)

For each of the 100 data sets simulated, two analyses will be performed: (1) an exhaustive search using parsimony and (2) and exhaustive search using maximum likelihood. I’ve added the [parsimony] and [likelihood] comments along with indentation to show you where these two searches are defined.

The tally commands keep track of how many times parsimony and ML infer a tree identical to the true tree used for simulation, and the tallied information is stored in the file results.txt. The name supplied after the tally command will end up being the column name in the file. Note that I had to surround the ML-JC with single quotes because of the embedded hyphen. (The quotes would also be necessary if you wanted a space inside the column name.)

For both parsimony and ML, tally calculates the following quantities (where NAME is either “parsimony” or “ML-JC”):

NAME_Ntrees, the number of trees tied for being best (ideally 1)
NAME_P, the fraction of splits in the true tree that were found in the inferred tree (averaged over number of trees inferred)
NAME_correct, same as NAME_P if no incorrect splits are found in the inferred tree, otherwise 0 (averaged over number of trees inferred)

The final paup block sets nowarntsave, which means PAUP will not warn you if you quit without saving stored trees, then quits.

Execute the NEXUS File

Log on to the cluster, start a session using

srun --partition=mcbstudent --qos=mcbstudent --pty bash

then load the current module of PAUP*

module load paup/4.0a-166

Start PAUP*, and execute your NEXUS file:

paup paupsim.nex

Note that PAUP* quits after performing the simulations (because we told it to quit in that final paup block). You can also view the results.txt file directly in your terminal by typing

column -t results.txt

The -t makes the columns align.

Did both optimality criteria get the tree correct most of the time?

Enter the Felsenstein Zone

As you’ve learned in lecture, parsimony is particularly prone to long branch attraction while maximum likelihood is able to resist joining long edges if the model is correct in the important details. Copy your NEXUS file to create a file named paupsimFZ.nex. Edit the new file and change two edge lengths to 1.0 in order to create a true tree that falls in the Felsenstein zone. Also change the name of the results.txt file to resultsFZ.txt so that you will not overwrite your previous results.

Execute paupsimFZ.nex, then cat the new resultsFZ.txt file and consider the results of your simulations.

Which optimality criterion performed best at recovering the true tree?

Change the simdata nchar=10000; line to simdata nchar=(100 1000 10000); and change output=allreps to output=meansonly. Now PAUP* will simulate data sets of 3 different sequence lengths and summarize the results rather than spitting out a line for every simulation replicate.

Which (parsimony or ML) appears to be statistically consistent? Why?

Add substantial rate HETEROgeneity (e.g. gamma shape = 0.01) to the simulated data and analyze the data under both parsimony and ML (using a model that assumes rate HOMOgeneity).

How did you modify your paupsimFZ.nex file in order to accomplish this?

Is ML statistically consistent when the model is violated in this way? Why? (hint: think about which data are hard to model accurately with high heterogeneity)

Saving Simulated Data

Can you figure out how to change your orignal paupsim.nex file (with no rate heterogenity) so that PAUP* simulates one data set and exports it to a file? Start PAUP* and use

export ?

to figure out how to use the export command to save the simulated data to a file. You will want to delete both the parsimony and ML analysis code between the beginsim and endsim lines and replace it with your export command. You should also add a cstatus command to help answer the questions about proportion of constant sites.

You will also probably want to make these other changes to your file:

specify only 1 sequence length (e.g. nchar=1000)
specify only 1 simulation replicate (i.e. nreps=1)
specify that you want to see output (i.e. monitor=yes)

Make all branches in the true tree long (e.g. 10). What is the proportion of constant sites?

How many substitutions are simulated, on average, per site over the entire tree?

Make all branches in the true tree short (e.g. 0.001). What is the proportion of constant sites?

How many substitutions are simulated, on average, per site over the entire tree?

Make all branches in the true tree 10 but add significant rate heterogeneity (gamma shape 0.01). What about the proportion of constant sites now?

How many substitutions are simulated, on average, per site over the entire tree?

To which of the previous 2 simulated data sets is this data set most similar in terms of the proportion of constant sites?

Start PAUP* (without specifying a data file) and use the gammaplot shape=0.01 command to examine the rate means for the four categories of rates.

What fraction of sites are essentially invariable?

Acknowledgements

Kevin Keegan contributed to a previous version of this lab exercise