Simulation Lab
Up to the Phylogenetics main page
Goals
The goal of this lab is to gain experience simulating DNA sequence data, which can be useful in testing null hypotheses of interest (parametric bootstrapping) as well as testing the robustness of models to violations of their assumptions and testing the correctness of software and algorithms. The workhorse for DNA simulations in phylogenetics is Andrew Rambaut’s program seq-gen, which still available (and still as useful as it always was!)
I use a
emoji below to indicate something you should do. Hopefully this will make it easier for you to see what to do next (as opposed to commentary on what you just did or notes).
Template
Here is a template with the questions asked below. Copy this into a text file and turn it in with your answers after the lab.
Can you explain why nearly every site shows evidence of substitution?
answer:
How many sites would you expect to look at before seeing one that shows evidence of substitution?
answer:
How many data sets did seq-gen simulate?
answer:
What part of the seq-gen command told it to generate this many data sets?
answer:
How many trees obtained using the parsimony criterion have a topology identical to the true tree?
answer:
How many trees obtained using the likelihood criterion have a topology identical to the true tree?
answer:
What split characterizes the true tree: AB|CD, AC|BC, or AD|BC?
answer:
How many likelihood trees were AB|CD? AC|BD? AD|BC?
answer:
How many parsimony trees were AB|CD? AC|BD? AD|BC?
answer:
Explain why this phenomenon is called long branch attraction.
answer:
How many likelihood trees were AB|CD? AC|BD? AD|BC when rate heterogeneity was present?
answer:
Is likelihood susceptible to LBA if the model is incorrect in an important way?
answer:
Getting started
Login to your account on the Storrs HPC cluster and start an interactive slurm session:
ssh hpc
gensrun
If gensrun fails, you can create the gensrun alias by opening the .bashrc file in your home directory:
nano ~/.bashrc
and typing the following (you can make it the first line of the file):
alias gensrun='srun -p general -q general --pty bash'
Save the .bashrc file and enter the following command (from within your home directory) to load the .bashrc file (and thus load the gensrun alias):
. .bashrc
The initial . means “read the following file”. There should be a space between the initial . and .bashrc.
Create a directory for this exercise
Create a directory to use for this lab and navigate into that directory:
mkdir simlab
cd simlab
Using Seq-Gen to simulate sequence data
Compiling seq-gen
The seq-gen program is not installed on the cluster, so you will need to download the source code and compile it yourself. What you will download is the source code (written in the computing language C), which needs to be compiled into an executable file before it can be used.
Download the seq-gen source code from GitHub using the following curl command:
curl -LO https://github.com/rambaut/Seq-Gen/archive/refs/tags/1.3.4.tar.gz
Unpack the downloaded “tape archive” file 1.3.4.tar.gz as follows:
tar zxvf 1.3.4.tar.gz
You should now have a directory named Seq-Gen-1.3.4 inside your simlab directory. You can now delete the 1.3.4.tar.gz file:
rm 1.3.4.tar.gz
Navigate into the source subdirectory of the new Seq-Gen-1.3.4 directory:
cd Seq-Gen-1.3.4/source
There is a file inside this directory named Makefile that contains instructions for building seq-gen from the C source files (which are the files with names that end in either .c or .h).
All you need to do now is type make to process these instructions:
make
The C compiler will compile (note the -c on the command line) all the files ending in .c (including, as necessary, the files ending in .h), producing object files with the same prefix but ending in .o. You can ignore the warning that is issued along the way (errors are harder to ignore, but fortunately there should be no errors). These object files are then linked together (note the -o rather than -c this time) in the final step to create the executable file, which is named seq-gen.
It is customary to store executable files in a directory named bin (bin is short for binary because an executable file is just a sequence of 1s and 0s). You should already have a bin directory (you earlier copied the paup executable file there).
Move the seq-gen executable file into your ~/bin directory so that it is easier to use for this lab:
mv seq-gen ~/bin
You can now delete the Seq-Gen-1.3.4 directory:
cd ~/simlab
rm -rf Seq-Gen-1.3.4
The -rf part of the remove (rm) command says to delete files recursively (-r) and forcibly (-f). The “forcibly” part means that rm will not ask if it is OK to delete any file it finds, it will just delete it. The recursive part is necessary because this is a directory and rm will not do anything with a directory unless you specify that the removal is recursive.
Viewing the seq-gen documentation
The documentation is in the form of an HTML file, Seq-Gen.Manual.html, which is located in the Seq-Gen-1.3.4/documentation directory. It is not that convenient to view web pages on the cluster, so you may wish to download the zip file version to your local laptop in order to get the file Seq-Gen.Manual.html where you can open it in a web browser.
Using seq-gen
Return to the simlab directory now:
cd ~/simlab
Using nano, create a file named tree.txt that contains the following single line:
(A:1.0,B:1.0,((C:1.0,D:1.0):1.0,(E:1.0,F:1.0):1.0):1.0)
Now, create a file named sg.sh containing the following:
seq-gen -mHKY -l10000 -n1 -p1 -on < tree.txt > simdata.nex
# -mHKY use the HKY substitution model
# (HKY, F84, GTR, JTT, WAG, PAM, BLOSUM, MTREV, CPREV or GENERAL)
# -l10000 sequence length (number of sites to simulate)
# -n1 number of data sets to simulate from this gene tree
# -p1 number of data partitions
# -f 0.25 0.25 0.25 0.25 nucleotide frequencies
# -t2.0 transition/transversion ratio (not the rate ratio!)
# -r 1.0 4.0 1.0 1.0 4.0 1.0 GTR exchangeabilities)
# -a0.5 shape parameter of gamma among-site rate heterogeneity
# -i0.5 proportion of invariable sites
# -z123579 random number seed
# -on output format (on=nexus, op=phylip, or=relaxed phylip)
# -x paupblock.txt insert contents of the file paupblock.txt after each simulated data set
# -s1.0 branch length scaling factor (1.0 says to not scale branch lengths at all)
#
# tree.txt should look something like this:
# (A:1.0,B:1.0,((C:1.0,D:1.0):1.0,(E:1.0,F:1.0):1.0):1.0)
#
# paupblock.txt, if used, should look something like this:
# begin paup;
# set crit=like;
# lset nst=1 basefreq=equal rates=equal;
# hsearch;
# end;
Most of the contents of this file are comments (lines starting with #). I included those as a sort of cheat-sheet for using seq-gen to save having to look everything up in the manual.
You will need to execute the command on the first line of the sg.sh file, and the easiest way to run Linux commands that are stored in a file is to “source” the file:
. sg.sh
The initial dot tells the bash interpreter to simply issue the commands found in the following “shell script” file as if you typed them into the console.
You could skip the creation of sg.sh in the first place by simply issuing the command on the command line, like this:
seq-gen -mHKY -l10000 -n1 -p1 -on < tree.txt > simdata.nex
I showed you how to store the command in a file because that provides a record of what you did and allows you to easily modify the command and run it again.
Note: if the sg.sh file fails to run, it may be because the system cannot find the seq-gen executable file. Try replacing seq-gen in the sg.sh script with $HOME/bin/seq-gen. If this works, it means that something went wrong when you worked through the section on Creating your own bin directory in the first lab of the semester. Feel free to ask one of us for help getting your bin directory set up correctly.
Take a look at the file seq-gen generated. Can you explain why nearly every site shows evidence of substitution? (hint1: look at the branch lengths specified in the true tree)
Modify your sg.sh specifying a branch length scaling factor of 0.0001 and rerun it.
seq-gen -mHKY -l10000 -n1 -p1 -on -s.0001 < tree.txt > simdata.nex
Take a look at the file seq-gen generated. How many sites would you expect to look at before seeing one that shows evidence of substitution? (hint: I’m not asking you to count constant sites! You can answer this using the true tree branch lengths and scaling factor)
Analyzing multiple simulated data sets
Ordinarily, simulation studies involve analyzing hundreds if not thousands of simulated data sets to make overall trends discoverable. Let’s use seg-gen to generate several simulated data sets and analyze each with PAUP* under the parsimony and likelihood criteria.
Using nano, replace the contents of your tree.txt file with the following single line:
(A:0.05,B:0.05,(C:0.05,D:0.05):0.05)
Now, replace the first line of your sg.sh file with this (note: only the first, uncommented line matters):
seq-gen -mHKY -l1000 -n10 -p1 -on -x paupblock.txt < tree.txt > simdata.nex
Finally, use nano to create a file named paupblock.txt with these contents:
begin paup;
set crit=parsimony;
hsearch;
savetrees file=parsimony-results.tre format=newick brlens append;
set crit=likelihood;
hsearch;
savetrees file=likelihood-results.tre format=newick brlens append;
end;
Run sg.sh as before:
. sg.sh
Use the tail command to see the last 30 lines of the file simdata.nex:
tail -n 30 simdata.nex
How many data sets did seq-gen simulate? (Hint: seq-gen reports this number in a comment after “Begin DATA;”, but you may need to scroll up a bit to see it)
What part of the seq-gen command told it to generate this many data sets?
Run simdata.nex in PAUP*:
paup -L logfile.txt simdata.nex
You will need to type q at PAUP’s prompt to get out of PAUP.
Type paup --help to see what the command line switch -L does.
Download the two files parsimony-results.tre and likelihood-results.tre to your local laptop using either Cyberduck or the scp command. The scp command would look something like this (type this into a terminal on your local laptop):
scp hpc:simlab/parsimony-results.tre .
scp hpc:simlab/likelihood-results.tre .
How did parsimony and likelihood do?
Open parsimony-results.tre in FigTree on your local laptop and flip through the trees.
How many trees obtained using the parsimony criterion have a topology identical to the true tree?
Open likelihood-results.tre in FigTree on your local laptop and flip through the trees.
How many trees obtained using the likelihood criterion have a topology identical to the true tree?
Long branch attraction and the Felsenstein zone
In addition to inventing the field of likelihood-based phylogenetics (Felsenstein 1981), Joe Felsenstein published an earlier paper in 1978 that created quite a stir for quite some time titled “Cases in which parsimony or compatibility methods will be positively misleading.” This is the paper that first considered the phenomenon of long branch attraction (LBA).
Let’s demonstrate LBA using simulation.
Using nano, replace the contents of your tree.txt file with the following single line:
(A:0.5,B:0.05,(C:0.05,D:0.5):0.05)
Before going further, copy this tree description and paste it into FigTree to see what this tree looks like.
The key is that this tree has two unrelated edges that are an order of magnitude longer than all other edges in the tree.
What split characterizes the true tree: AB|CD, AC|BC, or AD|BC?
Replace the contents of your sg.sh file with this (I’ve left out the comments this time):
seq-gen -mHKY -l1000 -n1000 -p1 -on -x paupblock.txt -z12345 < tree.txt > simdata.nex
Note that I’ve set -n1000 instead of -n10 this time and have specified a random number seed -z12345.
Replace the random number seed above with one of your choosing (try to think of one that no one else will think of) so that we all get different results. We can increase our sample size from 1000 to N*1000 by combining results from N people.
Run seq-gen to regenerate simdata.nex:
. sg.sh
Before running simdata.nex in PAUP*, delete your logfile.txt, likelihood-results.tre and parsimony-results.tre files so that we don’t mix results from two different analyses:
rm logfile.txt likelihood-results.tre parsimony-results.tre
Run PAUP* as before:
paup -L logfile.txt simdata.nex
How did likelihood fare?
This time, let’s avoid the tedium of actually counting how many trees were estimated correctly by using PAUP*’s treedist command.
You should still be in PAUP* (evidenced by the prompt paup>). If you quit paup, start it up again without specifying a data file by typing just paup.
Load the trees from likelihood-results.tre into PAUP*:
paup> gettrees file=likelihood-results.tre
PAUP* will probably ask you some questions at this point:
- Respond
yto the question about increasingmaxtrees - Respond to the question about a new values for
maxtreesby accepting the default (just hit Enter) - Respond to the question about the action to take if this limit is hit by typing
2
PAUP* will respond by saying 1000 trees read from file.
Using the contree command to summarize splits
Start PAUP*, read in the trees from the likelihood-results.tre file, and compute a majority-rule consensus tree (majrule) while turning off calculation of the default strict concensus (nostrict):
paup
paup> gettrees file=likelihood-results.tre
paup> contree all / nostrict majrule
Note the table at the bottom of the output that summarizes all splits. Since these are 4-taxon trees, there is only one internal split per tree, so the numbers in this table equate to the number of trees having a particular topology.
How many likelihood trees were AB|CD? AC|BD? AD|BC? (Make sure your numbers add up to the total number of trees.)
How did parsimony fare?
There are probably more than 1000 trees in parsimony-results.tre because of ties (some simulated data sets will probably result in two or more trees with the same parsimony score).
To prevent PAUP* from asking whether we want to increase maxtrees, let’s change some settings:
paup> set maxtrees=1000 increase=auto autoinc=200
Load the trees from parsimony-results.tre into PAUP*:
paup> gettrees file=parsimony-results.tre
PAUP* should respond by saying something like
Keeping: trees from file (replacing any trees already in memory)
1012 trees read from file
(although your results may differ slightly).
How many parsimony trees were AB|CD? AC|BD? AD|BC? (Use the consensus tree approach again)
Explain why this phenomenon is called long branch attraction.
Is likelihood immune from LBA?
I don’t want to leave you with the notion that likelihood is immune from problems such as LBA.
Revise your sg.sh file to contain this seq-gen command:
seq-gen -mHKY -l1000 -n1000 -p1 -t2.0 -on -x paupblock.txt -z12345 -a0.1 < tree.txt > simdata.nex
The only thing I’ve added is -a0.1 which adds a considerable amount of among-site rate heterogeneity to the model used to simulate the data. This means that many sites will evolve very slowly but some sites will evolve very quickly (the overall mean rate is the same). We will not tell the likelihood model used by PAUP* about this rate heterogeneity, so the analysis model will assume that every site evolves under the same rate.
Run sg.sh to generate simdata.nex anew.
Be sure to delete or rename your logfile.txt, likelihood-results.tre, and parsimony-results.tre files before running PAUP* and analyzing the results.
How many likelihood trees were AB|CD? AC|BD? AD|BC when rate heterogeneity was present?
Is likelihood susceptible to LBA if the model is incorrect in an important way?
Literature Cited
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology 27:401-410.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368-376.