Searching through treespace
Up to the Phylogenetics main page
This lab explores different search strategies under the parsimony criterion.
What to turn in
Please copy the questions in the template below into a plain text file (i.e., use Notepad++ on Windows or BBEdit on Mac). The questions below correspond to the sections in the lab tutorial starting with the “thinking” emoji. As you work through the lab, replace the underscores with your answers to these questions. After the lab, send your text file to Analisa via a Slack DM.
____ How many separate tree topologies did PAUP* examine?
____ What is the parsimony treelength of the best tree?
____ The worst tree?
____ How many steps separate the best tree from the next best?
____ Which tree was the original tree?
____ Which trees correspond to NNI rearrangments of which internal edges on the original tree?
____ How many tree islands were found?
____ What are the scores of the trees on the islands with the best parsimony score?
____ How long did the search take?
____ How many rearrangements were tried?
____ How many tree islands were found?
____ What are the scores of the trees in each island?
____ How long did the search take this time?
____ How many rearrangements were tried?
____ How many tree islands were found?
____ What are the scores of the trees in each island?
____ How long did the search take this time?
____ How many rearrangements were tried?
____ How many trees are currently in memory?
____ Has PAUP saved trees from all islands discovered during this search?
____ Explain why PAUP saved the number of trees it did.
____ Of the three types of branch swapping (NNI, SPR, TBR), which is most thorough?
____ Least thorough?
____ What is the minimum evolution score for the NJ tree?
____ What is the minimum evolution score for the tree found by heuristic search starting with the NJ tree?
____ What is wrong with this picture?
____ Why is the minimum evolution score of the heuristic search worse than that of the starting tree?
Getting started
Log into your account on the Health Center (Xanadu) cluster:
ssh username@xanadu-submit-ext.cam.uchc.edu
Then type:
srun --partition=mcbstudent --qos=mcbstudent --pty bash
This asks the scheduler to find a node (computer) in the cluster that is currently not busy. It should transfer your session from the head node to a different node. The reason we are using srun
today is that some of the analyses we are going to run take more than a few seconds to complete. If all of us ran long jobs on the head node simultaneously, users would notice significant slowdown in response time, which is very annoying to other users. Thus, we will start developing good habits and will use srun
to perform our interactive analyses on a node that no one else is currently using.
If the srun
command above fails to find a node for you, it may be because there is another class running at the same time and that class is occupying all the nodes assigned to the partition mcbstudent
. If that happens, try using --partition=general --qos=general
instead.
Once you see the prompt, type
module load paup/4.0a-166
This will make the most recent installed version available to you. Without this line, typing paup
may fail to do anything or may start a version of PAUP* that is (perhaps) years old!
Download a copy of the data file
If you have not already downloaded the angio35.nex file, you can recreate it in your current directory as follows:
curl -Ok https://gnetum.eeb.uconn.edu/courses/phylogenetics/lab/angio35.nex
Create a command file
Now start the nano editor (do not specify a file name because you will be creating a new file) and enter the following text, saving the file as run.nex
:
#nexus
begin paup;
log file=output.txt start replace;
execute angio35.nex;
delete 6-.;
alltrees fd=barchart;
quit;
end;
While it is possible to put a paup block directly into a data file such as angio35.nex, there are at least a couple of advantages to creating a separate NEXUS file like run.nex
that contains the paup block. First, executing run.nex
automatically starts a log file so that you will have a record of what you did. Second, it is not a great idea to put commands inside of the data file (angio35.nex) itself. Doing so may seem convenient at the time, but you will later forget that executing angio35.nex involves starting a possibly long-running analysis that also may immediately overwrite output files that took a long time to create (if the analysis is indeed long-running).
Note that because we used the replace
keyword in the log
command, the file output.txt
will be overwritten without warning if it exists. Yes, this is called living dangerously but saves some frustration if a run must be restarted.
The delete command
delete 6-.
causes PAUP* to ignore all taxa except Ephedrasinica, Gnetum_gnemJS, WelwitschiaJS, Ginkgo_biloba, and Pinus_ellCH.
The alltrees
command conducts an exhaustive search under the parsimony criterion (parsimony is the default optimality criterion).
The quit
command causes PAUP* to quit.
Start PAUP*, specifying run.nex
as the file to execute:
paup run.nex
You will find that you need to type y
in order to answer the question Warning: there are unsaved trees. Do you want to quit anyway?. You can avoid having to answer that question each time you execute run.nex by placing the following just above the line containing the quit
command:
set nowarntsave;
This analysis should finish very quickly because you now have only 5 taxa. The fd=barchart
setting tells PAUP to output a bar chart showing the distribution parsimony scores.
How many separate tree topologies did PAUP* examine? What is the parsimony treelength of the best tree? The worst tree? How many steps separate the best tree from the next best? (consult the bar chart to determine the answer)
Determine NNI rearrangements
Because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree.
Let’s do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have 5 trees in memory: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree.
Exercise to turn in before the end of lab
Before you start the NNI search, use the showtrees
command to show you the tree obtained from the exhaustive enumeration. (Note that the quit
command in your run.nex
file caused PAUP* to quit, so you will need to modify the run.nex
file, adding the showtrees
command just before the quit
command, and run PAUP* again.)
Draw this tree as an unrooted tree on a piece of paper, abbreviating the taxa as E for Ephedra, P for Pinus, W for Welwitschia, Gg for Gnetum gnemon, and Gb for Ginkgo biloba.
Draw the 4 possible NNI rearrangements (refer to the description of NNI in your lecture notes (slide 13) if you’ve forgotten) and label them with the tree number from the PAUP* output (which you shall obtain shortly).
Perform an NNI search
Add an hsearch
and describe
command to your run.nex
file, which should afterwards look like this:
#nexus
begin paup;
log file=output.txt start replace;
execute angio35.nex;
delete 6-.;
[! *** Conduct an exhaustive search]
alltrees fd=barchart;
[! *** Here is the best tree from the exhaustive search]
showtree;
[! *** Conduct NNI heuristic search starting with best tree]
hsearch start=1 swap=nni nbest=15;
[! *** Describe all trees]
describe all;
[! *** Quit]
set nowarntsave;
quit;
end;
As you can see, I’ve added some
[!comments that begin with an exclamation point]
NEXUS comments that begin with ! are printable comments: they will appear in the output.txt log file, which provides a nice way to document what you did both in the command file as well as the output. The *** at the beginning of each comment is not necessary, but I find it helps to locate my annotations in the output file later.
The hsearch
command is broken down as follows:
-
start=1
starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion) -
swap=nni
causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping -
nbest=15
saves the 15 best trees found during the search. Thus, were PAUP to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.
The describe all
command plots the 5 trees currently in memory. The reason we are using the describe
command rather than the showtrees
command is because we want PAUP to show us the numbers it has assigned to the internal nodes, something that showtrees
doesn’t do.
Which tree was the original tree? Which trees correspond to NNI rearrangments of which internal edges on the original tree? (Specify edges on the original tree using pairs of node numbers: e.g. 36-38)
Find the most parsimonious tree for all 35 taxa
Modify your run.nex file to conduct a heuristic search on all 35 taxa having the following characteristics:
- The starting trees are each generated by the stepwise addition method, using random addition of sequences (you will employ the
addseq
andstart
keywords for this) - Swap using NNI branch swapping (you will employ the
swap
keyword for this) - Set the
nbest
option toall
because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing). - Set the random number seed to 5555 using the
rseed
option (this determines the sequence of pseudorandom numbers used for the random additions; ordinarily you would not need to set the random number seed, but we will do this here to ensure that we all get the same results) - Do 500 replicate searches; each replicate represents an independent search starting from a different random-addition tree (you will use the
nreps
keyword for this).
Use the following command to get PAUP to list the options for hsearch:
hsearch ?
Remember you can comment out portions of your Nexus file if you don’t want to lose them: e.g.,
#nexus
begin paup;
log file=output.txt start replace;
execute angio35.nex;
[delete 6-.;]
[alltrees fd=barchart;]
[showtree;]
[! *** Conduct NNI heuristic search starting with random stepwise addition]
hsearch [put your new options here];
[describe all;]
[! *** Quit]
set nowarntsave;
quit;
end;
How many tree islands were found? What are the scores of the trees on the islands with the best parsimony score? How long did the search take? How many rearrangements were tried?
Conduct a second search with SPR swapping
Construct another heuristic search using SPR branch swapping. Be sure to set the random number seed to 5555.
How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time? How many rearrangements were tried?
Now conduct a third search with TBR swapping
How many tree islands were found? What are the scores of the trees in each island? How long did the search take this time? How many rearrangements were tried? How many trees are currently in memory (look for “Number of trees retained” in the output)? Has PAUP saved trees from all islands discovered during this search? (Hint: compare “Number of trees retained” to the sum of the “Size” column in the Tree-island profile.) Explain why PAUP saved the number of trees it did.
Wondering about this warning?
Multiple hits on islands of unsaved trees may in fact represent different islands.
When PAUP encounters a new island, it will find all trees composing that particular island in the process of branch swapping. If, in a new search (remember, you are performing 500 searches in each of these runs), it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.
Of the three types of branch swapping (NNI, SPR, TBR), which is most thorough (i.e. examines the largest number of tree topologies)? Least thorough?
Compare NJ with a comparable heuristic search
In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.
Create a new file using nano containing the following lines. Note that we are again using the angio35.nex file:
#nexus
begin paup;
execute angio35.nex;
dset distance=hky objective=me;
nj;
dscores 1; [!*** NJ score above ***]
hsearch start=1;
dscores 1; [!*** Heuristic search score above ***]
end;
(Reminder: the text surrounded by square brackets is a comment, and the initial exclamation point ! tells PAUP that you would like this comment to appear in the output.) Run this file in paup by typing the following at the linux prompt:
paup <filename>
(Of course, replace <filename>
with the actual name of the file you just created.)
What is the minimum evolution score for the NJ tree? (scroll down from the beginning of the PAUP* output looking for the phrase “ME-score” right above the point where the comment
*** NJ score above ***
was printed)
What is the minimum evolution score for the tree found by heuristic search starting with the NJ tree? (Look just above the comment
*** Heuristic search score above ***
)
What is wrong with this picture? Why is the minimum evolution score of the heuristic search worse than that of the starting tree? (Hint: take a look at the “Heuristic search settings” section of the output and, in particular, the “Optimality criterion” line.)
Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file. You may need to get PAUP to help you with the criterion setting; type the following to get PAUP to spit out the current settings, then look for criterion near the top of the list:
set ?
In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. That is not to say that NJ does not find a good tree, and it might even find the true tree, but it usually will not find the best tree. NJ is an excellent way to obtain a starting tree for an heuristic search, however.