IQ-TREE Lab

Goals

Today you will continue learning about maximum likelihood inference and will learn to use a program (IQ-TREE) known for the diversity of models it offers and its ability to handle large-scale phylogenetic analyses.

Objectives

After completing this lab, you will be able to…

estimate a phylogeny using IQ-TREE
obtain Ultrafast Bootstrap support values for the splits in the tree
use IQ-TREE to find the optimal data partitioning scheme
use BIC values to choose the best model (where model includes the tree as well as the substitution model)
use IQ-TREE to find loci that conflict with other loci, suggesting paralogy, horizontal transfer, etc.

Getting started

srun --qos=mcbstudent --partition=mcbstudent --pty bash

Ordinarily, we would use the command “module avail” to see a list of all modules and, finding iqtree/2.2.2 in the list, we would load iqtree using the module load command. Unfortunately, we need version IQ-TREE version 2.2.2.6 for this lab! So, instead of loading the module, we’ll download the latest version of IQ-TREE to our iqlab directory and run it from the downloaded executable.

First, download the latest version of IQ-TREE for “64-bit multicore Linux” (I got the link from the downloads web page):

curl -LO https://github.com/iqtree/iqtree2/releases/download/v2.2.2.6/iqtree-2.2.2.6-Linux.tar.gz

The L is important in this case because there is a redirect involved (i.e. the url does not point directly to the file we need, so L tells curl to follow redirects until it gets to the final destination).

This should result in a file named iqtree-2.2.2.6-Linux.tar.gz, which is a so-called “gzipped tar file”. The gz part of the name tells you that the file is compressed using the gzip algorithm, and the tar part of the name stands for Tape ARchive, which is an old term used for a single file that is actually a bunch of tiles concatenated together (for example, an entire directory of files). The tape part is because we used to use magnetic tape rather than hard drives to store computer data.

How do you extract a tar file? Using the tar command of course!

tar zxvf iqtree-2.2.2.6-Linux.tar.gz

The zxvf options say to gunzip the file (z), then extract the files therein (x), be verbose (v) and the file to process follows (f). This will create a directory named iqtree-2.2.2.6-Linux.

Navigate into that directory and then further navigate into the bin directory. We will copy the iqtree2 file (which is the actual program) to our iqlab directory so that it is easier to access.

Starting in the iqlab directory…

cd iqtree-2.2.2.6-Linux/bin
mv iqtree2 ../..
cd ../..

You can now delete the tar file and the directory that was created when we extracted it:

rm -rf iqtree-2.2.2.6-Linux
rm -f iqtree-2.2.2.6-Linux.tar.gz

(The -rf part tells the rm command to delete recursively - i.e. go into each subdirectory and get rid of everything, with the f part meaning “force”, which just tells rm to do its job without asking permission to delete every file it finds (which gets very tedious).

About IQ-TREE

IQ-TREE is known for its speed and ability to handle large nucleotide or protein datasets, and for the diversity of models it offers. Its main competitor is RAxML, which is also an excellent choice. We don’t have time to introduce you to every program that does phylogenetics; we’ve chosen IQ-TREE over RAxML mainly because it has truly excellent documentation. The explanation of options for this software is far better than for most other programs you will use this semester, and you can learn a lot about phylogenetics just looking up how to do things with IQ-TREE!

Start the tutorial

The tutorial below is a shortened version of the one written by Bui Minh (the primary developer of IQ-TREE) for the 2023 Woods Hole Workshop in Molecular Evolution.

Create a directory for today

Create a new directory named iqlab to store data and output files for today’s lab. Navigate into the iqlab directory using the cd command.

Copy the template

Copy the text below into a text file to record you answers to questions posed in this lab.

1. How many taxa are represented in the turtle.fa file?
answer: 

2. How many loci are represented in the turtle.fa file?
answer: 

3. How many substitution models were tested in the unpartitioned analysis?
answer:

4. What was the best model according to the Bayesian Information Criterion (BIC)?
answer:

5. Which of the 3 hypotheses (BC, BT, or CT) is supported by the ML tree for the unpartitioned analysis? 
answer:

6. What is the BIC score for the unpartitioned ML tree?
answer:

7. What is the ultrafast bootstrap support for this hypothesis in the unpartitioned ML tree?
answer:

8. Which of the 3 hypotheses (BT, BC, or CT) is supported by the partitioned ML tree?
answer:

9. What is the ultrafast bootstrap support for this hypothesis in the partitioned ML tree?
answer:

10. What is the BIC score for the partitioned ML tree? Is it better (smaller) or worse (larger) than the BIC for the unpartitioned ML tree?
answer:

11. Is Tree1 the tree from the unpart or fullpart analysis?
answer:

12. Which 2 loci (counting over starting at 1 for the leftmost bar in the plot) stand out as saying something very different than the others?
answer:

13. What are these loci? (i.e. look them up in the turtles.nex file you downloaded at the start of the lab)
answer:

14. Which hypothesis of turtle relationships do these 2 loci support?
answer:

15. Which hypothesis of turtle relationships do 20 of the remaining 27 loci support?
answer:

16. Which hypothesis is now supported by each analysis and what is the ultrafast support and BIC score?
(fill out the table below):

---------------------------------------------------
              Hypothesis     ufboot             BIC 
---------------------------------------------------
unpart     
---------------------------------------------------
fullpart                                           
---------------------------------------------------
bestpart
---------------------------------------------------

17. Did merging partition subsets improve the fit of the model? What evidence did you use to decide this?
answer:

18. Did removing the paralogous loci improve confidence in one hypothesis over the others? What evidence did you use to decide this?
answer:

Download the data

Start by downloading the sequence data file (turtle.fa) and partition file (turtle.nex) that we will be analyzing today:

curl -O http://www.iqtree.org/workshop/data/turtle.fa
curl -O http://www.iqtree.org/workshop/data/turtle.nex

The file turtle.fa contains the actual sequence data. It is in FASTA format, which is very simple: the data for each taxon begins with a separate line beginning with a greater-than symbol (>) followed by the taxon name and sometimes other information. The sequence data for that taxon begins on the next line and continues over possibly several lines until the end of the file or another > is encountered.

How many taxa are represented in the turtle.fa file?

The file turtle.nex is a NEXUS format file but does not contain sequence data! Instead this file contains only a sets block with charset statements showing the range of sites corresponding to each locus.

How many loci are represented in the turtle.fa file?

Background

We will use these data to infer relationships between three familiar groups: turtles (T), crocodiles (C), and birds (B). There are three possible rooted trees for the three taxa, shown in the figure on the right (image credit: Phylopic). I will refer to these three hypotheses simply by the taxa that are sister to each other: i.e. BC, BT, CT.

These data represent a small subset of the 248-locus, 16-taxon data from Chiari et al. 2012, who concluded that their data “…unambiguously support turtles as a sister group to birds and crocodiles” (i.e. BC) however they found that CT was supported by “more simplistic” models and attributed this artifactual grouping to substitutional saturation at 3rd codon positions.

During the lab today, you will discover that a couple of loci tell a different story than the others and are responsible for the tendency to get the CT grouping under common models. The fact that these loci are paralogs was discovered by Jeremy Brown and Robert Thomson in 2017.

Let’s try to figure out which two loci are the paralogs. Start by reminding yourself of the difference between paralogs and orthologs and why unknowingly mixing paralogs and orthologs can be misleading (see slides 38 and 40 in the first lecture).

Unpartitioned analysis

Create a directory inside iqlab named unpart. Use the cd command to navigate into the unpart directory.

Use this command to estimate the tree using maximum likelihood criterion:

../iqtree2 -s ../turtle.fa -B 1000 -nt AUTO --prefix unpart

Here is a key to the settings we chose on the command line:

-s specifies that the sequence data is in the file turtle.fa
-B says to perform 1000 ultrafast bootstraps
-nt AUTO says to figure out automatically how many threads to use
--prefix unpart says to begin the name of every output file with unpart

Note that we had to specify ../turtle.fa rather than turtle.fa because this file is located in the parent directory (iqlab) of the unpart directory in which you are now located.

A node in a cluster is a machine that may have multiple cores, where a core is a single processor. A thread is a part of a program that can be run independently of other parts. A single program can use multiple threads of execution to perform calculations in parallel, saving time. Ideally, each thread operates on a different core, otherwise threads compete with one another for time on the processor.

By specifying -nt AUTO (“nt = number of threads”), IQ-TREE will use as many cores as possible on the node that you were assigned using the srun command (and will use one thread per core). If you leave off the -nt specification, IQ-TREE will tell you in its output how many threads it could have used, but will actually just use 1 of them.

The -B setting tells IQ-TREE to assess support for individual edges (splits) using ultrafast bootstrapping. We will discuss what bootstrapping is in lecture (and how ultrafast bootstrapping differs from standard bootstrapping), but, for now, just know that this option will end up assigning a value to each edge in the summary consensus tree that tells you how plausible that edge is. Edges with ultrafast bootstraps > 95 are strongly supported, values smaller than that are more questionable.

This analysis will take awhile (most of the time is spent figuring out which substitution model is best), so you might use the time to read on past the next couple of questions to get some background. Once the unpartitioned analysis finishes, view the output file unpart.iqtree in your favorite text editor (nano, BBEdit, Notepad++, etc.) and answer these questions:

How many substitution models were tested in the unpartitioned analysis?

What was the best model according to the Bayesian Information Criterion (BIC)?

An F in a model name means “empirical nucleotide frequencies”, an I in the name means “proportion of invariable sites”, and an R refers to the “free rates” model of among-site rate heterogeneity.

BIC will make more sense once you’ve had the introduction to Bayesian statistics, but, for now, just remember that models with smaller BIC values are better. BIC and the Akaike Information Criterion (AIC) are alternatives to using likelihood ratio tests and have the advantage that the models compared do not need to be nested. We will exclusively use BIC in this lab, but AIC would probably make the same model choices. The two approaches differ only in how much they penalize models for complexity (i.e. number of estimated parameters), with BIC imposing a larger penalty for each extra parameter.

Considering that C comprises alligator and caiman, B comprises Taeniopygia and Gallus, and T comprises Emys, Chelonoidis, Caretta, and Phrynops…

Which of the 3 hypotheses (BC, BT, or CT) is supported by the ML tree for the unpartitioned analysis?

What is the BIC score for the unpartitioned ML tree?

You can answer the following question from looking at the text-based depiction of the tree in unpart.iqtree or you can download unpart.treefile to your laptop and open it in FigTree for a graphical depiction. To see the bootstrap support, type ufboot when it asks for label upon opening the file, then check the “Branch labels” check box, expand that tab, and specify ufboot for “Display”. When FigTree opens, if there is some information about nodes or edges that is encoded in the tree description, it asks you what you want to call it. I told you to call it ufboot but you could have typed something else. Whatever you type will show up in the “Display” dropdown control.

What is the ultrafast bootstrap support for this hypothesis in the unpartitioned ML tree?

Partitioned analysis

Let’s now analyze the data when every locus has its own distinct relative rate. This means adding 28 relative rate parameters, one for each locus except for one (that final rate can be obtained from the others because the average relative rate must equal 1).

Create a directory named fullpart under iqlab that is at the same level as your unpart directory and navigate into the new fullpart directory.

Enter this command to run IQ-TREE:

../iqtree2 -s ../turtle.fa -p ../turtle.nex -B 1000 -nt AUTO --prefix fullpart

Some of these options you saw before, but here is a key to all of them anyway:

the -s specifies the sequence data file
the -p specifies the partition file (which shows which sites are associated with each locus)
the -B specifies the number of ultrafast bootstraps
the -nt AUTO means “determine the number of threads to use automatically” with files created earlier
the --prefix fullpart means that each output file will start with fullpart

Which of the 3 hypotheses (BT, BC, or CT) is supported by the partitioned ML tree?

What is the ultrafast bootstrap support for this hypothesis in the partitioned ML tree?

What is the BIC score for the partitioned ML tree? Is it better (smaller) or worse (larger) than the BIC for the unpartitioned ML tree?

Looking for outlier loci

Hopefully you discovered that partitioning makes a difference in which hypothesis of turtle relationships is favored. The question we’ll ask in this section is: “Do all gene loci agree, or is there conflict among loci as to the best topology”.

We’ll do this by computing the log likelihood for each locus on each of the two trees. The log likelihood is a measure of how well the model (which includes the tree topology and edge lengths) fits the data for a particular locus. If the log likelihood for a locus is higher on tree 1 than it is on tree 2, then the data for that locus fit tree 1 better (i.e. we are less surprised at the data for that locus if we assume tree 1 was the true tree).

Create a directory named locuscomp (short for “locus comparison”) at the same level as fullpart and unpart and navigate into it.

First, we need to get the two trees we want to compare into the same file. You can use the cat (“concatenate”) command to merge the two files unpart.treefile and fullpart.treefile into a new file named turtle.trees:

cat ../unpart/unpart.treefile ../fullpart/fullpart.treefile > turtle.trees

Second, run IQ-TREE to generate a file of locus-log-likelihoods:

../iqtree2 -s ../turtle.fa -p ../fullpart/fullpart.best_scheme.nex -z turtle.trees -n 0 -wpl --prefix locuscomp

Here is a key to the settings:

the -s specifies the sequence data file
the -p specifies the partition file (we’ll use the partitioning scheme from the fullpart analysis)
the -z specifies the file containing the trees to compare
the -n 0 means “do zero iterations to improve the likelihood” (we just want the likelihood of the tree as it is)
the -wpl invokes the “writing partition log-likelihoods” functionality
the --prefix locuscomp makes all output file names begin with locuscomp so that we don’t overwrite files created earlier

When the analysis is finished, you should find a file named locuscomp.partlh in the directory. Make a copy of this file so that we can modify it without destroying the original:

cp locuscomp.partlh locuscomp.R

The copy ends in the filenname extension .R. This is because we are going to convert this file (i.e. edit it in a text editor) into a file containing commands that can be executed by the programming language R.

Open the file locuscomp.R in a text editor (e.g. nano). It should look like this:

2 29
Tree1      -5082.67 -3414.03 -3626.37 -4298.16 ... -4394.24
Tree2      -5086.25 -3415.55 -3609.85 -4288.83 ... -4389.8

I’ve used elipses (...) to stand for 24 values left out. The 29 numbers on the 2nd and 3rd lines of this file are the log likelihoods for the loci specified in the ../fullpart/fullpart.best_scheme.nex file for one of the two trees.

Is Tree1 the tree from the unpart or fullpart analysis?

Delete the first line (2 29) of this file and edit the two remaining lines to look like this (except your version will not have any ellipses!):

Tree1 <- c(-5082.67, -3414.03, -3626.37, -4298.16, ..., -4394.24)
Tree2 <- c(-5086.25, -3415.55, -3609.85, -4288.83, ..., -4389.8)

(Note the commas separating each value - don’t forget to add those.)

When R reads these lines, it will make two vectors of values corresponding to the log likelihoods of the 29 loci for Tree1 and Tree2

Now add a command at the end of the file to compute a vector of differences between the log likelihoods for Tree1 and Tree2 for each locus:

diffs <- Tree1 - Tree2

In R, subtracting one 29-element vector from another 29-element vector creates a new vector of length 29 containing the differences between each element.

Finally, add one more line telling R to create a bar plot showing these 29 differences graphically:

barplot(diffs)

Your final locuscomp.R file should now look like this (but, again, with no ellipses):

Tree1 <- c(-5082.67, -3414.03, -3626.37, -4298.16, ..., -4394.24)
Tree2 <- c(-5086.25, -3415.55, -3609.85, -4288.83, ..., -4389.8)
diffs <- Tree1 - Tree2
barplot(diffs)

Load the R module so that you can use R to run this file:

module load R/3.6.1

Now execute your locuscomp.R file in R:

R --no-save < locuscomp.R

This should generate a file named Rplots.pdf. Pull this file back down to your laptop (there is no easy way to view it on the Xanadu cluster) using Cyberduck or scp and open it.

Which 2 loci (counting over starting at 1 for the leftmost bar in the plot) stand out as saying something very different than the others?

What are these loci? (i.e. look them up in the turtles.nex file you downloaded at the start of the lab)

You should find that these two loci are the ones identified by Brown and Thomson 2019 as paralogs (see first paragraph of p. 522).

Which hypothesis of turtle relationships do these 2 loci support?

Which hypothesis of turtle relationships do 20 of the remaining 27 loci support?

Excluding the outliers

Let’s rerun both the unpart and fullpart analyses without these 2 loci and see if they now agree on turtle relationships.

Create a new directory final at the same level as unpart, fullpart, and locuscomp and navigate into it.

Copy the turtle.nex file from the parent directory into your new final directory:

cp ../turtle.nex finalpart.nex
cp ../turtle.nex finalunpart.nex

Edit the finalpart.nex, removing the two lines corresponding to the two paralogous loci.

Edit the finalunpart.nex file to remove the two problematic loci and pool sites from all other loci into a single big subset. Your finalunpart.nex file should look like this:

#nexus
begin sets;
    charset allorthologs = 1-9822 10369-12795 13243-20820;
end;

Run IQ-TREE to estimate an ML tree for the unpartitioned data:

../iqtree2 -s ../turtle.fa -p finalunpart.nex -B 1000 -nt AUTO --prefix unpart

Run IQ-TREE to estimate an ML tree for the partitioned data:

../iqtree2 -s ../turtle.fa -p finalpart.nex -B 1000 -nt AUTO --prefix fullpart

Finally, ask IQ-TREE to use the partition-finder algorithm to merge loci that are evolving at a rate that is similar enough that they can be modeled using the same relative rate:

../iqtree2 -s ../turtle.fa -p finalpart.nex -B 1000 -nt AUTO -m MFP+MERGE -rcluster 10 --prefix bestpart

The new options we haven’t seen before are:

the -m MFP+MERGE invokes the relaxed-clustering partition-finder algorithm
the -rcluster 10 says to ony examine the top 10% of partition schemes

Which hypothesis is now supported by each analysis and what is the ultrafast support and BIC score? (fill out the table below)

---------------------------------------------------
              Hypothesis     ufboot             BIC 
---------------------------------------------------
unpart     
---------------------------------------------------
fullpart                                           
---------------------------------------------------
bestpart
---------------------------------------------------

Did merging partition subsets improve the fit of the model? What evidence did you use to decide this?

Did removing the paralogous loci improve confidence in one hypothesis over the others? What evidence did you use to decide this?