Lab 9 - Sequence Alignment, Phylogenetic Analysis and Tree Visualization

Acknowledgements

This lab is a slighlty modified update of Stephen Turner’s ggtree tutorial. Many of the examples here were modified from the ggtree vignettes.

Learning Objectives

Multiple Sequence Alignment
Phylogenetic Inference
Tree visualization using ggtree

Bioconductor

Bioconductor is an open source, open development software project for genomic data analysis. There are thousands of software packages available and an active user community. R/Bioconductor is one of the primary mechanisms for publishing new genomic data analysis tools. We will use Bioconductor in the this and the following weeks labs.

Installing Bioconductor, ggtree and Biostrings

Bioconductor is already installed on Unity.

Install Bioconductor locally on your computer or Unity by typing in the R console.

R code

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.19")

You may get an error message that BioConductor is already installed.

Individual packages that are not part of the BioConductor core such as ggtree and Biostrings can be installed using

R code

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ggtree")

R code

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

The `ggtree` Package

ggtree is an R package that extends ggplot2 for visualizating and annotating phylogenetic trees. It is available from Bioconductor. Bioconductor is a project to provide tools for analyzing and annotating various kinds of genomic data. You can search and browse Bioconductor packages here.

ggtree Bioconductor page: bioconductor.org/packages/ggtree.
ggtree homepage: guangchuangyu.github.io/ggtree (contains more information about the package, more documentation, a gallery of beautiful images, and links to related resources).
ggtree publication: Yu, Guangchuang Data Integration, Manipulation and Visualization of Phylogenetic Trees

Bioconductor packages usually have great documentation in the form of vignettes. Take a look at the landing page for ggtree – about halfway down the page under the “Documentation” heading there are multiple walkthrough tutorials directed to different applications and functionalities of ggtree, chock full of runnable examples and explanations.

Just like R packages from CRAN, you only need to install Bioconductor packages once, then load them every time you start a new R session. Load the tidyverse package first.

R code

# You may need to install the treeio, rphylopic and ggimage packages
library(tidyverse)
library(ggtree)
library(treeio)
library(rphylopic)
library(ggimage)

A note on masked functions: Take a second and look through some of the output that you see when you load ggtree after tidyverse. When you first installed ggtree it may have taken a while, because ggtree depends on a number of other R packages. Each of these, in turn, may depend on other packages. These are all loaded into your working environment when you load ggtree. Also notice the lines that start with The following objects are masked from 'package:.... One example of this is the collapse() function from dplyr. When ggtree was loaded, it loaded it’s own function called collapse(). Now, if you wanted to use dplyr’s collapse function, you’ll have to call it explicitly using this kind of syntax: dplyr::collapse(). See this Q&A thread for more.

Connecting to JupyterLab on Unity

In addition to RStudio, today we are going to connect to Unity using the JupyterLab interface

Once you open JupyterLab you can navigate to our course site by clicking on the folder icon just to the left of your /home

Than click on work > pi_bio678_umass_edu > data > lab9

You can now open these files by clicking on them. These are the files we will work with today.

***You will need to change the path to the files in the examples below by adding /work/pi_bio678_umass_edu/data/lab9/

Sequence Alignment Background

Sequence Alignment is one of the most fundamental operations of bioinformatics. Sequence alignments are used to decide if genes or proteins are related functionally, structurally and evolutionarily. It is also at the core of sequence search methods such as BLAST.

There are three principal methods of pair-wise sequence alignment

Dot matrix analysis.
The dynamic programming (or DP) algorithm.
Word or k-tuple methods (such as those used in BLAST and FASTA - which we will discuss later)

Dot matrix analysis

This method involves constructing a matrix with one of the sequences to be compared running horizontally across the bottom, and the other running vertically along the left-hand side. Each entry of the matrix is a measure of similarity of those two residues on the horizontal and vertical sequence. The simplest scoring system, distinguishes only between identical (dots) and non-identical (blank) residues.

The dot plot displays graphically any possible sequence alignments as diagonals on the matrix. Insertions, deletions, direct repeats and inverted repeats can be visually detected on a dot plot. Dot plot can also detect regions of RNA that are self-complementary and thus might form secondary structure.

Dot plot

Ref:

This dot plot represents an alignment of two different strains of Helicobacter pylori. Forward matches are shown in red, while reverse matches are shown in green. This alignment clearly shows a major inversion event centered around the origin of replication.UM

The dynamic programming algorithm

“Dynamic programming was formalized by mathematician Richard Bellman, who was working at RNAD corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man know to be hostile to mathematics research. His work involved times series and planning - thus ‘dynamic’ and ‘programming’ (note, nothing particularly to do with computer programming). Bellman especially liked ‘dynamic’ because”it’s impossible to use the work dynamic in the pejorative sense”; he figured dynamic programming was “something not even a Congressman could object to” (Bellman 1984).” (Eddy 2004)

The dynamic programming algorithm is at the heart of many bioinformatics programs including BLAST (sequence searching), FASTA (sequence searching), CLUSTALW (multiple sequence alignmnet), HMMER (profile searching), GENSCAN (gene finding), MFOLD (RNA folding), and PHYLIP (phylogenetic analysis). Dynamic programing was first used for global alignment of protein sequences by Needleman and Wunnsch (1970) and to find local alignments by Smith and Waterson (1981).

  MLIGKVIGSVVSTRKN-LNVGKFMIVEPLK
  |     |  |  ||   || ||||     |        Global Alignment
  MVAARLI-VWANTRKTSLNLGKFMLAEIIK
  
              TRKN-LNVGKFM
              ||   || ||||               Local Alignment
              TRKTDLNLGKFML

Subsitution penalties differ for particular nucleotides or aminoacids

Amino acids are put together into proteins based on their three nucleotide codons, and most mutational events usually only change one nucleotide at time. For example, starting with the Alanine codon GCU and looking at all possible changes on a http://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table

GCU -> UCU = Alanine (A) -> Serine (S)
GCU -> ACU = Alanine (A) -> Threonine (T)
GCU -> CCU = Alanine (A) -> Proline (P)
GCU -> GUU = Alanine (A) -> Valine (V)
GCU -> GAU = Alanine (A) -> Aspartic Acid (D)
GCU -> GGU = Alanine (A) -> Glycine (G)
GCU -> GCA = Alanine (A) -> Alanine (A)
GCU -> GCC = Alanine (A) -> Alanine (A)
GCU -> GCG = Alanine (A) -> Alanine (A)

Thus, from the Alanine codon GCU with a single substitution we can only get to 6 of the 19 other amino acids and 33% of the changes are going to result in keeping Alanine at the site in the protein, the other six amino acids would result 11% of the time. Even if we look at all codons for Alanine, there are still 12 amino acids that can not be made by single substitution. Thus, the frequency of changing Alanine to another amino acid by a single nucleotide change would be

       A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
    A .33  0   0  .06  0   0  .06 .11  0   0   0   0   0   0  .11 .11 .11  0   0  .11

Origin of protein substitution matrices and bioinformatics

Protein substitution matrices were first developed by Margaret Dayhoff, one of the founders of the field of bioinformatics. She was particularly interested in the possibility of deducing the evolutionary relationships of organisms from protein sequences. Toward these ends she collected all the known protein sequences and, as a service to the scientific community, made them available to others in 1965 in a small book, the first Atlas of Protein Sequence and Structure.

Ref: http://en.wikipedia.org/wiki/Margaret_Oakley_Dayhoff

From these sequences, she and her coworkers developed a model of protein evolution which resulted in the development of a set of widely used substitution matrices. These are frequently called Dayhoff or PAM (Percent Accepted Mutation) matrices and similar matrices are used in many areas of bioinformatics including: sequence similarity searches (e.g. BLAST), multiple sequence alignment tools (e.g ClustalW), phylogenetics and identifying functional regions of proteins.

Natural selection governs which amino acid changes are observed

Just like mutation mucks with the frequency matrix so does Natural Selection. Changes that occur between amino acids that have different biochemical properties are likely to affect the function of the protein. Therefore, a substitution is more likely to occur between amino acids with similar biochemical properties. In the above example with the Alanine codon a substitution that yields an amino acid change would result in mostly changes to other neutral amino acids, whereas the frequency of change to Aspartic Acid would probably be much lower. Amazingly, the genetic code has evolved to minimize changes between amino acids with different biochemical properties.

Amino Acid          Side chain polarity Side chain acidity or basicity
 Alanine              nonpolar            neutral
 Arginine             polar                  strongly basic
 Asparagine           polar                  neutral
 Aspartic acid        polar                  acidic
 Cysteine             polar                  neutral
 Glutamic acid        polar                  acidic
 Glutamine            polar                  neutral
 Glycine              nonpolar               neutral
 Histidine            polar                  weakly basic
 Isoleucine           nonpolar               neutral
 Leucine              nonpolar               neutral
 Lysine               polar                  basic
 Methionine           nonpolar               neutral
 Phenylalanine        nonpolar               neutral
 Proline              nonpolar               neutral
 Serine               polar                  neutral
 Threonine            polar                  neutral
 Tryptophan           nonpolar               neutral
 Tyrosine             polar                  neutral
 Valine               nonpolar               neutral

Accounting for multiple substitutions

As time goes on and sequence divergence gets larger it gets harder to account for multiple substitutions at the same amino acid position. PAM protein matrices are based on global alignments of closely related proteins. However, sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. For comparing more distantly related sequence other types of matrices are used. One of the most common is the The BLOSUM (BLOck SUbstitution Matrix) series of matrices created by Steve Henikoff and colleagues. Both the PAM and BLOSUM are not expressed as transformation frequencies but the probabilities of transformation are expressed by log-odds scores as shown below for a BLOSUM62 matrix.

BLOSUM62

Ref: http://en.wikipedia.org/wiki/Substitution_matrix

Multiple Sequence Alignment

Why is multiple seqence alignment difficult?

If sequences are all the same length, alignment is trivial:

    KNMITGAAQMDGAILVVAATDGPMPQTREHVLLARQVEVP
    KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
    KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP
    KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP

This sequence alignment is unambiguous because there is no length variation among the sequences. No indels are needed to make the alignment, and the ungapped sequences can simply be arranged together. However, if the sequenes are of various lengths, problem is more complex, potentially very complex:

    RGSALKAVEAPNDPNHEA......YKPIQELLDAMDN.....YIPDPQRDVDKPFL
    RGSALKALEGDAAYIEKVR..........ELMQAVDD.....NIPTPEREIDKPFL
    RGSALKALE.....IEKVR..........ELMQAGDAAYVDDNIPTPEREIDKPFL
    RGSALLALEEMHKNPKTKRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
    RGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL
    KGSALQALEALQANPKTARGEDKWVDRIWELLDAVDS.....YIPTPERATDKTFL
    RGTARNALESPSKDIN....APEY.KCILELMNAVDE.....YIPTPQRAVDQPFL
    KGSALQALE....NAE....DEEKTKCIWELLQAMDD.....YIPAPERDIDKPFL
    KGSAFGAMS....NPE....DPESTKCVKELLESMDN.....YFDLPERDIDKPFL
    RGSAFAAMS....KPD....DPAATKCLDELLDTMDK.....YFVIPERALDKPFL

In many cases the best position to place an indel is ambiguous. Ideally, one would know the phylogeny for the sequences; this would help infer the sequence of indels. Unfortunately one normally needs a multiple sequence alignment to make inferences about how the sequences are related. Most alignment algorithms make a quick approximation of phylogeny, and then base alignments on these. Sound circular? You are right and this is a challenging problem that is at the forefront of research in phylogenetics…the joint estimation of the alignment and phylogeny. For this class we will stick to the traditional method of first aligning sequences followed by phylogenetic analysis.

Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The most popular progressive alignment method has been the Clustal family Different portals or implementations can vary in user interface and make different parameters accessible to the user. Clustal Omega is used extensively for phylogenetic tree construction.

The basic steps in Clustal are:

Calculate all possible pairwise alignments, record the score for each pair
Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining)
Find the two most closely related sequences
Align these sequences (algorithm: Smith-Waterman).
1. Calculate a consensus of this alignment
2. Replace the two sequences with the consensus
3. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
4. Iterate until all sequences have been aligned
Expand the consensus sequences with the (gapped) original sequences
Report the multiple sequence alignment

Software for sequence alignment

There are many tools available for sequence alignment. The common tools are hosting at the European Bioinformatics Institute.

The most commonly used are Clustal, Muscle and MAFFT. MAFFT is commonly implemented for working with large data sets and where speed is important (e.g. web servers)

On the Computer

Phylogenetic analysis

Carl Woese sequenced isolated and sequenced ribosomal RNA to discover a new domain of life, the Archaea. His discovery was dependent on using phylogenetic methods to determine the relationship of his microbial sequence to other sequences in the database. Since his discovery, progress sequencing specfic genes from genomic DNA has greatly simplified the process of understanding microbial phylogenies. In this project you will be given a DNA sequence, then construct a multiple sequence alignment and do a phylogenetic analysis to show the relationships among the taxa represented by the DNA sequences.

From a practical perspective, phylogenetic analysis is broken up into methods related to tree building and tree visualization. I have assembled a core group DNA sequences of small subunit ribosomal gene from our national DNA database Genbank to provide an example Tree_of_Life_Core_Sequences.txt. This file us in our Unity course directory /work/pi_bio678_umass_edu/data/lab9/ and on our course GitHub site directory contains data for the following species.

Thermoplasma volcanium
Halalkalicoccus jeotgali
Candidatus Korarchaeum cryptofilum
Nanoarchaeum equitans
Chlorobium chlorochromatii
Burkholderia cenocepacia
Rhizobium leguminosarum
Escherichia coli
Archaeoglobus fulgidus
Methanocaldococcus jannaschii
Pyrococcus abyssi
Oryza sativa nuclear
Oryza sativa mitochondrion
Oryza sativa chloroplast
Saccharomyces cerevisiae nuclear
Homo sapies nuclear
Drosophila yakuba nuclear
Amphidinium carterae nuclear
Thermotoga lettingae
Prochlorococcus marinus
Trypanosoma cruzi

NGPhylogeny.fr - A quick and easy working flow including multiple sequence alignment, phylogenetic analysis and tree visualization

This web server is great place to learn the basic workflow for phylogenetic analysis and I use it on occasion when I just need a quick see of the relationships among a small set of sequences. Doing a phylogenetic analysis involves (1) assembling a group of sequences to evaluate, (2) aligning the sequences so that a distance can be calculate between the sequences, (3) determining the relationships among the sequences, and (4) visualizing the relationships. Step 1 was accomplished above by getting representative archaeal, bacterial and eukaryotic sequences from GenBank. Steps 2-4 will be done on web server that has been set up to run these steps at the same time. Go to the site Robust Phylogenetic Analysis For The Non-Specialist. Scroll down and under Phylogenetic Analysis select “One Click”. Upload or paste in your DNA sequences from the file in Moodle. Then click submit. This will start the analysis process. The numbers of the tree represent statistical support for the relationship. The more robust the relationship the closer the value will be to 1. The scale at the bottom represents sequence distance (e.g. 0.2 is 20%).

The newick tree file was added to the /work/pi_bio678_umass_edu/data/lab9/ directory

Phylogenetic tree visualization

The tree visualizations on NGPhylogeny.fr are ok for some purposes, but often for publications more tree editing and visualization features are needed. There are a number of packages available to bring trees to like including Figtree, Dendroscope and iTOL. ETE Toolkit is a framework for analyzing and visualizing trees in Python. For a more complete list see Wikipedia.

Historically most phylogenetic analysis and tree visualization has been done with independent software tools. There has been a growing move towards using R for phylogenetic analyses. The package ape: Analysis of Phylogenetics and Evolution can access popular multiple sequence alignment and phylogenetic analysis methods. ggtree extends the ‘ggplot2’ for visualization and annotation of phylogenetic trees with their annotation data. The author, Guangchuang Yu, has recently published an R book Data Integration, Manipulation and Visualization of Phylogenetic Trees.

ggtree

Tree Import

From the ggtree landing page take a look at the Tree Data Import vignette. There are many different software packages for creating phylogenetic trees from different types of data, and there are many formats for storing the resulting phylogenetic trees they produce.

Most tree viewer software (including R packages) focus on Newick and Nexus file formats, and other evolution analysis software might also contain supporting evidence within the file that are ready for annotating a phylogenetic tree. ggtree supports several file formats, including:

and software output from:

The ggtree package implement several parser functions, including:

read.tree for reading Newick files.
read.phylip for reading Phylip files.
read.jplace for reading Jplace files.
read.nhx for reading NHX files.
read.beast for parsing output of BEAST
read.codeml for parsing output of CODEML (rst and mlc files)
read.codeml_mlc for parsing mlc file (output of CODEML)
read.hyphy for parsing output of HYPHY
read.jplace for parsing jplace file including output from EPA and pplacer
read.nhx for parsing NHX file including output from PHYLODOG and RevBayes
read.paml_rst for parsing rst file (output of BASEML and CODEML)
read.r8s for parsing output of r8s
read.raxml for parsing output of RAxML

Basic trees

Let’s first import our tree data. We’re going to work with a made-up phylogeny with 13 taxa tree_newick.nwk. Let’s load the libraries you’ll need if you haven’t already, and then import the tree using read.tree(). Displaying the object itself really isn’t useful. The output just tells you a little bit about the tree itself.

R code

tree <- read.tree("data/lab9/tree_newick.nwk")
# tree <- read.tree("/work/pi_bio678_umass_edu/data/lab9/tree_newick.nwk")
tree

Just like with ggplot2 we created a basic canvas with ggplot(...) and added layers with +geom_???(), we can do the same here. The ggtree package gives us a geom_tree() function. Because ggtree is built on top of ggplot2, you get ggplot2’s default gray theme with white lines. You can override this with a theme from the ggtree package.

Because you’ll almost always want to add a tree geom and remove the default background and axes, the ggtree() function is essentially a shortcut for ggplot(...) + geom_tree() + theme_tree().

R code

# build a ggplot with a geom_tree
ggplot(tree) + geom_tree() + theme_tree()
# This is convenient shorthand
ggtree(tree)

There’s also the treescale geom, which adds a scale bar, or alternatively, you can change the default ggtree() theme to theme_tree2(), which adds a scale on the x-axis. The horizontal dimension in this plot shows the amount of genetic change, and the branches and represent evolutionary lineages changing over time. The longer the branch in the horizonal dimension, the larger the amount of change, and the scale tells you this. The units of branch length are usually nucleotide substitutions per site – that is, the number of changes or substitutions divided by the length of the sequence (alternatively, it could represent the percent change, i.e., the number of changes per 100 bases).

R code

# add a scale
ggtree(tree) + geom_treescale()
# or add the entire scale to the x axis with theme_tree2()
ggtree(tree) + theme_tree2()

The default is to plot a phylogram, where the x-axis shows the genetic change / evolutionary distance. If you want to disable scaling and produce a cladogram instead, set the branch.length="none" option inside the ggtree() call. See ?ggtree for more.

R code

ggtree(tree, branch.length="none")

The ... option in the help for ?ggtree represents additional options that are further passed to ggplot(). You can use this to change aesthetics of the plot. Let’s draw a cladogram (no branch scaling) using thick blue dotted lines (note that I’m not mapping these aesthetics to features of the data with aes() – we’ll get to that later).

R code

ggtree(tree, branch.length="none", color="blue", size=2, linetype=3)

Exercise 1

Look at the help again for ?ggtree, specifically at the layout= option. By default, it produces a rectangular layout.

Create a slanted phylogenetic tree.
Create a circular phylogenetic tree.
Create a circular unscaled cladogram with thick red lines.

Other tree geoms

Let’s add additional layers. We can create a plot object, e.g., p, to store the basic layout of a ggplot, and add more layers to it as we desire. Let’s add node and tip points. Let’s finally label the tips.

R code

# create the basic plot
p <- ggtree(tree)
# add node points
p + geom_nodepoint()
# add tip points
p + geom_tippoint()
# Label the tips
p + geom_tiplab()

Exercise 2

Similar to how we change the aesthetics for the tree inside the ggtree() call, we can also change the aesthetics of the points themselves by passing graphical parameters inside the geom_nodepoint() or geom_tippoint() calls. Create a phylogeny with the following aesthetic characteristics:

tips labeled in purple
purple-colored diamond-shape tip points (hint: Google search “R point characters”)
large semitransparent yellow node points (hint: alpha=)
Add a title with + ggtitle(...)

Tree annotation

The geom_tiplab() function adds some very rudimentary annotation. Let’s take annotation a bit further.

Internal node number

Before we can go further we need to understand how ggtree is handling the tree structure internally. Some of the functions in ggtree for annotating clades need a parameter specifying the internal node number. To get the internal node number, user can use geom_text to display it, where the label is an aesthetic mapping to the “node variable” stored inside the tree object (think of this like the continent variable inside the gapminder object). We also supply the hjust option so that the labels aren’t sitting right on top of the nodes.

R code

ggtree(tree) + geom_text(aes(label=node), hjust=-.3)

Another way to get the internal node number is using MRCA() function by providing a vector of taxa names (created using c("taxon1", "taxon2")).. The function will return node number of input taxa’s most recent commond ancestor (MRCA). First, re-create the plot so you can choose which taxa you want to grab the MRCA from.

R code

ggtree(tree) + geom_tiplab()

Let’s grab the most recent common ancestor for taxa C+E, and taxa G+H. We can use MRCA() to get the internal node numbers. Go back to the node-labeled plot from before to confirm this.

R code

MRCA(tree, c("C", "E"))

[1] 17

R code

MRCA(tree, c("G", "H"))

[1] 21

R code

# MRCA(tree, tip=c("C", "E"))
# MRCA(tree, tip=c("G", "H"))

Labeling clades

We can use geom_cladelabel() to add another geom layer to annotate a selected clade with a bar indicating the clade with a corresponding label. You select the clades using the internal node number for the node that connects all the taxa in that clade.

Let’s annotate the clade with the most recent common ancestor between taxa C and E (internal node 17). Let’s make the annotation red. See ?geom_cladelabel help for more.

R code

ggtree(tree) + 
  geom_cladelabel(node=17, label="Some random clade", color="red")

Let’s add back in the tip labels. Notice how now the clade label is too close to the tip labels. Let’s add an offset to adjust the position. You might have to fiddle with this number to get it looking right.

R code

ggtree(tree) + 
  geom_tiplab() + 
  geom_cladelabel(node=17, label="Some random clade", 
                  color="red2", offset=.8)

Now let’s add another label for the clade connecting taxa G and H (internal node 21).

R code

ggtree(tree) + 
  geom_tiplab() + 
  geom_cladelabel(node=17, label="Some random clade", 
                  color="red2", offset=.8) + 
  geom_cladelabel(node=21, label="A different clade", 
                  color="blue", offset=.8)

Uh oh. Now we have two problems. First, the labels would look better if they were aligned. That’s simple. Pass align=TRUE to geom_cladelabel() (see ?geom_cladelabel help for more). But now, the labels are falling off the edge of the plot. That’s because geom_cladelabel() is just adding it this layer onto the end of the existing canvas that was originally layed out in the ggtree call. This default layout tried to optimize by plotting the entire tree over the entire region of the plot. Here’s how we’ll fix this.

First create the generic layout of the plot with ggtree(tree).
Add some tip labels.
Add each clade label.
Remember theme_tree2()? We used it way back to add a scale to the x-axis showing the genetic distance. This is the unit of the x-axis. We need to set the limits on the x-axis. Google around for something like “ggplot2 x axis limits” and you’ll wind up on this StackOverflow page that tells you exactly how to solve it – just add on a + xlim(..., ...) layer. Here let’s extend out the axis a bit further to the right.
Finally, if we want, we can either comment out the theme_tree2() segment of the code, or we could just add another theme layer on top of the plot altogether, which will override the theme that was set before. theme_tree() doesn’t have the scale.

R code

ggtree(tree) + 
  geom_tiplab() + 
  geom_cladelabel(node=17, label="Some random clade", 
                  color="red2", offset=.8, align=TRUE) + 
  geom_cladelabel(node=21, label="A different clade", 
                  color="blue", offset=.8, align=TRUE) + 
  theme_tree2() + 
  xlim(0, 70) + 
  theme_tree()

Alternatively, we could highlight the entire clade with geom_hilight(). See the help for options to tweak.

R code

ggtree(tree) + 
  geom_tiplab() + 
  geom_hilight(node=17, fill="gold") + 
  geom_hilight(node=21, fill="purple")

Connecting taxa

Some evolutionary events (e.g. reassortment, horizontal gene transfer) can be visualized with some simple annotations on a tree. The geom_taxalink() layer draws straight or curved lines between any of two nodes in the tree, allow it to show evolutionary events by connecting taxa.

R code

ggtree(tree) + 
  geom_tiplab() + 
  geom_taxalink("E", "H", color="blue3") +
  geom_taxalink("C", "G", color="orange2", curvature=-.9)

Exercise 3

Produce the figure below.

First, find what the MRCA is for taxa B+C, and taxa L+J. You can do this in one of two ways:
1. Easiest: use MRCA(tree, tip=c("taxon1", "taxon2")) for B/C and L/J separately.
2. Alternatively: use ggtree(tree) + geom_text(aes(label=node), hjust=-.3) to see what the node labels are on the plot. You might also add tip labels here too.
Draw the tree with ggtree(tree).
Add tip labels.
Highlight these clades with separate colors.
Add a clade label to the larger superclade (node=17) that we saw before that includes A, B, C, D, and E. You’ll probably need an offset to get this looking right.
Link taxa C to E, and G to J with a dashed gray line (hint: get the geom working first, then try changing the aesthetics. You’ll need linetype=2 somewhere in the geom_taxalink()).
Add a scale bar to the bottom by changing the theme.
Add a title.
Optionally, go back to the original ggtree(tree, ...) call and change the layout to "circular".

Advanced tree annotation

Let’s use a previously published dataset from this paper:

Liang et al. “Expansion of genotypic diversity and establishment of 2009 H1N1 pandemic-origin internal genes in pigs in China.” Journal of virology (2014): 88(18):10864-74.

This data was reanalyzed in the ggtree paper.

The subset of the data used here contains 76 H3 hemagglutinin gene sequences of a lineage containing both swine and human influenza A viruses. The sequence data set was re-analyzed by using BEAST (available at http://beast.bio.ed.ac.uk/). BEAST (Bayesian Evolutionary Analysis Sampling Trees) can give you rooted, time-measured phylogenies inferred using molecular clock models.

For this you’ll need the flu_tree_beast.tree output file from BEAST and the flu_aasequence.fasta FASTA file with the multiple sequence alignment. These are both available on the data downloads page. First let’s read in the tree with read.beast() (instead of the read.tree() we used before). Let’s add a scale bar with theme_tree2(). This gives you genetic distance. But, we have time measured here with molecular clock models. We’ve only estimated the relative time between branching events, so if we want to actually see dates on the x-axis, we need to supply the most recent sampling date to the ggtree() call. Do this by setting mrsd="YYYY-MM-DD" inside ggtree().

Finally, let’s add some tip labels. We’ll want to right-align them, and by default the dotted line is a little too thick. Let’s reduce the linesize a bit. Now, some of the labels might be falling off the margin. Set the xlim to limit the axis to show between 1990 and 2020. You could get MRCAs and node numbers and do all the annotations that we did before the same way here.

R code

# Read the data
tree <- read.beast("data/lab9/flu_tree_beast.tree")
# tree <- read.tree("/work/pi_bio678_umass_edu/data/lab9/flu_tree_beast.tree")
# supply a most recent sampling date so you get the dates
# and add a scale bar
ggtree(tree, mrsd="2013-01-01") + 
  theme_tree2() 
# Finally, add tip labels and adjust axis
ggtree(tree, mrsd="2013-01-01") + 
  theme_tree2() + 
  geom_tiplab(align=TRUE, linesize=.5) + 
  xlim(1990, 2020)

Finally, let’s look at ?msaplot. This puts the multiple sequence alignment and the tree side-by-side. The function takes a tree object (produced with ggtree()) and the path to the FASTA multiple sequence alignment. You can do it with the entire MSA, or you could restrict to just a window. Want something interesting-looking, but maybe not all that useful? Try changing the coordinate system of the plot itself by passing + coord_polar(theta="y") to the end of the command!

R code

# msaplot(p=ggtree(tree), fasta="/work/pi_bio678_umass_edu/data/lab9/flu_aasequence.fasta", window=c(150, 175))

msaplot(p=ggtree(tree), fasta="data/lab9/flu_aasequence.fasta", window=c(150, 175))

Bonus!

See the ggtree vignettes for more details on how these work.

Many trees

ggtree will let you plot many trees at once, and you can facet them the normal ggplot2 way. Let’s generate 3 replicates each of 4 random trees with 10, 25, 50, and 100 tips, plotting them all.

R code

set.seed(42)
trees <- lapply(rep(c(10, 25, 50, 100), 3), rtree)
class(trees) <- "multiPhylo"
ggtree(trees) + facet_wrap(~.id, scale="free", ncol=4) + ggtitle("Many trees. Such phylogenetics. Wow.")

Plotting trees with other data

For showing a phylogenetic tree alongside other panels with your own data, the facet_plot() function accepts a input data.frame and a geom function to draw the input data.

R code

# Generate a random tree with 30 tips
tree <- rtree(30)
# Make the original plot
p <- ggtree(tree)
# generate some random values for each tip label in the data
d1 <- data.frame(id=tree$tip.label, val=rnorm(30, sd=3))
# Make a second plot with the original, naming the new plot "dot", 
# using the data you just created, with a point geom.
p2 <- facet_plot(p, panel="dot", data=d1, geom=geom_point, aes(x=val), color='red3')
# Make some more data with another random value.
d2 <- data.frame(id=tree$tip.label, value = abs(rnorm(30, mean=100, sd=50)))
# Now add to that second plot, this time using the new d2 data above, 
# This time showing a bar segment, size 3, colored blue.
p3 <- facet_plot(p2, panel='bar', data=d2, geom=geom_segment, 
           aes(x=0, xend=value, y=y, yend=y), size=3, color='blue4') 
# Show all three plots with a scale
p3 + theme_tree2()

Overlay organism silouhettes

phylopic.org hosts free silhouette images of animals, plants, and other life forms, all under Creative Commons or Public Domain. You can use ggtree to overlay a phylopic image on your plot at a node of your choosing.

R code

# get phylopic 

newick <- "((Pongo_abelii,(Gorilla_gorilla_gorilla,(Pan_paniscus,Pan_troglodytes)Pan,Homo_sapiens)Homininae)Hominidae,Nomascus_leucogenys)Hominoidea;"

tree <- read.tree(text=newick)

d <- ggimage::phylopic_uid(tree$tip.label)
d$body_mass = c(52, 114, 47, 45, 58, 6)

p <- ggtree(tree) %<+% d + 
  geom_tiplab(aes(image=uid, colour=body_mass), geom="phylopic", offset=2.5) +
  geom_tiplab(aes(label=label), offset = .2) + xlim(NA, 7) +
  scale_color_viridis_c()
p

Tips

If you the tree is not fitting in the plot space you can modify the xlim

R code

# Add to your 
ggtree(tree) +
  xlim(0,2) # This worked from my rectangular trees
  xlim(0.36) # This worked for my circular trees

To show the bootstrap values

R code

ggtree(tree) +  
  geom_nodelab()

To change text size and position

R code

ggtree(tree) +
  geom_tiplab(hjust=-.1, size = 3)

To change the size of the highlighted clade you can use extend

R code

ggtree(tree) +
  geom_hilight(node=9, fill="gold", extend = 0.5)

You can order the layers when highlighting for certain effects

Exercises

For the lab turn in a report with the above examples. We will work with the NEON data next week.

Acknowledgements

Learning Objectives

Bioconductor

Installing Bioconductor, ggtree and Biostrings

The ggtree Package

Connecting to JupyterLab on Unity

Sequence Alignment Background

Dot matrix analysis

The dynamic programming algorithm

Subsitution penalties differ for particular nucleotides or aminoacids

Origin of protein substitution matrices and bioinformatics

Natural selection governs which amino acid changes are observed

Accounting for multiple substitutions

Multiple Sequence Alignment

Software for sequence alignment

On the Computer

Phylogenetic analysis

NGPhylogeny.fr - A quick and easy working flow including multiple sequence alignment, phylogenetic analysis and tree visualization

Phylogenetic tree visualization

ggtree

Tree Import

Basic trees

Exercise 1

Other tree geoms

Exercise 2

Tree annotation

Internal node number

Labeling clades

Connecting taxa

Exercise 3

Advanced tree annotation

Bonus!

Many trees

Plotting trees with other data

Overlay organism silouhettes

Tips

Exercises

The `ggtree` Package