Phylogenies
What is a phylogeny?
A phylogeny is a branching diagram showing the evolutionary relationships among species. Each tip represents a living (or extinct) species. Each internal node represents a common ancestor. The branches connecting them represent time or evolutionary change.
Think of the phylogeny as a family tree of life. Just as your grandparents had children who had children, species have common ancestors. The phylogeny is a hypothesis about how all these ancestors and descendants are related.
Below is an interactive phylogeny of 15 vertebrate species. Hover over the tips to see the species highlighted, along with its entire ancestral lineage.
Reading a phylogeny
A key insight: the horizontal position of species in a phylogeny does not matter. You can rotate branches around any internal node and get a topology that is logically identical. Below are three different "rotations" of the same six-species tree. They look different, but they show the same relationships.
What matters in a phylogeny is which species are nested within which clades. The visual layout is irrelevant. A and B can be on the left or right, at the top or bottom. What matters is that they share a most recent common ancestor that is not shared with C, D, E, or F.
Branch lengths
In some phylogenies, the length of branches encodes information about evolutionary time or the amount of change. A long branch means more time has passed, or more genetic change has accumulated. A short branch means the lineages diverged recently.
This distinction is crucial. A cladogram shows topology but ignores branch lengths. A chronogram or phylogram uses branch length to encode meaningful information.
In the chronogram above, the horizontal axis represents time. The branch leading to Fish is the longest because it diverged from the other vertebrates around 500 million years ago. Humans and Chimpanzees share a much more recent common ancestor, so the branches separating them are shorter.
Why phylogenies matter for statistics
Here is a critical problem that motivates phylogenetic comparative methods: if you study 1000 species of beetles, you do not have 1000 independent observations. If 500 of those species are sisters (they share a most recent common ancestor that lived 2 million years ago), then those 500 species are not independent. They inherited many traits from their common ancestor.
This is the problem of phylogenetic non-independence. When you analyze traits across species, you must account for the fact that species are related by descent. Otherwise, you violate the assumption of independence that underlies standard statistical tests.
Imagine you want to test whether body size predicts home range across 100 carnivore species. If you use linear regression, you assume the 100 data points are independent. But suppose 50 species are recently evolved lion relatives. They inherited similar body sizes and home ranges from their common ancestor. Now your sample size is effectively much smaller than 100.
The phylogeny gives you the structure to correct for this non-independence. By accounting for the shared ancestry among species, phylogenetic comparative methods let you extract the evolutionary signal from the phylogenetic noise.
How phylogenies are built
Phylogenies are not observed directly. They are inferred from genetic data (DNA sequences), morphological characters (body shape, skeletal features), or a combination of both. The most common method is maximum likelihood (ML), which finds the tree that makes your data most probable under a statistical model of evolution.
Another popular approach is Bayesian inference, which uses Bayes' theorem to calculate a probability distribution over all possible trees. Both methods account for uncertainty in the tree topology.
For the purposes of comparative methods, the key insight is this: the phylogeny you use is an estimate, not a fact. It comes with uncertainty. Some internal branches may be poorly supported. Good practice includes checking the support values (bootstrap percentages in ML, posterior probabilities in Bayesian methods) and sometimes using methods that account for topological uncertainty.