The Core Problem

Maximum likelihood estimation of discrete trait models comes with a well-known identifiability problem. When transition rates become very high relative to tree depth, something surprising happens: trait states get "scrambled" across the phylogeny. Every lineage rapidly visits every state many times, and the phylogenetic signal gets washed out.

In this limit, the expected distribution of tip states converges on the stationary distribution of the Markov chain. This distribution is determined only by the ratio of rates, not their absolute magnitude. Consider a simple example: a model with q01 = 0.1 and q10 = 0.1 (slow rates) can produce nearly the same distribution of tip states as q01 = 100 and q10 = 100 (very fast rates), if the tip data show a roughly 50/50 split between states.

This creates a likelihood ridge: a long plateau in parameter space extending from biologically realistic rate values (say, 0.001–1 events per unit branch length) out to arbitrarily high values. Along this ridge, the likelihood changes only minimally. The ridge runs along the direction of proportional rate scaling—both rates increase together while maintaining their ratio.

Mathematical intuition: Under fast evolution, the probability of observing state 0 at a tip depends on P(0) ≈ 0.5 + 0.5 × exp(−2q×t), where q is the rate and t is the branch length. When qt is large, exp(−2qt) ≈ 0 regardless of whether q = 0.1, 100, or 1000. The stationary distribution q₀/(q₀+q₁) is unaffected by scaling both rates together.

Why This Matters

  • Unrealistic estimates: Maximum likelihood optimizers (like optim() in R) can wander up the ridge and return estimates of q = 50 or q = 500 events per unit branch length when the true value is 0.5. Both fit the data similarly well, but only one is biologically plausible.
  • Slow convergence: The optimizer can get stuck in the ridge region, requiring many more function evaluations to converge. This wastes computational time and can trigger premature termination warnings.
  • Unbounded confidence intervals: Confidence intervals estimated via likelihood profiles will extend far up the ridge, with the upper bound appearing infinite or unreasonably large.
  • Worst-case scenarios: The ridge problem is most severe when:
    • Tip states are close to a 50/50 distribution (maximum ridging for two-state models)
    • Trees are large (more tips = stronger constraint on estimates)
    • There is substantial phylogenetic signal (related species share states, consistency across the tree)
Red flag: If your ML estimates return rates much larger than 1 event per unit branch length, suspect ridge wandering. Check your data: are states distributed roughly equally across the tree? Then high rates are riding the ridge.

Interactive Likelihood Ridge Visualization

The plot below shows a simulated likelihood surface for a simple two-state equal-rates model (q₀₁ = q₁₀ = q). The x-axis is the rate q on a logarithmic scale; the y-axis is log-likelihood. Use the slider to change the proportion of tips in state 0, and watch how the likelihood ridge forms and shifts.

0.50
Likelihood curve
Ridge region (ΔlogL < 2)
MLE position
How to read this: The green vertical line marks the maximum likelihood estimate (MLE). The shaded teal region shows the likelihood ridge—where the log-likelihood stays within 2 units of the peak. When the proportion is near 0.5, notice how the ridge extends far to the right (high q values). When the proportion moves away from 0.5, the ridge narrows and the likelihood becomes more peaked. This is why balanced trait distributions are problematic: the ridge becomes pronounced and high rates remain plausible.

Solutions to Ridge Wandering

1. MCMC with Priors

Bayesian MCMC with an exponential or lognormal prior on rates is the most principled solution. A prior naturally penalizes high values that have little data support. The posterior distribution will concentrate near biologically reasonable values unless the data strongly support high rates.

This is why tools like chromePlus and ChromEvol offer MCMC alongside maximum likelihood. The prior acts as a regularizer, pulling estimates away from the ridge toward plausible values. For example, an exponential prior with mean 0.5 makes q = 100 orders of magnitude less probable than q = 0.5, even if both fit the likelihood equally well.

2. Multiple Starting Points

Running ML optimization from many random starting points samples different regions of parameter space. If most convergences cluster near low-rate solutions but a few wander up the ridge, the true MLE is likely in the low-rate cluster. Compare the converged values: are they stable across runs? If estimates vary wildly, you may be seeing ridge effects.

3. Penalized Likelihood

Adding a regularization term (e.g., an L2 penalty proportional to rate magnitude) creates a unique maximum that concentrates near realistic values. The penalized log-likelihood is:

penalized.logL = logL(q) − λ × q²

The penalty weight λ controls how much you trust the prior belief that rates should be small. Cross-validation or information criteria can guide the choice of λ. This is less formal than a full Bayesian approach but faster and often sufficient.

4. Rate Bounds

The simplest (if somewhat arbitrary) approach is to set upper bounds on the rate parameter during optimization. For example, you might constrain q ≤ 10 events per unit branch length. This prevents the optimizer from ever reaching the ridge. The downside is that the bound is ad hoc and can artificially truncate the likelihood if the true value happens to exceed it. Still, if you have strong biological priors on reasonable rate ranges, this is practical.

chromePlus and ChromEvol: Best Practices

The chromePlus package (developed by the Blackmon Lab) and ChromEvol both allow MCMC inference with user-specified priors precisely because of the ridge problem. When rates inferred by maximum likelihood seem unreasonably high (e.g., > 1 event per unit branch length), this is a diagnostic sign of ridge wandering.

To protect yourself:

  • Always check biological plausibility. Does a rate of 50 transitions per unit branch length make sense given what you know about your organism and trait?
  • Compare ML and MCMC estimates. If MCMC (with a reasonable prior) gives q = 0.3 but ML gives q = 50, ridge wandering is likely. The MCMC posterior is more trustworthy in this case.
  • Try multiple starting values. Run the optimizer from 10–50 random starting points. If they converge to a tight cluster, you have confidence. If they scatter wildly, investigate further.
  • Visualize the likelihood profile. Plot likelihood against rate for your real data. If you see a long, flat plateau, you are on the ridge. A sharp peak indicates identifiability.

See the discrete trait evolution guide for more on model selection and the Mk framework. For the history of chromosome evolution methods, including the development of these tools, consult the methods review. Explore karyotype databases for real data examples.

Summary

Likelihood ridges are an inherent feature of discrete trait inference, not a bug but a signature of model identifiability. When tip states are close to equilibrium proportions, high rates become nearly indistinguishable from low rates. This is a genuine inference problem: the data alone may not constrain rates well.

The solutions—priors, multiple starts, penalization, and bounds—all aim to break the ridge by adding external information or constraints. In practice, Bayesian MCMC with informative priors is the gold standard because it explicitly encodes what you know (or want to assume) about plausible rate ranges. When in doubt, compare results across methods and always sanity-check your rate estimates.