Imputed MK Test

MKado implements the imputed MK test from Murga-Moreno et al. (2022), which corrects for slightly deleterious mutations by imputing their count from the synonymous frequency spectrum rather than discarding low-frequency polymorphisms.

Background

The standard MK test is biased by slightly deleterious mutations that inflate nonsynonymous polymorphism (Pn) without contributing to divergence (Dn). Several approaches exist to address this:

  • Asymptotic MK test: Fits a curve across the full frequency spectrum and extrapolates to x=1

  • Standard MK with frequency filtering: Discards all low-frequency polymorphisms below a cutoff (e.g., --min-freq, --no-singletons)

The imputed MK test takes a different approach: instead of discarding low-frequency data, it estimates how many low-frequency nonsynonymous polymorphisms are slightly deleterious, using the synonymous frequency spectrum as a neutral reference.

The Key Insight

Under neutrality, nonsynonymous and synonymous polymorphisms should have the same ratio of low-frequency to high-frequency variants. An excess of low-frequency nonsynonymous polymorphisms (relative to synonymous) indicates segregating slightly deleterious mutations.

By imputing this excess, the test:

  • Retains more data than methods that discard all low-frequency polymorphisms

  • Increases statistical power at the gene level

  • Decomposes the distribution of fitness effects (DFE) into interpretable fractions

The Imputation Formula

Polymorphisms are split at a derived allele frequency (DAF) cutoff (default 15%):

  1. Count low-frequency (DAF <= cutoff) and high-frequency (DAF > cutoff) variants separately for nonsynonymous (Pn) and synonymous (Ps) classes

  2. Compute the neutral ratio from synonymous polymorphisms:

    \[r = \frac{P_{s,low}}{P_{s,high}}\]
  3. Impute the number of weakly deleterious nonsynonymous polymorphisms:

    \[P_{wd} = P_{n,low} - P_{n,high} \times r\]

    This is clamped to >= 0.

  4. Compute neutral nonsynonymous polymorphisms:

    \[P_{n,neutral} = P_n - P_{wd}\]
  5. Calculate corrected alpha:

    \[\alpha = 1 - \frac{P_{n,neutral}}{P_s} \times \frac{D_s}{D_n}\]
  6. Significance is assessed with Fisher’s exact test on the corrected 2x2 table.

DFE Fractions

When the number of synonymous (m0) and nonsynonymous (mi) sites are provided, the test decomposes the DFE into four fractions:

  • alpha (a): Fraction of adaptive substitutions

  • f: Fraction of effectively neutral nonsynonymous mutations

  • b: Fraction of weakly deleterious mutations

  • d: Fraction of strongly deleterious mutations (d = 1 - f - b)

These fractions sum to 1 and describe the full distribution of fitness effects for new nonsynonymous mutations.

Usage

Single Gene Analysis

# Basic imputed MK test (default 15% DAF cutoff)
mkado test alignment.fa -i ingroup -o outgroup --imputed

# With custom DAF cutoff (10%)
mkado test alignment.fa -i ingroup -o outgroup --imputed --min-freq 0.10

# Separate ingroup/outgroup files
mkado test ingroup.fa outgroup.fa --imputed

Note

The --min-freq option is reused as the DAF cutoff when --imputed is set. If --min-freq is not specified, the default cutoff of 0.15 (15%) is used.

Batch Analysis (Aggregated)

For multi-gene analyses, pooling data across genes increases power:

# Pool polymorphisms and divergence across all genes
mkado batch alignments/ -i ingroup -o outgroup --imputed

Batch Analysis (Per-Gene)

# Run imputed test separately for each gene
mkado batch alignments/ -i ingroup -o outgroup --imputed --per-gene

Bootstrap Confidence Intervals

When --bootstrap N is set (with N > 0), the imputed test runs an additional case-resampling bootstrap to produce a 95% CI on alpha. Each replicate resamples the polymorphism list with replacement and re-runs the imputed MK algorithm; the 2.5/97.5 percentiles of the resulting alpha distribution are reported as alpha_CI_low / alpha_CI_high. The omega decomposition CIs are derived from the alpha CI by analytical scaling (omega_a_ci = alpha_ci * omega), mirroring the asymptotic test.

# Imputed test with 500-replicate bootstrap
mkado test alignment.fa -i ingroup -o outgroup --imputed --bootstrap 500

# Aggregated imputed batch with bootstrap
mkado batch alignments/ -i ingroup -o outgroup --imputed --bootstrap 200

The legacy default (--bootstrap 100) computes the bootstrap; pass --bootstrap 0 to disable CI computation entirely.

Output

The imputed test reports:

  • alpha: Corrected proportion of adaptive substitutions

  • p_value: Fisher’s exact test on the corrected contingency table

  • Pwd: Imputed count of weakly deleterious nonsynonymous polymorphisms

  • Pn_neutral: Nonsynonymous polymorphisms after removing imputed slightly deleterious mutations

  • Dn, Ds, Pn, Ps: Raw counts

  • cutoff: The DAF cutoff used

  • Ln, Ls: Nei-Gojobori non-synonymous and synonymous site totals

  • omega, omega_a, omega_na: dN/dS and the adaptive/non-adaptive decomposition (omega_a = alpha * omega) following Gossmann, Keightley & Eyre-Walker 2012, applied to MK counts by Coronado-Zamora et al. 2019. See Omega Decomposition (ω, ω_a, ω_na) for the decomposition formula.

  • alpha_CI_low, alpha_CI_high: 95% bootstrap CI on alpha (when --bootstrap > 0)

  • omega_a_CI_low/high, omega_na_CI_low/high: 95% CIs on the omega decomposition, derived from the alpha CI scaled by omega

  • ci_method: "bootstrap" when CI was computed, None otherwise

Example output (pretty format):

Imputed MK Test Results:
  Divergence:    Dn=6, Ds=8
  Polymorphism:  Pn=11, Ps=17
  DAF cutoff:    0.15
  Imputed Pwd:   4.82
  Pn (neutral):  6.18
  Alpha:         0.5154
  p-value:       0.0891
  Alpha 95% CI [bootstrap]: (0.3210, 0.7050)
  Sites:         Ln=576.00, Ls=195.00
  omega:         0.2539 (omega_a=0.1308, omega_na=0.1230)
    omega_a 95% CI:  (0.0815, 0.1790)
    omega_na 95% CI: (0.0749, 0.1724)

Comparison with Other Methods

Method

Approach

Best used when

Standard MK

No correction

Quick assessment; comparing specific genes

Asymptotic MK

Curve fitting across frequency spectrum

Genome-wide analyses with many polymorphisms

Standard MK with frequency filtering

Discards low-frequency polymorphisms (e.g., --min-freq, --no-singletons)

Simple correction, but loses data

Imputed MK

Imputes slightly deleterious count from synonymous spectrum

Gene-level analyses; maximizing statistical power

The imputed test is particularly useful when:

  • You want gene-level significance (p-values) rather than only genome-wide estimates

  • You want to retain as much data as possible

  • You want to decompose the DFE into interpretable fractions

When to Use the Imputed MK Test

Use imputed MK when:

  • Analyzing individual genes or small gene sets where power matters

  • You want per-gene corrected alpha with p-values

  • You want DFE decomposition (with site count information)

  • The asymptotic test lacks sufficient data for curve fitting

Consider alternatives when:

  • You have genome-wide data with thousands of polymorphisms (asymptotic MK may be more robust)

  • You want frequency-bin visualization of alpha(x) (use asymptotic MK with --plot-asymptotic)

  • You need unbiased multi-gene weighting without frequency modeling (use alpha_TG)

Interpreting Results

  • α > 0: Evidence for positive selection (proportion of adaptive substitutions)

  • α ≈ 0: Consistent with neutral evolution

  • α < 0: Excess of nonsynonymous polymorphism relative to divergence, suggesting segregating weakly deleterious mutations remain even after correction

  • Pwd > 0: Low-frequency nonsynonymous excess detected and corrected

  • Pwd = 0: No evidence for segregating slightly deleterious mutations (or all polymorphisms are high-frequency)

Choosing the DAF Cutoff

The default cutoff of 15% follows the recommendation of Murga-Moreno et al. (2022). The cutoff defines the boundary between “low-frequency” and “high-frequency” polymorphisms:

  • Lower cutoff (e.g., 5%): Only the rarest variants are considered low-frequency. More conservative imputation.

  • Higher cutoff (e.g., 25%): More variants classified as low-frequency. More aggressive imputation.

The optimal cutoff depends on the effective population size and strength of selection in your system.

Reference

Murga-Moreno J, Coronado-Zamora M, Casillas S, Barbadilla A (2022) impMKT: the imputed McDonald and Kreitman test, a straightforward correction that significantly increases the evidence of positive selection of the McDonald and Kreitman test at the gene level. G3: Genes, Genomes, Genetics 12(10):jkac206. https://doi.org/10.1093/g3journal/jkac206

Gossmann TI, Keightley PD, Eyre-Walker A (2012) The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. Genome Biology and Evolution 4(5):658-667. https://doi.org/10.1093/gbe/evs027

Coronado-Zamora M, Salvador-Martínez I, Castellano D, Barbadilla A, Salazar-Ciudad I (2019) Adaptation and conservation throughout the Drosophila melanogaster life-cycle. Genome Biology and Evolution 11(5):1463-1482. https://doi.org/10.1093/gbe/evz046