Imputed MK Test

MKado implements the imputed MK test from Murga-Moreno et al. (2022), which corrects for slightly deleterious mutations by imputing their count from the synonymous frequency spectrum rather than discarding low-frequency polymorphisms.

Background

The standard MK test is biased by slightly deleterious mutations that inflate nonsynonymous polymorphism (Pn) without contributing to divergence (Dn). Several approaches exist to address this:

Asymptotic MK test: Fits a curve across the full frequency spectrum and extrapolates to x=1
Standard MK with frequency filtering: Discards all low-frequency polymorphisms below a cutoff (e.g., --min-freq, --no-singletons)

The imputed MK test takes a different approach: instead of discarding low-frequency data, it estimates how many low-frequency nonsynonymous polymorphisms are slightly deleterious, using the synonymous frequency spectrum as a neutral reference.

The Key Insight

Under neutrality, nonsynonymous and synonymous polymorphisms should have the same ratio of low-frequency to high-frequency variants. An excess of low-frequency nonsynonymous polymorphisms (relative to synonymous) indicates segregating slightly deleterious mutations.

By imputing this excess, the test:

Retains more data than methods that discard all low-frequency polymorphisms
Increases statistical power at the gene level
Decomposes the distribution of fitness effects (DFE) into interpretable fractions

The Imputation Formula

Polymorphisms are split at a derived allele frequency (DAF) cutoff (default 15%):

Count low-frequency (DAF <= cutoff) and high-frequency (DAF > cutoff) variants separately for nonsynonymous (P_n) and synonymous (P_s) classes
Compute the neutral ratio from synonymous polymorphisms:

\[r = \frac{P_{s,low}}{P_{s,high}}\]
Impute the number of weakly deleterious nonsynonymous polymorphisms:

\[P_{wd} = P_{n,low} - P_{n,high} \times r\]

This is clamped to >= 0.
Compute neutral nonsynonymous polymorphisms:

\[P_{n,neutral} = P_n - P_{wd}\]
Calculate corrected alpha:

\[\alpha = 1 - \frac{P_{n,neutral}}{P_s} \times \frac{D_s}{D_n}\]
Significance is assessed with Fisher’s exact test on the corrected 2x2 table.

DFE Fractions

When the number of synonymous (m₀) and nonsynonymous (m_i) sites are provided, the test decomposes the DFE into four fractions:

alpha (a): Fraction of adaptive substitutions
f: Fraction of effectively neutral nonsynonymous mutations
b: Fraction of weakly deleterious mutations
d: Fraction of strongly deleterious mutations (d = 1 - f - b)

These fractions sum to 1 and describe the full distribution of fitness effects for new nonsynonymous mutations.

Usage

Single Gene Analysis

# Basic imputed MK test (default 15% DAF cutoff)
mkado test alignment.fa -i ingroup -o outgroup --imputed

# With custom DAF cutoff (10%)
mkado test alignment.fa -i ingroup -o outgroup --imputed --min-freq 0.10

# Separate ingroup/outgroup files
mkado test ingroup.fa outgroup.fa --imputed

Note

The --min-freq option is reused as the DAF cutoff when --imputed is set. If --min-freq is not specified, the default cutoff of 0.15 (15%) is used.

Batch Analysis (Aggregated)

For multi-gene analyses, pooling data across genes increases power:

# Pool polymorphisms and divergence across all genes
mkado batch alignments/ -i ingroup -o outgroup --imputed

Batch Analysis (Per-Gene)

# Run imputed test separately for each gene
mkado batch alignments/ -i ingroup -o outgroup --imputed --per-gene

Bootstrap Confidence Intervals

When --bootstrap N is set (with N > 0), the imputed test runs an additional case-resampling bootstrap to produce a 95% CI on alpha. Each replicate resamples the polymorphism list with replacement and re-runs the imputed MK algorithm; the 2.5/97.5 percentiles of the resulting alpha distribution are reported as alpha_CI_low / alpha_CI_high. The omega decomposition CIs are derived from the alpha CI by analytical scaling (omega_a_ci = alpha_ci * omega), mirroring the asymptotic test.

# Imputed test with 500-replicate bootstrap
mkado test alignment.fa -i ingroup -o outgroup --imputed --bootstrap 500

# Aggregated imputed batch with bootstrap
mkado batch alignments/ -i ingroup -o outgroup --imputed --bootstrap 200

The legacy default (--bootstrap 100) computes the bootstrap; pass --bootstrap 0 to disable CI computation entirely.

Output

The imputed test reports:

alpha: Corrected proportion of adaptive substitutions
p_value: Fisher’s exact test on the corrected contingency table
Pwd: Imputed count of weakly deleterious nonsynonymous polymorphisms
Pn_neutral: Nonsynonymous polymorphisms after removing imputed slightly deleterious mutations
Dn, Ds, Pn, Ps: Raw counts
cutoff: The DAF cutoff used
Ln, Ls: Nei-Gojobori non-synonymous and synonymous site totals
omega, omega_a, omega_na: dN/dS and the adaptive/non-adaptive decomposition (omega_a = alpha * omega) following Gossmann, Keightley & Eyre-Walker 2012, applied to MK counts by Coronado-Zamora et al. 2019. See Omega Decomposition (ω, ω_a, ω_na) for the decomposition formula.
alpha_CI_low, alpha_CI_high: 95% bootstrap CI on alpha (when --bootstrap > 0)
omega_a_CI_low/high, omega_na_CI_low/high: 95% CIs on the omega decomposition, derived from the alpha CI scaled by omega
ci_method: "bootstrap" when CI was computed, None otherwise

Example output (pretty format):

Imputed MK Test Results:
  Divergence:    Dn=6, Ds=8
  Polymorphism:  Pn=11, Ps=17
  DAF cutoff:    0.15
  Imputed Pwd:   4.82
  Pn (neutral):  6.18
  Alpha:         0.5154
  p-value:       0.0891
  Alpha 95% CI [bootstrap]: (0.3210, 0.7050)
  Sites:         Ln=576.00, Ls=195.00
  omega:         0.2539 (omega_a=0.1308, omega_na=0.1230)
    omega_a 95% CI:  (0.0815, 0.1790)
    omega_na 95% CI: (0.0749, 0.1724)

Comparison with Other Methods

Method	Approach	Best used when
Standard MK	No correction	Quick assessment; comparing specific genes
Asymptotic MK	Curve fitting across frequency spectrum	Genome-wide analyses with many polymorphisms
Standard MK with frequency filtering	Discards low-frequency polymorphisms (e.g., `--min-freq`, `--no-singletons`)	Simple correction, but loses data
Imputed MK	Imputes slightly deleterious count from synonymous spectrum	Gene-level analyses; maximizing statistical power

The imputed test is particularly useful when:

You want gene-level significance (p-values) rather than only genome-wide estimates
You want to retain as much data as possible
You want to decompose the DFE into interpretable fractions

When to Use the Imputed MK Test

Use imputed MK when:

Analyzing individual genes or small gene sets where power matters
You want per-gene corrected alpha with p-values
You want DFE decomposition (with site count information)
The asymptotic test lacks sufficient data for curve fitting

Consider alternatives when:

You have genome-wide data with thousands of polymorphisms (asymptotic MK may be more robust)
You want frequency-bin visualization of alpha(x) (use asymptotic MK with --plot-asymptotic)
You need unbiased multi-gene weighting without frequency modeling (use alpha_TG)

Interpreting Results

α > 0: Evidence for positive selection (proportion of adaptive substitutions)
α ≈ 0: Consistent with neutral evolution
α < 0: Excess of nonsynonymous polymorphism relative to divergence, suggesting segregating weakly deleterious mutations remain even after correction
Pwd > 0: Low-frequency nonsynonymous excess detected and corrected
Pwd = 0: No evidence for segregating slightly deleterious mutations (or all polymorphisms are high-frequency)

Choosing the DAF Cutoff

The default cutoff of 15% follows the recommendation of Murga-Moreno et al. (2022). The cutoff defines the boundary between “low-frequency” and “high-frequency” polymorphisms:

Lower cutoff (e.g., 5%): Only the rarest variants are considered low-frequency. More conservative imputation.
Higher cutoff (e.g., 25%): More variants classified as low-frequency. More aggressive imputation.

The optimal cutoff depends on the effective population size and strength of selection in your system.

Reference

Murga-Moreno J, Coronado-Zamora M, Casillas S, Barbadilla A (2022) impMKT: the imputed McDonald and Kreitman test, a straightforward correction that significantly increases the evidence of positive selection of the McDonald and Kreitman test at the gene level. G3: Genes, Genomes, Genetics 12(10):jkac206. https://doi.org/10.1093/g3journal/jkac206

Gossmann TI, Keightley PD, Eyre-Walker A (2012) The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. Genome Biology and Evolution 4(5):658-667. https://doi.org/10.1093/gbe/evs027

Coronado-Zamora M, Salvador-Martínez I, Castellano D, Barbadilla A, Salazar-Ciudad I (2019) Adaptation and conservation throughout the Drosophila melanogaster life-cycle. Genome Biology and Evolution 11(5):1463-1482. https://doi.org/10.1093/gbe/evz046