Imputed MK Test
===============

MKado implements the imputed MK test from `Murga-Moreno et al. (2022)`_, which corrects for slightly deleterious mutations by imputing their count from the synonymous frequency spectrum rather than discarding low-frequency polymorphisms.

Background
----------

The standard MK test is biased by **slightly deleterious mutations** that inflate nonsynonymous polymorphism (Pn) without contributing to divergence (Dn). Several approaches exist to address this:

- **Asymptotic MK test**: Fits a curve across the full frequency spectrum and extrapolates to x=1
- **Standard MK with frequency filtering**: Discards all low-frequency polymorphisms below a cutoff (e.g., ``--min-freq``, ``--no-singletons``)

The imputed MK test takes a different approach: instead of discarding low-frequency data, it **estimates how many** low-frequency nonsynonymous polymorphisms are slightly deleterious, using the synonymous frequency spectrum as a neutral reference.

The Key Insight
---------------

Under neutrality, nonsynonymous and synonymous polymorphisms should have the same ratio of low-frequency to high-frequency variants. An excess of low-frequency nonsynonymous polymorphisms (relative to synonymous) indicates segregating slightly deleterious mutations.

By imputing this excess, the test:

- Retains more data than methods that discard all low-frequency polymorphisms
- Increases statistical power at the gene level
- Decomposes the distribution of fitness effects (DFE) into interpretable fractions

The Imputation Formula
----------------------

Polymorphisms are split at a derived allele frequency (DAF) cutoff (default 15%):

1. Count low-frequency (DAF <= cutoff) and high-frequency (DAF > cutoff) variants separately for nonsynonymous (P\ :sub:`n`) and synonymous (P\ :sub:`s`) classes

2. Compute the neutral ratio from synonymous polymorphisms:

   .. math::

      r = \frac{P_{s,low}}{P_{s,high}}

3. Impute the number of weakly deleterious nonsynonymous polymorphisms:

   .. math::

      P_{wd} = P_{n,low} - P_{n,high} \times r

   This is clamped to >= 0.

4. Compute neutral nonsynonymous polymorphisms:

   .. math::

      P_{n,neutral} = P_n - P_{wd}

5. Calculate corrected alpha:

   .. math::

      \alpha = 1 - \frac{P_{n,neutral}}{P_s} \times \frac{D_s}{D_n}

6. Significance is assessed with Fisher's exact test on the corrected 2x2 table.

DFE Fractions
-------------

When the number of synonymous (m\ :sub:`0`) and nonsynonymous (m\ :sub:`i`) sites are provided, the test decomposes the DFE into four fractions:

- **alpha (a)**: Fraction of adaptive substitutions
- **f**: Fraction of effectively neutral nonsynonymous mutations
- **b**: Fraction of weakly deleterious mutations
- **d**: Fraction of strongly deleterious mutations (d = 1 - f - b)

These fractions sum to 1 and describe the full distribution of fitness effects for new nonsynonymous mutations.

Usage
-----

Single Gene Analysis
^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Basic imputed MK test (default 15% DAF cutoff)
   mkado test alignment.fa -i ingroup -o outgroup --imputed

   # With custom DAF cutoff (10%)
   mkado test alignment.fa -i ingroup -o outgroup --imputed --min-freq 0.10

   # Separate ingroup/outgroup files
   mkado test ingroup.fa outgroup.fa --imputed

.. note::

   The ``--min-freq`` option is reused as the DAF cutoff when ``--imputed`` is set. If ``--min-freq`` is not specified, the default cutoff of 0.15 (15%) is used.

Batch Analysis (Aggregated)
^^^^^^^^^^^^^^^^^^^^^^^^^^^

For multi-gene analyses, pooling data across genes increases power:

.. code-block:: bash

   # Pool polymorphisms and divergence across all genes
   mkado batch alignments/ -i ingroup -o outgroup --imputed

Batch Analysis (Per-Gene)
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Run imputed test separately for each gene
   mkado batch alignments/ -i ingroup -o outgroup --imputed --per-gene

Bootstrap Confidence Intervals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When ``--bootstrap N`` is set (with N > 0), the imputed test runs an
additional case-resampling bootstrap to produce a 95% CI on alpha. Each
replicate resamples the polymorphism list with replacement and re-runs
the imputed MK algorithm; the 2.5/97.5 percentiles of the resulting
alpha distribution are reported as ``alpha_CI_low`` / ``alpha_CI_high``.
The omega decomposition CIs are derived from the alpha CI by analytical
scaling (``omega_a_ci = alpha_ci * omega``), mirroring the asymptotic
test.

.. code-block:: bash

   # Imputed test with 500-replicate bootstrap
   mkado test alignment.fa -i ingroup -o outgroup --imputed --bootstrap 500

   # Aggregated imputed batch with bootstrap
   mkado batch alignments/ -i ingroup -o outgroup --imputed --bootstrap 200

The legacy default (``--bootstrap 100``) computes the bootstrap; pass
``--bootstrap 0`` to disable CI computation entirely.

Output
------

The imputed test reports:

- **alpha**: Corrected proportion of adaptive substitutions
- **p_value**: Fisher's exact test on the corrected contingency table
- **Pwd**: Imputed count of weakly deleterious nonsynonymous polymorphisms
- **Pn_neutral**: Nonsynonymous polymorphisms after removing imputed slightly deleterious mutations
- **Dn, Ds, Pn, Ps**: Raw counts
- **cutoff**: The DAF cutoff used
- **Ln, Ls**: Nei-Gojobori non-synonymous and synonymous site totals
- **omega, omega_a, omega_na**: dN/dS and the adaptive/non-adaptive
  decomposition (``omega_a = alpha * omega``) following
  `Gossmann, Keightley & Eyre-Walker 2012`_, applied to MK counts by
  `Coronado-Zamora et al. 2019`_. See :doc:`omega` for the decomposition
  formula.
- **alpha_CI_low, alpha_CI_high**: 95% bootstrap CI on alpha (when
  ``--bootstrap > 0``)
- **omega_a_CI_low/high, omega_na_CI_low/high**: 95% CIs on the omega
  decomposition, derived from the alpha CI scaled by omega
- **ci_method**: ``"bootstrap"`` when CI was computed, ``None`` otherwise

Example output (pretty format):

.. code-block:: text

   Imputed MK Test Results:
     Divergence:    Dn=6, Ds=8
     Polymorphism:  Pn=11, Ps=17
     DAF cutoff:    0.15
     Imputed Pwd:   4.82
     Pn (neutral):  6.18
     Alpha:         0.5154
     p-value:       0.0891
     Alpha 95% CI [bootstrap]: (0.3210, 0.7050)
     Sites:         Ln=576.00, Ls=195.00
     omega:         0.2539 (omega_a=0.1308, omega_na=0.1230)
       omega_a 95% CI:  (0.0815, 0.1790)
       omega_na 95% CI: (0.0749, 0.1724)

Comparison with Other Methods
-----------------------------

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Method
     - Approach
     - Best used when
   * - Standard MK
     - No correction
     - Quick assessment; comparing specific genes
   * - Asymptotic MK
     - Curve fitting across frequency spectrum
     - Genome-wide analyses with many polymorphisms
   * - Standard MK with frequency filtering
     - Discards low-frequency polymorphisms (e.g., ``--min-freq``, ``--no-singletons``)
     - Simple correction, but loses data
   * - **Imputed MK**
     - Imputes slightly deleterious count from synonymous spectrum
     - Gene-level analyses; maximizing statistical power

The imputed test is particularly useful when:

- You want gene-level significance (p-values) rather than only genome-wide estimates
- You want to retain as much data as possible
- You want to decompose the DFE into interpretable fractions

When to Use the Imputed MK Test
-------------------------------

**Use imputed MK when:**

- Analyzing individual genes or small gene sets where power matters
- You want per-gene corrected alpha with p-values
- You want DFE decomposition (with site count information)
- The asymptotic test lacks sufficient data for curve fitting

**Consider alternatives when:**

- You have genome-wide data with thousands of polymorphisms (asymptotic MK may be more robust)
- You want frequency-bin visualization of alpha(x) (use asymptotic MK with ``--plot-asymptotic``)
- You need unbiased multi-gene weighting without frequency modeling (use alpha_TG)

Interpreting Results
--------------------

- **α > 0**: Evidence for positive selection (proportion of adaptive substitutions)
- **α ≈ 0**: Consistent with neutral evolution
- **α < 0**: Excess of nonsynonymous polymorphism relative to divergence, suggesting segregating weakly deleterious mutations remain even after correction
- **Pwd > 0**: Low-frequency nonsynonymous excess detected and corrected
- **Pwd = 0**: No evidence for segregating slightly deleterious mutations (or all polymorphisms are high-frequency)

Choosing the DAF Cutoff
-----------------------

The default cutoff of 15% follows the recommendation of `Murga-Moreno et al. (2022)`_. The cutoff defines the boundary between "low-frequency" and "high-frequency" polymorphisms:

- **Lower cutoff** (e.g., 5%): Only the rarest variants are considered low-frequency. More conservative imputation.
- **Higher cutoff** (e.g., 25%): More variants classified as low-frequency. More aggressive imputation.

The optimal cutoff depends on the effective population size and strength of selection in your system.

Reference
---------

.. _Murga-Moreno et al. (2022): https://doi.org/10.1093/g3journal/jkac206
.. _Gossmann, Keightley & Eyre-Walker 2012: https://doi.org/10.1093/gbe/evs027
.. _Coronado-Zamora et al. 2019: https://doi.org/10.1093/gbe/evz046

Murga-Moreno J, Coronado-Zamora M, Casillas S, Barbadilla A (2022) impMKT: the imputed McDonald and Kreitman test, a straightforward correction that significantly increases the evidence of positive selection of the McDonald and Kreitman test at the gene level. *G3: Genes, Genomes, Genetics* 12(10):jkac206. https://doi.org/10.1093/g3journal/jkac206

Gossmann TI, Keightley PD, Eyre-Walker A (2012) The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. *Genome Biology and Evolution* 4(5):658-667. https://doi.org/10.1093/gbe/evs027

Coronado-Zamora M, Salvador-Martínez I, Castellano D, Barbadilla A, Salazar-Ciudad I (2019) Adaptation and conservation throughout the *Drosophila melanogaster* life-cycle. *Genome Biology and Evolution* 11(5):1463-1482. https://doi.org/10.1093/gbe/evz046