Getting Started Tutorial

This tutorial walks through using MKado for McDonald-Kreitman analysis, from running your first test to interpreting results.

Prerequisites

Install MKado (see Installation)
Have aligned coding sequences in FASTA format

The McDonald-Kreitman Test

The MK test (McDonald & Kreitman 1991) compares the ratio of non-synonymous to synonymous changes within species (polymorphism) versus between species (divergence). Under neutral evolution, these ratios should be equal.

The test produces a 2x2 contingency table:


Divergence	Dn	Ds
Polymorphism	Pn	Ps

Your First MK Test

Let’s run a basic MK test using the example data.

Step 1: Examine your data

# Get information about an alignment file
mkado info examples/anopheles_batch/AGAP000078.fa

This shows the number of sequences, sequence lengths, and sequence names.

Step 2: Run the MK test

# Standard MK test: gamb (ingroup) vs afun (outgroup)
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun

Step 3: Interpret the output

The output shows:

Dn, Ds: Fixed non-synonymous and synonymous differences between species
Pn, Ps: Polymorphic non-synonymous and synonymous sites within species
Fisher’s p-value: Statistical significance of the departure from neutrality
Neutrality Index (NI): (Pn/Ps) / (Dn/Ds) - ratio of ratios
Alpha: Proportion of substitutions driven by positive selection

Interpreting Results

Neutrality Index (NI)

The Neutrality Index (Rand & Kann 1996) measures the direction and degree of departure from neutral evolution:

NI = 1: Consistent with neutral evolution
NI > 1: Excess polymorphism (segregating weakly deleterious variants)
NI < 1: Excess divergence (positive selection)

Alpha

Alpha, the proportion of adaptive substitutions (Smith & Eyre-Walker 2002):

Alpha = 0: No adaptive substitutions
Alpha > 0: Proportion of fixed differences due to positive selection
Alpha < 0: Excess polymorphism relative to divergence

Asymptotic MK Test

The standard MK test can be biased by slightly deleterious mutations that segregate as polymorphisms but rarely reach fixation. These inflate Pn relative to Dn, causing alpha to be underestimated. The asymptotic MK test (Messer & Petrov 2013) addresses this by examining how alpha varies with derived allele frequency in the ingroup.

The key insight is that weakly deleterious mutations are more common at low frequencies (where they haven’t yet been purged by selection) and rare at high frequencies. By extrapolating alpha to a derived frequency of 1.0, we estimate what alpha would be if all deleterious polymorphisms were removed.

Determining Derived Allele Frequency

To calculate derived allele frequency, we need to identify which allele is ancestral (the original state) versus derived (the new mutation). MKado uses the outgroup to polarize mutations:

At each polymorphic codon position in the ingroup, examine which codons are present
Compare with the outgroup codon at the same position
The allele shared between ingroup and outgroup is assumed to be ancestral
Derived frequency = 1.0 − (frequency of ancestral allele in ingroup)

For example, if a codon position has allele A at 80% and allele B at 20% in the ingroup, and the outgroup has allele A, then A is ancestral and the derived allele frequency is 0.20.

Note

Sites where no ingroup allele matches the outgroup cannot be polarized and are excluded from the frequency spectrum analysis.

The Asymptotic Method

Once derived frequencies are computed, the test proceeds as follows:

Bin polymorphisms by derived frequency — Group Pn and Ps into frequency bins (e.g., 0.0–0.05, 0.05–0.10, etc.)
Calculate alpha at each bin — α(x) = 1 − (Ds/Dn) × (Pn(x)/Ps(x))
Fit a curve — Fit a model to the per-bin alpha estimates (see Model Selection below)
Extrapolate to x = 1 — The asymptotic alpha is the fitted value at derived frequency = 1.0

Model Selection

MKado fits two candidate models to the α(x) curve and automatically selects the best one:

Exponential model: α(x) = a + b·exp(−c·x) — 3 parameters
Linear model: α(x) = a + b·x — 2 parameters

The exponential model captures the expected shape when weakly deleterious mutations cause alpha to increase with frequency. However, when data are noisy or sparse, the exponential fit can be unstable.

Model selection follows the asymptoticMK R package convention (Haller & Messer 2017):

If the exponential fit produces a confidence interval width > 100, it is considered unstable and the linear model is used
Otherwise, both models are compared using AIC (Akaike Information Criterion), which balances fit quality against model complexity
The model with the lower AIC is selected
If both fits fail, the highest-frequency bin’s alpha value is reported as a fallback

The output reports which model was selected (exponential or linear) along with the fitted parameters.

# Run asymptotic MK test
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun -a

# Customize number of frequency bins
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun -a -b 20

# Switch to the Uricchio et al. 2019 cumulative SFS (more sample-size-stable)
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun -a --sfs-mode above

The output includes:

Alpha asymptotic: Extrapolated alpha value at derived frequency = 1
95% CI: Confidence interval for alpha asymptotic
ci_method: "monte-carlo" (default — parametric MVN sampling from the fit covariance) or "bootstrap" (case-resampling). Switch with --ci-method bootstrap. See Asymptotic MK Test for the trade-offs.
sfs_mode: "at" (default, per-bin SFS) or "above" (cumulative SFS as in Uricchio et al. 2019). See Asymptotic MK Test for the difference.
Model type: Whether exponential or linear model was selected
Per-bin alpha values: Alpha estimates at each frequency class
omega, omega_a, omega_na: dN/dS and its adaptive / non-adaptive decomposition, with 95% CIs on omega_a and omega_na derived by scaling the alpha CI. See Omega Decomposition (ω, ω_a, ω_na) for details.

Allele Polarization

Polarization is the process of determining which allele is ancestral (the original state) versus derived (a new mutation). This distinction is important for several reasons:

Calculating derived allele frequency for the asymptotic MK test
Filtering polymorphisms by frequency (--min-freq, --no-singletons)
Assigning mutations to specific lineages (polarized MK test)

MKado uses outgroup sequences to polarize alleles. The method differs slightly depending on the test type.

Polarization in Standard MK and Asymptotic Tests

In the standard and asymptotic MK tests, MKado uses a single outgroup to identify ancestral alleles:

For each polymorphic codon position in the ingroup, MKado identifies which codons are present and their frequencies
Compare with the outgroup — find codons that are shared between the ingroup and outgroup
The shared allele is ancestral — if multiple ingroup alleles match outgroup alleles, the most frequent one is chosen as ancestral
Calculate derived frequency — derived_freq = 1.0 − frequency(ancestral)

Example:

Consider a codon position where the ingroup has:

Codon AAA at 75% frequency
Codon AAG at 25% frequency

If the outgroup has codon AAA, then:

AAA is the ancestral allele (shared with outgroup)
AAG is the derived allele (new mutation in ingroup)
Derived allele frequency = 0.25

Unpolarizable sites:

Some polymorphic sites cannot be polarized:

No shared allele: If no ingroup codon matches any outgroup codon, we cannot determine which is ancestral. This can happen when the outgroup is distantly related or multiple mutations have occurred.
Ambiguous data: Codons containing ambiguous bases (N) or gaps (-) are excluded from comparison.

Unpolarizable sites are excluded from frequency-based analyses (asymptotic test, --min-freq filtering) but are still counted in the standard MK test’s Pn/Ps totals.

Polarized MK Test

The standard MK test counts fixed differences between ingroup and outgroup but cannot determine which lineage the mutation occurred on. Did the ingroup change from the ancestral state, or did the outgroup?

The polarized MK test uses a second outgroup (typically a more distantly related species) to answer this question:

# Use a second outgroup for polarization
mkado test alignment.fa -i ingroup -o outgroup1 --polarize-match outgroup2

Polarizing Fixed Differences

For each fixed difference between ingroup and outgroup1, MKado applies a majority rule using outgroup2:

Scenario	Ancestral State	Interpretation
outgroup1 = outgroup2	The shared outgroup codon	Mutation on ingroup lineage
ingroup = outgroup2	The ingroup codon	Mutation on outgroup1 lineage
All three differ	Cannot determine	Unpolarizable — excluded from lineage-specific counts

Example:

Consider a codon position with:

Ingroup: GCT (Ala)
Outgroup1: GGT (Gly)
Outgroup2: GGT (Gly)

Both outgroups share GGT, so GGT is likely ancestral. The GCT→GGT change occurred on the ingroup lineage.

Now consider:

Ingroup: GCT (Ala)
Outgroup1: GGT (Gly)
Outgroup2: GCT (Ala)

The ingroup matches the more distant outgroup2, so GCT is likely ancestral. The GCT→GGT change occurred on the outgroup1 lineage.

Polarizing Polymorphisms

Polymorphisms in the ingroup are also polarized to determine if they arose on the ingroup lineage. The logic mirrors fixed difference polarization:

Check for ancestral polymorphism: If outgroup2 contains ALL the ingroup alleles, the polymorphism existed at the common ancestor and cannot be attributed to the ingroup lineage → excluded
Determine ancestral state: If outgroup1 and outgroup2 share an allele, that allele is ancestral
Verify derived allele: If the ingroup has alleles not present in outgroup2, those are derived on the ingroup lineage → counted as ingroup polymorphism
Cannot polarize: If no clear ancestral state can be determined → excluded

Example:

Consider a polymorphic site where:

Ingroup has: AAA (60%), AAG (40%)
Outgroup1 has: AAA
Outgroup2 has: AAA

Outgroup1 and outgroup2 both have AAA, so AAA is ancestral. The AAG allele (40%) is derived on the ingroup lineage. This is an ingroup polymorphism.

Now consider:

Ingroup has: AAA (60%), AAG (40%)
Outgroup1 has: AAA
Outgroup2 has: AAA, AAG

Outgroup2 has BOTH ingroup alleles. This polymorphism existed at the common ancestor — it’s not derived on the ingroup lineage. This polymorphism is excluded from the ingroup count.

Interpreting Polarized Results

The polarized MK test reports separate statistics for each lineage:

Ingroup lineage: Dn, Ds, Pn, Ps, alpha — mutations that occurred on the ingroup branch
Outgroup lineage: Dn, Ds — mutations that occurred on the outgroup1 branch
Unpolarized: Dn, Ds, Pn, Ps — changes that could not be assigned to a lineage

Unpolarized polymorphisms include:

Ancestral polymorphisms (present in outgroup2)
Polymorphisms where the ancestral state cannot be determined

Only the ingroup lineage statistics are used to calculate alpha. This provides a cleaner estimate of selection on the ingroup lineage by excluding ancestral variation.

Why Use Polarization?

Lineage-specific analysis is useful when:

You want to test for selection specifically on the ingroup lineage
The outgroup1 may have experienced different selective pressures
You need to distinguish ancestral polymorphism from derived changes

Phylogenetic requirements:

outgroup2 ─────────┐
                   ├─── root
outgroup1 ─────┐   │
               ├───┘
ingroup ───────┘

Outgroup2 should be more distantly related than outgroup1. The more distant outgroup2 helps establish the ancestral state at the root.

Frequency Filtering Options

MKado provides two different frequency-related options that serve distinct purposes:

`--min-freq` (Standard MK, Polarized MK, and α_TG)

This option excludes rare polymorphisms below a frequency threshold from the Pn/Ps counts.

Applies to: Standard MK test, Polarized MK test, α_TG estimator
Purpose: Filter out singletons or very rare variants that may be sequencing errors or very recent mutations
How it works: Polymorphisms with derived allele frequency < min_freq are not counted

# Exclude polymorphisms below 5% frequency
mkado test alignment.fa -i dmel -o dsim --min-freq 0.05

# Also works with alpha-tg
mkado batch alignments/ -i dmel -o dsim --alpha-tg --min-freq 0.05

Note

--min-freq cannot be used with --asymptotic. The asymptotic test uses --freq-cutoffs for frequency filtering.

`--no-singletons` (Standard MK, Polarized MK, and α_TG)

A convenience option that automatically sets --min-freq to exclude singletons (variants appearing only once).

Applies to: Standard MK test, Polarized MK test, α_TG estimator
How it works: Calculates 1/n where n is the sample size, and uses that as the minimum frequency threshold
Sample size: Uses ingroup count, or ingroup + outgroup if --pool-polymorphisms is enabled

# Exclude singletons automatically
mkado test alignment.fa -i dmel -o dsim --no-singletons

# For batch processing (threshold calculated per gene)
mkado batch alignments/ -i dmel -o dsim --no-singletons

# Also works with alpha-tg
mkado batch alignments/ -i dmel -o dsim --alpha-tg --no-singletons

Note

--no-singletons cannot be used with --min-freq (they are mutually exclusive) or --asymptotic.

`--freq-cutoffs` (Asymptotic MK test)

This option defines the frequency range used for curve fitting in the asymptotic test — it does not exclude data.

Applies to: Aggregated asymptotic MK test (mkado batch -a)
Default: 0.1,0.9
Purpose: Avoid fitting to extreme frequency bins where data may be sparse or noisy
How it works:
1. All polymorphisms are counted and binned by derived frequency
2. Only bins within the [low, high] range are used to fit the exponential/linear model
3. The curve is still extrapolated to frequency = 1.0

# Fit model using only bins between 15% and 85% frequency
mkado batch alignments/ -i dmel -o dsim -a --freq-cutoffs 0.15,0.85

Key Differences

Aspect	`--min-freq`	`--freq-cutoffs`
Effect	Excludes polymorphisms from counts	Restricts which bins are used for fitting
Scope	Per-polymorphism filtering	Curve-fitting range only
Data	Polymorphisms below threshold not counted	All polymorphisms counted; fitting uses subset
Analysis types	Standard MK, Polarized MK, α_TG	Asymptotic MK (aggregated batch mode)

Genetic Code Tables

By default, MKado uses the standard genetic code. For analyses involving mitochondrial or non-standard nuclear genomes, use --code-table to specify an alternate code by name or NCBI table ID:

# By name
mkado test alignment.fa -i species1 -o species2 --code-table vertebrate-mito
mkado batch alignments/ -i species1 -o species2 --code-table invertebrate-mito

# By NCBI table ID
mkado test alignment.fa -i species1 -o species2 --code-table 2

# List all available codes
mkado codes

The genetic code affects codon translation, synonymous site counting, and the classification of changes as synonymous or non-synonymous.

Available Codes

ID	Name	`--code-table` alias
1	Standard	`standard`
2	Vertebrate Mitochondrial	`vertebrate-mito`
3	Yeast Mitochondrial	`yeast-mito`
4	Mold / Protozoan / Coelenterate Mitochondrial; Mycoplasma; Spiroplasma	`mold-mito`, `protozoan-mito`, `mycoplasma`
5	Invertebrate Mitochondrial	`invertebrate-mito`
6	Ciliate, Dasycladacean and Hexamita Nuclear	`ciliate`
9	Echinoderm and Flatworm Mitochondrial	`echinoderm-mito`, `flatworm-mito`
10	Euplotid Nuclear	`euplotid`
11	Bacterial, Archaeal and Plant Plastid	`bacterial`, `archaeal`, `plant-plastid`
12	Alternative Yeast Nuclear	`alt-yeast`
13	Ascidian Mitochondrial	`ascidian-mito`
14	Alternative Flatworm Mitochondrial	`alt-flatworm-mito`
16	Chlorophycean Mitochondrial	`chlorophycean-mito`
21	Trematode Mitochondrial	`trematode-mito`
22	Scenedesmus obliquus Mitochondrial	`scenedesmus-mito`
23	Thraustochytrium Mitochondrial	`thraustochytrium-mito`
24	Rhabdopleuridae Mitochondrial	`rhabdopleuridae-mito`
25	Candidate Division SR1 and Gracilibacteria	`sr1`, `gracilibacteria`
26	Pachysolen tannophilus Nuclear	`pachysolen`
27	Karyorelictea Nuclear	`karyorelictea`
29	Mesodinium Nuclear	`mesodinium`
30	Peritrich Nuclear	`peritrich`
31	Blastocrithidia Nuclear	`blastocrithidia`
33	Cephalodiscidae Mitochondrial UAA-Tyr	`cephalodiscidae-mito`

See the full NCBI genetic code reference for codon assignment details.

Output Formats

MKado supports multiple output formats:

# Pretty-printed output (default)
mkado test alignment.fa -i species1 -o species2

# Tab-separated values
mkado test alignment.fa -i species1 -o species2 -f tsv

# JSON format
mkado test alignment.fa -i species1 -o species2 -f json

Working with VCF Data

If your starting point is a variant calling pipeline (e.g., GATK, bcftools) rather than pre-aligned coding sequences, you can skip FASTA alignment entirely and use mkado vcf. This mode takes an ingroup VCF, an outgroup VCF, a reference genome, and a GFF3 annotation, and supports all the same analyses described above: standard MK, asymptotic, imputed, and alpha-TG.

mkado vcf \
    --vcf population.vcf.gz \
    --outgroup-vcf outgroup.vcf.gz \
    --ref reference.fa \
    --gff annotation.gff3

See VCF Input Mode for the full guide, including data preparation, gene filtering, plotting, and all available options.

Working with Ortholog Alignments (OrthoMaM-style projection)

A common scenario when working with non-model organisms is that one has codon-aligned ortholog sets from a database such as OrthoMaM (mammals), Ensembl Compara, or a custom alignment, plus a separate population-genetic VCF for the ingroup species. To use these together with MKado, the polymorphism data must be projected onto the alignment columns. The procedure below mirrors the one we used to assemble the 12,437-gene human benchmark dataset and is general enough to apply to any codon-level ortholog alignment plus a coordinate-resolved VCF.

For each gene, do the following:

Reconstruct the canonical CDS for the ingroup species from the reference genome (FASTA) and exon set (GTF/GFF3), so you have a position-by-position mapping from CDS index to genomic coordinate. For minus-strand genes, reverse-complement as you build the CDS.
Match the alignment to the CDS. Strip gaps from the ingroup row of the ortholog alignment, then look for it as a substring of the reconstructed canonical CDS. The position of the match gives you the alignment-column to CDS-position correspondence. Genes whose alignments do not match the reference CDS exactly (commonly the result of independent curation between alignment database and reference) should be skipped.
Extract overlapping variants from the VCF using the genomic coordinates from step 1 for the alignment-retained CDS positions. Verify that the VCF reference allele matches the genomic reference; for minus-strand genes, complement the VCF alleles to match the CDS orientation.
Build per-haplotype sequences. From the phased VCF genotypes, write out one CDS sequence per haplotype with variants substituted in. This step yields 2N (or N, for haploid samples) haplotype CDS sequences over the alignment- retained positions.
Re-gap to the alignment. Insert the gap structure of the ortholog alignment into each haplotype sequence, so that all haplotypes are codon-aligned to one another and to the outgroup sequences from the alignment. Outgroup sequences are taken directly from the alignment.

The output of this pipeline is a per-gene multi-species FASTA file whose sequences are codon-aligned, with the ingroup haplotypes filterable by sample-name pattern and the outgroup species filterable separately—i.e., exactly the input format mkado batch expects.

Indels in the alignment are preserved (they remain as gap characters in the output FASTA). Multi-transcript genes are reduced to the canonical CDS chosen by the alignment database; alternative transcripts are not analysed separately. The script we used for the human benchmark dataset (data/build_mk_dataset.py in the source repository) can serve as a starting point for adapting this to other species.

Next Steps

Learn about Batch Processing Workflow for processing multiple genes from FASTA
Use VCF Input Mode for VCF-based MK analysis
Explore Asymptotic MK Test for the full asymptotic MK methodology
Understand Tarone-Greenland Alpha (α_TG) for weighted multi-gene estimates
Review File Formats and Input Requirements for input requirements
Explore the Python API Reference for programmatic access

Getting Started Tutorial

Prerequisites

The McDonald-Kreitman Test

Your First MK Test

Interpreting Results

Neutrality Index (NI)

Alpha

Asymptotic MK Test

Determining Derived Allele Frequency

The Asymptotic Method

Model Selection

Allele Polarization

Polarization in Standard MK and Asymptotic Tests

Polarized MK Test

Polarizing Fixed Differences

Polarizing Polymorphisms

Interpreting Polarized Results

Why Use Polarization?

Frequency Filtering Options

--min-freq (Standard MK, Polarized MK, and α_TG)

--no-singletons (Standard MK, Polarized MK, and α_TG)

--freq-cutoffs (Asymptotic MK test)

Key Differences

Genetic Code Tables

Available Codes

Output Formats

Working with VCF Data

Working with Ortholog Alignments (OrthoMaM-style projection)

Next Steps

References

`--min-freq` (Standard MK, Polarized MK, and α_TG)

`--no-singletons` (Standard MK, Polarized MK, and α_TG)

`--freq-cutoffs` (Asymptotic MK test)