Python API Reference

MKado provides a Python API for programmatic access to MK test functionality.

Quick Start

from mkado import mk_test, asymptotic_mk_test, SequenceSet

# Run standard MK test
result = mk_test("ingroup.fa", "outgroup.fa")
print(f"Alpha: {result.alpha}")
print(f"P-value: {result.p_value}")

# Run asymptotic MK test
result = asymptotic_mk_test("ingroup.fa", "outgroup.fa")
print(f"Asymptotic Alpha: {result.alpha_asymptotic}")

Core Classes

SequenceSet

Sequence and SequenceSet classes for handling aligned sequences.

class mkado.core.sequences.Sequence(name, sequence)[source]

Bases: object

Represents a single biological sequence.

Parameters:

name (str)
sequence (str)

name: str

sequence: str

get_codon(codon_index, reading_frame=1)[source]

Get the codon at a given index.

Parameters:

codon_index (int) – Zero-based codon index
reading_frame (int) – Reading frame (1, 2, or 3)

Return type:

str

Returns:

Three-letter codon string

num_codons(reading_frame=1)[source]

Get the number of complete codons in the sequence.

Return type:: int
Parameters:: reading_frame (int)

__init__(name, sequence)

Parameters:

name (str)
sequence (str)

Return type:

None

class mkado.core.sequences.SequenceSet(sequences=<factory>, reading_frame=1, genetic_code=<factory>)[source]

Bases: object

Represents a set of aligned sequences (e.g., from one species).

Parameters:

sequences (list[Sequence])
reading_frame (int)
genetic_code (GeneticCode)

sequences: list[Sequence]

reading_frame: int = 1

genetic_code: GeneticCode

classmethod from_fasta(path, reading_frame=1, genetic_code=None)[source]

Load sequences from a FASTA file.

Parameters:

path (str | Path) – Path to FASTA file
reading_frame (int) – Reading frame (1, 2, or 3)
genetic_code (GeneticCode | None) – Genetic code for translation

Return type:

SequenceSet

Returns:

SequenceSet containing all sequences from the file

property num_codons: int: Number of complete codons (based on first sequence).

property alignment_length: int: Length of the alignment (based on first sequence).

codon_set(codon_index)[source]

Get the set of unique codons at a position across all sequences.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique codon strings

codon_set_clean(codon_index)[source]

Get unique codons at a position, excluding those with N or gaps.

Result is cached per codon_index for the lifetime of the SequenceSet. Callers must not mutate the returned set.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique valid codon strings

amino_set(codon_index)[source]

Get the set of unique amino acids at a codon position.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique amino acid characters

amino_set_clean(codon_index)[source]

Get unique amino acids at a position, excluding ambiguous codons.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique amino acid characters

is_polymorphic(codon_index)[source]

Check if a codon position is polymorphic (has >1 unique codon).

Parameters:: codon_index (int) – Zero-based codon index
Return type:: bool
Returns:: True if polymorphic

site_set(site_index)[source]

Get the set of unique nucleotides at a site across all sequences.

Parameters:: site_index (int) – Zero-based nucleotide position
Return type:: set[str]
Returns:: Set of unique nucleotide characters

site_set_clean(site_index)[source]

Get unique nucleotides at a site, excluding N and gaps.

Parameters:: site_index (int) – Zero-based nucleotide position
Return type:: set[str]
Returns:: Set of unique valid nucleotide characters

codon_array()[source]

Get a 2D array of codons (n_sequences x n_codons).

Return type:: ndarray
Returns:: NumPy array of codon strings

polymorphic_codons()[source]

Get indices of polymorphic codon positions.

Return type:: list[int]
Returns:: List of codon indices that are polymorphic

site_frequency_spectrum(codon_index)[source]

Calculate allele frequencies at a codon position.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: dict[str, float]
Returns:: Dict mapping codon strings to their frequencies (0-1)

derived_allele_frequency(codon_index, ancestral_codon)[source]

Calculate the derived allele frequency at a codon position.

Parameters:

codon_index (int) – Zero-based codon index
ancestral_codon (str) – The ancestral (outgroup) codon

Return type:

float | None

Returns:

Frequency of derived alleles (0-1), or None if not computable

__init__(sequences=<factory>, reading_frame=1, genetic_code=<factory>)

Parameters:

sequences (list[Sequence])
reading_frame (int)
genetic_code (GeneticCode)

Return type:

None

filter_by_name(pattern)[source]

Filter sequences by name pattern (substring match).

Parameters:: pattern (str) – Substring to match in sequence names
Return type:: SequenceSet
Returns:: New SequenceSet containing only matching sequences

GeneticCode

Genetic code and codon utilities.

class mkado.core.codons.GeneticCode(code=None, *, table_id=None)[source]

Bases: object

Represents a genetic code for translation and codon path computation.

Parameters:

code (dict[str, str] | None)
table_id (int | None)

__init__(code=None, *, table_id=None)[source]

Initialize with a codon-to-amino-acid mapping.

Parameters:

code (dict[str, str] | None) – Dict mapping codons to single-letter amino acids. Must be a complete mapping of all 64 codons. Uses standard code if neither code nor table_id is provided.
table_id (int | None) – NCBI genetic code table ID (1-33). If provided, overrides code.

translate(codon)[source]

Translate a codon to its amino acid.

Parameters:: codon (str) – Three-letter codon string
Return type:: str
Returns:: Single-letter amino acid code, or ‘X’ for unknown

translate_sequence(sequence, reading_frame=1)[source]

Translate a nucleotide sequence to amino acids.

Parameters:

sequence (str) – Nucleotide sequence string
reading_frame (int) – Reading frame (1, 2, or 3)

Return type:

str

Returns:

Amino acid sequence string

get_path(codon1, codon2)[source]

Get the shortest mutational path between two codons.

Parameters:

codon1 (str) – Starting codon
codon2 (str) – Ending codon

Return type:

list[tuple[str, int]]

Returns:

List of (change_type, position) tuples where change_type is ‘R’ for replacement (non-synonymous) or ‘S’ for synonymous, and position is the codon position (0, 1, or 2).

count_synonymous_sites(codon)[source]

Count the number of synonymous sites in a codon.

Uses Nei-Gojobori method: for each site, calculate fraction of possible changes that are synonymous.

Parameters:: codon (str) – Three-letter codon string
Return type:: float
Returns:: Number of synonymous sites (0-3)

is_synonymous_change(codon1, codon2)[source]

Check if a single-nucleotide change is synonymous.

Parameters:

codon1 (str) – Original codon
codon2 (str) – Changed codon

Return type:

bool | None

Returns:

True if synonymous, False if non-synonymous, None if not a single-nucleotide change or invalid codons

AlignedPair

Alignment comparison utilities for MK test.

class mkado.core.alignment.AlignedPair(ingroup, outgroup, genetic_code=<factory>)[source]

Bases: object

Represents a pair of aligned sequence sets (ingroup and outgroup).

Parameters:

ingroup (SequenceSet)
outgroup (SequenceSet)
genetic_code (GeneticCode)

ingroup: SequenceSet

outgroup: SequenceSet

genetic_code: GeneticCode

property num_codons: int: Number of codons in the alignment.

combined_codon_set(codon_index)[source]

Get all unique codons at a position from both groups.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique codon strings from both ingroup and outgroup

combined_codon_set_clean(codon_index)[source]

Get all clean unique codons at a position from both groups.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: set[str]
Returns:: Set of unique valid codon strings

is_fixed_between(codon_index)[source]

Check if a codon position is fixed differently between groups.

A position is a fixed difference if: - The ingroup has a single codon (or all share the same) - The outgroup has a single codon (or all share the same) - The ingroup and outgroup codons are different

Parameters:: codon_index (int) – Zero-based codon index
Return type:: bool
Returns:: True if this is a fixed difference between groups

is_polymorphic_within_ingroup(codon_index)[source]

Check if a codon position is polymorphic within the ingroup only.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: bool
Returns:: True if polymorphic within ingroup but not a fixed difference

fixed_differences()[source]

Get all codon indices with fixed differences between groups.

Return type:: list[int]
Returns:: List of codon indices

polymorphic_sites_ingroup()[source]

Get all codon indices polymorphic within the ingroup.

Return type:: list[int]
Returns:: List of codon indices

polymorphic_sites_outgroup()[source]

Get all codon indices polymorphic within the outgroup.

Return type:: list[int]
Returns:: List of codon indices

polymorphic_sites_pooled()[source]

Get all codon indices polymorphic in either population (union).

This follows the libsequence convention of pooling polymorphisms from both populations.

Return type:: list[int]
Returns:: List of unique codon indices (sorted)

count_total_sites()[source]

Aggregate total non-synonymous (Ln) and synonymous (Ls) sites.

For each codon position with at least one clean codon in each group, averages the Nei-Gojobori per-codon synonymous site counts within each group, then averages the two group means. Sums across positions and returns (Ln, Ls) with Ln = 3 * n_codons - Ls.

Return type:: tuple[float, float]

classify_fixed_difference(codon_index)[source]

Classify a fixed difference as synonymous/non-synonymous.

Uses the shortest mutational path to count the minimum number of synonymous and non-synonymous changes.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: tuple[int, int] | None
Returns:: Tuple of (non_synonymous_count, synonymous_count), or None if not a valid fixed difference

classify_polymorphism(codon_index)[source]

Classify a polymorphism as synonymous/non-synonymous.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: tuple[int, int] | None
Returns:: Tuple of (non_synonymous_count, synonymous_count), or None if not a valid polymorphism

classify_polymorphism_pooled(codon_index)[source]

Classify a polymorphism using codons from both populations.

This follows the libsequence convention of using all unique codons from both ingroup and outgroup when classifying polymorphisms.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: tuple[int, int] | None
Returns:: Tuple of (non_synonymous_count, synonymous_count), or None if not a valid polymorphism

__init__(ingroup, outgroup, genetic_code=<factory>)

Parameters:

ingroup (SequenceSet)
outgroup (SequenceSet)
genetic_code (GeneticCode)

Return type:

None

class mkado.core.alignment.PolarizedAlignedPair(ingroup, outgroup, genetic_code=<factory>, outgroup2=None)[source]

Bases: AlignedPair

Aligned pair with a second outgroup for polarization.

Parameters:

ingroup (SequenceSet)
outgroup (SequenceSet)
genetic_code (GeneticCode)
outgroup2 (SequenceSet | None)

outgroup2: SequenceSet | None = None

polarize_fixed_difference(codon_index)[source]

Polarize a fixed difference to determine which lineage changed.

Uses the second outgroup to determine the ancestral state.

Parameters:: codon_index (int) – Zero-based codon index
Return type:: tuple[str, tuple[int, int]] | None
Returns:: Tuple of (lineage, (nonsyn, syn)) where lineage is ‘ingroup’ or ‘outgroup’, or None if cannot be polarized

polarize_ingroup_polymorphism(codon_index)[source]

Polarize and classify an ingroup polymorphism.

Uses outgroup2 to determine if the polymorphism arose on the ingroup lineage. Following the convention from mkTest.rb:

If outgroup2 has ALL ingroup alleles → ancestral polymorphism, cannot attribute to ingroup lineage
If outgroup2 shares allele with outgroup1 → the shared allele is ancestral, ingroup has derived allele(s) → ingroup polymorphism
If outgroup2 shares some (but not all) alleles with ingroup → shared allele is ancestral → ingroup polymorphism
Otherwise → cannot polarize

Parameters:: codon_index (int) – Zero-based codon index
Return type:: tuple[int, int] | None
Returns:: Tuple of (non_synonymous_count, synonymous_count) if the polymorphism is derived in ingroup, None if cannot polarize

__init__(ingroup, outgroup, genetic_code=<factory>, outgroup2=None)

Parameters:

ingroup (SequenceSet)
outgroup (SequenceSet)
genetic_code (GeneticCode)
outgroup2 (SequenceSet | None)

Return type:

None

Analysis Functions

Standard MK Test

Standard McDonald-Kreitman test implementation.

class mkado.analysis.mk_test.MKResult(dn, ds, pn, ps, p_value, ni, alpha, dos, ln=None, ls=None, omega=None)[source]

Bases: object

Results from a McDonald-Kreitman test.

Site counts follow the standard MK definitions:

Dn / Ds: number of fixed differences between ingroup and outgroup that change (Dn) or do not change (Ds) the encoded amino acid under the supplied genetic code.
Pn / Ps: number of ingroup polymorphic sites whose derived allele causes a non-synonymous (Pn) or synonymous (Ps) substitution. Sites are filtered by min_frequency (default 0.0); the --no-singletons CLI flag sets this threshold to 1/n. When pool_polymorphisms is true, polymorphic sites in the outgroup are pooled into the ingroup counts.

Parameters:

dn (int)
ds (int)
pn (int)
ps (int)
p_value (float)
ni (float | None)
alpha (float | None)
dos (float | None)
ln (float | None)
ls (float | None)
omega (float | None)

dn: int: Non-synonymous fixed differences.

ds: int: Synonymous fixed differences.

pn: int: Non-synonymous polymorphisms in the ingroup (after frequency filtering).

ps: int: Synonymous polymorphisms in the ingroup (after frequency filtering).

p_value: float: Fisher’s exact test p-value on the 2x2 contingency table.

ni: float | None: Neutrality Index (Pn/Ps) / (Dn/Ds); None if any count is zero.

alpha: float | None: Proportion of adaptive substitutions 1 - NI.

__init__(dn, ds, pn, ps, p_value, ni, alpha, dos, ln=None, ls=None, omega=None)

Parameters:

dn (int)
ds (int)
pn (int)
ps (int)
p_value (float)
ni (float | None)
alpha (float | None)
dos (float | None)
ln (float | None)
ls (float | None)
omega (float | None)

Return type:

None

dos: float | None: Direction of Selection Dn/(Dn+Ds) - Pn/(Pn+Ps) (Stoletzki & Eyre-Walker 2011).

ln: float | None = None: Total non-synonymous sites (Nei-Gojobori) over analyzed codons.

ls: float | None = None: Total synonymous sites (Nei-Gojobori) over analyzed codons.

omega: float | None = None

(Dn/Ds) * (Ls/Ln) — dN/dS ratio. None when undefined.

Note: this result class deliberately omits omega_a / omega_na; the per-gene Smith & Eyre-Walker alpha is too noisy to give a useful rate decomposition. See docs/omega.rst for the rationale.

to_dict()[source]

Convert results to a dictionary.

Return type:: dict

mkado.analysis.mk_test.mk_test(ingroup, outgroup, reading_frame=1, genetic_code=None, pool_polymorphisms=False, min_frequency=0.0)[source]

Perform the standard McDonald-Kreitman test.

The MK test compares the ratio of non-synonymous to synonymous changes within species (polymorphism) to the ratio between species (divergence).

Under neutrality, these ratios should be equal. Deviations suggest selection: - Excess divergence relative to polymorphism suggests positive selection - Excess polymorphism relative to divergence suggests segregating weakly deleterious polymorphisms

Parameters:

ingroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for ingroup sequences
outgroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for outgroup sequences
reading_frame (int) – Reading frame (1, 2, or 3)
genetic_code (GeneticCode | None) – Genetic code for translation (uses standard if None)
pool_polymorphisms (bool) – If True, count polymorphisms from both ingroup and outgroup (libsequence convention). If False (default), count only ingroup polymorphisms (DnaSP/original MK convention).
min_frequency (float) – Minimum derived allele frequency for a polymorphism to be counted (default 0.0, i.e., include all polymorphisms). Useful for filtering out rare variants that may be slightly deleterious.

Return type:

MKResult

Returns:

MKResult containing test statistics

mkado.analysis.mk_test.mk_test_from_counts(dn, ds, pn, ps, ln=None, ls=None)[source]

Create MK test results from pre-computed counts.

Useful for testing or when counts are already available.

Parameters:

dn (int) – Non-synonymous divergence
ds (int) – Synonymous divergence
pn (int) – Non-synonymous polymorphisms
ps (int) – Synonymous polymorphisms
ln (float | None) – Total non-synonymous sites (optional, enables omega computation)
ls (float | None) – Total synonymous sites (optional, enables omega computation)

Return type:

MKResult

Returns:

MKResult with calculated statistics

Asymptotic MK Test

Asymptotic McDonald-Kreitman test implementation.

Based on Messer & Petrov (2013) PNAS.

class mkado.analysis.asymptotic.PolymorphismData(polymorphisms=<factory>, dn=0, ds=0, gene_id='', ln=None, ls=None)[source]

Bases: object

Intermediate data from a single gene for aggregation.

Parameters:

polymorphisms (list[tuple[float, str]])
dn (int)
ds (int)
gene_id (str)
ln (float | None)
ls (float | None)

polymorphisms: list[tuple[float, str]]: List of (derived_frequency, type) where type is ‘N’ or ‘S’.

dn: int = 0

ds: int = 0

gene_id: str = ''

ln: float | None = None: Total non-synonymous sites (Nei-Gojobori) over analyzed codons.

ls: float | None = None: Total synonymous sites (Nei-Gojobori) over analyzed codons.

__init__(polymorphisms=<factory>, dn=0, ds=0, gene_id='', ln=None, ls=None)

Parameters:

polymorphisms (list[tuple[float, str]])
dn (int)
ds (int)
gene_id (str)
ln (float | None)
ls (float | None)

Return type:

None

mkado.analysis.asymptotic.sum_site_totals(gene_data)[source]

Sum per-gene Nei-Gojobori (Ln, Ls); return (None, None) if any gene lacks them.

Return type:: tuple[float | None, float | None]
Parameters:: gene_data (list[PolymorphismData])

class mkado.analysis.asymptotic.AggregatedSFS(bin_edges=<factory>, pn_counts=<factory>, ps_counts=<factory>, dn_total=0, ds_total=0, num_genes=0, ln_total=None, ls_total=None, pn_total=0, ps_total=0, sfs_mode='at')[source]

Bases: object

Aggregated site frequency spectrum across genes.

Parameters:

bin_edges (ndarray)
pn_counts (ndarray)
ps_counts (ndarray)
dn_total (int)
ds_total (int)
num_genes (int)
ln_total (float | None)
ls_total (float | None)
pn_total (int)
ps_total (int)
sfs_mode (str)

bin_edges: ndarray

pn_counts: ndarray

per-bin or cumulative tail).

Type:: Pn per bin (semantics depend on sfs_mode

ps_counts: ndarray

per-bin or cumulative tail).

Type:: Ps per bin (semantics depend on sfs_mode

dn_total: int = 0

ds_total: int = 0

num_genes: int = 0

ln_total: float | None = None: Sum of per-gene Nei-Gojobori non-synonymous sites.

ls_total: float | None = None: Sum of per-gene Nei-Gojobori synonymous sites.

pn_total: int = 0: Total non-synonymous polymorphisms (raw count, mode-invariant).

ps_total: int = 0: Total synonymous polymorphisms (raw count, mode-invariant).

sfs_mode: str = 'at': SFS construction mode used to build pn_counts/ps_counts.

__init__(bin_edges=<factory>, pn_counts=<factory>, ps_counts=<factory>, dn_total=0, ds_total=0, num_genes=0, ln_total=None, ls_total=None, pn_total=0, ps_total=0, sfs_mode='at')

Parameters:

bin_edges (ndarray)
pn_counts (ndarray)
ps_counts (ndarray)
dn_total (int)
ds_total (int)
num_genes (int)
ln_total (float | None)
ls_total (float | None)
pn_total (int)
ps_total (int)
sfs_mode (str)

Return type:

None

class mkado.analysis.asymptotic.AsymptoticMKResult(frequency_bins=<factory>, alpha_by_freq=<factory>, alpha_x_values=<factory>, alpha_asymptotic=0.0, ci_low=0.0, ci_high=0.0, fit_a=0.0, fit_b=0.0, fit_c=0.0, dn=0, ds=0, num_genes=0, model_type='exponential', pn_total=0, ps_total=0, ln=None, ls=None, omega=None, omega_a=None, omega_na=None, ci_method='monte-carlo', sfs_mode='at')[source]

Bases: object

Results from an asymptotic McDonald-Kreitman test.

Pn_total and Ps_total follow the standard MK definition: ingroup polymorphic sites whose derived allele causes a non-synonymous (Pn) or synonymous (Ps) substitution. The asymptotic test uses the full SFS with no min_frequency filter; frequency cutoffs apply only to the alpha(x) curve fit, not to the SFS counts themselves.

Parameters:

frequency_bins (list[float])
alpha_by_freq (list[float])
alpha_x_values (list[float])
alpha_asymptotic (float)
ci_low (float)
ci_high (float)
fit_a (float)
fit_b (float)
fit_c (float)
dn (int)
ds (int)
num_genes (int)
model_type (str)
pn_total (int)
ps_total (int)
ln (float | None)
ls (float | None)
omega (float | None)
omega_a (float | None)
omega_na (float | None)
ci_method (str)
sfs_mode (str)

frequency_bins: list[float]: All frequency bin centers.

alpha_by_freq: list[float]: Alpha values for bins with valid data (same length as alpha_x_values).

alpha_x_values: list[float]: X values (frequency bin centers) corresponding to alpha_by_freq.

alpha_asymptotic: float = 0.0

ci_low: float = 0.0

ci_high: float = 0.0

fit_a: float = 0.0

fit_b: float = 0.0

fit_c: float = 0.0

dn: int = 0

ds: int = 0

num_genes: int = 0

model_type: str = 'exponential'

pn_total: int = 0: Total non-synonymous polymorphisms (sum across all SFS bins).

ps_total: int = 0: Total synonymous polymorphisms (sum across all SFS bins).

ln: float | None = None: Total non-synonymous sites (Nei-Gojobori).

ls: float | None = None: Total synonymous sites (Nei-Gojobori).

omega: float | None = None: (Dn/Ds) * (Ls/Ln) — dN/dS ratio.

omega_a: float | None = None: Adaptive component alpha_asymptotic * omega.

omega_na: float | None = None: Non-adaptive component (1 - alpha_asymptotic) * omega.

ci_method: str = 'monte-carlo'

"monte-carlo" samples parameters from the curve-fit covariance matrix; "bootstrap" resamples polymorphisms with replacement and refits per replicate.

Type:: Method used to compute ci_low/ci_high

sfs_mode: str = 'at'

"at" (Messer & Petrov 2013, count per bin) or "above" (Uricchio et al. 2019, inclusive right-tail cumulative count).

Type:: SFS construction mode

property omega_a_ci_low: float | None: Lower 95% CI for omega_a = ci_low * omega.

__init__(frequency_bins=<factory>, alpha_by_freq=<factory>, alpha_x_values=<factory>, alpha_asymptotic=0.0, ci_low=0.0, ci_high=0.0, fit_a=0.0, fit_b=0.0, fit_c=0.0, dn=0, ds=0, num_genes=0, model_type='exponential', pn_total=0, ps_total=0, ln=None, ls=None, omega=None, omega_a=None, omega_na=None, ci_method='monte-carlo', sfs_mode='at')

Parameters:

frequency_bins (list[float])
alpha_by_freq (list[float])
alpha_x_values (list[float])
alpha_asymptotic (float)
ci_low (float)
ci_high (float)
fit_a (float)
fit_b (float)
fit_c (float)
dn (int)
ds (int)
num_genes (int)
model_type (str)
pn_total (int)
ps_total (int)
ln (float | None)
ls (float | None)
omega (float | None)
omega_a (float | None)
omega_na (float | None)
ci_method (str)
sfs_mode (str)

Return type:

None

property omega_a_ci_high: float | None: Upper 95% CI for omega_a = ci_high * omega.

property omega_na_ci_low: float | None

Lower 95% CI for omega_na = (1 - ci_high) * omega.

Percentiles flip because (1 - alpha) is monotonically decreasing.

property omega_na_ci_high: float | None: Upper 95% CI for omega_na = (1 - ci_low) * omega.

to_dict()[source]

Convert results to a dictionary.

Return type:: dict

mkado.analysis.asymptotic.extract_polymorphism_data(ingroup, outgroup, reading_frame=1, genetic_code=None, pool_polymorphisms=False, gene_id='', min_frequency=0.0)[source]

Extract polymorphism and divergence data without curve fitting.

This function extracts the raw data needed for asymptotic MK analysis, allowing multiple genes to be processed and aggregated before fitting.

Parameters:

ingroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for ingroup sequences
outgroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for outgroup sequences
reading_frame (int) – Reading frame (1, 2, or 3)
genetic_code (GeneticCode | None) – Genetic code for translation (uses standard if None)
pool_polymorphisms (bool) – If True, consider sites polymorphic in either population (libsequence convention). Frequencies are still calculated from ingroup only.
gene_id (str) – Identifier for this gene
min_frequency (float) – Minimum derived allele frequency for polymorphisms (polymorphisms below this threshold are excluded)

Return type:

PolymorphismData

Returns:

PolymorphismData containing polymorphisms with frequencies and divergence counts

mkado.analysis.asymptotic.aggregate_polymorphism_data(gene_data, num_bins=20, sfs_mode='at')[source]

Aggregate polymorphism data from multiple genes into SFS bins.

Parameters:

gene_data (list[PolymorphismData]) – List of PolymorphismData from individual genes
num_bins (int) – Number of frequency bins
sfs_mode (str) – "at" (default, Messer & Petrov 2013) returns per-bin counts. "above" (Uricchio et al. 2019) returns the inclusive right-tail cumulative SFS — bin i holds the count of polymorphisms with frequency >= bin_edges[i].

Return type:

AggregatedSFS

Returns:

AggregatedSFS with per-bin Pn/Ps counts and total divergence

mkado.analysis.asymptotic.asymptotic_mk_test_aggregated(gene_data, num_bins=20, ci_replicates=10000, frequency_cutoffs=(0.1, 0.9), ci_method='monte-carlo', seed=42, sfs_mode='at', workers=1)[source]

Genome-wide asymptotic MK test on aggregated data.

This follows the approach of Messer & Petrov (2013) and the asymptoticMK R package: aggregate polymorphism SFS data and divergence counts from many genes, then fit a single exponential curve to the aggregated data.

Parameters:

gene_data (list[PolymorphismData]) – List of PolymorphismData from individual genes
num_bins (int) – Number of frequency bins (default 20)
ci_replicates (int) – Number of CI replicates. For ci_method="monte-carlo" (default), parameters are sampled from the curve-fit covariance — pass ~10000. For ci_method="bootstrap", each replicate refits the curve so 100-500 is typical.
frequency_cutoffs (tuple[float, float]) – (low, high) frequency range for fitting (default 0.1-0.9)
ci_method (str) – "monte-carlo" (default, parametric MVN sampling from the curve-fit covariance) or "bootstrap" (case-resampling of the pooled polymorphism list with replacement, refit per replicate).
seed (int) – Random seed for reproducibility (default 42).
sfs_mode (str)
workers (int)

Return type:

AsymptoticMKResult

Returns:

AsymptoticMKResult with aggregated analysis

mkado.analysis.asymptotic.asymptotic_mk_test(ingroup, outgroup, reading_frame=1, genetic_code=None, num_bins=10, bootstrap_replicates=100, pool_polymorphisms=False, sfs_mode='at', workers=1)[source]

Perform the asymptotic McDonald-Kreitman test.

This method accounts for slightly deleterious mutations by examining how alpha changes across the frequency spectrum and extrapolating to the asymptotic value at x=1.

Based on Messer & Petrov (2013) PNAS.

Parameters:

ingroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for ingroup sequences
outgroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for outgroup sequences
reading_frame (int) – Reading frame (1, 2, or 3)
genetic_code (GeneticCode | None) – Genetic code for translation (uses standard if None)
num_bins (int) – Number of frequency bins (default 10)
bootstrap_replicates (int) – Number of bootstrap replicates for CI (default 100)
pool_polymorphisms (bool) – If True, consider sites polymorphic in either population (libsequence convention). Frequencies are still calculated from ingroup only. If False (default), only consider ingroup polymorphisms (DnaSP/original MK convention).
sfs_mode (str)
workers (int)

Return type:

AsymptoticMKResult

Returns:

AsymptoticMKResult containing test statistics

Imputed MK Test

Imputed McDonald-Kreitman test implementation.

Based on Murga-Moreno et al. (2022) G3: Genes|Genomes|Genetics. https://doi.org/10.1093/g3journal/jkac206

The impMKT corrects for slightly deleterious mutations by imputing the number of weakly deleterious nonsynonymous polymorphisms segregating at low frequency, rather than discarding all low-frequency variants. This retains more data and increases power to detect positive selection at the gene level.

class mkado.analysis.imputed.ImputedMKResult(alpha, p_value, pn_neutral, pwd, dn, ds, pn_total, ps_total, d=None, b=None, f=None, cutoff=0.15, ln=None, ls=None, omega=None, omega_a=None, omega_na=None, alpha_ci_low=None, alpha_ci_high=None, ci_method=None)[source]

Bases: object

Results from an imputed McDonald-Kreitman test.

The imputed test partitions Pn into a weakly-deleterious fraction (Pwd) — imputed from the synonymous SFS scaled by the observed Pn/Ps ratio at low derived allele frequencies — and a neutral fraction (Pn_neutral = Pn_total - Pwd). alpha is computed against Pn_neutral rather than the raw Pn count, removing the downward bias caused by slightly deleterious nonsynonymous variants.

Dn and Ds follow the standard MK definitions. Pn_total and Ps_total are the unfiltered ingroup polymorphism counts (the imputed test uses the entire SFS, including singletons).

Parameters:

alpha (float | None)
p_value (float)
pn_neutral (float)
pwd (float)
dn (int)
ds (int)
pn_total (int)
ps_total (int)
d (float | None)
b (float | None)
f (float | None)
cutoff (float)
ln (float | None)
ls (float | None)
omega (float | None)
omega_a (float | None)
omega_na (float | None)
alpha_ci_low (float | None)
alpha_ci_high (float | None)
ci_method (str | None)

alpha: float | None: Proportion of adaptive substitutions, computed against Pn_neutral.

p_value: float: Fisher’s exact test p-value on the imputed (Dn, Ds, Pn_neutral, Ps_total) table.

pn_neutral: float: Estimated neutral nonsynonymous polymorphism count (Pn_total - Pwd).

pwd: float: Imputed count of weakly-deleterious nonsynonymous polymorphisms.

dn: int: Non-synonymous fixed differences.

ds: int: Synonymous fixed differences.

pn_total: int: Total non-synonymous ingroup polymorphisms (no frequency filter applied).

ps_total: int: Total synonymous ingroup polymorphisms (no frequency filter applied).

d: float | None = None: Estimated DFE deleterious fraction (Murga-Moreno et al. 2022, eq. 3).

b: float | None = None: Estimated DFE weakly-deleterious fraction.

f: float | None = None: Estimated DFE neutral fraction.

cutoff: float = 0.15: Derived allele frequency below which polymorphisms are treated as the imputation pool.

ln: float | None = None: Total non-synonymous sites (Nei-Gojobori) over analyzed codons.

ls: float | None = None: Total synonymous sites (Nei-Gojobori) over analyzed codons.

omega: float | None = None: (Dn/Ds) * (Ls/Ln) — dN/dS ratio.

__init__(alpha, p_value, pn_neutral, pwd, dn, ds, pn_total, ps_total, d=None, b=None, f=None, cutoff=0.15, ln=None, ls=None, omega=None, omega_a=None, omega_na=None, alpha_ci_low=None, alpha_ci_high=None, ci_method=None)

Parameters:

alpha (float | None)
p_value (float)
pn_neutral (float)
pwd (float)
dn (int)
ds (int)
pn_total (int)
ps_total (int)
d (float | None)
b (float | None)
f (float | None)
cutoff (float)
ln (float | None)
ls (float | None)
omega (float | None)
omega_a (float | None)
omega_na (float | None)
alpha_ci_low (float | None)
alpha_ci_high (float | None)
ci_method (str | None)

Return type:

None

omega_a: float | None = None: Adaptive rate alpha * omega with imputed alpha (Gossmann, Keightley & Eyre-Walker 2012).

omega_na: float | None = None: Non-adaptive component (1 - alpha) * omega.

alpha_ci_low: float | None = None: Lower 95% bootstrap CI for alpha. None when n_bootstrap=0.

alpha_ci_high: float | None = None: Upper 95% bootstrap CI for alpha.

ci_method: str | None = None: "bootstrap" when CI was computed, None when n_bootstrap=0.

property omega_a_ci_low: float | None: Lower 95% CI for omega_a = alpha_ci_low * omega.

property omega_a_ci_high: float | None: Upper 95% CI for omega_a = alpha_ci_high * omega.

property omega_na_ci_low: float | None: Lower 95% CI for omega_na = (1 - alpha_ci_high) * omega (flipped).

property omega_na_ci_high: float | None: Upper 95% CI for omega_na = (1 - alpha_ci_low) * omega.

to_dict()[source]

Convert results to a dictionary.

Return type:: dict

mkado.analysis.imputed.imputed_mk_test(gene_data, cutoff=0.15, num_synonymous_sites=None, num_nonsynonymous_sites=None, n_bootstrap=0, seed=42)[source]

Perform the imputed McDonald-Kreitman test on a single gene.

The impMKT corrects for slightly deleterious mutations by imputing the count of weakly deleterious nonsynonymous polymorphisms from the ratio of low- to high-frequency synonymous polymorphisms.

Based on Murga-Moreno et al. (2022) G3. https://doi.org/10.1093/g3journal/jkac206

Parameters:

gene_data (PolymorphismData) – Polymorphism and divergence data from extract_polymorphism_data()
cutoff (float) – DAF cutoff separating low and high frequency (default 0.15)
num_synonymous_sites (float | None) – Number of synonymous sites (m0) for DFE fractions
num_nonsynonymous_sites (float | None) – Number of nonsynonymous sites (mi) for DFE fractions
n_bootstrap (int) – Number of bootstrap replicates for the alpha CI. 0 (default) disables CI computation and preserves byte-identical legacy output.
seed (int) – Random seed for the bootstrap (default 42).

Return type:

ImputedMKResult

Returns:

ImputedMKResult with corrected alpha and optionally DFE fractions

mkado.analysis.imputed.imputed_mk_test_multi(gene_data, cutoff=0.15, num_synonymous_sites=None, num_nonsynonymous_sites=None, n_bootstrap=0, seed=42)[source]

Perform the imputed MK test on pooled data from multiple genes.

Pools all polymorphisms and divergence counts across genes, then runs the imputed MK algorithm on the combined data.

Parameters:

gene_data (list[PolymorphismData]) – List of PolymorphismData from individual genes
cutoff (float) – DAF cutoff separating low and high frequency (default 0.15)
num_synonymous_sites (float | None) – Number of synonymous sites (m0) for DFE fractions
num_nonsynonymous_sites (float | None) – Number of nonsynonymous sites (mi) for DFE fractions
n_bootstrap (int) – Number of bootstrap replicates for the alpha CI. 0 (default) disables CI computation and preserves byte-identical legacy output.
seed (int) – Random seed for the bootstrap (default 42).

Return type:

ImputedMKResult

Returns:

ImputedMKResult with corrected alpha from pooled data

Polarized MK Test

Polarized McDonald-Kreitman test implementation.

class mkado.analysis.polarized.PolarizedMKResult(dn_ingroup, ds_ingroup, pn_ingroup, ps_ingroup, dn_outgroup, ds_outgroup, dn_unpolarized, ds_unpolarized, pn_unpolarized, ps_unpolarized, p_value_ingroup, ni_ingroup, alpha_ingroup, dos_ingroup, ln=None, ls=None, omega=None)[source]

Bases: object

Results from a polarized McDonald-Kreitman test.

The polarized test uses a second outgroup (outgroup2) to assign each fixed difference and each ingroup polymorphism to one of two lineages — the ingroup or the first outgroup — by majority rule on the inferred ancestral state. Sites where the ancestral state cannot be determined are reported under *_unpolarized.

Pn and Ps count derived alleles segregating in the ingroup (subject to min_frequency filtering, identical to the standard MK test) and are split by lineage assignment.

Parameters:

dn_ingroup (int)
ds_ingroup (int)
pn_ingroup (int)
ps_ingroup (int)
dn_outgroup (int)
ds_outgroup (int)
dn_unpolarized (int)
ds_unpolarized (int)
pn_unpolarized (int)
ps_unpolarized (int)
p_value_ingroup (float)
ni_ingroup (float | None)
alpha_ingroup (float | None)
dos_ingroup (float | None)
ln (float | None)
ls (float | None)
omega (float | None)

dn_ingroup: int: Non-synonymous fixed differences assigned to the ingroup lineage.

ds_ingroup: int: Synonymous fixed differences assigned to the ingroup lineage.

pn_ingroup: int: Non-synonymous polymorphisms whose derived allele arose on the ingroup lineage.

ps_ingroup: int: Synonymous polymorphisms whose derived allele arose on the ingroup lineage.

dn_outgroup: int: Non-synonymous fixed differences assigned to the first outgroup lineage.

ds_outgroup: int: Synonymous fixed differences assigned to the first outgroup lineage.

dn_unpolarized: int: Non-synonymous fixed differences where lineage could not be determined.

ds_unpolarized: int: Synonymous fixed differences where lineage could not be determined.

pn_unpolarized: int: Non-synonymous polymorphisms where ancestral state could not be determined.

ps_unpolarized: int: Synonymous polymorphisms where ancestral state could not be determined.

__init__(dn_ingroup, ds_ingroup, pn_ingroup, ps_ingroup, dn_outgroup, ds_outgroup, dn_unpolarized, ds_unpolarized, pn_unpolarized, ps_unpolarized, p_value_ingroup, ni_ingroup, alpha_ingroup, dos_ingroup, ln=None, ls=None, omega=None)

Parameters:

dn_ingroup (int)
ds_ingroup (int)
pn_ingroup (int)
ps_ingroup (int)
dn_outgroup (int)
ds_outgroup (int)
dn_unpolarized (int)
ds_unpolarized (int)
pn_unpolarized (int)
ps_unpolarized (int)
p_value_ingroup (float)
ni_ingroup (float | None)
alpha_ingroup (float | None)
dos_ingroup (float | None)
ln (float | None)
ls (float | None)
omega (float | None)

Return type:

None

p_value_ingroup: float: Fisher’s exact test p-value on the ingroup-lineage 2x2 table.

ni_ingroup: float | None: Neutrality Index for the ingroup lineage.

alpha_ingroup: float | None: Proportion of adaptive substitutions on the ingroup lineage.

dos_ingroup: float | None: Direction of Selection for the ingroup lineage.

ln: float | None = None: Total non-synonymous sites (Nei-Gojobori) over analyzed codons.

ls: float | None = None: Total synonymous sites (Nei-Gojobori) over analyzed codons.

omega: float | None = None

(Dn_ingroup/Ds_ingroup) * (Ls/Ln) — ingroup-lineage dN/dS.

Note: this result class deliberately omits omega_a / omega_na; per-gene polarized alpha is too noisy to give a useful rate decomposition. See docs/omega.rst for the rationale.

to_dict()[source]

Convert results to a dictionary.

Return type:: dict

mkado.analysis.polarized.polarized_mk_test(ingroup, outgroup1, outgroup2, reading_frame=1, genetic_code=None, pool_polymorphisms=False, min_frequency=0.0)[source]

Perform a polarized McDonald-Kreitman test.

Uses a second outgroup to determine the ancestral state and assign mutations to specific lineages.

Parameters:

ingroup (SequenceSet | str | Path) – SequenceSet or path to FASTA file for ingroup sequences
outgroup1 (SequenceSet | str | Path) – SequenceSet or path to FASTA file for first outgroup
outgroup2 (SequenceSet | str | Path) – SequenceSet or path to FASTA file for second outgroup (used for polarization)
reading_frame (int) – Reading frame (1, 2, or 3)
genetic_code (GeneticCode | None) – Genetic code for translation (uses standard if None)
pool_polymorphisms (bool) – If True, count polymorphisms from both ingroup and outgroup1 (libsequence convention). If False (default), count only ingroup polymorphisms (DnaSP/original MK convention).
min_frequency (float) – Minimum derived allele frequency for a polymorphism to be counted (default 0.0, i.e., include all polymorphisms). Useful for filtering out rare variants that may be slightly deleterious.

Return type:

PolarizedMKResult

Returns:

PolarizedMKResult containing test statistics

Usage Examples

Working with SequenceSet

from mkado import SequenceSet

# Load sequences from FASTA
seqs = SequenceSet.from_fasta("alignment.fa")

# Filter by name pattern
ingroup = seqs.filter_by_name("dmel")
outgroup = seqs.filter_by_name("dsim")

# Get sequence information
print(f"Number of sequences: {len(seqs)}")
print(f"Alignment length: {seqs.alignment_length}")

Running MK Tests

from mkado import mk_test, SequenceSet

# From file paths
result = mk_test("ingroup.fa", "outgroup.fa")

# From SequenceSet objects
all_seqs = SequenceSet.from_fasta("combined.fa")
ingroup = all_seqs.filter_by_name("species1")
outgroup = all_seqs.filter_by_name("species2")
result = mk_test(ingroup, outgroup)

# Access results
print(f"Dn={result.Dn}, Ds={result.Ds}")
print(f"Pn={result.Pn}, Ps={result.Ps}")
print(f"P-value: {result.p_value}")
print(f"NI: {result.NI}")
print(f"Alpha: {result.alpha}")

Asymptotic Analysis

from mkado import asymptotic_mk_test

result = asymptotic_mk_test("ingroup.fa", "outgroup.fa", n_bins=20)

print(f"Asymptotic Alpha: {result.alpha_asymptotic}")
print(f"95% CI: [{result.ci_low}, {result.ci_high}]")

# Access per-bin data
for freq, alpha in zip(result.frequencies, result.alphas):
    print(f"Frequency {freq:.2f}: Alpha = {alpha:.3f}")

Batch Processing

from pathlib import Path
from mkado import mk_test, SequenceSet

# Process multiple files
results = {}
for fasta_file in Path("alignments/").glob("*.fa"):
    seqs = SequenceSet.from_fasta(fasta_file)
    ingroup = seqs.filter_by_name("species1")
    outgroup = seqs.filter_by_name("species2")

    if len(ingroup) > 0 and len(outgroup) > 0:
        results[fasta_file.stem] = mk_test(ingroup, outgroup)

# Summarize results
for gene, result in results.items():
    print(f"{gene}: alpha={result.alpha:.3f}, p={result.p_value:.3f}")

Generating Volcano Plots

from pathlib import Path
from mkado import mk_test, SequenceSet
from mkado.io.plotting import create_volcano_plot

# Collect batch results
results = []
for fasta_file in Path("alignments/").glob("*.fa"):
    seqs = SequenceSet.from_fasta(fasta_file)
    ingroup = seqs.filter_by_name("species1")
    outgroup = seqs.filter_by_name("species2")

    if len(ingroup) > 0 and len(outgroup) > 0:
        result = mk_test(ingroup, outgroup)
        results.append((fasta_file.stem, result))

# Generate volcano plot
create_volcano_plot(results, Path("volcano.png"))

Generating Asymptotic Plots

from pathlib import Path
from mkado import asymptotic_mk_test, SequenceSet
from mkado.io.plotting import create_asymptotic_plot

# Run asymptotic MK test
seqs = SequenceSet.from_fasta("alignment.fa")
ingroup = seqs.filter_by_name("species1")
outgroup = seqs.filter_by_name("species2")
result = asymptotic_mk_test(ingroup, outgroup)

# Generate alpha(x) vs frequency plot
create_asymptotic_plot(result, Path("asymptotic_fit.png"))

Plotting Functions

Plotting functions for MK test results.

mkado.io.plotting.create_volcano_plot(results, output_path, alpha=0.05)[source]

Create a volcano plot from batch MK test results.

The volcano plot shows -log10(NI) on the X-axis and -log10(p-value) on the Y-axis. A horizontal line indicates the Bonferroni-corrected significance threshold.

Parameters:

results (list[tuple[str, MKResult | PolarizedMKResult]]) – List of (gene_name, result) tuples from batch MK tests
output_path (Path) – Path to save the plot (PNG, PDF, or SVG)
alpha (float) – Significance level for Bonferroni correction (default 0.05)

Return type:

None

mkado.io.plotting.create_asymptotic_plot(result, output_path)[source]

Create an asymptotic MK test plot showing alpha(x) vs derived allele frequency.

This plot follows the style of Messer & Petrov (2013), showing: - Scatter points of alpha at each frequency bin - The fitted curve (exponential or linear) - A horizontal line at alpha_asymptotic with confidence interval band

Parameters:

result (AsymptoticMKResult) – AsymptoticMKResult from asymptotic MK test
output_path (Path) – Path to save the plot (PNG, PDF, or SVG)

Return type:

None