File Formats and Input Requirements

MKado supports two input modes: FASTA (aligned coding sequences for mkado test and mkado batch) and VCF (variant calls for mkado vcf). This page describes the input requirements for both.

FASTA Format

MKado reads standard FASTA format:

>sequence_name1 optional description
ATGCATGCATGC...
>sequence_name2
ATGCATGCATGC...

Requirements:

Standard FASTA format with > headers
Sequences can span multiple lines
Any valid sequence characters (ACGT, gaps, ambiguous codes)

Alignment Requirements

Sequences must be pre-aligned:

All sequences must be the same length
Gaps should be represented as -
Alignment should be in-frame (codon-aligned)

Codon Alignment

MKado analyzes coding sequences at the codon level. Your alignment should:

Start at the first position of a codon
Be a multiple of 3 nucleotides (excluding gaps)
Maintain reading frame throughout

Example of proper codon alignment:

>seq1
ATGGCC---TAAACT
>seq2
ATGGCCTGATAAACT

If your alignment isn’t codon-aligned, use the -r option to specify the reading frame (1, 2, or 3).

Sequence Naming Conventions

For combined file mode, sequence names should contain identifiable patterns for filtering:

>speciesA_gene1_sample1
ATGCATGC...
>speciesA_gene1_sample2
ATGCATGC...
>speciesB_gene1_sample1
ATGCATGC...

Then filter with:

mkado test alignment.fa -i "speciesA" -o "speciesB"

The pattern matching is case-sensitive substring matching.

Ingroup and Outgroup

Definitions

Ingroup: The species of primary interest (typically polymorphic population samples)
Outgroup: A closely related species used to identify fixed differences

Selection Guidelines

Outgroup should be close enough to have reliable alignments
But distant enough to have accumulated fixed differences
Multiple outgroup sequences can be used (consensus is taken)

For polarized tests:

Second outgroup: A more distant species to determine mutation direction
Should be divergent from both ingroup and primary outgroup

Handling Special Cases

Gaps

Codons containing gaps (---) are excluded from analysis
Partial gaps within codons are handled conservatively

Stop Codons

Internal stop codons are flagged but not automatically excluded
Check your alignments if you see unexpected results

Ambiguous Bases

N and other IUPAC ambiguity codes are supported
Codons with ambiguous bases may be excluded from certain calculations

Common Problems

Alignment Not in Frame

Symptom: Unexpected results, many excluded codons

Solution: Use mkado info to check alignment properties, specify -r for reading frame

mkado info alignment.fa
mkado test alignment.fa -i sp1 -o sp2 -r 2  # Try reading frame 2

Wrong Species Pattern

Symptom: “No sequences found” error

Solution: Check sequence names and pattern

mkado info alignment.fa  # Lists all sequence names
mkado test alignment.fa -i "correct_pattern" -o "outgroup"

Unequal Sequence Lengths

Symptom: Error about alignment length

Solution: Re-align sequences or check for truncated sequences

VCF Mode Input Files

The mkado vcf command requires three file types in addition to a GFF3 annotation. See VCF Input Mode for the full guide.

VCF Files

Both the ingroup (--vcf) and outgroup (--outgroup-vcf) inputs are standard VCF files:

Bgzipped + tabix-indexed is recommended for performance, but uncompressed VCF also works
The ingroup VCF should be a multi-sample population VCF with genotype fields
The outgroup VCF should be a single-sample VCF called against the same reference genome
Multi-allelic sites should be decomposed beforehand with for example bcftools norm -m-

Reference FASTA

The genome assembly both VCFs were called against (--ref):

Must be indexed with samtools faidx (creates a .fai file)
Both plain FASTA and bgzipped FASTA (with .gzi index) are supported
Only bgzip compression is supported, not plain gzip (random access requires BGZF block structure)

GFF3 Annotation

A GFF3 file defining gene models (--gff):

Must contain gene, mRNA/transcript, and CDS features linked by Parent attributes
Both plain text and gzip-compressed (.gff3.gz) files are supported
MKado selects the longest transcript per gene automatically
Genes where the total CDS length is not divisible by 3 are skipped
GTF format is not supported — MKado requires GFF3 (key=value attributes, not GTF’s key "value" style). Convert with gffread annotation.gtf -o annotation.gff3 if needed.

Example Data

The examples/ directory contains properly formatted FASTA example data:

# Examine example file
mkado info examples/anopheles_batch/AGAP000078.fa

# Run test on example
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun