File Formats and Input Requirements

MKado supports two input modes: FASTA (aligned coding sequences for mkado test and mkado batch) and VCF (variant calls for mkado vcf). This page describes the input requirements for both.

FASTA Format

MKado reads standard FASTA format:

>sequence_name1 optional description
ATGCATGCATGC...
>sequence_name2
ATGCATGCATGC...

Requirements:

  • Standard FASTA format with > headers

  • Sequences can span multiple lines

  • Any valid sequence characters (ACGT, gaps, ambiguous codes)

Alignment Requirements

Sequences must be pre-aligned:

  • All sequences must be the same length

  • Gaps should be represented as -

  • Alignment should be in-frame (codon-aligned)

Codon Alignment

MKado analyzes coding sequences at the codon level. Your alignment should:

  • Start at the first position of a codon

  • Be a multiple of 3 nucleotides (excluding gaps)

  • Maintain reading frame throughout

Example of proper codon alignment:

>seq1
ATGGCC---TAAACT
>seq2
ATGGCCTGATAAACT

If your alignment isn’t codon-aligned, use the -r option to specify the reading frame (1, 2, or 3).

Sequence Naming Conventions

For combined file mode, sequence names should contain identifiable patterns for filtering:

>speciesA_gene1_sample1
ATGCATGC...
>speciesA_gene1_sample2
ATGCATGC...
>speciesB_gene1_sample1
ATGCATGC...

Then filter with:

mkado test alignment.fa -i "speciesA" -o "speciesB"

The pattern matching is case-sensitive substring matching.

Ingroup and Outgroup

Definitions

  • Ingroup: The species of primary interest (typically polymorphic population samples)

  • Outgroup: A closely related species used to identify fixed differences

Selection Guidelines

  • Outgroup should be close enough to have reliable alignments

  • But distant enough to have accumulated fixed differences

  • Multiple outgroup sequences can be used (consensus is taken)

For polarized tests:

  • Second outgroup: A more distant species to determine mutation direction

  • Should be divergent from both ingroup and primary outgroup

Handling Special Cases

Gaps

  • Codons containing gaps (---) are excluded from analysis

  • Partial gaps within codons are handled conservatively

Stop Codons

  • Internal stop codons are flagged but not automatically excluded

  • Check your alignments if you see unexpected results

Ambiguous Bases

  • N and other IUPAC ambiguity codes are supported

  • Codons with ambiguous bases may be excluded from certain calculations

Common Problems

Alignment Not in Frame

Symptom: Unexpected results, many excluded codons

Solution: Use mkado info to check alignment properties, specify -r for reading frame

mkado info alignment.fa
mkado test alignment.fa -i sp1 -o sp2 -r 2  # Try reading frame 2

Wrong Species Pattern

Symptom: “No sequences found” error

Solution: Check sequence names and pattern

mkado info alignment.fa  # Lists all sequence names
mkado test alignment.fa -i "correct_pattern" -o "outgroup"

Unequal Sequence Lengths

Symptom: Error about alignment length

Solution: Re-align sequences or check for truncated sequences

VCF Mode Input Files

The mkado vcf command requires three file types in addition to a GFF3 annotation. See VCF Input Mode for the full guide.

VCF Files

Both the ingroup (--vcf) and outgroup (--outgroup-vcf) inputs are standard VCF files:

  • Bgzipped + tabix-indexed is recommended for performance, but uncompressed VCF also works

  • The ingroup VCF should be a multi-sample population VCF with genotype fields

  • The outgroup VCF should be a single-sample VCF called against the same reference genome

  • Multi-allelic sites should be decomposed beforehand with for example bcftools norm -m-

Reference FASTA

The genome assembly both VCFs were called against (--ref):

  • Must be indexed with samtools faidx (creates a .fai file)

  • Both plain FASTA and bgzipped FASTA (with .gzi index) are supported

  • Only bgzip compression is supported, not plain gzip (random access requires BGZF block structure)

GFF3 Annotation

A GFF3 file defining gene models (--gff):

  • Must contain gene, mRNA/transcript, and CDS features linked by Parent attributes

  • Both plain text and gzip-compressed (.gff3.gz) files are supported

  • MKado selects the longest transcript per gene automatically

  • Genes where the total CDS length is not divisible by 3 are skipped

  • GTF format is not supported — MKado requires GFF3 (key=value attributes, not GTF’s key "value" style). Convert with gffread annotation.gtf -o annotation.gff3 if needed.

Example Data

The examples/ directory contains properly formatted FASTA example data:

# Examine example file
mkado info examples/anopheles_batch/AGAP000078.fa

# Run test on example
mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun