File Formats and Input Requirements
====================================

MKado supports two input modes: **FASTA** (aligned coding sequences for ``mkado test`` and ``mkado batch``) and **VCF** (variant calls for ``mkado vcf``). This page describes the input requirements for both.

FASTA Format
------------

MKado reads standard FASTA format:

.. code-block:: text

   >sequence_name1 optional description
   ATGCATGCATGC...
   >sequence_name2
   ATGCATGCATGC...

Requirements:

- Standard FASTA format with ``>`` headers
- Sequences can span multiple lines
- Any valid sequence characters (ACGT, gaps, ambiguous codes)

Alignment Requirements
----------------------

Sequences must be pre-aligned:

- All sequences must be the same length
- Gaps should be represented as ``-``
- Alignment should be in-frame (codon-aligned)

Codon Alignment
^^^^^^^^^^^^^^^

MKado analyzes coding sequences at the codon level. Your alignment should:

- Start at the first position of a codon
- Be a multiple of 3 nucleotides (excluding gaps)
- Maintain reading frame throughout

Example of proper codon alignment:

.. code-block:: text

   >seq1
   ATGGCC---TAAACT
   >seq2
   ATGGCCTGATAAACT

If your alignment isn't codon-aligned, use the ``-r`` option to specify the reading frame (1, 2, or 3).

Sequence Naming Conventions
---------------------------

For combined file mode, sequence names should contain identifiable patterns for filtering:

.. code-block:: text

   >speciesA_gene1_sample1
   ATGCATGC...
   >speciesA_gene1_sample2
   ATGCATGC...
   >speciesB_gene1_sample1
   ATGCATGC...

Then filter with:

.. code-block:: bash

   mkado test alignment.fa -i "speciesA" -o "speciesB"

The pattern matching is case-sensitive substring matching.

Ingroup and Outgroup
--------------------

Definitions
^^^^^^^^^^^

- **Ingroup**: The species of primary interest (typically polymorphic population samples)
- **Outgroup**: A closely related species used to identify fixed differences

Selection Guidelines
^^^^^^^^^^^^^^^^^^^^

- Outgroup should be close enough to have reliable alignments
- But distant enough to have accumulated fixed differences
- Multiple outgroup sequences can be used (consensus is taken)

For polarized tests:

- **Second outgroup**: A more distant species to determine mutation direction
- Should be divergent from both ingroup and primary outgroup

Handling Special Cases
----------------------

Gaps
^^^^

- Codons containing gaps (``---``) are excluded from analysis
- Partial gaps within codons are handled conservatively

Stop Codons
^^^^^^^^^^^

- Internal stop codons are flagged but not automatically excluded
- Check your alignments if you see unexpected results

Ambiguous Bases
^^^^^^^^^^^^^^^

- ``N`` and other IUPAC ambiguity codes are supported
- Codons with ambiguous bases may be excluded from certain calculations

Common Problems
---------------

Alignment Not in Frame
^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Unexpected results, many excluded codons

**Solution**: Use ``mkado info`` to check alignment properties, specify ``-r`` for reading frame

.. code-block:: bash

   mkado info alignment.fa
   mkado test alignment.fa -i sp1 -o sp2 -r 2  # Try reading frame 2

Wrong Species Pattern
^^^^^^^^^^^^^^^^^^^^^

**Symptom**: "No sequences found" error

**Solution**: Check sequence names and pattern

.. code-block:: bash

   mkado info alignment.fa  # Lists all sequence names
   mkado test alignment.fa -i "correct_pattern" -o "outgroup"

Unequal Sequence Lengths
^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom**: Error about alignment length

**Solution**: Re-align sequences or check for truncated sequences

VCF Mode Input Files
--------------------

The ``mkado vcf`` command requires three file types in addition to a GFF3 annotation. See :doc:`vcf-input` for the full guide.

VCF Files
^^^^^^^^^

Both the ingroup (``--vcf``) and outgroup (``--outgroup-vcf``) inputs are standard VCF files:

- Bgzipped + tabix-indexed is recommended for performance, but uncompressed VCF also works
- The ingroup VCF should be a multi-sample population VCF with genotype fields
- The outgroup VCF should be a single-sample VCF called against the **same reference genome**
- Multi-allelic sites should be decomposed beforehand with for example ``bcftools norm -m-``

Reference FASTA
^^^^^^^^^^^^^^^

The genome assembly both VCFs were called against (``--ref``):

- Must be indexed with ``samtools faidx`` (creates a ``.fai`` file)
- Both plain FASTA and bgzipped FASTA (with ``.gzi`` index) are supported
- Only **bgzip** compression is supported, not plain gzip (random access requires BGZF block structure)

GFF3 Annotation
^^^^^^^^^^^^^^^

A GFF3 file defining gene models (``--gff``):

- Must contain ``gene``, ``mRNA``/``transcript``, and ``CDS`` features linked by ``Parent`` attributes
- Both plain text and gzip-compressed (``.gff3.gz``) files are supported
- MKado selects the longest transcript per gene automatically
- Genes where the total CDS length is not divisible by 3 are skipped
- **GTF format is not supported** — MKado requires GFF3 (``key=value`` attributes, not GTF's ``key "value"`` style). Convert with ``gffread annotation.gtf -o annotation.gff3`` if needed.

Example Data
------------

The ``examples/`` directory contains properly formatted FASTA example data:

.. code-block:: bash

   # Examine example file
   mkado info examples/anopheles_batch/AGAP000078.fa

   # Run test on example
   mkado test examples/anopheles_batch/AGAP000078.fa -i gamb -o afun