last-train
==========

This script tries to find suitable score parameters (substitution and
gap scores) for aligning some given sequences.

It (probabilistically) aligns the sequences using some initial score
parameters, then estimates better score parameters based on the
alignments, and repeats this procedure until the parameters stop
changing.

The usage is like this::

  lastdb mydb reference.fasta
  last-train mydb queries.fasta

last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by `lastal's -p
option <lastal.html#score-options>`_.

You can also pipe sequences into last-train, for example::

  zcat queries.fasta.gz | last-train mydb

Options
-------

  -h, --help
         Show a help message, with default option values, and exit.
  -v, --verbose
         Show more details of intermediate steps.

Training options
~~~~~~~~~~~~~~~~

  --revsym
         Force the substitution scores to have reverse-complement
         symmetry, e.g. score(A→G) = score(T→C).  This is often
         appropriate, if neither strand is "special".
  --matsym
         Force the substitution scores to have directional symmetry,
         e.g. score(A→G) = score(G→A).
  --gapsym
         Force the insertion costs to equal the deletion costs.
  --pid=PID
         Ignore alignments with > PID% identity.  This aims to
         optimize the parameters for low-similarity alignments,
         similarly to the BLOSUM matrices.
  --sample-number=N
         Use N randomly-chosen chunks of the query sequences.  The
         queries are chopped into fixed-length chunks (as if they were
         first concatenated into one long sequence).  If there are ≤ N
         chunks, all are picked.  Otherwise, if the final chunk is
         shorter, it is never picked.  0 means use everything.
  --sample-length=L
         Use randomly-chosen chunks of length L.

All options below this point are passed to lastal to do the
alignments: they are described in more detail at `<lastal.html>`_.

Initial parameter options
~~~~~~~~~~~~~~~~~~~~~~~~~

  -r SCORE   Initial match score.
  -q COST    Initial mismatch cost.
  -p NAME    Initial match/mismatch score matrix.
  -a COST    Initial gap existence cost.
  -b COST    Initial gap extension cost.
  -A COST    Initial insertion existence cost.
  -B COST    Initial insertion extension cost.

Alignment options
~~~~~~~~~~~~~~~~~

  -D LENGTH  Query letters per random alignment.  (See `here
             <last-evalues.html>`_.)
  -E EG2     Maximum expected alignments per square giga.  (See `here
             <last-evalues.html>`_.)
  -s NUMBER  Which query strand to use: 0=reverse, 1=forward, 2=both.
  -S NUMBER  Score matrix applies to forward strand of: 0=reference,
             1=query.  This matters only if the scores lack
             reverse-complement symmetry.
  -C COUNT   Before extending gapped alignments, discard any gapless
             alignment whose query range lies in COUNT other gapless
             alignments with higher score-per-length.  This aims to
             reduce run time.
  -T NUMBER  Type of alignment: 0=local, 1=overlap.
  -m COUNT   Maximum number of initial matches per query position.
  -k STEP    Look for initial matches starting only at every STEP-th
             position in each query.
  -P COUNT   Number of parallel threads.
  -Q NUMBER  Query sequence format: 0=fasta, 1=fastq-sanger.
             **Important:** if you use option -Q, last-train will only
             train the gap scores, and leave the substitution scores
             at their initial values.

Bugs
----

* last-train assumes that gap lengths roughly follow a geometric
  distribution.  If they do not (which is often the case), the results
  may be poor.

* last-train can fail for various reasons, e.g. if the sequences are
  too dissimilar.  If it fails to find any alignments, you could try
  `reducing the alignment significance threshold
  <last-tutorial.html#example-6-find-very-short-dna-alignments>`_ with
  option -D.
