FSA is a probabilistic multiple sequence alignment algorithm which uses
a "distance-based" approach to aligning homologous protein, RNA or DNA
sequences. Much as distance-based phylogenetic reconstruction methods
like Neighbor-Joining build a phylogeny using only pairwise divergence
estimates, FSA builds a multiple alignment using only pairwise
estimations of homology. This is made possible by the sequence
annealing technique for constructing a multiple alignment from pairwise
comparisons, developed by Ariel Schwartz in
"Posterior Decoding Methods for Optimization and Control of Multiple
Alignments."
FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs
to large-scale problems such as aligning thousands of sequences or megabase-long sequences.
FSA introduces several novel methods for constructing better alignments:
FSA uses machine-learning techniques to estimate gap and
substitution parameters on the fly for each set of input sequences.
This "query-specific learning" alignment method makes FSA very robust: it
can produce superior alignments of sets of homologous sequences
which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences
using a randomized inference algorithm to reduce the computational
cost of multiple alignment. This randomized inference can be over
ten times faster than a direct approach with little loss of
accuracy.
FSA can quickly align very long sequences using the "anchor
annealing" technique for resolving anchors and projecting them with
transitive anchoring. It then stitches together the alignment
between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display
the intermediate alignments produced by FSA, where each character
is colored according to the probability that it is correctly
aligned.
FSA is built and installed by running the following commands:
tar xvzf fsa-X.X.X.tar.gz
cd fsa-X.X.X
./configure
make
make install
(Substitute fsa-X.X.X.tar.gz with the name of the file
that you downloaded.)
The FSA executables can then be found in your system's standard
binary directory (e.g., /usr/local/bin). Alternatively, you may
just run FSA from the src/main subdirectory in which it is built
(which does not require running the make install step).
If you wish to install the FSA binaries in a location other than
your system's standard directories (which usually requires root
permissions), specify the top-level installation directory with
the --prefix option to configure. For example,
./configure --prefix=$HOME
specifies that binaries should be installed in $HOME/bin, libraries in
$HOME/lib, etc.
If you wish to align long sequences, then you must download and install MUMmer,
which FSA calls to get candidate anchors between sequences.
When running ./configure, either have the MUMmer executable in your path
or specify the executable with the --with-mummer option to ./configure.
FSA can also call exonerate
to obtain anchors. If you wish to use exonerate, then as with MUMmer,
when running ./configure, you must either have the exonerate executable in your path
or specify the executable with the --with-exonerate option to ./configure.
FSA can align thousands or tens of thousands of sequences.
Try running FSA with the --fast option. You can get finer-grained control with
the --alignment-number <int> option, which controls the total number of pairwise
comparisons which FSA uses to build a multiple alignment. If you want to align
N sequences, then you can set --alignment-number to as low as (N - 1) or as
high as (N choose 2) == (N * (N - 1) / 2).
FSA can align long sequences (megabases or tens of megabases) with the
"anchor annealing" technique. It uses the program MUMmer to find maximal
unique matches between pairs of sequences to be aligned, resolves
inconsistencies with anchor annealing, and then pieces together the alignment
between anchored regions using its standard inference method.
FSA can also use the program exonerate to detect remote homology.
Please use the --with-mummer and --with-exonerate options to ./configure
before compilation as explained in How do I run FSA?.
You can read about the MUMmer and exonerate programs in:
If the genomes which you want to align have few rearrangements, then
you can run FSA directly on them. If they have rearrangments, then
you must first use a program such as Colin Dewey's Mercator
to construct a homology map for the genomes and then run FSA on the
homologous segments. FSA can directly use the constraint information
produced by Mercator to inform its multiple alignment
(use the --mercator option to specify the Mercator constraint file).
FSA has an extensive logging system. Try running with the --log 7 option
to see progress of the DP algorithm, and --log 6 to see progress of anchoring
(when aligning long sequences).
Log levels from 0 to 10 are permitted,
where lower numbers are more verbose.
Most alignment programs attempt to maximize sensitivity, even at
the expense of specificity, leading to over-alignment (alignment of
non-homologous sequence). FSA, in contrast, maximizes the expected
accuracy of the alignment with a measure which rewards sensitivity
but penalizes over-alignment. If FSA cannot reliably detect homology,
then it will leave characters unaligned (gapped).
By default FSA stops aligning characters when the probability that
a character is aligned is equal to the probability that it is gapped.
Use the --maxsn option for maximum sensitivity.
You can get finer-grained control with the --gapfactor <int> option.
By default FSA runs at --gapfactor 1.
Use --gapfactor 0 for highest sensitivity (this is equivalent to --maxsn)
and gap factors > 1 for higher specificity.
Run FSA with the command-line option --gui. If the input alignment file
is myseqs.fasta, then FSA will write the files myseqs.fasta.gui and
'myseqs.fasta.probs'. Invoke the MAD (Multiple Alignment Display) GUI
as
java -jar display/mad.jar myseqs.fasta
Please be patient if it takes a while to load.
Characters in the multiple alignment are colored according to the
probability that they are correctly aligned.
You can see what a typical FSA alignment looks like by running
on one of the provided example alignments, such as
The accuracy estimates produced by FSA and the GUI are useful for
downstream analyses, for example allowing biologists to restrict
their analyses to the most reliable portions of the alignment
or edit unreliable parts of the alignment by hand.
The accuracy measures also allow you to visualize how FSA works.
If you switch to the "Specificity" coloring and watch the animation from the beginning,
you will see FSA first aligns characters whose homology it is most sure of (red),
and only later aligns characters of unclear homology (blue).
Similarly, you can visualize the sensitivity/specificity tradeoff: Near the beginning
of the alignment the specificity is very high, but the sensitivity is low.
The specificity decreases and the sensitivity increases as the alignment progresses.
The GUI displays five different accuracy measures for each position in the multiple alignment.
These are:
Accuracy: What characters or gaps of the multiple alignment are the most accurate?
Accuracy is the per-character estimated Alignment Metric Accuracy,
which measures the fidelity of both aligned characters and unaligned characters (gaps).
It can be thought of as a single measure encompassing the sensitivity/specificity tradeoff.
Sensitivity: What characters are aligned with the greatest sensitivity?
Sensitivity is the estimated number of correctly-aligned character pairs divided by the
true number of aligned character pairs.
Sensitivity is defined as the expectation of (True positives) / (True positives + False negatives).
This definition is equivalent to recall as used in classification problems.
Specificity: What characters are aligned with the greatest specificity?
Specificity is the estimated fraction of character pairs which are aligned correctly.
Specificity is defined as the expectation of (True positives) / (True positives + False positives).
This definition is equivalent to precision as used in classification problems.
It is also frequently called Positive Predictive Value in the literature.
Certainty: Was there a better place to align this character?
Certainty measures whether a character or gap is aligned correctly.
Consistency: What parts of the multiple alignment are optimal on a pairwise level?
Consistency measures the extent to which the posterior probabilities from pairwise comparisons
are optimized by the multiple alignment. If a multiple alignment is perfectly consistent,
then each pairwise alignment implied by the multiple alignment corresponds perfectly to the pairwise alignment
which you would obtain by aligning only those two sequences.
Please see the manuscript for precise definitions of these reliability measures.
Notice that the accuracy scores tend to
decrease near gaps, reflecting the difficulty of precisely
resolving gap boundaries.
Use the --stockholm output option to tell FSA
to produce a Stockholm-format alignment. The alignment is marked up with
a per-column accuracy annotation which is identical to the values
reported by the GUI.
FSA includes tools for working with Stockholm alignments, such as
prettify.pl for making Stockholm-format alignments human-
readable, in the perl/ directory.
The included script cmpalign.pl will compare two alignments
and report accuracies measures including Accuracy (AMA), Sensitivity and Specificity.
It can parse Stockholm, multi-FASTA, MSF and CLUSTAL format alignments.
This occurs when FSA is unable to find good anchors between your sequences
to restrict the complexity of inference. This can occur if you use the option
--noanchored to prevent anchoring, if your sequences are very diverged,
or if they have many simple repeats. Use the --maxram option to
prevent FSA from attempting to perform exhaustive inference when it can't
find good anchors. It will leave sequence for which it can't find
sufficiently-good anchors unaligned.
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology. 5:e1000392.
The FSA manuscript can also be found in the doc/
directory of the FSA source code distribution.
FSA is designed to be modular and there are many aspects of the program that can be improved;
we welcome your help! The code is under Git version control.
Please contact us at fsa@math.berkeley.edu for information.
The source code is set up for use with Doxygen,
a system for automated building of documentation.
Source code in seq/ and util/ is from Ian Holmes's DART library [1],
which is used for input and output routines.
FSA's DP code was generated by HMMoC by Gerton Lunter [2]. The
aligner example distributed with HMMoC, which implements a
learning procedure for gap parameters, was an inspiration for FSA's
learning strategies. FSA's banding code is taken directly from the
aligner example.
The sequence annealing technique for constructing a multiple
alignment from pairwise comparisons was developed by Ariel Schwartz.
The implementation of sequence annealing in FSA is a modified version
of the original implementation in AMAP by Ariel Schwartz and Lior Pachter [3,4].
The anchor annealing approach used in FSA is modeled after the recursive
anchoring strategy used in MAVID by Nicolas Bray and Lior Pachter [5].
The MAD GUI interface to FSA was written by Adam Roberts based on a preliminary
version developed by Michael Smoot.