FSA Frequently Asked Questions

Introduction

What is FSA?
What does FSA stand for?
How do I download and install FSA?
How do I run FSA?
Is there a webserver for FSA?
How do I get a help message to explain the options?

Alignments

How do I align many sequences?
How do I align long sequences?
How do I align genomes?
Why are there so many gaps in my alignment?
How do I control the sensitivity/specificity tradeoff of my alignment?

Visualization

How do I visualize the alignments produced by FSA?
How do I see what parts of my alignment are the most reliable?

Troubleshooting

My alignment is taking a long time. How do I see what FSA is doing?
I'm getting memory errors when using the GUI.
I get out-of-memory (bad alloc) errors when I try to align long sequences.

More information

How do I contact you?
How do I cite FSA?
Under what license is FSA distributed?
How do I become an FSA developer?
What other programs influenced the development of FSA?

Introduction

What is FSA?

FSA is a probabilistic multiple sequence alignment algorithm which uses a "distance-based" approach to aligning homologous protein, RNA or DNA sequences. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, FSA builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique for constructing a multiple alignment from pairwise comparisons, developed by Ariel Schwartz in "Posterior Decoding Methods for Optimization and Control of Multiple Alignments."

FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs to large-scale problems such as aligning thousands of sequences or megabase-long sequences. FSA introduces several novel methods for constructing better alignments:

FSA uses machine-learning techniques to estimate gap and substitution parameters on the fly for each set of input sequences. This "query-specific learning" alignment method makes FSA very robust: it can produce superior alignments of sets of homologous sequences which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences using a randomized inference algorithm to reduce the computational cost of multiple alignment. This randomized inference can be over ten times faster than a direct approach with little loss of accuracy.
FSA can quickly align very long sequences using the "anchor annealing" technique for resolving anchors and projecting them with transitive anchoring. It then stitches together the alignment between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display the intermediate alignments produced by FSA, where each character is colored according to the probability that it is correctly aligned.

What does FSA stand for?

Fast statistical alignment: We use machine-learning techniques to quickly re-estimate parameters for each alignment problem.

Fast sequence annealing: We build a multiple alignment from pairwise comparisons with the sequence annealing technique.

Functional statistical alignment: We implicitly use functional information when constructing alignments.

Full Speed Ahead

...and more...

How do I download and install FSA?

FSA is hosted by SourceForge. You can download the latest version from the SourceForge project page.

FSA is built and installed by running the following commands:

tar xvzf fsa-X.X.X.tar.gz cd fsa-X.X.X ./configure make make install

(Substitute fsa-X.X.X.tar.gz with the name of the file that you downloaded.)

The FSA executables can then be found in your system's standard binary directory (e.g., /usr/local/bin). Alternatively, you may just run FSA from the src/main subdirectory in which it is built (which does not require running the make install step). If you wish to install the FSA binaries in a location other than your system's standard directories (which usually requires root permissions), specify the top-level installation directory with the --prefix option to configure. For example,

./configure --prefix=$HOME

specifies that binaries should be installed in $HOME/bin, libraries in $HOME/lib, etc.

If you wish to align long sequences, then you must download and install MUMmer, which FSA calls to get candidate anchors between sequences. When running ./configure, either have the MUMmer executable in your path or specify the executable with the --with-mummer option to ./configure.

FSA can also call exonerate to obtain anchors. If you wish to use exonerate, then as with MUMmer, when running ./configure, you must either have the exonerate executable in your path or specify the executable with the --with-exonerate option to ./configure.

See the README file and How do I align long sequences? for more information.

Please contact us if you have any build problems.

How do I run FSA?

FSA accepts FASTA-format input files and outputs multi-FASTA alignments by default. The most basic usage is:

fsa <mysequences.fa> >myalignedsequences.mfa

fsa --stockholm <mysequences.fa> >myalignedsequences.stk

Is there a webserver for FSA?

There is a webserver hosted here which you can submit alignment jobs to. You will be emailed when the alignment is completed.

How do I get a help message to explain the options?

Run

fsa --help

Are there example sequence files and alignments?

Yes. Please see the examples/ directory.

Alignments

How do I align many sequences?

FSA can align thousands or tens of thousands of sequences. Try running FSA with the --fast option. You can get finer-grained control with the --alignment-number <int> option, which controls the total number of pairwise comparisons which FSA uses to build a multiple alignment. If you want to align N sequences, then you can set --alignment-number to as low as (N - 1) or as high as (N choose 2) == (N * (N - 1) / 2).

How do I align long sequences?

FSA can align long sequences (megabases or tens of megabases) with the "anchor annealing" technique. It uses the program MUMmer to find maximal unique matches between pairs of sequences to be aligned, resolves inconsistencies with anchor annealing, and then pieces together the alignment between anchored regions using its standard inference method.

FSA can also use the program exonerate to detect remote homology.

Please use the --with-mummer and --with-exonerate options to ./configure before compilation as explained in How do I run FSA?.

You can read about the MUMmer and exonerate programs in:

S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. Versatile and open software for comparing large genomes. Genome Biology. 2004, 5:R12.

G. S. Slater and E. Birney. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005, 6:31.

How do I align genomes?

If the genomes which you want to align have few rearrangements, then you can run FSA directly on them. If they have rearrangments, then you must first use a program such as Colin Dewey's Mercator to construct a homology map for the genomes and then run FSA on the homologous segments. FSA can directly use the constraint information produced by Mercator to inform its multiple alignment (use the --mercator option to specify the Mercator constraint file).

You can read about Mercator in:

C. Dewey. Whole-genome alignments and polytopes for comparative genomics. Ph.D. thesis, University of California, Berkeley. 2006.

My alignment is taking a long time. How do I see what FSA is doing?

FSA has an extensive logging system. Try running with the --log 7 option to see progress of the DP algorithm, and --log 6 to see progress of anchoring (when aligning long sequences). Log levels from 0 to 10 are permitted, where lower numbers are more verbose.

Why are there so many gaps in my alignment?

Most alignment programs attempt to maximize sensitivity, even at the expense of specificity, leading to over-alignment (alignment of non-homologous sequence). FSA, in contrast, maximizes the expected accuracy of the alignment with a measure which rewards sensitivity but penalizes over-alignment. If FSA cannot reliably detect homology, then it will leave characters unaligned (gapped).

How do I control the sensitivity/specificity tradeoff of my alignment?

By default FSA stops aligning characters when the probability that a character is aligned is equal to the probability that it is gapped. Use the --maxsn option for maximum sensitivity.

You can get finer-grained control with the --gapfactor <int> option. By default FSA runs at --gapfactor 1. Use --gapfactor 0 for highest sensitivity (this is equivalent to --maxsn) and gap factors > 1 for higher specificity.

Visualization

How do I visualize the alignments produced by FSA?

Run FSA with the command-line option --gui. If the input alignment file is myseqs.fasta, then FSA will write the files myseqs.fasta.gui and 'myseqs.fasta.probs'. Invoke the MAD (Multiple Alignment Display) GUI as

java -jar display/mad.jar myseqs.fasta

Please be patient if it takes a while to load.

Characters in the multiple alignment are colored according to the probability that they are correctly aligned.

You can see what a typical FSA alignment looks like by running on one of the provided example alignments, such as

java -jar display/mad.jar examples/tRNA.aln1.fasta

You can also use the GUI to compare a FSA alignment with another alignment, such as one which you have edited by hand. Invoke the GUI as

java -jar display/mad.jar examples/tRNA.aln1.fasta myalignment.fa The alternate alignment myalignment.mfa must be in multi-FASTA format.

How do I see what parts of my alignment are the most reliable?

The accuracy estimates produced by FSA and the GUI are useful for downstream analyses, for example allowing biologists to restrict their analyses to the most reliable portions of the alignment or edit unreliable parts of the alignment by hand. The accuracy measures also allow you to visualize how FSA works. If you switch to the "Specificity" coloring and watch the animation from the beginning, you will see FSA first aligns characters whose homology it is most sure of (red), and only later aligns characters of unclear homology (blue). Similarly, you can visualize the sensitivity/specificity tradeoff: Near the beginning of the alignment the specificity is very high, but the sensitivity is low. The specificity decreases and the sensitivity increases as the alignment progresses. The GUI displays five different accuracy measures for each position in the multiple alignment. These are:

Accuracy: What characters or gaps of the multiple alignment are the most accurate?
Accuracy is the per-character estimated Alignment Metric Accuracy, which measures the fidelity of both aligned characters and unaligned characters (gaps). It can be thought of as a single measure encompassing the sensitivity/specificity tradeoff.
Sensitivity: What characters are aligned with the greatest sensitivity?
Sensitivity is the estimated number of correctly-aligned character pairs divided by the true number of aligned character pairs.
Sensitivity is defined as the expectation of (True positives) / (True positives + False negatives). This definition is equivalent to recall as used in classification problems.
Specificity: What characters are aligned with the greatest specificity?
Specificity is the estimated fraction of character pairs which are aligned correctly.
Specificity is defined as the expectation of (True positives) / (True positives + False positives). This definition is equivalent to precision as used in classification problems. It is also frequently called Positive Predictive Value in the literature.
Certainty: Was there a better place to align this character?
Certainty measures whether a character or gap is aligned correctly.
Consistency: What parts of the multiple alignment are optimal on a pairwise level?
Consistency measures the extent to which the posterior probabilities from pairwise comparisons are optimized by the multiple alignment. If a multiple alignment is perfectly consistent, then each pairwise alignment implied by the multiple alignment corresponds perfectly to the pairwise alignment which you would obtain by aligning only those two sequences.

Please see the manuscript for precise definitions of these reliability measures.

Notice that the accuracy scores tend to decrease near gaps, reflecting the difficulty of precisely resolving gap boundaries.

Output formats and tools

How do I parse Stockholm alignments?

Use the --stockholm output option to tell FSA to produce a Stockholm-format alignment. The alignment is marked up with a per-column accuracy annotation which is identical to the values reported by the GUI. FSA includes tools for working with Stockholm alignments, such as prettify.pl for making Stockholm-format alignments human- readable, in the perl/ directory.

How do I compare alignments?

The included script cmpalign.pl will compare two alignments and report accuracies measures including Accuracy (AMA), Sensitivity and Specificity. It can parse Stockholm, multi-FASTA, MSF and CLUSTAL format alignments.

Troubleshooting

I'm getting memory errors when using the GUI.

If you're getting errors which say something like

Exception in thread ... java.lang.OutOfMemoryError: Java heap space

then try increasing the memory allowed with with -Xmx option, ie

java -Xmx256m jar display/mad.jar examples/tRNA.aln1.fasta

You can increase it up to the maximum allowed by your machine.

I get out-of-memory (bad alloc) errors when I try to align long sequences.

This occurs when FSA is unable to find good anchors between your sequences to restrict the complexity of inference. This can occur if you use the option --noanchored to prevent anchoring, if your sequences are very diverged, or if they have many simple repeats. Use the --maxram option to prevent FSA from attempting to perform exhaustive inference when it can't find good anchors. It will leave sequence for which it can't find sufficiently-good anchors unaligned.

More information

How do I contact you?

You can reach the FSA team at fsa@math.berkeley.edu with any questions, feedback, etc.

How do I cite FSA?

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology. 5:e1000392.

The FSA manuscript can also be found in the doc/ directory of the FSA source code distribution.

Under what license is FSA distributed?

FSA is licensed under version 3 of the GNU General Public License. Please see the files LICENSE and COPYING for further information.

How do I become an FSA developer?

FSA is designed to be modular and there are many aspects of the program that can be improved; we welcome your help! The code is under Git version control. Please contact us at fsa@math.berkeley.edu for information. The source code is set up for use with Doxygen, a system for automated building of documentation.

What other programs influenced the development of FSA?

Source code in seq/ and util/ is from Ian Holmes's DART library [1], which is used for input and output routines.

FSA's DP code was generated by HMMoC by Gerton Lunter [2]. The aligner example distributed with HMMoC, which implements a learning procedure for gap parameters, was an inspiration for FSA's learning strategies. FSA's banding code is taken directly from the aligner example.

The sequence annealing technique for constructing a multiple alignment from pairwise comparisons was developed by Ariel Schwartz. The implementation of sequence annealing in FSA is a modified version of the original implementation in AMAP by Ariel Schwartz and Lior Pachter [3,4].

The anchor annealing approach used in FSA is modeled after the recursive anchoring strategy used in MAVID by Nicolas Bray and Lior Pachter [5].

The MAD GUI interface to FSA was written by Adam Roberts based on a preliminary version developed by Michael Smoot.

Please see:

[1] I. Holmes and R. Durbin. Dynamic Programming Alignment Accuracy. Journal of Computational Biology. 1998, 5 (3):493-504.

[2] G.A. Lunter. HMMoC - a Compiler for Hidden Markov Models. Bioinformatics. 2007, 23 (18):2485-2487.

[3] A.S. Schwartz. Posterior Decoding Methods for Optimization and Control of Multiple Alignments. Ph.D. Thesis, UC Berkeley. 2007.

[4] A.S. Schwartz and L. Pachter. Multiple Alignment by Sequence Annealing. Bioinformatics. 2007, 23 (2):e24-e29.

[5] N. Bray and L. Pachter. MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research. 2004, 14:693-699.