FSA: Fast Statistical Alignment

See a movie of this alignment.

Introduction

FSA is a probabilistic multiple sequence alignment algorithm which uses a "distance-based" approach to aligning homologous protein, RNA or DNA sequences. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, FSA builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique for constructing a multiple alignment from pairwise comparisons, developed by Ariel Schwartz in "Posterior Decoding Methods for Optimization and Control of Multiple Alignments."

FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs to large-scale problems such as aligning thousands of sequences or megabase-long sequences. FSA introduces several novel methods for constructing better alignments:

FSA uses machine-learning techniques to estimate gap and substitution parameters on the fly for each set of input sequences. This "query-specific learning" alignment method makes FSA very robust: it can produce superior alignments of sets of homologous sequences which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences using a randomized inference algorithm to reduce the computational cost of multiple alignment. This randomized inference can be over ten times faster than a direct approach with little loss of accuracy.
FSA can quickly align very long sequences using the "anchor annealing" technique for resolving anchors and projecting them with transitive anchoring. It then stitches together the alignment between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display the intermediate alignments produced by FSA, where each character is colored according to the probability that it is correctly aligned (see the picture and movie at the top of the page).

You can see more information on the FAQ.

Download and Installation

FSA is an open-source project hosted by SourceForge. You can download the latest version from the SourceForge project page.

FSA is built and installed by running the following commands:

tar xvzf fsa-X.X.X.tar.gz cd fsa-X.X.X ./configure make make install

(Substitute fsa-X.X.X.tar.gz with the name of the file that you downloaded.) The FSA executables can then be found in your system's standard binary directory (e.g., /usr/local/bin). To install to other locations, see the FAQ. Alternatively, you may just run FSA from the src/main subdirectory in which it is built (which does not require running the make install step)

If you wish to align long sequences, then you must download and install MUMmer, which FSA calls to get candidate anchors between sequences. When running ./configure, either have the MUMmer executable in your path or specify the executable with the --with-mummer option to configure. See the included README and FAQ for more information.

Please contact us if you have any build problems.

Webserver

You can submit alignment jobs to the FSA webserver. Be aware that the webserver may reject alignment jobs which contain many (> 100) sequences due to computational limitations. If you wish to align many sequences, then please download and install FSA in order to run the alignment on your personal computer.

Applications

FSA can be used for all alignment problems, including:

Detailed analysis of a single family of proteins or RNAs.
Large-scale alignment of thousands of sequences (use the --fast option).
Genome alignment of megabases of orthologous sequences.

Citation

Please cite:
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology. 5:e1000392.

The FSA manuscript can also be found in the doc/ directory of the FSA source code distribution.

Contact

Please contact us at fsa@math.berkeley.edu with any questions or feedback.

References

Please see:

[1] I. Holmes and R. Durbin. Dynamic Programming Alignment Accuracy. Journal of Computational Biology. 1998, 5 (3):493-504.

[2] G.A. Lunter. HMMoC - a Compiler for Hidden Markov Models. Bioinformatics. 2007, 23 (18):2485-2487.

[3] A.S. Schwartz. Posterior Decoding Methods for Optimization and Control of Multiple Alignments. Ph.D. Thesis, UC Berkeley. 2007.

[4] A.S. Schwartz and L. Pachter. Multiple Alignment by Sequence Annealing. Bioinformatics. 2007, 23 (2):e24-e29.

[5] N. Bray and L. Pachter. MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research. 2004, 14:693-699.

The SIV sequence data in the image and movie is from:

[6] B. D. Redelings and M. A. Suchard. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evolutionary Biology. 2007, 7:40.

Acknowledgements

FSA was created by Robert Bradley. It was developed by Robert Bradley, Colin Dewey, Jaeyoung Do, Sudeep Juvekar, Lior Pachter, Adam Roberts, and Michael Smoot, along with assistance from many other people. All have made intellectual contributions and contributed code.

We give our heartfelt thanks to SourceForge for hosting this project.