LEAFF User's Guide
Utility for sequence indexing, manipulation and retrieval.
Described in the publication: B. Walenz and L. Florea (2011) Sim4db and leaff: Utilities for fast batch spliced alignment and sequence indexing, Bioinformatics, in press.
What is LEAFF?
The kmer project uses a compressed binary format to represent fasta sequences, called seqStore, to speed up sequence access and retrieval. Most kmer tools can read sequences from multiple formats, including multi-fasta and seqStore files. LEAFF (Let's Extract Anything From Fasta) is a utility program for working with such sequences. It provides random access to sequences at base level, as well as several analysis functions.
Command line usage
LEAFF can be used as a stand-alone program, and also as a library of routines that can be incorporated into other programs. This page describes the uses of LEAFF as a stand-alone sequence manipulation tool. An example of how you can develop your own program using the LEAFF library is here.
Basic sequence indexing and retrieval
Usage: leaff [-f fasta-file] [options] SOURCE FILES -f file: use sequence in 'file' (-F is also allowed for historical reasons) -A file: read actions from 'file' SOURCE FILE EXAMINATION -d: print the number of sequences in the fasta -i name: print an index, labelling the source 'name' OUTPUT OPTIONS -6 <#>: insert a newline every 60 letters (if the next arg is a number, newlines are inserted every n letters, e.g., -6 80. Disable line breaks with -6 0, or just don't use -6!) -e beg end: Print only the bases from position 'beg' to position 'end' (space based, relative to the FORWARD sequence!) If beg == end, then the entire sequence is printed. It is an error to specify beg > end, or beg > len, or end > len. -ends n Print n bases from each end of the sequence. One input sequence generates two output sequences, with '_5' or '_3' appended to the ID. If 2n >= length of the sequence, the sequence itself is printed, no ends are extracted (they overlap). -C: complement the sequences -H: DON'T print the defline -h: Use the next word as the defline ("-H -H" will reset to the original defline -R: reverse the sequences -u: uppercase all bases SEQUENCE SELECTION -G n s l: print n randomly generated sequences, 0 < s <= length <= l -L s l: print all sequences such that s <= length < l -N l h: print all sequences such that l <= % N composition < h (NOTE 0.0 <= l < h < 100.0) (NOTE that you cannot print sequences with 100% N This is a useful bug). -q file: print sequences from the seqid list in 'file' -r num: print 'num' randomly picked sequences -s seqid: print the single sequence 'seqid' -S f l: print all the sequences from ID 'f' to 'l' (inclusive) -W: print all sequences (do the whole file) LONGER HELP -help analysis -help examples
Options are ORDER DEPENDENT. Sequences are printed whenever a SEQUENCE SELECTION option occurs on the command line. OUTPUT OPTIONS are not reset when a sequence is printed.
SEQUENCES are numbered starting at ZERO, not one.
- Print the first 10 bases of the fourth sequence in file 'genes':
-f genes -e 0 10 -s 3
- Print the first 10 bases of the fourth and fifth sequences:
-f genes -e 0 10 -s 3 -s 4
- Print the fourth and fifth sequences reverse complemented, and the sixth sequence forward. The second set of -R -C toggle off reverse-complement:
-f genes -R -C -s 3 -s 4 -R -C -s 5
- Convert file 'genes' to a seqStore 'genes.seqStore'. The seqStore provides better performance with the kmer tools.
-f genes --seqstore genes.seqStore
In addition to sequence access and base-level manipulation, LEAFF performs several sequence analysis functions. Among the most useful are: i) partition a sequence into roughly equal size pieces (option -partition); ii) segment a multi-fasta file into multiple smaller files, each with an equal number of sequences, except perhaps for the last (option -segment); iii) generate new sequences from a multi-fasta file by incorporating simulated sequencing errors, for use in testing alignment programs (option -errors); and iv) convert an input multi-fasta file into a seqStore file (option -seqstore).
The command line options for these and other functions are given below:
--findduplicates a.fasta Reports sequences that are present more than once. Output is a list of pairs of deflines, separated by a newline. --mapduplicates a.fasta b.fasta Builds a map of IIDs from a.fasta and b.fasta that have identical sequences. Format is "IIDa <-> IIDb" --md5 a.fasta: Don't print the sequence, but print the md5 checksum (of the entire sequence) followed by the entire defline. --partition prefix [ n[gmk]bp | n ] a.fasta --partitionmap [ n[gmk]bp | n ] a.fasta Partition the sequences into roughly equal size pieces of size nbp, nkbp, nmbp or ngbp; or into n roughly equal sized parititions. Sequences larger that the partition size are in a partition by themself. --partitionmap writes a description of the partition to stdout; --partiton creates a fasta file 'prefix-###.fasta' for each partition. Example: -F some.fasta --partition parts 130mbp -F some.fasta --partition parts 16 --segment prefix n a.fasta Splits the sequences into n files, prefix-###.fasta. Sequences are not reordered; the first n sequences are in the first file, the next n in the second file, etc. --gccontent a.fasta Reports the GC content over a sliding window of 3, 5, 11, 51, 101, 201, 501, 1001, 2001 bp. --testindex a.fasta Test the index of 'file'. If index is up-to-date, leaff exits successfully, else, leaff exits with code 1. If an index file is supplied, that one is tested, otherwise, the default index file name is used. --dumpblocks a.fasta Generates a list of the blocks of N and non-N. Output format is 'base seq# beg end len'. 'N 84 483 485 2' means that a block of 2 N's starts at space-based position 483 in sequence ordinal 84. A '.' is the end of sequence marker. --errors L N C P a.fasta For every sequence in the input file, generate new sequences including simulated sequencing errors. L -- length of the new sequence. If zero, the length of the original sequence will be used. N -- number of subsequences to generate. If L=0, all subsequences will be the same, and you should use C instead. C -- number of copies to generate. Each of the N subsequences will have C copies, each with different errors. P -- probability of an error. HINT: to simulate ESTs from genes, use L=500, N=10, C=10 -- make C=10 sequencer runs of N=10 EST sequences of length 500bp each. to simulate mRNA from genes, use L=0, N=10, C=10 to simulate reads from genomes, use L=800, N=10, C=1 -- of course, N= should be increased to give the appropriate depth of coverage --stats a.fasta Reports size statistics; number, N50, sum, largest. --seqstore out.seqStore Converts the input file (-f) to a seqStore file.
This document is up to date, as of revision 1813.