LEAFF User's Guide

From kmer
Jump to: navigation, search

Utility for sequence indexing, manipulation and retrieval.

Described in the publication: B. Walenz and L. Florea (2011) Sim4db and leaff: Utilities for fast batch spliced alignment and sequence indexing, Bioinformatics, in press.

What is LEAFF?

The kmer project uses a compressed binary format to represent fasta sequences, called seqStore, to speed up sequence access and retrieval. Most kmer tools can read sequences from multiple formats, including multi-fasta and seqStore files. LEAFF (Let's Extract Anything From Fasta) is a utility program for working with such sequences. It provides random access to sequences at base level, as well as several analysis functions.

Command line usage

LEAFF can be used as a stand-alone program, and also as a library of routines that can be incorporated into other programs. This page describes the uses of LEAFF as a stand-alone sequence manipulation tool. An example of how you can develop your own program using the LEAFF library is here.

Basic sequence indexing and retrieval

Usage: leaff [-f fasta-file] [options]

   -f file:     use sequence in 'file' (-F is also allowed for historical reasons)
   -A file:     read actions from 'file'

   -d:          print the number of sequences in the fasta
   -i name:     print an index, labelling the source 'name'

   -6 <#>:      insert a newline every 60 letters
                  (if the next arg is a number, newlines are inserted every
                  n letters, e.g., -6 80.  Disable line breaks with -6 0,
                  or just don't use -6!)
   -e beg end:  Print only the bases from position 'beg' to position 'end'
                  (space based, relative to the FORWARD sequence!)  If
                  beg == end, then the entire sequence is printed.  It is an
                  error to specify beg > end, or beg > len, or end > len.
   -ends n      Print n bases from each end of the sequence.  One input
                  sequence generates two output sequences, with '_5' or '_3'
                  appended to the ID.  If 2n >= length of the sequence, the
                  sequence itself is printed, no ends are extracted (they
   -C:          complement the sequences
   -H:          DON'T print the defline
   -h:          Use the next word as the defline ("-H -H" will reset to the
                  original defline
   -R:          reverse the sequences
   -u:          uppercase all bases

   -G n s l:    print n randomly generated sequences, 0 < s <= length <= l
   -L s l:      print all sequences such that s <= length < l
   -N l h:      print all sequences such that l <= % N composition < h
                  (NOTE 0.0 <= l < h < 100.0)
                  (NOTE that you cannot print sequences with 100% N
                   This is a useful bug).
   -q file:     print sequences from the seqid list in 'file'
   -r num:      print 'num' randomly picked sequences
   -s seqid:    print the single sequence 'seqid'
   -S f l:      print all the sequences from ID 'f' to 'l' (inclusive)
   -W:          print all sequences (do the whole file)

   -help analysis
   -help examples

Options are ORDER DEPENDENT. Sequences are printed whenever a SEQUENCE SELECTION option occurs on the command line. OUTPUT OPTIONS are not reset when a sequence is printed.

SEQUENCES are numbered starting at ZERO, not one.


  • Print the first 10 bases of the fourth sequence in file 'genes':
-f genes -e 0 10 -s 3
  • Print the first 10 bases of the fourth and fifth sequences:
-f genes -e 0 10 -s 3 -s 4
  • Print the fourth and fifth sequences reverse complemented, and the sixth sequence forward. The second set of -R -C toggle off reverse-complement:
-f genes -R -C -s 3 -s 4 -R -C -s 5
  • Convert file 'genes' to a seqStore 'genes.seqStore'. The seqStore provides better performance with the kmer tools.
-f genes --seqstore genes.seqStore

Analysis Functions

In addition to sequence access and base-level manipulation, LEAFF performs several sequence analysis functions. Among the most useful are: i) partition a sequence into roughly equal size pieces (option -partition); ii) segment a multi-fasta file into multiple smaller files, each with an equal number of sequences, except perhaps for the last (option -segment); iii) generate new sequences from a multi-fasta file by incorporating simulated sequencing errors, for use in testing alignment programs (option -errors); and iv) convert an input multi-fasta file into a seqStore file (option -seqstore).

The command line options for these and other functions are given below:

   --findduplicates a.fasta
                Reports sequences that are present more than once.  Output
                is a list of pairs of deflines, separated by a newline.

   --mapduplicates a.fasta b.fasta
                Builds a map of IIDs from a.fasta and b.fasta that have
                identical sequences.  Format is "IIDa <-> IIDb"

   --md5 a.fasta:
                Don't print the sequence, but print the md5 checksum
                (of the entire sequence) followed by the entire defline.

   --partition     prefix [ n[gmk]bp | n ] a.fasta
   --partitionmap         [ n[gmk]bp | n ] a.fasta
                Partition the sequences into roughly equal size pieces of
                size nbp, nkbp, nmbp or ngbp; or into n roughly equal sized
                parititions.  Sequences larger that the partition size are
                in a partition by themself.  --partitionmap writes a
                description of the partition to stdout; --partiton creates
                a fasta file 'prefix-###.fasta' for each partition.
                Example: -F some.fasta --partition parts 130mbp
                         -F some.fasta --partition parts 16

   --segment prefix n a.fasta
                Splits the sequences into n files, prefix-###.fasta.
                Sequences are not reordered; the first n sequences are in
                the first file, the next n in the second file, etc.

   --gccontent a.fasta
                Reports the GC content over a sliding window of
                3, 5, 11, 51, 101, 201, 501, 1001, 2001 bp.

   --testindex a.fasta
                Test the index of 'file'.  If index is up-to-date, leaff
                exits successfully, else, leaff exits with code 1.  If an
                index file is supplied, that one is tested, otherwise, the
                default index file name is used.

   --dumpblocks a.fasta
                Generates a list of the blocks of N and non-N.  Output
                format is 'base seq# beg end len'.  'N 84 483 485 2' means
                that a block of 2 N's starts at space-based position 483
                in sequence ordinal 84.  A '.' is the end of sequence

   --errors L N C P a.fasta
                For every sequence in the input file, generate new
                sequences including simulated sequencing errors.
                L -- length of the new sequence.  If zero, the length
                     of the original sequence will be used.
                N -- number of subsequences to generate.  If L=0, all
                     subsequences will be the same, and you should use
                     C instead.
                C -- number of copies to generate.  Each of the N
                     subsequences will have C copies, each with different
                P -- probability of an error.

                HINT: to simulate ESTs from genes, use L=500, N=10, C=10
                         -- make C=10 sequencer runs of N=10 EST sequences
                            of length 500bp each.
                      to simulate mRNA from genes, use L=0, N=10, C=10
                      to simulate reads from genomes, use L=800, N=10, C=1
                         -- of course, N= should be increased to give the
                            appropriate depth of coverage

   --stats a.fasta
                Reports size statistics; number, N50, sum, largest.

   --seqstore out.seqStore
                Converts the input file (-f) to a seqStore file.

This document is up to date, as of revision 1813.