Getting Started with Sim4db

From kmer
Jump to: navigation, search

Batch spliced alignment of cDNA (EST, mRNA) sequences to a target genome, of the same or a related species.

Described in the publication: B. Walenz, L. Florea (2011) Sim4db and leaff: Utilities for fast batch spliced alignment and sequence indexing, Bioinformatics, in press.


What is sim4db?

Sim4db performs fast batch alignment of large cDNA (EST, mRNA) sequence sets to a set of eukaryotic genomic regions. It uses the sim4 and sim4cc algorithms to determine the alignments, but incorporates a fast sequence indexing and retrieval mechanism, implemented in the sister package leaff, to speedily process large volumes of sequences.


While sim4db produces alignments in the same way as sim4 or sim4cc, it has additional features to make it more amenable for use with whole-genome annotation pipelines. A script file can be used to group pairings between cDNAs and their corresponding genomic regions, to be aligned as one run and using the same set of parameters. Sim4db also optionally reports more than one alignments for the same cDNA within a genomic region, as long as they meet user-defined criteria such as minimum length, percentage sequence identity or coverage. This feature is instrumental in finding all alignments of a gene family at one locus. Lastly, the output is presented either as custom sim4db alignments or as GFF3 gene features.

Command line usage

A simple command line invocation:

sim4db -genomic g.fasta -cdna c.fasta -scr script -output o.sim4db

where:
   - 'c.fasta' and 'g.fasta' are the multi-fasta cDNA and genome sequence files
   - 'script' is a script file indicating individual alignments to be computed
   - output in sim4db format will be sent to the file 'o.sim4db' ('-' for standard output)

A more complex invocation:

sim4db -genomic g.fasta -cdna c.fasta -output o.sim4db [options]

Salient options:
       -cdna         use these cDNA sequences (multi-fasta file)
       -genomic      use these genomic sequences (multi-fasta file)
       -script       use this script file
       -pairwise     sequentially align pairs of sequences

                     If none of the '-script' and '-pairwise' options
                     is specified, sim4db performs all-against-all
                     alignments between pairs of cDNA and genomic sequences.

       -output       write output to this file
       -gff3         report output in GFF3 format
       -interspecies use sim4cc for inter-species alignments (default sim4)

Filter options:
       -mincoverage  iteratively find all exon models with the specified
                     minimum PERCENT COVERAGE
       -minidentity  iteratively find all exon models with the specified
                     minimum PERCENT EXON IDENTITY
       -minlength    iteratively find all exon models with the specified
                     minimum ABSOLUTE COVERAGE (number of bp aligned)
                     (default 0)
       -alwaysreport always report <number> exon models, even if they
                     are below the quality thresholds
  • If no mincoverage or minidentity or minlength is given, only the best exon model is returned. This is the DEFAULT operation.
  • You will probably want to specify ALL THREE of mincoverage, minidentity and minlength! Don't assume the default values are what you want!
  • You will DEFINITELY want to specify at least one of mincoverage, minidentity and minlength with '-alwaysreport'! If you don't, mincoverage will be set to 90 and minidentity to 95 -- to reduce the number of spurious matches when a good match is found.
Auxiliary options:
       -nodeflines   don't include the deflines in the sim4db output
       -alignments   print alignments

       -polytails    DON'T mask poly-A and poly-T tails
       -cut          trim marginal exons if A/T % > x (poly-AT tails)

       -noncanonical don't force canonical splice sites
       -splicemodel  use the following splice model: 0 - original sim4;
                     1 - GeneSplicer; 2 - Glimmer;  options 1 and 2 are
                     only available with '-interspecies'.
                     Default for sim4 is 0, and for sim4cc is 1.

       -forcestrand  Force the strand prediction to always be
                     one of 'forward' or 'reverse'

Execution options:
       -threads      Use n threads.
       -touch        create this file when the program finishes execution

Debugging options:
       -v            print status to stderr while running
       -V            print script lines (stderr) as they are being processed

Developer options:
       -Z            set the spaced seed pattern

       -H            set the relink weight factor (H=1000 recommended for mRNAs)
       -K            set the first MSP threshold
       -C            set the second MSP threshold
       -Ma           set the limit of the number of MSPs allowed
       -Mp           same, as percentage of bases in cDNA
                     NOTE:  If used, both -Ma and -Mp must be specified!

Input/Output

For a typical run, sim4db takes as input two multi-fasta files containing the cDNAs and the genome, respectively, and optionally a script describing a set of pairings among the sequences. Alignments are determined using the program sim4 (default) for same-species comparisons, or sim4cc when the '-interspecies' option is set.


The output is reported in the compact sim4db format (default), or in GFF3 format with the '-gff3' option. Utilities for filtering, merging, sorting and processing polishes in these formats, and for converting between the two formats (lossy), are included with the package and described below.

The input script file format

[-f|-r] -e ESTidx -D GENidx GENlo GENhi
where:
cDNAidx        internal index of the cDNA in the input cDNA fasta file
               (0..#cDNAseqs-1)
GENidx         internal index of the genomic sequence in the input genome
               file (0..#GENseqs-1)
-f             use the cDNA sequence as is
-r             use the reverse complement of the cDNA sequence
GENlo, GENhi   begin and end coordinates of the target genomic region;
               coordinates are 0-based

Example:

 -f -e 61728 -D 0 2370482 2375224
 -r -e 61730 -D 0 6723331 6757701
 -r -e 61734 -D 1 8428517 8432981
 -f -e 61736 -D 3 4600260 4637694

For best performance, the script should be sorted by genomic sequence index.


The sim4db output format

sim4begin
cDNAidx[cDNAlen-pA-pT] GENidx[GENoff-GENlen] <M-N-O-P-S>
edef=cDNA defline
ddef=genomic defline
cDNAbgn1-cDNAend1 (GENbgn1-GENend1) <M-N-P> intronOri
cDNAbgn2-cDNAend2 (GENbgn2-GENend2) <M-N-P> intronOri
...
cDNAbgnn-cDNAendn (GENbgnn-GENendn) <M-N-P> intronOri
cDNA alignment sequence for exon #1
genomic alignment sequence for exon #1
cDNA alignment sequence for exon #2
genomic alignment sequence for exon #2
...
cDNA alignment sequence for exon #n
genomic alignment sequence for exon #n
sim4end

where:
cDNAidx     internal index of the cDNA in the input cDNA fasta file
cDNAlen     length of the cDNA sequence
pA(T)wi     length of polyA(T) tail detected and masked
GENidx      internal index of the genomic sequence in the genome fasta file
GENoff      offset to the beginning of the genomic region containing the signal
GENlen      length of the genomic region containing the signal
M           number of nucleotide matches in the alignment
N           number of matching N's in the alignment
P           percent sequence identity of the alignment
O           match orientation:


    * forward    - the cDNA sequence aligns to the genomic sequence directly
    * complement - the reverse complement of the cDNA sequence matches
                   the genomic sequence; this is the equivalent of the
                   sim4 '(complement)' output line

S           strand predicted based on the splice signals and alignment quality:

    * forward    - predicted forward strand
    * reverse    - predicted reverse strand
    * unknown    - strand unknown (because of low alignment quality,
                   single exon match, or weak splice signals)

cDNAbgni    start position of exon i in the cDNA sequence
cDNAendi    end position of exon i in the cDNA sequence
GENbgni     start position of exon i in the genomic sequence (interval GENlo-GENhi)
GENendi     end position of exon i in the genomic sequence (interval GENlo-GENhi)
M           number of nucleotide matches in the alignment
N           number of matching N's in the alignment
P           percent sequence identity of the alignment
intronOri   predicted orientation of the intron:

    * -> forward (i.e., GT-AG-like splice signals)
    * <- reverse (i.e., CT-AC-like splice signals)
    * -- ambiguous
    * == gap (unaligned portion) in the cDNA sequence

Exon coordinates are nucleotide based, starting from 1. Genomic coordinates are always in the original sequence, while the cDNA coordinates will refer to positions in the reverse complement of the sequence if the match orientation is indicated as 'complement'.


Lowercase letters in the alignment lines indicate positions with matching nucleotides, '-' indicate gaps in the corresponding sequence, and uppercase letters mark either substitutions, or gaps in the other sequence. Alignments are OPTIONAL.


Example:

sim4begin
61728[685-0-21] 0[2370482-4742] <651-0-97-forward-reverse>
edef=gb|CA807305 D. melanogaster cDNA 3' similar to CT12127, mRNA sequence
ddef=arm_2L
22-337 (2372455-2372770) <313-0-99> <-
338-584 (2372830-2373076) <238-0-95> <-
585-685 (2373134-2373234) <100-0-99>
gtaaaaaTttctgtttatta...gggcgaccagaagtcaatcag
gtaaaaaGttctgtttatta...gggcgaccagaagtcaatcag
ggtaacttgtccttGggtgc...ccacaccgGctccca-ttcgcgtAtc
ggtaacttgtccttTggtgc...ccacaccgCctcccaGttcgcgtTtc
tgcaagcggtcgacatgagg...cttaaAgcgctggta
tgcaagcggtcgacatgagg...cttaaCgcgctggta
sim4end

The GFF3 output format

The same example as before:

0:arm_2L        sim4db  mRNA    2372455 2373234 97      -       .       ID=sim4db10;Name=61728:gb|CA807305;Target=61728:gb|CA807305 22 685 +;pA=0;pT=21;genRegion=2370482-2375224
0:arm_2L        sim4db  exon    2372455 2372770 99      -       .       Parent=sim4db10;Target=61728:gb|CA807305 22 337 +;Gap=M316;nMatches=313;intron=<-
0:arm_2L        sim4db  exon    2372830 2373076 95      -       .       Parent=sim4db10;Target=61728:gb|CA807305 338 584 +;Gap=M74 D1 M2 I1 M160 D1 M10;nMatches=238;intron=<-
0:arm_2L        sim4db  exon    2373134 2373234 99      -       .       Parent=sim4db10;Target=61728:gb|CA807305 585 685 +;Gap=M101;nMatches=100

(Columns are tab-separated.)

An example

Here is a small sample data set for testing sim4db:

  1. Download the (mouse) mRNA sequences, and gunzip
  2. Download the (human) genomic regions, and gunzip
  3. Dowload the script file
  4. Run sim4db:
   sim4db -cdna sim4db.example.cdna.fasta \
          -gen sim4db.example.genomic.fasta \
          -scr sim4db.example.script -interspecies -output out.sim4db

or, to align pairs of sequences (10 pairs) in the order in which they are listed in the fasta files:

   sim4db -cdna sim4db.example.cdna.fasta \
          -gen sim4db.example.genomic.fasta \
          -pairwise -interspecies -output out.sim4db

Done!


Interpretation

The commands above should produce ten alignments, one for each pair of sequences in the script.

The first of these alignments is:

sim4begin
0[1287-12-0] 0[25000-75000] <974-0-84-complement-reverse>
edef=gi|74316016|ref|NM_026831.1| Mus musculus myosin binding protein H-like (Mybphl), mRNA
ddef=NM_001010985.1, start=109636510, end=109651186
143-331 (52744-52932) <165-0-87> <-
332-468 (53869-54005) <123-0-90> <-
469-628 (54418-54577) <132-0-83> <-
629-768 (54685-54824) <126-0-90> <-
769-964 (55057-55252) <166-0-85> <-
965-1053 (55804-55892) <82-0-92> <-
1054-1287 (64482-64725) <180-0-72>
sim4end

Header line:

This sim4db record indicates that the reverse complement (complement) of the first cDNA sequence in the file (cDNA index 0) matches the first genomic sequence (genomic index 0), within the genomic interval 25,001-100,001 (base coordinates starting at 1). The cDNA sequence is 1,287 bp long, of which the last 12 bp are likely polyA tail bases. The alignment has 974 nucleotide matches, and 84% percent sequence identity. The gene is predicted to be on the reverse strand, based on the observed splice signals.

Fasta header lines:

The edef and ddef lines show the fasta headers of the cDNA and genomic sequences, respectively.

Exon lines:

The alignment contains seven exons. The first (3'-most) exon, for instance, starts at position 52,744 and ends at position 52,932 of the (full) genomic sequence. Because the match is with the reverse complement of the cDNA, the matching cDNA region is 1,287-331+1=957 through 1,287-143+1=1,145. The exonic alignment has 87% sequence identity, with 165 nucleotide matches. All introns are pointing in the reverse orientation (<-), suppporting the reverse strand prediction above.

Affiliated tools

The sim4dbutils package contains a range of utilities to work with sim4db-generated alignment files, of particular note being:


  • convertPolishes - convert between the two formats. With GFF3->sim4db conversion, alignments will be lost.
  • filterPolishes - filter alignments based on minimum percentage sequence identity, coverage and length.
  • mergePolishes - merge alignments from multiple files (also concatenates the cDNA fasta files)
  • sortPolishes - sort alignments by cDNA or genomic sequence index, using a limited amount of memory if needed.


More information can be found at Sim4db Files.