Getting Started with Sim4db
Batch spliced alignment of cDNA (EST, mRNA) sequences to a target genome, of the same or a related species.
Described in the publication: B. Walenz, L. Florea (2011) Sim4db and leaff: Utilities for fast batch spliced alignment and sequence indexing, Bioinformatics, in press.
What is sim4db?
Sim4db performs fast batch alignment of large cDNA (EST, mRNA) sequence sets to a set of eukaryotic genomic regions. It uses the sim4 and sim4cc algorithms to determine the alignments, but incorporates a fast sequence indexing and retrieval mechanism, implemented in the sister package leaff, to speedily process large volumes of sequences.
While sim4db produces alignments in the same way as sim4 or sim4cc, it has additional features to make it more amenable for use with whole-genome annotation pipelines. A script file can be used to group pairings between cDNAs and their corresponding genomic regions, to be aligned as one run and using the same set of parameters. Sim4db also optionally reports more than one alignments for the same cDNA within a genomic region, as long as they meet user-defined criteria such as minimum length, percentage sequence identity or coverage. This feature is instrumental in finding all alignments of a gene family at one locus. Lastly, the output is presented either as custom sim4db alignments or as GFF3 gene features.
Command line usage
A simple command line invocation:
sim4db -genomic g.fasta -cdna c.fasta -scr script -output o.sim4db where: - 'c.fasta' and 'g.fasta' are the multi-fasta cDNA and genome sequence files - 'script' is a script file indicating individual alignments to be computed - output in sim4db format will be sent to the file 'o.sim4db' ('-' for standard output)
A more complex invocation:
sim4db -genomic g.fasta -cdna c.fasta -output o.sim4db [options] Salient options: -cdna use these cDNA sequences (multi-fasta file) -genomic use these genomic sequences (multi-fasta file) -script use this script file -pairwise sequentially align pairs of sequences If none of the '-script' and '-pairwise' options is specified, sim4db performs all-against-all alignments between pairs of cDNA and genomic sequences. -output write output to this file -gff3 report output in GFF3 format -interspecies use sim4cc for inter-species alignments (default sim4) Filter options: -mincoverage iteratively find all exon models with the specified minimum PERCENT COVERAGE -minidentity iteratively find all exon models with the specified minimum PERCENT EXON IDENTITY -minlength iteratively find all exon models with the specified minimum ABSOLUTE COVERAGE (number of bp aligned) (default 0) -alwaysreport always report <number> exon models, even if they are below the quality thresholds
- If no mincoverage or minidentity or minlength is given, only the best exon model is returned. This is the DEFAULT operation.
- You will probably want to specify ALL THREE of mincoverage, minidentity and minlength! Don't assume the default values are what you want!
- You will DEFINITELY want to specify at least one of mincoverage, minidentity and minlength with '-alwaysreport'! If you don't, mincoverage will be set to 90 and minidentity to 95 -- to reduce the number of spurious matches when a good match is found.
Auxiliary options: -nodeflines don't include the deflines in the sim4db output -alignments print alignments -polytails DON'T mask poly-A and poly-T tails -cut trim marginal exons if A/T % > x (poly-AT tails) -noncanonical don't force canonical splice sites -splicemodel use the following splice model: 0 - original sim4; 1 - GeneSplicer; 2 - Glimmer; options 1 and 2 are only available with '-interspecies'. Default for sim4 is 0, and for sim4cc is 1. -forcestrand Force the strand prediction to always be one of 'forward' or 'reverse' Execution options: -threads Use n threads. -touch create this file when the program finishes execution Debugging options: -v print status to stderr while running -V print script lines (stderr) as they are being processed Developer options: -Z set the spaced seed pattern -H set the relink weight factor (H=1000 recommended for mRNAs) -K set the first MSP threshold -C set the second MSP threshold -Ma set the limit of the number of MSPs allowed -Mp same, as percentage of bases in cDNA NOTE: If used, both -Ma and -Mp must be specified!
For a typical run, sim4db takes as input two multi-fasta files containing the cDNAs and the genome, respectively, and optionally a script describing a set of pairings among the sequences. Alignments are determined using the program sim4 (default) for same-species comparisons, or sim4cc when the '-interspecies' option is set.
The output is reported in the compact sim4db format (default), or in GFF3 format with the '-gff3' option. Utilities for filtering, merging, sorting and processing polishes in these formats, and for converting between the two formats (lossy), are included with the package and described below.
The input script file format
[-f|-r] -e ESTidx -D GENidx GENlo GENhi where: cDNAidx internal index of the cDNA in the input cDNA fasta file (0..#cDNAseqs-1) GENidx internal index of the genomic sequence in the input genome file (0..#GENseqs-1) -f use the cDNA sequence as is -r use the reverse complement of the cDNA sequence GENlo, GENhi begin and end coordinates of the target genomic region; coordinates are 0-based
-f -e 61728 -D 0 2370482 2375224 -r -e 61730 -D 0 6723331 6757701 -r -e 61734 -D 1 8428517 8432981 -f -e 61736 -D 3 4600260 4637694
For best performance, the script should be sorted by genomic sequence index.
The sim4db output format
sim4begin cDNAidx[cDNAlen-pA-pT] GENidx[GENoff-GENlen] <M-N-O-P-S> edef=cDNA defline ddef=genomic defline cDNAbgn1-cDNAend1 (GENbgn1-GENend1) <M-N-P> intronOri cDNAbgn2-cDNAend2 (GENbgn2-GENend2) <M-N-P> intronOri ... cDNAbgnn-cDNAendn (GENbgnn-GENendn) <M-N-P> intronOri cDNA alignment sequence for exon #1 genomic alignment sequence for exon #1 cDNA alignment sequence for exon #2 genomic alignment sequence for exon #2 ... cDNA alignment sequence for exon #n genomic alignment sequence for exon #n sim4end where: cDNAidx internal index of the cDNA in the input cDNA fasta file cDNAlen length of the cDNA sequence pA(T)wi length of polyA(T) tail detected and masked GENidx internal index of the genomic sequence in the genome fasta file GENoff offset to the beginning of the genomic region containing the signal GENlen length of the genomic region containing the signal M number of nucleotide matches in the alignment N number of matching N's in the alignment P percent sequence identity of the alignment O match orientation: * forward - the cDNA sequence aligns to the genomic sequence directly * complement - the reverse complement of the cDNA sequence matches the genomic sequence; this is the equivalent of the sim4 '(complement)' output line S strand predicted based on the splice signals and alignment quality: * forward - predicted forward strand * reverse - predicted reverse strand * unknown - strand unknown (because of low alignment quality, single exon match, or weak splice signals) cDNAbgni start position of exon i in the cDNA sequence cDNAendi end position of exon i in the cDNA sequence GENbgni start position of exon i in the genomic sequence (interval GENlo-GENhi) GENendi end position of exon i in the genomic sequence (interval GENlo-GENhi) M number of nucleotide matches in the alignment N number of matching N's in the alignment P percent sequence identity of the alignment intronOri predicted orientation of the intron: * -> forward (i.e., GT-AG-like splice signals) * <- reverse (i.e., CT-AC-like splice signals) * -- ambiguous * == gap (unaligned portion) in the cDNA sequence
Exon coordinates are nucleotide based, starting from 1. Genomic coordinates are always in the original sequence, while the cDNA coordinates will refer to positions in the reverse complement of the sequence if the match orientation is indicated as 'complement'.
Lowercase letters in the alignment lines indicate positions with matching nucleotides, '-' indicate gaps in the corresponding sequence, and uppercase letters mark either substitutions, or gaps in the other sequence. Alignments are OPTIONAL.
sim4begin 61728[685-0-21] 0[2370482-4742] <651-0-97-forward-reverse> edef=gb|CA807305 D. melanogaster cDNA 3' similar to CT12127, mRNA sequence ddef=arm_2L 22-337 (2372455-2372770) <313-0-99> <- 338-584 (2372830-2373076) <238-0-95> <- 585-685 (2373134-2373234) <100-0-99> gtaaaaaTttctgtttatta...gggcgaccagaagtcaatcag gtaaaaaGttctgtttatta...gggcgaccagaagtcaatcag ggtaacttgtccttGggtgc...ccacaccgGctccca-ttcgcgtAtc ggtaacttgtccttTggtgc...ccacaccgCctcccaGttcgcgtTtc tgcaagcggtcgacatgagg...cttaaAgcgctggta tgcaagcggtcgacatgagg...cttaaCgcgctggta sim4end
The GFF3 output format
The same example as before:
0:arm_2L sim4db mRNA 2372455 2373234 97 - . ID=sim4db10;Name=61728:gb|CA807305;Target=61728:gb|CA807305 22 685 +;pA=0;pT=21;genRegion=2370482-2375224 0:arm_2L sim4db exon 2372455 2372770 99 - . Parent=sim4db10;Target=61728:gb|CA807305 22 337 +;Gap=M316;nMatches=313;intron=<- 0:arm_2L sim4db exon 2372830 2373076 95 - . Parent=sim4db10;Target=61728:gb|CA807305 338 584 +;Gap=M74 D1 M2 I1 M160 D1 M10;nMatches=238;intron=<- 0:arm_2L sim4db exon 2373134 2373234 99 - . Parent=sim4db10;Target=61728:gb|CA807305 585 685 +;Gap=M101;nMatches=100
(Columns are tab-separated.)
Here is a small sample data set for testing sim4db:
- Download the (mouse) mRNA sequences, and gunzip
- Download the (human) genomic regions, and gunzip
- Dowload the script file
- Run sim4db:
sim4db -cdna sim4db.example.cdna.fasta \ -gen sim4db.example.genomic.fasta \ -scr sim4db.example.script -interspecies -output out.sim4db
or, to align pairs of sequences (10 pairs) in the order in which they are listed in the fasta files:
sim4db -cdna sim4db.example.cdna.fasta \ -gen sim4db.example.genomic.fasta \ -pairwise -interspecies -output out.sim4db
The commands above should produce ten alignments, one for each pair of sequences in the script.
The first of these alignments is:
sim4begin 0[1287-12-0] 0[25000-75000] <974-0-84-complement-reverse> edef=gi|74316016|ref|NM_026831.1| Mus musculus myosin binding protein H-like (Mybphl), mRNA ddef=NM_001010985.1, start=109636510, end=109651186 143-331 (52744-52932) <165-0-87> <- 332-468 (53869-54005) <123-0-90> <- 469-628 (54418-54577) <132-0-83> <- 629-768 (54685-54824) <126-0-90> <- 769-964 (55057-55252) <166-0-85> <- 965-1053 (55804-55892) <82-0-92> <- 1054-1287 (64482-64725) <180-0-72> sim4end
This sim4db record indicates that the reverse complement (complement) of the first cDNA sequence in the file (cDNA index 0) matches the first genomic sequence (genomic index 0), within the genomic interval 25,001-100,001 (base coordinates starting at 1). The cDNA sequence is 1,287 bp long, of which the last 12 bp are likely polyA tail bases. The alignment has 974 nucleotide matches, and 84% percent sequence identity. The gene is predicted to be on the reverse strand, based on the observed splice signals.
Fasta header lines:
The edef and ddef lines show the fasta headers of the cDNA and genomic sequences, respectively.
The alignment contains seven exons. The first (3'-most) exon, for instance, starts at position 52,744 and ends at position 52,932 of the (full) genomic sequence. Because the match is with the reverse complement of the cDNA, the matching cDNA region is 1,287-331+1=957 through 1,287-143+1=1,145. The exonic alignment has 87% sequence identity, with 165 nucleotide matches. All introns are pointing in the reverse orientation (<-), suppporting the reverse strand prediction above.
The sim4dbutils package contains a range of utilities to work with sim4db-generated alignment files, of particular note being:
- convertPolishes - convert between the two formats. With GFF3->sim4db conversion, alignments will be lost.
- filterPolishes - filter alignments based on minimum percentage sequence identity, coverage and length.
- mergePolishes - merge alignments from multiple files (also concatenates the cDNA fasta files)
- sortPolishes - sort alignments by cDNA or genomic sequence index, using a limited amount of memory if needed.
More information can be found at Sim4db Files.