ESTmapper Documentation

From kmer
Jump to: navigation, search

ESTmapper Documentation v1.0


What is ESTmapper?

ESTmapper is a software package for the high-throughput alignment of large cDNA (EST, mRNA) sequence sets to a large eukaryotic genome of the same species. It can also be used to map relatively short DNA segments, such as sequencing reads or SNP flanking sequences, to the genome.

How does ESTmapper work?

For each cDNA in the input set, ESTmapper detects one or several occurrences of the cDNA in the target genome. ESTmapper uses a three-stage process to locate a cDNA sequence in the genome:

Stage 0
Configuration examine the genomic sequences and pre-compute search indices. This needs to be done once per genome.
Stage 1
Signal finding uses an efficient sequence similarity search to identify regions on the genome potentially containing the cDNA (signals).
Stage 2
Signal filtering discards regions containing weak signals based on the extent of the cDNA matched and the number of candidate regions.
Stage 3
Signal polishing generates a spliced alignment between the cDNA and each of the genomic regions selected earlier.

Basic usage


The first step is to pre-process the genomic sequences, using the program.

There are two required parameters:

-genome g.fasta
the genome to map to
-genomedir genome-directory
the directory to save the configuration in

And numerous optional parameters:

-mersize m
use m-mers (default 20)
-merskip s
skip s m-mers between mers (default 0, use all mers)
-memory M
use M MB memory for the search processes (default 1000MB)
-segment S
use S search processed (default, decided on memory size)
compute the configuration on the grid; args are passed to qsub
sge job name (default 'EMconfig')
compute the configuration right now (the default)

This will build an index into the genomic sequences g.fasta allowing fast access to each sequence. It will segment the sequences into equal sized pieces, either sized so that the search processes are expected to run in MMB of memory, or into S pieces. A rule of thumb is that each base will need 10 bytes of memory; mamallian genomes comfortably fit in S=32 pieces.

The mersize m and merskip s roughly control the space/time/sensitivity of the search process. Smaller mersizes require more time to find signals, generate more false positives (more space to store results, much more time to polish). Larger merskips quickly reduce memory requirements ( s=1 decreases the number of mers by 50%) which can reduce sensitivity, but doesn't impact search time much.

Configuration Example

Configure the Apis mellifera genome for ESTmapper, using the default m=20 mersize and s=0 merskip. Request four segments, instead of limiting each segment to a specific memory size. Perform the configuration computations using SGE, passing some SGE-specific options to the submission command. \
  -genome /genomedir/ame_ref.fasta \
  -genomedir /scratch/apis \
  -mersize 20 \
  -merskip 0 \
  -segments 4 \
  -sge "-pe thread 2 -A estmapper"

Basic usage (mapping)

Once a genome is configured, any number of ESTmapper mappings can be performed, on any input cDNA, EST or short genomic region input set.

Required Parameters

mapping-directory, the output and temporary files will go here.
genome-directory, as configured with
mapping style; -mapest, -mapmrna or -mapsnp.

There are numerous optional parameters, grouped by ESTmapper phase:

Global Parameters

prepare the computations (search, filter, polish) but do not actually compute them. Instead, tell the user what to do.

SGE Parameters

-sge N
run ESTmapper using SGE, with name N.
-sgeoptions "O"
"O" is supplied to all SGE submit commands.
-sgesearch "O"
"O" is supplied to search phase SGE submit commands.
-sgefilter "O"
"O" is supplied to filter phase SGE submit commands.
-sgepolish "O"
"O" is supplied to polish phase SGE submit commands.
-sgefinish "O"
"O" is supplied to finish phase SGE submit commands.

A note about SGE options: It is very important to quote the options to these parameters, otherwise, ESTmapper will attempt to interpret them. For example: -sgepolish "-q idle.q -p -10".

Search Stage Parameters

-localsearches N
compute the search phase locally, on this machine, running N processes at the same time.
-searchthreads T
use T compute threads per search process.
-mermaskfile M.fasta
read M.fasta , and ignore any mer present there when searching for signal.
-merignore n
ignore any mers that occur n times or more in the genome.

Filter Stage Parameters

-hitsortmemory M
use M MB memory to sort hits.
do no filtering of hits.

Polishing Stage Paramteres

-min[sim4]coverage C
report [make sim4 look for] alignments that cover at least C% of the query. Default C=50.
-min[sim4]identity I
report [make sim4 look for] alignments that are at least I% identity. Default I=95.
-min[sim4]length L
report [make sim4 look for] alignments that are at least L bases long. Default L=0.
-relink R
Sim4 'relink' constant, to allow for large introns. The default values are 500 for ESTs and 1000 for mRNAs.
-alwaysprint A
always report A alignments per query, regardless of quality.
-batchsize B
run polishing in batches of B signals each.
-numbatches N
run polishing using N batches of signals.
-localpolishes N
run polishing locally, on this machine, running N processes at the same time.
attempt to align between closely-related species.
[do not] save alignments in the output.
abort alignments that look suspicions and are expensive.
save the answer for each signal processed (for debugging).

Usually, the [sim4] flavor of the -min* options are unnecessary. By default, sim4 will look for alignments 5% below what is specified, e.g., "-mincoverage 50" implies "-minsim4coverage 45".

If any -min option is supplied, it is better to supply all three. Supplying just "-minlength 100" will leave the default values of "-mincoverage 50" and "-minidentity 95"; especially -mincoverage is probably not desired in this case.

Termination Stage Parameters

[do not] attempt to cleanup the results, removing small spurious exons, among other things
save all the gory intermediate files


Using the configured Apis mellifera genome from the last section, loosely map human mRNA. \
  -outputdir apismap \
  -genomedir /scratch/apis \
  -mapmrna /home/work/SEQUENCE/HUMREFSEQ1/HUMREFSEQ1.fasta \
  -minidentity 80 \
  -mincoverage 50 \
  -minlength 0 \
  -verbose \
  -stats \
  -sge emApis \
  -sgeoptions "-pe thread 2 -A estmapper" \
  -nocleanup \

This found alignments for (the number of alignments is the first number):

1        gi|11038618|ref|NM_001614.2| actin, gamma 1 (ACTG1), mRNA
19       gi|24371439|ref|NR_000037.1| tRNA glutamine 1 (TRQ1) on chr 17
5        gi|32526874|ref|NR_001449.1| tRNA lysine 1 (TRK1) on chr 17
3        gi|5016088|ref|NM_001101.2| actin, beta (ACTB), mRNA
4        gi|58293775|ref|NR_002213.1| Transfer RNA arginine (TRR) on chr 3
3        gi|62990120|ref|NM_001017421.1| actin-like protein (FKSG30), mRNA
1        gi|68299771|ref|NM_001069.2| tubulin, beta 2A (TUBB2A), mRNA
9        gi|73760420|ref|NR_002457.1| tRNA proline 2 (TRP2) on chr 14


The input consists of two large multi-fasta files, containing the cDNA and the genomic sequences, respectively.

The output consists primarily of the final alignment and summary files, located in the work directory:

all cDNA-genomic alignments that meet the specified quality criteria
for each EST, only the 'best' genomic alignment among those that meet the specified quality criteria (if multiple alignments with the same 'best' characteristics exist, then all are reported)
summary mapping statistics for the run

The following associated sequence files will also be created:

multi-fasta file of input sequences that were successfully mapped
multi-fasta file of input sequences that had signals, but did not produce a valid match
multi-fasta file of input sequences that did not have any signals

ESTmapper output format

cDNAidx[cDNAlen-pA-pT] GENidx[GENoff-GENlen] <M-N-O-P-S>
edef=cDNA defline
ddef=genomic defline
cDNAbgn1-cDNAend1 (GENbgn1-GENend1) <M-N-P> intronOrientation
cDNAbgn2-cDNAend2 (GENbgn2-GENend2) <M-N-P> intronOrientation
cDNAbgnn-cDNAendn (GENbgnn-GENendn) <M-N-P> intronOrientation
cDNA alignment sequence for exon #1
genomic alignment sequence for exon #1
cDNA alignment sequence for exon #2
genomic alignment sequence for exon #2
cDNA alignment sequence for exon #n
genomic alignment sequence for exon #n


cDNAidx internal index of the cDNA in the input cDNA fasta file
cDNAlen length of the cDNA sequence
pA(T) length of polyA(T) tail detected and masked
GENidx internal index of the genomic sequence in the genome fasta file
GENoff offset to the beginning of the genomic region containing the signal
GENlen length of the genomic region containing the signal
M number of nucleotide matches in the alignment
N number of matching N's in the alignment
P percent sequence identity of the alignment
O match orientation:
  • forward the cDNA sequence aligns to the genomic sequence directly
  • complement the reverse complement of the cDNA sequence matches the genomic sequence; this is the equivalent of the Sim4 '(complement)' output line
S strand predicted based on the splice signals and alignment quality:
  • forward high alignment quality, predicted forward strand
  • reverse high alignment quality, predicted reverse strand
  • unknown low alignment quality or weak splice signals
cDNAbgni start position of exon iin the cDNA sequence
cDNAendi end position of exon iin the cDNA sequence
GENbgni start position of exon iin the genomic sequence (interval GENlo-GENhi)
GENendi end position of exon iin the genomic sequence (interval GENlo-GENhi)
M number of nucleotide matches in the alignment
N number of matching N's in the alignment
P percent sequence identity of the alignment
intronOrientation predicted orientation of the intron:
  • -> forward (i.e., GT-AG-like splice signals)
  • <- reverse (i.e., CT-AC-like splice signals)
  • -- ambiguous
  • == gap (unaligned portion) in the cDNA sequence

Exon coordinates are nucleotide based, starting from 1. Genomic coordinates are always in the original sequence, while the cDNA coordinates will refer to positions in the reverse complement of the sequence if the match orientation is indicated as 'complement'.

Lowercase letters in the alignment lines indicate positions with matching nucleotides, '-' indicate gaps in the corresponding sequence, and uppercase letters mark either substitutions, or gaps in the other sequence.

Output Format Examples

A minimal match description. The cDNA and genomic sequence deflines and alignment lines need not be present:

2[2472-0-0] 0[239821074-51250] <2472-0-100-complement-reverse>
1-1542 (238923075-239824616) <1542-0-100> <-
1543-1794 (239827671-239827922) <251-0-99> <-
1795-1954 (239834345-239834504) <160-0-100> <-
1955-2472 (239869807-239870324) <518-0-100>

The record describes a cDNA-genomic alignment between the cDNA with index number 2 (the third sequence in the cDNA multi-fasta file) and the genomic segment between positions 239821075-23987233. The header line contains 3 tokens that provide information about the cDNA (2472 bp long, no polyA or polyT tails identified), the genomic sequence (index 0, and range), and the match (number of nucleotide identity matches -- 2472, percent sequence identity -- 100%, match orientation -- complement, and likely gene location, on the reverse strand).

The alignment contains 4 exons. Since the cDNA and the genomic sequences match in 'complement', the cDNA coordinates are given in the reverse complement of the cDNA sequence (e.g., exon 1, 1-1542, refers to the segment between positions 2472-1542+1 and 2472-1+1 in the original sequence).

A more detailed match description, including alignments and deflines.

2[2472-0-0] 0[239821074-51250] <2471-0-100-complement-reverse>
edef=>gi|71999139|ref|NM_001030012.1| Homo sapiens opsin 3(encephalopsin, panopsin) (OPN3), transcript variant 3, mRNA
ddef=>chr1 /len=247249719 /nonNlen=224999719 /org=H.sapiens(hg18)
1-1542 (238923075-239824616) <1542-0-100> <-
1543-1794 (239827671-239827922) <251-0-99> <-
1795-1954 (239834345-239834504) <160-0-100> <-
1955-2472 (239869807-239870324) <518-0-100>

This is the same match as in example 1. The edef line now characterizes the cDNA sequence (i.e., the RefSeq sequence corresponding to the opsin 3 gene mRNA), and the ddef line characterizes the genomic sequence (i.e., human chromosome 1). The alignment contains one substitution (C-A) at position 3 in exon 2.

Intermeditate Data Files

ESTmapper creates several intermediate data files located in the designated working directory as follows:

symbolic links to input files, indices and other temporary files
temporary directory for Stage I (signal finding)
temporary directory for Stage II (signal filtering)
temporary directory for Stage III (alignment and validation)

Software and hardware requirements

ESTmapper is written in a combination of Perl and C++, and should run on any Posix compliant unix-like platform. For large tasks, such as mapping 6M EST sequences to the human genome, at least 1GB RAM and up to 80GB disk for intermediate results is required; the output will be approximately 12 GB (9 GB for 'polishes-good', 3 GB for 'polishes-best'.

ESTmapper was designed, developed and implemented by Brian Walenz and Liliana Florea.