Sim4db Files

From kmer
Jump to: navigation, search

sim4db Utilities

There are many utilities provided for working with sim4db files. Most of these will work with GFF3 files as well.

Top Tier

These tools are probably the most useful ones here.

Convert between sim4db and GFF3 format. Alignment information is lost when converting from GFF3 to sim4db.
Filter alignments based on minimum percent identity, percent coverage and/or length.
Merge alignments from multiple files, and concatenate source cDNA files to maintain the internal index of each sequence.
Sort alignments by cDNA or genomic sequence index, or sequence name.
Convert from sim4db format to a single-line tab-delimited format.

Occasionally Useful

These tools are sometimes useful. They are probably not production ready.

Updates a sim4db format file to use the sequence index of a specific fasta file. Can also be used to extract a subset of the alignments from the sim4db file.
Outputs a tab-delimited histogram of the depth of polishes at various window sizes. Assumes there is exactly one genomic sequence.
Very simple utility to print the first N alignments in a file, similar to the UNIX 'head' command.
Uses an unspecified heuristic to report only the 'best' alignment for each cDNA. In the case of ties, all alignments with the same 'best' score are reported.
Reports alignments where there is a clear single best alignment for each cDNA. A 'clear best' alignment is one that is more than some percent identity above all other alignments.
Generates (1) a histogram of the percent identity, (2) a histogram of the percent coverage, and (3) a list of percent identity and coverage (for use in a scatter plot).
Filters out all alignments for cDNA with multiple alignments (-uniq) or with a single alignment (-dupl). Similar to the UNIX 'uniq' command.
Recompute the alignments listed in a sim4db file. Does some cleanup of the alignments.
Generates a histogram of the types of errors in a set of alignments.

Unsupported and Deprecated

These tools were constructed during development, or for some special purpose analysis. They're not guaranteed to work, and are listed for completeness. If they don't exist in the current software, then we finally decided to remove them.

Attempts to correlate alignments in two files, for example, to compare the effect of different parameters on the same data.
Converts from sim4db format to ATAC format.
Examines alignments for sequences that might be chimeric.
Reports the amount of the query sequence (EST, cDNA) that is covered by alignments. Very stale.
Analyzes alignments for SNPs. Expects a very specific fasta header format. Was used years ago for mapping dbSNP to human.
Searches the input for duplicate alignments.
Generates a Venn diagram for multiple sim4db files. Stale.
(marked as broken)

Used internally by ESTmapper

In addition to sortPolishes, filterPolishes and pickBestPolish above, these are used internally by ESTmapper.