Sim4db Files

From kmer
Jump to: navigation, search

sim4db Utilities

There are many utilities provided for working with sim4db files. Most of these will work with GFF3 files as well.

Top Tier

These tools are probably the most useful ones here.

convertPolishes
Convert between sim4db and GFF3 format. Alignment information is lost when converting from GFF3 to sim4db.
filterPolishes
Filter alignments based on minimum percent identity, percent coverage and/or length.
mergePolishes
Merge alignments from multiple files, and concatenate source cDNA files to maintain the internal index of each sequence.
sortPolishes
Sort alignments by cDNA or genomic sequence index, or sequence name.
convertToExtent
Convert from sim4db format to a single-line tab-delimited format.


Occasionally Useful

These tools are sometimes useful. They are probably not production ready.

fixPolishesIID
Updates a sim4db format file to use the sequence index of a specific fasta file. Can also be used to extract a subset of the alignments from the sim4db file.
depthOfPolishes
Outputs a tab-delimited histogram of the depth of polishes at various window sizes. Assumes there is exactly one genomic sequence.
headPolishes
Very simple utility to print the first N alignments in a file, similar to the UNIX 'head' command.
pickBestPolish
Uses an unspecified heuristic to report only the 'best' alignment for each cDNA. In the case of ties, all alignments with the same 'best' score are reported.
pickUniquePolish
Reports alignments where there is a clear single best alignment for each cDNA. A 'clear best' alignment is one that is more than some percent identity above all other alignments.
plotCoverageVsIdentity
Generates (1) a histogram of the percent identity, (2) a histogram of the percent coverage, and (3) a list of percent identity and coverage (for use in a scatter plot).
uniqPolishes
Filters out all alignments for cDNA with multiple alignments (-uniq) or with a single alignment (-dupl). Similar to the UNIX 'uniq' command.
realignPolishes
Recompute the alignments listed in a sim4db file. Does some cleanup of the alignments.
reportAlignmentDifferences
Generates a histogram of the types of errors in a set of alignments.


Unsupported and Deprecated

These tools were constructed during development, or for some special purpose analysis. They're not guaranteed to work, and are listed for completeness. If they don't exist in the current software, then we finally decided to remove them.

comparePolishes
Attempts to correlate alignments in two files, for example, to compare the effect of different parameters on the same data.
convertToAtac
Converts from sim4db format to ATAC format.
detectChimera
Examines alignments for sequences that might be chimeric.
mappedCoverage
Reports the amount of the query sequence (EST, cDNA) that is covered by alignments. Very stale.
parseSNP
Analyzes alignments for SNPs. Expects a very specific fasta header format. Was used years ago for mapping dbSNP to human.
removeDuplicate
Searches the input for duplicate alignments.
vennPolishes
Generates a Venn diagram for multiple sim4db files. Stale.
removeRedundant
(marked as broken)


Used internally by ESTmapper

In addition to sortPolishes, filterPolishes and pickBestPolish above, these are used internally by ESTmapper.

cleanPolishes
summarizePolishes