Getting Started with kmer-mask
For every input sequence, kmer-mask computes the fraction of it that is covered by kmers in a supplied kmer database. Bases covered by kmers are, by default, masked with 'n'. Sequences are placed in one of three output categories based on the amount of sequence masked: 'clean' (not much masking), 'murky' (moderate masking), or 'match' (lots of masking). Thresholds for each category are settable from the command line.
Support for paired reads is included. By default, pairing is not disturvbed. Pairs with different levels of masking are moved to a special 'mixed' output category. Other options are to use the cleaner of the two labels for both, to use the dirtier label, or to break the mate pairing and put each read in a different class.
To demonstrate kmer-mask, we'll make a mix of E. coli and P. gingivalis reads, and mask them using the reference sequences. It's not perfect -- a real subtraction project would use sequencing reads as the maksing database and have to deal with errors -- but it'll show the process.
Grab the Data
We're going to use the Escherichia coli and Porphyromonas gingivalis reference genomes for the kmer database(s) and use fastqSimulate from the Canu assembler to generate reads. I couldn't find any command-line-downloadable links to download the data from NCBI, so you'll need to manually click-through to save the files. Once downloaded generate reads with:
fastqSimulate -f NC_000913.3.fasta -l 250 -n 10000 -se -o ecoli-reads fastqSimulate -f NC_002950.2.fasta -l 250 -n 10000 -se -o pging-reads cat ecoli-reads.s.fastq pging-reads.s.fastq > reads.s.fastq
Or, just grab the ready made data:
curl -O http://kmer.sourceforge.net/kmer-mask-example.tar.xz xz -dc kmer-mask-example.tar.xz | tar -xf - cat ecoli-reads.s.fastq pging-reads.s.fastq > reads.s.fastq
Prepare the kmer databases
We'll mask the reads using 15-mers. kmer-mask checks both the forward and reverse kmer against these databases, so we can reduce the size a bit and count only canonical kmers. Instead of counting the forward and reverse-complement mers at each position individually, a canonical mer is the smaller of the two.
Note that the meryl algorithm cannot work with compressed inputs. The *.fastaidx files are created by the overly-sophisticated (for what meryl needs) sequence reader used in meryl.
meryl -B -C -m 15 -s NC_000913.3.fasta -o ecoli.ms15 meryl -B -C -m 15 -s NC_002950.2.fasta -o pging.ms15 rm *.fastaidx
in a real subtraction project, the database would typically come from sequencing reads from the growth media, and would thus contain sequencing errors. We can use meryl to remove the uncommon kmers. Below, we're removing kmers with count less than 5. With kmers from a reference, however, nearly every kmer will be unique, and this step would just retain the repetitive kmers.
# Don't run this on reference genomes! meryl -M greaterthanorequal 5 -s pging.ms15 -o pging.ms15.noerrors
With multiple kmer databases, we can do fancy operations like combine them (add), or remove one from the other (subtract), or more. Lets remove the kmers from one reference from the other.
meryl -M sub -s pging.ms15 -s ecoli.ms15 -o pging-minus-ecoli.ms15 meryl -M sub -s ecoli.ms15 -s pging.ms15 -o ecoli-minus-pging.ms15
There aren't that many common kmers between E. coli and P. gingivalis. You could do '-M and' to find them. We'll just view counts. This reveals that most of the shared kmers are single copy, interesting.
meryl -Dc -s ecoli.ms15 Found 4641638 mers. Found 4462229 distinct mers. Found 4357730 unique mers. meryl -Dc -s ecoli-minus-pging.ms15 Found 4611450 mers. Found 4434340 distinct mers. Found 4330629 unique mers.
meryl -Dc -s pging.ms15 Found 2343462 mers. Found 2182804 distinct mers. Found 2123270 unique mers. meryl -Dc -s pging-minus-ecoli.ms15 Found 2313274 mers. Found 2154550 distinct mers. Found 2095956 unique mers.
Subtracting P. gingivalis from E. coli resulted in 27,101 unique (single copy) mers being removed. A handful more were removed the other way around: 27,314 mers were removed from P. gingivalis by subtracting out E. coli.
Ansai, et al., found similarity too, way back in 1995.
The masking process checks every kmer in the subject sequence against the kmer database. If the kmer is present in the database, the bases covered by the kmer in the subject sequence are masked to lowercase 'n' (by default) or left as is, but counted as being masked (with option -nomasking).
The output of the process is four FASTQ files (eight if mated reads are input). The files hold reads that are 'clean' (not masked), 'match' (masked) or 'murky' (in between). Mated reads could be placed in a 'mixed' file, for reads that have different classifications. Parameters -clean and -match specify the cutoffs, as fraction of the read masked. Settings of "-clean 0.0 -match 1.0" would output only reads with no kmer hits in the 'clean' file, only reads with 100% kmer hit coverage in the 'match' file, and all other reads, with some bases masked and some not, in the 'murky' file. Likewise, '-clean 0.1' would allow up to 10% of the read to be masked, and still appear in the 'clean' output.
Two parameters attempt to recover from noise in either the subject sequence or the database:
The '-m' option will remove isolated kmers, presumabely caused by chance. The default, '-m 0', will do no removal; '-m 1' will remove a single isolated kmer hit; '-m 2' will remove two adjacent kmer hits, etc. For example, if our database contains only the kmer GTA, the subject sequence CGTACGTAC contains two isolated kmers. The kmers in that sequence are CGT GTA TAC ACG CGT GTA TAC. Both GTA kmers are between non-matching kmers, making them isolated. If the database contained two kmers, AGT and GTA, The first GTA kmer is isolated, while the second GTA is not. The second GTA kmer is adjacent to AGT.
The '-e' option performs the complementary operation. It will join regions covered by masked kmers as long as they are not far away. With '-e 1' and only GTA in the database, CGTACGTAC is initially masked to CgtaCgtaC (using lowercase masking). Because the two masked regions have at most one base between them, the '-e' option will extend the masking across the region. This operation can overcome errors in the reads.
in our example, we're masking against a target genome, so a 'clean' hit is one that does not match the genome, and a 'match' hit is one does match.
First, lets mask against the reference the reads came from.
kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 0 aMurky 0 aMatch 10000 aMixed 0 kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 0 aMurky 0 aMatch 10000 aMixed 0
They both match at 100%, that's good! Let's match the reads against the other reference.
kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -h pe.histogram aClean 7363 aMurky 2636 aMatch 1 aMixed 0 kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 4121 aMurky 5877 aMatch 2 aMixed 0
The 'a' prefix indicates this is for the first file. If mated reads were used, we'd get a matrix of numbers, showing how each pair ended up. We'll cheat and pretend the two read files we have are mated (notice also the use of -unlink to prevent picking one of the classifications for both reads).
kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 pging-reads.s.fastq -2 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -unlink bClean bMurky bMatch bMixed aClean 0 0 0 0 aMurky 0 0 0 0 aMatch 7363 2636 1 0 aMixed 0 0 0 0
It's clear that E. coli and P. gingivalis share something in common. Had this been our subtraction task - say, removing the P. ginvgivalis reads from these two read sets combined - we would have matched 100% of the P. gingivalis reads against the P. gingivalis genome, but also falsely matched one E. coli read fully, and 2636 reads partially.
Looking at the histogram of amount matching (that's the '-h pe' option in the first kmer-mask command above), we see that there are a lot of E. coli reads with about 12% of the bases matching the P. gingivalis genome. Curiously, there are also peaks at 6%, 18% and 24%.
> head -n 51 pe.histogram # fraction-masked number-of-sequences 0.0000 13514 0.0600 2630 0.0640 735 0.0680 230 0.0720 89 0.0760 35 0.0800 29 0.0840 23 0.0880 24 0.0920 27 0.0960 27 0.1000 21 0.1040 19 0.1080 16 0.1120 18 0.1160 16 0.1200 821 0.1240 502 0.1280 198 0.1320 83 0.1360 47 0.1400 42 0.1440 21 0.1480 28 0.1520 24 0.1560 16 0.1600 8 0.1640 22 0.1680 16 0.1720 19 0.1760 13 0.1800 201 0.1840 140 0.1880 70 0.1920 37 0.1960 14 0.2000 15 0.2040 11 0.2080 7 0.2120 13 0.2160 11 0.2200 10 0.2240 6 0.2280 1 0.2320 8 0.2360 1 0.2400 24 0.2440 30 0.2480 21 0.2520 8
Ah! But six percent is just kmer-size (15) / read-length (250). We can do two things to clean this up:
- build new subtraction databases of the kmers present in only one assembly (we did this already, by subtracting one set of kmers from the other set).
- tell kmer-mask to skip isolated kmer matches.
For just the cleaned up database:
kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 9532 aMurky 468 aMatch 0 aMixed 0 kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 8501 aMurky 1499 aMatch 0 aMixed 0
For just the isolate kmer skipping:
kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2 aClean 9656 aMurky 344 aMatch 0 aMixed 0 kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3 aClean 9820 aMurky 180 aMatch 0 aMixed 0
kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2 aClean 9967 aMurky 33 aMatch 0 aMixed 0 kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3 aClean 9958 aMurky 42 aMatch 0 aMixed 0
Not so bad! Only about 0.5% of the reads have a weak match, all others are 'clean'. Notice that the first command used '-m 2', to ignore isolated kmer hits. The second command used '-m 3' to ignore isolated hits formed by two adjacent kmers (there were more 'murky' hits with '-m 2' for some reason). The latter would be the same as increasing the kmer size by one and using '-m 2'.
If we build a set of reads from both E. coli and P. gingivalis, we should get nice seperation when masking to the cleaned up databases:
cat ecoli-reads.s.fastq pging-reads.s.fastq > all-reads.s.fastq kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 all-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2 aClean 9967 aMurky 34 aMatch 9999 aMixed 0 kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 all-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3 aClean 9958 aMurky 43 aMatch 9999 aMixed 0
You might remember that we had matched the reads against the references at 100% success, so why is there one read missing now? We're using the subtracted databases! We have apparently found one read that shared kmers between the two genomes.
kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3 aClean 0 aMurky 1 aMatch 9999 aMixed 0 cat junk.murky.1.fastq @SE_3450_0@3829546-3829796#1 fractionMasked=0.396 TTCCAGCGnnnnnnnnnnnnnnnnnnnnnAGCGGnnnnnnnnnnnnnnnnnnnnnnnnnnnnGTAATAGAGCGGCGGTGGCACCAGCAATCAGTCTTGCGTCATAGCnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnCGCCAGCGCGTATCGGGCCGCCAGTCGCAATTACCGGTGACGGCCGACGCGTAAGTAAGGCTTACGTTGACGGTTTAAGCAGGATAATTACGCnnnnnnnnnnnnnnnnnn + HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH)HHHHHHHH)%HHHHHHHHHHHHHHHHHHH)HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH)HHHH)HHHHHHHH%HHHHHHHHHHH)HHHH%HHHH%HHHHHHHHHHHHHH%HHHHHHHHHHHHHH%HHHHHHHHHHHHHH)H%HHHHHHHHHHHHHHHHHH
(Remember, this is a simulated read, the weird QVs are where the errors are.)
Precompute Masking Table
The data structure used to hold the kmers can be precomputed. This reduces startup times by allowing the data to be loaded directly with no additional processing. The '-edb option enables this.
Once the 'edb' file exists, the original kmer counts in the 'mdb' files aren't used. If supplied, they are ignored. You can safely omit the '-mdb' parameter in this case.
To build the 'edb', run kmer-mask with no input reads:
kmer-mask -mdb ecoli.ms15 -ms 15 -edb ecoli.ms15.edb
You can also supply the 'edb' option on a normal run. This will build the data structure using the 'mdb' files, output the 'edb' file, and continue processing the reads. Future runs will load the data directly from the 'edb' file; the 'mdb' file isn't used at all.
kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -edb pging.ms15.edb -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 aClean 0 aMurky 0 aMatch 10000 aMixed 0
To be finished....
usage: kmer-mask -mdb mer-database -ms mer-size ... INPUTS: -mdb mer-database load masking kmers from meryl 'mer-database' -ms mer-size the mer size used to generate the meryl database -edb exist-database save masking kmers to 'exist-database', and reuse on future runs (optional) -1 in.1.fastq input reads - fastq, fastq.gz, fastq.bz2 or fastq.xz -2 in.2.fastq - optional, for mated reads -l length maximum length of input read (512) If too small, program will fail. If too large, program will use excessive memory. THRESHOLDS and OPTIONS: A read with fewer than 'c' bases masked is 'clean', while a read with more than 'd' bases masked is 'match'ed. Reads in between are 'murky'. See OUTPUTS. -clean c mark reads with less than this fraction masked as 'clean' (0.3333) -match d mark reads with more than this fraction masked as 'match' (0.6667) -m min-size ignore database hits below this many consecutive kmers (0) -e extend-size extend database hits across this many missing kmers (0) For mate pairs, how to handle reads with different classifications: -cleaner use the cleaner classification of the two reads -dirtier use the dirtier classification of the two reads -discard discard pairs with conflicting classifications (DEFAULT) -unlink leave conflicting status alone, output reads to different files NOTE: mate pairing will be broken -nomasking do not trim masked sequence; output the original read OUTPUTS: -o prefix output reads: prefix.clean..fastq - clean (unmasked) reads prefix.murky..fastq - reads in between prefix.match..fastq - matching (masked) reads prefix.mixed..fastq - reads with conflicting status (for mated reads) -h histogram write a histogram of the amount of sequence RETAINED COMPUTE CONFIGURATION and LOGGING -t t use 't' compute threads -v show progress