Getting Started with kmer-mask

From kmer
Jump to: navigation, search


For every input sequence, kmer-mask computes the fraction of it that is covered by kmers in a supplied kmer database. Bases covered by kmers are, by default, masked with 'n'. Sequences are placed in one of three output categories based on the amount of sequence masked: 'clean' (not much masking), 'murky' (moderate masking), or 'match' (lots of masking). Thresholds for each category are settable from the command line.

Support for paired reads is included. By default, pairing is not disturvbed. Pairs with different levels of masking are moved to a special 'mixed' output category. Other options are to use the cleaner of the two labels for both, to use the dirtier label, or to break the mate pairing and put each read in a different class.

An Example

To demonstrate kmer-mask, we'll make a mix of E. coli and P. gingivalis reads, and mask them using the reference sequences. It's not perfect -- a real subtraction project would use sequencing reads as the maksing database and have to deal with errors -- but it'll show the process.

Grab the Data

We're going to use the Escherichia coli and Porphyromonas gingivalis reference genomes for the kmer database(s) and use fastqSimulate from the Canu assembler to generate reads. I couldn't find any command-line-downloadable links to download the data from NCBI, so you'll need to manually click-through to save the files. Once downloaded generate reads with:

fastqSimulate -f NC_000913.3.fasta -l 250 -n 10000 -se -o ecoli-reads
fastqSimulate -f NC_002950.2.fasta -l 250 -n 10000 -se -o pging-reads
cat ecoli-reads.s.fastq pging-reads.s.fastq > reads.s.fastq

Or, just grab the ready made data:

curl -O http://kmer.sourceforge.net/kmer-mask-example.tar.xz
xz -dc kmer-mask-example.tar.xz | tar -xf -
cat ecoli-reads.s.fastq pging-reads.s.fastq > reads.s.fastq

Prepare the kmer databases

We'll mask the reads using 15-mers. kmer-mask checks both the forward and reverse kmer against these databases, so we can reduce the size a bit and count only canonical kmers. Instead of counting the forward and reverse-complement mers at each position individually, a canonical mer is the smaller of the two.

Note that the meryl algorithm cannot work with compressed inputs. The *.fastaidx files are created by the overly-sophisticated (for what meryl needs) sequence reader used in meryl.

meryl -B -C -m 15 -s NC_000913.3.fasta -o ecoli.ms15
meryl -B -C -m 15 -s NC_002950.2.fasta -o pging.ms15
rm *.fastaidx

in a real subtraction project, the database would typically come from sequencing reads from the growth media, and would thus contain sequencing errors. We can use meryl to remove the uncommon kmers. Below, we're removing kmers with count less than 5. With kmers from a reference, however, nearly every kmer will be unique, and this step would just retain the repetitive kmers.

#  Don't run this on reference genomes!
meryl -M greaterthanorequal 5 -s pging.ms15 -o pging.ms15.noerrors

With multiple kmer databases, we can do fancy operations like combine them (add), or remove one from the other (subtract), or more. Lets remove the kmers from one reference from the other.

meryl -M sub -s pging.ms15 -s ecoli.ms15 -o pging-minus-ecoli.ms15
meryl -M sub -s ecoli.ms15 -s pging.ms15 -o ecoli-minus-pging.ms15

There aren't that many common kmers between E. coli and P. gingivalis. You could do '-M and' to find them. We'll just view counts. This reveals that most of the shared kmers are single copy, interesting.

meryl -Dc -s ecoli.ms15
Found 4641638 mers.
Found 4462229 distinct mers.
Found 4357730 unique mers.

meryl -Dc -s ecoli-minus-pging.ms15
Found 4611450 mers.
Found 4434340 distinct mers.
Found 4330629 unique mers.

and

meryl -Dc -s pging.ms15
Found 2343462 mers.
Found 2182804 distinct mers.
Found 2123270 unique mers.

meryl -Dc -s pging-minus-ecoli.ms15
Found 2313274 mers.
Found 2154550 distinct mers.
Found 2095956 unique mers.

Subtracting P. gingivalis from E. coli resulted in 27,101 unique (single copy) mers being removed. A handful more were removed the other way around: 27,314 mers were removed from P. gingivalis by subtracting out E. coli.

Ansai, et al., found similarity too, way back in 1995.

Mask!

The masking process checks every kmer in the subject sequence against the kmer database. If the kmer is present in the database, the bases covered by the kmer in the subject sequence are masked to lowercase 'n' (by default) or left as is, but counted as being masked (with option -nomasking).

The output of the process is four FASTQ files (eight if mated reads are input). The files hold reads that are 'clean' (not masked), 'match' (masked) or 'murky' (in between). Mated reads could be placed in a 'mixed' file, for reads that have different classifications. Parameters -clean and -match specify the cutoffs, as fraction of the read masked. Settings of "-clean 0.0 -match 1.0" would output only reads with no kmer hits in the 'clean' file, only reads with 100% kmer hit coverage in the 'match' file, and all other reads, with some bases masked and some not, in the 'murky' file. Likewise, '-clean 0.1' would allow up to 10% of the read to be masked, and still appear in the 'clean' output.

Two parameters attempt to recover from noise in either the subject sequence or the database:

The '-m' option will remove isolated kmers, presumabely caused by chance. The default, '-m 0', will do no removal; '-m 1' will remove a single isolated kmer hit; '-m 2' will remove two adjacent kmer hits, etc. For example, if our database contains only the kmer GTA, the subject sequence CGTACGTAC contains two isolated kmers. The kmers in that sequence are CGT GTA TAC ACG CGT GTA TAC. Both GTA kmers are between non-matching kmers, making them isolated. If the database contained two kmers, AGT and GTA, The first GTA kmer is isolated, while the second GTA is not. The second GTA kmer is adjacent to AGT.

The '-e' option performs the complementary operation. It will join regions covered by masked kmers as long as they are not far away. With '-e 1' and only GTA in the database, CGTACGTAC is initially masked to CgtaCgtaC (using lowercase masking). Because the two masked regions have at most one base between them, the '-e' option will extend the masking across the region. This operation can overcome errors in the reads.

in our example, we're masking against a target genome, so a 'clean' hit is one that does not match the genome, and a 'match' hit is one does match.

First, lets mask against the reference the reads came from.

kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5
aClean        0
aMurky        0
aMatch    10000
aMixed        0

kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5
aClean        0
aMurky        0
aMatch    10000
aMixed        0

They both match at 100%, that's good! Let's match the reads against the other reference.

kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -h pe.histogram
aClean     7363
aMurky     2636
aMatch        1
aMixed        0

kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5
aClean     4121
aMurky     5877
aMatch        2
aMixed        0

The 'a' prefix indicates this is for the first file. If mated reads were used, we'd get a matrix of numbers, showing how each pair ended up. We'll cheat and pretend the two read files we have are mated (notice also the use of -unlink to prevent picking one of the classifications for both reads).

kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 pging-reads.s.fastq -2 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -unlink
         bClean    bMurky   bMatch     bMixed
aClean        0         0         0         0
aMurky        0         0         0         0
aMatch     7363      2636         1         0
aMixed        0         0         0         0

It's clear that E. coli and P. gingivalis share something in common. Had this been our subtraction task - say, removing the P. ginvgivalis reads from these two read sets combined - we would have matched 100% of the P. gingivalis reads against the P. gingivalis genome, but also falsely matched one E. coli read fully, and 2636 reads partially.

Looking at the histogram of amount matching (that's the '-h pe' option in the first kmer-mask command above), we see that there are a lot of E. coli reads with about 12% of the bases matching the P. gingivalis genome. Curiously, there are also peaks at 6%, 18% and 24%.

> head -n 51 pe.histogram
# fraction-masked number-of-sequences
0.0000  13514
0.0600  2630
0.0640  735
0.0680  230
0.0720  89
0.0760  35
0.0800  29
0.0840  23
0.0880  24
0.0920  27
0.0960  27
0.1000  21
0.1040  19
0.1080  16
0.1120  18
0.1160  16
0.1200  821
0.1240  502
0.1280  198
0.1320  83
0.1360  47
0.1400  42
0.1440  21
0.1480  28
0.1520  24
0.1560  16
0.1600  8
0.1640  22
0.1680  16
0.1720  19
0.1760  13
0.1800  201
0.1840  140
0.1880  70
0.1920  37
0.1960  14
0.2000  15
0.2040  11
0.2080  7
0.2120  13
0.2160  11
0.2200  10
0.2240  6
0.2280  1
0.2320  8
0.2360  1
0.2400  24
0.2440  30
0.2480  21
0.2520  8

Ah! But six percent is just kmer-size (15) / read-length (250). We can do two things to clean this up:

  1. build new subtraction databases of the kmers present in only one assembly (we did this already, by subtracting one set of kmers from the other set).
  2. tell kmer-mask to skip isolated kmer matches.

For just the cleaned up database:

kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5
aClean     9532
aMurky      468
aMatch        0
aMixed        0

kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5
aClean     8501
aMurky     1499
aMatch        0
aMixed        0

For just the isolate kmer skipping:

kmer-mask -t 12 -mdb pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2
aClean     9656
aMurky      344
aMatch        0
aMixed        0

kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3
aClean     9820
aMurky      180
aMatch        0
aMixed        0

Combining both:

kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2
aClean     9967
aMurky       33
aMatch        0
aMixed        0

kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 pging-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3
aClean     9958
aMurky       42
aMatch        0
aMixed        0

Not so bad! Only about 0.5% of the reads have a weak match, all others are 'clean'. Notice that the first command used '-m 2', to ignore isolated kmer hits. The second command used '-m 3' to ignore isolated hits formed by two adjacent kmers (there were more 'murky' hits with '-m 2' for some reason). The latter would be the same as increasing the kmer size by one and using '-m 2'.

If we build a set of reads from both E. coli and P. gingivalis, we should get nice seperation when masking to the cleaned up databases:

cat ecoli-reads.s.fastq pging-reads.s.fastq > all-reads.s.fastq
kmer-mask -t 12 -mdb pging-minus-ecoli.ms15 -ms 15 -1 all-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 2
aClean     9967
aMurky       34
aMatch     9999
aMixed        0

kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 all-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3
aClean     9958
aMurky       43
aMatch     9999
aMixed        0

You might remember that we had matched the reads against the references at 100% success, so why is there one read missing now? We're using the subtracted databases! We have apparently found one read that shared kmers between the two genomes.

kmer-mask -t 12 -mdb ecoli-minus-pging.ms15 -ms 15 -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5 -m 3
aClean        0
aMurky        1
aMatch     9999
aMixed        0

cat junk.murky.1.fastq
@SE_3450_0@3829546-3829796#1 fractionMasked=0.396
TTCCAGCGnnnnnnnnnnnnnnnnnnnnnAGCGGnnnnnnnnnnnnnnnnnnnnnnnnnnnnGTAATAGAGCGGCGGTGGCACCAGCAATCAGTCTTGCGTCATAGCnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnCGCCAGCGCGTATCGGGCCGCCAGTCGCAATTACCGGTGACGGCCGACGCGTAAGTAAGGCTTACGTTGACGGTTTAAGCAGGATAATTACGCnnnnnnnnnnnnnnnnnn
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH)HHHHHHHH)%HHHHHHHHHHHHHHHHHHH)HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH)HHHH)HHHHHHHH%HHHHHHHHHHH)HHHH%HHHH%HHHHHHHHHHHHHH%HHHHHHHHHHHHHH%HHHHHHHHHHHHHH)H%HHHHHHHHHHHHHHHHHH

(Remember, this is a simulated read, the weird QVs are where the errors are.)

Precompute Masking Table

The data structure used to hold the kmers can be precomputed. This reduces startup times by allowing the data to be loaded directly with no additional processing. The '-edb option enables this.

Once the 'edb' file exists, the original kmer counts in the 'mdb' files aren't used. If supplied, they are ignored. You can safely omit the '-mdb' parameter in this case.

To build the 'edb', run kmer-mask with no input reads:

kmer-mask -mdb ecoli.ms15 -ms 15 -edb ecoli.ms15.edb

You can also supply the 'edb' option on a normal run. This will build the data structure using the 'mdb' files, output the 'edb' file, and continue processing the reads. Future runs will load the data directly from the 'edb' file; the 'mdb' file isn't used at all.

kmer-mask -t 12 -mdb ecoli.ms15 -ms 15 -edb pging.ms15.edb -1 ecoli-reads.s.fastq -o junk -clean 0.1 -match 0.5

aClean        0
aMurky        0
aMatch    10000
aMixed        0

Reference

To be finished....

usage: kmer-mask -mdb mer-database -ms mer-size ...

INPUTS:

  -mdb mer-database      load masking kmers from meryl 'mer-database'
  -ms  mer-size          the mer size used to generate the meryl database
  -edb exist-database    save masking kmers to 'exist-database', and reuse on
                           future runs (optional)

  -1 in.1.fastq          input reads - fastq, fastq.gz, fastq.bz2 or fastq.xz
  -2 in.2.fastq                      - optional, for mated reads

  -l length              maximum length of input read (512)
                           If too small, program will fail.
                           If too large, program will use excessive memory.

THRESHOLDS and OPTIONS:

  A read with fewer than 'c' bases masked is 'clean', while a read with more than 'd' bases
  masked is 'match'ed.  Reads in between are 'murky'.  See OUTPUTS.

  -clean c               mark reads with less than this fraction masked as 'clean' (0.3333)
  -match d               mark reads with more than this fraction masked as 'match' (0.6667)

  -m min-size            ignore database hits below this many consecutive kmers (0)
  -e extend-size         extend database hits across this many missing kmers (0)

  For mate pairs, how to handle reads with different classifications:

  -cleaner               use the cleaner classification of the two reads
  -dirtier               use the dirtier classification of the two reads
  -discard               discard pairs with conflicting classifications (DEFAULT)
  -unlink                leave conflicting status alone, output reads to different files
                           NOTE: mate pairing will be broken

  -nomasking             do not trim masked sequence; output the original read

OUTPUTS:

  -o  prefix                output reads:
                            prefix.clean.[12].fastq  - clean (unmasked) reads
                            prefix.murky.[12].fastq  - reads in between
                            prefix.match.[12].fastq  - matching (masked) reads
                            prefix.mixed.[12].fastq  - reads with conflicting status (for mated reads)
  -h histogram           write a histogram of the amount of sequence RETAINED

COMPUTE CONFIGURATION and LOGGING

  -t t                   use 't' compute threads
  -v                     show progress