Getting Started with Meryl

From kmer
Jump to: navigation, search


meryl is a multi-threaded, multi-process, out-of-core k-mer counter.

A k-mer is a short contiguous block of sequence, for example, TGAAC is a 5-mer. k-mers are widely used in bioinformatics for seeding sequence search and alignment algorithms (blast, sim4) to correcting errors in raw sequencing reads (quake) to estimating genome sizes.

A k-mer counter analyzes an input sequence file and builds a table of the number of times each specific k-mer is present in the input. The sequence TGAACTGAAC has 6 total 5-mers, with k-mer TGAAC of count 2. The 5-mer count table for this sequence would be

2 TGAAC
1 GAACT
1 AACTG
1 ACTGA
1 CTGAA

meryl is capable of generating the k-mer count table and of performing simple operations on multiple tables (adding counts, subtracting counts, logical operations) along with reporting statistics on individual tables (histograms).

General meryl usage

Meryl has five modes of operation. The mode is specified through a command line option:

-P
Estimate memory usage for building a table
-B
Build a table
-M
Modify tables (math or logic)
-D
Dump a table

Meryl reads sequence from either a fasta or a fastq file. It writes k-mers to a 'meryl database', a set of two files, one ending in '.mcidx' and the other ending in '.mcdat'. When telling meryl the database to write to (or read from) only the name should be supplied; leave off the '.mcidx' or '.mcdat' extension.

NOTE: The first time meryl runs on a sequence file, it scans the entire file to build index that allows random-access to bases in the file. This index is stored in the *.fastaidx or *.fastqidx file. Later runs of meryl on the same file load the (usually much smaller) index directly.

Estimating memory usage for building a k-mer count table =

The '-P' mode will compute the amount of memory a sequential build process will require. If a sequence file is supplied (-s) the exact number of k-mers will be counted by scanning the entire file. Significantly faster is to supply an estimate of the number of k-mers in the file (number of bases is a good estimate) with the -n option.

-m #          (size of a mer; required)
-c #          (homopolymer compression; optional)
-p            (enable positions)
-s seq.fasta  (seq.fasta is scanned to determine the number of mers)
-n #          (compute params assuming file with this many mers in it)
> ls -l sequence.fasta
-rw-r--r--   1 bri  bri  10000015000 Jan  6 07:51 sequence.fasta

> time meryl -P -v -m 22 -n 10000015000
10000015000 22-mers can be computed using 22785MB memory.
0.000u 0.002s 0:00.00 0.0%      0+0k 0+0io 0pf+0w

> time meryl -P -v -m 22 -s sequence.fasta
Counting mers in 'sequence.fasta'
 9999.98 Mmers -- 21.85 Mmers/second
9999979000 22-mers can be computed using 22785MB memory.
432.130u 23.091s 7:37.64 99.4%  361+1169k 0+0io 0pf+0w

Note that no support is available for estimating the memory usage for threaded or segmented operation. A reasonable estimate is to divide the sequential memory usage by the number of segments, and add in the side of the fastaidx for each thread.

Building a k-mer count table

Meryl allows you to select the k-mer size (-m option), the strand to count (-f, -r and -C options), homopolymer compression (-c option), and can filter out rare or common k-mers (-L and -U options).

-s seq.fasta  (sequence to build the table for)
-o tblprefix  (output table prefix)

-m #          (size of a mer; required)
-c #          (homopolymer compression; optional)

-f            (only build for the forward strand)
-r            (only build for the reverse strand)
-C            (use canonical mers, assumes both strands)

-L #          (DON'T save mers that occur less than # times)
-U #          (DON'T save mers that occur more than # times)

-p            (enable positions)

-v            (entertain the user)

By default, the computation is done as one large sequential process. On large datasets, this can use too much memory, can take too long, or both.

The size of a meryl job can be controlled by either breaking the computation into fixed size pieces, or limiting the size of each piece.

The '-segments S' option will split the computation into S equal sized pieces.

The '-memory M' option is similar, but instead of splitting into a given number of pieces, number of pieces is computed so that each piece uses no more than M MB of memory.

By default, each of these pieces are computed sequentially. The "-threads T" option will compute T pieces concurrently. For example, "-segments 16 -threads 4' will break the job into 16 pieces and compute 4 of them concurrently.

Examples

Using a large sequence file with 1000 sequences of length 10Mbp each.

> ls -l
-rw-r--r--   1 bri  bri  10000015000 Jan  6 07:51 sequence.fasta

Sequential build

> meryl -v -B -m 22 -s sequence.fasta -o output
Have NO LIMITS!: mersPerBatch=9999979001 segmentLimit=1
basesPerBatch = 10000000000
Computing 1 segments using AS MUCH MEMORY AS NEEDED.
  numMersActual      = 9999979001
  mersPerBatch       = 9999979001
  basesPerBatch      = 10000000000
  numBuckets         = 33554432 (25 bits)
  bucketPointerWidth = 34
  merDataWidth       = 19
 Allocating 22649MB for mer storage (19 bits wide).
 Allocating 136MB for bucket pointer table (34 bits wide).
 Allocating 128MB for counting the size of each bucket.
Counting mers in buckets:   10.49 Mmers --  6.67 Mmers/second
...

This will use 22649 + 136 + 128 = 22913, slightly more than the estimate given with the -P option above.

Sequential segmented build

> meryl -v -B -m 22 -segments 16 -s sequence.fasta -o output
Have a segment limit: mersPerBatch=624998688 segmentLimit=16
basesPerBatch = 625000000
Computing 16 segments using AS MUCH MEMORY AS NEEDED.
  numMersActual      = 9999979001
  mersPerBatch       = 624998688
  basesPerBatch      = 625000000
  numBuckets         = 33554432 (25 bits)
  bucketPointerWidth = 30
  merDataWidth       = 19
Computing segment 1 of 16.
 Allocating 1415MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Counting mers in buckets:   54.53 Mmers --  6.70 Mmers/second
...

When breaking the computation into pieces, each piece still has some constant-sized tables, but the primary data size is reduced from 22649 MB to 1415 MB, exactly 16 times smaller.

Threaded segmented build

> meryl -v -B -m 22 -segments 16 -threads 4 -s sequence.fasta -o output
Have a segment limit: mersPerBatch=624998688 segmentLimit=16
basesPerBatch = 625000000
Computing 16 segments using AS MUCH MEMORY AS NEEDED.
  numMersActual      = 9999979001
  mersPerBatch       = 624998688
  basesPerBatch      = 625000000
  numBuckets         = 33554432 (25 bits)
  bucketPointerWidth = 30
  merDataWidth       = 19
Computing segment 1 of 16.
 Allocating 1415MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Computing segment 2 of 16.
 Allocating 1415MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Computing segment 3 of 16.
 Allocating 1415MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Computing segment 4 of 16.
 Allocating 1415MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Counting mers in buckets:   29.36 Mmers --  5.34 Mmers/second
...

The primary data size is still 1415 MB since we didn't change the number of segments, but each thread allocates an additional 120 + 128 MB for local storage. Only four segments are computed at a time.

Memory-limited build

> meryl -v -B -m 22 -memory 1500 -s sequence.fasta -o output
Can fit 609277844 mers into table with prefix of 25 bits, using 1500.000MB (   0.000MB for positions)
Have a memory limit: mersPerBatch=609277844 segmentLimit=17
basesPerBatch = 588235295
Computing 17 segments using 1500MB memory each.
  numMersActual      = 9999979001
  mersPerBatch       = 609277844
  basesPerBatch      = 588235295
  numBuckets         = 33554432 (25 bits)
  bucketPointerWidth = 30
  merDataWidth       = 19
Computing segment 1 of 17.
 Allocating 1332MB for mer storage (19 bits wide).
 Allocating 120MB for bucket pointer table (30 bits wide).
 Allocating 128MB for counting the size of each bucket.
Counting mers in buckets:    6.29 Mmers --  6.85 Mmers/second

meryl computes the number of k-mers that can be counted in a fixed memory size. We requested each segment to use no more than 1500 MB, and it allocated 1332 + 120 + 128 = 1580 MB. Using multiple threads will allocate 1580 MB per thread, as was done in 'threaded segmented build' above.

SGE support

Meryl supports SGE operation; see the full command line help. The '-sge' option will submit meryl to SGE for execution.

Segmented, batched operation: Same as sequential, except this allows
each segment to be manually executed in parallel.
Only one of -memory and -segments is needed.
   -memory mMB     (use at most m MB of memory per segment)
   -segments n     (use n segments)
   -configbatch    (create the batches)
   -countbatch n   (run batch number n)
   -mergebatch     (merge the batches)
Initialize the compute with -configbatch, which needs all the build options.
Execute all -countbatch jobs, then -mergebatch to complete.
  meryl -configbatch -B [options] -o file
  meryl -countbatch 0 -o file
  meryl -countbatch 1 -o file
  ...
  meryl -countbatch N -o file
  meryl -mergebatch N -o file
Batched mode can run on the grid.
   -sge        jobname      unique job name for this execution.  Meryl will submit
                            jobs with name mpjobname, ncjobname, nmjobname, for
                            phases prepare, count and merge.
   -sgebuild "options"    any additional options to sge, e.g.,
   -sgemerge "options"    "-p -153 -pe thread 2 -A merylaccount"
                            N.B. - -N will be ignored
                            N.B. - be sure to quote the options

Modify tables

Meryl can perform simple operations on tables. All operations read k-mer counts from some number of input tables (-s) and write the result to a single output table (-o).

-s tblprefix  (use tblprefix as an input database)
-o tblprefix  (create this output)
-v            (entertain the user)

Example: meryl -M lessthan 100 -s input -o output

Multiple tables are specified with multiple -s switches; e.g.: meryl -M add -s 1 -s 2 -s 3 -s 4 -o all

It is NOT possible to specify more than one operation; e.g., this is invalid: meryl -M add -s 1 -s 2 -M sub -s 3

Math operations

min
count is the minimum count for all databases. If the mer does NOT exist in all databases, the mer has a zero count, and is NOT in the output.
minexist
count is the minimum count for all databases that contain the mer
max
count is the maximum count for all databases
add
count is sum of the counts for all databases
sub
count is the first minus the second (binary only)
abs
count is the absolute value of the first minus the second (binary only)

Logical operations

and
outputs mer iff it exists in all databases
nand
outputs mer iff it exists in at least one, but not all, databases
or
outputs mer iff it exists in at least one database
xor
outputs mer iff it exists in an odd number of databases

Threshold operations

Threshold operations work on exactly one database.

lessthan x
outputs mer iff it has count < x
lessthanorequal x
outputs mer iff it has count <= x
greaterthan x
outputs mer iff it has count > x
greaterthanorequal x
outputs mer iff it has count >= x
equal x
outputs mer iff it has count == x

Dumping a table

We'll use p.gingivalis as an example.

> meryl -v -B -m 22 -C -s AE015924.fasta -o AE015924
Have NO LIMITS!: mersPerBatch=2343456 segmentLimit=1
basesPerBatch = 2343476
Computing 1 segments using AS MUCH MEMORY AS NEEDED.
  numMersActual      = 2343456
  mersPerBatch       = 2343456
  basesPerBatch      = 2343476
  numBuckets         = 131072 (17 bits)
  bucketPointerWidth = 22
  merDataWidth       = 27
 Allocating 7MB for mer storage (27 bits wide).
 Allocating 0MB for bucket pointer table (22 bits wide).
 Allocating 0MB for counting the size of each bucket.
 Counting mers in buckets:    2.34 Mmers -- 24.29 Mmers/second
 Creating bucket pointers.
 Releasing 0MB from counting the size of each bucket.
 Filling mers into list:      2.34 Mmers --  9.86 Mmers/second
 Writing output:              2.34 Mmers --  7.95 Mmers/second
Segment 0 finished.

Histogram

A histogram of the k-mer counts can be generated.

> meryl -Dh -s AE015924 > histogram
Found 2343455 mers.
Found 2208734 distinct mers.
Found 2164083 unique mers.
Largest mercount is 81; 0 mers are too big for histogram.

> head histogram
1       2164083 0.9798  0.9235
2       25265   0.9912  0.9450
3       3046    0.9926  0.9489
4       7469    0.9960  0.9617
5       3039    0.9974  0.9682
6       971     0.9978  0.9706
7       396     0.9980  0.9718
8       511     0.9982  0.9736
9       293     0.9983  0.9747
10      1416    0.9990  0.9807

The columns are:

  1. The k-mer count.
  2. The number of k-mers with this count.
  3. The fraction of all k-mer sequences with at most this count.
  4. The fraction of all k-mer in the input with at most this count.

Column 3 treats all k-mers as a (mathematical) set. A k-mer with count=10 will be listed in this set once.

Column 4 treats all k-mers individually. A k-mer with count=10 will be listed ten times.

Threshold

All k-mers with count at lease some threshold can be output in fasta format. The k-mers are lexicographically sorted.

meryl -Dt -n <threshold> -s <database> > <output.fasta>
> meryl -Dt -n 71 -s AE015924
>81
AATCGCGTCCTGCATCGTGCAG
>81
ACAATCGCGTCCTGCATCGTGC
>71
ACGATGCAGGACGCGATTGTCA
>81
ATCGCGTCCTGCATCGTGCAGG
>81
CAATCGCGTCCTGCATCGTGCA
>81
CACGATGCAGGACGCGATTGTC