Overview of the ATAC process

From kmer
Jump to: navigation, search

ATAC is a computational process for comparative mapping between two genome assemblies, or between two different genomes. The outcome of this process is a genome-wide list of assembly-to-assembly anchors and blocks.

Assembly-to-assembly anchors (matches) are bi-directionally unique sequences (found only once in each assembly) that are at least 95% identical between the two assemblies with no insertions and deletions.

Assembly-to-assembly blocks (runs) are chains of ordered and oriented anchors with intervening gaps.

This data will allow comparisons and navigation between assemblies and particularly important to these comparisons is the mapping of annotated features such as genes.

An assembly-to-assembly anchors are constructed by an all-against-all alignment of the two sequences. This comparison is done using a symmetric process, treating each assembly equally.

Anchors and blocks are computed in four steps:

Local Match Finding
The local match finding is an “all-against-all” search of one assembly against the other using 20-mer frequency counting. This step is done by first finding the sets of distinct 20-mers within each assembly and then retaining only the set of 20-mers that are in common between the two assemblies. All 20-mers that occur more than once in either assembly are removed from the set to avoid repetitive and low complexity regions. Next, when possible, the matching 20-mers are coalesced into longer exact (100% identical) matches between the assemblies. Any coalesced exact match with less than 40 bp is discarded. Those that survive are preliminary anchors.
Signal to Noise Filtering
The signal to noise filtering step is to eliminate preliminary anchors occurring totally by chance in both assemblies. A match between two assemblies of at least 40 bp with 100 % identity is retained and becomes a preliminary assembly-to-assembly anchor.
Chaining to Find a Global Map
The resulting preliminary anchors are chained to find the global map and create preliminary blocks. Preliminary anchors are sorted by their chromosome assignment on the first assembly, and then grouped by their chromosome assignment on the other assembly. Anchor groups that have a consecutively ascending or descending order on both assemblies are named ATA blocks. A block is broken into two blocks if there is a change in the ascending or descending order of one anchor on either assembly or in the orientation of an anchor. In addition, a block will also be broken into two blocks if the distance between two consecutive anchors is greater than 100,000 bp. A small block with fewer than 100 bp in anchors is discarded as noise. After the noise blocks (and their preliminary anchors) are discarded, the final blocks are recomputed to create the global map.
Alignment Between Anchors in the Global Map
The last step is polishing, primarily filling the gaps between consecutive anchors of the same block. The anchors are extended into intra-block gaps to include isolated single nucleotide substitutions flanked by at least 20 bp of exactly matching sequence. After this extension step, preliminary anchors become final assembly-to-assembly anchors. In addition, we close gaps with 5 or less substitutions (but not insertions or deletions) when the resulting anchor retains 95% or higher sequence identity.