License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.WABI.2022.14
URN: urn:nbn:de:0030-drops-170481
URL: https://drops.dagstuhl.de/opus/volltexte/2022/17048/
Shibuya, Yoshihiro ;
Belazzougui, Djamal ;
Kucherov, Gregory
Efficient Reconciliation of Genomic Datasets of High Similarity
Abstract
We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).
BibTeX - Entry
@InProceedings{shibuya_et_al:LIPIcs.WABI.2022.14,
author = {Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory},
title = {{Efficient Reconciliation of Genomic Datasets of High Similarity}},
booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
pages = {14:1--14:14},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-243-3},
ISSN = {1868-8969},
year = {2022},
volume = {242},
editor = {Boucher, Christina and Rahmann, Sven},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/17048},
URN = {urn:nbn:de:0030-drops-170481},
doi = {10.4230/LIPIcs.WABI.2022.14},
annote = {Keywords: k-mers, sketching, Invertible Bloom Lookup Tables, IBLT, MinHash, syncmers, minimizers}
}
Keywords: |
|
k-mers, sketching, Invertible Bloom Lookup Tables, IBLT, MinHash, syncmers, minimizers |
Collection: |
|
22nd International Workshop on Algorithms in Bioinformatics (WABI 2022) |
Issue Date: |
|
2022 |
Date of publication: |
|
26.08.2022 |
Supplementary Material: |
|
Software (Source Code): https://github.com/yhhshb/km-peeler.git |