Skip to Main content Skip to Navigation
Conference papers

Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

Abstract : Motivation: In many bioinformatics pipelines, k-mer counting is often a required step, with existing methods focusing on optimizing time or memory usage. These methods usually produce very large count tables explicitly representing k-mers themselves. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) or Count-Min sketches. The former is only applicable to static maps not subject to updates, while the latter suffers from potentially very large point-query errors, making it unsuitable when counters are required to be highly accurate. Results: We introduce Set-Min sketch, a sketching technique inspired by Count-Min sketch, for representing associative maps, more specifically, k-mer count tables. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, especially for large values of k. Space-efficiency of Set-min takes advantage of the power-law distribution of k-mer counts in genomic datasets.
Document type :
Conference papers
Complete list of metadata
Contributor : Gregory Kucherov Connect in order to contact the contributor
Submitted on : Wednesday, December 9, 2020 - 7:39:07 PM
Last modification on : Monday, February 14, 2022 - 3:52:19 AM
Long-term archiving on: : Wednesday, March 10, 2021 - 8:05:54 PM


Files produced by the author(s)




Yoshihiro Shibuya, Gregory Kucherov. Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. Research in Computational Molecular Biology - 25th Annual International Conference, RECOMB 2021, Padua, Italy, August 29-September 1, 2021, Proceedings, Aug 2021, Padova, Italy. ⟨10.1101/2020.11.14.382713⟩. ⟨hal-03049389⟩



Record views


Files downloads