Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

Abstract : Motivation: In many bioinformatics pipelines, k-mer counting is often a required step, with existing methods focusing on optimizing time or memory usage. These methods usually produce very large count tables explicitly representing k-mers themselves. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) or Count-Min sketches. The former is only applicable to static maps not subject to updates, while the latter suffers from potentially very large point-query errors, making it unsuitable when counters are required to be highly accurate. Results: We introduce Set-Min sketch, a sketching technique inspired by Count-Min sketch, for representing associative maps, more specifically, k-mer count tables. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, especially for large values of k. Space-efficiency of Set-min takes advantage of the power-law distribution of k-mer counts in genomic datasets.
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

https://hal-upec-upem.archives-ouvertes.fr/hal-03049389
Contributor : Gregory Kucherov <>
Submitted on : Wednesday, December 9, 2020 - 7:39:07 PM
Last modification on : Wednesday, December 30, 2020 - 1:08:06 PM

File

2020.11.14.382713v1.full.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Yoshihiro Shibuya, Gregory Kucherov. Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. 2020. ⟨hal-03049389⟩

Share

Metrics

Record views

14

Files downloads

14