Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

Résumé

Motivation: In many bioinformatics pipelines, k-mer counting is often a required step, with existing methods focusing on optimizing time or memory usage. These methods usually produce very large count tables explicitly representing k-mers themselves. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) or Count-Min sketches. The former is only applicable to static maps not subject to updates, while the latter suffers from potentially very large point-query errors, making it unsuitable when counters are required to be highly accurate. Results: We introduce Set-Min sketch, a sketching technique inspired by Count-Min sketch, for representing associative maps, more specifically, k-mer count tables. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, especially for large values of k. Space-efficiency of Set-min takes advantage of the power-law distribution of k-mer counts in genomic datasets.
Fichier principal
Vignette du fichier
2020.11.14.382713v1.full.pdf (481.04 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03049389 , version 1 (09-12-2020)

Identifiants

Citer

Yoshihiro Shibuya, Gregory Kucherov. Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. Research in Computational Molecular Biology - 25th Annual International Conference, RECOMB 2021, Padua, Italy, August 29-September 1, 2021, Proceedings, Aug 2021, Padova, Italy. ⟨10.1101/2020.11.14.382713⟩. ⟨hal-03049389⟩
44 Consultations
87 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More