Lossless Filter for Long Multiple Repetitions with Edit Distance

Abstract : Identifying local similarity between two or more sequences, or identifying repe- titions occurring at least twice in a sequence, is an essential part in the analysis of biologi- cal sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or inside a unique sequence, while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computation- ally expensive task, and consequently exact methods can usually not be applied in practice. The filter we introduce in this paper, called Ed'Nimbus, provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment method, eliminating an important fraction of the input that is guaranteed not to contain any ap- proximate repetition. It consists in the verification of a strong necessary condition. This condition concerns the number and order of exactly repeated words shared by the approx- imate repetitions. The efficiency of the filter is due to this condition, that we show how to check in a fast way. The speed of the method is achieved thanks also to the use of a simple and efficient data structure, that we describe in this paper, as well as its linear time and space construction. Our results show that using Ed'Nimbus allows us to sensibly reduce the time spent in a local multiple alignment. The Ed'Nimbus software is freely available on the web: http://igm.univ-mlv.fr/∼peterlon/officiel/ednimbus/
Complete list of metadatas

Cited literature [32 references]  Display  Hide  Download

https://hal-upec-upem.archives-ouvertes.fr/hal-00627831
Contributor : Pierre Peterlongo <>
Submitted on : Thursday, September 29, 2011 - 4:28:21 PM
Last modification on : Thursday, November 21, 2019 - 1:14:17 PM
Long-term archiving on : Friday, December 30, 2011 - 2:30:43 AM

File

Hal.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00627831, version 1

Collections

Citation

Pierre Peterlongo, Nadia Pisanti, Alair Peirera Do Lago, Marie-France Sagot. Lossless Filter for Long Multiple Repetitions with Edit Distance. 2006. ⟨hal-00627831⟩

Share

Metrics

Record views

308

Files downloads

108