Multiple sequence comparison - A peptide matching approach

Marie-France Sagot; Alain Viari; H Soldano

doi:10.1016/S0304-3975(96)00137-5

Article Dans Une Revue Theoretical Computer Science Année : 1997

Multiple sequence comparison - A peptide matching approach

(1) , ,

Marie-France Sagot

Fonction : Auteur
PersonId : 170068
IdHAL : marie-france-sagot
IdRef : 103537562

Laboratoire d'Informatique Gaspard-Monge

Alain Viari

Fonction : Auteur
PersonId : 18170
IdHAL : alain-viari
ORCID : 0000-0002-1329-7777
IdRef : 081478151

H Soldano

Fonction : Auteur

Résumé

We present in this paper a peptide matching approach to the multiple comparison of a set of protein sequences. This approach consists in looking for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as reference an object called a model, In the case of proteins, a model is a product of subsets of the alphabet Sigma of the amino acids. These subsets belong to a cover of Sigma, that is, their union covers all of Sigma. A word is said to be an instance of a model if it belongs to the model, A further flexibility is introduced in the comparison by allowing for up to e errors in the comparison between a word and a model. These errors may concern gaps or substitutions not allowed by the cover. A word is said to be this time an occurrence of a model if the Levenshtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a Set-Levenshtein distance between the occurrences and the model itself. Two words are said to be similar if there is at least one model of which both are occurrences. In the special case where e = 0, the occurrences of a model are simply its instances, If a model M has occurrences in at least q of the sequences of the set, M is said to occur in the set. The algorithm presented here is an efficient and exact way of looking for all the models, of a fixed length k or of the greatest possible length k(max), that occur in a set of sequences. It is linear in the total length n of the sequences and proportional to (e + 2)(2e + 1)k(e + 1)p(e + 1)g(k) where k much less than n is a small value in all practical situations, p is the number of sets in the cover and g is related to the latter's degree of nontransitivity. Models are closely related to what is called a consensus in the biocomputing area, and covers are a good way of representing complex relationships between the amino acids.

Domaines

Automatique

Marie-France Sagot : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00694274

Soumis le : jeudi 3 mai 2012-23:57:04

Dernière modification le : jeudi 28 mars 2024-03:28:16

Dates et versions

hal-00694274 , version 1 (03-05-2012)

Identifiants

HAL Id : hal-00694274 , version 1
DOI : 10.1016/S0304-3975(96)00137-5

Citer

Marie-France Sagot, Alain Viari, H Soldano. Multiple sequence comparison - A peptide matching approach. Theoretical Computer Science, 1997, 180 (1-2), pp.115--137. ⟨10.1016/S0304-3975(96)00137-5⟩. ⟨hal-00694274⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENPC CNRS UNIV-MLV LIGM_ALGO PARISTECH LIGM UNIV-EIFFEL JSE2024

90 Consultations

0 Téléchargements

Multiple sequence comparison - A peptide matching approach

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager