Correction de données de séquençage de troisième génération

Abstract : The aims of this thesis are part of the vast problematic of high-throughput sequencing data analysis. More specifically, this thesis deals with long reads from third-generation sequencing technologies. The aspects tackled in this topic mainly focus on error correction, and on its impact on downstream analyses such a de novo assembly. As a first step, one of the objectives of this thesis is to evaluate and compare the quality of the error correction provided by the state-of-the-art tools, whether they employ a hybrid (using complementary short reads) or a self-correction (relying only on the information contained in the long reads sequences) strategy. Such an evaluation allows to easily identify which method is best tailored for a given case, according to the genome complexity, the sequencing depth, or the error rate of the reads. Moreover, developpers can thus identify the limiting factors of the existing methods, in order to guide their work and propose new solutions allowing to overcome these limitations. A new evaluation tool, providing a wide variety of metrics, compared to the only tool previously available, was thus developped. This tool combines a multiple sequence alignment approach and a segmentation strategy, thus allowing to drastically reduce the evaluation runtime. With the help of this tool, we present a benchmark of all the state-of-the-art error correction methods, on various datasets from several organisms, spanning from the A. baylyi bacteria to the human. This benchmark allowed to spot two major limiting factors of the existing tools: the reads displaying error rates above 30%, and the reads reaching more than 50 000 base pairs. The second objective of this thesis is thus the error correction of highly noisy long reads. To this aim, a hybrid error correction tool, combining different strategies from the state-of-the-art, was developped, in order to overcome the limiting factors of existing methods. More precisely, this tool combines a short reads alignmentstrategy to the use of a variable-order de Bruijn graph. This graph is used in order to link the aligned short reads, and thus correct the uncovered regions of the long reads. This method allows to process reads displaying error rates as high as 44%, and scales better to larger genomes, while allowing to reduce the runtime of the error correction, compared to the most efficient state-of-the-art tools.Finally, the third objectif of this thesis is the error correction of extremely long reads. To this aim, aself-correction tool was developed, by combining, once again, different methologies from the state-of-the-art. More precisely, an overlapping strategy, and a two phases error correction process, using multiple sequence alignement and local de Bruijn graphs, are used. In order to allow this method to scale to extremely long reads, the aforementioned segmentation strategy was generalized. This self-correction methods allows to process reads reaching up to 340 000 base pairs, and manages to scale very well to complex organisms such as the human genome.
Document type :
Theses
Complete list of metadatas

Cited literature [161 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02320413
Contributor : Abes Star <>
Submitted on : Friday, October 18, 2019 - 4:39:07 PM
Last modification on : Thursday, November 28, 2019 - 4:06:27 AM

File

pierremorisse2.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02320413, version 1

Citation

Pierre Morisse. Correction de données de séquençage de troisième génération. Bio-informatique [q-bio.QM]. Normandie Université, 2019. Français. ⟨NNT : 2019NORMR043⟩. ⟨tel-02320413⟩

Share

Metrics

Record views

202

Files downloads

74