Normalizing speech transcriptions for Natural Language Processing

Abstract : Researchers in the field of spoken text processing face specific problems, all related to the nature of the data. In particular, spoken texts are full of disfluencies that constitute practical issues for automatic analysis. On the basis of a corpus of almost 500.000 words from the textual data bank of spontaneous spoken French of Valibel (http://www.uclouvain.be/valibel.html), we have especially studied four types of disfluencies: repetition, word fragments, immediate self-correction and the word euh, called "filled pause". In this paper, we show how these four types of disfluencies were automatically preprocessed in texts. The principle we used was to annotate the part of the disfluency called reparandum (according to the terminology in Shriberg 1994), in order to keep only the repair part.
Document type :
Conference papers
Complete list of metadatas

https://hal-upec-upem.archives-ouvertes.fr/hal-00866252
Contributor : Matthieu Constant <>
Submitted on : Thursday, September 26, 2013 - 2:06:47 PM
Last modification on : Tuesday, June 5, 2018 - 10:10:04 AM

Identifiers

  • HAL Id : hal-00866252, version 1

Citation

Anne Dister, Mathieu Constant, Gérald Prunelle. Normalizing speech transcriptions for Natural Language Processing. 3rd International Conference on Spoken Communication (GSCP'09), Università degli Studi di Napoli L'Orientale, Feb 2009, Naples, Italy. pp.507-520. ⟨hal-00866252⟩

Share

Metrics

Record views

215