Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin - Histoire et Sources des Mondes antiques Accéder directement au contenu
Article Dans Une Revue Journal of Data Mining and Digital Humanities Année : 2020

Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin

Résumé

Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.
Fichier principal
Vignette du fichier
article.pdf (1.03 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02154122 , version 1 (12-06-2019)
hal-02154122 , version 2 (05-04-2020)

Licence

Paternité - Partage selon les Conditions Initiales

Identifiants

Citer

Thibault Clérice. Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin. Journal of Data Mining and Digital Humanities, 2020, 2020, ⟨10.46298/jdmdh.5581⟩. ⟨hal-02154122v2⟩
698 Consultations
1521 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More