Real-time text extraction based on the page layout analysis system

Mahmoud Soua; Alae Benchekroun; Rostom Kachouri; Mohamed Akil

doi:10.1117/12.2262364

Communication Dans Un Congrès Année : 2017

Real-time text extraction based on the page layout analysis system

(1) , (1) , (1) , (1)

Mahmoud Soua

Fonction : Auteur
PersonId : 777295
IdRef : 202644111

Laboratoire d'Informatique Gaspard-Monge

Alae Benchekroun

Fonction : Auteur

Laboratoire d'Informatique Gaspard-Monge

Rostom Kachouri

Fonction : Auteur
PersonId : 7786
IdHAL : rostom-kachouri
IdRef : 148346499

Laboratoire d'Informatique Gaspard-Monge

Mohamed Akil

Fonction : Auteur
PersonId : 172163
IdHAL : mohamed-akil
ORCID : 0000-0001-9029-2163
IdRef : 118787039

Laboratoire d'Informatique Gaspard-Monge

Résumé

Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a dicult task because of the variation of the text due to the dierences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK) 5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an ecient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.

Mots clés

Text extraction Heterogeneous documents Tesseract Layout analysis PLA I-HBK GPU

Domaines

Architectures Matérielles [cs.AR] Systèmes embarqués Traitement du texte et du document

Fichier principal

pla_v2.pdf (982.92 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Rostom Kachouri : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01525503

Soumis le : dimanche 21 mai 2017-14:02:34

Dernière modification le : jeudi 28 mars 2024-03:28:50

Archivage à long terme le : mercredi 23 août 2017-10:58:56

Dates et versions

hal-01525503 , version 1 (21-05-2017)

Identifiants

HAL Id : hal-01525503 , version 1
DOI : 10.1117/12.2262364

Citer

Mahmoud Soua, Alae Benchekroun, Rostom Kachouri, Mohamed Akil. Real-time text extraction based on the page layout analysis system. SPIE Conference on Real-Time Image and Video Processing, Apr 2017, Anaheim, CA, United States. ⟨10.1117/12.2262364⟩. ⟨hal-01525503⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENPC CNRS LIGM_A3SI PARISTECH LIGM ESIEE-PARIS UNIV-EIFFEL JSE2024

377 Consultations

2605 Téléchargements

Real-time text extraction based on the page layout analysis system

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager