Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing

Résumé

Nowadays, more and more scanned documents are converted into editable electronic representation. This proceeding relies on the Optical Character Recognition (OCR) tool-chain. Generally, an OCR system is based on the important binarization step that separates character strokes from the background document. In this context, one of more robust binarization methods is the recently proposed Hybrid Binarization based on Kmeans (HBK). It handles effectively scanned documents which includes text on simple background. Nevertheless, in Heterogeneous documents , HBK ends up with some issues when extracting foreground text from complex background images. Moreover, HBK assumes to have a dark foreground against a clear background. Otherwise, it fails to render correct binarization colors. In this paper, we propose to improve the HBK method for handling efficiently Heterogeneous documents. Indeed, our proposal employs a layout analysis process that classify document regions into text and image. Image regions are enhanced with Gamma Correction (GC) before HBK binarization. Text regions are treated directly with HBK, keeping its effectiveness on text with homogeneous background. To ensure a robust and independent color rendering in the binarized documents, we control the labeling polarity of text and background through a pixel density-based technique. According to our experiments on LRDE and ICDAR datasets, we demonstrate that I-HBK outperforms HBK when dealing with Heterogeneous documents in both F-measure and OCR accuracy.
Fichier principal
Vignette du fichier
ISPA15_IHBK(accepté).pdf (7.01 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01309993 , version 1 (01-05-2016)

Identifiants

Citer

Mahmoud Soua, Rostom Kachouri, Mohamed Akil. Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing. 9th International Symposium on Image and Signal Processing and Analysis, ISPA'15, Sep 2015, Zagreb, Croatia. pp.210-215, ⟨10.1109/ISPA.2015.7306060⟩. ⟨hal-01309993⟩
127 Consultations
327 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More