Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing

Abstract : Nowadays, more and more scanned documents are converted into editable electronic representation. This proceeding relies on the Optical Character Recognition (OCR) tool-chain. Generally, an OCR system is based on the important binarization step that separates character strokes from the background document. In this context, one of more robust binarization methods is the recently proposed Hybrid Binarization based on Kmeans (HBK). It handles effectively scanned documents which includes text on simple background. Nevertheless, in Heterogeneous documents , HBK ends up with some issues when extracting foreground text from complex background images. Moreover, HBK assumes to have a dark foreground against a clear background. Otherwise, it fails to render correct binarization colors. In this paper, we propose to improve the HBK method for handling efficiently Heterogeneous documents. Indeed, our proposal employs a layout analysis process that classify document regions into text and image. Image regions are enhanced with Gamma Correction (GC) before HBK binarization. Text regions are treated directly with HBK, keeping its effectiveness on text with homogeneous background. To ensure a robust and independent color rendering in the binarized documents, we control the labeling polarity of text and background through a pixel density-based technique. According to our experiments on LRDE and ICDAR datasets, we demonstrate that I-HBK outperforms HBK when dealing with Heterogeneous documents in both F-measure and OCR accuracy.
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download

https://hal-upec-upem.archives-ouvertes.fr/hal-01309993
Contributor : Rostom Kachouri <>
Submitted on : Sunday, May 1, 2016 - 1:57:55 PM
Last modification on : Tuesday, September 10, 2019 - 2:16:01 PM
Long-term archiving on : Tuesday, May 24, 2016 - 4:25:34 PM

File

ISPA15_IHBK(accepté).pdf
Files produced by the author(s)

Identifiers

Citation

Mahmoud Soua, Rostom Kachouri, Mohamed Akil. Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing. 9th International Symposium on Image and Signal Processing and Analysis, ISPA'15, Sep 2015, Zagreb, Croatia. pp.210-215, ⟨10.1109/ISPA.2015.7306060⟩. ⟨hal-01309993⟩

Share

Metrics

Record views

238

Files downloads

166