Skip to Main content Skip to Navigation
Conference papers

Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing

Abstract : Nowadays, more and more scanned documents are converted into editable electronic representation. This proceeding relies on the Optical Character Recognition (OCR) tool-chain. Generally, an OCR system is based on the important binarization step that separates character strokes from the background document. In this context, one of more robust binarization methods is the recently proposed Hybrid Binarization based on Kmeans (HBK). It handles effectively scanned documents which includes text on simple background. Nevertheless, in Heterogeneous documents , HBK ends up with some issues when extracting foreground text from complex background images. Moreover, HBK assumes to have a dark foreground against a clear background. Otherwise, it fails to render correct binarization colors. In this paper, we propose to improve the HBK method for handling efficiently Heterogeneous documents. Indeed, our proposal employs a layout analysis process that classify document regions into text and image. Image regions are enhanced with Gamma Correction (GC) before HBK binarization. Text regions are treated directly with HBK, keeping its effectiveness on text with homogeneous background. To ensure a robust and independent color rendering in the binarized documents, we control the labeling polarity of text and background through a pixel density-based technique. According to our experiments on LRDE and ICDAR datasets, we demonstrate that I-HBK outperforms HBK when dealing with Heterogeneous documents in both F-measure and OCR accuracy.
Complete list of metadata

Cited literature [18 references]  Display  Hide  Download
Contributor : Rostom Kachouri Connect in order to contact the contributor
Submitted on : Sunday, May 1, 2016 - 1:57:55 PM
Last modification on : Saturday, January 15, 2022 - 3:56:07 AM
Long-term archiving on: : Tuesday, May 24, 2016 - 4:25:34 PM


Files produced by the author(s)



Mahmoud Soua, Rostom Kachouri, Mohamed Akil. Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing. 9th International Symposium on Image and Signal Processing and Analysis, ISPA'15, Sep 2015, Zagreb, Croatia. pp.210-215, ⟨10.1109/ISPA.2015.7306060⟩. ⟨hal-01309993⟩



Record views


Files downloads