[tesseract-ocr] Tesseract confused between a character and a digit which look-alike

'Yash Mistry' via tesseract-ocr Tue, 07 Jun 2022 00:50:44 -0700


I am facing challenge to extract correct a letter from a word which are 
look-alike, i.e 5 & S, I & 1, 8 & S.


I applied image pre-processing techniques like Blurring, erode, dilate, 
normalised the noise, remove unnecessary component and text detection from 
the input image but after these much of pre-processing tesseract OCR isn't 
giving correct result.

Please check attached images,

*Original Image*


*[image: image.png]*

*Pre-processed Image*

[image: image (1).png]

*Detected Text*


*[image: image (2).png]*


*[image: image (3).png]*

*Tesseract Configuration*

-l eng --oem 1 --psm 7 -c 
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" 
load_system_dawg=false load_freq_dawg=false

*Result of OCR*: TITLENUMBER 81003716

As we can see OCR extract S as 8 even after pre-processing and text 
detection.

Is there anyway we can overcome this problem?

*Tesseract Version*: tesseract 5.1.0-32-gf36c0

Note: Asked same question in pytesseract github repo and got suggestion to 
drop this question here.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com.

[tesseract-ocr] Tesseract confused between a character and a digit which look-alike

Reply via email to