[tesseract-ocr] Tesseract mistakes letters for numbers

Eric Hodges Wed, 21 Jul 2021 11:07:14 -0700

I need some help. I have a bunch of images of text like this:

[image: sample_si.jpg]
They are all 200 dpi, black and white images. In over 50% of the cases, 
Tesseract confuses the "SI" at the front for digits. Most of them are "51", 
but some are "81" or "31".


I've tried tweaking all of the settings I can find, but none of them 
improve the results. I'm currently using a config file like this:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Interesting fact: If I cut off the digits and only send the alphas to 
Tesseract, it recognizes them correctly. Is there something in Tesseract 
that makes it less likely to mix letters and numbers in a single word?

Any suggestions?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6738679c-d5ea-4b3c-bad5-ef29b7109571n%40googlegroups.com.

[tesseract-ocr] Tesseract mistakes letters for numbers

Reply via email to