[tesseract-ocr] Guidance in OCR of screenshot with small font

Kristofer Johansson Fri, 18 Nov 2016 07:35:09 -0800

Hello!

I have been spending a couple of days getting familiar with Tesseract but I 
am finding that the more I learn the more I realize how much there is to 
this so I am posting this in hopes that more experienced users/devs can 
point me in the right direction so I don't spend days barking up the wrong 
tree unnecessarily.

What I want to do:
*OCR exactly one word with no spaces from is a screenshot.

The good:
*I know exactly where this word will appear so I can feed a bitmap to
Tesseract with the word centered and as much free space around it as
desired. Of course there is no skewing or sloping but just plain,straight
text.
*The word is actually not a real word but just 1-5 random capital letters
used to ID goods containers. The format is therefore known.
*I can easily create a dictionary of all acceptable "words" in this
"language". (I have an excel file of all container IDs)
*The font is simple and is just "even lines" with no serifs or fancy stuff.
However it is not mono spaced.

The bad:
*The font is small. Only about 8px in height.

What I have done so far:
*Using the Leptonica utility provided with Capture2Text
<http://capture2text.sourceforge.net/> I have pre-processed the BMP using
pretty much the default values that Capture2Text uses (scaling it 3.5x),
inverting the colors and make black and white. (original is white on dark
grey)
*I have created a conf file with only english caps to use as whitelist.
*I throw the resulting TIF to Tesseract using the whitelist and -psm 8
(single word).
*I have NOT yet applied a dictionary since I want to try out the other
parameters first to optimize them and then put in the dictionary last.
*I don't specify a language so I think tesseract is usung default (english)

With these steps I get a pretty good but not perfect result. Especially
interesting to me is that leptonica seem to handle the preprocessing
diferently depending on if the original BMP is shifted a pixel left or
right even though there is plenty of space around the word which seems
strange to me and that does generate inconstancy.

So, could anyone with experience or thoughts on screen captured OCR comment
on this and send me off in the right direction to further optimize this?
*Training Tesseract, somehow?
*Should I use ImageMagick instead of Leptonica for more consistent results?
Use different parameters/functions with Leptonica? Recommendations?
*Other things to consider?
*I have found that most information applies primarily to people scanning
tons of documents and since this application is somewhat different with
different problems I want to ask if someone in here has done something
similar and how they got it to work.

FYI:
The input comes from another program running on a computer and I have no
way of accessing the text programatically or copy it to clipboard or
similar so let's focus on a solution with OCR. I can't control how it is
rendered on screen either.

Thank you for your time!
/Kris

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/dc501958-c463-4b71-b342-b0d5f1c6be8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Guidance in OCR of screenshot with small font

Reply via email to