Hi Chris,

I should preface my answer by saying that I haven't done anything
similar, so some of what I'll say are guesses based on experience.

I'll answer your questions inline, and not in the order you wrote
them ;)

On Tue, Jun 12, 2012 at 09:16:52PM -0700, Chris wrote:
> For a project I want to recognize the text taken from screenshots from 
> programs and games.

What game? Sounds interesting? Why aren't you "just" locating the
area in memory that it stores the information?

> I have a lot of assumed knowledge which should help me with the recognition:
> 
> - The font used is usually arial 12pt, plus one or two others.
> - Background is (usually) black
> - Font can be different colors, including white
> - Vocabulary is quite limited and definitely exhaustive (subset of english).
> - Aside from interpuntion, no non-english characters are used.
> 
> Out of the box this does not perform anywhere near acceptable accuracy so I 
> need to manually train.

I suspect training won't help here particularly, whereas
preprocessing will. As you're recognising a common font. Training
doesn't encode colour or things like that, it encodes primarily the
shape of the letters, so the english training should be fine.

The exception to this is the vocabulary. If many of the words which
will come up are unlikely to be in the dictionary, it would probably
be worth using a user_words file with the extra words. Indeed if
there wasn't much overlap between the vocabulary used and the
english dictionary, I would probably unpack the english training,
remove the word lists, then pack it back up. But leave these until
you have training working reasonably; word lists are helpful, but
they won't suddenly turn garbage into sense.

> Do I need to preprocess the images to black/white or should color be fine? 
> Note that anti-aliased screenshots dont look well when flattened to 
> black/white.

I think you should pre-process it to black-and-white, yes. Another
pre-processing step which will likely have a big impact would be resizing
the screenshot to be much bigger; Tesseract works best with around 300DPI.

> Should I be worried about images, lines, borders and other non-text 
> elements in my screenshots? Will tesseract gracefully skip these?

Try and see. I would guess you should try to remove them, but I
don't have experience of this myself.

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to