On Wed, Jun 13, 2012 at 6:16 AM, Chris <[email protected]> wrote:

> For a project I want to recognize the text taken from screenshots from
> programs and games.
>
> I have a lot of assumed knowledge which should help me with the
> recognition:
>
> - The font used is usually arial 12pt, plus one or two others.
> - Background is (usually) black
> - Font can be different colors, including white
> - Vocabulary is quite limited and definitely exhaustive (subset of
> english).
> - Aside from interpuntion, no non-english characters are used.
>
> Out of the box this does not perform anywhere near acceptable accuracy so
> I need to manually train.
>
> How should I train this?
> Should I be using the actual screenshots, where the text is quite small,
> pixelly and not well spaced (font size 12 is not big) or should I make my
> own training images in every possible color on black background in a bigger
> font size with larger spacing? If I attempt to generate a box file it is
> absolute rubbish.
> Do I need to preprocess the images to black/white or should color be fine?
> Note that anti-aliased screenshots dont look well when flattened to
> black/white.
> Should I be worried about images, lines, borders and other non-text
> elements in my screenshots? Will tesseract gracefully skip these?
>
> Any advice specifically to my situation would help.
>
>
> As always - If you are really need help, than provide image
example. Written descriptions of problems are useless without image/testing
case.

-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to