On Wed, Jun 13, 2012 at 6:16 AM, Chris <[email protected]> wrote:
> For a project I want to recognize the text taken from screenshots from > programs and games. > > I have a lot of assumed knowledge which should help me with the > recognition: > > - The font used is usually arial 12pt, plus one or two others. > - Background is (usually) black > - Font can be different colors, including white > - Vocabulary is quite limited and definitely exhaustive (subset of > english). > - Aside from interpuntion, no non-english characters are used. > > Out of the box this does not perform anywhere near acceptable accuracy so > I need to manually train. > > How should I train this? > Should I be using the actual screenshots, where the text is quite small, > pixelly and not well spaced (font size 12 is not big) or should I make my > own training images in every possible color on black background in a bigger > font size with larger spacing? If I attempt to generate a box file it is > absolute rubbish. > Do I need to preprocess the images to black/white or should color be fine? > Note that anti-aliased screenshots dont look well when flattened to > black/white. > Should I be worried about images, lines, borders and other non-text > elements in my screenshots? Will tesseract gracefully skip these? > > Any advice specifically to my situation would help. > > > As always - If you are really need help, than provide image example. Written descriptions of problems are useless without image/testing case. -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

