Hi Chris, I should preface my answer by saying that I haven't done anything similar, so some of what I'll say are guesses based on experience.
I'll answer your questions inline, and not in the order you wrote them ;) On Tue, Jun 12, 2012 at 09:16:52PM -0700, Chris wrote: > For a project I want to recognize the text taken from screenshots from > programs and games. What game? Sounds interesting? Why aren't you "just" locating the area in memory that it stores the information? > I have a lot of assumed knowledge which should help me with the recognition: > > - The font used is usually arial 12pt, plus one or two others. > - Background is (usually) black > - Font can be different colors, including white > - Vocabulary is quite limited and definitely exhaustive (subset of english). > - Aside from interpuntion, no non-english characters are used. > > Out of the box this does not perform anywhere near acceptable accuracy so I > need to manually train. I suspect training won't help here particularly, whereas preprocessing will. As you're recognising a common font. Training doesn't encode colour or things like that, it encodes primarily the shape of the letters, so the english training should be fine. The exception to this is the vocabulary. If many of the words which will come up are unlikely to be in the dictionary, it would probably be worth using a user_words file with the extra words. Indeed if there wasn't much overlap between the vocabulary used and the english dictionary, I would probably unpack the english training, remove the word lists, then pack it back up. But leave these until you have training working reasonably; word lists are helpful, but they won't suddenly turn garbage into sense. > Do I need to preprocess the images to black/white or should color be fine? > Note that anti-aliased screenshots dont look well when flattened to > black/white. I think you should pre-process it to black-and-white, yes. Another pre-processing step which will likely have a big impact would be resizing the screenshot to be much bigger; Tesseract works best with around 300DPI. > Should I be worried about images, lines, borders and other non-text > elements in my screenshots? Will tesseract gracefully skip these? Try and see. I would guess you should try to remove them, but I don't have experience of this myself. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

