I've just started messing around with OCR applications and have found Tesseract to be pretty awesome and useful. Recently, I've tried to use it to read text in an odd font, specifically the one used in menus in Diablo 3. The font is heavily stylized which is messing up Tesseract quite a bit. For example, the "O"s in the font have a cross in them, which Tesseract converts into "¤".
Here's a similar font, for reference: <http://www.fontshark.com/map/b00a9cf5a09501844bba7dacee6dbfff.font> And here's some sample output from running Tesseract: 00m¤N¤0N 000m;; 320 12,000,0000 1s,000,000• 1h17m KILLER causmaas 313 9,999,9990 15,000,0000 7h40m WANDER Dssmuce 293 15,000,0000 l5,000,0000 11h 29m FI LIVING sruunsas 290 900,0000 15,000,0000 22h 15m It's pulling in the numbers just fine, but the words are pretty garbled. I've played with different settings and tried a few different things (such as enlarging the tif source image) but in general nothing has helped all that much. I've only started looking into "training" Tesseract and my initial impression is that training is for adding new languages or a custom set of words, and not so much training it how to read a "weird" font in normal English. Is there any easy way to simply tell Tesseract "this is what all the letters look like in this font", so that it can know that an O with a + inside is really just an O? Is there a flag or setting where I can specify "only normal A-Z characters and numbers 0-9"? I'm looking into how I can use training to do this, and also considering some post processing on the text (like replacing ¤ with O), but I think working with Tesseract to make the initial parse better would be a better way to go about it. Any tips or tricks would be very helpful, and much appreciated. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

