Tesseract with weird/stylized fonts?

Joseph Moskie Sat, 02 Jun 2012 20:45:54 -0700

I've just started messing around with OCR applications and have found 
Tesseract to be pretty awesome and useful. Recently, I've tried to use it 
to read text in an odd font, specifically the one used in menus in Diablo 
3. The font is heavily stylized which is messing up Tesseract quite a bit. 
For example, the "O"s in the font have a cross in them, which Tesseract 
converts into "¤".


Here's a similar font, for reference:

<http://www.fontshark.com/map/b00a9cf5a09501844bba7dacee6dbfff.font>
And here's some sample output from running Tesseract:

00m¤N¤0N 000m;; 320 12,000,0000 1s,000,000• 1h17m
KILLER causmaas 313 9,999,9990 15,000,0000 7h40m  
WANDER Dssmuce 293 15,000,0000 l5,000,0000 11h 29m FI
LIVING sruunsas 290 900,0000 15,000,0000 22h 15m

It's pulling in the numbers just fine, but the words are pretty garbled.

I've played with different settings and tried a few different things (such 
as enlarging the tif source image) but in general nothing has helped all 
that much.

I've only started looking into "training" Tesseract and my initial 
impression is that training is for adding new languages or a custom set of 
words, and not so much training it how to read a "weird" font in normal 
English.

Is there any easy way to simply tell Tesseract "this is what all the 
letters look like in this font", so that it can know that an O with a + 
inside is really just an O?

Is there a flag or setting where I can specify "only normal A-Z characters 
and numbers 0-9"?



I'm looking into how I can use training to do this, and also considering 
some post processing on the text (like replacing ¤ with O), but I think 
working with Tesseract to make the initial parse better would be a better 
way to go about it.

Any tips or tricks would be very helpful, and much appreciated.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Tesseract with weird/stylized fonts?

Reply via email to