Training Tesseract for early printed text

Tom Morris Sat, 07 Dec 2013 11:30:24 -0800

In watching Bryan Tarpley's Franken+ presentation (
http://emop.tamu.edu/node/54) it's pretty obvious from the example that 
there are (at least) two clusters of glyphs for the letter 'o': a tall 
skinny glyph and a round glyph.


<https://lh3.googleusercontent.com/-ToHeDSJQWeM/UqN1FLyrbLI/AAAAAAAAAlI/A_rOElvihYM/s1600/franken-ocr-os.PNG>

Attempting to extract a single set of features for a classifier to use is 
likely to be problematic.  I don't know whether Tesseract has a strong 1:1 
tie between glyphs and Unicode code points, but if it does, perhaps one 
workaround would be to train "skinny O" and "normal O" to two different 
code points.  Of course, that just kicks the problem down the pipeline a 
bit because now all the lexical letter frequency stuff will be messed up 
and need adjusting. One could train them to different fonts, but then you'd 
run afoul of rules about the likelihood of fonts changing mid-word.  Anyone 
of other ideas?

It seems like this task requires fundamentally different ways of training 
and recognizing because it violates a whole set of (very reasonable) 
assumptions that a modern OCR engine has built in to it.

Is there anyone attacking this problem at a more fundamental level than 
just tweaking Tesseract training?  Are there other groups doing research in 
this area besides eMOP and IMPACT?

Tom

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Training Tesseract for early printed text

Reply via email to