In watching Bryan Tarpley's Franken+ presentation ( http://emop.tamu.edu/node/54) it's pretty obvious from the example that there are (at least) two clusters of glyphs for the letter 'o': a tall skinny glyph and a round glyph.
<https://lh3.googleusercontent.com/-ToHeDSJQWeM/UqN1FLyrbLI/AAAAAAAAAlI/A_rOElvihYM/s1600/franken-ocr-os.PNG> Attempting to extract a single set of features for a classifier to use is likely to be problematic. I don't know whether Tesseract has a strong 1:1 tie between glyphs and Unicode code points, but if it does, perhaps one workaround would be to train "skinny O" and "normal O" to two different code points. Of course, that just kicks the problem down the pipeline a bit because now all the lexical letter frequency stuff will be messed up and need adjusting. One could train them to different fonts, but then you'd run afoul of rules about the likelihood of fonts changing mid-word. Anyone of other ideas? It seems like this task requires fundamentally different ways of training and recognizing because it violates a whole set of (very reasonable) assumptions that a modern OCR engine has built in to it. Is there anyone attacking this problem at a more fundamental level than just tweaking Tesseract training? Are there other groups doing research in this area besides eMOP and IMPACT? Tom -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.