Here is an example of glyphs from one font. The upper case i is ocr'd as lower case L, and the lower case L was ocr'd as vertical bar '|'
<https://lh6.googleusercontent.com/-q3kSpzpaOfg/VFO9HAwqqUI/AAAAAAAAAAk/y70T5yE_x7g/s1600/FPDGJB%2BDKFrutiger-Bold80HL.tiff> In an earlier post [1] it was recommended to repeat the string, but this rarely, if ever improved the results, and was not worth the added cpu time. I guess really #3 is my biggest concern. #1 and #2 are not huge, but #3 is very annoying. Here is an image for where the upper case M gets ocr'd as "|\/|". <https://lh3.googleusercontent.com/-DkaktR6xDfo/VFO_o3jkNnI/AAAAAAAAAAw/4OWW3vs6soY/s1600/FPDGJA%2BDekaFrutiger45Light.tiff> As for full page OCR, I've been using VietOCR.Net for testing, and confirmed that doing full page ocr does not result in the breaking of the M. But of course process time is orders of magnitude longer. What I would really like to do is skip the whole image analysis part, since I already have the glyph paths in vector form, so I don't want tesseract to chop. [1] https://groups.google.com/forum/#!searchin/tesseract-ocr/from$3Ame/tesseract-ocr/K_CHA_DGO-Y/8l7qLOtua7EJ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c25dc419-8a23-4a24-8a05-15d08bd4def5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

