Here is an example of glyphs from one font.

The upper case i is ocr'd as lower case L, and the lower case L was ocr'd 
as vertical bar '|'

<https://lh6.googleusercontent.com/-q3kSpzpaOfg/VFO9HAwqqUI/AAAAAAAAAAk/y70T5yE_x7g/s1600/FPDGJB%2BDKFrutiger-Bold80HL.tiff>
In an earlier post [1] it was recommended to repeat the string, but this 
rarely, if ever improved the results, and was not worth the added cpu time.

I guess really #3 is my biggest concern. #1 and #2 are not huge, but #3 is 
very annoying. Here is an image for where the upper case M gets ocr'd as 
"|\/|".

<https://lh3.googleusercontent.com/-DkaktR6xDfo/VFO_o3jkNnI/AAAAAAAAAAw/4OWW3vs6soY/s1600/FPDGJA%2BDekaFrutiger45Light.tiff>
As for full page OCR, I've been using VietOCR.Net for testing, and 
confirmed that doing full page ocr does not result in the breaking of the 
M. But of course process time is orders of magnitude longer.

What I would really like to do is skip the whole image analysis part, since 
I already have the glyph paths in vector form, so I don't want tesseract to 
chop.

[1] 
https://groups.google.com/forum/#!searchin/tesseract-ocr/from$3Ame/tesseract-ocr/K_CHA_DGO-Y/8l7qLOtua7EJ

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c25dc419-8a23-4a24-8a05-15d08bd4def5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to