Skip to site navigation (Press enter)

Re: [tesseract-ocr] Passing glyph vector data directly to tesseract

Ryan Dev Fri, 31 Oct 2014 10:10:16 -0700

Here is an example of glyphs from one font.

The upper case i is ocr'd as lower case L, and the lower case L was ocr'd 
as vertical bar '|'

<https://lh6.googleusercontent.com/-q3kSpzpaOfg/VFO9HAwqqUI/AAAAAAAAAAk/y70T5yE_x7g/s1600/FPDGJB%2BDKFrutiger-Bold80HL.tiff>
In an earlier post [1] it was recommended to repeat the string, but this
rarely, if ever improved the results, and was not worth the added cpu time.

I guess really #3 is my biggest concern. #1 and #2 are not huge, but #3 is
very annoying. Here is an image for where the upper case M gets ocr'd as
"|\/|".

<https://lh3.googleusercontent.com/-DkaktR6xDfo/VFO_o3jkNnI/AAAAAAAAAAw/4OWW3vs6soY/s1600/FPDGJA%2BDekaFrutiger45Light.tiff>
As for full page OCR, I've been using VietOCR.Net for testing, and
confirmed that doing full page ocr does not result in the breaking of the
M. But of course process time is orders of magnitude longer.

What I would really like to do is skip the whole image analysis part, since
I already have the glyph paths in vector form, so I don't want tesseract to
chop.

[1]
https://groups.google.com/forum/#!searchin/tesseract-ocr/from$3Ame/tesseract-ocr/K_CHA_DGO-Y/8l7qLOtua7EJ

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/c25dc419-8a23-4a24-8a05-15d08bd4def5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.