[tesseract-ocr] Passing glyph vector data directly to tesseract

Ryan Dev Fri, 24 Oct 2014 00:20:08 -0700

Hi, I have what I think is a unique situation, and I was hoping I could get 
some hints on how to proceed.

I have problem font files, for which I want to fix the unicode mappings
for. I also have PDF files with these fonts, so I also have contextual
semantics available.

Currently I draw all the glyphs to an image, and run OCR on them. However,
there are always issues in just about every test.

The most common problems are
1. lower case and upper case latin o's being mixed up with zero
2. upper case latin i and lower case latin L, and number one being mixed up
3. Characters "randomly" getting broken up. So instead of latin upper case
H, I get two vertical bars and a hyphen.

Performance is very important, so I would like to avoid having to do ocr on
full page/text (such as paragraphs, words), and instead just work with the
font itself.

One approach I was thinking, is skipping the whole image raster steps,
since I already have vector data. Would it not be beneficial to simply hook
in to tesseract and pass my vector data directly to some later stage
(features?) in tesseract.

I am comfortable with C++, etc, so please feel free to point me to source
code I should be interested in.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Passing glyph vector data directly to tesseract

Reply via email to