Hi, I have what I think is a unique situation, and I was hoping I could get some hints on how to proceed.
I have problem font files, for which I want to fix the unicode mappings for. I also have PDF files with these fonts, so I also have contextual semantics available. Currently I draw all the glyphs to an image, and run OCR on them. However, there are always issues in just about every test. The most common problems are 1. lower case and upper case latin o's being mixed up with zero 2. upper case latin i and lower case latin L, and number one being mixed up 3. Characters "randomly" getting broken up. So instead of latin upper case H, I get two vertical bars and a hyphen. Performance is very important, so I would like to avoid having to do ocr on full page/text (such as paragraphs, words), and instead just work with the font itself. One approach I was thinking, is skipping the whole image raster steps, since I already have vector data. Would it not be beneficial to simply hook in to tesseract and pass my vector data directly to some later stage (features?) in tesseract. I am comfortable with C++, etc, so please feel free to point me to source code I should be interested in. Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

