On Fri, Oct 24, 2014 at 1:45 AM, Ryan Dev <[email protected] > wrote:
> Hi, I have what I think is a unique situation, and I was hoping I could > get some hints on how to proceed. > > I have problem font files, for which I want to fix the unicode mappings > for. I also have PDF files with these fonts, so I also have contextual > semantics available. > > Currently I draw all the glyphs to an image, and run OCR on them. However, > there are always issues in just about every test. > > The most common problems are > 1. lower case and upper case latin o's being mixed up with zero > 2. upper case latin i and lower case latin L, and number one being mixed > up > 3. Characters "randomly" getting broken up. So instead of latin upper case > H, I get two vertical bars and a hyphen. > > IMO these (1. and 2.) are general (not only OCR) problems: these letter > are difficult to distinguish for some fonts. You can increase chance to > identify them correctly by putting them in some context (e.g. words). But > as I understand you try to avoid it. Maybe you can post some example image, > so > > Performance is very important, so I would like to avoid having to do ocr > on full page/text (such as paragraphs, words), and instead just work with > the font itself. > > One approach I was thinking, is skipping the whole image raster steps, > since I already have vector data. Would it not be beneficial to simply hook > in to tesseract and pass my vector data directly to some later stage > (features?) in tesseract. > > I am comfortable with C++, etc, so please feel free to point me to source > code I should be interested in. > > Thanks! > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z3aWFLVA6uf3TaFWqS1GTHhTsD7hsjx7JAs3zExgJK2w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

