On Fri, Oct 24, 2014 at 1:45 AM, Ryan Dev <[email protected]
> wrote:

> Hi, I have what I think is a unique situation, and I was hoping I could
> get some hints on how to proceed.
>
> I have problem font files, for which I want to fix the unicode mappings
> for. I also have PDF files with these fonts, so I also have contextual
> semantics available.
>
> Currently I draw all the glyphs to an image, and run OCR on them. However,
> there are always issues in just about every test.
>
> The most common problems are
> 1. lower case and upper case latin o's being mixed up with zero
> 2. upper case latin i and lower case latin L, and number one being mixed
> up
>
3. Characters "randomly" getting broken up. So instead of latin upper case
> H, I get two vertical bars and a hyphen.
>


> IMO these (1. and 2.) are general (not only OCR) problems: these letter
> are difficult to distinguish for some fonts. You can increase chance to
> identify them correctly by putting them in some context (e.g. words). But
> as I understand you try to avoid it. Maybe you can post some example image,
> so
>


> Performance is very important, so I would like to avoid having to do ocr
> on full page/text (such as paragraphs, words), and instead just work with
> the font itself.
>
> One approach I was thinking, is skipping the whole image raster steps,
> since I already have vector data. Would it not be beneficial to simply hook
> in to tesseract and pass my vector data directly to some later stage
> (features?) in tesseract.
>
> I am comfortable with C++, etc, so please feel free to point me to source
> code I should be interested in.
>
> Thanks!
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4540d666-3110-46d5-8f31-208ebc475de0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z3aWFLVA6uf3TaFWqS1GTHhTsD7hsjx7JAs3zExgJK2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to