Re: Latin (Roman antiquity!) alphabet training

Falke Thu, 24 May 2012 06:19:06 -0700

> > Is it possible to add the macron glyphs to the already-existing
> > eng.traineddata?  (the Ā, ā, Ē, ē, Ō, ō, Ū, ū)
>
> > No, it is not possible (AFAIK).
>
> But you can try to training only missing glyphs and use (in 3.02) "-l
> eng+missing_glyphs"
>


As far as I've observed, the multilingual recognition switches between
languages at word level, not glyph level. Tesseract guesses which
language an entire word belongs to, and does NOT mix glyphs from
different languages within one single word (or, in other words: the
final, recognized word always consists of glyphs from only ONE
language, even with the "-l lang1+lang2" option)

Which would mean that "missing_glyphs" would have to include the WHOLE
alphabet, rather than just the missing diacritics (please correct me
if i'm wrong)

(I wish there *was* a way to merge a small subset...  )

So, how much training is needed for good results?  Would I need to
train for normal, bold and italic?  A variety of fonts? (serif and
sans serif, etc.)?  Any recommendations?

thanks.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Latin (Roman antiquity!) alphabet training

Reply via email to