> > Is it possible to add the macron glyphs to the already-existing > > eng.traineddata? (the Ā, ā, Ē, ē, Ō, ō, Ū, ū) > > > No, it is not possible (AFAIK). > > But you can try to training only missing glyphs and use (in 3.02) "-l > eng+missing_glyphs" >
As far as I've observed, the multilingual recognition switches between languages at word level, not glyph level. Tesseract guesses which language an entire word belongs to, and does NOT mix glyphs from different languages within one single word (or, in other words: the final, recognized word always consists of glyphs from only ONE language, even with the "-l lang1+lang2" option) Which would mean that "missing_glyphs" would have to include the WHOLE alphabet, rather than just the missing diacritics (please correct me if i'm wrong) (I wish there *was* a way to merge a small subset... ) So, how much training is needed for good results? Would I need to train for normal, bold and italic? A variety of fonts? (serif and sans serif, etc.)? Any recommendations? thanks. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

