Hi all, I'm working on training Tesseract to recognize Georgian, and right now it works pretty well. However, many documents in Georgian may also contain text in Cyrillic and/or Latin scripts, so I'm wondering what the best approach to solving this is.
Option 1: I see that I can incorporate the Cyrillic and Latin training files into my own training files, and create a Georgian+Cyrillic+Latin recognizer, but I'm worried about reducing the accuracy of recognition, especially for Latin/Cyrillic lookalikes such as н. Option 2: As an alternative, is it possible to access Tesseract's internal state as it tries to recognize characters? Then I could write a wrapper that would try alternative languages for characters with low confidence and pick the one that gives the highest confidence as its best guess. Option 3: Create four models: Georgian only, Georgian+Russian, Georgian+English, Georgian+English+Russian, and use the appropriate one. This is my fallback option since it seems the most likely to work while maintaining maximum accuracy. Any advice, please let me know, thanks! Derek -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

