Best multiple language option?

Derek Dohler Thu, 02 Jun 2011 23:42:58 -0700

Hi all, 

I'm working on training Tesseract to recognize Georgian, and right now it works 
pretty well. However, many documents in Georgian may also contain text in 
Cyrillic and/or Latin scripts, so I'm wondering what the best approach to 
solving this is.


Option 1: I see that I can incorporate the Cyrillic and Latin training files 
into my own training files, and create a Georgian+Cyrillic+Latin recognizer, 
but I'm worried about reducing the accuracy of recognition, especially for 
Latin/Cyrillic lookalikes such as н.

Option 2: As an alternative, is it possible to access Tesseract's internal 
state as it tries to recognize characters? Then I could write a wrapper that 
would try alternative languages for characters with low confidence and pick the 
one that gives the highest confidence as its best guess.

Option 3: Create four models: Georgian only, Georgian+Russian, 
Georgian+English, Georgian+English+Russian, and use the appropriate one. This 
is my fallback option since it seems the most likely to work while maintaining 
maximum accuracy.

Any advice, please let me know, thanks!

Derek

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Best multiple language option?

Reply via email to