Good day.
For a few years our group has been developing OCR (optical character
recognition) and translation system with Open Source code. Now we have
the first solid results and will be happy to share this system and our
knowledge with you. The key features of the OCR system include:

1. Stream OCR processing
During the first stage of the project, we recognized 300 000 pages of
Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We
used MacPro stream server that has processed all 280 volumes with one
OCR set.

2. Tibetan spell checker and online dictionary on 250000 words ans 6.5
mln wordlist.

3. Multilingual support
At present, the key direction of the project is Tibetan and Sanskrit
OCR. However, its main algorithm can study one language per two
months.

4. High accuracy
The system uses dictionary control at all stages of OCR processing.
Its Grammar Corrector can use a statistic dictionary containing 20-30
mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan
books, the current recognition results are 1 error per 1000
characters. Here you can see a screenshot: 
http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png

All this features can be integrated in Tesseract project.

We believe that we may help you in your research and projects. And
probably you may help us to continue the development of the OCR system
and start tibetan translation program. We are looking forward to
hearing from you and will be happy to answer your questions!

Best regards,
Alexander Stroganov,
[email protected]

Rime Center Russia
OCR Project Web pages:
http://sourceforge.net/projects/ocrlib/
www.buddism.ru/ocrlib

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to