Good day. For a few years our group has been developing OCR (optical character recognition) and translation system with Open Source code. Now we have the first solid results and will be happy to share this system and our knowledge with you. The key features of the OCR system include:
1. Stream OCR processing During the first stage of the project, we recognized 300 000 pages of Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We used MacPro stream server that has processed all 280 volumes with one OCR set. 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5 mln wordlist. 3. Multilingual support At present, the key direction of the project is Tibetan and Sanskrit OCR. However, its main algorithm can study one language per two months. 4. High accuracy The system uses dictionary control at all stages of OCR processing. Its Grammar Corrector can use a statistic dictionary containing 20-30 mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan books, the current recognition results are 1 error per 1000 characters. Here you can see a screenshot: http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png All this features can be integrated in Tesseract project. We believe that we may help you in your research and projects. And probably you may help us to continue the development of the OCR system and start tibetan translation program. We are looking forward to hearing from you and will be happy to answer your questions! Best regards, Alexander Stroganov, [email protected] Rime Center Russia OCR Project Web pages: http://sourceforge.net/projects/ocrlib/ www.buddism.ru/ocrlib -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

