On 30 July 2010 00:29, Moscow Rime Dharma Centre <[email protected]> wrote: > Good day. > For a few years our group has been developing OCR (optical character > recognition) and translation system with Open Source code. Now we have > the first solid results and will be happy to share this system and our > knowledge with you. The key features of the OCR system include: > > 1. Stream OCR processing > During the first stage of the project, we recognized 300 000 pages of > Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We > used MacPro stream server that has processed all 280 volumes with one > OCR set. > > 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5 > mln wordlist. > > 3. Multilingual support > At present, the key direction of the project is Tibetan and Sanskrit > OCR. However, its main algorithm can study one language per two > months. >
Could you elaborate on that? If there are any papers published about your work, I'd be interested in reading them (Я могу читать по-русски, но я не говорю хорошо :) > 4. High accuracy > The system uses dictionary control at all stages of OCR processing. > Its Grammar Corrector can use a statistic dictionary containing 20-30 > mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan > books, the current recognition results are 1 error per 1000 > characters. Here you can see a screenshot: > http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png > Interesting. How much of this accuracy do you attribute to correction? > All this features can be integrated in Tesseract project. > Probably not directly - for one thing, the licences are incompatible - but indirect reuse of ideas should not be a problem. > We believe that we may help you in your research and projects. And > probably you may help us to continue the development of the OCR system > and start tibetan translation program. We are looking forward to > hearing from you and will be happy to answer your questions! > MT is an interest of mine; I'd be interested in the details (from what I remember, there are some difficulties with Tibetan). -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

