On 30 July 2010 00:29, Moscow Rime Dharma Centre
<[email protected]> wrote:
> Good day.
> For a few years our group has been developing OCR (optical character
> recognition) and translation system with Open Source code. Now we have
> the first solid results and will be happy to share this system and our
> knowledge with you. The key features of the OCR system include:
>
> 1. Stream OCR processing
> During the first stage of the project, we recognized 300 000 pages of
> Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We
> used MacPro stream server that has processed all 280 volumes with one
> OCR set.
>
> 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5
> mln wordlist.
>
> 3. Multilingual support
> At present, the key direction of the project is Tibetan and Sanskrit
> OCR. However, its main algorithm can study one language per two
> months.
>

Could you elaborate on that? If there are any papers published about
your work, I'd be interested in reading them (Я могу читать по-русски,
но я не говорю хорошо :)

> 4. High accuracy
> The system uses dictionary control at all stages of OCR processing.
> Its Grammar Corrector can use a statistic dictionary containing 20-30
> mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan
> books, the current recognition results are 1 error per 1000
> characters. Here you can see a screenshot: 
> http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png
>

Interesting. How much of this accuracy do you attribute to correction?

> All this features can be integrated in Tesseract project.
>

Probably not directly - for one thing, the licences are incompatible -
but indirect reuse of ideas should not be a problem.

> We believe that we may help you in your research and projects. And
> probably you may help us to continue the development of the OCR system
> and start tibetan translation program. We are looking forward to
> hearing from you and will be happy to answer your questions!
>

MT is an interest of mine; I'd be interested in the details (from what
I remember, there are some difficulties with Tibetan).

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to