Re: Open Source OCR system

Rakesh Achanta Sat, 31 Jul 2010 23:42:59 -0700

Very interesting.
Me and a bunch of friends are currently dealing with Indian languages. As
Tibetan is also based on the Devanagari system of writing, and is written as
abugida, your work will be very helpful for us.
Details like, how do you account for sandhis/joins in Sanskrit Eg:- sah+aham
= soham etc.


Complexity in Sanskrit like languages arises primarily from two things
1) Writing in syllables takes the symbols to a thousand or so (compare
English's 80 or so)
2) The number of words in  Sanskrit are limitless as one can keep combining
them.

I would be interested in reading any notes that detail how you are able to
cope with the above two.

Also as you said your system can learn new languages, it must be very easy
for it to learn Indian languages that have the same writing concept as
Tibetan. If you want a list of all possible combos for say, Telugu with the
tiff image and the unicode string. I can give them to you.

Regards
Rakesh

On 30 July 2010 04:59, Moscow Rime Dharma Centre <[email protected]>wrote:

> Good day.
> For a few years our group has been developing OCR (optical character
> recognition) and translation system with Open Source code. Now we have
> the first solid results and will be happy to share this system and our
> knowledge with you. The key features of the OCR system include:
>
> 1. Stream OCR processing
> During the first stage of the project, we recognized 300 000 pages of
> Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We
> used MacPro stream server that has processed all 280 volumes with one
> OCR set.
>
> 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5
> mln wordlist.
>
> 3. Multilingual support
> At present, the key direction of the project is Tibetan and Sanskrit
> OCR. However, its main algorithm can study one language per two
> months.
>
> 4. High accuracy
> The system uses dictionary control at all stages of OCR processing.
> Its Grammar Corrector can use a statistic dictionary containing 20-30
> mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan
> books, the current recognition results are 1 error per 1000
> characters. Here you can see a screenshot:
> http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png
>
> All this features can be integrated in Tesseract project.
>
> We believe that we may help you in your research and projects. And
> probably you may help us to continue the development of the OCR system
> and start tibetan translation program. We are looking forward to
> hearing from you and will be happy to answer your questions!
>
> Best regards,
> Alexander Stroganov,
> [email protected]
>
> Rime Center Russia
> OCR Project Web pages:
> http://sourceforge.net/projects/ocrlib/
> www.buddism.ru/ocrlib
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Open Source OCR system

Reply via email to