Re: Open Source OCR system

Sriranga(77yrsold) Sun, 01 Aug 2010 00:05:16 -0700

Dear Rakesh,
Really interesting. Please don't forget me   I like to join with you in
developing OCR for indian languages under your leadership.
Yes complexity existed as well as fundamental grammar in Indian languages
based on Sanskrit only.
I can also contribute Kannada tif image with its text converted in  unicodes
also for experiment purpose.
I like to have software for hands on experience and beta-testing and
feedback.
Wishing you best of Luck and good wishes,
-sriranga(77yrsold)


On Sun, Aug 1, 2010 at 12:12 PM, Rakesh Achanta <[email protected]> wrote:

> Very interesting.
> Me and a bunch of friends are currently dealing with Indian languages. As
> Tibetan is also based on the Devanagari system of writing, and is written as
> abugida, your work will be very helpful for us.
> Details like, how do you account for sandhis/joins in Sanskrit Eg:-
> sah+aham = soham etc.
>
> Complexity in Sanskrit like languages arises primarily from two things
> 1) Writing in syllables takes the symbols to a thousand or so (compare
> English's 80 or so)
> 2) The number of words in  Sanskrit are limitless as one can keep combining
> them.
>
> I would be interested in reading any notes that detail how you are able to
> cope with the above two.
>
> Also as you said your system can learn new languages, it must be very easy
> for it to learn Indian languages that have the same writing concept as
> Tibetan. If you want a list of all possible combos for say, Telugu with the
> tiff image and the unicode string. I can give them to you.
>
> Regards
> Rakesh
>
> On 30 July 2010 04:59, Moscow Rime Dharma Centre 
> <[email protected]>wrote:
>
>> Good day.
>> For a few years our group has been developing OCR (optical character
>> recognition) and translation system with Open Source code. Now we have
>> the first solid results and will be happy to share this system and our
>> knowledge with you. The key features of the OCR system include:
>>
>> 1. Stream OCR processing
>> During the first stage of the project, we recognized 300 000 pages of
>> Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We
>> used MacPro stream server that has processed all 280 volumes with one
>> OCR set.
>>
>> 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5
>> mln wordlist.
>>
>> 3. Multilingual support
>> At present, the key direction of the project is Tibetan and Sanskrit
>> OCR. However, its main algorithm can study one language per two
>> months.
>>
>> 4. High accuracy
>> The system uses dictionary control at all stages of OCR processing.
>> Its Grammar Corrector can use a statistic dictionary containing 20-30
>> mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan
>> books, the current recognition results are 1 error per 1000
>> characters. Here you can see a screenshot:
>> http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png
>>
>> All this features can be integrated in Tesseract project.
>>
>> We believe that we may help you in your research and projects. And
>> probably you may help us to continue the development of the OCR system
>> and start tibetan translation program. We are looking forward to
>> hearing from you and will be happy to answer your questions!
>>
>> Best regards,
>> Alexander Stroganov,
>> [email protected]
>>
>> Rime Center Russia
>> OCR Project Web pages:
>> http://sourceforge.net/projects/ocrlib/
>> www.buddism.ru/ocrlib
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Open Source OCR system

Reply via email to