Dear Rakesh, Really interesting. Please don't forget me I like to join with you in developing OCR for indian languages under your leadership. Yes complexity existed as well as fundamental grammar in Indian languages based on Sanskrit only. I can also contribute Kannada tif image with its text converted in unicodes also for experiment purpose. I like to have software for hands on experience and beta-testing and feedback. Wishing you best of Luck and good wishes, -sriranga(77yrsold)
On Sun, Aug 1, 2010 at 12:12 PM, Rakesh Achanta <[email protected]> wrote: > Very interesting. > Me and a bunch of friends are currently dealing with Indian languages. As > Tibetan is also based on the Devanagari system of writing, and is written as > abugida, your work will be very helpful for us. > Details like, how do you account for sandhis/joins in Sanskrit Eg:- > sah+aham = soham etc. > > Complexity in Sanskrit like languages arises primarily from two things > 1) Writing in syllables takes the symbols to a thousand or so (compare > English's 80 or so) > 2) The number of words in Sanskrit are limitless as one can keep combining > them. > > I would be interested in reading any notes that detail how you are able to > cope with the above two. > > Also as you said your system can learn new languages, it must be very easy > for it to learn Indian languages that have the same writing concept as > Tibetan. If you want a list of all possible combos for say, Telugu with the > tiff image and the unicode string. I can give them to you. > > Regards > Rakesh > > On 30 July 2010 04:59, Moscow Rime Dharma Centre > <[email protected]>wrote: > >> Good day. >> For a few years our group has been developing OCR (optical character >> recognition) and translation system with Open Source code. Now we have >> the first solid results and will be happy to share this system and our >> knowledge with you. The key features of the OCR system include: >> >> 1. Stream OCR processing >> During the first stage of the project, we recognized 300 000 pages of >> Tibetan Canon in Tibetan for TBRS Digital Library (www.tbrc.org) We >> used MacPro stream server that has processed all 280 volumes with one >> OCR set. >> >> 2. Tibetan spell checker and online dictionary on 250000 words ans 6.5 >> mln wordlist. >> >> 3. Multilingual support >> At present, the key direction of the project is Tibetan and Sanskrit >> OCR. However, its main algorithm can study one language per two >> months. >> >> 4. High accuracy >> The system uses dictionary control at all stages of OCR processing. >> Its Grammar Corrector can use a statistic dictionary containing 20-30 >> mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan >> books, the current recognition results are 1 error per 1000 >> characters. Here you can see a screenshot: >> http://www.buddism.ru///ocrlib/OCRLib21_07_2010.png >> >> All this features can be integrated in Tesseract project. >> >> We believe that we may help you in your research and projects. And >> probably you may help us to continue the development of the OCR system >> and start tibetan translation program. We are looking forward to >> hearing from you and will be happy to answer your questions! >> >> Best regards, >> Alexander Stroganov, >> [email protected] >> >> Rime Center Russia >> OCR Project Web pages: >> http://sourceforge.net/projects/ocrlib/ >> www.buddism.ru/ocrlib >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]<tesseract-ocr%[email protected]> >> . >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

