On May 1, 12:51 pm, Rajesh Pandey <[email protected]> wrote: > The hindi language tesseract data files should work. While I was working in > 2007-2008, Hindi language data files were not available. A bengali guy > called debayanin tried hard to use hindi / devanagari. > Today the hindi language data files (tessdata) are available. I haven't > tested it. But I am sure it should work. > The question has been answered. Nepali Language should be able to use the > hindi data files. It all depends on how much accurate the results for Hindi > are. If Hindi is detected flawlessly, it should work similarly with Nepali.
Except for the dictionary, as I mentioned above. Nepali dictionary is definitely different from Hindi dictionary. The difference would probably be reflected in the accuracy and/or speed. AFAIK, the dictionary is instrumental in the algorithms. (Someone, correct me if I'm wrong.) The above, of course, would beg the question: Can you just swap out the dictionary component of traineddata? I am assuming one can. (So as not to have to retrain from scratch) > There is a slight difference in Nepali that some characters from Hindi are > not used. However they are in the devanagari chart. Its good for Nepali > that Nepali does not use those characters. If it had been the reverse, we > should train again to incorporate those characters. Just out of curiosity -- what bearing does this have on Sanskrit? Are there certain Sanskrit glyphs that are missing from the current tesseract Hindi set? Thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

