Re: Nepali Tesseract OCR data files for tesseract ocr

Falke Tue, 01 May 2012 19:13:16 -0700


On May 1, 12:51 pm, Rajesh Pandey <[email protected]> wrote:
> The hindi language tesseract data files should work. While I was working in
> 2007-2008, Hindi language data files were not available. A bengali guy
> called debayanin tried hard to use hindi / devanagari.
> Today the hindi language data files (tessdata) are available. I haven't
> tested it. But I am sure it should work.
> The question has been answered. Nepali Language should be able to use the
> hindi data files. It all depends on how much accurate the results for Hindi
> are. If Hindi is detected flawlessly, it should work similarly with Nepali.


Except for the dictionary, as I mentioned above.  Nepali dictionary is
definitely different from Hindi dictionary.  The difference would
probably be reflected in the accuracy and/or speed.  AFAIK, the
dictionary is instrumental in the algorithms. (Someone, correct me if
I'm wrong.)

The above, of course, would beg the question:  Can you just swap out
the dictionary component of traineddata?  I am assuming one can. (So
as not to have to retrain from scratch)

> There is a slight difference in Nepali that some characters from Hindi are
> not used. However they are in the devanagari chart. Its good for Nepali
> that Nepali does not use those characters. If it had been the reverse, we
> should train again to incorporate those characters.

Just out of curiosity -- what bearing does this have on Sanskrit?  Are
there certain Sanskrit glyphs that are missing from the current
tesseract Hindi set?

Thanks

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Nepali Tesseract OCR data files for tesseract ocr

Reply via email to