Re: Nepali Tesseract OCR data files for tesseract ocr

Rajesh Pandey Wed, 02 May 2012 09:25:42 -0700

On Tue, May 1, 2012 at 11:59 PM, Falke <[email protected]> wrote:

>
>
> On May 1, 12:51 pm, Rajesh Pandey <[email protected]> wrote:
> > The hindi language tesseract data files should work. While I was working
> in
> > 2007-2008, Hindi language data files were not available. A guy
> > called debayanin tried hard to use hindi / devanagari.
> > Today the hindi language data files (tessdata) are available. I haven't
> > tested it. But I am sure it should work.
> > The question has been answered. Nepali Language should be able to use the
> > hindi data files. It all depends on how much accurate the results for
> Hindi
> > are. If Hindi is detected flawlessly, it should work similarly with
> Nepali.
>
> Except for the dictionary, as I mentioned above.  Nepali dictionary is


definitely different from Hindi dictionary.


Yes the dictionary is a bit different. However a lot of words are similar,
specially the words derived from Sanskrit are mostly common in Hindi and
Nepali. Nouns are approximately 80% similar, adjectives may be 50% similar,
verbs are a bit different, the suffixes and prefixes attached to verb, noun
and adjectives are mostly different.
So there are chances that even the dictionary files could also be used to
some extent. But that's just a guess without actually using it.
eg: 
this<http://code.google.com/p/nepaliwikipediatranslator/source/browse/trunk/NepaliWikiPediaTranslator/bin/Debug/NounsCommonInBothLanguage.txt>is
a list of nouns common in both language that I have compiled for a
different project.
adjectives<http://code.google.com/p/nepaliwikipediatranslator/source/browse/trunk/NepaliWikiPediaTranslator/bin/Debug/adjectivelist.txt>,
verbs<http://code.google.com/p/nepaliwikipediatranslator/source/browse/trunk/NepaliWikiPediaTranslator/bin/Debug/verblist.txt>,
nouns<http://code.google.com/p/nepaliwikipediatranslator/source/browse/trunk/NepaliWikiPediaTranslator/bin/Debug/nounlist.txt>

I am still a newbie so I don't know much about the dictionary files and
unicharset so I should not be writing about it. Earlier while I trained, I
used few empty files for  them and didn't use a dictionary, I just used
zero sized files just to make tesseract work.


The difference would
> probably be reflected in the accuracy and/or speed.  AFAIK, the
> dictionary is instrumental in the algorithms. (Someone, correct me if
> I'm wrong.)
>
>


> The above, of course, would beg the question:  Can you just swap out
> the dictionary component of traineddata?  I am assuming one can. (So
> as not to have to retrain from scratch)
>
> > There is a slight difference in Nepali that some characters from Hindi
> are
> > not used. However they are in the devanagari chart. Its good for Nepali
> > that Nepali does not use those characters. If it had been the reverse, we
> > should train again to incorporate those characters.
>
> Just out of curiosity -- what bearing does this have on Sanskrit?  Are
> there certain Sanskrit glyphs that are missing from the current
> tesseract Hindi set?
>
> Well someone who knows Sanskrit better must know better about this.


> Thanks
>
> --
>
>


-- 
Rajesh Pandey

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Nepali Tesseract OCR data files for tesseract ocr

Reply via email to