On Apr 20, 1:57 am, Rajesh Pandey <[email protected]> wrote: > Hi > > Has anyone tried to create Nepali language data for tesseract ? > > I think Hindi/Sanskrit data files can also be used for tesseract.
I think it should work with the current "-l hin" option (tesseract's hindi language traineddata) Have YOU tried it yourself? I got some errors, but have not played with the resolution, etc., to try to reduce the errors. > I don't know which place is it to discuss about this : tesseract ocr forum > or fossnepal. > I don't see tesseract explicitly among the applications on fossnepal front page (if that's any indication) > Any suggestions on this ? > > Art is a librarian at the University of Windsor and have been working on > using open source OCR for newspaper collections. He was asked about Nepali > by a friend and became curious but he doesn't have a specific project for > the language at this point. He opts > tesseract<http://code.google.com/p/tesseract-ocr/>for this and wants > to use it for newspaper pages in batch. > > Earlier I was interested in creating a Nepali OCR but I am these days more You were going to write the whole engine, from scratch? Wow. > into creating Nepali Translator [Hindi or English to Nepali text > translator<http://code.google.com/p/nepaliwikipediatranslator> > ] > I read tesseract-ocr threads daily but still I prefer to be called a noob > in this regards. Have you tried tesseract with "-l hin" on nepali images?? Let us know your accuracy (and perhaps some idea of the resolution and quality of your scan, etc.) I believe accuracy has increased with the most recent (3.02) tesseract version. You need to compile 3.02 from svn. Read the INSTALL.SVN, and don't forget "make install-langs" at the end. The correspondence you quote, i believe, predates the recent improvements and additions, in 3.02. Also, the assertion in it that tesseract does not recognize conjoined characters is wrong. I believe it is AND WAS wrong, in general, even prior before 3.02 conceptually, that is: I don't think tesseract is **aware** of conjuncts, per se, as an object or algorithm -- it simply stores the conjunct's prototype image (like any other glyph image) in its data set, where that image is mapped to its utf8 code representation (however many bytes that utf8 representation might take (though there *IS* a limit)). As I understand it -- the challenge of improving this (and other scripts') recognition accuracy has a lot to do with the training (although perhaps not exclusively). Regarding handwritten documents: that seems daunting to me, but I may not know enough of tesseract's internals to assess how much harder handwritten images would be, than typeset ones. I know that tesseract makes multiple recognition passes, as it builds certain assumptions and confidences on its first pass, to be used in the second. Well, if hand-written documents have a certain consistency which tesseract can algorithmify (for pass 2+) then that's a plus. But it seems it would take very stylistically consistent hand writing, for that to take effect. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

