Re: Nepali Tesseract OCR data files for tesseract ocr

Falke Mon, 23 Apr 2012 02:42:26 -0700

On Apr 20, 1:57 am, Rajesh Pandey <[email protected]> wrote:
> Hi
>
> Has anyone tried to create Nepali language data for tesseract ?
>
> I think Hindi/Sanskrit data files can also be used for tesseract.


I think it should work with the current "-l hin" option (tesseract's
hindi language traineddata)

Have YOU tried it yourself?

I got some errors, but have not played with the resolution, etc., to
try to reduce the errors.


> I don't know which place is it to discuss about this : tesseract ocr forum
> or fossnepal.
>

I don't see tesseract explicitly among the applications on fossnepal
front page (if that's any indication)

> Any suggestions on this ?
>
> Art is a librarian at the University of Windsor and have been working on
> using open source OCR for newspaper collections. He was asked about Nepali
> by a friend and became curious but he doesn't have a specific project for
> the language at this point. He opts
> tesseract<http://code.google.com/p/tesseract-ocr/>for this and wants
> to use it for newspaper pages in batch.
>
> Earlier I was interested in creating a Nepali OCR but I am these days more

You were going to write the whole engine, from scratch?  Wow.

> into creating Nepali Translator [Hindi or English to Nepali text
> translator<http://code.google.com/p/nepaliwikipediatranslator>
> ]
> I read tesseract-ocr threads daily but still I prefer to be called a noob
> in this regards.

Have you tried tesseract with "-l hin" on nepali images??

Let us know your accuracy (and perhaps some idea of the resolution and
quality of your scan, etc.)

I believe accuracy has increased with the most recent (3.02) tesseract
version.

You need to compile 3.02 from svn.  Read the INSTALL.SVN, and don't
forget "make install-langs" at the end.

The correspondence you quote, i believe, predates the recent
improvements and additions, in 3.02.   Also, the assertion in it that
tesseract  does not recognize conjoined characters is wrong.

I believe it is AND WAS wrong, in general, even prior before 3.02

conceptually, that is: I don't think tesseract is **aware** of
conjuncts, per se, as an object or algorithm -- it simply stores the
conjunct's prototype image (like any other glyph image) in its data
set, where that image is mapped to its utf8 code representation
(however many bytes that utf8 representation might take (though there
*IS* a limit)).

As I understand it -- the challenge of improving this (and other
scripts') recognition accuracy has a lot to do with the training
(although perhaps not exclusively).

Regarding handwritten documents:  that seems daunting to me, but I may
not know enough of tesseract's internals to assess how much harder
handwritten images would be, than typeset ones.  I know that tesseract
makes multiple recognition passes, as it builds certain assumptions
and confidences on its first pass, to be used in the second.  Well, if
hand-written documents have a certain consistency which tesseract can
algorithmify (for pass 2+) then that's a plus.  But it seems it would
take very stylistically consistent hand writing, for that to take
effect.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Nepali Tesseract OCR data files for tesseract ocr

Reply via email to