Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

udippel Mon, 16 Mar 2009 06:22:10 -0700

On Mar 16, 10:11 am, lab <[email protected]> wrote:

> There is a summary (probably a bit out of date, but still usable) of
> the algorithmic aspects of tesseract in Ray Smith "An Overview of the
> Tesseract OCR Engine".

Thanks for the tip.

> I think it can explain the 'A' problem, because
> letter features are normalized to be size independent (this is a good
> thing in usual cases)

I can't agree here, and probably only few would. OCR from a pixel
height of 2 is crazy, and can never result in anything sane.
Second argument against it: the dots are some tens of blanks away from
any other character. There is no isolated character in any layout,
that has a height of a few dots only and can usably be recognized.

I still consider the OCR buggy, since tens of blanks should show as
tens of blanks, not artifically be concatenated. (Which is also why I
didn't notice the dots before, having searched for dots in the
surroundings behind 'MY'.

>  % tesseract fax000000095.tif output batch.nochop makebox && cat
> output.txt

Oh, thanks for the debug command! Until here, I had looked for it in
vain.

> I can confirm that the spurious 'A' is filtered out if there is no
> other text in the image.
[...]
> % tesseract dots.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
>

... and so it ought to be with text in the same line, see above or
compare to ocrad.

> Here is another experiment: I have edited the file fax000000095.pbm
> with Gimp and I have erased the last part "MY" of the email address
> only. The new image is called fax000000095_no_MY.pbm
>
> % convert  fax000000095_no_MY.pbm fax000000095_no_MY.tif
> % tesseract fax000000095_no_MY.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT “
>
> As you can see, the three dots are no longer recognized as 'A', but as
> some other unicode symbol with the same English dictionary (of course
> my terminal font doesn't display it correctly, and I also checked that
> there are no leftover pixels near the location where MY used to be).
> This experiment shows that tesseract's adaptive classifier is playing
> a role here, not just the static character classifier (see Ray's paper
> referred to earlier for details).

I'd love to have a larger number of options to pass to tesseract; e.g.
minimal height of character to be recognized, ASCII/UTF-8.

Thanks again for the great explanations and your efforts!

Uwe


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

Reply via email to