On Mar 16, 10:11 am, lab <[email protected]> wrote:
> There is a summary (probably a bit out of date, but still usable) of
> the algorithmic aspects of tesseract in Ray Smith "An Overview of the
> Tesseract OCR Engine".
Thanks for the tip.
> I think it can explain the 'A' problem, because
> letter features are normalized to be size independent (this is a good
> thing in usual cases)
I can't agree here, and probably only few would. OCR from a pixel
height of 2 is crazy, and can never result in anything sane.
Second argument against it: the dots are some tens of blanks away from
any other character. There is no isolated character in any layout,
that has a height of a few dots only and can usably be recognized.
I still consider the OCR buggy, since tens of blanks should show as
tens of blanks, not artifically be concatenated. (Which is also why I
didn't notice the dots before, having searched for dots in the
surroundings behind 'MY'.
> % tesseract fax000000095.tif output batch.nochop makebox && cat
> output.txt
Oh, thanks for the debug command! Until here, I had looked for it in
vain.
> I can confirm that the spurious 'A' is filtered out if there is no
> other text in the image.
[...]
> % tesseract dots.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
>
... and so it ought to be with text in the same line, see above or
compare to ocrad.
> Here is another experiment: I have edited the file fax000000095.pbm
> with Gimp and I have erased the last part "MY" of the email address
> only. The new image is called fax000000095_no_MY.pbm
>
> % convert fax000000095_no_MY.pbm fax000000095_no_MY.tif
> % tesseract fax000000095_no_MY.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT “
>
> As you can see, the three dots are no longer recognized as 'A', but as
> some other unicode symbol with the same English dictionary (of course
> my terminal font doesn't display it correctly, and I also checked that
> there are no leftover pixels near the location where MY used to be).
> This experiment shows that tesseract's adaptive classifier is playing
> a role here, not just the static character classifier (see Ray's paper
> referred to earlier for details).
I'd love to have a larger number of options to pass to tesseract; e.g.
minimal height of character to be recognized, ASCII/UTF-8.
Thanks again for the great explanations and your efforts!
Uwe
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---