Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

udippel Sat, 14 Mar 2009 05:54:56 -0700

On Mar 14, 6:22 pm, lab <[email protected]> wrote:
> Here are some further manipulations, perhaps they are useful.

Interesting at least! Thanks.

> % tesseract fax95_cut_bw.tif output && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY

So the conversion forth and back brought up a proper OCR result? Maybe
the default TIFF-format of GIMP was simply not conducive for
tesseract?

> % tesseract fax000000095.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY A
>
> % tesseract fax000000095.tif output -l fra && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY *

Which was to be expected: French has no single 'a' in its vocabulary,
English has. It would be good, to debug tesseract, to see what it
actually 'sees' at the 'A'.
Also, one might try to convert the fax000000095.tif forth and back.
Did you try that?
The misery here is, that we run Debian on an embedded system and I
have no build environment; and it is slow.

To be added from my side: I uninstalled the Debian package (lenny),
and added the old Etch-Tesseract 1.02. And then the fax000000095.tif
resolves to
"... MY                      M"
So it remains - IMHO - an imaging/layout problem. And this might as
well be the reason for the frequent other 'bad result' posts that we
have seen here.
Again, I wonder if it is possible to 'see' the image directly before
the character recognition. I am pretty sure, that some artifacts are
introduced, so that the beauty and correctness of the engine itself
are compromised.

Uwe


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

Reply via email to