Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

lab Sat, 14 Mar 2009 16:58:36 -0700

Uwe,

Some more experiments :)


% convert fax000000095.tif fax000000095.pbm
% convert fax000000095.pbm fax000000095_bw.tif

% tesseract fax000000095_bw.tif output -l eng && cat output.txt
Tesseract Open Source OCR Engine
AZLAN AT UNITEN DOT EDU DOT MY A

% xloadimage -identify fax000000095_bw.tif
fax000000095_bw.tif is a 3456x4677 single-plane white-on-black
standard TIFF imageTitled "fax000000095_bw.tif"

% pnmcrop fax000000095.pbm | xloadimage -identify stdin
stdin is a 2964x4376 RawBits PBM image

% pbmclean fax000000095.pbm | pnmcrop > fax000000095_c1.pbm
% xloadimage -identify fax000000095_c1.pbm
fax000000095_c1.pbm is a 2200x4376 RawBits PBM image

% convert fax000000095_c1.pbm fax000000095_c1.tif
% tesseract fax000000095_c1.tif output -l eng && cat output.txt
Tesseract Open Source OCR Engine
AZLAN AT UNITEN DOT EDU DOT MY A

% pbmclean -m 3 fax000000095.pbm | pnmcrop > fax000000095_c3.pbm
% xloadimage -identify fax000000095_c3.pbm
fax000000095_c3.pbm is a 1203x42 RawBits PBM image

% convert fax000000095_c3.pbm fax000000095_c3.tif
% tesseract fax000000095_c3.tif output -l eng && cat output.txt
Tesseract Open Source OCR Engine
AZLAN AT UNITEN DOT EDU DOT MY

% pnmpad -white -width 2200 -height 4376 fax000000095_c3.pbm >
fax000000095_c3_pad.pbm
% xloadimage -identify fax000000095_c3_pad.pbm
fax000000095_c3_pad.pbm is a 2200x4376 RawBits PBM image

% convert fax000000095_c3_pad.pbm fax000000095_c3_pad.tif
% tesseract fax000000095_c3_pad.tif output -l eng && cat output.txt
Tesseract Open Source OCR Engine
AZLAN AT UNITEN DOT EDU DOT MY

Laird.

On Mar 14, 10:54 pm, udippel <[email protected]> wrote:
> On Mar 14, 6:22 pm, lab <[email protected]> wrote:
>
> > Here are some further manipulations, perhaps they are useful.
>
> Interesting at least! Thanks.
>
> > % tesseract fax95_cut_bw.tif output && cat output.txt
> > Tesseract Open Source OCR Engine
> > AZLAN AT UNITEN DOT EDU DOT MY
>
> So the conversion forth and back brought up a proper OCR result? Maybe
> the default TIFF-format of GIMP was simply not conducive for
> tesseract?
>
> > % tesseract fax000000095.tif output -l eng && cat output.txt
> > Tesseract Open Source OCR Engine
> > AZLAN AT UNITEN DOT EDU DOT MY A
>
> > % tesseract fax000000095.tif output -l fra && cat output.txt
> > Tesseract Open Source OCR Engine
> > AZLAN AT UNITEN DOT EDU DOT MY *
>
> Which was to be expected: French has no single 'a' in its vocabulary,
> English has. It would be good, to debug tesseract, to see what it
> actually 'sees' at the 'A'.
> Also, one might try to convert the fax000000095.tif forth and back.
> Did you try that?
> The misery here is, that we run Debian on an embedded system and I
> have no build environment; and it is slow.
>
> To be added from my side: I uninstalled the Debian package (lenny),
> and added the old Etch-Tesseract 1.02. And then the fax000000095.tif
> resolves to
> "... MY                      M"
> So it remains - IMHO - an imaging/layout problem. And this might as
> well be the reason for the frequent other 'bad result' posts that we
> have seen here.
> Again, I wonder if it is possible to 'see' the image directly before
> the character recognition. I am pretty sure, that some artifacts are
> introduced, so that the beauty and correctness of the engine itself
> are compromised.
>
> Uwe
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

Reply via email to