Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

udippel Sun, 15 Mar 2009 03:11:42 -0700

On Mar 15, 7:58 am, lab <[email protected]> wrote:

> Some more experiments :)

Good!

> % convert fax000000095.tif fax000000095.pbm
> % convert fax000000095.pbm fax000000095_bw.tif

My line is different, simpler:
tifftopnm fax000000095.tif > fax000000095.pnm
ocrad fax000000095.pnm
AZLAN AT UNITEN DOT EDU DOT MY

We were here a year ago. ocrad is really beautiful, small footprint,
very well supported.
But our test samples then showed a better accuracy with tesseract
overall.

I need to add, that we had scanned the same page thrice, same machine,
just scanned thrice. Two (fax-)images result in a very concise and
clear OCR process, on tesseract, with the expected result. Only one,
the one that I uploaded, creates the artifacts.

> % pnmcrop fax000000095.pbm | xloadimage -identify stdin
> stdin is a 2964x4376 RawBits PBM image
>
> % pbmclean fax000000095.pbm | pnmcrop > fax000000095_c1.pbm
> % xloadimage -identify fax000000095_c1.pbm
> fax000000095_c1.pbm is a 2200x4376 RawBits PBM image
>
> % convert fax000000095_c1.pbm fax000000095_c1.tif
> % tesseract fax000000095_c1.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY A

This branch is interesting. The cropped and cleaned image still shows
OCR results that are - I assume - not visible. Could you check with an
image viewer?
The ocrad-log is unambiguous on the very first image above:

# Ocr Results File. Created by GNU Ocrad version 0.17
source file fax000000095.pnm
total text blocks 1
text block 1 0 0 3456 4677
lines 1
line 1 chars 30 height 39
1136 2392 41 39; 1, 'A'0
1182 2392 30 39; 2, 'Z'1, 'z'0
1221 2392 34 39; 1, 'L'0
1257 2392 41 39; 1, 'A'0
1299 2392 38 39; 1, 'N'0
1337 2392 40 39; 1, ' '0
1377 2393 40 38; 1, 'A'0
1420 2393 35 38; 1, 'T'0
1455 2392 44 39; 1, ' '0
1499 2393 38 39; 2, 'U'1, 'u'0
1539 2393 39 39; 1, 'N'0
1585 2393 27 39; 1, 'I'0
1621 2393 36 39; 1, 'T'0
1662 2393 33 39; 1, 'E'0
1700 2393 38 40; 1, 'N'0
1738 2393 43 39; 1, ' '0
1781 2394 35 38; 1, 'D'0
1821 2393 35 40; 2, 'O'1, 'o'0
1861 2394 36 39; 1, 'T'0
1897 2393 45 39; 1, ' '0
1942 2394 33 39; 1, 'E'0
1981 2394 36 39; 1, 'D'0
2021 2394 38 40; 2, 'U'1, 'u'0
2059 2393 42 39; 1, ' '0
2101 2394 36 39; 1, 'D'0
2142 2394 35 40; 2, 'O'1, 'o'0
2183 2394 36 40; 1, 'T'0
2219 2393 43 39; 1, ' '0
2262 2395 37 39; 1, 'M'0
2302 2395 36 38; 1, 'Y'0

Where does the 'A' respectively 'M' come from? Why only in one out of
3 scans with an automatic feed of a copy/fax-machine?

> % pbmclean -m 3 fax000000095.pbm | pnmcrop > fax000000095_c3.pbm
> % xloadimage -identify fax000000095_c3.pbm
> fax000000095_c3.pbm is a 1203x42 RawBits PBM image
>
> % convert fax000000095_c3.pbm fax000000095_c3.tif
> % tesseract fax000000095_c3.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY
>
> % pnmpad -white -width 2200 -height 4376 fax000000095_c3.pbm >
> fax000000095_c3_pad.pbm
> % xloadimage -identify fax000000095_c3_pad.pbm
> fax000000095_c3_pad.pbm is a 2200x4376 RawBits PBM image
>
> % convert fax000000095_c3_pad.pbm fax000000095_c3_pad.tif
> % tesseract fax000000095_c3_pad.tif output -l eng && cat output.txt
> Tesseract Open Source OCR Engine
> AZLAN AT UNITEN DOT EDU DOT MY

There is nothing unexpected here, white is white.
One of the major advantages we saw in using tesseract was that the
default output was pretty much cleaned of random dots and scratches
during scan. (I had written an extensive filter before ocrad was at
that level.)
According to the output, the artifacts always show in exactly that
same line that contains the information. And when I look at it, there
is nothing. Nothing at all. So we expect some alias, as first
assumption. This is supported by the strange OCR results that others
have seen, as presented here.

Still curious how to get debug results. Any clue?

Uwe

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

Reply via email to