Uwe, Some more experiments :)
% convert fax000000095.tif fax000000095.pbm % convert fax000000095.pbm fax000000095_bw.tif % tesseract fax000000095_bw.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT MY A % xloadimage -identify fax000000095_bw.tif fax000000095_bw.tif is a 3456x4677 single-plane white-on-black standard TIFF imageTitled "fax000000095_bw.tif" % pnmcrop fax000000095.pbm | xloadimage -identify stdin stdin is a 2964x4376 RawBits PBM image % pbmclean fax000000095.pbm | pnmcrop > fax000000095_c1.pbm % xloadimage -identify fax000000095_c1.pbm fax000000095_c1.pbm is a 2200x4376 RawBits PBM image % convert fax000000095_c1.pbm fax000000095_c1.tif % tesseract fax000000095_c1.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT MY A % pbmclean -m 3 fax000000095.pbm | pnmcrop > fax000000095_c3.pbm % xloadimage -identify fax000000095_c3.pbm fax000000095_c3.pbm is a 1203x42 RawBits PBM image % convert fax000000095_c3.pbm fax000000095_c3.tif % tesseract fax000000095_c3.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT MY % pnmpad -white -width 2200 -height 4376 fax000000095_c3.pbm > fax000000095_c3_pad.pbm % xloadimage -identify fax000000095_c3_pad.pbm fax000000095_c3_pad.pbm is a 2200x4376 RawBits PBM image % convert fax000000095_c3_pad.pbm fax000000095_c3_pad.tif % tesseract fax000000095_c3_pad.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT MY Laird. On Mar 14, 10:54 pm, udippel <[email protected]> wrote: > On Mar 14, 6:22 pm, lab <[email protected]> wrote: > > > Here are some further manipulations, perhaps they are useful. > > Interesting at least! Thanks. > > > % tesseract fax95_cut_bw.tif output && cat output.txt > > Tesseract Open Source OCR Engine > > AZLAN AT UNITEN DOT EDU DOT MY > > So the conversion forth and back brought up a proper OCR result? Maybe > the default TIFF-format of GIMP was simply not conducive for > tesseract? > > > % tesseract fax000000095.tif output -l eng && cat output.txt > > Tesseract Open Source OCR Engine > > AZLAN AT UNITEN DOT EDU DOT MY A > > > % tesseract fax000000095.tif output -l fra && cat output.txt > > Tesseract Open Source OCR Engine > > AZLAN AT UNITEN DOT EDU DOT MY * > > Which was to be expected: French has no single 'a' in its vocabulary, > English has. It would be good, to debug tesseract, to see what it > actually 'sees' at the 'A'. > Also, one might try to convert the fax000000095.tif forth and back. > Did you try that? > The misery here is, that we run Debian on an embedded system and I > have no build environment; and it is slow. > > To be added from my side: I uninstalled the Debian package (lenny), > and added the old Etch-Tesseract 1.02. And then the fax000000095.tif > resolves to > "... MY M" > So it remains - IMHO - an imaging/layout problem. And this might as > well be the reason for the frequent other 'bad result' posts that we > have seen here. > Again, I wonder if it is possible to 'see' the image directly before > the character recognition. I am pretty sure, that some artifacts are > introduced, so that the beauty and correctness of the engine itself > are compromised. > > Uwe --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

