Uwe, There is a summary (probably a bit out of date, but still usable) of the algorithmic aspects of tesseract in Ray Smith "An Overview of the Tesseract OCR Engine". I think it can explain the 'A' problem, because letter features are normalized to be size independent (this is a good thing in usual cases)
% tesseract fax000000095.tif output batch.nochop makebox && cat output.txt Tesseract Open Source OCR Engine A 1135 2243 1178 2285 Z 1182 2243 1213 2285 L 1220 2243 1256 2284 @ 1256 2244 1337 2286 A 1376 2244 1419 2285 T 1419 2245 1456 2285 U 1499 2242 1539 2284 N 1539 2243 1579 2284 I 1585 2242 1613 2284 T 1621 2243 1658 2284 E 1661 2242 1696 2284 N 1699 2242 1739 2284 D 1780 2243 1817 2284 O 1820 2242 1857 2284 T 1860 2242 1898 2283 E 1942 2243 1976 2284 D 1981 2243 2018 2284 U 2020 2242 2060 2284 D 2100 2241 2138 2284 O 2141 2241 2178 2284 T 2182 2241 2220 2284 M 2261 2241 2300 2282 Y 2302 2242 2339 2282 A 3332 2263 3336 2267 I can confirm that the spurious 'A' is filtered out if there is no other text in the image. % pnmcut -right 2500 fax000000095.pbm > no_dots.pbm % convert no_dots.pbm no_dots.tif % pnmcut -left 2500 fax000000095.pbm > dots.pbm % convert dots.pbm dots.tif % tesseract dots.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine % tesseract no_dots.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT MY > Question to the developers: Does Tesseract filter (==remove) > 'characters' within a line containing recognizable characters, that > are way below possible heights of recognition? I would definitively > hope so. Like in this line, the character height is >30, with some > sparse, isolated dots of a height of <3. Does Tesseract attempt to > discard those? I think it should. I think in this case the problem is that by chance the blob (three dots) exists exactly in the middle of the recognized line, so tesseract tries to read it, and the 'A' interpretation depends on the context. Here is another experiment: I have edited the file fax000000095.pbm with Gimp and I have erased the last part "MY" of the email address only. The new image is called fax000000095_no_MY.pbm % convert fax000000095_no_MY.pbm fax000000095_no_MY.tif % tesseract fax000000095_no_MY.tif output -l eng && cat output.txt Tesseract Open Source OCR Engine AZLAN AT UNITEN DOT EDU DOT “ As you can see, the three dots are no longer recognized as 'A', but as some other unicode symbol with the same English dictionary (of course my terminal font doesn't display it correctly, and I also checked that there are no leftover pixels near the location where MY used to be). This experiment shows that tesseract's adaptive classifier is playing a role here, not just the static character classifier (see Ray's paper referred to earlier for details). Laird. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

