On Mar 15, 6:11 pm, udippel <[email protected]> wrote:
> > % pbmclean -m 3 fax000000095.pbm | pnmcrop > fax000000095_c3.pbm
> > % xloadimage -identify fax000000095_c3.pbm
> > fax000000095_c3.pbm is a 1203x42 RawBits PBM image
>
> > % convert fax000000095_c3.pbm fax000000095_c3.tif
> > % tesseract fax000000095_c3.tif output -l eng && cat output.txt
> > Tesseract Open Source OCR Engine
> > AZLAN AT UNITEN DOT EDU DOT MY
Okay, this also comes in as a result once I have removed those tiny
little three dots forming a vertically flipped 'L':
x
xx
towards the end of the line. The pbmclean -m 3 did the same. I guess
that if I put the same into another line, tesseract will not come up
with any OCR. Actually, again, this is where we found tesseract to be
much better than Ocrad: since we fax, we find a lot of lines with a
dot or scratch, that tesseract would always remove. To me the
behaviour with fax000000095 is a bug: an(y) OCR is not supposed to
'recognize' an isolated (!) letter of a height of 2. From the
position, 1.02 seems to do a much better job than 2.03, since those
tiny little dots are several dozen of blanks off the 'MY'. So the 'MY
A' is wrong by all means.
Question to the developers: Does Tesseract filter (==remove)
'characters' within a line containing recognizable characters, that
are way below possible heights of recognition? I would definitively
hope so. Like in this line, the character height is >30, with some
sparse, isolated dots of a height of <3. Does Tesseract attempt to
discard those? I think it should.
Uwe
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---