I looked at your image.

I do not know the answer if your raw image from the scanner looks like
this.

If you are preprocessing the image before passing it to Tesseract, it
looks like you are converting too many pixels from gray or white to
black.  If the raw image has white or gray pixels between the digits,
you can zoom by 2x using nearest neighbor interpolation to increase
the separation between the digits.  Then you can do additional zooming
with interpolation if necessary.  This generally leaves you with a
gray-scale image.

I have had more success leaving such images as gray scale going into
tesseract than when I have thresholded the gray-scale image to black
and white.

On Aug 12, 2:11 am, "Jimmy O'Regan" <[email protected]> wrote:
> On 12 August 2010 08:01, patrickq <[email protected]> wrote:
>
> > Seehttp://www.scanbizcards.com/touchingdigits.jpg
> > Includes a tel number where "OO" appear twice with no spacing, i.e.
> > touching. Tesseract fails on both sets, returning:
> > (65)81W6W instead of (65)8100 6002
> > ("00" -> "W" and '002" -> "W")
>
> > I have not seen Tesseract do well with hardly any situation where two
> > letters were touching - yet ironically I have seen plenty of examples
> > where a letter got chopped up in 2 or 3 pieces, for example:
> > |\| instead of N
>
> > Any idea what's going on and why Tesseract doesn't attempt to
> > recognize "00" as two 0's?
>
> It's something Google have said they're working on (primarily to
> support Arabic, where all characters are joined). As is, you could
> just train frequent instances as ligatures.
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to