On 12 August 2010 08:01, patrickq <[email protected]> wrote:
> See http://www.scanbizcards.com/touchingdigits.jpg
> Includes a tel number where "OO" appear twice with no spacing, i.e.
> touching. Tesseract fails on both sets, returning:
> (65)81W6W instead of (65)8100 6002
> ("00" -> "W" and '002" -> "W")
>
> I have not seen Tesseract do well with hardly any situation where two
> letters were touching - yet ironically I have seen plenty of examples
> where a letter got chopped up in 2 or 3 pieces, for example:
> |\| instead of N
>
> Any idea what's going on and why Tesseract doesn't attempt to
> recognize "00" as two 0's?

It's something Google have said they're working on (primarily to
support Arabic, where all characters are joined). As is, you could
just train frequent instances as ligatures.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to