You could probably improve its ability to recognize "00" as two 0's by
training it on such paired symbols.
Mind you, I have also been surprised by cases where a perfectly clear
and flawless symbol gets subdivided, like a N becoming |\| or an H
becoming I-I, which indicates that tesseract has code to subdivide blobs
other than based on there being "space" between them. However that code
seems to behave in erratic ways.
patrickq wrote, On 2010-08-12 02:01:
See http://www.scanbizcards.com/touchingdigits.jpg
Includes a tel number where "OO" appear twice with no spacing, i.e.
touching. Tesseract fails on both sets, returning:
(65)81W6W instead of (65)8100 6002
("00" -> "W" and '002" -> "W")
I have not seen Tesseract do well with hardly any situation where two
letters were touching - yet ironically I have seen plenty of examples
where a letter got chopped up in 2 or 3 pieces, for example:
|\| instead of N
Any idea what's going on and why Tesseract doesn't attempt to
recognize "00" as two 0's?
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.