Hi Francesco,

Tesseract 3.0 actually recognizes all the digits in your sample image
just great. I have processed your image using the ScanBizCards iPhone
application (which uses Tesseract 3.0) and you can see screenshots on:
http://www.scanbizcards.com/boxes.jpg
http://www.scanbizcards.com/results.jpg

The first screenshot is taken during processing and shows you in red
the boxes found by Tesseract during the layout analysis, the 2nd
screenshot is the text result where you can see that all digits were
recognized properly.
We convert the image to a grayscale (using non-equal weights for the 3
RGB components) before submitting the image to Tesseract so it's
possible that this makes the difference (but I doubt it). Note by the
way how Tesseract returns several imaginative matches for many of the
'*' characters - not sure why - but you should be able to ignore these
in your code, for example by searching for consecutive sequences of
digits.

Regarding your issue in general: you are right that Tesseract may do a
better job when processing an entire image, because it can draw
conclusions on text size (for example) but in some cases, that
algorithm is actually a bad thing, for example where each line is in a
totally different font and size! This is the case for what I scan and
I have asked this forum for the Tesserract variable to turn such
adaptive learning OFF - but got no replies. Anyone out there with the
answer? I just want Tesseract to analyze each line separately,
"forgetting" anything it may have learned from other lines. I think
that means disabling the adaptive classifier but not certain.

If you are having better luck scanning the entire image, I suggest
that instead of using blacklist / whitelist on sub-images, you may
want to do this:
- use a regular expression describing, for example, a number
- in that regular expression, don't just look for a sequence of
digits, do something like "[\\dIlOZ&]*", which means "accept a digit
or uppercase I or lowercase i or uppercase O or uppercase Z or &
(because these letters look similar to digits)
- then in the string matched by the regexp, just replace occurences of
O with 0, Z with 2, & with 8 etc

Patrick

On Feb 2, 1:40 pm, Francesco <[email protected]> wrote:
> Hi everybody,
> I'm writing an application to automatically scan tons of postal
> orders, using TessNet2 library from C#. Tesseract is great and
> recognizes about everything on the postal order. But, because some
> fields contain only numbers and some others only letters, I want to
> process single subimages from the whole picture, adjusting
> tessedit_char_blacklist and tessedit_char_whitelist variables for each
> of these.
> But while processing the entire picture gives great results (still
> with some letters recognized as numbers like '0' instead of O),
> processing a single subimage, particularly this one, gives no results
> at all:http://www.francescovannini.com/pub/importo.jpg
> The library detects only a tilde in this image, strangely with a
> confidence of 100/255. Unfortunately this is the only part of the
> postal order image that I can publish, because  sensitive data
> concerns.
> Is there something that I can tune? Surely processing the entire
> picture gives Tesseract some more information about font features than
> processing this subimage. That's the only reason why it seems possible
> to me. But how can I process a subimage setting a particular whitelist
> while achieving the same accuracy that processing the entire picture
> gives?
> Thank you in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to