Re: Different results on subimages

Faisal Shafait Thu, 04 Feb 2010 03:25:57 -0800

Hi,
Tesseract 3.0 has an individual text-line recognition mode. When running in
that mode, I think adaptive classifier does not adapt to other text-lines in
the page.


Cheers,
Faisal

On Wed, Feb 3, 2010 at 4:46 PM, patrickq <[email protected]>wrote:

> Hi Francesco,
>
> Tesseract 3.0 actually recognizes all the digits in your sample image
> just great. I have processed your image using the ScanBizCards iPhone
> application (which uses Tesseract 3.0) and you can see screenshots on:
> http://www.scanbizcards.com/boxes.jpg
> http://www.scanbizcards.com/results.jpg
>
> The first screenshot is taken during processing and shows you in red
> the boxes found by Tesseract during the layout analysis, the 2nd
> screenshot is the text result where you can see that all digits were
> recognized properly.
> We convert the image to a grayscale (using non-equal weights for the 3
> RGB components) before submitting the image to Tesseract so it's
> possible that this makes the difference (but I doubt it). Note by the
> way how Tesseract returns several imaginative matches for many of the
> '*' characters - not sure why - but you should be able to ignore these
> in your code, for example by searching for consecutive sequences of
> digits.
>
> Regarding your issue in general: you are right that Tesseract may do a
> better job when processing an entire image, because it can draw
> conclusions on text size (for example) but in some cases, that
> algorithm is actually a bad thing, for example where each line is in a
> totally different font and size! This is the case for what I scan and
> I have asked this forum for the Tesserract variable to turn such
> adaptive learning OFF - but got no replies. Anyone out there with the
> answer? I just want Tesseract to analyze each line separately,
> "forgetting" anything it may have learned from other lines. I think
> that means disabling the adaptive classifier but not certain.
>
> If you are having better luck scanning the entire image, I suggest
> that instead of using blacklist / whitelist on sub-images, you may
> want to do this:
> - use a regular expression describing, for example, a number
> - in that regular expression, don't just look for a sequence of
> digits, do something like "[\\dIlOZ&]*", which means "accept a digit
> or uppercase I or lowercase i or uppercase O or uppercase Z or &
> (because these letters look similar to digits)
> - then in the string matched by the regexp, just replace occurences of
> O with 0, Z with 2, & with 8 etc
>
> Patrick
>
> On Feb 2, 1:40 pm, Francesco <[email protected]> wrote:
> > Hi everybody,
> > I'm writing an application to automatically scan tons of postal
> > orders, using TessNet2 library from C#. Tesseract is great and
> > recognizes about everything on the postal order. But, because some
> > fields contain only numbers and some others only letters, I want to
> > process single subimages from the whole picture, adjusting
> > tessedit_char_blacklist and tessedit_char_whitelist variables for each
> > of these.
> > But while processing the entire picture gives great results (still
> > with some letters recognized as numbers like '0' instead of O),
> > processing a single subimage, particularly this one, gives no results
> > at all:http://www.francescovannini.com/pub/importo.jpg
> > The library detects only a tilde in this image, strangely with a
> > confidence of 100/255. Unfortunately this is the only part of the
> > postal order image that I can publish, because  sensitive data
> > concerns.
> > Is there something that I can tune? Surely processing the entire
> > picture gives Tesseract some more information about font features than
> > processing this subimage. That's the only reason why it seems possible
> > to me. But how can I process a subimage setting a particular whitelist
> > while achieving the same accuracy that processing the entire picture
> > gives?
> > Thank you in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Different results on subimages

Reply via email to