Re: Different results on subimages

patrickq Thu, 04 Feb 2010 04:46:25 -0800

Hi Faisal,

Here is the image by the way: http://scanbizcards.com/twolines.jpg
Could not be simpler ... two lines:
John Doe
[email protected]


and yet, scanned as an entire image, John Doe is recognized with 4
mistakes out of 7 letters ("JOhfl DO6")! See 
http://scanbizcards.com/resultstwolines.jpg

No problems if I split as two images.

Truly bizarre ... what's even weirder is that the line that's messed
up is the first line, as if this was a result of scanning the 2nd line
(even though that line comes after and I think Tesseract recognizes
top down).

I think you mean the SetPageSegMode API, which takes one of:
                 PSM_AUTO,           // Fully automatic page segmentation.
                 PSM_SINGLE_COLUMN,  // Assume a single column of text of 
variable
sizes.
                 PSM_SINGLE_BLOCK,   // Assume a single uniform block of text.
(Default.)
                 PSM_SINGLE_LINE,    // Treat the image as a single text line.
                 PSM_SINGLE_WORD,    // Treat the image as a single word.
                 PSM_SINGLE_CHAR,    // Treat the image as a single character.

Setting the mode to PSM_SINGLE_COLUMN would seem to be the one I need
- unfortunately, I tried it and it doesn't seem to help in the case in
question.

Patrick

On Feb 4, 6:25 am, Faisal Shafait <[email protected]> wrote:
> Hi,
> Tesseract 3.0 has an individual text-line recognition mode. When running in
> that mode, I think adaptive classifier does not adapt to other text-lines in
> the page.
>
> Cheers,
> Faisal
>
> On Wed, Feb 3, 2010 at 4:46 PM, patrickq <[email protected]>wrote:
>
> > Hi Francesco,
>
> > Tesseract 3.0 actually recognizes all the digits in your sample image
> > just great. I have processed your image using the ScanBizCards iPhone
> > application (which uses Tesseract 3.0) and you can see screenshots on:
> >http://www.scanbizcards.com/boxes.jpg
> >http://www.scanbizcards.com/results.jpg
>
> > The first screenshot is taken during processing and shows you in red
> > the boxes found by Tesseract during the layout analysis, the 2nd
> > screenshot is the text result where you can see that all digits were
> > recognized properly.
> > We convert the image to a grayscale (using non-equal weights for the 3
> > RGB components) before submitting the image to Tesseract so it's
> > possible that this makes the difference (but I doubt it). Note by the
> > way how Tesseract returns several imaginative matches for many of the
> > '*' characters - not sure why - but you should be able to ignore these
> > in your code, for example by searching for consecutive sequences of
> > digits.
>
> > Regarding your issue in general: you are right that Tesseract may do a
> > better job when processing an entire image, because it can draw
> > conclusions on text size (for example) but in some cases, that
> > algorithm is actually a bad thing, for example where each line is in a
> > totally different font and size! This is the case for what I scan and
> > I have asked this forum for the Tesserract variable to turn such
> > adaptive learning OFF - but got no replies. Anyone out there with the
> > answer? I just want Tesseract to analyze each line separately,
> > "forgetting" anything it may have learned from other lines. I think
> > that means disabling the adaptive classifier but not certain.
>
> > If you are having better luck scanning the entire image, I suggest
> > that instead of using blacklist / whitelist on sub-images, you may
> > want to do this:
> > - use a regular expression describing, for example, a number
> > - in that regular expression, don't just look for a sequence of
> > digits, do something like "[\\dIlOZ&]*", which means "accept a digit
> > or uppercase I or lowercase i or uppercase O or uppercase Z or &
> > (because these letters look similar to digits)
> > - then in the string matched by the regexp, just replace occurences of
> > O with 0, Z with 2, & with 8 etc
>
> > Patrick
>
> > On Feb 2, 1:40 pm, Francesco <[email protected]> wrote:
> > > Hi everybody,
> > > I'm writing an application to automatically scan tons of postal
> > > orders, using TessNet2 library from C#. Tesseract is great and
> > > recognizes about everything on the postal order. But, because some
> > > fields contain only numbers and some others only letters, I want to
> > > process single subimages from the whole picture, adjusting
> > > tessedit_char_blacklist and tessedit_char_whitelist variables for each
> > > of these.
> > > But while processing the entire picture gives great results (still
> > > with some letters recognized as numbers like '0' instead of O),
> > > processing a single subimage, particularly this one, gives no results
> > > at all:http://www.francescovannini.com/pub/importo.jpg
> > > The library detects only a tilde in this image, strangely with a
> > > confidence of 100/255. Unfortunately this is the only part of the
> > > postal order image that I can publish, because  sensitive data
> > > concerns.
> > > Is there something that I can tune? Surely processing the entire
> > > picture gives Tesseract some more information about font features than
> > > processing this subimage. That's the only reason why it seems possible
> > > to me. But how can I process a subimage setting a particular whitelist
> > > while achieving the same accuracy that processing the entire picture
> > > gives?
> > > Thank you in advance.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "tesseract-ocr" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected]<tesseract-ocr%[email protected]>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Different results on subimages

Reply via email to