Hi, Tesseract 3.0 has an individual text-line recognition mode. When running in that mode, I think adaptive classifier does not adapt to other text-lines in the page.
Cheers, Faisal On Wed, Feb 3, 2010 at 4:46 PM, patrickq <[email protected]>wrote: > Hi Francesco, > > Tesseract 3.0 actually recognizes all the digits in your sample image > just great. I have processed your image using the ScanBizCards iPhone > application (which uses Tesseract 3.0) and you can see screenshots on: > http://www.scanbizcards.com/boxes.jpg > http://www.scanbizcards.com/results.jpg > > The first screenshot is taken during processing and shows you in red > the boxes found by Tesseract during the layout analysis, the 2nd > screenshot is the text result where you can see that all digits were > recognized properly. > We convert the image to a grayscale (using non-equal weights for the 3 > RGB components) before submitting the image to Tesseract so it's > possible that this makes the difference (but I doubt it). Note by the > way how Tesseract returns several imaginative matches for many of the > '*' characters - not sure why - but you should be able to ignore these > in your code, for example by searching for consecutive sequences of > digits. > > Regarding your issue in general: you are right that Tesseract may do a > better job when processing an entire image, because it can draw > conclusions on text size (for example) but in some cases, that > algorithm is actually a bad thing, for example where each line is in a > totally different font and size! This is the case for what I scan and > I have asked this forum for the Tesserract variable to turn such > adaptive learning OFF - but got no replies. Anyone out there with the > answer? I just want Tesseract to analyze each line separately, > "forgetting" anything it may have learned from other lines. I think > that means disabling the adaptive classifier but not certain. > > If you are having better luck scanning the entire image, I suggest > that instead of using blacklist / whitelist on sub-images, you may > want to do this: > - use a regular expression describing, for example, a number > - in that regular expression, don't just look for a sequence of > digits, do something like "[\\dIlOZ&]*", which means "accept a digit > or uppercase I or lowercase i or uppercase O or uppercase Z or & > (because these letters look similar to digits) > - then in the string matched by the regexp, just replace occurences of > O with 0, Z with 2, & with 8 etc > > Patrick > > On Feb 2, 1:40 pm, Francesco <[email protected]> wrote: > > Hi everybody, > > I'm writing an application to automatically scan tons of postal > > orders, using TessNet2 library from C#. Tesseract is great and > > recognizes about everything on the postal order. But, because some > > fields contain only numbers and some others only letters, I want to > > process single subimages from the whole picture, adjusting > > tessedit_char_blacklist and tessedit_char_whitelist variables for each > > of these. > > But while processing the entire picture gives great results (still > > with some letters recognized as numbers like '0' instead of O), > > processing a single subimage, particularly this one, gives no results > > at all:http://www.francescovannini.com/pub/importo.jpg > > The library detects only a tilde in this image, strangely with a > > confidence of 100/255. Unfortunately this is the only part of the > > postal order image that I can publish, because sensitive data > > concerns. > > Is there something that I can tune? Surely processing the entire > > picture gives Tesseract some more information about font features than > > processing this subimage. That's the only reason why it seems possible > > to me. But how can I process a subimage setting a particular whitelist > > while achieving the same accuracy that processing the entire picture > > gives? > > Thank you in advance. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

