Hi Patrick, This: JOhfl DO6 - looks like a result of a bad line height estimation. Tesseract rescales all the lines to some standard height and when this step goes wrong, the classifier is helpless. This might lead to letters being recognized as taller letters (o->O, n->fl, ...). I think that the line height is estimated during the layout analysis stage, before the actual recognition starts. It is quite possible (I'm not sure) that Tesseract enforces the same line height onto all lines in a column.
Ilya On Feb 4, 1:46 pm, patrickq <[email protected]> wrote: > Hi Faisal, > > Here is the image by the way:http://scanbizcards.com/twolines.jpg > Could not be simpler ... two lines: > John Doe > [email protected] > > and yet, scanned as an entire image, John Doe is recognized with 4 > mistakes out of 7 letters ("JOhfl DO6")! > Seehttp://scanbizcards.com/resultstwolines.jpg > > No problems if I split as two images. > > Truly bizarre ... what's even weirder is that the line that's messed > up is the first line, as if this was a result of scanning the 2nd line > (even though that line comes after and I think Tesseract recognizes > top down). > > I think you mean the SetPageSegMode API, which takes one of: > PSM_AUTO, // Fully automatic page segmentation. > PSM_SINGLE_COLUMN, // Assume a single column of text of > variable > sizes. > PSM_SINGLE_BLOCK, // Assume a single uniform block of text. > (Default.) > PSM_SINGLE_LINE, // Treat the image as a single text line. > PSM_SINGLE_WORD, // Treat the image as a single word. > PSM_SINGLE_CHAR, // Treat the image as a single character. > > Setting the mode to PSM_SINGLE_COLUMN would seem to be the one I need > - unfortunately, I tried it and it doesn't seem to help in the case in > question. > > Patrick > > On Feb 4, 6:25 am, Faisal Shafait <[email protected]> wrote: > > > Hi, > > Tesseract 3.0 has an individual text-line recognition mode. When running in > > that mode, I think adaptive classifier does not adapt to other text-lines in > > the page. > > > Cheers, > > Faisal > > > On Wed, Feb 3, 2010 at 4:46 PM, patrickq > > <[email protected]>wrote: > > > > Hi Francesco, > > > > Tesseract 3.0 actually recognizes all the digits in your sample image > > > just great. I have processed your image using the ScanBizCards iPhone > > > application (which uses Tesseract 3.0) and you can see screenshots on: > > >http://www.scanbizcards.com/boxes.jpg > > >http://www.scanbizcards.com/results.jpg > > > > The first screenshot is taken during processing and shows you in red > > > the boxes found by Tesseract during the layout analysis, the 2nd > > > screenshot is the text result where you can see that all digits were > > > recognized properly. > > > We convert the image to a grayscale (using non-equal weights for the 3 > > > RGB components) before submitting the image to Tesseract so it's > > > possible that this makes the difference (but I doubt it). Note by the > > > way how Tesseract returns several imaginative matches for many of the > > > '*' characters - not sure why - but you should be able to ignore these > > > in your code, for example by searching for consecutive sequences of > > > digits. > > > > Regarding your issue in general: you are right that Tesseract may do a > > > better job when processing an entire image, because it can draw > > > conclusions on text size (for example) but in some cases, that > > > algorithm is actually a bad thing, for example where each line is in a > > > totally different font and size! This is the case for what I scan and > > > I have asked this forum for the Tesserract variable to turn such > > > adaptive learning OFF - but got no replies. Anyone out there with the > > > answer? I just want Tesseract to analyze each line separately, > > > "forgetting" anything it may have learned from other lines. I think > > > that means disabling the adaptive classifier but not certain. > > > > If you are having better luck scanning the entire image, I suggest > > > that instead of using blacklist / whitelist on sub-images, you may > > > want to do this: > > > - use a regular expression describing, for example, a number > > > - in that regular expression, don't just look for a sequence of > > > digits, do something like "[\\dIlOZ&]*", which means "accept a digit > > > or uppercase I or lowercase i or uppercase O or uppercase Z or & > > > (because these letters look similar to digits) > > > - then in the string matched by the regexp, just replace occurences of > > > O with 0, Z with 2, & with 8 etc > > > > Patrick > > > > On Feb 2, 1:40 pm, Francesco <[email protected]> wrote: > > > > Hi everybody, > > > > I'm writing an application to automatically scan tons of postal > > > > orders, using TessNet2 library from C#. Tesseract is great and > > > > recognizes about everything on the postal order. But, because some > > > > fields contain only numbers and some others only letters, I want to > > > > process single subimages from the whole picture, adjusting > > > > tessedit_char_blacklist and tessedit_char_whitelist variables for each > > > > of these. > > > > But while processing the entire picture gives great results (still > > > > with some letters recognized as numbers like '0' instead of O), > > > > processing a single subimage, particularly this one, gives no results > > > > at all:http://www.francescovannini.com/pub/importo.jpg > > > > The library detects only a tilde in this image, strangely with a > > > > confidence of 100/255. Unfortunately this is the only part of the > > > > postal order image that I can publish, because sensitive data > > > > concerns. > > > > Is there something that I can tune? Surely processing the entire > > > > picture gives Tesseract some more information about font features than > > > > processing this subimage. That's the only reason why it seems possible > > > > to me. But how can I process a subimage setting a particular whitelist > > > > while achieving the same accuracy that processing the entire picture > > > > gives? > > > > Thank you in advance. > > > > -- > > > You received this message because you are subscribed to the Google Groups > > > "tesseract-ocr" group. > > > To post to this group, send email to [email protected]. > > > To unsubscribe from this group, send email to > > > [email protected]<tesseract-ocr%[email protected]> > > > . > > > For more options, visit this group at > > >http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

