Thanks Andrei, You are right of course - but I have not yet mastered the art of thresholding so the only thresholding done is by Tesseract, which I believe is simplistic and with a single threshold value applied to the entire image (ie not adaptive). I also don't do noise reduction yet.
In any case, telling Tesseract not to waste time on sizes known to be too small seems like a must do - I just need someone to let me know the name of that elusive variable ... come on guys, I'll offer a cash prize for that name :-)! On Jan 21, 4:25 am, andrei_c <[email protected]> wrote: > Not sure if I'm being helpful, but it sounds like either your input > image is noisy or thresholding algorithm incorrectly separated > foreground from background. If it's former, noise reduction of > original image would help. If latter, you probably need to choose > thresholding algorithm more appropriate for your input image. > > That said, I don't know how to suppress small rows efficiently. > > Andrei > > On Jan 17, 11:55 am, patrickq <[email protected]> wrote: > > > > > I am scanning images with large, clear text but on a grainy background > > and although I get the text fine, I also get myriads of irrelevant > > letters with a size of 3 or 5 pixels (way below a size at which > > anything could be recognized accurately). I could eliminate them based > > on size post-OCR but meanwhile Tesseract spent minutes recognizing > > these characters. Could someone please point me to the right variable > > (s) to tell Tesseract to not attempt recognition (and ideally not > > return boxes at the layout analysis phase) below a certain size? > > > I assume that the variable in question regards the min expected height > > of a row (rather than of individual characters) since a dot ('.') for > > example can be quite small even within a row with normal sized > > letters. > > > Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

