Thanks Patrick. How does one go about changing the threshold for minimum text size? I see the wiki reference a file that can contain configuration options, but I don't see anything detailing what they could be. I don't mind poking around the code; but a tip on where to start might be helpful!
Thanks, Jason On Tue, Nov 15, 2011 at 7:14 AM, patrickq <[email protected]> wrote: > Anything is possible with Tesseract since there are gazillion settings > but in my opinion a setting that returns only words in the dictionary > would be useless to 99.9% of application usages. For one thing, since > the Tesseract dictionary doesn't contain all the words in the English > language, it would strip out countless words from virtually any > document, including my own comment here (because Tesseract is not in > the dictionary and 99.9% isn't either). > > And, yes, these spurious letters are likely the result of noise - > setting a high threshold for minimum text size can help filter them > out. > > Patrick > > On Nov 15, 1:29 am, Jason Funk <[email protected]> wrote: >> Does Tesseract make any attempts to filter out things that aren't >> words? For example, I processed an image and it returned this: >> >> "This is a slide about a muffin's magical >> powers. !%i >> Muffin Power >> HI K >> Q55 >> iii‘ >> >> E!!! >> iU_ >> ‘gm >> !" >> >> All of the words that it found are right, but everything else isn't. I >> don't know where it's coming from? Maybe the background or whatever. I >> thought that tesseract had a dictionary that it used to know that >> "iU_" wasn't a valid word. Or maybe I don't have it turned on >> correctly? Or configured right? Any pointers would be great. > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

