Anything is possible with Tesseract since there are gazillion settings but in my opinion a setting that returns only words in the dictionary would be useless to 99.9% of application usages. For one thing, since the Tesseract dictionary doesn't contain all the words in the English language, it would strip out countless words from virtually any document, including my own comment here (because Tesseract is not in the dictionary and 99.9% isn't either).
And, yes, these spurious letters are likely the result of noise - setting a high threshold for minimum text size can help filter them out. Patrick On Nov 15, 1:29 am, Jason Funk <[email protected]> wrote: > Does Tesseract make any attempts to filter out things that aren't > words? For example, I processed an image and it returned this: > > "This is a slide about a muffin's magical > powers. !%i > Muffin Power > HI K > Q55 > iii‘ > > E!!! > iU_ > ‘gm > !" > > All of the words that it found are right, but everything else isn't. I > don't know where it's coming from? Maybe the background or whatever. I > thought that tesseract had a dictionary that it used to know that > "iU_" wasn't a valid word. Or maybe I don't have it turned on > correctly? Or configured right? Any pointers would be great. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

