Re: English Word Filtering

patrickq Tue, 15 Nov 2011 05:24:35 -0800

Anything is possible with Tesseract since there are gazillion settings
but in my opinion a setting that returns only words in the dictionary
would be useless to 99.9% of application usages. For one thing, since
the Tesseract dictionary doesn't contain all the words in the English
language, it would strip out countless words from virtually any
document, including my own comment here (because Tesseract is not in
the dictionary and 99.9% isn't either).


And, yes, these spurious letters are likely the result of noise -
setting a high threshold for minimum text size can help filter them
out.

Patrick

On Nov 15, 1:29 am, Jason Funk <[email protected]> wrote:
> Does Tesseract make any attempts to filter out things that aren't
> words? For example, I processed an image and it returned this:
>
> "This is a slide about a mufﬁn's magical
> powers. !%i
> Mufﬁn Power
> HI K
> Q55
> iii‘
>
> E!!!
> iU_
> ‘gm
> !"
>
> All of the words that it found are right, but everything else isn't. I
> don't know where it's coming from? Maybe the background or whatever. I
> thought that tesseract had a dictionary that it used to know that
> "iU_" wasn't a valid word. Or maybe I don't have it turned on
> correctly? Or configured right? Any pointers would be great.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: English Word Filtering

Reply via email to