Thanks Patrick. How does one go about changing the threshold for
minimum text size? I see the wiki reference a file that can contain
configuration options, but I don't see anything detailing what they
could be. I don't mind poking around the code; but a tip on where to
start might be helpful!

Thanks,

Jason

On Tue, Nov 15, 2011 at 7:14 AM, patrickq <[email protected]> wrote:
> Anything is possible with Tesseract since there are gazillion settings
> but in my opinion a setting that returns only words in the dictionary
> would be useless to 99.9% of application usages. For one thing, since
> the Tesseract dictionary doesn't contain all the words in the English
> language, it would strip out countless words from virtually any
> document, including my own comment here (because Tesseract is not in
> the dictionary and 99.9% isn't either).
>
> And, yes, these spurious letters are likely the result of noise -
> setting a high threshold for minimum text size can help filter them
> out.
>
> Patrick
>
> On Nov 15, 1:29 am, Jason Funk <[email protected]> wrote:
>> Does Tesseract make any attempts to filter out things that aren't
>> words? For example, I processed an image and it returned this:
>>
>> "This is a slide about a muffin's magical
>> powers. !%i
>> Muffin Power
>> HI K
>> Q55
>> iii‘
>>
>> E!!!
>> iU_
>> ‘gm
>> !"
>>
>> All of the words that it found are right, but everything else isn't. I
>> don't know where it's coming from? Maybe the background or whatever. I
>> thought that tesseract had a dictionary that it used to know that
>> "iU_" wasn't a valid word. Or maybe I don't have it turned on
>> correctly? Or configured right? Any pointers would be great.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to