On 30 July 2010 20:12, patrickq <[email protected]> wrote:
> Hi Sven,
>
> Not only did I read these posts, but I was the one to which Jimmy
> kindly responded. Here is one quote:
>
> "At any point, if you ask Tesseract what the 'word' it sees is, it
> will
> simply give you a string composed of the highest-confidence
> characters: the word structure also keeps an array of possible
> characters along with the confidence from the recogniser. The weight
> from a dictionary can add extra weight to a set of characters, but
> only if the set of characters that word is composed from is among the
> set of choices (some other steps can add or remove characters...
> etc)."
>

I think I managed to miss mentioning it completely, but there's
nothing that *forces* that a word be recognised as a dictionary word;
it's just used to establish character confidences. Really, where you
see the difference is across a longer piece of text, when the adaptive
classifier has seen enough examples to know "hey, this thing I thought
was an 'f' might actually be a 't'". In short texts, there's not much
to adapt to. Making a bunch of training images, drawing boxfiles,
etc., only goes so far, so tess uses the dictionary as an
approximation; a low-confidence equivalent of training pages.

On the plus side, it turns out that there are functions buried in the
code to serialise/deserialise the classifier state, so it might be
useful to run a whole corpus of short images through tess in one
batch, save the state, and load that at startup.

> Although I did not debug to inspect the alternative choices for the
> mistaken 'f' and 'i', it's a reasonable expectations that 't' and 'l'
> would be next in line in these two cases respectively, because these
> ARE the letters clearly appearing in this image and these are known
> frequent mistakes. I'd say 'i' instead of 'l' is the most common
> mistake. So I think it's reasonable that I would be disappointed.
>
> If I missed something else that would indicate how I can make it work,
> please clarify!
>
> Thanks,
> Patrick
>
> On Jul 30, 1:55 pm, Sven Pedersen <[email protected]> wrote:
>> Patrick,
>> This is a known issue which has been discussed in the last three days.
>> Please look in the archives or check the emails you've received from
>> the list for the last few days.
>> --Sven
>>
>>
>>
>> On Fri, Jul 30, 2010 at 8:04 AM, patrickq <[email protected]> 
>> wrote:
>> > This what I did:
>>
>> > 1. Created a text file called eng.user-words, containing:
>> > Chest
>> > Chestnut
>> > Floor
>> > Vice
>>
>> > 2. Placed the file in the tessdata folder (next to eng.traineddata)
>>
>> > 3. Ran recognition on an image returning "Chesf" instead of "Chest"
>> > and "Fioor" instead of "Floor". Both mistaken "f" and "i" look quite
>> > right visually so I can only assume their confidence level would be
>> > low (but I didn't check).
>>
>> > No effect whatsoever - zip. I can only assume that a variable must be
>> > set or a function needs to be called to turn this on (even though
>> > there is no mention of needing to set anything in the documentation)
>> > or (most likely) I just don't understand how this works and the
>> > dictionary kicks in only on the day or the summer solstice and when
>> > there is a full moon or something.
>>
>> > Patrick
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups 
>> > "tesseract-ocr" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to 
>> > [email protected].
>> > For more options, visit this group 
>> > athttp://groups.google.com/group/tesseract-ocr?hl=en.
>>
>> --
>> ``All that is gold does not glitter,
>>   not all those who wander are lost;
>> the old that is strong does not wither,
>>   deep roots are not reached by the frost.
>> From the ashes a fire shall be woken,
>>   a light from the shadows shall spring;
>> renewed shall be blade that was broken,
>>   the crownless again shall be king.”
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to