On 30 July 2010 20:12, patrickq <[email protected]> wrote: > Hi Sven, > > Not only did I read these posts, but I was the one to which Jimmy > kindly responded. Here is one quote: > > "At any point, if you ask Tesseract what the 'word' it sees is, it > will > simply give you a string composed of the highest-confidence > characters: the word structure also keeps an array of possible > characters along with the confidence from the recogniser. The weight > from a dictionary can add extra weight to a set of characters, but > only if the set of characters that word is composed from is among the > set of choices (some other steps can add or remove characters... > etc)." >
I think I managed to miss mentioning it completely, but there's nothing that *forces* that a word be recognised as a dictionary word; it's just used to establish character confidences. Really, where you see the difference is across a longer piece of text, when the adaptive classifier has seen enough examples to know "hey, this thing I thought was an 'f' might actually be a 't'". In short texts, there's not much to adapt to. Making a bunch of training images, drawing boxfiles, etc., only goes so far, so tess uses the dictionary as an approximation; a low-confidence equivalent of training pages. On the plus side, it turns out that there are functions buried in the code to serialise/deserialise the classifier state, so it might be useful to run a whole corpus of short images through tess in one batch, save the state, and load that at startup. > Although I did not debug to inspect the alternative choices for the > mistaken 'f' and 'i', it's a reasonable expectations that 't' and 'l' > would be next in line in these two cases respectively, because these > ARE the letters clearly appearing in this image and these are known > frequent mistakes. I'd say 'i' instead of 'l' is the most common > mistake. So I think it's reasonable that I would be disappointed. > > If I missed something else that would indicate how I can make it work, > please clarify! > > Thanks, > Patrick > > On Jul 30, 1:55 pm, Sven Pedersen <[email protected]> wrote: >> Patrick, >> This is a known issue which has been discussed in the last three days. >> Please look in the archives or check the emails you've received from >> the list for the last few days. >> --Sven >> >> >> >> On Fri, Jul 30, 2010 at 8:04 AM, patrickq <[email protected]> >> wrote: >> > This what I did: >> >> > 1. Created a text file called eng.user-words, containing: >> > Chest >> > Chestnut >> > Floor >> > Vice >> >> > 2. Placed the file in the tessdata folder (next to eng.traineddata) >> >> > 3. Ran recognition on an image returning "Chesf" instead of "Chest" >> > and "Fioor" instead of "Floor". Both mistaken "f" and "i" look quite >> > right visually so I can only assume their confidence level would be >> > low (but I didn't check). >> >> > No effect whatsoever - zip. I can only assume that a variable must be >> > set or a function needs to be called to turn this on (even though >> > there is no mention of needing to set anything in the documentation) >> > or (most likely) I just don't understand how this works and the >> > dictionary kicks in only on the day or the summer solstice and when >> > there is a full moon or something. >> >> > Patrick >> >> > -- >> > You received this message because you are subscribed to the Google Groups >> > "tesseract-ocr" group. >> > To post to this group, send email to [email protected]. >> > To unsubscribe from this group, send email to >> > [email protected]. >> > For more options, visit this group >> > athttp://groups.google.com/group/tesseract-ocr?hl=en. >> >> -- >> ``All that is gold does not glitter, >> not all those who wander are lost; >> the old that is strong does not wither, >> deep roots are not reached by the frost. >> From the ashes a fire shall be woken, >> a light from the shadows shall spring; >> renewed shall be blade that was broken, >> the crownless again shall be king.” > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

