In a conversation between Philip Pemberton and Jimmy on the 27th, it seems that the user wordlist may not work for Tesseract 3. You may need to call the file 'eng.' or $LANG. and put it in the traindata folder. It sounds like Jimmy is eventually planning to improve the situation. In the mean time you may have to train tesseract yourself with your corpus (and font) to improve results, or do image manipulations (resize/adjust) to improve the input at runtime. --Sven
On Fri, Jul 30, 2010 at 2:12 PM, patrickq <[email protected]> wrote: > Hi Sven, > > Not only did I read these posts, but I was the one to which Jimmy > kindly responded. Here is one quote: > > "At any point, if you ask Tesseract what the 'word' it sees is, it > will > simply give you a string composed of the highest-confidence > characters: the word structure also keeps an array of possible > characters along with the confidence from the recogniser. The weight > from a dictionary can add extra weight to a set of characters, but > only if the set of characters that word is composed from is among the > set of choices (some other steps can add or remove characters... > etc)." > > Although I did not debug to inspect the alternative choices for the > mistaken 'f' and 'i', it's a reasonable expectations that 't' and 'l' > would be next in line in these two cases respectively, because these > ARE the letters clearly appearing in this image and these are known > frequent mistakes. I'd say 'i' instead of 'l' is the most common > mistake. So I think it's reasonable that I would be disappointed. > > If I missed something else that would indicate how I can make it work, > please clarify! > > Thanks, > Patrick > > On Jul 30, 1:55 pm, Sven Pedersen <[email protected]> wrote: >> Patrick, >> This is a known issue which has been discussed in the last three days. >> Please look in the archives or check the emails you've received from >> the list for the last few days. >> --Sven >> >> >> >> On Fri, Jul 30, 2010 at 8:04 AM, patrickq <[email protected]> >> wrote: >> > This what I did: >> >> > 1. Created a text file called eng.user-words, containing: >> > Chest >> > Chestnut >> > Floor >> > Vice >> >> > 2. Placed the file in the tessdata folder (next to eng.traineddata) >> >> > 3. Ran recognition on an image returning "Chesf" instead of "Chest" >> > and "Fioor" instead of "Floor". Both mistaken "f" and "i" look quite >> > right visually so I can only assume their confidence level would be >> > low (but I didn't check). >> >> > No effect whatsoever - zip. I can only assume that a variable must be >> > set or a function needs to be called to turn this on (even though >> > there is no mention of needing to set anything in the documentation) >> > or (most likely) I just don't understand how this works and the >> > dictionary kicks in only on the day or the summer solstice and when >> > there is a full moon or something. >> >> > Patrick >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

