The number of edges in the DAWG is not equivalent to the number of words in your dictionary. Here's some information about DAWGs: http://tesseract-ocr.repairfaq.org/allaboutdawg.html
That upper bound actually might be the root of your problem. If you've already compiled Tesseract on your own, try to use a greater number for kMaxUserDawgEdges. If you have not, you could either reduce the number of words in your dictionary or add the dictionary during training. Regards, Paul Am Dienstag, 17. Juni 2014 11:43:06 UTC+2 schrieb elena bresciani: > > I tried using bazaar with my user-words and results are way much better, > also working on image pre-processing contributed to improve output. > > I have another issue now: I expanded my list of user-words to about 7000 > words but i get this error: > > >>Error: word '......' not in DAWG after adding it > >>Error: failed to load /usr/local/share/tessdata/ita.user-words > > I found a report of the problem here: > https://code.google.com/p/tesseract-ocr/issues/detail?id=1020 > but still I don't know how to solve it. Reading through the source code > (in dict.h) I found, like in the report: > > static const int kMaxUserDawgEdges = 50000; > > is this that cause the error? But my list is of 7000 words, which is much > less than 50000... > I don't understand. > > Thank you very much. > > Elena > > > Il giorno sabato 14 giugno 2014 16:11:58 UTC+2, Paul ha scritto: >> >> Could you probably show us an example image that gives you bad results? >> >> Probably it would be useful to use another technique for image >> binarization. >> Tesseract uses Otsu's method. I would suggest to use a method like this >> one <http://www.imlab.jp/cbdar2007/proceedings/papers/O1-1.pdf> by Kasar >> et. al. >> It can be helpful with colored imagery and white on black/color text. >> >> Your idea to add a drug dictionary could also be beneficial. You don't >> necessarily need to start a new training, though. >> Maybe using bazaar with your own "eng.user-words" file might be enough >> (see >> http://tesseract-ocr.googlecode.com/svn-history/r1116/trunk/doc/tesseract.1.html >> ). >> >> >> Am Mittwoch, 11. Juni 2014 12:49:34 UTC+2 schrieb elena bresciani: >>> >>> Hello to everybody, >>> >>> for the project I'm working on I need to automatically recognize a grug >>> from an image of its package. >>> I tried tesseract but with not so good results. In particular sometimes >>> certain words (especially the drug names) are totally bad interpreted and >>> moreover other words (even printed in big fonts) are missing. >>> >>> How can I resolve my issues? >>> Maybe I have to train tesseract with a "drug-dictionary"? >>> And how can I resolve the problem of completly missing words? >>> >>> Thank you in advance >>> >>> Cheers >>> Elena >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4bd2d20b-6ee6-4c5f-880e-5879ea99168a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

