I tried using bazaar with my user-words and results are way much better, also working on image pre-processing contributed to improve output.
I have another issue now: I expanded my list of user-words to about 7000 words but i get this error: >>Error: word '......' not in DAWG after adding it >>Error: failed to load /usr/local/share/tessdata/ita.user-words I found a report of the problem here: https://code.google.com/p/tesseract-ocr/issues/detail?id=1020 but still I don't know how to solve it. Reading through the source code (in dict.h) I found, like in the report: static const int kMaxUserDawgEdges = 50000; is this that cause the error? But my list is of 7000 words, which is much less than 50000... I don't understand. Thank you very much. Elena Il giorno sabato 14 giugno 2014 16:11:58 UTC+2, Paul ha scritto: > > Could you probably show us an example image that gives you bad results? > > Probably it would be useful to use another technique for image > binarization. > Tesseract uses Otsu's method. I would suggest to use a method like this > one <http://www.imlab.jp/cbdar2007/proceedings/papers/O1-1.pdf> by Kasar > et. al. > It can be helpful with colored imagery and white on black/color text. > > Your idea to add a drug dictionary could also be beneficial. You don't > necessarily need to start a new training, though. > Maybe using bazaar with your own "eng.user-words" file might be enough > (see > http://tesseract-ocr.googlecode.com/svn-history/r1116/trunk/doc/tesseract.1.html > ). > > > Am Mittwoch, 11. Juni 2014 12:49:34 UTC+2 schrieb elena bresciani: >> >> Hello to everybody, >> >> for the project I'm working on I need to automatically recognize a grug >> from an image of its package. >> I tried tesseract but with not so good results. In particular sometimes >> certain words (especially the drug names) are totally bad interpreted and >> moreover other words (even printed in big fonts) are missing. >> >> How can I resolve my issues? >> Maybe I have to train tesseract with a "drug-dictionary"? >> And how can I resolve the problem of completly missing words? >> >> Thank you in advance >> >> Cheers >> Elena >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c0b45cf0-247d-4a45-9a69-c599ff3d3b0c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

