[tesseract-ocr] Re: Pharmaceutics OCR recognition project

elena bresciani Tue, 17 Jun 2014 03:09:08 -0700

I tried using bazaar with my user-words and results are way much better, 
also working on image pre-processing contributed to improve output.


I have another issue now: I expanded my list of user-words to about 7000 
words but i get this error:

 >>Error: word '......' not in DAWG after adding it
 >>Error: failed to load /usr/local/share/tessdata/ita.user-words

I found a report of the problem here: 
https://code.google.com/p/tesseract-ocr/issues/detail?id=1020
but still I don't know how to solve it. Reading through the source code (in 
dict.h) I found, like in the report:

   static const int kMaxUserDawgEdges = 50000;

is this that cause the error? But my list is of 7000 words, which is much 
less than 50000...
I don't understand.

Thank you very much.

Elena


Il giorno sabato 14 giugno 2014 16:11:58 UTC+2, Paul ha scritto:
>
> Could you probably show us an example image that gives you bad results?
>
> Probably it would be useful to use another technique for  image 
> binarization.
> Tesseract uses Otsu's method. I would suggest to use a method like this 
> one <http://www.imlab.jp/cbdar2007/proceedings/papers/O1-1.pdf> by Kasar 
> et. al.
> It can be helpful with colored imagery and white on black/color text.
>
> Your idea to add a drug dictionary could also be beneficial. You don't 
> necessarily need to start a new training, though.
> Maybe using bazaar with your own "eng.user-words" file might be enough 
> (see 
> http://tesseract-ocr.googlecode.com/svn-history/r1116/trunk/doc/tesseract.1.html
> ).
>
>
> Am Mittwoch, 11. Juni 2014 12:49:34 UTC+2 schrieb elena bresciani:
>>
>> Hello to everybody,
>>
>> for the project I'm working on I need to automatically recognize a grug 
>> from an image of its package. 
>> I tried tesseract but with not so good results. In particular sometimes 
>> certain words (especially the drug names) are totally bad interpreted and 
>> moreover other words (even printed in big fonts) are missing.
>>
>> How can I resolve my issues?
>> Maybe I have to train tesseract with a "drug-dictionary"?
>> And how can I resolve the problem of completly missing words?
>>
>> Thank you in advance
>>
>> Cheers
>> Elena
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c0b45cf0-247d-4a45-9a69-c599ff3d3b0c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Pharmaceutics OCR recognition project

Reply via email to