Re: [tesseract-ocr] OCR failing on simple and clear text codes

Dmitri Silaev Wed, 20 May 2015 03:29:41 -0700

One no-brainer method to try out would be turning off all dictionaries and
using your own custom "user-patterns" file. Since you said about "your
application" I suppose you can program. So you can take a look at the
comment preceding read_pattern_list() declaration in "dict/trie.h" for more
details.


It seems all your strings are of the same format:
\A\A\d\d\d\d\d\d\d\d\d\d
(Tess understands very limited pattern syntax).

But if accuracy is critical in your app, in the long run I would absolutely
avoid using any parts of Tesseract except char classifier. I.e. crop every
single char out of your source image and run Tess in the single char PSM. I
think it's should be easy as long as location of every character is quite
stable among your source images. ImageMagick/shell scripts would suffice.

Best regards,
Dmitri Silaev
www.CustomOCR.com





On Wed, May 20, 2015 at 12:52 PM, Yoann Nicod <[email protected]> wrote:

> Hello,
>
> Being a beginner toward Tesseract, I'm facing a problem I hope experienced
> Tesseract users will bring a simple/obvious solution to.
> I am running Tesseract on codes I want to read. I run tesseract.exe with
> this command line : "tesseract.exe in.png out configfile"
> Here is the content of my configfile :
>
>    tessedit_create_boxfile 1
>    tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> I run it on images that look like this one :
>
>
> <https://lh3.googleusercontent.com/-qXBW5r3VHIE/VVxNt8fmU8I/AAAAAAAAAEQ/Rv2rVqds_1I/s1600/in.png>
> Most of the time, the characters read and the boxes are OK. But I
> identified 3 different issues that happen time to time.
>
> *      I - **Wrong character read, confusion between '0', 'O' and 'D'.*
>
> For example, for this image :
>
>
> <https://lh3.googleusercontent.com/-ZB-l2e22ckQ/VVxPUpA18VI/AAAAAAAAAEc/xVxNVoVQsPs/s1600/in.png>
> Tesseract gives me : "UFO05D424091"
> I am aware that a training would improve recognition but for some reasons
> I don't want to explain here, I can not do that and I was hopping the
> recognition engine would work well on such a simple font. Is there any
> parameters to set in order to improve the results ? I add that since D, 0
> and O are likely to appear in the codes, I can't exclude D and O with the
> whitelist.
>
>       *II - Threshold artifacts disturb the recognition.*
>
> When my threshold operation leaves some black pixels, like on this picture
> :
>
>
> <https://lh3.googleusercontent.com/-E-Oo3W5hWYo/VVxTPJPR9BI/AAAAAAAAAEo/wSQu5Pc70SA/s1600/in.png>
> The resulting boxes are :
>
>
> <https://lh3.googleusercontent.com/-LH_MjIy3KJQ/VVxTXnEw6dI/AAAAAAAAAEw/tejkRAmdqOg/s1600/fu.bmp>
> The recognized code is right, but the fact that the boxe is wrong is very
> problematic in my application. I know I could improve my pre-processing,
> doing a morphologic operation for example, but I want to know if there is a
> setting that could make tesseract ignore these black pixels. That's strange
> that the fact that a character of a word is way bigger than the others does
> not bother tesseract.
>
>       *III - Wrong character segmentation.*
>
> Whereas the 2 first problems are understandable, I don't get how this one
> can happen.
> Let's take the first example :
>
>
> <https://lh3.googleusercontent.com/-IQUU1rSiobE/VVxUe_F2rII/AAAAAAAAAE8/wqKFrjaUenE/s1600/in.png>
> it leads to these boxes :
>
>
> <https://lh3.googleusercontent.com/-Diwn4F_w8AY/VVxUlaEtCxI/AAAAAAAAAFE/LQQOFT5dDKM/s1600/fu.bmp>
> and the following recognised code : UM050409017.
> Here is the second example :
>
>
> <https://lh3.googleusercontent.com/-YJ4AIRY0Zh0/VVxUuQk_c7I/AAAAAAAAAFM/ZIPUN77n1fE/s1600/in.png>
> leading to :
>
>
> <https://lh3.googleusercontent.com/-7ArW5UY5Lrk/VVxUyibRcSI/AAAAAAAAAFU/pGQi_6vBF3U/s1600/fu.bmp>
> and the code is : UAZZO51717151.
> How is this possible ? The input images are perfectly clear, I don't see
> the problem. Again, is there a setting to set in order to avoid this ?
>
>
>
>
>
>
> I hope I am missing something obvious, for at least 1 of my problems. I
> have to admit that the list of all the possible parameters (that I found
> here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is
> hard to master, and since I am a beginner I don't know what to do now.
> Thanks in advance for your help, I attached an archive containing all the
> images.
>
> Regards
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFNUm277CB1mHweZpwa%2B5RB6PKmFGBhkD5A4Ys9rvyBAGQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] OCR failing on simple and clear text codes

Reply via email to