One no-brainer method to try out would be turning off all dictionaries and using your own custom "user-patterns" file. Since you said about "your application" I suppose you can program. So you can take a look at the comment preceding read_pattern_list() declaration in "dict/trie.h" for more details.
It seems all your strings are of the same format: \A\A\d\d\d\d\d\d\d\d\d\d (Tess understands very limited pattern syntax). But if accuracy is critical in your app, in the long run I would absolutely avoid using any parts of Tesseract except char classifier. I.e. crop every single char out of your source image and run Tess in the single char PSM. I think it's should be easy as long as location of every character is quite stable among your source images. ImageMagick/shell scripts would suffice. Best regards, Dmitri Silaev www.CustomOCR.com On Wed, May 20, 2015 at 12:52 PM, Yoann Nicod <[email protected]> wrote: > Hello, > > Being a beginner toward Tesseract, I'm facing a problem I hope experienced > Tesseract users will bring a simple/obvious solution to. > I am running Tesseract on codes I want to read. I run tesseract.exe with > this command line : "tesseract.exe in.png out configfile" > Here is the content of my configfile : > > tessedit_create_boxfile 1 > tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ > > I run it on images that look like this one : > > > <https://lh3.googleusercontent.com/-qXBW5r3VHIE/VVxNt8fmU8I/AAAAAAAAAEQ/Rv2rVqds_1I/s1600/in.png> > Most of the time, the characters read and the boxes are OK. But I > identified 3 different issues that happen time to time. > > * I - **Wrong character read, confusion between '0', 'O' and 'D'.* > > For example, for this image : > > > <https://lh3.googleusercontent.com/-ZB-l2e22ckQ/VVxPUpA18VI/AAAAAAAAAEc/xVxNVoVQsPs/s1600/in.png> > Tesseract gives me : "UFO05D424091" > I am aware that a training would improve recognition but for some reasons > I don't want to explain here, I can not do that and I was hopping the > recognition engine would work well on such a simple font. Is there any > parameters to set in order to improve the results ? I add that since D, 0 > and O are likely to appear in the codes, I can't exclude D and O with the > whitelist. > > *II - Threshold artifacts disturb the recognition.* > > When my threshold operation leaves some black pixels, like on this picture > : > > > <https://lh3.googleusercontent.com/-E-Oo3W5hWYo/VVxTPJPR9BI/AAAAAAAAAEo/wSQu5Pc70SA/s1600/in.png> > The resulting boxes are : > > > <https://lh3.googleusercontent.com/-LH_MjIy3KJQ/VVxTXnEw6dI/AAAAAAAAAEw/tejkRAmdqOg/s1600/fu.bmp> > The recognized code is right, but the fact that the boxe is wrong is very > problematic in my application. I know I could improve my pre-processing, > doing a morphologic operation for example, but I want to know if there is a > setting that could make tesseract ignore these black pixels. That's strange > that the fact that a character of a word is way bigger than the others does > not bother tesseract. > > *III - Wrong character segmentation.* > > Whereas the 2 first problems are understandable, I don't get how this one > can happen. > Let's take the first example : > > > <https://lh3.googleusercontent.com/-IQUU1rSiobE/VVxUe_F2rII/AAAAAAAAAE8/wqKFrjaUenE/s1600/in.png> > it leads to these boxes : > > > <https://lh3.googleusercontent.com/-Diwn4F_w8AY/VVxUlaEtCxI/AAAAAAAAAFE/LQQOFT5dDKM/s1600/fu.bmp> > and the following recognised code : UM050409017. > Here is the second example : > > > <https://lh3.googleusercontent.com/-YJ4AIRY0Zh0/VVxUuQk_c7I/AAAAAAAAAFM/ZIPUN77n1fE/s1600/in.png> > leading to : > > > <https://lh3.googleusercontent.com/-7ArW5UY5Lrk/VVxUyibRcSI/AAAAAAAAAFU/pGQi_6vBF3U/s1600/fu.bmp> > and the code is : UAZZO51717151. > How is this possible ? The input images are perfectly clear, I don't see > the problem. Again, is there a setting to set in order to avoid this ? > > > > > > > I hope I am missing something obvious, for at least 1 of my problems. I > have to admit that the list of all the possible parameters (that I found > here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is > hard to master, and since I am a beginner I don't know what to do now. > Thanks in advance for your help, I attached an archive containing all the > images. > > Regards > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFNUm277CB1mHweZpwa%2B5RB6PKmFGBhkD5A4Ys9rvyBAGQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

