Thanks for your reply, I can't really use pre defined patterns since the code pattern and font can change over time. I like the idea to segment the characters myself before giving it to tesseract one by one, but it looks time consuming (coding it I mean). Isn't there any other suitable method ? In particular to solve the 3rd issue, which I think must be easy to solve.
On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote: > > One no-brainer method to try out would be turning off all dictionaries and > using your own custom "user-patterns" file. Since you said about "your > application" I suppose you can program. So you can take a look at the > comment preceding read_pattern_list() declaration in "dict/trie.h" for more > details. > > It seems all your strings are of the same format: > \A\A\d\d\d\d\d\d\d\d\d\d > (Tess understands very limited pattern syntax). > > But if accuracy is critical in your app, in the long run I would > absolutely avoid using any parts of Tesseract except char classifier. I.e. > crop every single char out of your source image and run Tess in the single > char PSM. I think it's should be easy as long as location of every > character is quite stable among your source images. ImageMagick/shell > scripts would suffice. > > Best regards, > Dmitri Silaev > www.CustomOCR.com > > > > > > On Wed, May 20, 2015 at 12:52 PM, Yoann Nicod <[email protected] > <javascript:>> wrote: > >> Hello, >> >> Being a beginner toward Tesseract, I'm facing a problem I hope >> experienced Tesseract users will bring a simple/obvious solution to. >> I am running Tesseract on codes I want to read. I run tesseract.exe with >> this command line : "tesseract.exe in.png out configfile" >> Here is the content of my configfile : >> >> tessedit_create_boxfile 1 >> tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ >> >> I run it on images that look like this one : >> >> >> <https://lh3.googleusercontent.com/-qXBW5r3VHIE/VVxNt8fmU8I/AAAAAAAAAEQ/Rv2rVqds_1I/s1600/in.png> >> Most of the time, the characters read and the boxes are OK. But I >> identified 3 different issues that happen time to time. >> >> * I - **Wrong character read, confusion between '0', 'O' and 'D'.* >> >> For example, for this image : >> >> >> <https://lh3.googleusercontent.com/-ZB-l2e22ckQ/VVxPUpA18VI/AAAAAAAAAEc/xVxNVoVQsPs/s1600/in.png> >> Tesseract gives me : "UFO05D424091" >> I am aware that a training would improve recognition but for some reasons >> I don't want to explain here, I can not do that and I was hopping the >> recognition engine would work well on such a simple font. Is there any >> parameters to set in order to improve the results ? I add that since D, 0 >> and O are likely to appear in the codes, I can't exclude D and O with the >> whitelist. >> >> *II - Threshold artifacts disturb the recognition.* >> >> When my threshold operation leaves some black pixels, like on this >> picture : >> >> >> <https://lh3.googleusercontent.com/-E-Oo3W5hWYo/VVxTPJPR9BI/AAAAAAAAAEo/wSQu5Pc70SA/s1600/in.png> >> The resulting boxes are : >> >> >> <https://lh3.googleusercontent.com/-LH_MjIy3KJQ/VVxTXnEw6dI/AAAAAAAAAEw/tejkRAmdqOg/s1600/fu.bmp> >> The recognized code is right, but the fact that the boxe is wrong is very >> problematic in my application. I know I could improve my pre-processing, >> doing a morphologic operation for example, but I want to know if there is a >> setting that could make tesseract ignore these black pixels. That's strange >> that the fact that a character of a word is way bigger than the others does >> not bother tesseract. >> >> *III - Wrong character segmentation.* >> >> Whereas the 2 first problems are understandable, I don't get how this one >> can happen. >> Let's take the first example : >> >> >> <https://lh3.googleusercontent.com/-IQUU1rSiobE/VVxUe_F2rII/AAAAAAAAAE8/wqKFrjaUenE/s1600/in.png> >> it leads to these boxes : >> >> >> <https://lh3.googleusercontent.com/-Diwn4F_w8AY/VVxUlaEtCxI/AAAAAAAAAFE/LQQOFT5dDKM/s1600/fu.bmp> >> and the following recognised code : UM050409017. >> Here is the second example : >> >> >> <https://lh3.googleusercontent.com/-YJ4AIRY0Zh0/VVxUuQk_c7I/AAAAAAAAAFM/ZIPUN77n1fE/s1600/in.png> >> leading to : >> >> >> <https://lh3.googleusercontent.com/-7ArW5UY5Lrk/VVxUyibRcSI/AAAAAAAAAFU/pGQi_6vBF3U/s1600/fu.bmp> >> and the code is : UAZZO51717151. >> How is this possible ? The input images are perfectly clear, I don't see >> the problem. Again, is there a setting to set in order to avoid this ? >> >> >> >> >> >> >> I hope I am missing something obvious, for at least 1 of my problems. I >> have to admit that the list of all the possible parameters (that I found >> here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is >> hard to master, and since I am a beginner I don't know what to do now. >> Thanks in advance for your help, I attached an archive containing all the >> images. >> >> Regards >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/877f5620-b346-4429-a18f-0921ae60fb65%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

