Re: [tesseract-ocr] OCR failing on simple and clear text codes

Yoann Nicod Wed, 20 May 2015 04:51:37 -0700

Thanks for your reply,

I can't really use pre defined patterns since the code pattern and font can 
change over time.
I like the idea to segment the characters myself before giving it to 
tesseract one by one, but it looks time consuming (coding it I mean).
Isn't there any other suitable method ? In particular to solve the 3rd 
issue, which I think must be easy to solve.




On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote:
>
> One no-brainer method to try out would be turning off all dictionaries and 
> using your own custom "user-patterns" file. Since you said about "your 
> application" I suppose you can program. So you can take a look at the 
> comment preceding read_pattern_list() declaration in "dict/trie.h" for more 
> details.
>
> It seems all your strings are of the same format:
> \A\A\d\d\d\d\d\d\d\d\d\d
> (Tess understands very limited pattern syntax).
>
> But if accuracy is critical in your app, in the long run I would 
> absolutely avoid using any parts of Tesseract except char classifier. I.e. 
> crop every single char out of your source image and run Tess in the single 
> char PSM. I think it's should be easy as long as location of every 
> character is quite stable among your source images. ImageMagick/shell 
> scripts would suffice.
>
> Best regards,
> Dmitri Silaev
> www.CustomOCR.com
>
>
>
>
>
> On Wed, May 20, 2015 at 12:52 PM, Yoann Nicod <[email protected] 
> <javascript:>> wrote:
>
>> Hello,
>>
>> Being a beginner toward Tesseract, I'm facing a problem I hope 
>> experienced Tesseract users will bring a simple/obvious solution to.
>> I am running Tesseract on codes I want to read. I run tesseract.exe with 
>> this command line : "tesseract.exe in.png out configfile"
>> Here is the content of my configfile :
>>
>>    tessedit_create_boxfile 1
>>    tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>>
>> I run it on images that look like this one :
>>
>>
>> <https://lh3.googleusercontent.com/-qXBW5r3VHIE/VVxNt8fmU8I/AAAAAAAAAEQ/Rv2rVqds_1I/s1600/in.png>
>> Most of the time, the characters read and the boxes are OK. But I 
>> identified 3 different issues that happen time to time.
>>
>> *      I - **Wrong character read, confusion between '0', 'O' and 'D'.*
>>
>> For example, for this image :
>>
>>
>> <https://lh3.googleusercontent.com/-ZB-l2e22ckQ/VVxPUpA18VI/AAAAAAAAAEc/xVxNVoVQsPs/s1600/in.png>
>> Tesseract gives me : "UFO05D424091"
>> I am aware that a training would improve recognition but for some reasons 
>> I don't want to explain here, I can not do that and I was hopping the 
>> recognition engine would work well on such a simple font. Is there any 
>> parameters to set in order to improve the results ? I add that since D, 0 
>> and O are likely to appear in the codes, I can't exclude D and O with the 
>> whitelist.
>>
>>       *II - Threshold artifacts disturb the recognition.*
>>
>> When my threshold operation leaves some black pixels, like on this 
>> picture :
>>
>>
>> <https://lh3.googleusercontent.com/-E-Oo3W5hWYo/VVxTPJPR9BI/AAAAAAAAAEo/wSQu5Pc70SA/s1600/in.png>
>> The resulting boxes are :
>>
>>
>> <https://lh3.googleusercontent.com/-LH_MjIy3KJQ/VVxTXnEw6dI/AAAAAAAAAEw/tejkRAmdqOg/s1600/fu.bmp>
>> The recognized code is right, but the fact that the boxe is wrong is very 
>> problematic in my application. I know I could improve my pre-processing, 
>> doing a morphologic operation for example, but I want to know if there is a 
>> setting that could make tesseract ignore these black pixels. That's strange 
>> that the fact that a character of a word is way bigger than the others does 
>> not bother tesseract.
>>
>>       *III - Wrong character segmentation.*
>>
>> Whereas the 2 first problems are understandable, I don't get how this one 
>> can happen.
>> Let's take the first example :
>>
>>
>> <https://lh3.googleusercontent.com/-IQUU1rSiobE/VVxUe_F2rII/AAAAAAAAAE8/wqKFrjaUenE/s1600/in.png>
>> it leads to these boxes :
>>
>>
>> <https://lh3.googleusercontent.com/-Diwn4F_w8AY/VVxUlaEtCxI/AAAAAAAAAFE/LQQOFT5dDKM/s1600/fu.bmp>
>> and the following recognised code : UM050409017. 
>> Here is the second example :
>>
>>
>> <https://lh3.googleusercontent.com/-YJ4AIRY0Zh0/VVxUuQk_c7I/AAAAAAAAAFM/ZIPUN77n1fE/s1600/in.png>
>> leading to :
>>
>>
>> <https://lh3.googleusercontent.com/-7ArW5UY5Lrk/VVxUyibRcSI/AAAAAAAAAFU/pGQi_6vBF3U/s1600/fu.bmp>
>> and the code is : UAZZO51717151.
>> How is this possible ? The input images are perfectly clear, I don't see 
>> the problem. Again, is there a setting to set in order to avoid this ?
>>
>>
>>
>>
>>
>>
>> I hope I am missing something obvious, for at least 1 of my problems. I 
>> have to admit that the list of all the possible parameters (that I found 
>> here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is 
>> hard to master, and since I am a beginner I don't know what to do now.
>> Thanks in advance for your help, I attached an archive containing all the 
>> images.
>>
>> Regards 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/877f5620-b346-4429-a18f-0921ae60fb65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] OCR failing on simple and clear text codes

Reply via email to