Re: [tesseract-ocr] Tesseract configuration for alphanumeric strings: mixes up 2, Z, 6 and G

Timothy Korse Mon, 27 Jun 2016 02:54:08 -0700

I'm afraid I misunderstood `upsampling`. I just upsized the ROI. But I'll
definitely look into upsampling!


2016-06-27 11:50 GMT+02:00 Allistair <[email protected]>:

> Have you perhaps tried upsizing without sampling? What that will do is
> give you harsher edges on the larger image size which may allow Tesseract
> to fit its classifications better.
>
> On 27 June 2016 at 10:43, Timothy Korse <[email protected]> wrote:
>
>> Yes, these images are actually upsampled. The height of the caracters
>> from the input source is about 20 pixels. Now they are about 100 pixels, I
>> can see the difference between the 2 and the Z for instance quite strong.
>> So I am sure that Tesseract can too.
>>
>> Please let me know if someone needs more information in order to help me
>> out.
>>
>> I really appreciate your help!
>>
>> Op maandag 27 juni 2016 10:52:58 UTC+2 schreef Allistair C:
>>>
>>> Have you tried the generally useful increasing your image sizes until it
>>> works approach? Not sure if the samples you posted were the actual size but
>>> in the past I have read this problem *can* lessen with larger image sizes -
>>> even artificially upsampled images.
>>>
>>> On 27 June 2016 at 09:37, Timothy Korse <[email protected]> wrote:
>>>
>>>> Hi Alistair,
>>>>
>>>> Thank you for your response. Yes I actually tried that without luck. I
>>>> think unicharambigs is useful when using dictionaries, which I do not use.
>>>> I simply can't substitute a 2 by a Z because it might as well be a 2.
>>>>
>>>> I tried the following format:
>>>>
>>>> v1
>>>> 1 Z 1 2 x
>>>> 1 2 1 Z x
>>>> 1 G 1 6 x
>>>> 1 6 1 G x
>>>> 1 M 1 H x
>>>> 1 H 1 M x
>>>>
>>>> Where x is ofcourse the mode. For this setting I tried 0, 1 and 3.
>>>> Unfortunately other modes than 0 and 1 are not documented. I thought
>>>> looking at the source code of Tesseract that 3 might do the trick, but it
>>>> didn't.
>>>>
>>>> Am I doing something wrong?
>>>>
>>>>
>>>> Op zondag 26 juni 2016 22:49:09 UTC+2 schreef Allistair C:
>>>>>
>>>>> Did you ever look at incorporating the unicharambigs file into your
>>>>> training?
>>>>>
>>>>>
>>>>> http://www.resolveradiologic.com/blog/2013/01/16/more-on-training-tesseract/
>>>>>
>>>>> On 26 June 2016 at 15:09, Timothy Korse <[email protected]> wrote:
>>>>>
>>>>>> I'm trying to configurate tesseract to recognize *alphanumeric
>>>>>> strings* of 10 characters long (all uppercase).
>>>>>>
>>>>>>
>>>>>> This works pretty good, except it seems to mix up the following
>>>>>> characters pretty often:
>>>>>>
>>>>>>    - 2 and Z
>>>>>>    - 6 and G
>>>>>>
>>>>>>
>>>>>> Examples of images are:
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-20dr7dBmT9c/V2_eMKE7TtI/AAAAAAAAAKw/ENcZMZogPws1elcz7BV0WRsE4B8M22IWgCKgB/s1600/X2JR6XK6VGMQP2L5.jpg>
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-MysZA6TlqI0/V2_eQyVCOzI/AAAAAAAAAKw/LgUKmhGzsvcfod1bHLEIRfBtKO7-dCodQCKgB/s1600/X2LHV6KHPJ5TFTDK.jpg>
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-s6QuiuY_GK8/V2_eUtSCvBI/AAAAAAAAAKw/nM-vnz9SCvQ2OWPuwytKJirJMCS4kIGqgCKgB/s1600/X3K9V5XKQV3Z5QT5.jpg>
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-QVLjGd9Lcik/V2_eYvEDsJI/AAAAAAAAAKw/c_s5sYdtE0AbFZX8OqNiEAAvrnooYD6pwCKgB/s1600/X3P92TR7Q93F2G9F.jpg>
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-wfH5bpBqC5E/V2_egk0Sj3I/AAAAAAAAAKw/-da1JPAT_hUF5CEn6c9FkkZqANu3TDtngCKgB/s1600/X4NT7CFMH2GR7HXZ.jpg>
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-KHssFqw1XyE/V2_emEmR4yI/AAAAAAAAAK0/kftsbb0E65os-rdIlkHxpqT8Ip7gkWWbwCKgB/s1600/X4QGN9XQ3KP69YZX.jpg>
>>>>>>
>>>>>> These are preprocessed. I think this process was successfully done.
>>>>>> I'll glad to hear otherwise.
>>>>>>
>>>>>>
>>>>>> This is how I run Tesseract:
>>>>>>
>>>>>>
>>>>>> tesseract = new Tesseract();
>>>>>> tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY);
>>>>>> tesseract.setPageSegMode(7);
>>>>>> tesseract.setTessVariable("load_system_dawg", "0");
>>>>>> tesseract.setTessVariable("load_freq_dawg", "0");
>>>>>> tesseract.setTessVariable("load_punc_dawg", "0");
>>>>>> tesseract.setTessVariable("load_number_dawg", "0");
>>>>>> tesseract.setTessVariable("load_unambig_dawg", "0");
>>>>>> tesseract.setTessVariable("load_bigram_dawg", "0");
>>>>>> tesseract.setTessVariable("load_fixed_length_dawgs", "0");
>>>>>>
>>>>>> tesseract.setTessVariable("classify_enable_learning", "0");
>>>>>> tesseract.setTessVariable("classify_enable_adaptive_matcher", "0");
>>>>>>
>>>>>> tesseract.setTessVariable("segment_penalty_garbage", "0");
>>>>>> tesseract.setTessVariable("segment_penalty_dict_nonword", "0");
>>>>>> tesseract.setTessVariable("segment_penalty_dict_frequent_word", "0");
>>>>>> tesseract.setTessVariable("segment_penalty_dict_case_ok", "0");
>>>>>> tesseract.setTessVariable("segment_penalty_dict_case_bad", "0");
>>>>>>
>>>>>>
>>>>>> *Note that this is Java code, but my question is not limited to Java.*
>>>>>>
>>>>>> I am not really experienced with Tesseract and seem to find the
>>>>>> documentation very unclear. I hope someone else can help me out.
>>>>>> ------------------------------
>>>>>>
>>>>>> To give some more context:
>>>>>>
>>>>>>
>>>>>> *How do I train Tesseract?*
>>>>>>
>>>>>>
>>>>>> I train Tesseract by combining over 200 images into one image. Every
>>>>>> image contains 10 alphanumeric characters. Also, I am sure the box file 
>>>>>> is
>>>>>> correct.
>>>>>>
>>>>>>
>>>>>> I build the final language by executing the following batch script:
>>>>>>
>>>>>> tesseract qwe.combined.jpg qwe.combined.box nobatch box.train
>>>>>>
>>>>>> echo combined 1 0 0 0 0 > font_properties
>>>>>>
>>>>>> unicharset_extractor qwe.combined.box
>>>>>>
>>>>>> shapeclustering -F font_properties -U unicharset qwe.combined.box.tr
>>>>>>
>>>>>> mftraining -F font_properties -U unicharset -O qwe.unicharset 
>>>>>> qwe.combined.box.tr
>>>>>>
>>>>>> cntraining qwe.combined.box.tr
>>>>>>
>>>>>> copy inttemp qwe.inttemp
>>>>>> copy normproto qwe.normproto
>>>>>> copy pffmtable qwe.pffmtable
>>>>>> copy shapetable qwe.shapetable
>>>>>>
>>>>>> combine_tessdata qwe.
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> How can I make Tesseract discriminate better between the 2, Z, 6 and
>>>>>> G?
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/Hr79AmtApeA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAORW5viJjTtp3%2B7RcAsw92L4eCco-3yRQK_qiKZPmzCKrAjiLA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAORW5viJjTtp3%2B7RcAsw92L4eCco-3yRQK_qiKZPmzCKrAjiLA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG7i5K2c-DiWcTTwzdSBVmXhWiwB9SXVSHxKex8cYBn6MSCf6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract configuration for alphanumeric strings: mixes up 2, Z, 6 and G

Reply via email to