Re: [tesseract-ocr] Tesseract configuration for alphanumeric strings: mixes up 2, Z, 6 and G

Allistair Mon, 27 Jun 2016 02:51:13 -0700

Have you perhaps tried upsizing without sampling? What that will do is give
you harsher edges on the larger image size which may allow Tesseract to fit
its classifications better.


On 27 June 2016 at 10:43, Timothy Korse <[email protected]> wrote:

> Yes, these images are actually upsampled. The height of the caracters from
> the input source is about 20 pixels. Now they are about 100 pixels, I can
> see the difference between the 2 and the Z for instance quite strong. So I
> am sure that Tesseract can too.
>
> Please let me know if someone needs more information in order to help me
> out.
>
> I really appreciate your help!
>
> Op maandag 27 juni 2016 10:52:58 UTC+2 schreef Allistair C:
>>
>> Have you tried the generally useful increasing your image sizes until it
>> works approach? Not sure if the samples you posted were the actual size but
>> in the past I have read this problem *can* lessen with larger image sizes -
>> even artificially upsampled images.
>>
>> On 27 June 2016 at 09:37, Timothy Korse <[email protected]> wrote:
>>
>>> Hi Alistair,
>>>
>>> Thank you for your response. Yes I actually tried that without luck. I
>>> think unicharambigs is useful when using dictionaries, which I do not use.
>>> I simply can't substitute a 2 by a Z because it might as well be a 2.
>>>
>>> I tried the following format:
>>>
>>> v1
>>> 1 Z 1 2 x
>>> 1 2 1 Z x
>>> 1 G 1 6 x
>>> 1 6 1 G x
>>> 1 M 1 H x
>>> 1 H 1 M x
>>>
>>> Where x is ofcourse the mode. For this setting I tried 0, 1 and 3.
>>> Unfortunately other modes than 0 and 1 are not documented. I thought
>>> looking at the source code of Tesseract that 3 might do the trick, but it
>>> didn't.
>>>
>>> Am I doing something wrong?
>>>
>>>
>>> Op zondag 26 juni 2016 22:49:09 UTC+2 schreef Allistair C:
>>>>
>>>> Did you ever look at incorporating the unicharambigs file into your
>>>> training?
>>>>
>>>>
>>>> http://www.resolveradiologic.com/blog/2013/01/16/more-on-training-tesseract/
>>>>
>>>> On 26 June 2016 at 15:09, Timothy Korse <[email protected]> wrote:
>>>>
>>>>> I'm trying to configurate tesseract to recognize *alphanumeric
>>>>> strings* of 10 characters long (all uppercase).
>>>>>
>>>>>
>>>>> This works pretty good, except it seems to mix up the following
>>>>> characters pretty often:
>>>>>
>>>>>    - 2 and Z
>>>>>    - 6 and G
>>>>>
>>>>>
>>>>> Examples of images are:
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-20dr7dBmT9c/V2_eMKE7TtI/AAAAAAAAAKw/ENcZMZogPws1elcz7BV0WRsE4B8M22IWgCKgB/s1600/X2JR6XK6VGMQP2L5.jpg>
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-MysZA6TlqI0/V2_eQyVCOzI/AAAAAAAAAKw/LgUKmhGzsvcfod1bHLEIRfBtKO7-dCodQCKgB/s1600/X2LHV6KHPJ5TFTDK.jpg>
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-s6QuiuY_GK8/V2_eUtSCvBI/AAAAAAAAAKw/nM-vnz9SCvQ2OWPuwytKJirJMCS4kIGqgCKgB/s1600/X3K9V5XKQV3Z5QT5.jpg>
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-QVLjGd9Lcik/V2_eYvEDsJI/AAAAAAAAAKw/c_s5sYdtE0AbFZX8OqNiEAAvrnooYD6pwCKgB/s1600/X3P92TR7Q93F2G9F.jpg>
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-wfH5bpBqC5E/V2_egk0Sj3I/AAAAAAAAAKw/-da1JPAT_hUF5CEn6c9FkkZqANu3TDtngCKgB/s1600/X4NT7CFMH2GR7HXZ.jpg>
>>>>>
>>>>>
>>>>> <https://lh3.googleusercontent.com/-KHssFqw1XyE/V2_emEmR4yI/AAAAAAAAAK0/kftsbb0E65os-rdIlkHxpqT8Ip7gkWWbwCKgB/s1600/X4QGN9XQ3KP69YZX.jpg>
>>>>>
>>>>> These are preprocessed. I think this process was successfully done.
>>>>> I'll glad to hear otherwise.
>>>>>
>>>>>
>>>>> This is how I run Tesseract:
>>>>>
>>>>>
>>>>> tesseract = new Tesseract();
>>>>> tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY);
>>>>> tesseract.setPageSegMode(7);
>>>>> tesseract.setTessVariable("load_system_dawg", "0");
>>>>> tesseract.setTessVariable("load_freq_dawg", "0");
>>>>> tesseract.setTessVariable("load_punc_dawg", "0");
>>>>> tesseract.setTessVariable("load_number_dawg", "0");
>>>>> tesseract.setTessVariable("load_unambig_dawg", "0");
>>>>> tesseract.setTessVariable("load_bigram_dawg", "0");
>>>>> tesseract.setTessVariable("load_fixed_length_dawgs", "0");
>>>>>
>>>>> tesseract.setTessVariable("classify_enable_learning", "0");
>>>>> tesseract.setTessVariable("classify_enable_adaptive_matcher", "0");
>>>>>
>>>>> tesseract.setTessVariable("segment_penalty_garbage", "0");
>>>>> tesseract.setTessVariable("segment_penalty_dict_nonword", "0");
>>>>> tesseract.setTessVariable("segment_penalty_dict_frequent_word", "0");
>>>>> tesseract.setTessVariable("segment_penalty_dict_case_ok", "0");
>>>>> tesseract.setTessVariable("segment_penalty_dict_case_bad", "0");
>>>>>
>>>>>
>>>>> *Note that this is Java code, but my question is not limited to Java.*
>>>>>
>>>>> I am not really experienced with Tesseract and seem to find the
>>>>> documentation very unclear. I hope someone else can help me out.
>>>>> ------------------------------
>>>>>
>>>>> To give some more context:
>>>>>
>>>>>
>>>>> *How do I train Tesseract?*
>>>>>
>>>>>
>>>>> I train Tesseract by combining over 200 images into one image. Every
>>>>> image contains 10 alphanumeric characters. Also, I am sure the box file is
>>>>> correct.
>>>>>
>>>>>
>>>>> I build the final language by executing the following batch script:
>>>>>
>>>>> tesseract qwe.combined.jpg qwe.combined.box nobatch box.train
>>>>>
>>>>> echo combined 1 0 0 0 0 > font_properties
>>>>>
>>>>> unicharset_extractor qwe.combined.box
>>>>>
>>>>> shapeclustering -F font_properties -U unicharset qwe.combined.box.tr
>>>>>
>>>>> mftraining -F font_properties -U unicharset -O qwe.unicharset 
>>>>> qwe.combined.box.tr
>>>>>
>>>>> cntraining qwe.combined.box.tr
>>>>>
>>>>> copy inttemp qwe.inttemp
>>>>> copy normproto qwe.normproto
>>>>> copy pffmtable qwe.pffmtable
>>>>> copy shapetable qwe.shapetable
>>>>>
>>>>> combine_tessdata qwe.
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> How can I make Tesseract discriminate better between the 2, Z, 6 and G
>>>>> ?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5viJjTtp3%2B7RcAsw92L4eCco-3yRQK_qiKZPmzCKrAjiLA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract configuration for alphanumeric strings: mixes up 2, Z, 6 and G

Reply via email to