I'm afraid I misunderstood `upsampling`. I just upsized the ROI. But I'll definitely look into upsampling!
2016-06-27 11:50 GMT+02:00 Allistair <[email protected]>: > Have you perhaps tried upsizing without sampling? What that will do is > give you harsher edges on the larger image size which may allow Tesseract > to fit its classifications better. > > On 27 June 2016 at 10:43, Timothy Korse <[email protected]> wrote: > >> Yes, these images are actually upsampled. The height of the caracters >> from the input source is about 20 pixels. Now they are about 100 pixels, I >> can see the difference between the 2 and the Z for instance quite strong. >> So I am sure that Tesseract can too. >> >> Please let me know if someone needs more information in order to help me >> out. >> >> I really appreciate your help! >> >> Op maandag 27 juni 2016 10:52:58 UTC+2 schreef Allistair C: >>> >>> Have you tried the generally useful increasing your image sizes until it >>> works approach? Not sure if the samples you posted were the actual size but >>> in the past I have read this problem *can* lessen with larger image sizes - >>> even artificially upsampled images. >>> >>> On 27 June 2016 at 09:37, Timothy Korse <[email protected]> wrote: >>> >>>> Hi Alistair, >>>> >>>> Thank you for your response. Yes I actually tried that without luck. I >>>> think unicharambigs is useful when using dictionaries, which I do not use. >>>> I simply can't substitute a 2 by a Z because it might as well be a 2. >>>> >>>> I tried the following format: >>>> >>>> v1 >>>> 1 Z 1 2 x >>>> 1 2 1 Z x >>>> 1 G 1 6 x >>>> 1 6 1 G x >>>> 1 M 1 H x >>>> 1 H 1 M x >>>> >>>> Where x is ofcourse the mode. For this setting I tried 0, 1 and 3. >>>> Unfortunately other modes than 0 and 1 are not documented. I thought >>>> looking at the source code of Tesseract that 3 might do the trick, but it >>>> didn't. >>>> >>>> Am I doing something wrong? >>>> >>>> >>>> Op zondag 26 juni 2016 22:49:09 UTC+2 schreef Allistair C: >>>>> >>>>> Did you ever look at incorporating the unicharambigs file into your >>>>> training? >>>>> >>>>> >>>>> http://www.resolveradiologic.com/blog/2013/01/16/more-on-training-tesseract/ >>>>> >>>>> On 26 June 2016 at 15:09, Timothy Korse <[email protected]> wrote: >>>>> >>>>>> I'm trying to configurate tesseract to recognize *alphanumeric >>>>>> strings* of 10 characters long (all uppercase). >>>>>> >>>>>> >>>>>> This works pretty good, except it seems to mix up the following >>>>>> characters pretty often: >>>>>> >>>>>> - 2 and Z >>>>>> - 6 and G >>>>>> >>>>>> >>>>>> Examples of images are: >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-20dr7dBmT9c/V2_eMKE7TtI/AAAAAAAAAKw/ENcZMZogPws1elcz7BV0WRsE4B8M22IWgCKgB/s1600/X2JR6XK6VGMQP2L5.jpg> >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-MysZA6TlqI0/V2_eQyVCOzI/AAAAAAAAAKw/LgUKmhGzsvcfod1bHLEIRfBtKO7-dCodQCKgB/s1600/X2LHV6KHPJ5TFTDK.jpg> >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-s6QuiuY_GK8/V2_eUtSCvBI/AAAAAAAAAKw/nM-vnz9SCvQ2OWPuwytKJirJMCS4kIGqgCKgB/s1600/X3K9V5XKQV3Z5QT5.jpg> >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-QVLjGd9Lcik/V2_eYvEDsJI/AAAAAAAAAKw/c_s5sYdtE0AbFZX8OqNiEAAvrnooYD6pwCKgB/s1600/X3P92TR7Q93F2G9F.jpg> >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-wfH5bpBqC5E/V2_egk0Sj3I/AAAAAAAAAKw/-da1JPAT_hUF5CEn6c9FkkZqANu3TDtngCKgB/s1600/X4NT7CFMH2GR7HXZ.jpg> >>>>>> >>>>>> >>>>>> <https://lh3.googleusercontent.com/-KHssFqw1XyE/V2_emEmR4yI/AAAAAAAAAK0/kftsbb0E65os-rdIlkHxpqT8Ip7gkWWbwCKgB/s1600/X4QGN9XQ3KP69YZX.jpg> >>>>>> >>>>>> These are preprocessed. I think this process was successfully done. >>>>>> I'll glad to hear otherwise. >>>>>> >>>>>> >>>>>> This is how I run Tesseract: >>>>>> >>>>>> >>>>>> tesseract = new Tesseract(); >>>>>> tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY); >>>>>> tesseract.setPageSegMode(7); >>>>>> tesseract.setTessVariable("load_system_dawg", "0"); >>>>>> tesseract.setTessVariable("load_freq_dawg", "0"); >>>>>> tesseract.setTessVariable("load_punc_dawg", "0"); >>>>>> tesseract.setTessVariable("load_number_dawg", "0"); >>>>>> tesseract.setTessVariable("load_unambig_dawg", "0"); >>>>>> tesseract.setTessVariable("load_bigram_dawg", "0"); >>>>>> tesseract.setTessVariable("load_fixed_length_dawgs", "0"); >>>>>> >>>>>> tesseract.setTessVariable("classify_enable_learning", "0"); >>>>>> tesseract.setTessVariable("classify_enable_adaptive_matcher", "0"); >>>>>> >>>>>> tesseract.setTessVariable("segment_penalty_garbage", "0"); >>>>>> tesseract.setTessVariable("segment_penalty_dict_nonword", "0"); >>>>>> tesseract.setTessVariable("segment_penalty_dict_frequent_word", "0"); >>>>>> tesseract.setTessVariable("segment_penalty_dict_case_ok", "0"); >>>>>> tesseract.setTessVariable("segment_penalty_dict_case_bad", "0"); >>>>>> >>>>>> >>>>>> *Note that this is Java code, but my question is not limited to Java.* >>>>>> >>>>>> I am not really experienced with Tesseract and seem to find the >>>>>> documentation very unclear. I hope someone else can help me out. >>>>>> ------------------------------ >>>>>> >>>>>> To give some more context: >>>>>> >>>>>> >>>>>> *How do I train Tesseract?* >>>>>> >>>>>> >>>>>> I train Tesseract by combining over 200 images into one image. Every >>>>>> image contains 10 alphanumeric characters. Also, I am sure the box file >>>>>> is >>>>>> correct. >>>>>> >>>>>> >>>>>> I build the final language by executing the following batch script: >>>>>> >>>>>> tesseract qwe.combined.jpg qwe.combined.box nobatch box.train >>>>>> >>>>>> echo combined 1 0 0 0 0 > font_properties >>>>>> >>>>>> unicharset_extractor qwe.combined.box >>>>>> >>>>>> shapeclustering -F font_properties -U unicharset qwe.combined.box.tr >>>>>> >>>>>> mftraining -F font_properties -U unicharset -O qwe.unicharset >>>>>> qwe.combined.box.tr >>>>>> >>>>>> cntraining qwe.combined.box.tr >>>>>> >>>>>> copy inttemp qwe.inttemp >>>>>> copy normproto qwe.normproto >>>>>> copy pffmtable qwe.pffmtable >>>>>> copy shapetable qwe.shapetable >>>>>> >>>>>> combine_tessdata qwe. >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> How can I make Tesseract discriminate better between the 2, Z, 6 and >>>>>> G? >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/c57fdcb6-10df-4ad9-9822-c0dc46c9ccde%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/07f632a9-a4d5-4234-8478-ec1a22bbd5da%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/Hr79AmtApeA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAORW5viJjTtp3%2B7RcAsw92L4eCco-3yRQK_qiKZPmzCKrAjiLA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAORW5viJjTtp3%2B7RcAsw92L4eCco-3yRQK_qiKZPmzCKrAjiLA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG7i5K2c-DiWcTTwzdSBVmXhWiwB9SXVSHxKex8cYBn6MSCf6w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

