See https://github.com/Shreeshrii/tessdata_MICR
I have uploaded my files there. https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh is the bash script that runs the training. You can modify as needed. Please note this is for legacy/base tesseract --oem 0. On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]> wrote: > Thanks a lot, shree. It seems you know everything. > > I tried the MICR0.traineddata and the first two mcr.traineddata. The last > one was blocked by the browser. Each of the traineddata had mixed > results. All of them are getting symbols fairly good but getting spaces > randomly and reading some numbers wrong. > > MICR0 seems the best among them. Did you suggest that you'd be able to > update it? It gets tripple D very often where there's only one, and so on. > > Also, I tried to fine tune from MICR0 but I found that I need to change > the language-specific.sh. It specifies some parameters for each language. > Do you have any guidance for it? > > 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >> >> see >> http://www.devscope.net/Content/ocrchecks.aspx >> https://github.com/BigPino67/Tesseract-MICR-OCR >> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >> >> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> >> wrote: >> >>> That'll be nice if there's traineddata out there but I didn't find any. >>> I see free fonts and commercial OCR software but not traineddata. Tessdata >>> repository obviously doesn't have one, either. >>> >>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>> >>>> Please also search for existing MICR traineddata files. >>>> >>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> >>>> wrote: >>>> >>>>> So I did several tests from scratch. In the last attempt, I made a >>>>> training text with 4,000 lines in the following format, >>>>> >>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >>>>> >>>>> >>>>> and combined it with eng.digits.training_text in which symbols are >>>>> converted to E13B symbols. This makes about 12,000 lines of training >>>>> text. It's amazing that this thing generates a good reader out of >>>>> nowhere. But then it is not very good. For example: >>>>> >>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>> >>>>> is a result on the image attached. It's close but the last '<' in the >>>>> result text doesn't exist on the image. It's a small failure but it >>>>> causes >>>>> a greater trouble in parsing. >>>>> >>>>> What would you suggest from here to increase accuracy? >>>>> >>>>> - Increase the number of lines in the training text >>>>> - Mix up more variations in the training text >>>>> - Increase the number of iterations >>>>> - Investigate wrong reads one by one >>>>> - Or else? >>>>> >>>>> Also, I referred to engrestrict*.* and could generate similar result >>>>> with the fine-tuning-from-full method. It seems a bit faster to get to >>>>> the >>>>> same level but it also stops at a 'good' level. I can go with either way >>>>> if it takes me to the bright future. >>>>> >>>>> Regards, >>>>> ElMagoElGato >>>>> >>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>> >>>>>> Thanks a lot, Shree. I'll look it in. >>>>>> >>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>> >>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>> >>>>>>> Look at the files engrestrict*.* and also >>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>> >>>>>>> Create training text of about 100 lines and finetune for 400 lines >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I had about 14 lines as attached. How many lines would you >>>>>>>> recommend? >>>>>>>> >>>>>>>> Fine tuning gives much better result but it tends to pick other >>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 >>>>>>>> symbols. I thought training from scratch would eliminate such >>>>>>>> confusion. >>>>>>>> >>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>> >>>>>>>>> For training from scratch a large training text and hundreds of >>>>>>>>> thousands of iterations are recommended. >>>>>>>>> >>>>>>>>> If you are just fine tuning for a font try to follow instructions >>>>>>>>> for training for impact, with your font. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks, Shree. >>>>>>>>>> >>>>>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>>>>> >>>>>>>>>> Using tesstrain.sh: >>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng >>>>>>>>>> --linedata_only \ >>>>>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>>>>> >>>>>>>>>> Training from scratch: >>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>>>>>>> O1c111]' \ >>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>>>> 20e-4 \ >>>>>>>>>> --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>> \ >>>>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>> --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>> >>>>>>>>>> Test with base_checkpoint: >>>>>>>>>> src/training/lstmeval --model >>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>> >>>>>>>>>> Combining output files: >>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>> >>>>>>>>>> Test with eng.traineddata: >>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The training from scratch ended as: >>>>>>>>>> >>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 >>>>>>>>>> wrote best >>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote >>>>>>>>>> checkpoint. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>> >>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt. Both >>>>>>>>>> files are attached. >>>>>>>>>> >>>>>>>>>> Training seems to have worked fine. I don't know how to >>>>>>>>>> translate the test result from base_checkpoint. The generated >>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice of >>>>>>>>>> --traineddata in combining output files is bad but I have no clue. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> ElMagoElGato >>>>>>>>>> >>>>>>>>>> BTW, I referred to your tess4training in the process. It helped >>>>>>>>>> a lot. >>>>>>>>>> >>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>> >>>>>>>>>>> see >>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>> >>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>> >>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>>>>>> according to the method in Training From Scratch. Now, how can I >>>>>>>>>>>> make a >>>>>>>>>>>> trained data from the base_checkpoint file? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> Visit this group at >>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXJc0LBhHpmEM3Vh6RcFWhnNj4dJhFPqgr%2BpBsWfjBsBQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

