Sorry if I interrupted your conversation. I have a similar problem which is the .traineddata I exported from checkpoint file did not recognize any character at all although my training showed very good results. As I understand from you guys' conversation. Is this because Training From Scratch? All I need to do is fine-tuning a model to get better result? Also, I am quite confused why result using checkpoint file is so different from .traineddata and I would be appreciated if some one can the explain the reason why.
To have more information about my case, you can refer my post here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/74xMXlYX6T0 Thank you and have a nice day On Friday, June 14, 2019 at 7:58:49 PM UTC+9, shree wrote: > > See https://github.com/Shreeshrii/tessdata_MICR > > I have uploaded my files there. > > https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh > is the bash script that runs the training. > > You can modify as needed. Please note this is for legacy/base tesseract > --oem 0. > > On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected] > <javascript:>> wrote: > >> Thanks a lot, shree. It seems you know everything. >> >> I tried the MICR0.traineddata and the first two mcr.traineddata. The >> last one was blocked by the browser. Each of the traineddata had mixed >> results. All of them are getting symbols fairly good but getting spaces >> randomly and reading some numbers wrong. >> >> MICR0 seems the best among them. Did you suggest that you'd be able to >> update it? It gets tripple D very often where there's only one, and so on. >> >> Also, I tried to fine tune from MICR0 but I found that I need to change >> the language-specific.sh. It specifies some parameters for each language. >> Do you have any guidance for it? >> >> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>> >>> see >>> http://www.devscope.net/Content/ocrchecks.aspx >>> https://github.com/BigPino67/Tesseract-MICR-OCR >>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>> >>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> >>> wrote: >>> >>>> That'll be nice if there's traineddata out there but I didn't find >>>> any. I see free fonts and commercial OCR software but not traineddata. >>>> Tessdata repository obviously doesn't have one, either. >>>> >>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>> >>>>> Please also search for existing MICR traineddata files. >>>>> >>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> >>>>> wrote: >>>>> >>>>>> So I did several tests from scratch. In the last attempt, I made a >>>>>> training text with 4,000 lines in the following format, >>>>>> >>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >>>>>> >>>>>> >>>>>> and combined it with eng.digits.training_text in which symbols are >>>>>> converted to E13B symbols. This makes about 12,000 lines of training >>>>>> text. It's amazing that this thing generates a good reader out of >>>>>> nowhere. But then it is not very good. For example: >>>>>> >>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>> >>>>>> is a result on the image attached. It's close but the last '<' in >>>>>> the result text doesn't exist on the image. It's a small failure but it >>>>>> causes a greater trouble in parsing. >>>>>> >>>>>> What would you suggest from here to increase accuracy? >>>>>> >>>>>> - Increase the number of lines in the training text >>>>>> - Mix up more variations in the training text >>>>>> - Increase the number of iterations >>>>>> - Investigate wrong reads one by one >>>>>> - Or else? >>>>>> >>>>>> Also, I referred to engrestrict*.* and could generate similar result >>>>>> with the fine-tuning-from-full method. It seems a bit faster to get to >>>>>> the >>>>>> same level but it also stops at a 'good' level. I can go with either >>>>>> way >>>>>> if it takes me to the bright future. >>>>>> >>>>>> Regards, >>>>>> ElMagoElGato >>>>>> >>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>> >>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>> >>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>> >>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>> >>>>>>>> Look at the files engrestrict*.* and also >>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>> >>>>>>>> Create training text of about 100 lines and finetune for 400 lines >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I had about 14 lines as attached. How many lines would you >>>>>>>>> recommend? >>>>>>>>> >>>>>>>>> Fine tuning gives much better result but it tends to pick other >>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 >>>>>>>>> symbols. I thought training from scratch would eliminate such >>>>>>>>> confusion. >>>>>>>>> >>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>> >>>>>>>>>> For training from scratch a large training text and hundreds of >>>>>>>>>> thousands of iterations are recommended. >>>>>>>>>> >>>>>>>>>> If you are just fine tuning for a font try to follow instructions >>>>>>>>>> for training for impact, with your font. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks, Shree. >>>>>>>>>>> >>>>>>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>>>>>> >>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang >>>>>>>>>>> eng --linedata_only \ >>>>>>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>>>>>> >>>>>>>>>>> Training from scratch: >>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>>>>>>>> O1c111]' \ >>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>>>>> 20e-4 \ >>>>>>>>>>> --train_listfile >>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>> \ >>>>>>>>>>> --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>> >>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>> >>>>>>>>>>> Combining output files: >>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>> >>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>> >>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 >>>>>>>>>>> wrote best >>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>> wrote >>>>>>>>>>> checkpoint. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>> >>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error >>>>>>>>>>> rate=0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt. >>>>>>>>>>> Both files are attached. >>>>>>>>>>> >>>>>>>>>>> Training seems to have worked fine. I don't know how to >>>>>>>>>>> translate the test result from base_checkpoint. The generated >>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice >>>>>>>>>>> of >>>>>>>>>>> --traineddata in combining output files is bad but I have no clue. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> ElMagoElGato >>>>>>>>>>> >>>>>>>>>>> BTW, I referred to your tess4training in the process. It helped >>>>>>>>>>> a lot. >>>>>>>>>>> >>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>> >>>>>>>>>>>> see >>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>> >>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>>>>>>> according to the method in Training From Scratch. Now, how can I >>>>>>>>>>>>> make a >>>>>>>>>>>>> trained data from the base_checkpoint file? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>> >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>>> Visit this group at >>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>>>>> >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/acbb787c-2e00-419e-b5b1-3daa6df1e1d7%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

