Phantom characters here for me too: https://github.com/tesseract-ocr/tesseract/issues/1778
Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also improved. I wrote some code that uses symbols iterator to discard symbols that are clearly duplicated: too small, overlapping, etc. But it was not easy to make it work decently and it is not 100% reliable with false negatives and positives. I cannot share the code and it is quite ugly anyway. Here there is another MRZ model with training data: https://github.com/DoubangoTelecom/tesseractMRZ Lorenzo Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <[email protected]> ha scritto: > I’m getting the “phantom character” issue as well using the OCRB that > Shree trained on MRZ lines. For example for a 0 it will sometimes add both > a 0 and an O to the output , thus outputting 45 characters total instead of > 44. I haven’t looked at the bounding box output yet but I suspect a phantom > thin character is added somewhere that I can discard .. or maybe two chars > will have the same bounding box. If anyone else has fixed this issue > further up (eg so the output doesn’t contain the phantom characters in the > first place) id be interested. > > On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <[email protected]> > wrote: > >> Hi, >> >> I'll go back to more of training later. Before doing so, I'd like to >> investigate results a little bit. The hocr and lstmbox options give some >> details of positions of characters. The results show positions that >> perfectly correspond to letters in the image. But the text output contains >> a character that obviously does not exist. >> >> Then I found a config file 'lstmdebug' that generates far more >> information. I hope it explains what happened with each character. I'm >> yet to read the debug output but I'd appreciate it if someone could tell me >> how to read it because it's really complex. >> >> Regards, >> ElMagoElGato >> >> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >> >>> See https://github.com/Shreeshrii/tessdata_MICR >>> >>> I have uploaded my files there. >>> >>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>> is the bash script that runs the training. >>> >>> You can modify as needed. Please note this is for legacy/base tesseract >>> --oem 0. >>> >>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]> >>> wrote: >>> >>>> Thanks a lot, shree. It seems you know everything. >>>> >>>> I tried the MICR0.traineddata and the first two mcr.traineddata. The >>>> last one was blocked by the browser. Each of the traineddata had mixed >>>> results. All of them are getting symbols fairly good but getting spaces >>>> randomly and reading some numbers wrong. >>>> >>>> MICR0 seems the best among them. Did you suggest that you'd be able to >>>> update it? It gets tripple D very often where there's only one, and so on. >>>> >>>> Also, I tried to fine tune from MICR0 but I found that I need to change >>>> the language-specific.sh. It specifies some parameters for each language. >>>> Do you have any guidance for it? >>>> >>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>> >>>>> see >>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>> >>>>> >>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> >>>>> wrote: >>>>> >>>>>> That'll be nice if there's traineddata out there but I didn't find >>>>>> any. I see free fonts and commercial OCR software but not traineddata. >>>>>> Tessdata repository obviously doesn't have one, either. >>>>>> >>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>> >>>>>>> Please also search for existing MICR traineddata files. >>>>>>> >>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> So I did several tests from scratch. In the last attempt, I made a >>>>>>>> training text with 4,000 lines in the following format, >>>>>>>> >>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >>>>>>>> >>>>>>>> >>>>>>>> and combined it with eng.digits.training_text in which symbols are >>>>>>>> converted to E13B symbols. This makes about 12,000 lines of training >>>>>>>> text. It's amazing that this thing generates a good reader out of >>>>>>>> nowhere. But then it is not very good. For example: >>>>>>>> >>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>> >>>>>>>> is a result on the image attached. It's close but the last '<' in >>>>>>>> the result text doesn't exist on the image. It's a small failure but >>>>>>>> it >>>>>>>> causes a greater trouble in parsing. >>>>>>>> >>>>>>>> What would you suggest from here to increase accuracy? >>>>>>>> >>>>>>>> - Increase the number of lines in the training text >>>>>>>> - Mix up more variations in the training text >>>>>>>> - Increase the number of iterations >>>>>>>> - Investigate wrong reads one by one >>>>>>>> - Or else? >>>>>>>> >>>>>>>> Also, I referred to engrestrict*.* and could generate similar >>>>>>>> result with the fine-tuning-from-full method. It seems a bit faster >>>>>>>> to get >>>>>>>> to the same level but it also stops at a 'good' level. I can go with >>>>>>>> either way if it takes me to the bright future. >>>>>>>> >>>>>>>> Regards, >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>> >>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>> >>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>> >>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>> >>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>> >>>>>>>>>> Create training text of about 100 lines and finetune for 400 >>>>>>>>>> lines >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I had about 14 lines as attached. How many lines would you >>>>>>>>>>> recommend? >>>>>>>>>>> >>>>>>>>>>> Fine tuning gives much better result but it tends to pick other >>>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and >>>>>>>>>>> 4 >>>>>>>>>>> symbols. I thought training from scratch would eliminate such >>>>>>>>>>> confusion. >>>>>>>>>>> >>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>> >>>>>>>>>>>> For training from scratch a large training text and hundreds of >>>>>>>>>>>> thousands of iterations are recommended. >>>>>>>>>>>> >>>>>>>>>>>> If you are just fine tuning for a font try to follow >>>>>>>>>>>> instructions for training for impact, with your font. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>>>>>>>> >>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang >>>>>>>>>>>>> eng --linedata_only \ >>>>>>>>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>>>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>> >>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 >>>>>>>>>>>>> Lfx256 O1c111]' \ >>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>> >>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>> >>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>> >>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>> >>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 >>>>>>>>>>>>> wrote best >>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>> wrote >>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>> >>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error >>>>>>>>>>>>> rate=0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt. >>>>>>>>>>>>> Both files are attached. >>>>>>>>>>>>> >>>>>>>>>>>>> Training seems to have worked fine. I don't know how to >>>>>>>>>>>>> translate the test result from base_checkpoint. The generated >>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice >>>>>>>>>>>>> of >>>>>>>>>>>>> --traineddata in combining output files is bad but I have no clue. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>> >>>>>>>>>>>>> BTW, I referred to your tess4training in the process. It >>>>>>>>>>>>> helped a lot. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>> >>>>>>>>>>>>>> see >>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>>>>>>>>> according to the method in Training From Scratch. Now, how can >>>>>>>>>>>>>>> I make a >>>>>>>>>>>>>>> trained data from the base_checkpoint file? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>> from it, send an email to [email protected]. >>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>>> Visit this group at >>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ____________________________________________________________ >>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyO4hvNf9izPXsTxMw-0Vs%2B5LPim9e7u%3DZeVARmzOjfGA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

