Hi, Let's call them phantom characters then.
Was psm 7 the solution for the issue 1778? None of the psm option didn't solve my problem though I see different output. I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway. How did you get bounding box for each character? Alto and lstmbox only show bbox for a group of characters. ElMagoElGato 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: > Phantom characters here for me too: > > https://github.com/tesseract-ocr/tesseract/issues/1778 > > Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also > improved. > > I wrote some code that uses symbols iterator to discard symbols that are > clearly duplicated: too small, overlapping, etc. But it was not easy to > make it work decently and it is not 100% reliable with false negatives and > positives. I cannot share the code and it is quite ugly anyway. > > Here there is another MRZ model with training data: > > https://github.com/DoubangoTelecom/tesseractMRZ > > > > > Lorenzo > > > Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <[email protected] > <javascript:>> ha scritto: > >> I’m getting the “phantom character” issue as well using the OCRB that >> Shree trained on MRZ lines. For example for a 0 it will sometimes add both >> a 0 and an O to the output , thus outputting 45 characters total instead of >> 44. I haven’t looked at the bounding box output yet but I suspect a phantom >> thin character is added somewhere that I can discard .. or maybe two chars >> will have the same bounding box. If anyone else has fixed this issue >> further up (eg so the output doesn’t contain the phantom characters in the >> first place) id be interested. >> >> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <[email protected] >> <javascript:>> wrote: >> >>> Hi, >>> >>> I'll go back to more of training later. Before doing so, I'd like to >>> investigate results a little bit. The hocr and lstmbox options give some >>> details of positions of characters. The results show positions that >>> perfectly correspond to letters in the image. But the text output contains >>> a character that obviously does not exist. >>> >>> Then I found a config file 'lstmdebug' that generates far more >>> information. I hope it explains what happened with each character. I'm >>> yet to read the debug output but I'd appreciate it if someone could tell me >>> how to read it because it's really complex. >>> >>> Regards, >>> ElMagoElGato >>> >>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>> >>>> See https://github.com/Shreeshrii/tessdata_MICR >>>> >>>> I have uploaded my files there. >>>> >>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>> is the bash script that runs the training. >>>> >>>> You can modify as needed. Please note this is for legacy/base tesseract >>>> --oem 0. >>>> >>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]> >>>> wrote: >>>> >>>>> Thanks a lot, shree. It seems you know everything. >>>>> >>>>> I tried the MICR0.traineddata and the first two mcr.traineddata. The >>>>> last one was blocked by the browser. Each of the traineddata had mixed >>>>> results. All of them are getting symbols fairly good but getting spaces >>>>> randomly and reading some numbers wrong. >>>>> >>>>> MICR0 seems the best among them. Did you suggest that you'd be able >>>>> to update it? It gets tripple D very often where there's only one, and >>>>> so >>>>> on. >>>>> >>>>> Also, I tried to fine tune from MICR0 but I found that I need to >>>>> change the language-specific.sh. It specifies some parameters for each >>>>> language. Do you have any guidance for it? >>>>> >>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>> >>>>>> see >>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>> >>>>>> >>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> That'll be nice if there's traineddata out there but I didn't find >>>>>>> any. I see free fonts and commercial OCR software but not traineddata. >>>>>>> >>>>>>> Tessdata repository obviously doesn't have one, either. >>>>>>> >>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>> >>>>>>>> Please also search for existing MICR traineddata files. >>>>>>>> >>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> So I did several tests from scratch. In the last attempt, I made >>>>>>>>> a training text with 4,000 lines in the following format, >>>>>>>>> >>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >>>>>>>>> >>>>>>>>> >>>>>>>>> and combined it with eng.digits.training_text in which symbols are >>>>>>>>> converted to E13B symbols. This makes about 12,000 lines of training >>>>>>>>> text. It's amazing that this thing generates a good reader out of >>>>>>>>> nowhere. But then it is not very good. For example: >>>>>>>>> >>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>> >>>>>>>>> is a result on the image attached. It's close but the last '<' in >>>>>>>>> the result text doesn't exist on the image. It's a small failure but >>>>>>>>> it >>>>>>>>> causes a greater trouble in parsing. >>>>>>>>> >>>>>>>>> What would you suggest from here to increase accuracy? >>>>>>>>> >>>>>>>>> - Increase the number of lines in the training text >>>>>>>>> - Mix up more variations in the training text >>>>>>>>> - Increase the number of iterations >>>>>>>>> - Investigate wrong reads one by one >>>>>>>>> - Or else? >>>>>>>>> >>>>>>>>> Also, I referred to engrestrict*.* and could generate similar >>>>>>>>> result with the fine-tuning-from-full method. It seems a bit faster >>>>>>>>> to get >>>>>>>>> to the same level but it also stops at a 'good' level. I can go with >>>>>>>>> either way if it takes me to the bright future. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> ElMagoElGato >>>>>>>>> >>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>> >>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>> >>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>> >>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>> >>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>> >>>>>>>>>>> Create training text of about 100 lines and finetune for 400 >>>>>>>>>>> lines >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I had about 14 lines as attached. How many lines would you >>>>>>>>>>>> recommend? >>>>>>>>>>>> >>>>>>>>>>>> Fine tuning gives much better result but it tends to pick other >>>>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 >>>>>>>>>>>> and 4 >>>>>>>>>>>> symbols. I thought training from scratch would eliminate such >>>>>>>>>>>> confusion. >>>>>>>>>>>> >>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>> >>>>>>>>>>>>> For training from scratch a large training text and hundreds >>>>>>>>>>>>> of thousands of iterations are recommended. >>>>>>>>>>>>> >>>>>>>>>>>>> If you are just fine tuning for a font try to follow >>>>>>>>>>>>> instructions for training for impact, with your font. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang >>>>>>>>>>>>>> eng --linedata_only \ >>>>>>>>>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>>>>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>> >>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 >>>>>>>>>>>>>> Lfx256 O1c111]' \ >>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>> >>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>> >>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>> >>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>> >>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 >>>>>>>>>>>>>> wrote best >>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>> wrote >>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>>> >>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error >>>>>>>>>>>>>> rate=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt. >>>>>>>>>>>>>> Both files are attached. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Training seems to have worked fine. I don't know how to >>>>>>>>>>>>>> translate the test result from base_checkpoint. The generated >>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the >>>>>>>>>>>>>> choice of >>>>>>>>>>>>>> --traineddata in combining output files is bad but I have no >>>>>>>>>>>>>> clue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>> >>>>>>>>>>>>>> BTW, I referred to your tess4training in the process. It >>>>>>>>>>>>>> helped a lot. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> see >>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint >>>>>>>>>>>>>>>> file according to the method in Training From Scratch. Now, >>>>>>>>>>>>>>>> how can I make >>>>>>>>>>>>>>>> a trained data from the base_checkpoint file? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>> from it, send an email to [email protected]. >>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> Visit this group at >>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>>>>>>>>> >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71f7d6bd-b8a7-4057-b1bf-ab02db544579%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

