I’m getting the “phantom character” issue as well using the OCRB that Shree
trained on MRZ lines. For example for a 0 it will sometimes add both a 0
and an O to the output , thus outputting 45 characters total instead of 44.
I haven’t looked at the bounding box output yet but I suspect a phantom
thin character is added somewhere that I can discard .. or maybe two chars
will have the same bounding box. If anyone else has fixed this issue
further up (eg so the output doesn’t contain the phantom characters in the
first place) id be interested.

On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <[email protected]>
wrote:

> Hi,
>
> I'll go back to more of training later.  Before doing so, I'd like to
> investigate results a little bit.  The hocr and lstmbox options give some
> details of positions of characters.  The results show positions that
> perfectly correspond to letters in the image.  But the text output contains
> a character that obviously does not exist.
>
> Then I found a config file 'lstmdebug' that generates far more
> information.  I hope it explains what happened with each character.  I'm
> yet to read the debug output but I'd appreciate it if someone could tell me
> how to read it because it's really complex.
>
> Regards,
> ElMagoElGato
>
> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>
>> See https://github.com/Shreeshrii/tessdata_MICR
>>
>> I have uploaded my files there.
>>
>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>> is the bash script that runs the training.
>>
>> You can modify as needed. Please note this is for legacy/base tesseract
>> --oem 0.
>>
>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]>
>> wrote:
>>
>>> Thanks a lot, shree.  It seems you know everything.
>>>
>>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The
>>> last one was blocked by the browser.  Each of the traineddata had mixed
>>> results.  All of them are getting symbols fairly good but getting spaces
>>> randomly and reading some numbers wrong.
>>>
>>> MICR0 seems the best among them.  Did you suggest that you'd be able to
>>> update it?  It gets tripple D very often where there's only one, and so on.
>>>
>>> Also, I tried to fine tune from MICR0 but I found that I need to change
>>> the language-specific.sh.  It specifies some parameters for each language.
>>> Do you have any guidance for it?
>>>
>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>
>>>> see
>>>> http://www.devscope.net/Content/ocrchecks.aspx
>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>
>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]>
>>>> wrote:
>>>>
>>>>> That'll be nice if there's traineddata out there but I didn't find
>>>>> any.  I see free fonts and commercial OCR software but not traineddata.
>>>>> Tessdata repository obviously doesn't have one, either.
>>>>>
>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>
>>>>>> Please also search for existing MICR traineddata files.
>>>>>>
>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> So I did several tests from scratch.  In the last attempt, I made a
>>>>>>> training text with 4,000 lines in the following format,
>>>>>>>
>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>>>>>>
>>>>>>>
>>>>>>> and combined it with eng.digits.training_text in which symbols are
>>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training
>>>>>>> text.  It's amazing that this thing generates a good reader out of
>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>
>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>
>>>>>>> is a result on the image attached.  It's close but the last '<' in
>>>>>>> the result text doesn't exist on the image.  It's a small failure but it
>>>>>>> causes a greater trouble in parsing.
>>>>>>>
>>>>>>> What would you suggest from here to increase accuracy?
>>>>>>>
>>>>>>>    - Increase the number of lines in the training text
>>>>>>>    - Mix up more variations in the training text
>>>>>>>    - Increase the number of iterations
>>>>>>>    - Investigate wrong reads one by one
>>>>>>>    - Or else?
>>>>>>>
>>>>>>> Also, I referred to engrestrict*.* and could generate similar result
>>>>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>>>>> the
>>>>>>> same level but it also stops at a 'good' level.  I can go with either 
>>>>>>> way
>>>>>>> if it takes me to the bright future.
>>>>>>>
>>>>>>> Regards,
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>
>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>
>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>
>>>>>>>>> Look at the files engrestrict*.* and also
>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>
>>>>>>>>> Create training text of about 100 lines and finetune for 400 lines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I had about 14 lines as attached.  How many lines would you
>>>>>>>>>> recommend?
>>>>>>>>>>
>>>>>>>>>> Fine tuning gives much better result but it tends to pick other
>>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4
>>>>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>>>>> confusion.
>>>>>>>>>>
>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>
>>>>>>>>>>> For training from scratch a large training text and hundreds of
>>>>>>>>>>> thousands of iterations are recommended.
>>>>>>>>>>>
>>>>>>>>>>> If you are just fine tuning for a font try to follow
>>>>>>>>>>> instructions for training for impact, with your font.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>>>>>>
>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang
>>>>>>>>>>>> eng --linedata_only \
>>>>>>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>
>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96
>>>>>>>>>>>> Lfx256 O1c111]' \
>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate
>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>   --train_listfile
>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>   --eval_listfile
>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>   --max_iterations 5000
>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>
>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>> src/training/lstmeval --model
>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>
>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>
>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir
>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>
>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char
>>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 
>>>>>>>>>>>> wrote best
>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint 
>>>>>>>>>>>> wrote
>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>
>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error
>>>>>>>>>>>> rate=0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.
>>>>>>>>>>>> Both files are attached.
>>>>>>>>>>>>
>>>>>>>>>>>> Training seems to have worked fine.  I don't know how to
>>>>>>>>>>>> translate the test result from base_checkpoint.  The generated
>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice 
>>>>>>>>>>>> of
>>>>>>>>>>>> --traineddata in combining output files is bad but I have no clue.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, I referred to your tess4training in the process.  It
>>>>>>>>>>>> helped a lot.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>
>>>>>>>>>>>>> see
>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file
>>>>>>>>>>>>>> according to the method in Training From Scratch.  Now, how can 
>>>>>>>>>>>>>> I make a
>>>>>>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>> .
>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr
>>>>>>>>>> .
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ____________________________________________________________
>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to