Re: [tesseract-ocr] Trained data for E13B font

Lorenzo Bolzani Wed, 17 Jul 2019 02:58:40 -0700

Phantom characters here for me too:

https://github.com/tesseract-ocr/tesseract/issues/1778


Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also
improved.

I wrote some code that uses symbols iterator to discard symbols that are
clearly duplicated: too small, overlapping, etc. But it was not easy to
make it work decently and it is not 100% reliable with false negatives and
positives. I cannot share the code and it is quite ugly anyway.

Here there is another MRZ model with training data:

https://github.com/DoubangoTelecom/tesseractMRZ




Lorenzo


Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <[email protected]> ha
scritto:

> I’m getting the “phantom character” issue as well using the OCRB that
> Shree trained on MRZ lines. For example for a 0 it will sometimes add both
> a 0 and an O to the output , thus outputting 45 characters total instead of
> 44. I haven’t looked at the bounding box output yet but I suspect a phantom
> thin character is added somewhere that I can discard .. or maybe two chars
> will have the same bounding box. If anyone else has fixed this issue
> further up (eg so the output doesn’t contain the phantom characters in the
> first place) id be interested.
>
> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <[email protected]>
> wrote:
>
>> Hi,
>>
>> I'll go back to more of training later.  Before doing so, I'd like to
>> investigate results a little bit.  The hocr and lstmbox options give some
>> details of positions of characters.  The results show positions that
>> perfectly correspond to letters in the image.  But the text output contains
>> a character that obviously does not exist.
>>
>> Then I found a config file 'lstmdebug' that generates far more
>> information.  I hope it explains what happened with each character.  I'm
>> yet to read the debug output but I'd appreciate it if someone could tell me
>> how to read it because it's really complex.
>>
>> Regards,
>> ElMagoElGato
>>
>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>
>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>
>>> I have uploaded my files there.
>>>
>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>> is the bash script that runs the training.
>>>
>>> You can modify as needed. Please note this is for legacy/base tesseract
>>> --oem 0.
>>>
>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]>
>>> wrote:
>>>
>>>> Thanks a lot, shree.  It seems you know everything.
>>>>
>>>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The
>>>> last one was blocked by the browser.  Each of the traineddata had mixed
>>>> results.  All of them are getting symbols fairly good but getting spaces
>>>> randomly and reading some numbers wrong.
>>>>
>>>> MICR0 seems the best among them.  Did you suggest that you'd be able to
>>>> update it?  It gets tripple D very often where there's only one, and so on.
>>>>
>>>> Also, I tried to fine tune from MICR0 but I found that I need to change
>>>> the language-specific.sh.  It specifies some parameters for each language.
>>>> Do you have any guidance for it?
>>>>
>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>
>>>>> see
>>>>> http://www.devscope.net/Content/ocrchecks.aspx
>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> That'll be nice if there's traineddata out there but I didn't find
>>>>>> any.  I see free fonts and commercial OCR software but not traineddata.
>>>>>> Tessdata repository obviously doesn't have one, either.
>>>>>>
>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>
>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>
>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So I did several tests from scratch.  In the last attempt, I made a
>>>>>>>> training text with 4,000 lines in the following format,
>>>>>>>>
>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>>>>>>>
>>>>>>>>
>>>>>>>> and combined it with eng.digits.training_text in which symbols are
>>>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training
>>>>>>>> text.  It's amazing that this thing generates a good reader out of
>>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>>
>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>
>>>>>>>> is a result on the image attached.  It's close but the last '<' in
>>>>>>>> the result text doesn't exist on the image.  It's a small failure but 
>>>>>>>> it
>>>>>>>> causes a greater trouble in parsing.
>>>>>>>>
>>>>>>>> What would you suggest from here to increase accuracy?
>>>>>>>>
>>>>>>>>    - Increase the number of lines in the training text
>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>    - Increase the number of iterations
>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>    - Or else?
>>>>>>>>
>>>>>>>> Also, I referred to engrestrict*.* and could generate similar
>>>>>>>> result with the fine-tuning-from-full method.  It seems a bit faster 
>>>>>>>> to get
>>>>>>>> to the same level but it also stops at a 'good' level.  I can go with
>>>>>>>> either way if it takes me to the bright future.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>
>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>
>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>
>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>
>>>>>>>>>> Look at the files engrestrict*.* and also
>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>
>>>>>>>>>> Create training text of about 100 lines and finetune for 400
>>>>>>>>>> lines
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I had about 14 lines as attached.  How many lines would you
>>>>>>>>>>> recommend?
>>>>>>>>>>>
>>>>>>>>>>> Fine tuning gives much better result but it tends to pick other
>>>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 
>>>>>>>>>>> 4
>>>>>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>>>>>> confusion.
>>>>>>>>>>>
>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>
>>>>>>>>>>>> For training from scratch a large training text and hundreds of
>>>>>>>>>>>> thousands of iterations are recommended.
>>>>>>>>>>>>
>>>>>>>>>>>> If you are just fine tuning for a font try to follow
>>>>>>>>>>>> instructions for training for impact, with your font.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang
>>>>>>>>>>>>> eng --linedata_only \
>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>
>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96
>>>>>>>>>>>>> Lfx256 O1c111]' \
>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base
>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>   --train_listfile
>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>   --eval_listfile
>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>   --max_iterations 5000
>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>> src/training/lstmeval --model
>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>   --eval_listfile
>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>
>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir
>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>
>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char
>>>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 
>>>>>>>>>>>>> wrote best
>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint 
>>>>>>>>>>>>> wrote
>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>
>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error
>>>>>>>>>>>>> rate=0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.
>>>>>>>>>>>>> Both files are attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Training seems to have worked fine.  I don't know how to
>>>>>>>>>>>>> translate the test result from base_checkpoint.  The generated
>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice 
>>>>>>>>>>>>> of
>>>>>>>>>>>>> --traineddata in combining output files is bad but I have no clue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW, I referred to your tess4training in the process.  It
>>>>>>>>>>>>> helped a lot.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> see
>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file
>>>>>>>>>>>>>>> according to the method in Training From Scratch.  Now, how can 
>>>>>>>>>>>>>>> I make a
>>>>>>>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>>>>>>>>>>> from it, send an email to [email protected].
>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>>> Visit this group at
>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________
>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/97a1fc89-06eb-45f6-865d-fee2c132789d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxFmnQ2_3B825CdsrLYi5%2BWCD8OxEVLC29LwnXGkTx_q6Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyO4hvNf9izPXsTxMw-0Vs%2B5LPim9e7u%3DZeVARmzOjfGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to