Re: [tesseract-ocr] Trained data for E13B font

Shree Devi Kumar Fri, 14 Jun 2019 03:58:53 -0700

See https://github.com/Shreeshrii/tessdata_MICR


I have uploaded my files there.

https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
is the bash script that runs the training.

You can modify as needed. Please note this is for legacy/base tesseract
--oem 0.

On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]>
wrote:

> Thanks a lot, shree.  It seems you know everything.
>
> I tried the MICR0.traineddata and the first two mcr.traineddata.  The last
> one was blocked by the browser.  Each of the traineddata had mixed
> results.  All of them are getting symbols fairly good but getting spaces
> randomly and reading some numbers wrong.
>
> MICR0 seems the best among them.  Did you suggest that you'd be able to
> update it?  It gets tripple D very often where there's only one, and so on.
>
> Also, I tried to fine tune from MICR0 but I found that I need to change
> the language-specific.sh.  It specifies some parameters for each language.
> Do you have any guidance for it?
>
> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>
>> see
>> http://www.devscope.net/Content/ocrchecks.aspx
>> https://github.com/BigPino67/Tesseract-MICR-OCR
>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>
>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]>
>> wrote:
>>
>>> That'll be nice if there's traineddata out there but I didn't find any.
>>> I see free fonts and commercial OCR software but not traineddata.  Tessdata
>>> repository obviously doesn't have one, either.
>>>
>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>
>>>> Please also search for existing MICR traineddata files.
>>>>
>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]>
>>>> wrote:
>>>>
>>>>> So I did several tests from scratch.  In the last attempt, I made a
>>>>> training text with 4,000 lines in the following format,
>>>>>
>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>>>>
>>>>>
>>>>> and combined it with eng.digits.training_text in which symbols are
>>>>> converted to E13B symbols.  This makes about 12,000 lines of training
>>>>> text.  It's amazing that this thing generates a good reader out of
>>>>> nowhere.  But then it is not very good.  For example:
>>>>>
>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>
>>>>> is a result on the image attached.  It's close but the last '<' in the
>>>>> result text doesn't exist on the image.  It's a small failure but it 
>>>>> causes
>>>>> a greater trouble in parsing.
>>>>>
>>>>> What would you suggest from here to increase accuracy?
>>>>>
>>>>>    - Increase the number of lines in the training text
>>>>>    - Mix up more variations in the training text
>>>>>    - Increase the number of iterations
>>>>>    - Investigate wrong reads one by one
>>>>>    - Or else?
>>>>>
>>>>> Also, I referred to engrestrict*.* and could generate similar result
>>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>>> the
>>>>> same level but it also stops at a 'good' level.  I can go with either way
>>>>> if it takes me to the bright future.
>>>>>
>>>>> Regards,
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>
>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>
>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>
>>>>>>> Look at the files engrestrict*.* and also
>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>
>>>>>>> Create training text of about 100 lines and finetune for 400 lines
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I had about 14 lines as attached.  How many lines would you
>>>>>>>> recommend?
>>>>>>>>
>>>>>>>> Fine tuning gives much better result but it tends to pick other
>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4
>>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>>> confusion.
>>>>>>>>
>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>
>>>>>>>>> For training from scratch a large training text and hundreds of
>>>>>>>>> thousands of iterations are recommended.
>>>>>>>>>
>>>>>>>>> If you are just fine tuning for a font try to follow instructions
>>>>>>>>> for training for impact, with your font.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>
>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>>>>
>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>>>>>>>>>> --linedata_only \
>>>>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>
>>>>>>>>>> Training from scratch:
>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256
>>>>>>>>>> O1c111]' \
>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate
>>>>>>>>>> 20e-4 \
>>>>>>>>>>   --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>> \
>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>
>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>> src/training/lstmeval --model
>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>
>>>>>>>>>> Combining output files:
>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>
>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>> tesseract e13b.png out --tessdata-dir
>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>
>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char
>>>>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 
>>>>>>>>>> wrote best
>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote
>>>>>>>>>> checkpoint.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>
>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.  Both
>>>>>>>>>> files are attached.
>>>>>>>>>>
>>>>>>>>>> Training seems to have worked fine.  I don't know how to
>>>>>>>>>> translate the test result from base_checkpoint.  The generated
>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice of
>>>>>>>>>> --traineddata in combining output files is bad but I have no clue.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> ElMagoElGato
>>>>>>>>>>
>>>>>>>>>> BTW, I referred to your tess4training in the process.  It helped
>>>>>>>>>> a lot.
>>>>>>>>>>
>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>
>>>>>>>>>>> see
>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>
>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file
>>>>>>>>>>>> according to the method in Training From Scratch.  Now, how can I 
>>>>>>>>>>>> make a
>>>>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>> .
>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr
>>>>>>>>>> .
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXJc0LBhHpmEM3Vh6RcFWhnNj4dJhFPqgr%2BpBsWfjBsBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to