Re: [tesseract-ocr] Trained data for E13B font

Phuc Sun, 16 Jun 2019 21:40:41 -0700

Sorry if I interrupted your conversation.
I have a similar problem which is the .traineddata I exported from 
checkpoint file did not recognize any character at all although my training 
showed very good results.
As I understand from you guys' conversation. Is this because Training From 
Scratch? All I need to do is fine-tuning a model to get better result?
Also, I am quite confused why result using checkpoint file is so different 
from .traineddata and I would be appreciated if some one can the explain 
the reason why.


To have more information about my case, you can refer my post here: 
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/74xMXlYX6T0
Thank you and have a nice day

On Friday, June 14, 2019 at 7:58:49 PM UTC+9, shree wrote:
>
> See https://github.com/Shreeshrii/tessdata_MICR
>
> I have uploaded my files there. 
>
> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
> is the bash script that runs the training.
>
> You can modify as needed. Please note this is for legacy/base tesseract 
> --oem 0.
>
> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected] 
> <javascript:>> wrote:
>
>> Thanks a lot, shree.  It seems you know everything.
>>
>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
>> last one was blocked by the browser.  Each of the traineddata had mixed 
>> results.  All of them are getting symbols fairly good but getting spaces 
>> randomly and reading some numbers wrong.
>>
>> MICR0 seems the best among them.  Did you suggest that you'd be able to 
>> update it?  It gets tripple D very often where there's only one, and so on.
>>
>> Also, I tried to fine tune from MICR0 but I found that I need to change 
>> the language-specific.sh.  It specifies some parameters for each language.  
>> Do you have any guidance for it?
>>
>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>
>>> see 
>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ 
>>>
>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> 
>>> wrote:
>>>
>>>> That'll be nice if there's traineddata out there but I didn't find 
>>>> any.  I see free fonts and commercial OCR software but not traineddata.  
>>>> Tessdata repository obviously doesn't have one, either.
>>>>
>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>
>>>>> Please also search for existing MICR traineddata files.
>>>>>
>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> So I did several tests from scratch.  In the last attempt, I made a 
>>>>>> training text with 4,000 lines in the following format,
>>>>>>
>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>>>>>
>>>>>>
>>>>>> and combined it with eng.digits.training_text in which symbols are 
>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training 
>>>>>> text.  It's amazing that this thing generates a good reader out of 
>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>
>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>
>>>>>> is a result on the image attached.  It's close but the last '<' in 
>>>>>> the result text doesn't exist on the image.  It's a small failure but it 
>>>>>> causes a greater trouble in parsing.
>>>>>>
>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>
>>>>>>    - Increase the number of lines in the training text
>>>>>>    - Mix up more variations in the training text
>>>>>>    - Increase the number of iterations
>>>>>>    - Investigate wrong reads one by one
>>>>>>    - Or else?
>>>>>>
>>>>>> Also, I referred to engrestrict*.* and could generate similar result 
>>>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>>>> the 
>>>>>> same level but it also stops at a 'good' level.  I can go with either 
>>>>>> way 
>>>>>> if it takes me to the bright future.
>>>>>>
>>>>>> Regards,
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>
>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>
>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>
>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>
>>>>>>>> Create training text of about 100 lines and finetune for 400 lines 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I had about 14 lines as attached.  How many lines would you 
>>>>>>>>> recommend?
>>>>>>>>>
>>>>>>>>> Fine tuning gives much better result but it tends to pick other 
>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 
>>>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>>>> confusion.
>>>>>>>>>
>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>
>>>>>>>>>> For training from scratch a large training text and hundreds of 
>>>>>>>>>> thousands of iterations are recommended. 
>>>>>>>>>>
>>>>>>>>>> If you are just fine tuning for a font try to follow instructions 
>>>>>>>>>> for training for impact, with your font.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>
>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>>>>>
>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang 
>>>>>>>>>>> eng --linedata_only \
>>>>>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>
>>>>>>>>>>> Training from scratch:
>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 
>>>>>>>>>>> O1c111]' \
>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt 
>>>>>>>>>>> \
>>>>>>>>>>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>
>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>
>>>>>>>>>>> Combining output files:
>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>
>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>
>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char 
>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 
>>>>>>>>>>> wrote best 
>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint 
>>>>>>>>>>> wrote 
>>>>>>>>>>> checkpoint.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>
>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error 
>>>>>>>>>>> rate=0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.  
>>>>>>>>>>> Both files are attached.
>>>>>>>>>>>
>>>>>>>>>>> Training seems to have worked fine.  I don't know how to 
>>>>>>>>>>> translate the test result from base_checkpoint.  The generated 
>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice 
>>>>>>>>>>> of 
>>>>>>>>>>> --traineddata in combining output files is bad but I have no clue.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>
>>>>>>>>>>> BTW, I referred to your tess4training in the process.  It helped 
>>>>>>>>>>> a lot.
>>>>>>>>>>>
>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>
>>>>>>>>>>>> see 
>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file 
>>>>>>>>>>>>> according to the method in Training From Scratch.  Now, how can I 
>>>>>>>>>>>>> make a 
>>>>>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>>
>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>>> Visit this group at 
>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/acbb787c-2e00-419e-b5b1-3daa6df1e1d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to