Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Mon, 17 Jun 2019 18:05:38 -0700

I guess the cotent of training text is important when you add new 
characters.  I had the same issue at first and then shree suggested 
a larger text and more iterations.  I thought variation in the text would 
matter as well.  I'm getting good results after I prepared good training 
text.


Now, both training from scratch and fine tuning are giving decent results.  
I'm working on E13B font that existing eng.traineddata never reads.  It 
proves the training really works.  My issue is to bring the accuracy to 
higher level.  I'm yet to try the last suggestion from shree but I know 
that it'll be a long way to go for extreme accuracy.

2019年6月17日月曜日 13時40分10秒 UTC+9 Phuc:

> Sorry if I interrupted your conversation.
> I have a similar problem which is the .traineddata I exported from 
> checkpoint file did not recognize any character at all although my training 
> showed very good results.
> As I understand from you guys' conversation. Is this because Training From 
> Scratch? All I need to do is fine-tuning a model to get better result?
> Also, I am quite confused why result using checkpoint file is so different 
> from .traineddata and I would be appreciated if some one can the explain 
> the reason why.
>
> To have more information about my case, you can refer my post here: 
> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/74xMXlYX6T0
> Thank you and have a nice day
>
> On Friday, June 14, 2019 at 7:58:49 PM UTC+9, shree wrote:
>>
>> See https://github.com/Shreeshrii/tessdata_MICR
>>
>> I have uploaded my files there. 
>>
>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>> is the bash script that runs the training.
>>
>> You can modify as needed. Please note this is for legacy/base tesseract 
>> --oem 0.
>>
>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <[email protected]> 
>> wrote:
>>
>>> Thanks a lot, shree.  It seems you know everything.
>>>
>>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
>>> last one was blocked by the browser.  Each of the traineddata had mixed 
>>> results.  All of them are getting symbols fairly good but getting spaces 
>>> randomly and reading some numbers wrong.
>>>
>>> MICR0 seems the best among them.  Did you suggest that you'd be able to 
>>> update it?  It gets tripple D very often where there's only one, and so on.
>>>
>>> Also, I tried to fine tune from MICR0 but I found that I need to change 
>>> the language-specific.sh.  It specifies some parameters for each language.  
>>> Do you have any guidance for it?
>>>
>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>
>>>> see 
>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ 
>>>>
>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <[email protected]> 
>>>> wrote:
>>>>
>>>>> That'll be nice if there's traineddata out there but I didn't find 
>>>>> any.  I see free fonts and commercial OCR software but not traineddata.  
>>>>> Tessdata repository obviously doesn't have one, either.
>>>>>
>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>
>>>>>> Please also search for existing MICR traineddata files.
>>>>>>
>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> So I did several tests from scratch.  In the last attempt, I made a 
>>>>>>> training text with 4,000 lines in the following format,
>>>>>>>
>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>>>>>>
>>>>>>>
>>>>>>> and combined it with eng.digits.training_text in which symbols are 
>>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training 
>>>>>>> text.  It's amazing that this thing generates a good reader out of 
>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>
>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>
>>>>>>> is a result on the image attached.  It's close but the last '<' in 
>>>>>>> the result text doesn't exist on the image.  It's a small failure but 
>>>>>>> it 
>>>>>>> causes a greater trouble in parsing.
>>>>>>>
>>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>>
>>>>>>>    - Increase the number of lines in the training text
>>>>>>>    - Mix up more variations in the training text
>>>>>>>    - Increase the number of iterations
>>>>>>>    - Investigate wrong reads one by one
>>>>>>>    - Or else?
>>>>>>>
>>>>>>> Also, I referred to engrestrict*.* and could generate similar result 
>>>>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>>>>> the 
>>>>>>> same level but it also stops at a 'good' level.  I can go with either 
>>>>>>> way 
>>>>>>> if it takes me to the bright future.
>>>>>>>
>>>>>>> Regards,
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>
>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>
>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>
>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>
>>>>>>>>> Create training text of about 100 lines and finetune for 400 lines 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I had about 14 lines as attached.  How many lines would you 
>>>>>>>>>> recommend?
>>>>>>>>>>
>>>>>>>>>> Fine tuning gives much better result but it tends to pick other 
>>>>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 
>>>>>>>>>> 4 
>>>>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>>>>> confusion.
>>>>>>>>>>
>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>
>>>>>>>>>>> For training from scratch a large training text and hundreds of 
>>>>>>>>>>> thousands of iterations are recommended. 
>>>>>>>>>>>
>>>>>>>>>>> If you are just fine tuning for a font try to follow 
>>>>>>>>>>> instructions for training for impact, with your font.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>>>>>>
>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang 
>>>>>>>>>>>> eng --linedata_only \
>>>>>>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>
>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 
>>>>>>>>>>>> Lfx256 O1c111]' \
>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>
>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>
>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>
>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>
>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char 
>>>>>>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 
>>>>>>>>>>>> wrote best 
>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint 
>>>>>>>>>>>> wrote 
>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>
>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error 
>>>>>>>>>>>> rate=0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.  
>>>>>>>>>>>> Both files are attached.
>>>>>>>>>>>>
>>>>>>>>>>>> Training seems to have worked fine.  I don't know how to 
>>>>>>>>>>>> translate the test result from base_checkpoint.  The generated 
>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect the choice 
>>>>>>>>>>>> of 
>>>>>>>>>>>> --traineddata in combining output files is bad but I have no clue.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, I referred to your tess4training in the process.  It 
>>>>>>>>>>>> helped a lot.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>
>>>>>>>>>>>>> see 
>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I read the training tutorial and made a base_checkpoint file 
>>>>>>>>>>>>>> according to the method in Training From Scratch.  Now, how can 
>>>>>>>>>>>>>> I make a 
>>>>>>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>
>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>> .
>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr
>>>>>>>>>> .
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> ____________________________________________________________
>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/856a44a7-5127-45cd-9c7d-b9684eba8089%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/82eed1fe-86d2-457a-87be-a2e89d1e93ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to