Re: [tesseract-ocr] Making custom traineddata

shree Tue, 09 Apr 2019 02:16:23 -0700


see https://github.com/Shreeshrii/tessdata_ocrb



Retrained to add missing X using 3 fonts at 3 exposures and a larger 
training text compared to previous version.

Both float/best and integer/fast versions are provided.

   - Download best version 
   <https://github.com/Shreeshrii/tessdata_ocrb/raw/master/ocrb.traineddata> - 
   11.1 MB. use with -l ocrb
   - Download fast version 
   
<https://github.com/Shreeshrii/tessdata_ocrb/raw/master/ocrb_int.traineddata> - 
   1.66 MB. use with -l ocrb-int.

I would appreciate feedback. If this is useful, we can add it to 
https://github.com/tesseract-ocr/tessdata_contrib


On Monday, April 8, 2019 at 10:45:29 PM UTC+5:30, shree wrote:
>
> If you can provide another 40-50 lines of training data (text file) I will 
> rerun the training 
>
>
> On Mon, 8 Apr 2019, 22:11 Jankees Korstanje,  wrote:
>
>> Hi Shree,
>>
>> We have tried your traineddata file for MRZ and noticed that it does not 
>> detect the character X.
>>
>> Looking at 
>> https://github.com/Shreeshrii/tessdata_ocrb/blob/master/eng.MRZ.training_text
>>
>> We see that there are no X in there.
>>
>> In addition it might be good to add a couple of lines that are specific 
>> for IDs (starting with I) note they are all fake
>>
>> IDESPANH186495123456789X<<<<<<
>> IXESPE002561410<0233181G<<<<<
>> I<NLDIS2KX87214<<<<<<<<<<<<<<<
>>
>>
>>
>>
>>
>>
>>
>> On Wednesday, 5 September 2018 18:03:41 UTC+2, shree wrote:
>>>
>>> See https://github.com/Shreeshrii/tessdata_ocrb
>>> for the files and traineddata.
>>>
>>>
>>> On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <[email protected]> 
>>> wrote:
>>>
>>>> I think finetune will be a better option than training from scratch.
>>>>
>>>> Using a small training/test text - 40 lines, I get
>>>>
>>>> --------------------------------- 
>>>>
>>>> + lstmeval --verbosity 0 --model /home/ubuntu/
>>>> *tessdata_best/script/Latin.traineddata* --eval_listfile 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>>> At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word 
>>>> error rate=13.75*
>>>>
>>>> ---------------------------------
>>>>
>>>> + lstmeval --verbosity 0 --model /home/ubuntu/
>>>> *tessdata_best/eng.traineddata* --eval_listfile 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>>> At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error 
>>>> rate=92.5*
>>>>
>>>>
>>>> * --------------------------------- *
>>>>
>>>> *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char 
>>>> train=0.448%, word train=3.659%, skip ratio=0%,  New best char error = 
>>>> 0.448 wrote checkpoint.*
>>>>
>>>> *Finished! Error rate = 0.448*
>>>>
>>>>
>>>> * --------------------------------- *
>>>>
>>>>
>>>> + lstmeval --model 
>>>> /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint 
>>>> *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata 
>>>> --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>>> /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a 
>>>> recognition model, trying training checkpoint...
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>>> Loaded 40/40 pages (1-40) of document 
>>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>>> At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0*
>>>>
>>>> --------------------------------- 
>>>>
>>>> On Wed, Sep 5, 2018 at 1:55 PM, <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> (I might butcher English grammar- you have been warned!)
>>>>>
>>>>>    For some time I'm trying to teach tesseract to read MRZ 
>>>>> codes.Unfortunately it's not going very well. I'm using the latest 
>>>>> version 
>>>>> of tesseract (4.0) soI'mm trying to train it by lstm method. I've 
>>>>> managed to pull it off and got some custom traineddata samples but 
>>>>> effects of using them are... let's say slightly unsatisfying. In the 
>>>>> matter 
>>>>> of fact they are not even remotely close to eng traineddata. I know 
>>>>> that there was mrz traineddata in the previous version of tesseract.
>>>>>
>>>>> I'm out of ideas how to improve accuracy, so I'll need your help guys. 
>>>>>
>>>>> At first I thought I could use images, .txt files containing already 
>>>>> read data and font data to somehow make box files (basically you have 
>>>>> image and .txt containing everything read from the image). I was 
>>>>> disappointed when I realized that without manual correction of boxes 
>>>>> tesseract won't know how to apply them correctly. Of course I need 
>>>>> automated method do apply boxes (I can't use any GUI or something).
>>>>>
>>>>> At the moment I'm only using .txt files and these are steps I'm doing 
>>>>> (it's also good to mention that I'm trying to make it from scratch):
>>>>> -Using .txt and font (OcrB) to create .tiff and box files using 
>>>>> text2image method
>>>>> -Creating unicharset from all box files 
>>>>> -(it's optional but for the sake of it) I'm applyingunicharsetproperties 
>>>>>
>>>>> -Getting trainneddata from unicharset, langdata and using custom 
>>>>> language as parameter 
>>>>> -Creating lstmf file by tesseract some .tiff output lstm.train 
>>>>> -Creating list of files to train 
>>>>> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4 
>>>>> -At the end I'm using last checkpoint to create traineddata for 
>>>>> usage. Currently initial .txt files are randomly generated by me in 
>>>>> program in form of mrz code (samples included). I also tried to 
>>>>> generate files in form of mixed alphabet to get signs variety. I was 
>>>>> using 
>>>>> about 1000 samples to train it and it doesn't differ from using 100 
>>>>> samples.
>>>>>
>>>>> Also, I disabled dictionary in the OCR process to prevent tesseract 
>>>>> from treating whole MRZ code as a word.
>>>>>
>>>>> I might not understand some things despite reading a lot about this 
>>>>> topic, but I'm pretty sure that I'm doing training process correctly. Do 
>>>>> you have any tips how to improve training process? Consider pointing out 
>>>>> even dumbest things I could forget about.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/24103e92-2287-4830-8bea-3496eaa8b0d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Making custom traineddata

Reply via email to