see https://github.com/Shreeshrii/tessdata_ocrb
Retrained to add missing X using 3 fonts at 3 exposures and a larger training text compared to previous version. Both float/best and integer/fast versions are provided. - Download best version <https://github.com/Shreeshrii/tessdata_ocrb/raw/master/ocrb.traineddata> - 11.1 MB. use with -l ocrb - Download fast version <https://github.com/Shreeshrii/tessdata_ocrb/raw/master/ocrb_int.traineddata> - 1.66 MB. use with -l ocrb-int. I would appreciate feedback. If this is useful, we can add it to https://github.com/tesseract-ocr/tessdata_contrib On Monday, April 8, 2019 at 10:45:29 PM UTC+5:30, shree wrote: > > If you can provide another 40-50 lines of training data (text file) I will > rerun the training > > > On Mon, 8 Apr 2019, 22:11 Jankees Korstanje, wrote: > >> Hi Shree, >> >> We have tried your traineddata file for MRZ and noticed that it does not >> detect the character X. >> >> Looking at >> https://github.com/Shreeshrii/tessdata_ocrb/blob/master/eng.MRZ.training_text >> >> We see that there are no X in there. >> >> In addition it might be good to add a couple of lines that are specific >> for IDs (starting with I) note they are all fake >> >> IDESPANH186495123456789X<<<<<< >> IXESPE002561410<0233181G<<<<< >> I<NLDIS2KX87214<<<<<<<<<<<<<<< >> >> >> >> >> >> >> >> On Wednesday, 5 September 2018 18:03:41 UTC+2, shree wrote: >>> >>> See https://github.com/Shreeshrii/tessdata_ocrb >>> for the files and traineddata. >>> >>> >>> On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <[email protected]> >>> wrote: >>> >>>> I think finetune will be a better option than training from scratch. >>>> >>>> Using a small training/test text - 40 lines, I get >>>> >>>> --------------------------------- >>>> >>>> + lstmeval --verbosity 0 --model /home/ubuntu/ >>>> *tessdata_best/script/Latin.traineddata* --eval_listfile >>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >>>> Warning: LSTMTrainer deserialized an LSTMRecognizer! >>>> At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word >>>> error rate=13.75* >>>> >>>> --------------------------------- >>>> >>>> + lstmeval --verbosity 0 --model /home/ubuntu/ >>>> *tessdata_best/eng.traineddata* --eval_listfile >>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >>>> Warning: LSTMTrainer deserialized an LSTMRecognizer! >>>> At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error >>>> rate=92.5* >>>> >>>> >>>> * --------------------------------- * >>>> >>>> *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char >>>> train=0.448%, word train=3.659%, skip ratio=0%, New best char error = >>>> 0.448 wrote checkpoint.* >>>> >>>> *Finished! Error rate = 0.448* >>>> >>>> >>>> * --------------------------------- * >>>> >>>> >>>> + lstmeval --model >>>> /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint >>>> *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata >>>> --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt >>>> /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a >>>> recognition model, trying training checkpoint... >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf >>>> Loaded 40/40 pages (1-40) of document >>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf >>>> At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0* >>>> >>>> --------------------------------- >>>> >>>> On Wed, Sep 5, 2018 at 1:55 PM, <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> (I might butcher English grammar- you have been warned!) >>>>> >>>>> For some time I'm trying to teach tesseract to read MRZ >>>>> codes.Unfortunately it's not going very well. I'm using the latest >>>>> version >>>>> of tesseract (4.0) soI'mm trying to train it by lstm method. I've >>>>> managed to pull it off and got some custom traineddata samples but >>>>> effects of using them are... let's say slightly unsatisfying. In the >>>>> matter >>>>> of fact they are not even remotely close to eng traineddata. I know >>>>> that there was mrz traineddata in the previous version of tesseract. >>>>> >>>>> I'm out of ideas how to improve accuracy, so I'll need your help guys. >>>>> >>>>> At first I thought I could use images, .txt files containing already >>>>> read data and font data to somehow make box files (basically you have >>>>> image and .txt containing everything read from the image). I was >>>>> disappointed when I realized that without manual correction of boxes >>>>> tesseract won't know how to apply them correctly. Of course I need >>>>> automated method do apply boxes (I can't use any GUI or something). >>>>> >>>>> At the moment I'm only using .txt files and these are steps I'm doing >>>>> (it's also good to mention that I'm trying to make it from scratch): >>>>> -Using .txt and font (OcrB) to create .tiff and box files using >>>>> text2image method >>>>> -Creating unicharset from all box files >>>>> -(it's optional but for the sake of it) I'm applyingunicharsetproperties >>>>> >>>>> -Getting trainneddata from unicharset, langdata and using custom >>>>> language as parameter >>>>> -Creating lstmf file by tesseract some .tiff output lstm.train >>>>> -Creating list of files to train >>>>> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4 >>>>> -At the end I'm using last checkpoint to create traineddata for >>>>> usage. Currently initial .txt files are randomly generated by me in >>>>> program in form of mrz code (samples included). I also tried to >>>>> generate files in form of mixed alphabet to get signs variety. I was >>>>> using >>>>> about 1000 samples to train it and it doesn't differ from using 100 >>>>> samples. >>>>> >>>>> Also, I disabled dictionary in the OCR process to prevent tesseract >>>>> from treating whole MRZ code as a word. >>>>> >>>>> I might not understand some things despite reading a lot about this >>>>> topic, but I'm pretty sure that I'm doing training process correctly. Do >>>>> you have any tips how to improve training process? Consider pointing out >>>>> even dumbest things I could forget about. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/24103e92-2287-4830-8bea-3496eaa8b0d5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

