*I got this thing while trying to make starter training data* Rendered page 31 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif Stripped 1 unrenderable words Rendered page 31 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif Stripped 1 unrenderable words Rendered page 37 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif Stripped 1 unrenderable words Rendered page 38 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif Stripped 2 unrenderable words Rendered page 32 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif Stripped 6 unrenderable words Rendered page 32 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif Stripped 1 unrenderable words Rendered page 38 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif Stripped 1 unrenderable words Rendered page 39 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif Rendered page 33 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif Stripped 5 unrenderable words Rendered page 33 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif Stripped 1 unrenderable words Rendered page 39 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif Stripped 1 unrenderable words Rendered page 40 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif Rendered page 34 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif Rendered page 34 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif Rendered page 40 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif Stripped 1 unrenderable words ...... *and then* ....... Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'পাে' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'জাে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'গাি' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'রীি' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'ভাে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'জাি' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'থাে' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'হাে' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'পুে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'পুি' Invalid start of grapheme sequence:H=0x9cd Normalization failed for string 'অ্যা' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'খাে' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'চুে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'ঢাি' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'তাে' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'উে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'উি' Invalid start of grapheme sequence:M=0x9c7 Normalization failed for string 'থাে' Invalid start of grapheme sequence:M=0x9bf Normalization failed for string 'তাি' Invalid start of grapheme sequence:M=0x9bf
*but finally i got * === Moving lstmf files for training data === Moving /tmp/ben-2019-05-29.K90/ben.Bangla_Medium.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa Moving /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa Moving /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa Moving /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa Moving /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa Created starter traineddata for LSTM training of language 'ben' Run 'lstmtraining' command to continue LSTM training for language 'ben' *No error, will this training data be good, i am asking this because i feel lots of things are happening not in the way it has to be....like it says "normalization failed" "unrenderable"* On Tue, May 28, 2019 at 6:27 PM Jennil Thiyam <[email protected]> wrote: > okay, now i understand, thank you shree > > On Tue, May 28, 2019 at 6:22 PM Shree Devi Kumar <[email protected]> > wrote: > >> It is using a different set of fonts. So training is being done on one >> set of fonts and eval on others. >> >> alternately, you can use a smaller text file for eval and use same set of >> fonts. >> >> It all depends on what you want to accomplish with training. >> >> On Tue, May 28, 2019 at 5:59 PM Jennil Thiyam <[email protected]> >> wrote: >> >>> training/tesstrain.sh \ >>> --fonts_dir /c/Windows/Fonts \ >>> --tessdata_dir ./tessdata \ >>> --training_text ../langdata/ara/ara.training_text \ >>> --langdata_dir ../langdata \ >>> --lang ara \ >>> --linedata_only \ >>> --noextract_font_properties \ >>> --exposures "0" \ >>> --fontlist "Arial" \ >>> --output_dir ~/tesstutorial/aratest >>> >>> training/tesstrain.sh \ >>> --fonts_dir /c/Windows/Fonts \ >>> --tessdata_dir ./tessdata \ >>> --training_text ../langdata/ara/ara.training_text \ >>> --langdata_dir ../langdata \ >>> --lang ara \ >>> --linedata_only \ >>> --noextract_font_properties \ >>> --exposures "0" \ >>> --fontlist "Arial" \ >>> "Arial Unicode MS" \ >>> "Calibri" \ >>> "Courier New" \ >>> --output_dir ~/tesstutorial/araeval >>> >>> can anyone tell me why do we need to create this eval data, i meant it is >>> also going to same as training data. >>> >>> >>> On Tue, May 28, 2019 at 10:46 AM Jennil Thiyam <[email protected]> >>> wrote: >>> >>>> okay, thank you >>>> >>>> On Tue, May 28, 2019 at 10:30 AM Shree Devi Kumar <[email protected]> >>>> wrote: >>>> >>>>> The old traineddata and the lstm file need to be in sync. So you >>>>> should extract lstm file after downloading the traineddata and use those >>>>> files. Rest of files don't need to be regenerated. >>>>> >>>>> On Tue, May 28, 2019 at 10:26 AM Jennil Thiyam <[email protected]> >>>>> wrote: >>>>> >>>>>> do you mean to change only the path of this old traineddata(in the >>>>>> command, that I underlined) to the path of ben.traineddata(that i am >>>>>> going >>>>>> to download from tessdata_best)? or do i need to perform the whole >>>>>> process >>>>>> with this (to be downloaded) ben.traineddata? >>>>>> >>>>>> lstmtraining --model_output /model \ >>>>>> --continue_from /ben_extract/ben.lstm \ >>>>>> --traineddata /tesstutorial_output/ben/ben.traineddata \ >>>>>> *--old_traineddata >>>>>> /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata \* >>>>>> --train_listfile /tesstutorial_output/ben.training_files.txt \ >>>>>> --max_iterations 1500 >>>>>> >>>>>> Do you have any idea about the estimated time it will take for 1500 >>>>>> iterations? >>>>>> >>>>>> Thank you >>>>>> >>>>>> On Mon, May 27, 2019 at 10:20 PM Shree Devi Kumar < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> You can download ben.traineddata from tessdata_best in a different >>>>>>> location and use that as part of lstmtraining command >>>>>>> >>>>>>> On Mon, May 27, 2019 at 6:24 PM Jennil Thiyam < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I installed by using the command in ubuntu 18, so i dint install >>>>>>>> from git repository, so if i installed from git repository,will this >>>>>>>> thing >>>>>>>> work?? >>>>>>>> >>>>>>>> On Mon 27 May, 2019, 5:43 PM Shree Devi Kumar <[email protected] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata from >>>>>>>>> tessdata_best repo? Only those models can be used for finetuning. >>>>>>>>> >>>>>>>>> On Mon, May 27, 2019 at 4:25 PM Jennil Thiyam < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> yes...i extracted with the command combine_tessdata >>>>>>>>>> >>>>>>>>>> On Mon 27 May, 2019, 4:23 PM Shree Devi Kumar < >>>>>>>>>> [email protected] wrote: >>>>>>>>>> >>>>>>>>>>> Has /ben_extract/ben.lstm been extracted from >>>>>>>>>>> /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata ? >>>>>>>>>>> >>>>>>>>>>> On Mon, May 27, 2019 at 2:55 PM Jennil Thiyam < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I got error whie trying to perform fine tuning, the command i >>>>>>>>>>>> used is below: >>>>>>>>>>>> >>>>>>>>>>>> lstmtraining --model_output /model \ >>>>>>>>>>>> --continue_from /ben_extract/ben.lstm \ >>>>>>>>>>>> --traineddata /tesstutorial_output/ben/ben.traineddata \ >>>>>>>>>>>> --old_traineddata >>>>>>>>>>>> /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata >>>>>>>>>>>> \ >>>>>>>>>>>> --train_listfile /tesstutorial_output/ben.training_files.txt >>>>>>>>>>>> \ >>>>>>>>>>>> --max_iterations 1500 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I have read the discussion about the same error, but the >>>>>>>>>>>> solution provided over there were all about changing path and all, >>>>>>>>>>>> and i am >>>>>>>>>>>> sure i am right about the path. please help me out >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>> [email protected]. >>>>>>>>>>>> Visit this group at >>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0958d266-6f2f-4d10-9104-ee8145a4f005%40googlegroups.com >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0958d266-6f2f-4d10-9104-ee8145a4f005%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> To post to this group, send email to >>>>>>>>>>> [email protected]. >>>>>>>>>>> Visit this group at >>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXN72W5rb7o%3D7btSfz-GOj%2BoXWOX10%3Dr3CpdNb%2By-JbKA%40mail.gmail.com >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXN72W5rb7o%3D7btSfz-GOj%2BoXWOX10%3Dr3CpdNb%2By-JbKA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> To post to this group, send email to >>>>>>>>>> [email protected]. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoofQjuSOdaVNdkB%2B54b%2BzNhLWY9uyb-yDFuDGrhEh-ixCg%40mail.gmail.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoofQjuSOdaVNdkB%2B54b%2BzNhLWY9uyb-yDFuDGrhEh-ixCg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ____________________________________________________________ >>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to >>>>>>>>> [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWhz4YfUPDDWctdkbKcA-nVT1j2Rxkbq%2BZhuh2W2dxqJA%40mail.gmail.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWhz4YfUPDDWctdkbKcA-nVT1j2Rxkbq%2BZhuh2W2dxqJA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected] >>>>>>>> . >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoocvQgqXPQL6VAWm-iZS_WHu3dU094fH%3Db_i%2Bo2B%2BAdzPA%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoocvQgqXPQL6VAWm-iZS_WHu3dU094fH%3Db_i%2Bo2B%2BAdzPA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXcdSWM-TxaSPVtk%3DVbG4bB8DRrtT6ocGRBErq46si6_g%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXcdSWM-TxaSPVtk%3DVbG4bB8DRrtT6ocGRBErq46si6_g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodYdOK4S9XoGOBAKoGWvRQ1xA52%3DUB-TqoVVgSLagPraw%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodYdOK4S9XoGOBAKoGWvRQ1xA52%3DUB-TqoVVgSLagPraw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXDbsmDfyngQ%2B_2Pqiwumj%3DuT3c16myvoutpD%3DOVq%3DN_g%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXDbsmDfyngQ%2B_2Pqiwumj%3DuT3c16myvoutpD%3DOVq%3DN_g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodOZJg3eGg5k2w%3D5%3DeCtq2%2BmNfw%3DFsaYT-4OB2hEmLHMw%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodOZJg3eGg5k2w%3D5%3DeCtq2%2BmNfw%3DFsaYT-4OB2hEmLHMw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzCFHEEioCSu6drQSysHti818xztypCFSWMrQDWtuPaw%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzCFHEEioCSu6drQSysHti818xztypCFSWMrQDWtuPaw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooc9uDhQnLsHshfB%2BmE7kd71T8U_JR%3D96QoE%2Bf%2Baefs6ug%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

