https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction
On Sun, Mar 29, 2020 at 12:53 PM Essam Zaky <[email protected]> wrote: > Thanks @shreeshrii > > , while prepare the training text what are the recommendations for this > step > > is there ant tutorial to show me how to prepare the training text. > > example > what is the recommended text size > how many character instance repeated in the training set > , what about ligatures, how to handle it and how to add it in unicharset > .... > > بتاريخ الأحد، 29 مارس، 2020 7:50:54 ص UTC+2، كتب shree: >> >> The unicharset is based on the training text you use. Please make sure >> you have all required characters in the text. >> >> Fine-tune for impact works with the unicharset of the best traineddata >> file, but then you can't add any characters to it. >> >> On Sun, Mar 29, 2020, 11:08 Essam Zaky <[email protected]> wrote: >> >>> Hi@shreeshrii >>> attached is the bash script as described in the following page >>> >>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948 >>> >>> when i change the line #51 line >>> >>> --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \ >>> >>> to be >>> >>> --traineddata ~/tesstutorial/araeval/ara/ara.traineddata >>> >>> now it works fine without error >>> but i have another question >>> the number of character set in best train is 85 and in the new generated >>> character set contain only 74 >>> how to keep unicharset number as best 85 ? >>> >>> >>> بتاريخ الأحد، 29 مارس، 2020 5:06:16 ص UTC+2، كتب shree: >>>> >>>> See >>>> https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh >>>> >>>> lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \ >>>> --continue_from ../tesstutorial/trainplusminus/eng.lstm \ >>>> --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \ >>>> --old_traineddata tessdata/best/eng.traineddata \ >>>> --train_listfile >>>> ../tesstutorial/trainplusminus/eng.training_files.txt \ >>>> --max_iterations 3600 >>>> >>>> ... >>>> >>>> >>>> lstmtraining \ >>>> --stop_training \ >>>> --continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \ >>>> --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \ >>>> --model_output >>>> ../tesstutorial/trainplusminus/eng_plusminus.traineddata >>>> >>>> --traineddata needs to be same in both commands. >>>> >>>> On Sun, Mar 29, 2020 at 6:45 AM Shree Devi Kumar <[email protected]> >>>> wrote: >>>> >>>>> Please check that you have used the correct path for the traineddata >>>>> file. >>>>> >>>>> Please share the lstmtraining command that you used before this for >>>>> training. >>>>> >>>>> On Sat, Mar 28, 2020, 22:56 Essam Zaky <[email protected]> wrote: >>>>> >>>>>> Dear @Shreeshrii >>>>>> I had followed your bash script to add Andalus font in the Arabic >>>>>> lanaguage here it the script url >>>>>> >>>>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948 >>>>>> >>>>>> all steps steps works except the last one which generate the >>>>>> traineddata here it's the error >>>>>> >>>>>> osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \ >>>>>> > --stop_training \ >>>>>> > --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \ >>>>>> > --traineddata >>>>>> ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \ >>>>>> > --model_output >>>>>> ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata >>>>>> Loaded file /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint, >>>>>> unpacking... >>>>>> Code range changed from 74 to 85! >>>>>> Must supply the old traineddata for code conversion! >>>>>> Failed to read continue from: >>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint >>>>>> >>>>>> >>>>>> Best Regards >>>>>> Essam >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWug_3v%3Dzr4_6PszBFq-kgcjJF1bEAFLP%2BYvYcKYkMQ2g%40mail.gmail.com.

