I read this page but still need more information about how to build training data set say i would train the engine to recognize field contain 15 digit is it enough to give small text file contain the 10 digits from 0 to 9 or should i prepare the training text to contain all 15 digit combination that it mean to have 10pow15 digit which is very huge data
بتاريخ الأحد، 29 مارس، 2020 11:45:01 ص UTC+2، كتب shree: > > > https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction > > > On Sun, Mar 29, 2020 at 12:53 PM Essam Zaky <[email protected] > <javascript:>> wrote: > >> Thanks @shreeshrii >> >> , while prepare the training text what are the recommendations for this >> step >> >> is there ant tutorial to show me how to prepare the training text. >> >> example >> what is the recommended text size >> how many character instance repeated in the training set >> , what about ligatures, how to handle it and how to add it in unicharset >> .... >> >> بتاريخ الأحد، 29 مارس، 2020 7:50:54 ص UTC+2، كتب shree: >>> >>> The unicharset is based on the training text you use. Please make sure >>> you have all required characters in the text. >>> >>> Fine-tune for impact works with the unicharset of the best traineddata >>> file, but then you can't add any characters to it. >>> >>> On Sun, Mar 29, 2020, 11:08 Essam Zaky <[email protected]> wrote: >>> >>>> Hi@shreeshrii >>>> attached is the bash script as described in the following page >>>> >>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948 >>>> >>>> when i change the line #51 line >>>> >>>> --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \ >>>> >>>> to be >>>> >>>> --traineddata ~/tesstutorial/araeval/ara/ara.traineddata >>>> >>>> now it works fine without error >>>> but i have another question >>>> the number of character set in best train is 85 and in the new >>>> generated character set contain only 74 >>>> how to keep unicharset number as best 85 ? >>>> >>>> >>>> بتاريخ الأحد، 29 مارس، 2020 5:06:16 ص UTC+2، كتب shree: >>>>> >>>>> See >>>>> https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh >>>>> >>>>> lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \ >>>>> --continue_from ../tesstutorial/trainplusminus/eng.lstm \ >>>>> --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \ >>>>> --old_traineddata tessdata/best/eng.traineddata \ >>>>> --train_listfile >>>>> ../tesstutorial/trainplusminus/eng.training_files.txt \ >>>>> --max_iterations 3600 >>>>> >>>>> ... >>>>> >>>>> >>>>> lstmtraining \ >>>>> --stop_training \ >>>>> --continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \ >>>>> --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \ >>>>> --model_output >>>>> ../tesstutorial/trainplusminus/eng_plusminus.traineddata >>>>> >>>>> --traineddata needs to be same in both commands. >>>>> >>>>> On Sun, Mar 29, 2020 at 6:45 AM Shree Devi Kumar <[email protected]> >>>>> wrote: >>>>> >>>>>> Please check that you have used the correct path for the traineddata >>>>>> file. >>>>>> >>>>>> Please share the lstmtraining command that you used before this for >>>>>> training. >>>>>> >>>>>> On Sat, Mar 28, 2020, 22:56 Essam Zaky <[email protected]> wrote: >>>>>> >>>>>>> Dear @Shreeshrii >>>>>>> I had followed your bash script to add Andalus font in the Arabic >>>>>>> lanaguage here it the script url >>>>>>> >>>>>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948 >>>>>>> >>>>>>> all steps steps works except the last one which generate the >>>>>>> traineddata here it's the error >>>>>>> >>>>>>> osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \ >>>>>>> > --stop_training \ >>>>>>> > --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \ >>>>>>> > --traineddata >>>>>>> ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \ >>>>>>> > --model_output >>>>>>> ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata >>>>>>> Loaded file >>>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint, unpacking... >>>>>>> Code range changed from 74 to 85! >>>>>>> Must supply the old traineddata for code conversion! >>>>>>> Failed to read continue from: >>>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint >>>>>>> >>>>>>> >>>>>>> Best Regards >>>>>>> Essam >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/473d71dd-4356-421d-98be-18ec9f1317a0%40googlegroups.com.

