I read this page but still need more information about how to build 
training data set
say i would train the engine to recognize field contain 15 digit
is it enough to give small text file contain the 10 digits from 0 to 9
or should i prepare the training text to contain all 15 digit combination 
that it mean to have 10pow15 digit which is very huge data

بتاريخ الأحد، 29 مارس، 2020 11:45:01 ص UTC+2، كتب shree:
>
>
> https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction
>   
>
> On Sun, Mar 29, 2020 at 12:53 PM Essam Zaky <[email protected] 
> <javascript:>> wrote:
>
>> Thanks @shreeshrii
>>
>>  , while prepare the training text what are the recommendations for this 
>> step
>>
>> is there ant tutorial to show me how to prepare the training text.
>>
>> example
>> what is the recommended text size
>> how many character instance repeated in the training set
>> ,  what about ligatures, how to handle it and how to add it in unicharset
>> ....
>>
>> بتاريخ الأحد، 29 مارس، 2020 7:50:54 ص UTC+2، كتب shree:
>>>
>>> The unicharset is based on the training text you use. Please make sure 
>>> you have all required characters in the text.
>>>
>>> Fine-tune for impact works with the unicharset of the best traineddata 
>>> file, but then you can't add any characters to it.
>>>
>>> On Sun, Mar 29, 2020, 11:08 Essam Zaky <[email protected]> wrote:
>>>
>>>> Hi@shreeshrii
>>>> attached is the bash script as described in the following page
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948
>>>>
>>>> when i change the line #51 line 
>>>>
>>>> --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \
>>>>
>>>> to be
>>>>
>>>> --traineddata ~/tesstutorial/araeval/ara/ara.traineddata
>>>>
>>>> now it works fine without error 
>>>> but i have another question
>>>> the number of character set in best train is 85 and in the new 
>>>> generated character set contain only 74
>>>> how to keep unicharset number as best  85 ?
>>>>
>>>>
>>>> بتاريخ الأحد، 29 مارس، 2020 5:06:16 ص UTC+2، كتب shree:
>>>>>
>>>>> See 
>>>>> https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh
>>>>>
>>>>> lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \
>>>>>   --continue_from ../tesstutorial/trainplusminus/eng.lstm \
>>>>>   --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
>>>>>   --old_traineddata tessdata/best/eng.traineddata \
>>>>>   --train_listfile 
>>>>> ../tesstutorial/trainplusminus/eng.training_files.txt \
>>>>>   --max_iterations 3600
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> lstmtraining \
>>>>>   --stop_training \
>>>>>   --continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \
>>>>>   --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
>>>>>   --model_output 
>>>>> ../tesstutorial/trainplusminus/eng_plusminus.traineddata
>>>>>
>>>>>     --traineddata  needs to be same in both commands. 
>>>>>
>>>>> On Sun, Mar 29, 2020 at 6:45 AM Shree Devi Kumar <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Please check that you have used the correct path for the traineddata 
>>>>>> file.
>>>>>>
>>>>>> Please share the lstmtraining command that you used before this for 
>>>>>> training.
>>>>>>
>>>>>> On Sat, Mar 28, 2020, 22:56 Essam Zaky <[email protected]> wrote:
>>>>>>
>>>>>>> Dear @Shreeshrii
>>>>>>> I had followed your bash script to add Andalus font in the Arabic 
>>>>>>> lanaguage here it the script url
>>>>>>>
>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948
>>>>>>>
>>>>>>> all steps steps works except the last one which generate the 
>>>>>>> traineddata here it's the error
>>>>>>>
>>>>>>> osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \
>>>>>>> >   --stop_training \
>>>>>>> >   --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \
>>>>>>> >   --traineddata 
>>>>>>> ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \
>>>>>>> >   --model_output 
>>>>>>> ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata
>>>>>>> Loaded file 
>>>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint, unpacking...
>>>>>>> Code range changed from 74 to 85!
>>>>>>> Must supply the old traineddata for code conversion!
>>>>>>> Failed to read continue from: 
>>>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint
>>>>>>>
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Essam
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/473d71dd-4356-421d-98be-18ec9f1317a0%40googlegroups.com.

Reply via email to