https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction


On Sun, Mar 29, 2020 at 12:53 PM Essam Zaky <[email protected]> wrote:

> Thanks @shreeshrii
>
>  , while prepare the training text what are the recommendations for this
> step
>
> is there ant tutorial to show me how to prepare the training text.
>
> example
> what is the recommended text size
> how many character instance repeated in the training set
> ,  what about ligatures, how to handle it and how to add it in unicharset
> ....
>
> بتاريخ الأحد، 29 مارس، 2020 7:50:54 ص UTC+2، كتب shree:
>>
>> The unicharset is based on the training text you use. Please make sure
>> you have all required characters in the text.
>>
>> Fine-tune for impact works with the unicharset of the best traineddata
>> file, but then you can't add any characters to it.
>>
>> On Sun, Mar 29, 2020, 11:08 Essam Zaky <[email protected]> wrote:
>>
>>> Hi@shreeshrii
>>> attached is the bash script as described in the following page
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948
>>>
>>> when i change the line #51 line
>>>
>>> --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \
>>>
>>> to be
>>>
>>> --traineddata ~/tesstutorial/araeval/ara/ara.traineddata
>>>
>>> now it works fine without error
>>> but i have another question
>>> the number of character set in best train is 85 and in the new generated
>>> character set contain only 74
>>> how to keep unicharset number as best  85 ?
>>>
>>>
>>> بتاريخ الأحد، 29 مارس، 2020 5:06:16 ص UTC+2، كتب shree:
>>>>
>>>> See
>>>> https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh
>>>>
>>>> lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \
>>>>   --continue_from ../tesstutorial/trainplusminus/eng.lstm \
>>>>   --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
>>>>   --old_traineddata tessdata/best/eng.traineddata \
>>>>   --train_listfile
>>>> ../tesstutorial/trainplusminus/eng.training_files.txt \
>>>>   --max_iterations 3600
>>>>
>>>> ...
>>>>
>>>>
>>>> lstmtraining \
>>>>   --stop_training \
>>>>   --continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \
>>>>   --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
>>>>   --model_output
>>>> ../tesstutorial/trainplusminus/eng_plusminus.traineddata
>>>>
>>>>     --traineddata  needs to be same in both commands.
>>>>
>>>> On Sun, Mar 29, 2020 at 6:45 AM Shree Devi Kumar <[email protected]>
>>>> wrote:
>>>>
>>>>> Please check that you have used the correct path for the traineddata
>>>>> file.
>>>>>
>>>>> Please share the lstmtraining command that you used before this for
>>>>> training.
>>>>>
>>>>> On Sat, Mar 28, 2020, 22:56 Essam Zaky <[email protected]> wrote:
>>>>>
>>>>>> Dear @Shreeshrii
>>>>>> I had followed your bash script to add Andalus font in the Arabic
>>>>>> lanaguage here it the script url
>>>>>>
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948
>>>>>>
>>>>>> all steps steps works except the last one which generate the
>>>>>> traineddata here it's the error
>>>>>>
>>>>>> osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \
>>>>>> >   --stop_training \
>>>>>> >   --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \
>>>>>> >   --traineddata
>>>>>> ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \
>>>>>> >   --model_output
>>>>>> ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata
>>>>>> Loaded file /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint,
>>>>>> unpacking...
>>>>>> Code range changed from 74 to 85!
>>>>>> Must supply the old traineddata for code conversion!
>>>>>> Failed to read continue from:
>>>>>> /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint
>>>>>>
>>>>>>
>>>>>> Best Regards
>>>>>> Essam
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWug_3v%3Dzr4_6PszBFq-kgcjJF1bEAFLP%2BYvYcKYkMQ2g%40mail.gmail.com.

Reply via email to