Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

Seokbong Choi Fri, 19 Oct 2018 19:03:02 -0700

Can you share the content of "eng.training_files.txt" file? that
--train_listfile argument refers to?
Thanks.


On Fri, Oct 19, 2018 at 1:59 PM tu tonquang <[email protected]> wrote:

> I want my application able to recognize characters like: 'Φ'
>
> Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:
>>
>> Hi,
>>
>> *I have some errors when I follow this tutorial to retrain tesseract: *
>>
>> I follow this link to retrain tesseract with my image dataset (I retrain
>> tesseract with real image, not from text file via tesstrain.sh)
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata
>>
>> It is my steps to retrain tesseract lstm:
>>
>>
>> *Step1: I create my training data (tif image + box file) from my images.*
>> I generated its via this command line: tesseract
>> [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop
>> makebox
>>
>>
>> *Step2: I edit manually by Qt-box-edito*r. (I done with this link:
>> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files
>> )
>> So now I have files:
>> .tif file
>> .box file
>> .lstmf file (generated by command: tesseract
>> [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train
>> unicharset file
>>
>>
>> *Step 3: I create .traineddata via this command:*
>> combine_lang_model --input_unicharset unicharset --script_dir langdata
>> --output_dir output --lang "eng"
>> With langdata I downloaded from here:
>> https://github.com/tesseract-ocr/langdata
>>
>>
>> *Step4: I extract existing model from exist traineddata by command:*
>> combine_tessdata -e
>> /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata eng.lstm
>>
>>
>> *Step5: I retrain tesseract *(Fine Tuning for ± a few characters:
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters)
>> by command:
>> lstmtraining --model_output output_model --continue_from eng.lstm
>> --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share
>> tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile
>> eng.training_files.txt --debug_interval -1 --max_iterations 400
>>
>>    - It is format of my eng.training_files.txt:
>>    path/to/lstmf
>>
>> *I get an error like the following:*
>>
>> [image: Screenshot from 2018-10-19 21-49-00.png]
>> *It is example about my training image:*
>> [image: eng.centurygothic.exp0.png]
>>
>>
>>
>>
>>
>> *I try to retrain tesseract with from real image (not from text file via
>> tesstrain.sh)*
>>
>> Please share me something if you have any idea to fix it.
>>
>>
>> Thank you for advance !
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BVWkA5z%2BdHyXoo-w3B9E2wtAGtAHDCqO6ryqYiV4Qu6NrMSrw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

Reply via email to