Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

robertyoung0511 Tue, 19 Sep 2017 01:59:18 -0700

Does the finetune update all the parameters in all of the layers?

We need to add lots of mathematical symbols and some other special symbols. 
Maybe we should scratch training?


What is the char error and iteration times for the scratch training, then 
we train the chi_sim(Simplified Chinese)?



在 2017年9月19日星期二 UTC+8下午4:49:30，shree写道：
>
> As per comments by Ray, for finetune or for plus minus a few letters.
> the number of iterations should be limited to 3000 or so.
>
> It probably won't get to .2% accuracy, but you might have better results 
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Sep 19, 2017 at 2:00 PM, <[email protected] <javascript:>> 
> wrote:
>
>> Hello,
>>
>> I am training my own traineddata model for the chi_sim language with the 
>> finetune training. In my trained data, there are some mathematical symbols, 
>> such as "∞", "β", "△" and so on, which cannot be recognized in the official 
>> chi_sim.traineddata model.
>>
>> So we change the content of the chi_sim.training_text file, and fill the 
>> file with our training data.
>>
>>
>> Then executing the training command:
>> training/lstmtraining --model_output ~/tesstutorial/trainspecial/special \
>>   --continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
>>   --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
>>   --old_traineddata tessdata/best/chi_sim.traineddata \
>>   --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt 
>> \
>>   --max_iterations 400000
>>
>> As the command, when we iterate 400000 times, the char error is about 
>> 0.2% and the word error is about 4.2%. 
>> The error rate has almost started to oscillate and it can't go down. So 
>> we stopped training and exported the traineddata model.
>>
>> After testing the exported traineddata model, the accuracy is not 
>> satisfactory enough, which is lower than the model provided by the official 
>> website (tesseract github website).
>>
>> We hope that the training model recognition accuracy will be consistent 
>> with the official website. Then how can we continue to further improve the 
>> accuracy of the model?
>>
>> Does anyone know the details of the official website training language 
>> model, such as the num of iteration, the lowest char error and word error, 
>> the value of the learning_rate, and so on?
>>
>> If you know these information, please give some tips.
>>
>>
>> Thank you.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/54f6a114-54c3-462b-a6f0-11d6ca81f6c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

Reply via email to