Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

ShreeDevi Kumar Tue, 19 Sep 2017 02:10:19 -0700

If you unpack the traineddata file, the version string usually has the
network spec used for building the traineddata.


For chi_sim, I think Ray has also mentioned it in the wiki on the training
page.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Sep 19, 2017 at 2:28 PM, <[email protected]> wrote:

> Does the finetune update all the parameters in all of the layers?
>
> We need to add lots of mathematical symbols and some other special
> symbols. Maybe we should scratch training?
>
> What is the char error and iteration times for the scratch training, then
> we train the chi_sim(Simplified Chinese)?
>
>
>
> 在 2017年9月19日星期二 UTC+8下午4:49:30，shree写道：
>>
>> As per comments by Ray, for finetune or for plus minus a few letters.
>> the number of iterations should be limited to 3000 or so.
>>
>> It probably won't get to .2% accuracy, but you might have better results
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Sep 19, 2017 at 2:00 PM, <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I am training my own traineddata model for the chi_sim language with the
>>> finetune training. In my trained data, there are some mathematical symbols,
>>> such as "∞", "β", "△" and so on, which cannot be recognized in the official
>>> chi_sim.traineddata model.
>>>
>>> So we change the content of the chi_sim.training_text file, and fill the
>>> file with our training data.
>>>
>>>
>>> Then executing the training command:
>>> training/lstmtraining --model_output ~/tesstutorial/trainspecial/special
>>> \
>>>   --continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
>>>   --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata
>>> \
>>>   --old_traineddata tessdata/best/chi_sim.traineddata \
>>>   --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt
>>> \
>>>   --max_iterations 400000
>>>
>>> As the command, when we iterate 400000 times, the char error is about
>>> 0.2% and the word error is about 4.2%.
>>> The error rate has almost started to oscillate and it can't go down. So
>>> we stopped training and exported the traineddata model.
>>>
>>> After testing the exported traineddata model, the accuracy is not
>>> satisfactory enough, which is lower than the model provided by the official
>>> website (tesseract github website).
>>>
>>> We hope that the training model recognition accuracy will be consistent
>>> with the official website. Then how can we continue to further improve the
>>> accuracy of the model?
>>>
>>> Does anyone know the details of the official website training language
>>> model, such as the num of iteration, the lowest char error and word error,
>>> the value of the learning_rate, and so on?
>>>
>>> If you know these information, please give some tips.
>>>
>>>
>>> Thank you.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/54f6a114-54c3-462b-a6f0-11d6ca81f6c4%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/54f6a114-54c3-462b-a6f0-11d6ca81f6c4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVEeLjh3O3MnVMkqQE57tS4k09nr3HxSftr9s2uSSPAWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

Reply via email to