Re: [tesseract-ocr] The problems about training eng+chinese

Shree Devi Kumar Tue, 19 Mar 2019 23:21:05 -0700

On Wed, Mar 20, 2019 at 9:57 AM 易鑫 <[email protected]> wrote:


> Thank you very much for your reply, your result is pretty good.
>
> You are right, I want to limit my unicharset.
> I want to ask you a few questions:
>
> 1.What pre-processing have you done? only Binarisation,Rotation and
> Deskewing?
>

I used irfanview interactively. Rotated to straighten the lines, converted
to 2 color image and changed dpi to 300.
I didn't test with oiginal image. Tesseract also does binarization.

>
> 2.From your result,chi_sim_tuned.txt, also contains some characters that
> do not in the train_text file,such as "二"，“》:”，why?
>

I don't know. Probably they are there in the tessdata_best model and don't
get fully overwritten in finetuning.

>
> 3. How to the choose the "max_iterations" value, I usually choose a large
> number for the first time such as 10000 to let the model under overfitting
> condition, then reduce the value gradually,make sure the model is good
> finally.
>   Is there any good method to choose max_iterations?
>

Ray's recommendations for finetuning for font is 400 iterations. For
plus-minus tuning to add a character is 3600. You should check an eval set
(different from training set) around these numbers to find the minimum.

>
>
>
>
>
>
>
>
>
>
>
>
>
> Shree Devi Kumar <[email protected]> 于2019年3月20日周三 上午11:18写道：
>
>>
>> ~/tesseract/src/training/tesstrain.sh \
>> --fonts_dir ~/.fonts \
>> --training_text ~/langdata/chi_sim/chi_sim_tuned.txt \
>> --langdata_dir ~/langdata \
>> --tessdata_dir ~/tessdata \
>> --lang chi_sim --linedata_only \
>> --noextract_font_properties  \
>> --exposures "0" \
>> --workspace_dir ~/tmp \
>> --save_box_tiff \
>> --fontlist  \
>> "NSimSun" \
>> "Arial Unicode MS" \
>> "SimSun" \
>> "Merchant Copy" \
>> "Merchant Copy Doublesize" \
>> "Noto Sans CJK SC" \
>> "Noto Sans Mono CJK SC" \
>> --output_dir ~/tesstutorial/chi_sim_trainnew
>>
>>
>> mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim
>>
>> combine_tessdata -e ~/tessdata_best/chi_sim.traineddata
>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm
>>
>> ~/tesseract/bin/src/training/lstmtraining \
>> --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \
>> --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>> --old_traineddata ~/tessdata_best/chi_sim.traineddata \
>> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
>> --debug_interval -1 \
>> --max_iterations 3600
>>
>> ~/tesseract/bin/src/training/lstmtraining \
>> --stop_training \
>> --continue_from
>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint  \
>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>> --model_output ~/tessdata_best/chi_sim_tuned.traineddata
>>
>>
>> On Wed, Mar 20, 2019 at 8:46 AM Shree Devi Kumar <[email protected]>
>> wrote:
>>
>>> Also, 10000 iterations for finetuning will lead to overfitting.
>>>
>>> I tried by using fewer fonts and adding a couple of English only fonts
>>> that match the typeface of the image you shared. The output is improved
>>> compared to tessdata_best. I assume that you want to limit your unicharset
>>> based on your training_text (numbers, some English letters and some
>>> Simplified Chinese characters). The image was pre-processed to B&W and
>>> deskewed.
>>>
>>> I found that --psm 6 gives worse results both for tessdata_best and
>>> finetuned, but the default psm gives better accuracy though there are
>>> multiple blank lines for extra columns identified in --psm 3.
>>>
>>> See attached:
>>>
>>>
>>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWs5cctkn0OSF9UE2Fhhq7wsyE8xmFwwdj%2BAQVXfqNfFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] The problems about training eng+chinese

Reply via email to