Thank you very much.

Shree Devi Kumar <[email protected]> 于2019年3月20日周三 下午2:20写道:

> On Wed, Mar 20, 2019 at 9:57 AM 易鑫 <[email protected]> wrote:
>
>> Thank you very much for your reply, your result is pretty good.
>>
>> You are right, I want to limit my unicharset.
>> I want to ask you a few questions:
>>
>> 1.What pre-processing have you done? only Binarisation,Rotation and
>> Deskewing?
>>
>
> I used irfanview interactively. Rotated to straighten the lines, converted
> to 2 color image and changed dpi to 300.
> I didn't test with oiginal image. Tesseract also does binarization.
>
>>
>> 2.From your result,chi_sim_tuned.txt, also contains some characters that
>> do not in the train_text file,such as "二",“》:”,why?
>>
>
> I don't know. Probably they are there in the tessdata_best model and don't
> get fully overwritten in finetuning.
>
>>
>> 3. How to the choose the "max_iterations" value, I usually choose a
>> large number for the first time such as 10000 to let the model under
>> overfitting condition, then reduce the value gradually,make sure the model
>> is good finally.
>>   Is there any good method to choose max_iterations?
>>
>
> Ray's recommendations for finetuning for font is 400 iterations. For
> plus-minus tuning to add a character is 3600. You should check an eval set
> (different from training set) around these numbers to find the minimum.
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Shree Devi Kumar <[email protected]> 于2019年3月20日周三 上午11:18写道:
>>
>>>
>>> ~/tesseract/src/training/tesstrain.sh \
>>> --fonts_dir ~/.fonts \
>>> --training_text ~/langdata/chi_sim/chi_sim_tuned.txt \
>>> --langdata_dir ~/langdata \
>>> --tessdata_dir ~/tessdata \
>>> --lang chi_sim --linedata_only \
>>> --noextract_font_properties  \
>>> --exposures "0" \
>>> --workspace_dir ~/tmp \
>>> --save_box_tiff \
>>> --fontlist  \
>>> "NSimSun" \
>>> "Arial Unicode MS" \
>>> "SimSun" \
>>> "Merchant Copy" \
>>> "Merchant Copy Doublesize" \
>>> "Noto Sans CJK SC" \
>>> "Noto Sans Mono CJK SC" \
>>> --output_dir ~/tesstutorial/chi_sim_trainnew
>>>
>>>
>>> mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim
>>>
>>> combine_tessdata -e ~/tessdata_best/chi_sim.traineddata
>>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm
>>>
>>> ~/tesseract/bin/src/training/lstmtraining \
>>> --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \
>>> --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>>> --old_traineddata ~/tessdata_best/chi_sim.traineddata \
>>> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt
>>> \
>>> --debug_interval -1 \
>>> --max_iterations 3600
>>>
>>> ~/tesseract/bin/src/training/lstmtraining \
>>> --stop_training \
>>> --continue_from
>>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint  \
>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>>> --model_output ~/tessdata_best/chi_sim_tuned.traineddata
>>>
>>>
>>> On Wed, Mar 20, 2019 at 8:46 AM Shree Devi Kumar <[email protected]>
>>> wrote:
>>>
>>>> Also, 10000 iterations for finetuning will lead to overfitting.
>>>>
>>>> I tried by using fewer fonts and adding a couple of English only fonts
>>>> that match the typeface of the image you shared. The output is improved
>>>> compared to tessdata_best. I assume that you want to limit your unicharset
>>>> based on your training_text (numbers, some English letters and some
>>>> Simplified Chinese characters). The image was pre-processed to B&W and
>>>> deskewed.
>>>>
>>>> I found that --psm 6 gives worse results both for tessdata_best and
>>>> finetuned, but the default psm gives better accuracy though there are
>>>> multiple blank lines for extra columns identified in --psm 3.
>>>>
>>>> See attached:
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWs5cctkn0OSF9UE2Fhhq7wsyE8xmFwwdj%2BAQVXfqNfFA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWs5cctkn0OSF9UE2Fhhq7wsyE8xmFwwdj%2BAQVXfqNfFA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE22LvRQrBFGksmrf8OmnkB6xpmd3mPHP%3DPj%3DRg5s1RsPbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to