Thank you very much. Shree Devi Kumar <[email protected]> 于2019年3月20日周三 下午2:20写道:
> On Wed, Mar 20, 2019 at 9:57 AM 易鑫 <[email protected]> wrote: > >> Thank you very much for your reply, your result is pretty good. >> >> You are right, I want to limit my unicharset. >> I want to ask you a few questions: >> >> 1.What pre-processing have you done? only Binarisation,Rotation and >> Deskewing? >> > > I used irfanview interactively. Rotated to straighten the lines, converted > to 2 color image and changed dpi to 300. > I didn't test with oiginal image. Tesseract also does binarization. > >> >> 2.From your result,chi_sim_tuned.txt, also contains some characters that >> do not in the train_text file,such as "二",“》:”,why? >> > > I don't know. Probably they are there in the tessdata_best model and don't > get fully overwritten in finetuning. > >> >> 3. How to the choose the "max_iterations" value, I usually choose a >> large number for the first time such as 10000 to let the model under >> overfitting condition, then reduce the value gradually,make sure the model >> is good finally. >> Is there any good method to choose max_iterations? >> > > Ray's recommendations for finetuning for font is 400 iterations. For > plus-minus tuning to add a character is 3600. You should check an eval set > (different from training set) around these numbers to find the minimum. > >> >> >> >> >> >> >> >> >> >> >> >> >> >> Shree Devi Kumar <[email protected]> 于2019年3月20日周三 上午11:18写道: >> >>> >>> ~/tesseract/src/training/tesstrain.sh \ >>> --fonts_dir ~/.fonts \ >>> --training_text ~/langdata/chi_sim/chi_sim_tuned.txt \ >>> --langdata_dir ~/langdata \ >>> --tessdata_dir ~/tessdata \ >>> --lang chi_sim --linedata_only \ >>> --noextract_font_properties \ >>> --exposures "0" \ >>> --workspace_dir ~/tmp \ >>> --save_box_tiff \ >>> --fontlist \ >>> "NSimSun" \ >>> "Arial Unicode MS" \ >>> "SimSun" \ >>> "Merchant Copy" \ >>> "Merchant Copy Doublesize" \ >>> "Noto Sans CJK SC" \ >>> "Noto Sans Mono CJK SC" \ >>> --output_dir ~/tesstutorial/chi_sim_trainnew >>> >>> >>> mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim >>> >>> combine_tessdata -e ~/tessdata_best/chi_sim.traineddata >>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm >>> >>> ~/tesseract/bin/src/training/lstmtraining \ >>> --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \ >>> --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \ >>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ >>> --old_traineddata ~/tessdata_best/chi_sim.traineddata \ >>> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt >>> \ >>> --debug_interval -1 \ >>> --max_iterations 3600 >>> >>> ~/tesseract/bin/src/training/lstmtraining \ >>> --stop_training \ >>> --continue_from >>> ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint \ >>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ >>> --model_output ~/tessdata_best/chi_sim_tuned.traineddata >>> >>> >>> On Wed, Mar 20, 2019 at 8:46 AM Shree Devi Kumar <[email protected]> >>> wrote: >>> >>>> Also, 10000 iterations for finetuning will lead to overfitting. >>>> >>>> I tried by using fewer fonts and adding a couple of English only fonts >>>> that match the typeface of the image you shared. The output is improved >>>> compared to tessdata_best. I assume that you want to limit your unicharset >>>> based on your training_text (numbers, some English letters and some >>>> Simplified Chinese characters). The image was pre-processed to B&W and >>>> deskewed. >>>> >>>> I found that --psm 6 gives worse results both for tessdata_best and >>>> finetuned, but the default psm gives better accuracy though there are >>>> multiple blank lines for extra columns identified in --psm 3. >>>> >>>> See attached: >>>> >>>> >>>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUeONc98a%3DMiGE1Y1PGKK-Jb5vinDTPnEF%2BMvPUkT0nmw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE21ywZpg%2BRtGj2BK9XxV87ivycnhp8nvaGSguaD%3DtKUN7w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWs5cctkn0OSF9UE2Fhhq7wsyE8xmFwwdj%2BAQVXfqNfFA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWs5cctkn0OSF9UE2Fhhq7wsyE8xmFwwdj%2BAQVXfqNfFA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE22LvRQrBFGksmrf8OmnkB6xpmd3mPHP%3DPj%3DRg5s1RsPbw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

