Update: 1. When using a smaller training_text for chi_sim for plus training, the unicharset gets restricted. So, merge the lstm-unicharset with it.
2. The unicharset for chi_sim using langdata is different from the one extracted from tessdata_best. so using training_text from langdata will add more characters. 3. The fonts used for LSTM training are given in langdata_lstm in okfonts.txt. For plus training same fonts should be used otherwise it will require training of new typefaces. 4. Another user was trying to fine-tune chi_sim (check old forum posts) to add theta sign. If I remember correctly, the plus type training did not work for it. Replace top layer was the better option. 5. I am training with the following fonts. "Adobe Heiti Std" \ "Adobe Kaiti Std" \ "Arial Unicode MS" \ "Bitstream CyberCJK" \ "Microsoft YaHei UI" \ "Microsoft YaHei" \ "NSimSun" \ "Noto Sans CJK SC" \ "Noto Sans Mono CJK SC" \ "STXihei" \ "SimSun" \ "WenQuanYi Zen Hei Medium" \ "WenQuanYi Zen Hei Mono Medium" \ "WenQuanYi Zen Hei Sharp Medium" \ At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char train=4.888%, word train=46.842%, skip ratio=0%, New best char error = 4.888 wrote best model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint wrote checkpoint. On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin <[email protected]> wrote: > Can you please test on arrows (↑ > <https://en.wikipedia.org/wiki/%E2%86%91_(disambiguation)> or ↓ > <https://en.wikipedia.org/wiki/%E2%86%93_(disambiguation)>) instead of ± > if it's not inconvenient for you? > > 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: >> >> I will test tomorrow and let you know >> >> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, <[email protected]> wrote: >> >>> It still couldn't work after I increased the number of ± to about 100. >>> And the error rate after 2000 iterations is about 11. This is a pretty high >>> error rate compare to what we have for adding a few characters to eng. With >>> such high error rate, I would not be surprised that it could't recognize >>> some special characters like ±. Is this it for chi_sim? Or can I increase >>> iterations to make the error rate smaller? >>> Thanks for your help. >>> >>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道: >>>> >>>> increase the number of ± to about 100 >>>> >>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin <[email protected]> >>>> wrote: >>>> >>>>> Sorry to bother you again and again. >>>>> I reduced the training text to about 450 lines, with like 30 ± in it. >>>>> I used two fonts and iteration of 1000. But it looks like ± is still not >>>>> picked up by the BEST OCR TEXT at all, it always recognizes ± as something >>>>> else. What is happening here? Should I increase the number of ±? Or do I >>>>> need to increase the number of fonts? I'm trying increasing iterations. >>>>> >>>>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: >>>>>> >>>>>> If you increase the iterations then the plus type of training will >>>>>> not give good result, i.e. the other letters will lose accuracy. >>>>>> >>>>>> You can try to reduce the training text size while still keeping all >>>>>> the characters that you need as part of the training text, >>>>>> >>>>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I was only using two different fonts and It only achieved lowest >>>>>>> error rate of 11.271 after the training, does this mean I really need to >>>>>>> increase the iterations? >>>>>>> >>>>>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: >>>>>>>> >>>>>>>> How big was your training text? How many iterations? Did the fonts >>>>>>>> you use for training support the plus minus sign? >>>>>>>> >>>>>>>> You can run training with -- debug-level of -1 so that you can see >>>>>>>> whether the plus minus is being picked for training in the console >>>>>>>> messages. >>>>>>>> >>>>>>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks. It works. The new character I added was there. >>>>>>>>> >>>>>>>>> Do you have any idea why after fine tuning tesseract still >>>>>>>>> couldn't recognize the new character I added? When I tried to add '±' >>>>>>>>> to >>>>>>>>> eng it works, but when I tried to add '±' to chi_sim, it couldn't work >>>>>>>>> (explained below). Is there anything we need to pay attention to when >>>>>>>>> fine >>>>>>>>> tuning other langs rather than eng? >>>>>>>>> >>>>>>>>> I used >>>>>>>>> >>>>>>>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ >>>>>>>>> --traineddata >>>>>>>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \ >>>>>>>>> --eval_listfile >>>>>>>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 | >>>>>>>>> grep ± >>>>>>>>> >>>>>>>>> to check and ± only shows up in Truth but not in OCR >>>>>>>>> >>>>>>>>> >>>>>>>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道: >>>>>>>>>> >>>>>>>>>> combine_tessdata -u new.traineddata new. >>>>>>>>>> >>>>>>>>>> will unpack the traineddata file. check new.lstm-unicharset in it >>>>>>>>>> >>>>>>>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I tried to fine tune the model and add a new character via >>>>>>>>>>> training, but it seems it still couldn't recognize this new >>>>>>>>>>> character using >>>>>>>>>>> the new traineddata generated. To debug I want to check whether >>>>>>>>>>> this new >>>>>>>>>>> character is in the .unicharset in the new traineddata generated. >>>>>>>>>>> Is there >>>>>>>>>>> a way to do this? >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkq8Qw032B7qS-nmnrTBN5uKJamkONYa8xwr3sYFvF4g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

