It still couldn't work after I increased the number of ± to about 100. And the error rate after 2000 iterations is about 11. This is a pretty high error rate compare to what we have for adding a few characters to eng. With such high error rate, I would not be surprised that it could't recognize some special characters like ±. Is this it for chi_sim? Or can I increase iterations to make the error rate smaller? Thanks for your help.
在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道: > > increase the number of ± to about 100 > > On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin <[email protected] > <javascript:>> wrote: > >> Sorry to bother you again and again. >> I reduced the training text to about 450 lines, with like 30 ± in it. I >> used two fonts and iteration of 1000. But it looks like ± is still not >> picked up by the BEST OCR TEXT at all, it always recognizes ± as something >> else. What is happening here? Should I increase the number of ±? Or do I >> need to increase the number of fonts? I'm trying increasing iterations. >> >> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: >>> >>> If you increase the iterations then the plus type of training will not >>> give good result, i.e. the other letters will lose accuracy. >>> >>> You can try to reduce the training text size while still keeping all the >>> characters that you need as part of the training text, >>> >>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin <[email protected]> wrote: >>> >>>> I was only using two different fonts and It only achieved lowest error >>>> rate of 11.271 after the training, does this mean I really need to >>>> increase >>>> the iterations? >>>> >>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: >>>>> >>>>> How big was your training text? How many iterations? Did the fonts you >>>>> use for training support the plus minus sign? >>>>> >>>>> You can run training with -- debug-level of -1 so that you can see >>>>> whether the plus minus is being picked for training in the console >>>>> messages. >>>>> >>>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <[email protected]> wrote: >>>>> >>>>>> Thanks. It works. The new character I added was there. >>>>>> >>>>>> Do you have any idea why after fine tuning tesseract still couldn't >>>>>> recognize the new character I added? When I tried to add '±' to eng it >>>>>> works, but when I tried to add '±' to chi_sim, it couldn't work >>>>>> (explained >>>>>> below). Is there anything we need to pay attention to when fine tuning >>>>>> other langs rather than eng? >>>>>> >>>>>> I used >>>>>> >>>>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ >>>>>> --traineddata >>>>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \ >>>>>> --eval_listfile >>>>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 | >>>>>> grep ± >>>>>> >>>>>> to check and ± only shows up in Truth but not in OCR >>>>>> >>>>>> >>>>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道: >>>>>>> >>>>>>> combine_tessdata -u new.traineddata new. >>>>>>> >>>>>>> will unpack the traineddata file. check new.lstm-unicharset in it >>>>>>> >>>>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote: >>>>>>>> >>>>>>>> I tried to fine tune the model and add a new character via >>>>>>>> training, but it seems it still couldn't recognize this new character >>>>>>>> using >>>>>>>> the new traineddata generated. To debug I want to check whether this >>>>>>>> new >>>>>>>> character is in the .unicharset in the new traineddata generated. Is >>>>>>>> there >>>>>>>> a way to do this? >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

