I see. I will spend some time learning the structure of tesseract's network and give it a try.
Thanks for the help! On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <[email protected]> wrote: > Fine-tune plus-minus will work for few character changes. > > You want to delete thousands of characters. > > Maybe you need replace top layer type of training. > > > > On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected]> wrote: > >> Thanks for your reply! >> >> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable? >> >> I also try to extract dawg from HanS.traineddata and convert it to >> wordlist, and use it to generate base traineddata to fine-tune. I have >> confirmed that the new model's dawg->wordlist has the words that consist of >> my limited unicharset, but the problem still exists. >> >> To give more background, my scenario is to recognize plate number from >> vehicle license. The target image is something like "one Chinese character >> + several English letters or digits" (see one example image below). So the >> results are by design not some meaningful words. My training data has 5000 >> such plate numbers, one line for each as text. The reason why I want to >> retrain is the fact that the number of possible Chinese character at >> position 0 is limited to ~30. >> >> Am I doing anything wrong? >> >> [image: Inline image 1] >> >> >> >> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected]> >> wrote: >> >>> How many iterations did you use for training? >>> >>> You can unpack HanS.traineddata and then run dawg2word program to get >>> the wordlists used in it. Try using these for langdata in addition to your >>> training text. >>> >>> >>> >>> >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> These days I was working on fine-tuning a Chinese tesseract model based >>>> on 4.0 LSTM, and it worked great when the unicharset is not changed. But I >>>> found a problem when I applied it to a different scenario. >>>> >>>> Basically in my new scenario, the target characters are very limited - >>>> I only need to recognize less than 100 Chinese characters instead of >>>> thousands. I find this >>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters> >>>> link about how to use a different set of unicharset to achieve this. >>>> Concretely, what I did is: >>>> 1. Prepare some text with only the characters I need >>>> 2. Run tesstrain.sh to generate images, and unicharset + >>>> traineddata + lstmf files (here I use chi_sim as langdata dir) >>>> 3. Run fine tuning: continued from HanS.lstm which is extracted >>>> from HanS.traineddata, use the generated chi_sim.traineddata as base >>>> traineddata, and use HanS.traineddata as old_traineddata >>>> >>>> The training process is smooth. But when I applied this new model to my >>>> evaluation set, I found that for some of my test cases, it worked better; >>>> but for the rest, the model just output empty string. As comparison, if I >>>> directly use a fine-tuned model based on HanS.traineddata without changing >>>> the unicharset (say, just adding some new lstmf files to fine tune), EVERY >>>> test cases can output something (no matter it is correct or not). >>>> >>>> Personally I don't think it is related to overfitting, because even a >>>> bad model should output something wrong. I'm not sure if it is related to >>>> chi_sim under langdata - it seems that langdata for 4.0 is not released >>>> yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata >>>> model. >>>> >>>> Any help will be appreciated. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ms >>>> gid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40goo >>>> glegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>> pic/tesseract-ocr/CymhBpd24WU/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L >>> _E_rkJSzA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67q >> QJEwUhi4%3D7w%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/tesseract-ocr/CymhBpd24WU/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUt > w%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

