Fine-tune plus-minus will work for few character changes. You want to delete thousands of characters.
Maybe you need replace top layer type of training. On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected]> wrote: > Thanks for your reply! > > The #iterations I always used is 2000/3000/5000/10000. Is it reasonable? > > I also try to extract dawg from HanS.traineddata and convert it to > wordlist, and use it to generate base traineddata to fine-tune. I have > confirmed that the new model's dawg->wordlist has the words that consist of > my limited unicharset, but the problem still exists. > > To give more background, my scenario is to recognize plate number from > vehicle license. The target image is something like "one Chinese character > + several English letters or digits" (see one example image below). So the > results are by design not some meaningful words. My training data has 5000 > such plate numbers, one line for each as text. The reason why I want to > retrain is the fact that the number of possible Chinese character at > position 0 is limited to ~30. > > Am I doing anything wrong? > > [image: Inline image 1] > > > > On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected]> > wrote: > >> How many iterations did you use for training? >> >> You can unpack HanS.traineddata and then run dawg2word program to get the >> wordlists used in it. Try using these for langdata in addition to your >> training text. >> >> >> >> >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected]> wrote: >> >>> Hi, >>> >>> These days I was working on fine-tuning a Chinese tesseract model based >>> on 4.0 LSTM, and it worked great when the unicharset is not changed. But I >>> found a problem when I applied it to a different scenario. >>> >>> Basically in my new scenario, the target characters are very limited - I >>> only need to recognize less than 100 Chinese characters instead of >>> thousands. I find this >>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters> >>> link about how to use a different set of unicharset to achieve this. >>> Concretely, what I did is: >>> 1. Prepare some text with only the characters I need >>> 2. Run tesstrain.sh to generate images, and unicharset + traineddata >>> + lstmf files (here I use chi_sim as langdata dir) >>> 3. Run fine tuning: continued from HanS.lstm which is extracted from >>> HanS.traineddata, use the generated chi_sim.traineddata as base >>> traineddata, and use HanS.traineddata as old_traineddata >>> >>> The training process is smooth. But when I applied this new model to my >>> evaluation set, I found that for some of my test cases, it worked better; >>> but for the rest, the model just output empty string. As comparison, if I >>> directly use a fine-tuned model based on HanS.traineddata without changing >>> the unicharset (say, just adding some new lstmf files to fine tune), EVERY >>> test cases can output something (no matter it is correct or not). >>> >>> Personally I don't think it is related to overfitting, because even a >>> bad model should output something wrong. I'm not sure if it is related to >>> chi_sim under langdata - it seems that langdata for 4.0 is not released >>> yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata >>> model. >>> >>> Any help will be appreciated. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit https://groups.google.com/d/to >> pic/tesseract-ocr/CymhBpd24WU/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L >> _E_rkJSzA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu% > 2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

