Thanks for giving more insight! Sorry for another question: is there any "dropping" logic in tesseract (say, if the certainty of recognized character < threshold, the result will not be used thus an empty string is returned)?
On Tuesday, January 9, 2018 at 7:26:33 PM UTC+8, shree wrote: > > Another suggestion, maybe it will help in your particular case of "one > Chinese character + several English letters or digits" > > You could modify the numbers wordlist in langdata to have samples of this > format - with all 30 chinese characters at start. If the English characters > follow some pattern you can use that too. > > something like > > 支ABC... > 部ABC... > 支GME... > 部GME... > 支XYZ... > 部XYZ... > > The ... indicate the portion used by digits. The number of spaces indicate > the number of digits. Please look at langdata/eng/eng.numbers as a sample. > > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Tue, Jan 9, 2018 at 12:14 PM, Yang Yu <[email protected] <javascript:>> > wrote: > >> I see. I will spend some time learning the structure of tesseract's >> network and give it a try. >> >> Thanks for the help! >> >> On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <[email protected] >> <javascript:>> wrote: >> >>> Fine-tune plus-minus will work for few character changes. >>> >>> You want to delete thousands of characters. >>> >>> Maybe you need replace top layer type of training. >>> >>> >>> >>> On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected] <javascript:>> >>> wrote: >>> >>>> Thanks for your reply! >>>> >>>> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable? >>>> >>>> I also try to extract dawg from HanS.traineddata and convert it to >>>> wordlist, and use it to generate base traineddata to fine-tune. I have >>>> confirmed that the new model's dawg->wordlist has the words that consist >>>> of >>>> my limited unicharset, but the problem still exists. >>>> >>>> To give more background, my scenario is to recognize plate number from >>>> vehicle license. The target image is something like "one Chinese character >>>> + several English letters or digits" (see one example image below). So the >>>> results are by design not some meaningful words. My training data has 5000 >>>> such plate numbers, one line for each as text. The reason why I want to >>>> retrain is the fact that the number of possible Chinese character at >>>> position 0 is limited to ~30. >>>> >>>> Am I doing anything wrong? >>>> >>>> [image: Inline image 1] >>>> >>>> >>>> >>>> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> How many iterations did you use for training? >>>>> >>>>> You can unpack HanS.traineddata and then run dawg2word program to get >>>>> the wordlists used in it. Try using these for langdata in addition to >>>>> your >>>>> training text. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected] >>>>> <javascript:>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> These days I was working on fine-tuning a Chinese tesseract model >>>>>> based on 4.0 LSTM, and it worked great when the unicharset is not >>>>>> changed. >>>>>> But I found a problem when I applied it to a different scenario. >>>>>> >>>>>> Basically in my new scenario, the target characters are very limited >>>>>> - I only need to recognize less than 100 Chinese characters instead of >>>>>> thousands. I find this >>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters> >>>>>> >>>>>> link about how to use a different set of unicharset to achieve this. >>>>>> Concretely, what I did is: >>>>>> 1. Prepare some text with only the characters I need >>>>>> 2. Run tesstrain.sh to generate images, and unicharset + >>>>>> traineddata + lstmf files (here I use chi_sim as langdata dir) >>>>>> 3. Run fine tuning: continued from HanS.lstm which is extracted >>>>>> from HanS.traineddata, use the generated chi_sim.traineddata as base >>>>>> traineddata, and use HanS.traineddata as old_traineddata >>>>>> >>>>>> The training process is smooth. But when I applied this new model to >>>>>> my evaluation set, I found that for some of my test cases, it worked >>>>>> better; but for the rest, the model just output empty string. As >>>>>> comparison, if I directly use a fine-tuned model based on >>>>>> HanS.traineddata >>>>>> without changing the unicharset (say, just adding some new lstmf files >>>>>> to >>>>>> fine tune), EVERY test cases can output something (no matter it is >>>>>> correct >>>>>> or not). >>>>>> >>>>>> Personally I don't think it is related to overfitting, because even a >>>>>> bad model should output something wrong. I'm not sure if it is related >>>>>> to >>>>>> chi_sim under langdata - it seems that langdata for 4.0 is not released >>>>>> yet, so chi_sim is the only thing I can use to fine-tune >>>>>> HanS.trainneddata >>>>>> model. >>>>>> >>>>>> Any help will be appreciated. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected] <javascript:>. >>>>>> To post to this group, send email to [email protected] >>>>>> <javascript:>. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "tesseract-ocr" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/tesseract-ocr/CymhBpd24WU/unsubscribe >>>>> . >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected] <javascript:>. >>>>> To post to this group, send email to [email protected] >>>>> <javascript:>. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> To post to this group, send email to [email protected] >>>> <javascript:>. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/CymhBpd24WU/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eb84d2c3-ed86-46f6-b66f-19e31f0e600e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

