Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Yang Yu Tue, 09 Jan 2018 06:27:35 -0800

Thanks for giving more insight!

Sorry for another question: is there any "dropping" logic in tesseract 
(say, if the certainty of recognized character < threshold, the result will 
not be used thus an empty string is returned)?



On Tuesday, January 9, 2018 at 7:26:33 PM UTC+8, shree wrote:
>
> Another suggestion, maybe it will help in your particular case of "one 
> Chinese character + several English letters or digits"
>
> You could modify the numbers wordlist in langdata to have samples of this 
> format - with all 30 chinese characters at start. If the English characters 
> follow some pattern you can use that too.
>
> something like
>
> 支ABC...
> 部ABC...
> 支GME...
> 部GME...
> 支XYZ...
> 部XYZ...
>
> The ... indicate the portion used by digits. The number of spaces indicate 
> the number of digits. Please look at langdata/eng/eng.numbers as a sample.
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jan 9, 2018 at 12:14 PM, Yang Yu <[email protected] <javascript:>> 
> wrote:
>
>> I see. I will spend some time learning the structure of tesseract's 
>> network and give it a try.
>>
>> Thanks for the help!
>>
>> On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <[email protected] 
>> <javascript:>> wrote:
>>
>>> Fine-tune plus-minus will work for few character changes.
>>>
>>> You want to delete thousands of characters. 
>>>
>>> Maybe you need replace top layer type of training.
>>>
>>>
>>>
>>> On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected] <javascript:>> 
>>> wrote:
>>>
>>>> Thanks for your reply!
>>>>
>>>> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable?
>>>>
>>>> I also try to extract dawg from HanS.traineddata and convert it to 
>>>> wordlist, and use it to generate base traineddata to fine-tune. I have 
>>>> confirmed that the new model's dawg->wordlist has the words that consist 
>>>> of 
>>>> my limited unicharset, but the problem still exists.
>>>>
>>>> To give more background, my scenario is to recognize plate number from 
>>>> vehicle license. The target image is something like "one Chinese character 
>>>> + several English letters or digits" (see one example image below). So the 
>>>> results are by design not some meaningful words. My training data has 5000 
>>>> such plate numbers, one line for each as text. The reason why I want to 
>>>> retrain is the fact that the number of possible Chinese character at 
>>>> position 0 is limited to ~30.
>>>>
>>>> Am I doing anything wrong?
>>>>
>>>> [image: Inline image 1]
>>>>
>>>>
>>>>
>>>> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected] 
>>>> <javascript:>> wrote:
>>>>
>>>>> How many iterations did you use for training?
>>>>>
>>>>> You can unpack HanS.traineddata and then run dawg2word program to get 
>>>>> the wordlists used in it. Try using these for langdata in addition to 
>>>>> your 
>>>>> training text.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected] 
>>>>> <javascript:>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> These days I was working on fine-tuning a Chinese tesseract model 
>>>>>> based on 4.0 LSTM, and it worked great when the unicharset is not 
>>>>>> changed. 
>>>>>> But I found a problem when I applied it to a different scenario.
>>>>>>
>>>>>> Basically in my new scenario, the target characters are very limited 
>>>>>> - I only need to recognize less than 100 Chinese characters instead of 
>>>>>> thousands. I find this 
>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>>>>>>  
>>>>>> link about how to use a different set of unicharset to achieve this. 
>>>>>> Concretely, what I did is:
>>>>>>     1. Prepare some text with only the characters I need
>>>>>>     2. Run tesstrain.sh to generate images, and unicharset + 
>>>>>> traineddata + lstmf files (here I use chi_sim as langdata dir)
>>>>>>     3. Run fine tuning: continued from HanS.lstm which is extracted 
>>>>>> from HanS.traineddata, use the generated chi_sim.traineddata as base 
>>>>>> traineddata, and use HanS.traineddata as old_traineddata
>>>>>>
>>>>>> The training process is smooth. But when I applied this new model to 
>>>>>> my evaluation set, I found that for some of my test cases, it worked 
>>>>>> better; but for the rest, the model just output empty string. As 
>>>>>> comparison, if I directly use a fine-tuned model based on 
>>>>>> HanS.traineddata 
>>>>>> without changing the unicharset (say, just adding some new lstmf files 
>>>>>> to 
>>>>>> fine tune), EVERY test cases can output something (no matter it is 
>>>>>> correct 
>>>>>> or not).
>>>>>>
>>>>>> Personally I don't think it is related to overfitting, because even a 
>>>>>> bad model should output something wrong. I'm not sure if it is related 
>>>>>> to 
>>>>>> chi_sim under langdata - it seems that langdata for 4.0 is not released 
>>>>>> yet, so chi_sim is the only thing I can use to fine-tune 
>>>>>> HanS.trainneddata 
>>>>>> model.
>>>>>>
>>>>>> Any help will be appreciated.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected] <javascript:>.
>>>>>> To post to this group, send email to [email protected] 
>>>>>> <javascript:>.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/tesseract-ocr/CymhBpd24WU/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected] <javascript:>.
>>>>> To post to this group, send email to [email protected] 
>>>>> <javascript:>.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> To post to this group, send email to [email protected] 
>>>> <javascript:>.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/tesseract-ocr/CymhBpd24WU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eb84d2c3-ed86-46f6-b66f-19e31f0e600e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Reply via email to