Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

ShreeDevi Kumar Tue, 09 Jan 2018 03:27:07 -0800

Another suggestion, maybe it will help in your particular case of "one
Chinese character + several English letters or digits"


You could modify the numbers wordlist in langdata to have samples of this
format - with all 30 chinese characters at start. If the English characters
follow some pattern you can use that too.

something like

支ABC...
部ABC...
支GME...
部GME...
支XYZ...
部XYZ...

The ... indicate the portion used by digits. The number of spaces indicate
the number of digits. Please look at langdata/eng/eng.numbers as a sample.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jan 9, 2018 at 12:14 PM, Yang Yu <[email protected]> wrote:

> I see. I will spend some time learning the structure of tesseract's
> network and give it a try.
>
> Thanks for the help!
>
> On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <[email protected]>
> wrote:
>
>> Fine-tune plus-minus will work for few character changes.
>>
>> You want to delete thousands of characters.
>>
>> Maybe you need replace top layer type of training.
>>
>>
>>
>> On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected]> wrote:
>>
>>> Thanks for your reply!
>>>
>>> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable?
>>>
>>> I also try to extract dawg from HanS.traineddata and convert it to
>>> wordlist, and use it to generate base traineddata to fine-tune. I have
>>> confirmed that the new model's dawg->wordlist has the words that consist of
>>> my limited unicharset, but the problem still exists.
>>>
>>> To give more background, my scenario is to recognize plate number from
>>> vehicle license. The target image is something like "one Chinese character
>>> + several English letters or digits" (see one example image below). So the
>>> results are by design not some meaningful words. My training data has 5000
>>> such plate numbers, one line for each as text. The reason why I want to
>>> retrain is the fact that the number of possible Chinese character at
>>> position 0 is limited to ~30.
>>>
>>> Am I doing anything wrong?
>>>
>>> [image: Inline image 1]
>>>
>>>
>>>
>>> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected]>
>>> wrote:
>>>
>>>> How many iterations did you use for training?
>>>>
>>>> You can unpack HanS.traineddata and then run dawg2word program to get
>>>> the wordlists used in it. Try using these for langdata in addition to your
>>>> training text.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> These days I was working on fine-tuning a Chinese tesseract model
>>>>> based on 4.0 LSTM, and it worked great when the unicharset is not changed.
>>>>> But I found a problem when I applied it to a different scenario.
>>>>>
>>>>> Basically in my new scenario, the target characters are very limited -
>>>>> I only need to recognize less than 100 Chinese characters instead of
>>>>> thousands. I find this
>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>>>>> link about how to use a different set of unicharset to achieve this.
>>>>> Concretely, what I did is:
>>>>>     1. Prepare some text with only the characters I need
>>>>>     2. Run tesstrain.sh to generate images, and unicharset +
>>>>> traineddata + lstmf files (here I use chi_sim as langdata dir)
>>>>>     3. Run fine tuning: continued from HanS.lstm which is extracted
>>>>> from HanS.traineddata, use the generated chi_sim.traineddata as base
>>>>> traineddata, and use HanS.traineddata as old_traineddata
>>>>>
>>>>> The training process is smooth. But when I applied this new model to
>>>>> my evaluation set, I found that for some of my test cases, it worked
>>>>> better; but for the rest, the model just output empty string. As
>>>>> comparison, if I directly use a fine-tuned model based on HanS.traineddata
>>>>> without changing the unicharset (say, just adding some new lstmf files to
>>>>> fine tune), EVERY test cases can output something (no matter it is correct
>>>>> or not).
>>>>>
>>>>> Personally I don't think it is related to overfitting, because even a
>>>>> bad model should output something wrong. I'm not sure if it is related to
>>>>> chi_sim under langdata - it seems that langdata for 4.0 is not released
>>>>> yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata
>>>>> model.
>>>>>
>>>>> Any help will be appreciated.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/52093984-141
>>>>> 5-4256-a2cd-268ed4141531%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>> pic/tesseract-ocr/CymhBpd24WU/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L
>>>> _E_rkJSzA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67q
>>> QJEwUhi4%3D7w%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>> pic/tesseract-ocr/CymhBpd24WU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx
>> 1fUQVb7PUtw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-
> cf6nWSH4w9orfi1wwQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXedWOrwvDZLpjhpeeMLdi9W1CKowp-%3D_b9BEMP4ru9zA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Reply via email to