Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Yang Yu Mon, 08 Jan 2018 22:45:07 -0800

I see. I will spend some time learning the structure of tesseract's network
and give it a try.


Thanks for the help!

On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <[email protected]>
wrote:

> Fine-tune plus-minus will work for few character changes.
>
> You want to delete thousands of characters.
>
> Maybe you need replace top layer type of training.
>
>
>
> On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected]> wrote:
>
>> Thanks for your reply!
>>
>> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable?
>>
>> I also try to extract dawg from HanS.traineddata and convert it to
>> wordlist, and use it to generate base traineddata to fine-tune. I have
>> confirmed that the new model's dawg->wordlist has the words that consist of
>> my limited unicharset, but the problem still exists.
>>
>> To give more background, my scenario is to recognize plate number from
>> vehicle license. The target image is something like "one Chinese character
>> + several English letters or digits" (see one example image below). So the
>> results are by design not some meaningful words. My training data has 5000
>> such plate numbers, one line for each as text. The reason why I want to
>> retrain is the fact that the number of possible Chinese character at
>> position 0 is limited to ~30.
>>
>> Am I doing anything wrong?
>>
>> [image: Inline image 1]
>>
>>
>>
>> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected]>
>> wrote:
>>
>>> How many iterations did you use for training?
>>>
>>> You can unpack HanS.traineddata and then run dawg2word program to get
>>> the wordlists used in it. Try using these for langdata in addition to your
>>> training text.
>>>
>>>
>>>
>>>
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> These days I was working on fine-tuning a Chinese tesseract model based
>>>> on 4.0 LSTM, and it worked great when the unicharset is not changed. But I
>>>> found a problem when I applied it to a different scenario.
>>>>
>>>> Basically in my new scenario, the target characters are very limited -
>>>> I only need to recognize less than 100 Chinese characters instead of
>>>> thousands. I find this
>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>>>> link about how to use a different set of unicharset to achieve this.
>>>> Concretely, what I did is:
>>>>     1. Prepare some text with only the characters I need
>>>>     2. Run tesstrain.sh to generate images, and unicharset +
>>>> traineddata + lstmf files (here I use chi_sim as langdata dir)
>>>>     3. Run fine tuning: continued from HanS.lstm which is extracted
>>>> from HanS.traineddata, use the generated chi_sim.traineddata as base
>>>> traineddata, and use HanS.traineddata as old_traineddata
>>>>
>>>> The training process is smooth. But when I applied this new model to my
>>>> evaluation set, I found that for some of my test cases, it worked better;
>>>> but for the rest, the model just output empty string. As comparison, if I
>>>> directly use a fine-tuned model based on HanS.traineddata without changing
>>>> the unicharset (say, just adding some new lstmf files to fine tune), EVERY
>>>> test cases can output something (no matter it is correct or not).
>>>>
>>>> Personally I don't think it is related to overfitting, because even a
>>>> bad model should output something wrong. I'm not sure if it is related to
>>>> chi_sim under langdata - it seems that langdata for 4.0 is not released
>>>> yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata
>>>> model.
>>>>
>>>> Any help will be appreciated.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>> pic/tesseract-ocr/CymhBpd24WU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L
>>> _E_rkJSzA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67q
>> QJEwUhi4%3D7w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/tesseract-ocr/CymhBpd24WU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUt
> w%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-f9HVMccHWtHqGQLKP5UUkFHr8-cf6nWSH4w9orfi1wwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Reply via email to