Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

ShreeDevi Kumar Mon, 08 Jan 2018 21:17:24 -0800

Fine-tune plus-minus will work for few character changes.

You want to delete thousands of characters.


Maybe you need replace top layer type of training.



On 09-Jan-2018 7:32 AM, "Yang Yu" <[email protected]> wrote:

> Thanks for your reply!
>
> The #iterations I always used is 2000/3000/5000/10000. Is it reasonable?
>
> I also try to extract dawg from HanS.traineddata and convert it to
> wordlist, and use it to generate base traineddata to fine-tune. I have
> confirmed that the new model's dawg->wordlist has the words that consist of
> my limited unicharset, but the problem still exists.
>
> To give more background, my scenario is to recognize plate number from
> vehicle license. The target image is something like "one Chinese character
> + several English letters or digits" (see one example image below). So the
> results are by design not some meaningful words. My training data has 5000
> such plate numbers, one line for each as text. The reason why I want to
> retrain is the fact that the number of possible Chinese character at
> position 0 is limited to ~30.
>
> Am I doing anything wrong?
>
> [image: Inline image 1]
>
>
>
> On Mon, Jan 8, 2018 at 11:36 PM, ShreeDevi Kumar <[email protected]>
> wrote:
>
>> How many iterations did you use for training?
>>
>> You can unpack HanS.traineddata and then run dawg2word program to get the
>> wordlists used in it. Try using these for langdata in addition to your
>> training text.
>>
>>
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jan 8, 2018 at 6:30 PM, Yang Yu <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> These days I was working on fine-tuning a Chinese tesseract model based
>>> on 4.0 LSTM, and it worked great when the unicharset is not changed. But I
>>> found a problem when I applied it to a different scenario.
>>>
>>> Basically in my new scenario, the target characters are very limited - I
>>> only need to recognize less than 100 Chinese characters instead of
>>> thousands. I find this
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>
>>> link about how to use a different set of unicharset to achieve this.
>>> Concretely, what I did is:
>>>     1. Prepare some text with only the characters I need
>>>     2. Run tesstrain.sh to generate images, and unicharset + traineddata
>>> + lstmf files (here I use chi_sim as langdata dir)
>>>     3. Run fine tuning: continued from HanS.lstm which is extracted from
>>> HanS.traineddata, use the generated chi_sim.traineddata as base
>>> traineddata, and use HanS.traineddata as old_traineddata
>>>
>>> The training process is smooth. But when I applied this new model to my
>>> evaluation set, I found that for some of my test cases, it worked better;
>>> but for the rest, the model just output empty string. As comparison, if I
>>> directly use a fine-tuned model based on HanS.traineddata without changing
>>> the unicharset (say, just adding some new lstmf files to fine tune), EVERY
>>> test cases can output something (no matter it is correct or not).
>>>
>>> Personally I don't think it is related to overfitting, because even a
>>> bad model should output something wrong. I'm not sure if it is related to
>>> chi_sim under langdata - it seems that langdata for 4.0 is not released
>>> yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata
>>> model.
>>>
>>> Any help will be appreciated.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>> pic/tesseract-ocr/CymhBpd24WU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L
>> _E_rkJSzA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXR3TOTOzkFFi0jt1DLiwh-nvHDw5ftYkm2L_E_rkJSzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%
> 2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CADEYF-cJkD--Ho3NVwnbrMJu%2BZ2RBCXOg5Ki67qQJEwUhi4%3D7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvpbkYD7AnsOr-%3DFr0fOSvCfzheb8eHvx1fUQVb7PUtw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Reply via email to