[tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Yang Yu Mon, 08 Jan 2018 05:01:01 -0800

Hi,

These days I was working on fine-tuning a Chinese tesseract model based on 
4.0 LSTM, and it worked great when the unicharset is not changed. But I 
found a problem when I applied it to a different scenario.

Basically in my new scenario, the target characters are very limited - I
only need to recognize less than 100 Chinese characters instead of
thousands. I find this
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters>

link about how to use a different set of unicharset to achieve this.
Concretely, what I did is:
1. Prepare some text with only the characters I need
2. Run tesstrain.sh to generate images, and unicharset + traineddata +
lstmf files (here I use chi_sim as langdata dir)
3. Run fine tuning: continued from HanS.lstm which is extracted from
HanS.traineddata, use the generated chi_sim.traineddata as base
traineddata, and use HanS.traineddata as old_traineddata

The training process is smooth. But when I applied this new model to my
evaluation set, I found that for some of my test cases, it worked better;
but for the rest, the model just output empty string. As comparison, if I
directly use a fine-tuned model based on HanS.traineddata without changing
the unicharset (say, just adding some new lstmf files to fine tune), EVERY
test cases can output something (no matter it is correct or not).

Personally I don't think it is related to overfitting, because even a bad
model should output something wrong. I'm not sure if it is related to
chi_sim under langdata - it seems that langdata for 4.0 is not released
yet, so chi_sim is the only thing I can use to fine-tune HanS.trainneddata
model.

Any help will be appreciated.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/52093984-1415-4256-a2cd-268ed4141531%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

Reply via email to