Hi group,

I'm trying to retrain top layers from the chi_sim tessdata_best model using 
Tesseract 4.0.0. Combine_tessdata says this about the network: Version 
string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
 
. I noticed at the end: O1c1 says just 1 output class. When unpacked, its 
unicharset contains 4022 characters, why unicharset doesn't match outputs?

I'm adding characters as well as providing new fonts. When retraining with 
'--append_index 5 --net_spec [Lfx512 O1c1]', the training tool complains 
about output class count

 Appending a new network to an old one!!Warning: given outputs 1 not equal 
to unicharset of 5077.

Then it insisted on another structure: Built 
network:[1,48,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx512Fc5077] from request 
[Lfx512 O1c1]. 

The starter traineddata is created this way:

combine_lang_model --input_unicharset model/custom/custom.lstm-unicharset 
--script_dir data/langdata_lstm --words 
data/langdata_lstm/chi_sim/chi_sim.wordlist --puncs 
data/langdata_lstm/chi_sim/chi_sim.punc --numbers 
data/langdata_lstm/chi_sim/chi_sim.numbers --output_dir model --lang 
chi_sim --pass_through_recoder

And .lstm-unicharset is generated from 'unicharset_extractor --norm_mode 1' 
with box files.

Where did I do wrong?

Thanks in advance,
He Shiming

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/924a47fe-8217-4402-a682-3127cf62c748%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to