[tesseract-ocr] Corrupt eng.traineddata output file?

Adam Funk Wed, 25 Sep 2019 02:35:34 -0700

Hi again,

I've succeeded in generating *.lstmf files from *.tif and *.box files,
and producing the train and eval file lists.  Then I do this



combine_lang_model \
  --input_unicharset "${UNICHARSET_FILE}" \
  --script_dir "${TESSDATA_PREFIX}" \
  --output_dir "${OUTPUT_DIR}" \
  --pass_through_recoder \
  --lang "${LANG_CODE}"

which produces the "starter" ${OUTPUT_DIR}/eng/eng.traineddata file,
then I do this:

lstmtraining  --traineddata "${TRAINED_DATA_FILE}" \
  --net_spec "[1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256
O1c$num_classes]" \
  --model_output "${OUTPUT_DIR}" \
  --train_listfile "${LIST_TRAIN}" \
  --eval_listfile "${LIST_EVAL}"

That command appears to work, although it doesn't like some of the input
data:

Begin lstmtraining ...
Num outputs,weights in Series:
  1,40,0,1:1, 0
Num outputs,weights in Series:
  C5,5:25, 0
  Ft64:64, 1664
Total weights = 1664
  [C5,5Ft64]:64, 1664
  Mp3,3:64, 0
  Lfys128:128, 98816
  Lbx256:512, 788480
  Lbx256:512, 1574912
  Fc113:113, 57969
Total weights = 2521841
Built network:[1,40,0,1[C5,5Ft64]Mp3,3Lfys128Lbx256Lbx256Fc113] from
request [1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256 O1c113]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5
null char=2
Deserialize header failed: /data/training/20190923-183956-000002541.lstmf
Deserialize header failed: /data/training/20190923-183713-000001499.lstmf
Deserialize header failed: /data/training/20190923-184103-000002958.lstmf
Deserialize header failed: /data/training/20190923-183629-000001195.lstmf
Load of page 0 failed!
Load of images failed!!
Loaded 1/1 pages (1-1) of document
/data/training/20190923-183643-000001284.lstmf
Deserialize header failed: /data/training/20190923-183932-000002375.lstmf
Loaded 1/1 pages (1-1) of document
/data/training/20190923-183724-000001575.lstmf
Loaded 1/1 pages (1-1) of document
/data/training/20190923-183541-000000875.lstmf
Loaded 1/1 pages (1-1) of document
/data/training/20190923-183850-000002113.lstmf


But I think the resulting eng.traineddata file is corrupt, because when
I try to run the tesseract command with --tessdata-dir ${OUTPUT_DIR}/eng
I get the "couldn't load any languages" error.

tesseract --tessdata-dir /data/output/eng --list-langs
Failed loading language 'eng'
Tesseract couldn't load any languages!
List of available languages (1):
eng


I would be grateful for debugging suggestions.

Thanks,
Adam

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d57b049a-fc03-f520-847e-6fcd9b09884d%40sheffield.ac.uk.

[tesseract-ocr] Corrupt eng.traineddata output file?

Reply via email to