Hi again, I've succeeded in generating *.lstmf files from *.tif and *.box files, and producing the train and eval file lists. Then I do this
combine_lang_model \ --input_unicharset "${UNICHARSET_FILE}" \ --script_dir "${TESSDATA_PREFIX}" \ --output_dir "${OUTPUT_DIR}" \ --pass_through_recoder \ --lang "${LANG_CODE}" which produces the "starter" ${OUTPUT_DIR}/eng/eng.traineddata file, then I do this: lstmtraining --traineddata "${TRAINED_DATA_FILE}" \ --net_spec "[1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256 O1c$num_classes]" \ --model_output "${OUTPUT_DIR}" \ --train_listfile "${LIST_TRAIN}" \ --eval_listfile "${LIST_EVAL}" That command appears to work, although it doesn't like some of the input data: Begin lstmtraining ... Num outputs,weights in Series: 1,40,0,1:1, 0 Num outputs,weights in Series: C5,5:25, 0 Ft64:64, 1664 Total weights = 1664 [C5,5Ft64]:64, 1664 Mp3,3:64, 0 Lfys128:128, 98816 Lbx256:512, 788480 Lbx256:512, 1574912 Fc113:113, 57969 Total weights = 2521841 Built network:[1,40,0,1[C5,5Ft64]Mp3,3Lfys128Lbx256Lbx256Fc113] from request [1,40,0,1 Ct5,5,64 Mp3,3 Lfys128 Lbx256 Lbx256 O1c113] Training parameters: Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5 null char=2 Deserialize header failed: /data/training/20190923-183956-000002541.lstmf Deserialize header failed: /data/training/20190923-183713-000001499.lstmf Deserialize header failed: /data/training/20190923-184103-000002958.lstmf Deserialize header failed: /data/training/20190923-183629-000001195.lstmf Load of page 0 failed! Load of images failed!! Loaded 1/1 pages (1-1) of document /data/training/20190923-183643-000001284.lstmf Deserialize header failed: /data/training/20190923-183932-000002375.lstmf Loaded 1/1 pages (1-1) of document /data/training/20190923-183724-000001575.lstmf Loaded 1/1 pages (1-1) of document /data/training/20190923-183541-000000875.lstmf Loaded 1/1 pages (1-1) of document /data/training/20190923-183850-000002113.lstmf But I think the resulting eng.traineddata file is corrupt, because when I try to run the tesseract command with --tessdata-dir ${OUTPUT_DIR}/eng I get the "couldn't load any languages" error. tesseract --tessdata-dir /data/output/eng --list-langs Failed loading language 'eng' Tesseract couldn't load any languages! List of available languages (1): eng I would be grateful for debugging suggestions. Thanks, Adam -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d57b049a-fc03-f520-847e-6fcd9b09884d%40sheffield.ac.uk.