[tesseract-ocr] Retraining from the best eng model leads to garbage results

Samuel Bell Mon, 26 Aug 2019 11:10:22 -0700

Hi All,

I'm trying to use the training to optimize tesseract for my dataset, which 
is a bunch of not particularly high-resolution scans of books from the 
1930s.  The text is in English, and I have successfully made a training and 
test set of true text.  I've successfully trained a model that's nearly as 
good as the original best eng model from this dataset.  But that's using 
training from scratch.  Where I'm struggling is on retraining from the best 
eng model.


When I do this, the character error rate starts very high, usually more 
than 5.0 (depending on the learning rate I specify).  It slowly comes down 
with lots of iterations, but the end results when I test them are still 
garbage.  What am I doing wrong?

I'm downloading the best eng model to the from_full directory I've created 
for this with:
wget 
https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

I'm then making my .lstm file with:
combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm

Finally, I'm running the retraining with:

lstmtraining \
  --continue_from from_full/eng.lstm \
  --traineddata from_full/eng.traineddata \
  --train_listfile data/list.train \
  --learning_rate 1e-3 \
  --model_output from_full/checkpoints/retrain400 \
  --max_iterations 400

How do I make this work?  Where am I going wrong?

Thanks!

--Sam

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b429e31-bcdb-4057-b27b-15290cf87f68%40googlegroups.com.

[tesseract-ocr] Retraining from the best eng model leads to garbage results

Reply via email to