Hi All, I'm trying to use the training to optimize tesseract for my dataset, which is a bunch of not particularly high-resolution scans of books from the 1930s. The text is in English, and I have successfully made a training and test set of true text. I've successfully trained a model that's nearly as good as the original best eng model from this dataset. But that's using training from scratch. Where I'm struggling is on retraining from the best eng model.
When I do this, the character error rate starts very high, usually more than 5.0 (depending on the learning rate I specify). It slowly comes down with lots of iterations, but the end results when I test them are still garbage. What am I doing wrong? I'm downloading the best eng model to the from_full directory I've created for this with: wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata I'm then making my .lstm file with: combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm Finally, I'm running the retraining with: lstmtraining \ --continue_from from_full/eng.lstm \ --traineddata from_full/eng.traineddata \ --train_listfile data/list.train \ --learning_rate 1e-3 \ --model_output from_full/checkpoints/retrain400 \ --max_iterations 400 How do I make this work? Where am I going wrong? Thanks! --Sam -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b429e31-bcdb-4057-b27b-15290cf87f68%40googlegroups.com.

