Of course, as soon as I posted this, I found my error--I was making an error with the evaluation command.
On Monday, August 26, 2019 at 2:10:12 PM UTC-4, Samuel Bell wrote: > > Hi All, > > I'm trying to use the training to optimize tesseract for my dataset, which > is a bunch of not particularly high-resolution scans of books from the > 1930s. The text is in English, and I have successfully made a training and > test set of true text. I've successfully trained a model that's nearly as > good as the original best eng model from this dataset. But that's using > training from scratch. Where I'm struggling is on retraining from the best > eng model. > > When I do this, the character error rate starts very high, usually more > than 5.0 (depending on the learning rate I specify). It slowly comes down > with lots of iterations, but the end results when I test them are still > garbage. What am I doing wrong? > > I'm downloading the best eng model to the from_full directory I've created > for this with: > wget > https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata > > I'm then making my .lstm file with: > combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm > > Finally, I'm running the retraining with: > > lstmtraining \ > --continue_from from_full/eng.lstm \ > --traineddata from_full/eng.traineddata \ > --train_listfile data/list.train \ > --learning_rate 1e-3 \ > --model_output from_full/checkpoints/retrain400 \ > --max_iterations 400 > > How do I make this work? Where am I going wrong? > > Thanks! > > --Sam > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f7c6644e-467f-48b1-b867-7257f7492c7f%40googlegroups.com.

