[tesseract-ocr] Re: Retraining from the best eng model leads to garbage results

Samuel Bell Mon, 26 Aug 2019 11:41:37 -0700

Of course, as soon as I posted this, I found my error--I was making an 
error with the evaluation command.


On Monday, August 26, 2019 at 2:10:12 PM UTC-4, Samuel Bell wrote:
>
> Hi All,
>
> I'm trying to use the training to optimize tesseract for my dataset, which 
> is a bunch of not particularly high-resolution scans of books from the 
> 1930s.  The text is in English, and I have successfully made a training and 
> test set of true text.  I've successfully trained a model that's nearly as 
> good as the original best eng model from this dataset.  But that's using 
> training from scratch.  Where I'm struggling is on retraining from the best 
> eng model.
>
> When I do this, the character error rate starts very high, usually more 
> than 5.0 (depending on the learning rate I specify).  It slowly comes down 
> with lots of iterations, but the end results when I test them are still 
> garbage.  What am I doing wrong?
>
> I'm downloading the best eng model to the from_full directory I've created 
> for this with:
> wget 
> https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
>
> I'm then making my .lstm file with:
> combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm
>
> Finally, I'm running the retraining with:
>
> lstmtraining \
>   --continue_from from_full/eng.lstm \
>   --traineddata from_full/eng.traineddata \
>   --train_listfile data/list.train \
>   --learning_rate 1e-3 \
>   --model_output from_full/checkpoints/retrain400 \
>   --max_iterations 400
>
> How do I make this work?  Where am I going wrong?
>
> Thanks!
>
> --Sam
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f7c6644e-467f-48b1-b867-7257f7492c7f%40googlegroups.com.

[tesseract-ocr] Re: Retraining from the best eng model leads to garbage results

Reply via email to