[tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Gradalajage Fri, 18 Sep 2020 22:31:58 -0700

I have 395 PNG files depicting numbers with commas. The images are 130x54 
pixels and are black text on white background. Here is an example of an 
image showing the number 638,997:
[image: 638,997.png]
I would like to use Tesseract to perform reliable OCR on these images and 
others like them. Out-of-the-box, Tesseract correctly extracts text for 344 
of these images, and fails in some manner on 51 of them. I am using the 
following command line for each image:

> tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789'
{filename}.png out

I run that command on each image, substituting {filename} as needed. Each
invocation of that command produces the following output:

Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.

344/395 is an 87% success rate, but I want to try for better. So, I am
attempting to "fine-tune" Tesseract by running through the instructions for
tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG
files have file names that indicate ground truth, and I have a little
script that generates ground-truth TXT files from the PNG file names. I
have chosen "swtor" as the model name. I can then run this command from the
tesstrain root directory:

$ make training MODEL_NAME=swtor START_MODEL=eng PSM=7

This command runs, prints lots of info, and eventually produces the
following output, just before it ends:

Finished! Error rate = 2.739
lstmtraining \
--stop_training \
--continue_from data/swtor/checkpoints/swtor_checkpoint \
--traineddata data/swtor/swtor.traineddata \
--model_output data/swtor.traineddata
Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...

I can then take the resulting swtor.traineddata file, copy it to my
tessdata directory, and then re-run my experiment from earlier, with a
command line that looks like this:

> tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist=
',0123456789' {filename}.png out

With the new swtor model, Tesseract correctly extracts text for 64 of these
images, and fails in some manner on 331 of them.
64/395 is a 16% success rate, down from 87% for the eng model.
So, the swtor model I trained does far worse, which I find surprising and
unexpected. I think I might be doing something wrong but do not really know
what next steps to take to continue troubleshooting this. I'm hoping to
post here and get help from someone knowledgeable about the training
process.

I can post the contents of the "data" directory in my tesstrain repo root
directory if that is helpful for anyone (I'd have to remove the
checkpoints).

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com.

[tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Reply via email to