[tesseract-ocr] Tesseract training

b h Fri, 11 Feb 2022 01:31:29 -0800

Hi,

I'm trying to train new model that will recognize ids/dates (Indic digits + 
space + /).
I've generated 10 K 300 DPI single line images (random combinations of the 
above) with the font I need in 2 different font sizes found on the 
documents I need to process.
I've split the images into 2 sets, one for training and second to test.
I've performed the training with different network specifications, but so 
far I'm getting at best 50-80% success rate when testing on my generated 
data (I compare full string and run test on both sets) - text with digits 
only recognizes better, but addition of / messes up the results even though 
I've trained on data with /.
I've used training parameters equivalent to the standard tesseract training 
scripts (just changing network specifications) - learning_rate 0.002, 
target_error_rate 0.01 (no max iterations).
I'd appreciate some pointers on how to get better results from my training. 
Any network specification that would give better results, any training 
approach/parameters.


Once I'm satisfied with the results of my synthetic training, I'd like to 
add real life data to the set.
How clean does the training data needs to be - I'd expect similar to what 
I'll be passing in real life.
How important are box coordinates. I.e. for LSTM do I need to make sure my 
boxes are correct and contain only the text, or having full image box for 
the cutout of single line text is ok and I do not need to be precise.
I've seen others mixing real life and synthetic data for the training. What 
would the ratio and sequence of the data to get best results be. Should 
such data be used in training in specific sequence or randomly.

Additionally is there any tool that allows getting specification of the 
LSTM network of ready traineddata. I know some have the specification in 
the version file, but many do not.

Regards,

Bht

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cd4ebf23-cdbd-422f-b493-befd8ae3ad57n%40googlegroups.com.

[tesseract-ocr] Tesseract training

Reply via email to