Hi, I'm trying to train new model that will recognize ids/dates (Indic digits + space + /). I've generated 10 K 300 DPI single line images (random combinations of the above) with the font I need in 2 different font sizes found on the documents I need to process. I've split the images into 2 sets, one for training and second to test. I've performed the training with different network specifications, but so far I'm getting at best 50-80% success rate when testing on my generated data (I compare full string and run test on both sets) - text with digits only recognizes better, but addition of / messes up the results even though I've trained on data with /. I've used training parameters equivalent to the standard tesseract training scripts (just changing network specifications) - learning_rate 0.002, target_error_rate 0.01 (no max iterations). I'd appreciate some pointers on how to get better results from my training. Any network specification that would give better results, any training approach/parameters.
Once I'm satisfied with the results of my synthetic training, I'd like to add real life data to the set. How clean does the training data needs to be - I'd expect similar to what I'll be passing in real life. How important are box coordinates. I.e. for LSTM do I need to make sure my boxes are correct and contain only the text, or having full image box for the cutout of single line text is ok and I do not need to be precise. I've seen others mixing real life and synthetic data for the training. What would the ratio and sequence of the data to get best results be. Should such data be used in training in specific sequence or randomly. Additionally is there any tool that allows getting specification of the LSTM network of ready traineddata. I know some have the specification in the version file, but many do not. Regards, Bht -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cd4ebf23-cdbd-422f-b493-befd8ae3ad57n%40googlegroups.com.

