Thanks shree for your reply. I see that you are very busy to answer a lot of questions here. Thanks again for taking some time for me > > Your files have prefix of jpn, so I assume you are training for Japanese, > but the image in question has only numbers in it. > Well I forgot to mention, my model only need to recognize digits, not all of Japanese Character. I just put the prefix of jpn because I am working with Japanese Document Anw, as your answer I understand that high chance that I am dealing with overfitting problem, not some problem of how to convert check point file to .traineddata file, am I right? If so, I guess the first thing I should try is to finetune your digits model (I found you shared on github https://github.com/Shreeshrii/tessdata_shreetest). Correct me if I am wrong
Btw, I have 2 more questions: 1. About how I generate the training data. Since I could not find the right font for my document, I cropped the digit image from the data I have and randomly pick cropped digit to generate training image. Do you think this is the right way to do the data augmentation? 2. I generated 2000 samples for the training, is it enough or not? On Mon, Jun 17, 2019 at 5:19 PM shree <[email protected]> wrote: > Your files have prefix of jpn, so I assume you are training for Japanese, > but the image in question has only numbers in it. > > Getting good results on eval data but bad results on OCR could be the > result of overfitting the model, if you have used a small sample and > trained for large number of iterations. > > > On Friday, June 14, 2019 at 8:35:40 AM UTC+5:30, Phuc wrote: >> >> Hi >> I am training a model using Tesseract's lstmtraining and get confuse >> about the result I get. I wonder if I do anything wrong among these steps >> below: >> >> - I create training data .box and .tif following >> https://github.com/tesseract-ocr/tesseract/issues/2357. Note that an >> (.box, .tif) pair include multiple text lines >> - Run the training process using https://github.com/OCR-D/ocrd-train. >> Since I already have .box file, I simply comment out the line of >> `generate_line_box.py` inside the Makefile >> - After training, I use lstmeval to evaluate the model on some >> evaluation dataset and get the error which is not so bad >> >> [image: 図1.png] >> >> >> - But when I use the exact same image on evaluation dataset, and run >> the prediction using .traineddata and then the result seems to be totally >> different >> >> I also attach some files of my training data and the visualized result in >> case anyone wants to take a look >> >> I will be appreciate if someone can tell me what wrong did I do >> >> Thanks >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

