I don't think you need training to improve results. You need to pre-process the image, straighten it. Use a separate tool to identify each cell of data and then OCR that. You will get best results like that.
On Mon, Jun 17, 2019 at 6:07 PM [email protected] <[email protected]> wrote: > Thanks shree for your reply. I see that you are very busy to answer a lot > of questions here. Thanks again for taking some time for me >> >> Your files have prefix of jpn, so I assume you are training for Japanese, >> but the image in question has only numbers in it. >> > Well I forgot to mention, my model only need to recognize digits, not all > of Japanese Character. I just put the prefix of jpn because I am working > with Japanese Document > Anw, as your answer I understand that high chance that I am dealing with > overfitting problem, not some problem of how to convert check point file to > .traineddata file, am I right? If so, I guess the first thing I should try > is to finetune your digits model (I found you shared on github > https://github.com/Shreeshrii/tessdata_shreetest). Correct me if I am > wrong > > Btw, I have 2 more questions: > 1. About how I generate the training data. Since I could not find the > right font for my document, I cropped the digit image from the data I have > and randomly pick cropped digit to generate training image. Do you think > this is the right way to do the data augmentation? > 2. I generated 2000 samples for the training, is it enough or not? > > On Mon, Jun 17, 2019 at 5:19 PM shree <[email protected]> wrote: > >> Your files have prefix of jpn, so I assume you are training for Japanese, >> but the image in question has only numbers in it. >> >> Getting good results on eval data but bad results on OCR could be the >> result of overfitting the model, if you have used a small sample and >> trained for large number of iterations. >> >> >> On Friday, June 14, 2019 at 8:35:40 AM UTC+5:30, Phuc wrote: >>> >>> Hi >>> I am training a model using Tesseract's lstmtraining and get confuse >>> about the result I get. I wonder if I do anything wrong among these steps >>> below: >>> >>> - I create training data .box and .tif following >>> https://github.com/tesseract-ocr/tesseract/issues/2357. Note that an >>> (.box, .tif) pair include multiple text lines >>> - Run the training process using https://github.com/OCR-D/ocrd-train. >>> Since I already have .box file, I simply comment out the line of >>> `generate_line_box.py` inside the Makefile >>> - After training, I use lstmeval to evaluate the model on some >>> evaluation dataset and get the error which is not so bad >>> >>> [image: 図1.png] >>> >>> >>> - But when I use the exact same image on evaluation dataset, and run >>> the prediction using .traineddata and then the result seems to be totally >>> different >>> >>> I also attach some files of my training data and the visualized result >>> in case anyone wants to take a look >>> >>> I will be appreciate if someone can tell me what wrong did I do >>> >>> Thanks >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWKGfVyTDcawio63iCHcwe4TcVYd6vDjj9upKdOdoRoMA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

