Lorenzo, Thank you so much for your help. I did everything step by step and got a very good result I think what helped me most was up scaling the images. the code I did is in python and is the following if anyone is following the thread:
*import PIL* *from PIL import Image* *im = Image.open(imagepath)* *hpercent = (baseheight / float(img.size[1]))* *wsize = int((float(img.size[0]) * float(hpercent)))* *img = img.resize((wsize, baseheight), PIL.Image.ANTIALIAS)* I'm a real newbie in bash so I didn't use your scripts I kept getting a permission error. Thank you again Lorenzo! On Thursday, July 26, 2018 at 5:46:44 AM UTC-5, Lorenzo Blz wrote: > > First, read this: "Fine Tuning for ± a few characters" > <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters> > > > Then check the data/unicharset file to see if everything is ok, if there > are all the characters you want. > > > Then, 15000 iterations are way too many and 300 samples are really too > few. If you train too much you'll get worse results. > > I usually get the best fine tuning results from 400 to 2000 iterations. I > can do more, up to 20k iterations, only when I have many sample images: a > few thousand with multiple words. > > > I do it like this (this is not a complete guide, just to give you the > general idea): > > - > clean the data and data/checkpoints folders (do NOT add -rf, you do not > want to wipe out the training data) > > rm data/* > > rm data/checkpoints/* > > > (do this only once, when you start a new training session, not after each > training step) > > - > go into the Makefile and fix this (in the "data/list.eval" block, remove > the + before $$no): > > > tail -n "$$no" $(ALL_LSTMF) > "$@" > > > then add somewhere at the top: > > ITERATIONS=100 > > and change the max_iterations line to this (do not change the tabs/spaces > at the beginning, just replace the number): > > --max_iterations $(ITERATIONS) > > - now run the training as normal like this: > > make training ITERATIONS=100 > > - when it finishes run this: > > lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval > > In the last line you'll get something like this: > > At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error > rate=3.8461538 > > These are the only values that matter. Take note of these values and the > iteration numbers. > > Make a backup of the model: > > cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100 > > - Now start the training again with ITERATIONS=200, it will resume from > the previous iteration up to 200: > > make training ITERATIONS=200 > > - Run lstmeval again, take note, backup and so on, 300, 400, 500.... > > You should see that the error rate will go down for a while then it will > slow down and then will start to get worse. Use the model where you got the > best score. > > You can try this, but 300 samples are likely way too few for this to be > meaningful. > > I'm attaching my training scripts, they should work but double check > everything. > > > About thresholding, probably you do not need it, just increase the > contrast a little, do not go binary. Probably you do not need that either. > And do the same processing to the training data that you will do on your > real data. > > Two important things, for training and recognition. Use PSM=13 > (PSM.RAW_LINE). Trim all the white borders, upscale the image so that the > text is 30-50 pixels tall. > > Again, train with the same processing you'll use for recognition. > > > Bye > > Lorenzo > > > 2018-07-25 16:49 GMT+02:00 Emiliano Isaza Villamizar <[email protected] > <javascript:>>: > >> Hello, >> >> I'm trying to train tesseract to accurately extract information from a >> table. Initialy when running with pytesseract I get these results: >> >> *pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c >> tessedit_char_whitelist=0123456789')* >> >> I get these results: >> >> ground truth Tesseract >> >> CN¥6.94 CN#6.94 >> >> ¥31660.90 ¥31660.90 >> >> Ltd Lid >> >> I retrained tesseract with OCR-D, I extracted each cell and wrote the >> ground truth for 3 tables that add up to 300 cells (300 labeled images). I >> ran it for 15000 iterations and got an error of 0.5%. But now I get worse >> results. Tesseract doesn't seem to read numbers and basic acronyms.attached >> you may find an example of an image used for training. >> >> ground truth New tesseract >> >> 000426.China ooo426.cin >> >> How can I improve tesseract to read these weird characters? I already >> tried to improve the image quality by transforming the image using CV2 this >> is an example: >> >> >> th3 = >> cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2) >> >> img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY) >> >> >> Thanks!! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0ae0ec83-b0d3-43b3-bfd5-5f612b297d3b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

