Tesseract is OCR library e.g. user is responsible for image preprocessing.
Zdenko ne 24. 3. 2019 o 4:12 <[email protected]> napísal(a): > Hi, i feel confused why upscaling works.Actually, in the tesseract, it > also has the process to prescale the image to height 36pix. > > 在 2018年7月30日星期一 UTC+8下午11:19:23,Emiliano Isaza Villamizar写道: >> >> Lorenzo, Thank you so much for your help. I did everything step by step >> and got a very good result I think what helped me most was up scaling the >> images. the code I did is in python and is the following if anyone is >> following the thread: >> >> *import PIL* >> *from PIL import Image* >> >> *im = Image.open(imagepath)* >> *hpercent = (baseheight / float(img.size[1]))* >> *wsize = int((float(img.size[0]) * float(hpercent)))* >> *img = img.resize((wsize, baseheight), PIL.Image.ANTIALIAS)* >> >> I'm a real newbie in bash so I didn't use your scripts I kept getting a >> permission error. Thank you again Lorenzo! >> >> >> >> >> >> >> On Thursday, July 26, 2018 at 5:46:44 AM UTC-5, Lorenzo Blz wrote: >>> >>> First, read this: "Fine Tuning for ± a few characters" >>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters> >>> >>> >>> Then check the data/unicharset file to see if everything is ok, if there >>> are all the characters you want. >>> >>> >>> Then, 15000 iterations are way too many and 300 samples are really too >>> few. If you train too much you'll get worse results. >>> >>> I usually get the best fine tuning results from 400 to 2000 iterations. >>> I can do more, up to 20k iterations, only when I have many sample images: a >>> few thousand with multiple words. >>> >>> >>> I do it like this (this is not a complete guide, just to give you the >>> general idea): >>> >>> - >>> clean the data and data/checkpoints folders (do NOT add -rf, you do not >>> want to wipe out the training data) >>> >>> rm data/* >>> >>> rm data/checkpoints/* >>> >>> >>> (do this only once, when you start a new training session, not after >>> each training step) >>> >>> - >>> go into the Makefile and fix this (in the "data/list.eval" block, remove >>> the + before $$no): >>> >>> >>> tail -n "$$no" $(ALL_LSTMF) > "$@" >>> >>> >>> then add somewhere at the top: >>> >>> ITERATIONS=100 >>> >>> and change the max_iterations line to this (do not change the >>> tabs/spaces at the beginning, just replace the number): >>> >>> --max_iterations $(ITERATIONS) >>> >>> - now run the training as normal like this: >>> >>> make training ITERATIONS=100 >>> >>> - when it finishes run this: >>> >>> lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile >>> data/list.eval >>> >>> In the last line you'll get something like this: >>> >>> At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error >>> rate=3.8461538 >>> >>> These are the only values that matter. Take note of these values and the >>> iteration numbers. >>> >>> Make a backup of the model: >>> >>> cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100 >>> >>> - Now start the training again with ITERATIONS=200, it will resume from >>> the previous iteration up to 200: >>> >>> make training ITERATIONS=200 >>> >>> - Run lstmeval again, take note, backup and so on, 300, 400, 500.... >>> >>> You should see that the error rate will go down for a while then it will >>> slow down and then will start to get worse. Use the model where you got the >>> best score. >>> >>> You can try this, but 300 samples are likely way too few for this to be >>> meaningful. >>> >>> I'm attaching my training scripts, they should work but double check >>> everything. >>> >>> >>> About thresholding, probably you do not need it, just increase the >>> contrast a little, do not go binary. Probably you do not need that either. >>> And do the same processing to the training data that you will do on your >>> real data. >>> >>> Two important things, for training and recognition. Use PSM=13 >>> (PSM.RAW_LINE). Trim all the white borders, upscale the image so that the >>> text is 30-50 pixels tall. >>> >>> Again, train with the same processing you'll use for recognition. >>> >>> >>> Bye >>> >>> Lorenzo >>> >>> >>> 2018-07-25 16:49 GMT+02:00 Emiliano Isaza Villamizar <[email protected]>: >>> >>>> Hello, >>>> >>>> I'm trying to train tesseract to accurately extract information from a >>>> table. Initialy when running with pytesseract I get these results: >>>> >>>> *pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 >>>> -c tessedit_char_whitelist=0123456789')* >>>> >>>> I get these results: >>>> >>>> ground truth Tesseract >>>> >>>> CN¥6.94 CN#6.94 >>>> >>>> ¥31660.90 ¥31660.90 >>>> >>>> Ltd Lid >>>> >>>> I retrained tesseract with OCR-D, I extracted each cell and wrote the >>>> ground truth for 3 tables that add up to 300 cells (300 labeled images). I >>>> ran it for 15000 iterations and got an error of 0.5%. But now I get worse >>>> results. Tesseract doesn't seem to read numbers and basic acronyms.attached >>>> you may find an example of an image used for training. >>>> >>>> ground truth New tesseract >>>> >>>> 000426.China ooo426.cin >>>> >>>> How can I improve tesseract to read these weird characters? I already >>>> tried to improve the image quality by transforming the image using CV2 this >>>> is an example: >>>> >>>> >>>> th3 = >>>> cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2) >>>> img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY) >>>> >>>> >>>> Thanks!! >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/09043939-0f2b-45e7-9f54-130eb8d03299%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/09043939-0f2b-45e7-9f54-130eb8d03299%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zpRdp86E5DfgpiL%2BdX-2vFDk9t67naWzegFTABeTkofg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

