Hello, I'm trying to train tesseract to accurately extract information from a table. Initialy when running with pytesseract I get these results:
*pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c tessedit_char_whitelist=0123456789')* I get these results: ground truth Tesseract CN¥6.94 CN#6.94 ¥31660.90 ¥31660.90 Ltd Lid I retrained tesseract with OCR-D, I extracted each cell and wrote the ground truth for 3 tables that add up to 300 cells (300 labeled images). I ran it for 15000 iterations and got an error of 0.5%. But now I get worse results. Tesseract doesn't seem to read numbers and basic acronyms.attached you may find an example of an image used for training. ground truth New tesseract 000426.China ooo426.cin How can I improve tesseract to read these weird characters? I already tried to improve the image quality by transforming the image using CV2 this is an example: th3 = cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2) img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY) Thanks!! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
₩276,077.30

