Hello,

I'm trying to train tesseract to accurately extract information from a 
table. Initialy when running with pytesseract I get these results:

*pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c 
tessedit_char_whitelist=0123456789')*

I get these results:

ground truth                            Tesseract  

CN¥6.94 CN#6.94

¥31660.90 ¥31660.90

Ltd Lid

I retrained tesseract with OCR-D, I extracted each cell and wrote the 
ground truth for 3 tables that add up to 300 cells (300 labeled images). I 
ran it for 15000 iterations and got an error of 0.5%. But now I get worse 
results. Tesseract doesn't seem to read numbers and basic acronyms.attached 
you may find an example of an image used for training.

ground truth                              New tesseract

000426.China                            ooo426.cin

How can I improve tesseract to read these weird characters? I already tried 
to improve the image quality by transforming the image using CV2 this is an 
example:


th3 = 
cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2)
 
img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY)


Thanks!!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
₩276,077.30

Reply via email to