[tesseract-ocr] Not getting results with numbers and currency simbols in tables

Emiliano Isaza Villamizar Wed, 25 Jul 2018 07:49:47 -0700

Hello,

I'm trying to train tesseract to accurately extract information from a 
table. Initialy when running with pytesseract I get these results:

*pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c
tessedit_char_whitelist=0123456789')*

I get these results:

ground truth Tesseract

CN¥6.94 CN#6.94

¥31660.90 ¥31660.90

Ltd Lid

I retrained tesseract with OCR-D, I extracted each cell and wrote the
ground truth for 3 tables that add up to 300 cells (300 labeled images). I
ran it for 15000 iterations and got an error of 0.5%. But now I get worse
results. Tesseract doesn't seem to read numbers and basic acronyms.attached
you may find an example of an image used for training.

ground truth New tesseract

000426.China ooo426.cin

How can I improve tesseract to read these weird characters? I already tried
to improve the image quality by transforming the image using CV2 this is an
example:

th3 =
cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2)

img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY)

Thanks!!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

₩276,077.30

[tesseract-ocr] Not getting results with numbers and currency simbols in tables

Reply via email to