We are trying to extract text content from normal pdf and scanned pdf
(image) using tesseract-ocr.
We have observed following issues for the pdf's with table as table
Contents are not getting extracted properly.
1. Contents from few cells(rows/columns) are not visible.Sometimes
heading of the table is missing.
2. If numbers are there inside table, all the numbers are not getting
extracted.
3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
4. Column sequence is getting interchanged as it is parsing horizontally.
5. Some extra characters are getting extracted along with normal one.
Tried image_to_string ,image_to_data ,opencv approach
Sample code used is:
from PIL import Image
import pytesseract from pytesseract import image_to_string from pytesseract
import image_to_boxes
image=(pytesseract.image_to_string(Image.open('table_number.jpg')))
print(image)
It should extract rows and columns properly which it is not extracting as
of now. Kindly suggest function or method to enhance the results for table
content extraction using tesseract.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.