We are trying to extract text content from normal pdf and scanned pdf 
(image) using tesseract-ocr.

We have observed following issues for the pdf's with table as table 
Contents are not getting extracted properly.

   1. Contents from few cells(rows/columns) are not visible.Sometimes 
   heading of the table is missing.
   2. If numbers are there inside table, all the numbers are not getting 
   extracted.
   3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
   4. Column sequence is getting interchanged as it is parsing horizontally.
   5. Some extra characters are getting extracted along with normal one.

Tried image_to_string ,image_to_data ,opencv approach

Sample code used is:

from PIL import Image

import pytesseract from pytesseract import image_to_string from pytesseract 
import image_to_boxes

image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) 
print(image)


It should extract rows and columns properly which it is not extracting as 
of now. Kindly suggest function or method to enhance the results for table 
content extraction using tesseract.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to