I am also facing same issue for Scan PDF specially with multiple columns 
and Text with numbers. Please share some inputs here if anyone tried using 
tesseract or some other APIs.

On Friday, May 31, 2019 at 3:55:08 PM UTC+5:30, Sayali begampure wrote:
>
> We are trying to extract text content from normal pdf and scanned pdf 
> (image) using tesseract-ocr.
>
> We have observed following issues for the pdf's with table as table 
> Contents are not getting extracted properly.
>
>    1. Contents from few cells(rows/columns) are not visible.Sometimes 
>    heading of the table is missing.
>    2. If numbers are there inside table, all the numbers are not getting 
>    extracted.
>    3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
>    4. Column sequence is getting interchanged as it is parsing 
>    horizontally.
>    5. Some extra characters are getting extracted along with normal one.
>
> Tried image_to_string ,image_to_data ,opencv approach
>
> Sample code used is:
>
> from PIL import Image
>
> import pytesseract from pytesseract import image_to_string from 
> pytesseract import image_to_boxes
>
> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) 
> print(image)
>
>
> It should extract rows and columns properly which it is not extracting as 
> of now. Kindly suggest function or method to enhance the results for table 
> content extraction using tesseract.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9227377d-c1dd-4f58-9741-1d752b7a208f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to