Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Manasi sarode Fri, 31 May 2019 03:28:36 -0700

That's fair enough.

On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]>
wrote:


> We are trying to extract text content from normal pdf and scanned pdf
> (image) using tesseract-ocr.
>
> We have observed following issues for the pdf's with table as table
> Contents are not getting extracted properly.
>
>    1. Contents from few cells(rows/columns) are not visible.Sometimes
>    heading of the table is missing.
>    2. If numbers are there inside table, all the numbers are not getting
>    extracted.
>    3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
>    4. Column sequence is getting interchanged as it is parsing
>    horizontally.
>    5. Some extra characters are getting extracted along with normal one.
>
> Tried image_to_string ,image_to_data ,opencv approach
>
> Sample code used is:
>
> from PIL import Image
>
> import pytesseract from pytesseract import image_to_string from
> pytesseract import image_to_boxes
>
> image=(pytesseract.image_to_string(Image.open('table_number.jpg')))
> print(image)
>
>
> It should extract rows and columns properly which it is not extracting as
> of now. Kindly suggest function or method to enhance the results for table
> content extraction using tesseract.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Reply via email to