Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Amulya Kali Fri, 31 May 2019 03:31:00 -0700

Did you try changing psm?

On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]>
wrote:


> That's fair enough.
>
> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]>
> wrote:
>
>> We are trying to extract text content from normal pdf and scanned pdf
>> (image) using tesseract-ocr.
>>
>> We have observed following issues for the pdf's with table as table
>> Contents are not getting extracted properly.
>>
>>    1. Contents from few cells(rows/columns) are not visible.Sometimes
>>    heading of the table is missing.
>>    2. If numbers are there inside table, all the numbers are not getting
>>    extracted.
>>    3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
>>    4. Column sequence is getting interchanged as it is parsing
>>    horizontally.
>>    5. Some extra characters are getting extracted along with normal one.
>>
>> Tried image_to_string ,image_to_data ,opencv approach
>>
>> Sample code used is:
>>
>> from PIL import Image
>>
>> import pytesseract from pytesseract import image_to_string from
>> pytesseract import image_to_boxes
>>
>> image=(pytesseract.image_to_string(Image.open('table_number.jpg')))
>> print(image)
>>
>>
>> It should extract rows and columns properly which it is not extracting as
>> of now. Kindly suggest function or method to enhance the results for table
>> content extraction using tesseract.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFJOL1UcgdM5PFQ2xR81xV-iamai3Y-32BQ2-eeRGTsHbpHi3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Reply via email to