Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Sayali begampure Fri, 31 May 2019 03:44:42 -0700

Used psm for 2 column documents. Its showing results perfectly.
Can you send link or pointers how to use it for table content extraction 
from scanned pdf?


Thanks

On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote:
>
> Did you try changing psm?
>
> On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected] 
> <javascript:>> wrote:
>
>> That's fair enough.
>>
>> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected] 
>> <javascript:>> wrote:
>>
>>> We are trying to extract text content from normal pdf and scanned pdf 
>>> (image) using tesseract-ocr.
>>>
>>> We have observed following issues for the pdf's with table as table 
>>> Contents are not getting extracted properly.
>>>
>>>    1. Contents from few cells(rows/columns) are not visible.Sometimes 
>>>    heading of the table is missing.
>>>    2. If numbers are there inside table, all the numbers are not 
>>>    getting extracted.
>>>    3. Some letters are extracted wrongly . eg. i is misinterpreted as l.
>>>    4. Column sequence is getting interchanged as it is parsing 
>>>    horizontally.
>>>    5. Some extra characters are getting extracted along with normal one.
>>>
>>> Tried image_to_string ,image_to_data ,opencv approach
>>>
>>> Sample code used is:
>>>
>>> from PIL import Image
>>>
>>> import pytesseract from pytesseract import image_to_string from 
>>> pytesseract import image_to_boxes
>>>
>>> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) 
>>> print(image)
>>>
>>>
>>> It should extract rows and columns properly which it is not extracting 
>>> as of now. Kindly suggest function or method to enhance the results for 
>>> table content extraction using tesseract.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Reply via email to