Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Saylee Begampure Thu, 06 Jun 2019 20:46:15 -0700

Thanks a lot! Yes.. I am familiar with ML part, will implement it and try 
to get results.


On Thursday, 6 June 2019 19:38:17 UTC+5:30, Krishna Prasad wrote:
>
> Hi Sayali,
>      I'm dealing with a similar problem. Detecting table contents 
> accurately has never been easy with tesseract. I would suggest building 
> your own pipeline for Detecting tables and complex layouts. There are many 
> public datasets available. I'm trying to use Deeplab V3+ by Google
> Model : https://github.com/tensorflow/models/tree/master/research/deeplab
> Dataset : https://www.primaresearch.org/datasets/Layout_Analysis
> Deeplab is properly documented and really good at its job. If you are 
> familiar with ML, this would be a piece of cake for you. 
>
> Hope this helps. 😃
>
> Regards,
> Krishna Prasad A S
>
> On Thu, Jun 6, 2019 at 8:41 AM Sayali begampure <[email protected] 
> <javascript:>> wrote:
>
>> Hello ,I tried with both psm 6 and psm 3 , but still problem in detecting 
>> the table contents.Numbers are not visible and also sometimes only heading 
>> is visible.
>> Any other change I can do in tesseract or for image quality improvement?
>>
>> TIA
>>
>> On Friday, 31 May 2019 19:41:50 UTC+5:30, Amulya Kali wrote:
>>>
>>> Hi,  I'm not sure about the psm mode you have used. You can try psm 6 
>>> for table. 
>>>
>>> Something like this.. 
>>> pytesseract.image_to_string(image, lang='eng', config='--psm 6')
>>>
>>> On Fri, May 31, 2019, 16:14 Sayali begampure <[email protected]> 
>>> wrote:
>>>
>>>> Used psm for 2 column documents. Its showing results perfectly.
>>>> Can you send link or pointers how to use it for table content 
>>>> extraction from scanned pdf?
>>>>
>>>> Thanks
>>>>
>>>> On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote:
>>>>>
>>>>> Did you try changing psm?
>>>>>
>>>>> On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> That's fair enough.
>>>>>>
>>>>>> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> We are trying to extract text content from normal pdf and scanned 
>>>>>>> pdf (image) using tesseract-ocr.
>>>>>>>
>>>>>>> We have observed following issues for the pdf's with table as table 
>>>>>>> Contents are not getting extracted properly.
>>>>>>>
>>>>>>>    1. Contents from few cells(rows/columns) are not 
>>>>>>>    visible.Sometimes heading of the table is missing.
>>>>>>>    2. If numbers are there inside table, all the numbers are not 
>>>>>>>    getting extracted.
>>>>>>>    3. Some letters are extracted wrongly . eg. i is misinterpreted 
>>>>>>>    as l.
>>>>>>>    4. Column sequence is getting interchanged as it is parsing 
>>>>>>>    horizontally.
>>>>>>>    5. Some extra characters are getting extracted along with normal 
>>>>>>>    one.
>>>>>>>
>>>>>>> Tried image_to_string ,image_to_data ,opencv approach
>>>>>>>
>>>>>>> Sample code used is:
>>>>>>>
>>>>>>> from PIL import Image
>>>>>>>
>>>>>>> import pytesseract from pytesseract import image_to_string from 
>>>>>>> pytesseract import image_to_boxes
>>>>>>>
>>>>>>> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) 
>>>>>>> print(image)
>>>>>>>
>>>>>>>
>>>>>>> It should extract rows and columns properly which it is not 
>>>>>>> extracting as of now. Kindly suggest function or method to enhance the 
>>>>>>> results for table content extraction using tesseract.
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a01f82f0-3a28-49ec-8498-a9ba153e5a97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Not able to extract table contents for Scanned pdf's and Normal Pdf's using Tesseract-ocr?

Reply via email to