Hi Sayali,
I'm dealing with a similar problem. Detecting table contents
accurately has never been easy with tesseract. I would suggest building
your own pipeline for Detecting tables and complex layouts. There are many
public datasets available. I'm trying to use Deeplab V3+ by Google
Model : https://github.com/tensorflow/models/tree/master/research/deeplab
Dataset : https://www.primaresearch.org/datasets/Layout_Analysis
Deeplab is properly documented and really good at its job. If you are
familiar with ML, this would be a piece of cake for you.
Hope this helps. 😃
Regards,
Krishna Prasad A S
On Thu, Jun 6, 2019 at 8:41 AM Sayali begampure <[email protected]>
wrote:
> Hello ,I tried with both psm 6 and psm 3 , but still problem in detecting
> the table contents.Numbers are not visible and also sometimes only heading
> is visible.
> Any other change I can do in tesseract or for image quality improvement?
>
> TIA
>
> On Friday, 31 May 2019 19:41:50 UTC+5:30, Amulya Kali wrote:
>>
>> Hi, I'm not sure about the psm mode you have used. You can try psm 6 for
>> table.
>>
>> Something like this..
>> pytesseract.image_to_string(image, lang='eng', config='--psm 6')
>>
>> On Fri, May 31, 2019, 16:14 Sayali begampure <[email protected]>
>> wrote:
>>
>>> Used psm for 2 column documents. Its showing results perfectly.
>>> Can you send link or pointers how to use it for table content extraction
>>> from scanned pdf?
>>>
>>> Thanks
>>>
>>> On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote:
>>>>
>>>> Did you try changing psm?
>>>>
>>>> On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]> wrote:
>>>>
>>>>> That's fair enough.
>>>>>
>>>>> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> We are trying to extract text content from normal pdf and scanned pdf
>>>>>> (image) using tesseract-ocr.
>>>>>>
>>>>>> We have observed following issues for the pdf's with table as table
>>>>>> Contents are not getting extracted properly.
>>>>>>
>>>>>> 1. Contents from few cells(rows/columns) are not
>>>>>> visible.Sometimes heading of the table is missing.
>>>>>> 2. If numbers are there inside table, all the numbers are not
>>>>>> getting extracted.
>>>>>> 3. Some letters are extracted wrongly . eg. i is misinterpreted
>>>>>> as l.
>>>>>> 4. Column sequence is getting interchanged as it is parsing
>>>>>> horizontally.
>>>>>> 5. Some extra characters are getting extracted along with normal
>>>>>> one.
>>>>>>
>>>>>> Tried image_to_string ,image_to_data ,opencv approach
>>>>>>
>>>>>> Sample code used is:
>>>>>>
>>>>>> from PIL import Image
>>>>>>
>>>>>> import pytesseract from pytesseract import image_to_string from
>>>>>> pytesseract import image_to_boxes
>>>>>>
>>>>>> image=(pytesseract.image_to_string(Image.open('table_number.jpg')))
>>>>>> print(image)
>>>>>>
>>>>>>
>>>>>> It should extract rows and columns properly which it is not
>>>>>> extracting as of now. Kindly suggest function or method to enhance the
>>>>>> results for table content extraction using tesseract.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CANMRjBiT21Rc2s0f55axqxDvSKPQyfw-F_yDitVx1ztadG%3DVng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.