Hi, I'm not sure about the psm mode you have used. You can try psm 6 for table.
Something like this.. pytesseract.image_to_string(image, lang='eng', config='--psm 6') On Fri, May 31, 2019, 16:14 Sayali begampure <[email protected]> wrote: > Used psm for 2 column documents. Its showing results perfectly. > Can you send link or pointers how to use it for table content extraction > from scanned pdf? > > Thanks > > On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote: >> >> Did you try changing psm? >> >> On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]> wrote: >> >>> That's fair enough. >>> >>> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]> >>> wrote: >>> >>>> We are trying to extract text content from normal pdf and scanned pdf >>>> (image) using tesseract-ocr. >>>> >>>> We have observed following issues for the pdf's with table as table >>>> Contents are not getting extracted properly. >>>> >>>> 1. Contents from few cells(rows/columns) are not visible.Sometimes >>>> heading of the table is missing. >>>> 2. If numbers are there inside table, all the numbers are not >>>> getting extracted. >>>> 3. Some letters are extracted wrongly . eg. i is misinterpreted as >>>> l. >>>> 4. Column sequence is getting interchanged as it is parsing >>>> horizontally. >>>> 5. Some extra characters are getting extracted along with normal >>>> one. >>>> >>>> Tried image_to_string ,image_to_data ,opencv approach >>>> >>>> Sample code used is: >>>> >>>> from PIL import Image >>>> >>>> import pytesseract from pytesseract import image_to_string from >>>> pytesseract import image_to_boxes >>>> >>>> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) >>>> print(image) >>>> >>>> >>>> It should extract rows and columns properly which it is not extracting >>>> as of now. Kindly suggest function or method to enhance the results for >>>> table content extraction using tesseract. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFJOL1X5fxZ4%2BMOAiXXkW-ALRPy4U%2BhXY1qqoDSZJD3Jf0eVYg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

