Used psm for 2 column documents. Its showing results perfectly. Can you send link or pointers how to use it for table content extraction from scanned pdf?
Thanks On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote: > > Did you try changing psm? > > On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected] > <javascript:>> wrote: > >> That's fair enough. >> >> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected] >> <javascript:>> wrote: >> >>> We are trying to extract text content from normal pdf and scanned pdf >>> (image) using tesseract-ocr. >>> >>> We have observed following issues for the pdf's with table as table >>> Contents are not getting extracted properly. >>> >>> 1. Contents from few cells(rows/columns) are not visible.Sometimes >>> heading of the table is missing. >>> 2. If numbers are there inside table, all the numbers are not >>> getting extracted. >>> 3. Some letters are extracted wrongly . eg. i is misinterpreted as l. >>> 4. Column sequence is getting interchanged as it is parsing >>> horizontally. >>> 5. Some extra characters are getting extracted along with normal one. >>> >>> Tried image_to_string ,image_to_data ,opencv approach >>> >>> Sample code used is: >>> >>> from PIL import Image >>> >>> import pytesseract from pytesseract import image_to_string from >>> pytesseract import image_to_boxes >>> >>> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) >>> print(image) >>> >>> >>> It should extract rows and columns properly which it is not extracting >>> as of now. Kindly suggest function or method to enhance the results for >>> table content extraction using tesseract. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

