Did you try changing psm? On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]> wrote:
> That's fair enough. > > On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]> > wrote: > >> We are trying to extract text content from normal pdf and scanned pdf >> (image) using tesseract-ocr. >> >> We have observed following issues for the pdf's with table as table >> Contents are not getting extracted properly. >> >> 1. Contents from few cells(rows/columns) are not visible.Sometimes >> heading of the table is missing. >> 2. If numbers are there inside table, all the numbers are not getting >> extracted. >> 3. Some letters are extracted wrongly . eg. i is misinterpreted as l. >> 4. Column sequence is getting interchanged as it is parsing >> horizontally. >> 5. Some extra characters are getting extracted along with normal one. >> >> Tried image_to_string ,image_to_data ,opencv approach >> >> Sample code used is: >> >> from PIL import Image >> >> import pytesseract from pytesseract import image_to_string from >> pytesseract import image_to_boxes >> >> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) >> print(image) >> >> >> It should extract rows and columns properly which it is not extracting as >> of now. Kindly suggest function or method to enhance the results for table >> content extraction using tesseract. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFJOL1UcgdM5PFQ2xR81xV-iamai3Y-32BQ2-eeRGTsHbpHi3Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

