Thanks a lot! Yes.. I am familiar with ML part, will implement it and try to get results.
On Thursday, 6 June 2019 19:38:17 UTC+5:30, Krishna Prasad wrote: > > Hi Sayali, > I'm dealing with a similar problem. Detecting table contents > accurately has never been easy with tesseract. I would suggest building > your own pipeline for Detecting tables and complex layouts. There are many > public datasets available. I'm trying to use Deeplab V3+ by Google > Model : https://github.com/tensorflow/models/tree/master/research/deeplab > Dataset : https://www.primaresearch.org/datasets/Layout_Analysis > Deeplab is properly documented and really good at its job. If you are > familiar with ML, this would be a piece of cake for you. > > Hope this helps. 😃 > > Regards, > Krishna Prasad A S > > On Thu, Jun 6, 2019 at 8:41 AM Sayali begampure <[email protected] > <javascript:>> wrote: > >> Hello ,I tried with both psm 6 and psm 3 , but still problem in detecting >> the table contents.Numbers are not visible and also sometimes only heading >> is visible. >> Any other change I can do in tesseract or for image quality improvement? >> >> TIA >> >> On Friday, 31 May 2019 19:41:50 UTC+5:30, Amulya Kali wrote: >>> >>> Hi, I'm not sure about the psm mode you have used. You can try psm 6 >>> for table. >>> >>> Something like this.. >>> pytesseract.image_to_string(image, lang='eng', config='--psm 6') >>> >>> On Fri, May 31, 2019, 16:14 Sayali begampure <[email protected]> >>> wrote: >>> >>>> Used psm for 2 column documents. Its showing results perfectly. >>>> Can you send link or pointers how to use it for table content >>>> extraction from scanned pdf? >>>> >>>> Thanks >>>> >>>> On Friday, 31 May 2019 16:00:36 UTC+5:30, Amulya Kali wrote: >>>>> >>>>> Did you try changing psm? >>>>> >>>>> On Fri, May 31, 2019, 15:57 Manasi sarode <[email protected]> >>>>> wrote: >>>>> >>>>>> That's fair enough. >>>>>> >>>>>> On Fri, May 31, 2019, 3:55 PM Sayali begampure <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We are trying to extract text content from normal pdf and scanned >>>>>>> pdf (image) using tesseract-ocr. >>>>>>> >>>>>>> We have observed following issues for the pdf's with table as table >>>>>>> Contents are not getting extracted properly. >>>>>>> >>>>>>> 1. Contents from few cells(rows/columns) are not >>>>>>> visible.Sometimes heading of the table is missing. >>>>>>> 2. If numbers are there inside table, all the numbers are not >>>>>>> getting extracted. >>>>>>> 3. Some letters are extracted wrongly . eg. i is misinterpreted >>>>>>> as l. >>>>>>> 4. Column sequence is getting interchanged as it is parsing >>>>>>> horizontally. >>>>>>> 5. Some extra characters are getting extracted along with normal >>>>>>> one. >>>>>>> >>>>>>> Tried image_to_string ,image_to_data ,opencv approach >>>>>>> >>>>>>> Sample code used is: >>>>>>> >>>>>>> from PIL import Image >>>>>>> >>>>>>> import pytesseract from pytesseract import image_to_string from >>>>>>> pytesseract import image_to_boxes >>>>>>> >>>>>>> image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) >>>>>>> print(image) >>>>>>> >>>>>>> >>>>>>> It should extract rows and columns properly which it is not >>>>>>> extracting as of now. Kindly suggest function or method to enhance the >>>>>>> results for table content extraction using tesseract. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a01f82f0-3a28-49ec-8498-a9ba153e5a97%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

