Hello and thank you for the useful suggestion. Would you happen to know the reason why numbers printed within boxes cannot be parsed and are ignored?
I am working on scenarios that numbers withing closed boxes are very very common and removing the horizontal lines have various side effects on other pieces of text on my images. Is there a reason for this and maybe another way to make tesseract detect the numbers printed within boxes (maybe with passing a parameter or something)? Thank you in advance for your answer. On Saturday, August 27, 2016 at 7:34:10 PM UTC+3, Quan Nguyen wrote: > > Deskew, grayscale, remove lines, binarize produced the image: > > > <https://lh3.googleusercontent.com/-k4IAE2W2W7M/V8HAYJhIP5I/AAAAAAAAAqg/C85uxC7JDOMikMfAX_whlGB8UBU2Y1BiACLcB/s1600/Capture4.PNG> > > and OCRed text: > > l4|0|0l2|1l1>°l0|7l > > So if you could remove the vertical lines, it would improve further. > > On Saturday, August 27, 2016 at 10:29:52 AM UTC-5, shripad shirsat wrote: >> >> >> I am facing to issue to recognize the numbers from pdf which are printed >> within the boxes. I have used tesseract in C# for my project. Kindly some >> one help me out with any clue or hint or a snippet to how to go about to >> find the solution for the same. Please find the attached pdf >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d790a9f-b385-4f25-b133-27998bdb7f3f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

