Hello Dmitri Silaev, Thanks for your Response.
On Thursday, 28 May 2015 20:14:30 UTC+5:30, Dmitri Silaev wrote: > > I see you have a publication on document image processing, therefore I > suppose you're in the know of many techniques. > > These images require a bit different approaches. In general, in both cases > Tess requires some help with layout analysis and table border or frame > removal. > > 4.png > ------- > - Binarize. I think Otsu would suffice. > - Remove table borders. Use either CC analysis (filter by CC size, nesting > level, etc.), or Hough transform to detect long straight lines (if table > borders touch characters). > - Isolate rotated text at the right. Tess can't recognize such text. > Unrotate and OCR separately. Probably also would need upscaling, say by 3x. > - Isolate regions with dense text and OCR separately one by one. Tess is > bad at recognition of sparse text, let alone so different in size. > > 82.png > --------- > - Binarize. Otsu. > - Remove the frame. I suppose the easiest is filter CCs by pixel count. > - Upper word. Isolate and OCR separately. Needs prior blurring (to make > characters more "fleshy") and upscaling (to provide more stroke details to > Tess). Instead of blurring you may use dilation. > - Lower word. Isolate and OCR separately. May require erosion (as Tess's > stock traineddata might not work well for such bold font). > > Locating dense text regions, vertical text and so on can be done by NN > chain analysis. > > It seems you have used all the above mentioned methods as I read in your > article's abstract. Tesseract is no miracle, you have to do many things > manually. All above is easier to do by programming but might be done by > means of ImageMagick/shell scripts also. > > Best regards, > Dmitri Silaev > www.CustomOCR.com > > > > > > On Thu, May 28, 2015 at 2:47 PM, supriya Das <[email protected] > <javascript:>> wrote: > >> Hello Dmitri Siaev, >> Thanks for your response. Please tell me the complex processing logic. >> Thanks in advance. >> >> On Thursday, 28 May 2015 15:59:22 UTC+5:30, Dmitri Silaev wrote: >>> >>> You won't get any improvement just by changing a few params. A more >>> complex processing is required. Let me know if you're interested in more >>> details. >>> >>> Best regards, >>> Dmitri Silaev >>> www.CustomOCR.com >>> >>> >>> >>> >>> >>> On Thu, May 28, 2015 at 8:50 AM, supriya Das <[email protected]> >>> wrote: >>> >>>> Hello Everybody, >>>> >>>> I am not getting proper output for couple of image. What kind of >>>> parameter should be set for getting proper output? >>>> and is it possible to set SetPageSegMode with multiple enum at a >>>> time? Some problem images are as follow. Thanks in Advance. >>>> >>>> >>>> In the bellow images i am not getting any kind of output. i also tried >>>> to change ppi to 300 but not getting result. >>>> >>>> >>>> <https://lh3.googleusercontent.com/-XlFRIZfDN-k/VWasN7JC1FI/AAAAAAAAAPU/y77aOoveOhk/s1600/4.png> >>>> >>>> >>>> <https://lh3.googleusercontent.com/-jW3aDb_4lZE/VWargKvFZsI/AAAAAAAAAPM/Y26kenYq93U/s1600/82.png> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/7431b25c-47ae-46d1-af90-e2ec80a7b7ca%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/8e57aa4c-3a7c-4eb4-a377-8a0700093f32%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81ba6a60-0542-44ab-9c48-a0fef69ff363%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

