On Tuesday, June 23, 2015 at 3:56:31 AM UTC-4, Gunasekaran Velu wrote: > > > I have increased the DPI also but some word are missing attached output > image. > > I have attached the image properties. the file compression type CCITT and > bit depth is 1. > > Does compression type and bit depth is depended on OCR process? >
CCITT T.4 (ie G3 fax) compression algorithms are loss-less, so they have no impact. The low spatial resolution will have a negative impact. Although the OCR algorithm operates on bitonal images, the fact that the image is already binarized removes potential flexibility to adjust the binarization process (although fax machines tend to be pretty good at this because a) it's the mode they're designed to operate in and b) they have a very controlled scanning environment. Art's suggestion to remove lines is a good one, but if you have only a single form to deal with, you could just scan an empty form and then subtract that template from your submitted form (after deskewing & registering using the corner marks). Dealing with dropouts where the characters intersect preprinted form elements is going to be problematic with either approach, doubly so because of the low resolution. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f3defdd-78f5-4376-b783-a541a5c53bc6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

