Re: [tesseract-ocr] Re: Improve OCR accuracy

Tom Morris Tue, 23 Jun 2015 08:59:20 -0700


On Tuesday, June 23, 2015 at 3:56:31 AM UTC-4, Gunasekaran Velu wrote:
>
>
> I have increased the DPI also but some word are missing attached output 
> image.
>
> I have attached the image properties. the file compression type CCITT and 
> bit depth is 1.
>
> Does compression type and bit depth is depended on OCR process?
>


CCITT T.4 (ie G3 fax) compression algorithms are loss-less, so they have no 
impact.  The low spatial resolution will have a negative impact.  Although 
the OCR algorithm operates on bitonal images, the fact that the image is 
already binarized removes potential flexibility to adjust the binarization 
process (although fax machines tend to be pretty good at this because a) 
it's the mode they're designed to operate in and b) they have a very 
controlled scanning environment. 

Art's suggestion to remove lines is a good one, but if you have only a 
single form to deal with, you could just scan an empty form and then 
subtract that template from your submitted form (after deskewing & 
registering using the corner marks).  Dealing with dropouts where the 
characters intersect preprinted form elements is going to be problematic 
with either approach, doubly so because of the low resolution.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5f3defdd-78f5-4376-b783-a541a5c53bc6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Improve OCR accuracy

Reply via email to