On Thursday, October 16, 2014 11:04:58 AM UTC-4, Zunair Fayaz wrote:
>
> need best practice to OCR on documents with + sign that helps align the 
> documents.
>

Those are typically referred to as "registration marks" 
 

> Any known practice?
>

I would have thought they'd be pretty easy to detect using a simple 
black/white histogram on the scanned rows of pixels at the top of the page. 
 Have you tried that?  What approaches have you tried?
 

>
> See attached file that I'm trying to OCR and get perfect results.
> Currently, I'm cropping at the top with 18 percent height of the 
> document... and if needed remove the border using accusoft scanfix.
> Then OCR just that, so I get some blank lines and then + + then numbers...
>
> My problem right now is that when all chars are used, 1 becomes i because 
> of that speck in this document. (I can de-speckle if there is no other way 
> to improve)
> If I use only digits.. only 0-9, then I get a weird result, I get an extra 
> 5 just below the speck.
>
> Is there simple a way to find this line and use a constant height to OCR 
> this line? so the speck will not be in that rectangle?
> Is there a way to get the positions of those + signs in pixels?
>

If it's always just a single isolated line, it should be pretty easy to 
detect even without the registration marks.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/32b085b3-a65c-4425-944e-57aeee4dd512%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to