On Thursday, October 16, 2014 11:04:58 AM UTC-4, Zunair Fayaz wrote: > > need best practice to OCR on documents with + sign that helps align the > documents. >
Those are typically referred to as "registration marks" > Any known practice? > I would have thought they'd be pretty easy to detect using a simple black/white histogram on the scanned rows of pixels at the top of the page. Have you tried that? What approaches have you tried? > > See attached file that I'm trying to OCR and get perfect results. > Currently, I'm cropping at the top with 18 percent height of the > document... and if needed remove the border using accusoft scanfix. > Then OCR just that, so I get some blank lines and then + + then numbers... > > My problem right now is that when all chars are used, 1 becomes i because > of that speck in this document. (I can de-speckle if there is no other way > to improve) > If I use only digits.. only 0-9, then I get a weird result, I get an extra > 5 just below the speck. > > Is there simple a way to find this line and use a constant height to OCR > this line? so the speck will not be in that rectangle? > Is there a way to get the positions of those + signs in pixels? > If it's always just a single isolated line, it should be pretty easy to detect even without the registration marks. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/32b085b3-a65c-4425-944e-57aeee4dd512%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

