[tesseract-ocr] Preprocessing - detailed cropping

Mirko P Wed, 08 Apr 2015 07:59:24 -0700


Hello,

I'm trying to recognize the machine readable part of a passport. (see the
last line in this picture: http://s.hswstatic.com/gif/passport-11.jpg )

I'm using Tesseract on Android (tess-two) and take the picture with a 5
Mpix mobile camera. Unfortunately, the accuracy is not satisfyingly high.
What I have tried to improve recognition was cropping the picture and
retraining Tesseract for the font used in a passport (ocr-b). Both raises
accuracy but still not to an acceptable level.
Here is a typical cropped picture I hand to Tesseract to perform ocr:

<https://lh3.googleusercontent.com/-DjwyoGe0dYQ/VSU_mcxzkMI/AAAAAAAAAAM/3HpmT04hzBM/s1600/croppic6.gif>

The binarized picture created by Tess for the actual recognition looks like
this:

<https://lh3.googleusercontent.com/-DwGxUTaDcK0/VSU_5W9pnpI/AAAAAAAAAAU/ffmiLw6yuLo/s1600/tessinput6.tif>

This is what Tesseract recognizes:

* 09 1 M 1 907 1 8 F8 F857<4 < W<B<O <UME QVWBBENO W JMGHJ <RBP6W9BQR ED*

I figured that the thin line at the bottom is extremely distracting to
Tesseract. If I cut off the line manually and perform ocr, results are
perfectly fine and all characters are recognized.
My question is, how can I find and get rid of that line automatically if it
is in the cropped picture? This has to be done on an Android phone.

Any help will be appreciated!
Mirko

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4eba59f2-0fbe-461a-bde8-1bee207ef1ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Preprocessing - detailed cropping

Reply via email to