[tesseract-ocr] Configure for single character recognition


Hello,

I am trying to recognize single characters written with the Gotham Bold
font. I have trained Tesseract by following Michael Jay Lissners guide
"Adding New Fonts to Tesseract 3 OCR Engine"
<http://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/>.

I trained it using a newspaper article and removed all characters that I am
not interested in as well as making sure all characters are upper case as I
am not going to match lower case characters.

I run Tesseract with my custom language and with page segmentation set to
10, which treat the image as a single character.

While most of the matches are fine, I am getting a lot of incorrect
matches. For example, the below image of the letter "B" is matched as an
"X". I cannot figure out why this is.

<https://lh4.googleusercontent.com/-AOLPnD7nXJY/VGYC58I-roI/AAAAAAAAASQ/kTJq9eSNMy4/s1600/0-4.png>

And the "B" below which looks the same as the above but it is in fact not
the same image, is not matched to anything. Tesseract does not know what is
on the image.

<https://lh4.googleusercontent.com/-b0kMaAzcN-Y/VGYFI6NOzjI/AAAAAAAAASk/c9EfpR8CjWI/s1600/1-7.png.png>

The below "C" is not matched to anything. Tesseract cannot figure out what
is on the image.

<https://lh5.googleusercontent.com/-ZKl8jE2Orto/VGYEs2xzGlI/AAAAAAAAASc/2xTXomhIkWI/s1600/0-8.png>
The same goes for the "U" below.

<https://lh5.googleusercontent.com/-fciIyBe9bDw/VGYFRh3YBNI/AAAAAAAAASs/29WZQUHqPmE/s1600/1-8.png>
And it thinks the "E" below is a "K".

<https://lh4.googleusercontent.com/-ZZFkr77drgM/VGYFcDydDXI/AAAAAAAAAS0/RQ1UO8U3rOY/s1600/1-9.png>

The above errors are just examples. There are others but I think those four
examples illustrate the quirks I'm currently dealing with.

I manually slice the image below into images of single characters like the
ones above. Maybe a completely different approach is better?

<https://lh4.googleusercontent.com/-TfwZnXosqB0/VGYFjLppJ9I/AAAAAAAAAS8/Oun76IHLwks/s1600/prepared_image.png>
Does anyone know how I can improve the recognition of single characters?
I'ld like the above examples to match correctly but generally it's just not
good enough and I'ld like to know if there's any way I can improve it.
Should I train differently? Should I pass other configurations or should I
process the images before trying to recognize the characters?

Best regards,
Simon B. Støvring

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to