[tesseract-ocr] Letters split in multiple parts

Lorenzo Bolzani Thu, 05 Jul 2018 09:59:36 -0700

Hi,
I have a small problem with some letters that are recognized as multiple
letters.

This is a sample (I can reproduce the problem with this image and eng
"_best"):

output is: 17AE4L4

The 4 is seen as three different letters. Maybe the shape of the 4 is not
so common and this is creating the problem.

This is how tesseract sees the image (data is taken from the bounding box
returned by the iterator, a red dots means the beginning of a symbol):

I'm wondering if there is anything I can do to fix this other than training
a custom model on this font (it is part of an mrz, btw).

Even a small edit to the image, like cropping, makes the problem appear or
disappear. The output for the other sample is : 17AESL

Are there any parameters like minimum box size, split threshold, something
I can ask the iterator, etc. that might help? Or is everything part of the
lstm?

I tried a quick fix based on the box sizes and confidence but there are
several variations and is not so easy to do it right.

I'm using:

tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff
4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

Thanks, bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxmc7UksXkzQw1UcQHyp19JW1B78MUc5Sn8csvEQFWWTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Letters split in multiple parts

Reply via email to