[tesseract-ocr] Question about underlined text

Jurgis Pasukonis Mon, 27 Oct 2014 02:12:13 -0700

Hi all,

I have a task to recognize printed timetables. I started experimenting with 
tesseract-OCR a week ago and I managed to train it to recognize the 
following kind of pictures perfectly:

<https://lh6.googleusercontent.com/-CDrXlacVBDc/VEzOf2zyizI/AAAAAAABTX8/09a0VDg2pKI/s1600/Capture.png>

I just used a single image containing all different digits, since the font
is always the same.

Now, what I'm having problems with is the following - exactly the same
font, just red and underlined.

<https://lh6.googleusercontent.com/-MSTNhlK_8R0/VEzOZn6v3hI/AAAAAAABTX0/l6tFqpjiwnw/s1600/Capture2.PNG>

What happens is that tesseract recognizes the whole word "06:05" including
the underline as a *single blob*, and then of course it can't recognize
what symbol it is. Funny thing, in some rare cases it does succeed (it
ignores the underline, and then marks each symbol as a blob, and recognizes
them correctly), and I can't figure out what it depends on. It somehow
depends on the context - if I change the layout, keeping the text exactly
the same, it would sometimes recognize it correctly, and sometimes not.

Perhaps some experts here could give an advice, how to go about solving
this.. Most importantly how, to *debug* what's going on? My thoughts, and
what I've been trying:

- I tried including the red/underlined example in the training data as a
"different font", but that doesn't help.
- I've tried running with the options "psm -5", "psm -6" and it does
change the behaviour significantly, but none works as it should. In any
case, this is suggesting me that the problem is in the way tesseract splits
the text into blobs, not with the actual symbol recognition. And the
underlined text confuses it.
- I've tried playing around with the underline recognition settings
(e.g. textord_underline_threshold), but it made absolutely no difference
- I've tried to dig deeper into the architecture of tesseract - page
segmentation, blob recognition, chopping - because it seems that the
problem is in one of those steps, but couldn't yet find a good way to debug
it. Tried using this
(https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging),
but it's only telling me that the underlined text ends up as a single blob.

Thanks a lot for any suggestions,
Jurgis

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/cc5b2e89-0239-43ad-b251-3cecb830abfe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Question about underlined text

Reply via email to