Hi all,

I have a task to recognize printed timetables. I started experimenting with 
tesseract-OCR a week ago and I managed to train it to recognize the 
following kind of pictures perfectly:

<https://lh6.googleusercontent.com/-CDrXlacVBDc/VEzOf2zyizI/AAAAAAABTX8/09a0VDg2pKI/s1600/Capture.png>

I just used a single image containing all different digits, since the font 
is always the same.

Now, what I'm having problems with is the following - exactly the same 
font, just red and underlined.

<https://lh6.googleusercontent.com/-MSTNhlK_8R0/VEzOZn6v3hI/AAAAAAABTX0/l6tFqpjiwnw/s1600/Capture2.PNG>

What happens is that tesseract recognizes the whole word "06:05" including 
the underline as a *single blob*, and then of course it can't recognize 
what symbol it is. Funny thing, in some rare cases it does succeed (it 
ignores the underline, and then marks each symbol as a blob, and recognizes 
them correctly), and I can't figure out what it depends on. It somehow 
depends on the context - if I change the layout, keeping the text exactly 
the same, it would sometimes recognize it correctly, and sometimes not.

Perhaps some experts here could give an advice, how to go about solving 
this.. Most importantly how, to *debug* what's going on? My thoughts, and 
what I've been trying:


   - I tried including the red/underlined example in the training data as a 
   "different font", but that doesn't help.
   - I've tried running with the options "psm -5", "psm -6" and it does 
   change the behaviour significantly, but none works as it should. In any 
   case, this is suggesting me that the problem is in the way tesseract splits 
   the text into blobs, not with the actual symbol recognition. And the 
   underlined text confuses it.
   - I've tried playing around with the underline recognition settings 
   (e.g. textord_underline_threshold), but it made absolutely no difference
   - I've tried to dig deeper into the architecture of tesseract - page 
   segmentation, blob recognition, chopping - because it seems that the 
   problem is in one of those steps, but couldn't yet find a good way to debug 
   it. Tried using this 
(https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging), 
   but it's only telling me that the underlined text ends up as a single blob.

Thanks a lot for any suggestions,
Jurgis

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cc5b2e89-0239-43ad-b251-3cecb830abfe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to