Hi all, I have a task to recognize printed timetables. I started experimenting with tesseract-OCR a week ago and I managed to train it to recognize the following kind of pictures perfectly:
<https://lh6.googleusercontent.com/-CDrXlacVBDc/VEzOf2zyizI/AAAAAAABTX8/09a0VDg2pKI/s1600/Capture.png> I just used a single image containing all different digits, since the font is always the same. Now, what I'm having problems with is the following - exactly the same font, just red and underlined. <https://lh6.googleusercontent.com/-MSTNhlK_8R0/VEzOZn6v3hI/AAAAAAABTX0/l6tFqpjiwnw/s1600/Capture2.PNG> What happens is that tesseract recognizes the whole word "06:05" including the underline as a *single blob*, and then of course it can't recognize what symbol it is. Funny thing, in some rare cases it does succeed (it ignores the underline, and then marks each symbol as a blob, and recognizes them correctly), and I can't figure out what it depends on. It somehow depends on the context - if I change the layout, keeping the text exactly the same, it would sometimes recognize it correctly, and sometimes not. Perhaps some experts here could give an advice, how to go about solving this.. Most importantly how, to *debug* what's going on? My thoughts, and what I've been trying: - I tried including the red/underlined example in the training data as a "different font", but that doesn't help. - I've tried running with the options "psm -5", "psm -6" and it does change the behaviour significantly, but none works as it should. In any case, this is suggesting me that the problem is in the way tesseract splits the text into blobs, not with the actual symbol recognition. And the underlined text confuses it. - I've tried playing around with the underline recognition settings (e.g. textord_underline_threshold), but it made absolutely no difference - I've tried to dig deeper into the architecture of tesseract - page segmentation, blob recognition, chopping - because it seems that the problem is in one of those steps, but couldn't yet find a good way to debug it. Tried using this (https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging), but it's only telling me that the underlined text ends up as a single blob. Thanks a lot for any suggestions, Jurgis -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cc5b2e89-0239-43ad-b251-3cecb830abfe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.