I have a PDF with yellow background, black text and e-mail address underlined and in blue color, Used convert (imageMagik project) to save as TIFF file changing image to monochrome. The result, was a dirty TXT file, but it recognized the underlined text, maybe it's not a solution but a workaround.
El domingo, 26 de octubre de 2014, 6:48:16 (UTC-4), Jurgis Pasukonis escribió: > > Hi all, > > I have a task to recognize printed timetables. I started experimenting > with tesseract-OCR a week ago and I managed to train it to recognize the > following kind of pictures perfectly: > > > <https://lh6.googleusercontent.com/-CDrXlacVBDc/VEzOf2zyizI/AAAAAAABTX8/09a0VDg2pKI/s1600/Capture.png> > > I just used a single image containing all different digits, since the font > is always the same. > > Now, what I'm having problems with is the following - exactly the same > font, just red and underlined. > > > <https://lh6.googleusercontent.com/-MSTNhlK_8R0/VEzOZn6v3hI/AAAAAAABTX0/l6tFqpjiwnw/s1600/Capture2.PNG> > > What happens is that tesseract recognizes the whole word "06:05" including > the underline as a *single blob*, and then of course it can't recognize > what symbol it is. Funny thing, in some rare cases it does succeed (it > ignores the underline, and then marks each symbol as a blob, and recognizes > them correctly), and I can't figure out what it depends on. It somehow > depends on the context - if I change the layout, keeping the text exactly > the same, it would sometimes recognize it correctly, and sometimes not. > > Perhaps some experts here could give an advice, how to go about solving > this.. Most importantly how, to *debug* what's going on? My thoughts, and > what I've been trying: > > > - I tried including the red/underlined example in the training data as > a "different font", but that doesn't help. > - I've tried running with the options "psm -5", "psm -6" and it does > change the behaviour significantly, but none works as it should. In any > case, this is suggesting me that the problem is in the way tesseract > splits > the text into blobs, not with the actual symbol recognition. And the > underlined text confuses it. > - I've tried playing around with the underline recognition settings > (e.g. textord_underline_threshold), but it made absolutely no > difference > - I've tried to dig deeper into the architecture of tesseract - page > segmentation, blob recognition, chopping - because it seems that the > problem is in one of those steps, but couldn't yet find a good way to > debug > it. Tried using this ( > https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging), but > it's only telling me that the underlined text ends up as a single blob. > > Thanks a lot for any suggestions, > Jurgis > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/badca4cf-17f2-4126-9090-833baee734a8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

