I have a PDF with yellow background, black text and e-mail address 
underlined and in blue color,
Used convert (imageMagik project) to save as TIFF file changing image to 
monochrome.
The result, was a dirty TXT file, but it recognized the underlined text, 
maybe it's not a solution but a workaround.



El domingo, 26 de octubre de 2014, 6:48:16 (UTC-4), Jurgis Pasukonis 
escribió:
>
> Hi all,
>
> I have a task to recognize printed timetables. I started experimenting 
> with tesseract-OCR a week ago and I managed to train it to recognize the 
> following kind of pictures perfectly:
>
>
> <https://lh6.googleusercontent.com/-CDrXlacVBDc/VEzOf2zyizI/AAAAAAABTX8/09a0VDg2pKI/s1600/Capture.png>
>
> I just used a single image containing all different digits, since the font 
> is always the same.
>
> Now, what I'm having problems with is the following - exactly the same 
> font, just red and underlined.
>
>
> <https://lh6.googleusercontent.com/-MSTNhlK_8R0/VEzOZn6v3hI/AAAAAAABTX0/l6tFqpjiwnw/s1600/Capture2.PNG>
>
> What happens is that tesseract recognizes the whole word "06:05" including 
> the underline as a *single blob*, and then of course it can't recognize 
> what symbol it is. Funny thing, in some rare cases it does succeed (it 
> ignores the underline, and then marks each symbol as a blob, and recognizes 
> them correctly), and I can't figure out what it depends on. It somehow 
> depends on the context - if I change the layout, keeping the text exactly 
> the same, it would sometimes recognize it correctly, and sometimes not.
>
> Perhaps some experts here could give an advice, how to go about solving 
> this.. Most importantly how, to *debug* what's going on? My thoughts, and 
> what I've been trying:
>
>
>    - I tried including the red/underlined example in the training data as 
>    a "different font", but that doesn't help.
>    - I've tried running with the options "psm -5", "psm -6" and it does 
>    change the behaviour significantly, but none works as it should. In any 
>    case, this is suggesting me that the problem is in the way tesseract 
> splits 
>    the text into blobs, not with the actual symbol recognition. And the 
>    underlined text confuses it.
>    - I've tried playing around with the underline recognition settings 
>    (e.g. textord_underline_threshold), but it made absolutely no 
>    difference
>    - I've tried to dig deeper into the architecture of tesseract - page 
>    segmentation, blob recognition, chopping - because it seems that the 
>    problem is in one of those steps, but couldn't yet find a good way to 
> debug 
>    it. Tried using this (
>    https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging), but 
>    it's only telling me that the underlined text ends up as a single blob.
>
> Thanks a lot for any suggestions,
> Jurgis
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/badca4cf-17f2-4126-9090-833baee734a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to