IN a preprocessing step you could do a connected component analysis 
(https://en.wikipedia.org/wiki/Connected_component_labeling)
and then filter out all blobs that have an aspect ratio larger than, say, 
20 to 1 or something like that. That should be quite efficient if the
lines are not skewed. Since Tesseract already uses leptonica you probably 
also want to use that library to find the connected components
(see conncomp.c).

Am Dienstag, 17. Juni 2014 22:06:54 UTC+2 schrieb Glen Rubin:
>
> Teseract is failing to OCR text on my page in-between 2 horizontal lines.  
> For example it would miss the following text:
>
>
> ___________________________________________________________
>
>        This text is missed by Tesseract
> ____________________________________________________________
>
> Any suggestions of how to overcome this.  I was looking at imagemagick 
> scripts to get rid of the lines, but that seems rather involved.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b22eb8b0-711c-4d4a-a55c-9b9965358be2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to