Hi all,

I am fairly new to tesseract, I have done some playing around with training 
new fonts, and loading config files etc. I have an issue with the images I 
am trying to OCR.
In many cases, there is a dotted horizontal line about 5-10 pixels above 
the text. Tesseract mistakenly assumes this is apart of the text and puts 
the box around the character and around the line above the text.
An example below

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     Example of 
lines above text

1. text to read
2. text to read
3. text to read
4. text to read
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

It reads lines 2 and 3 almost perfectly, however, 1 and 4 is inconsistent 
and can vary. Most of the time its gibberish. It makes it hard to train 
tesseract properly as the box files have been produced 

I was wondering if there is a parameter or configuration were I could set 
the maximum font size or maximum box size to avoid it from including the 
lines above the text?
I would do some morphological operations on the lines to get rid of them 
but the lines are about the same thickness as the font and I would worry it 
would degrade the text.
I know tesseract requires minimum size 10 font to get acceptable results, 
so I was wondering if there is a way to set the max font size.

The font size should be fairly even across the images (obviously camera 
distortion may result in an offset of a pixel or two but roughly the same)

I am aware I could segment the image and pull out the regions in between 
the lines. I guess I am just seeing if there is a quick configuration or 
parameter I could parse to satisfy this requirement?

Can anyone help me?
Is pre-processing the only way to solve this?

Thanks,
Elan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/82008e4f-d31c-4694-8724-545bea2ae6e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to