Hi all, I am fairly new to tesseract, I have done some playing around with training new fonts, and loading config files etc. I have an issue with the images I am trying to OCR. In many cases, there is a dotted horizontal line about 5-10 pixels above the text. Tesseract mistakenly assumes this is apart of the text and puts the box around the character and around the line above the text. An example below
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Example of lines above text 1. text to read 2. text to read 3. text to read 4. text to read _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ It reads lines 2 and 3 almost perfectly, however, 1 and 4 is inconsistent and can vary. Most of the time its gibberish. It makes it hard to train tesseract properly as the box files have been produced I was wondering if there is a parameter or configuration were I could set the maximum font size or maximum box size to avoid it from including the lines above the text? I would do some morphological operations on the lines to get rid of them but the lines are about the same thickness as the font and I would worry it would degrade the text. I know tesseract requires minimum size 10 font to get acceptable results, so I was wondering if there is a way to set the max font size. The font size should be fairly even across the images (obviously camera distortion may result in an offset of a pixel or two but roughly the same) I am aware I could segment the image and pull out the regions in between the lines. I guess I am just seeing if there is a quick configuration or parameter I could parse to satisfy this requirement? Can anyone help me? Is pre-processing the only way to solve this? Thanks, Elan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82008e4f-d31c-4694-8724-545bea2ae6e5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

