So, I decided to manually remove the underline from the image and OCR it. The new image is attached.
*$ tesseract test2.png stdout -l eng * REQUEST FOR INDEPENDENT NIEDICAL REVIEW *$ tesseract test2.png stdout -l eng use-userdict* REQUEST FOR INDEPENDENT IVIEDICAL REVIEW Having specified the user dictionary, I would have expected the output to be correct. Could someone please elaborate on why the difference ? I have also observed that Tesseract correctly handles underlines in other places - so I am unclear on what is required here. What are the rules for handling text with underlines ? Thanks - viraf On Thursday, February 18, 2016 at 1:08:37 AM UTC-5, viraf wrote: > > I am facing challenges with the accuracy of the OCR, and was hoping that > someone could guide me through the process of debugging the problem so that > I can apply these techniques to other OCR related issues that I face. > Attached is a snippet of a document that is not correctly OCR'd. The > output that I get is: > > RE U'EST FO DICAL > > The following config entries were added to *configs/use-userdict* > load_system_dawg F > load_freq_dawg F > load_punc_dawg F > load_number_dawg F > load_unambig_dawg F > load_bigram_dawg F > load_fixed_length_dawgs F > user_words_suffix user-words > tessedit_write_images T > tessedit_dump_pageseg_images T > > and *eng.user-words* has the following entries > REQUEST > FOR > INDEPENDENT > MEDICAL > REVIEW > > The following command line was used > > tesseract test.png stdout -l eng use-userdict > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9573e9d9-9a61-47f8-bd1d-2e3a42ac480a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

