We receive PDF files which appear to contain scanning artifacts which severely impact recognition. Specifically, under magnification you can see regularly spaced "notches" and corresponding "bumps", especially noticeable with vertical lines.
Currently I'm using Ghostscript to convert the files to TIFF for processing, any Python-based alternatives out there? Ultimately would like to do all cleaning and converting using Python, with "Pytesser" to do the OCR. Any suggestions on cleaning up the files to improve recognition rates? I'd like to see about "training" the OCR using the notched characters, but the links on doing so seem incomplete. Any recommendations would be appreciated! Thanks! Kevin -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

