Thanks, I will check in to that... BTW I did find a Python wrapper to do the raster conversion - although using GS may be overkill for what I need it to do...
http://pypi.python.org/pypi/ghostscript/0.3 In any case, it looks like cleaning up the artifacts is going to be the biggest challenge - a repeating pattern of scan lines in the original PDF file are "shifted" maybe one pixel to the left, 3 or 4 places in each character, and most noticeable at 300% magnification and above. The rows of rasterized text are spaced differently than the artifacting, and so each character could have multiple variations of the shifted-pixel effect... On 9/24/10, Eugene Reimer <[email protected]> wrote: > Ghostscript is good for working with PDFs containing text; yours likely > have images but no no text. Using something like pdfimages to extract > the raster-images from a PDF will give you what you want, without any > unwanted rescaling. > > > Kevin Carlson wrote, On 2010-09-24 12:37: >> We receive PDF files which appear to contain scanning artifacts which >> severely impact recognition. Specifically, under magnification you >> can see regularly spaced "notches" and corresponding "bumps", >> especially noticeable with vertical lines. >> >> Currently I'm using Ghostscript to convert the files to TIFF for >> processing, any Python-based alternatives out there? Ultimately would >> like to do all cleaning and converting using Python, with "Pytesser" >> to do the OCR. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

