Re: Extracting text from scanned PDF docs

Kevin Carlson Fri, 24 Sep 2010 20:30:21 -0700

Thanks, I will check in to that... BTW I did find a Python wrapper to
do the raster conversion - although using GS may be overkill for what
I need it to do...


http://pypi.python.org/pypi/ghostscript/0.3

In any case, it looks like cleaning up the artifacts is going to be
the biggest challenge - a repeating pattern of scan lines in the
original PDF file are "shifted" maybe one pixel to the left, 3 or 4
places in each character, and most noticeable at 300% magnification
and above.

The rows of rasterized text are spaced differently than the
artifacting, and so each character could have multiple variations of
the shifted-pixel effect...


On 9/24/10, Eugene Reimer <[email protected]> wrote:
> Ghostscript is good for working with PDFs containing text; yours likely
> have images but no no text.  Using something like pdfimages to extract
> the raster-images from a PDF will give you what you want, without any
> unwanted rescaling.
>
>
> Kevin Carlson wrote, On 2010-09-24 12:37:
>> We receive PDF files which appear to contain scanning artifacts which
>> severely impact recognition. Specifically, under magnification you
>> can see regularly spaced "notches" and corresponding "bumps",
>> especially noticeable with vertical lines.
>>
>> Currently I'm using Ghostscript to convert the files to TIFF for
>> processing, any Python-based alternatives out there? Ultimately would
>> like to do all cleaning and converting using Python, with "Pytesser"
>> to do the OCR.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Extracting text from scanned PDF docs

Reply via email to