Re: Extracting text from scanned PDF docs

Jimmy O'Regan Fri, 24 Sep 2010 14:18:22 -0700

On 24 September 2010 18:37, Kevin Carlson <[email protected]> wrote:
> We receive PDF files which appear to contain scanning artifacts which
> severely impact recognition.  Specifically, under magnification you
> can see regularly spaced "notches" and corresponding "bumps",
> especially noticeable with vertical lines.
>
> Currently I'm using Ghostscript to convert the files to TIFF for
> processing, any Python-based alternatives out there?  Ultimately would
> like to do all cleaning and converting using Python, with "Pytesser"
> to do the OCR.
>


Unlikely. Ghostscript isn't designed to work as a library, so there's
nothing to write a Python wrapper around. Postscript is a whole
programming language -- I find it hard to imagine that someone would
be masochistic enough to write anything more than a toy implementation
in a slow, memory hungry language like Python.


-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Extracting text from scanned PDF docs

Reply via email to