Re: Extracting text from PDF

Jan Schenkel Thu, 11 Jan 2007 12:42:37 -0800

--- Richard Gaskin <[EMAIL PROTECTED]> wrote:
> Anyone here have an efficient algo for extracting
> text from PDFs?
> 
> -- 
>   Richard Gaskin
>   Fourth World Media Corporation
>


Well, one would hope I know a thing or two about PDF
files ;-)
There are a couple of things that make this a
challenge: text can be either in Latin or Unicode /
UTF-16 (Big Endian) encoding. You can use the BOM
marker to figure out if a piece of text is Latin or
Unicode.
But PDF files can also be compressed and/or encrypted,
making it nearly impossible to read from Revolution.

If this is Mac-only, you might be able to AppleScript
another application to get this information -
Preiew.app doesn't seem to be scriptable, but perhaps
another application could do the trick.
Some googling turned up the texttopdf command line
tool, which is open-source:
<http://www.glyphandcog.com/textext.html>
There's also a build for MacOSX, which you can
download at:
<http://www.bluem.net/downloads/pdftotext_en/>

Hope this helped,

Jan Schenkel.

Quartam Reports for Revolution
<http://www.quartam.com>

=====
"As we grow older, we grow both wiser and more foolish at the same time."  (La 
Rochefoucauld)


 
____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Extracting text from PDF

Reply via email to