--- Richard Gaskin <[EMAIL PROTECTED]> wrote: > Anyone here have an efficient algo for extracting > text from PDFs? > > -- > Richard Gaskin > Fourth World Media Corporation >
Well, one would hope I know a thing or two about PDF files ;-) There are a couple of things that make this a challenge: text can be either in Latin or Unicode / UTF-16 (Big Endian) encoding. You can use the BOM marker to figure out if a piece of text is Latin or Unicode. But PDF files can also be compressed and/or encrypted, making it nearly impossible to read from Revolution. If this is Mac-only, you might be able to AppleScript another application to get this information - Preiew.app doesn't seem to be scriptable, but perhaps another application could do the trick. Some googling turned up the texttopdf command line tool, which is open-source: <http://www.glyphandcog.com/textext.html> There's also a build for MacOSX, which you can download at: <http://www.bluem.net/downloads/pdftotext_en/> Hope this helped, Jan Schenkel. Quartam Reports for Revolution <http://www.quartam.com> ===== "As we grow older, we grow both wiser and more foolish at the same time." (La Rochefoucauld) ____________________________________________________________________________________ Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
