På 2007-03-20, skrev Alex Mandel: > Anyone know a way to tap into a pdf programmatically to tell if it > contains text vs was scanned as an image? > > I basically just want to sort a directory with many thousands of pdfs. > I figured there must be something in the header or in the file info that > either says that it's an image or it has text, or to be more complicated > gives you a quick percentage of document is text, which I could use to > set a sort threshold. > > Alternately if it can be done more easily on a ps file there's no reason > why I can't do a pdf2ps on it and then decide how to sort. > It's really a one time deal so I'll take the overhead on that operation.
What about converting the PDF files to postscript then running ps2ascii? -- Henry House +1 530 753 3361 ext. 13 Please don't send me HTML mail! My mail system frequently rejects it. The unintelligible text that may follow is a digital signature. See <http://hajhouse.org/pgp> to find out how to use it. My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
signature.asc
Description: Digital signature
_______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
