On Tuesday 20 March 2007 21:30, Alex Mandel wrote: > Anyone know a way to tap into a pdf programmatically to tell if it > contains text vs was scanned as an image? > > I basically just want to sort a directory with many thousands of > pdfs. I figured there must be something in the header or in the file > info that either says that it's an image or it has text, or to be > more complicated gives you a quick percentage of document is text, > which I could use to set a sort threshold. > > Alternately if it can be done more easily on a ps file there's no > reason why I can't do a pdf2ps on it and then decide how to sort. > It's really a one time deal so I'll take the overhead on that > operation.
I think pdfedit http://pdfedit.petricek.net/ can tell you what you want to know, but it looks insanely hard to give you a quick percentage for one document, let alone thousands. At any rate, maybe someone else on vox-tech will find it useful to know about. --Ken -- Ken Bloom. PhD candidate. Linguistic Cognition Laboratory. Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/
pgpRKkmzAJXrf.pgp
Description: PGP signature
_______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
