Anyone know a way to tap into a pdf programmatically to tell if it contains text vs was scanned as an image?

I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info that either says that it's an image or it has text, or to be more complicated gives you a quick percentage of document is text, which I could use to set a sort threshold.

Alternately if it can be done more easily on a ps file there's no reason why I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.

Alex
_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

Reply via email to