På 2007-03-20, skrev Alex Mandel:
> Anyone know a way to tap into a pdf programmatically to tell if it 
> contains text vs was scanned as an image?
> 
> I basically just want to sort a directory with many thousands of pdfs.
> I figured there must be something in the header or in the file info that 
> either says that it's an image or it has text, or to be more complicated 
> gives you a quick percentage of document is text, which I could use to 
> set a sort threshold.
> 
> Alternately if it can be done more easily on a ps file there's no reason 
> why I can't do a pdf2ps on it and then decide how to sort.
> It's really a one time deal so I'll take the overhead on that operation.

What about converting the PDF files to postscript then running ps2ascii?

-- 
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

Attachment: signature.asc
Description: Digital signature

_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

Reply via email to