hajhouse wrote:
På 2007-03-20, skrev Alex Mandel:
Anyone know a way to tap into a pdf programmatically to tell if it contains text vs was scanned as an image?

I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info that either says that it's an image or it has text, or to be more complicated gives you a quick percentage of document is text, which I could use to set a sort threshold.

Alternately if it can be done more easily on a ps file there's no reason why I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.

What about converting the PDF files to postscript then running ps2ascii?



Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images can then be ocr corrected to text versions. So my idea was either a yes/no answer or to say something like, if the document is more than 20%(arbitrary) text consider it text.

So far pdffont tells me what fonts I have, and if it's an image I get nothing after the header lines. So that might work if I write a program that makes a temp pdffont and sees if it's longer than just the headers.

I guess I should clarify when I say image, I'm talking about pdf that were made by scanning a document straight to tiff with no ocr. I know none of them have pictures, since it's all legal docs at a law firm.

Alex
_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

Reply via email to