Re: [vox-tech] How to tell if a pdf is text or image?

Alex Mandel Tue, 20 Mar 2007 21:36:35 -0800

hajhouse wrote:

På 2007-03-20, skrev Alex Mandel:
Anyone know a way to tap into a pdf programmatically to tell if itcontains text vs was scanned as an image?
I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info thateither says that it's an image or it has text, or to be more complicatedgives you a quick percentage of document is text, which I could use toset a sort threshold.
Alternately if it can be done more easily on a ps file there's no reasonwhy I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.
What about converting the PDF files to postscript then running ps2ascii?

Well, I don't actually need the text, I just need to know if it is text.

The idea is that once I separate them, all the ones that are images canthen be ocr corrected to text versions.So my idea was either a yes/no answer or to say something like, if thedocument is more than 20%(arbitrary) text consider it text.

So far pdffont tells me what fonts I have, and if it's an image I getnothing after the header lines. So that might work if I write a programthat makes a temp pdffont and sees if it's longer than just the headers.

I guess I should clarify when I say image, I'm talking about pdf thatwere made by scanning a document straight to tiff with no ocr. I knownone of them have pictures, since it's all legal docs at a law firm.


Alex
_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

Re: [vox-tech] How to tell if a pdf is text or image?

Reply via email to