Re: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Ken Herron Thu, 22 Mar 2007 18:09:59 -0800

Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images canthen be ocr corrected to text versions.So my idea was either a yes/no answer or to say something like, if thedocument is more than 20%(arbitrary) text consider it text.

PDF is a scripting language. You can look at the raw PDF with a texteditor and you'll see plain text PDF operators interspersed withpossibly binary data. In principle PDF is a programming language and theonly way to tell what it produces is to run it. But in practice, PDFcode is all machine-written, and you could probably learn to distinguishfont-using PDFs from pure-image PDFs by examining the raw PDF file.

You could look for the font embedding operators. A document consistingonly of scanned page images probably won't have any fonts embedded init. Or, if the scanned-paper PDFs are all made by a particular program,you might be able to identify particular PDF operator sequences that ituses.

_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

Re: Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Reply via email to