[vox-tech] Re: How to tell if a pdf is text or image?

hajhouse Fri, 23 Mar 2007 02:42:24 -0800

[...]
> PDF is a scripting language. You can look at the raw PDF with a text 
> editor and you'll see plain text PDF operators interspersed with 
> possibly binary data. In principle PDF is a programming language and the 
> only way to tell what it produces is to run it. But in practice, PDF 
> code is all machine-written, and you could probably learn to distinguish 
> font-using PDFs from pure-image PDFs by examining the raw PDF file.
> 
> You could look for the font embedding operators. A document consisting 
> only of scanned page images probably won't have any fonts embedded in 
> it. Or, if the scanned-paper PDFs are all made by a particular program, 
> you might be able to identify particular PDF operator sequences that it 
> uses.


In that vein, I ask: is Alex's question about a general method
applicable to the set of all possible PDF files, or are the PDF files of
the particular problem a limited set created by one or a few programs?

-- 
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

signature.asc
Description: Digital signature

_______________________________________________
vox-tech mailing list
[email protected]
http://lists.lugod.org/mailman/listinfo/vox-tech

[vox-tech] Re: How to tell if a pdf is text or image?

Reply via email to