Illegible decoding in some pdf documents

Thomas Fischer Tue, 11 May 2010 05:14:25 -0700

Hello,

I sent this note last week and didn't receive any response, here is an updated 
version with some additional information.
To explain the context a little: I tried to extract text from 5091 mathematical 
PDF files. While I got some messages like "You do not have permission to 
extract text", "Error: Header doesn't contain versioning" or "Error: 
End-of-File, expected line", the majority of the files were transformed without 
an error message.
Unfortunately, some of these supposedly correctly transformed files are 
illegible. In those files, usually all characters are somehow decoded; and I 
could distinguish at least 3 kinds of decoding. In those papers all characters 
look like the following examples:


1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20 cases)
        created using e.g. 
                TeX output 2009.02.18:0900
                dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks

2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
        created using some version of Ghostscript or pdfTeX

3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
        created using e.g.
                some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical 
Eye Software)
                some version of Acrobat Distiller


Using Apple's PDF kit, I obtain readable results for the first and second 
cases. In the third case, only characters from Unicode's "Private Plane" are 
shown.

In some cases, only part of the document is encoded this way, probably because 
the file was put together from different sources:

Figure 1: Hypothetical Log Quasi-Likelihood
a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
a20
a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27 
a21a5a22a30a29a32a25
section.

Can anybody tell me what this means, is there a way to improve the results?
Is there a way to obtain information wether the transformation yielded any 
readable results?

Best regards
Thomas Fischer

smime.p7s
Description: S/MIME cryptographic signature

Illegible decoding in some pdf documents

Reply via email to