Hello, I sent this note last week and didn't receive any response, here is an updated version with some additional information. To explain the context a little: I tried to extract text from 5091 mathematical PDF files. While I got some messages like "You do not have permission to extract text", "Error: Header doesn't contain versioning" or "Error: End-of-File, expected line", the majority of the files were transformed without an error message. Unfortunately, some of these supposedly correctly transformed files are illegible. In those files, usually all characters are somehow decoded; and I could distinguish at least 3 kinds of decoding. In those papers all characters look like the following examples:
1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20 cases)
created using e.g.
TeX output 2009.02.18:0900
dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks
2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
created using some version of Ghostscript or pdfTeX
3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
created using e.g.
some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical
Eye Software)
some version of Acrobat Distiller
Using Apple's PDF kit, I obtain readable results for the first and second
cases. In the third case, only characters from Unicode's "Private Plane" are
shown.
In some cases, only part of the document is encoded this way, probably because
the file was put together from different sources:
Figure 1: Hypothetical Log Quasi-Likelihood
a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
a20
a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27
a21a5a22a30a29a32a25
section.
Can anybody tell me what this means, is there a way to improve the results?
Is there a way to obtain information wether the transformation yielded any
readable results?
Best regards
Thomas Fischer
smime.p7s
Description: S/MIME cryptographic signature

