Hi Thomas, ----- original Nachricht -------- Betreff: Illegible decoding in some pdf documents Gesendet: Di, 11. Mai 2010 Von: Thomas Fischer<[email protected]> > Hello, > > I sent this note last week and didn't receive any response, here is an > updated version with some additional information. > To explain the context a little: I tried to extract text from 5091 > mathematical PDF files. While I got some messages like "You do not have > permission to extract text", "Error: Header doesn't contain versioning" or > "Error: End-of-File, expected line", the majority of the files were > transformed without an error message. > Unfortunately, some of these supposedly correctly transformed files are > illegible. In those files, usually all characters are somehow decoded; and I > could distinguish at least 3 kinds of decoding. In those papers all > characters look like the following examples: > > 1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20 > cases) > created using e.g. > TeX output 2009.02.18:0900 > dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks > > 2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases) > created using some version of Ghostscript or pdfTeX > > 3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases) > created using e.g. > some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical > Eye > Software) > some version of Acrobat Distiller > > > Using Apple's PDF kit, I obtain readable results for the first and second > cases. In the third case, only characters from Unicode's "Private Plane" are > shown. > > In some cases, only part of the document is encoded this way, probably > because the file was put together from different sources: > > Figure 1: Hypothetical Log Quasi-Likelihood > a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18 > a20 > a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27 > a21a5a22a30a29a32a25 > section. > > Can anybody tell me what this means, is there a way to improve the results? > Is there a way to obtain information wether the transformation yielded any > readable results? I'm sorry for the late answer. Without having a look at the documents it's only a guess, but I'm sure it is an encoding issue. In your case it seems to be a tex related issue, probably similar to the issue described in PDFBX-534 [1]
BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-534

