Re: Illegible decoding in some pdf documents

Andreas Lehmkühler Wed, 12 May 2010 00:16:53 -0700

Hi Thomas,

----- original Nachricht --------
Betreff: Illegible decoding in some pdf documents
Gesendet: Di, 11. Mai 2010
Von: Thomas Fischer<[email protected]>
> Hello,
> 
> I sent this note last week and didn't receive any response, here is an
> updated version with some additional information.
> To explain the context a little: I tried to extract text from 5091
> mathematical PDF files. While I got some messages like "You do not have
> permission to extract text", "Error: Header doesn't contain versioning" or
> "Error: End-of-File, expected line", the majority of the files were
> transformed without an error message.
> Unfortunately, some of these supposedly correctly transformed files are
> illegible. In those files, usually all characters are somehow decoded; and I
> could distinguish at least 3 kinds of decoding. In those papers all
> characters look like the following examples:
> 
> 1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20
> cases)
>       created using e.g. 
>               TeX output 2009.02.18:0900
>               dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks
> 
> 2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
>       created using some version of Ghostscript or pdfTeX
> 
> 3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
>       created using e.g.
>               some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical 
> Eye
> Software)
>               some version of Acrobat Distiller
> 
> 
> Using Apple's PDF kit, I obtain readable results for the first and second
> cases. In the third case, only characters from Unicode's "Private Plane" are
> shown.
> 
> In some cases, only part of the document is encoded this way, probably
> because the file was put together from different sources:
> 
> Figure 1: Hypothetical Log Quasi-Likelihood
> a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
> a20
> a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27
> a21a5a22a30a29a32a25
> section.
> 
> Can anybody tell me what this means, is there a way to improve the results?
> Is there a way to obtain information wether the transformation yielded any
> readable results?
I'm sorry for the late answer. Without having a look at the documents it's only 
a guess, 
but I'm sure it is an encoding issue. In your case it seems to be a tex related 
issue, 
probably similar to the issue described in PDFBX-534 [1]


BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-534

Re: Illegible decoding in some pdf documents

Reply via email to