Re: Re: Illegible decoding in some pdf documents

Andreas Lehmkuehler Sat, 15 May 2010 09:26:55 -0700

Hi

Thomas Fischer schrieb:

Hi Andreas,
yes, I assume it's TeX-related, but all my other files are TeX-created as well.
I don't suppose that the choice of TeX encoding "\usepackage[T1]{fontenc}" mentioned in [1] really matters,since the encoding usepackage really just calls a preprocessor to retranslate characters like "ü" to thestandard "\"u" of TeX (I hope I get this right). So if you just remove this line, you will exposenon-ASCII-characters to the TeX engine, and I suppose that they just will be skipped.

I guess, is is about 15 years ago when I used Tex/LaTex the last time, so that
I can't follow the given workaround, but it is obviously a TeX-related issue.

I'll gladly post the three different problems (and a couple of others if people 
are willing to work on them)
to the pdfbox jira system, but won't go through this process unless somebody is interested - I don'tknow the importance of TeX files for the pdfbox community.

IMHO we should try to support every pdf and TeX is, AFAIK, quite popular if it
comes to write technical documentation and similar docs.

So, if you can share your docs, please attach them to the mentioned issue
PDFBOX-534. It is better to have too much than too few examples. :-)

Thanks in advance
Andreas Lehmkühler

The problem described in [1] looks like my case 2, and as I mentioned, these 
tend to be accessible through Apple's PDF kit, though with the usual quirks 
that made me decide to use pdfbox in the first place. But this shows that the 
file can be transformed.
But I'm not enough of an expert of either Java or the PDF format to really dig 
into the the pdfbox code, so I can't be of much help there.

All the best
Thomas Fischer


Am 12.05.2010 um 09:16 schrieb Andreas Lehmkühler:

Hi Thomas,

----- original Nachricht --------
Betreff: Illegible decoding in some pdf documents
Gesendet: Di, 11. Mai 2010
Von: Thomas Fischer<[email protected]>

Hello,

I sent this note last week and didn't receive any response, here is an
updated version with some additional information.
To explain the context a little: I tried to extract text from 5091
mathematical PDF files. While I got some messages like "You do not have
permission to extract text", "Error: Header doesn't contain versioning" or
"Error: End-of-File, expected line", the majority of the files were
transformed without an error message.
Unfortunately, some of these supposedly correctly transformed files are
illegible. In those files, usually all characters are somehow decoded; and I
could distinguish at least 3 kinds of decoding. In those papers all
characters look like the following examples:

1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20
cases)

created using e.g.TeX output 2009.02.18:0900

                dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks

2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
        created using some version of Ghostscript or pdfTeX

3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
        created using e.g.
                some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical 
Eye
Software)
                some version of Acrobat Distiller


Using Apple's PDF kit, I obtain readable results for the first and second
cases. In the third case, only characters from Unicode's "Private Plane" are
shown.

In some cases, only part of the document is encoded this way, probably
because the file was put together from different sources:

Figure 1: Hypothetical Log Quasi-Likelihood
a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
a20
a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27
a21a5a22a30a29a32a25
section.

Can anybody tell me what this means, is there a way to improve the results?
Is there a way to obtain information wether the transformation yielded any
readable results?

I'm sorry for the late answer. Without having a look at the documents it's only a guess,but I'm sure it is an encoding issue. In your case it seems to be a tex related issue,probably similar to the issue described in PDFBX-534 [1]


BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-534

Re: Re: Illegible decoding in some pdf documents

Reply via email to