Hi Andreas,

yes, I assume it's TeX-related, but all my other files are TeX-created as well.

I don't suppose that the choice of TeX encoding "\usepackage[T1]{fontenc}" 
mentioned in [1] really matters, since the encoding usepackage really just 
calls a preprocessor to retranslate characters like “ü” to the standard “\"u” 
of TeX (I hope I get this right). So if you just remove this line, you will 
expose non-ASCII-characters to the TeX engine, and I suppose that they just 
will be skipped.

I'll gladly post the three different problems (and a couple of others if people 
are willing to work on them) to the pdfbox jira system, but won't go through 
this process unless somebody is interested - I don't know the importance of TeX 
files for the pdfbox community.

The problem described in [1] looks like my case 2, and as I mentioned, these 
tend to be accessible through Apple's PDF kit, though with the usual quirks 
that made me decide to use pdfbox in the first place. But this shows that the 
file can be transformed.
But I'm not enough of an expert of either Java or the PDF format to really dig 
into the the pdfbox code, so I can't be of much help there.

All the best
Thomas Fischer


Am 12.05.2010 um 09:16 schrieb Andreas Lehmkühler:

> Hi Thomas,
> 
> ----- original Nachricht --------
> Betreff: Illegible decoding in some pdf documents
> Gesendet: Di, 11. Mai 2010
> Von: Thomas Fischer<[email protected]>
>> Hello,
>> 
>> I sent this note last week and didn't receive any response, here is an
>> updated version with some additional information.
>> To explain the context a little: I tried to extract text from 5091
>> mathematical PDF files. While I got some messages like "You do not have
>> permission to extract text", "Error: Header doesn't contain versioning" or
>> "Error: End-of-File, expected line", the majority of the files were
>> transformed without an error message.
>> Unfortunately, some of these supposedly correctly transformed files are
>> illegible. In those files, usually all characters are somehow decoded; and I
>> could distinguish at least 3 kinds of decoding. In those papers all
>> characters look like the following examples:
>> 
>> 1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20
>> cases)
>>      created using e.g. 
>>              TeX output 2009.02.18:0900
>>              dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks
>> 
>> 2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
>>      created using some version of Ghostscript or pdfTeX
>> 
>> 3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
>>      created using e.g.
>>              some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical 
>> Eye
>> Software)
>>              some version of Acrobat Distiller
>> 
>> 
>> Using Apple's PDF kit, I obtain readable results for the first and second
>> cases. In the third case, only characters from Unicode's "Private Plane" are
>> shown.
>> 
>> In some cases, only part of the document is encoded this way, probably
>> because the file was put together from different sources:
>> 
>> Figure 1: Hypothetical Log Quasi-Likelihood
>> a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
>> a20
>> a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27
>> a21a5a22a30a29a32a25
>> section.
>> 
>> Can anybody tell me what this means, is there a way to improve the results?
>> Is there a way to obtain information wether the transformation yielded any
>> readable results?
> I'm sorry for the late answer. Without having a look at the documents it's 
> only a guess, 
> but I'm sure it is an encoding issue. In your case it seems to be a tex 
> related issue, 
> probably similar to the issue described in PDFBX-534 [1]
> 
> BR
> Andreas Lehmkühler
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-534

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to