Re: Illegible decoding in some pdf documents

Thomas Fischer Sun, 16 May 2010 06:17:56 -0700

Hallo Andreas,

I added some comments and files to 
https://issues.apache.org/jira/browse/PDFBOX-534
and created three new issues
https://issues.apache.org/jira/browse/PDFBOX-727 to -729
which I suppose are different from the one described in PDFBOX-534:
TeX remnants, hex-decoding and unreadable text of a different kind, all 
TeX-related.


There are different methods to create PDF documents from TeX (actually, usually 
LaTeX these days):

The classical method is to create a DVI file, and from that a PostScript file 
for printing and dissemination, this is why older files on preprint servers 
like arxiv.org will usually be in PostScript. Later on people started to 
convert these files to PDF for better acceptance, using Acrobat Distiller or 
some version of ps2pdf, available on all Linux systems and on Mac OS X.
From my experience, this path tends to be problematic for PDFBox.

The newer method uses pdflatex (probably with the hyperref package to create 
tables of contents and hyperlinks) to transform (La)Tex files directly into 
PDF. These files seem to be easier to handle, but not always successfully: the 
original example of PDFBOX-534 was created by pdfTeX-1.40.3

Either way, these TeX-created documents seem to present specific challenges for 
PDFBox. Since we need to make these files available for full-text search, we 
would be very happy if their text extraction could be improved. I'm ready to 
help with tests and examples; I am afraid my lack of experience in Java limits 
my direct help in the development of the code.

All the best
Thomas

Am 15.05.2010 um 18:26 schrieb Andreas Lehmkuehler:

> Hi
> 
> Thomas Fischer schrieb:
>> Hi Andreas,
>> yes, I assume it's TeX-related, but all my other files are TeX-created as 
>> well.
>> I don't suppose that the choice of TeX encoding "\usepackage[T1]{fontenc}" 
>> mentioned in [1] really matters, since the encoding usepackage really just 
>> calls a preprocessor to retranslate characters like "ü" to the standard 
>> "\"u" of TeX (I hope I get this right). So if you just remove this line, you 
>> will expose non-ASCII-characters to the TeX engine, and I suppose that they 
>> just will be skipped.
> I guess, is is about 15 years ago when I used Tex/LaTex the last time, so that
> I can't follow the given workaround, but it is obviously a TeX-related issue.
> 
>> I'll gladly post the three different problems (and a couple of others if 
>> people are willing to work on them)
>> to the pdfbox jira system, but won't go through this process unless somebody 
>> is interested - I don't know the importance of TeX files for the pdfbox 
>> community.
> IMHO we should try to support every pdf and TeX is, AFAIK, quite popular if it
> comes to write technical documentation and similar docs.
> 
> So, if you can share your docs, please attach them to the mentioned issue
> PDFBOX-534. It is better to have too much than too few examples. :-)
> 
> Thanks in advance
> Andreas Lehmkühler
> 
>> The problem described in [1] looks like my case 2, and as I mentioned, these 
>> tend to be accessible through Apple's PDF kit, though with the usual quirks 
>> that made me decide to use pdfbox in the first place. But this shows that 
>> the file can be transformed.
>> But I'm not enough of an expert of either Java or the PDF format to really 
>> dig into the the pdfbox code, so I can't be of much help there.
>> All the best
>> Thomas Fischer
>> Am 12.05.2010 um 09:16 schrieb Andreas Lehmkühler:
>>> Hi Thomas,
>>> 
>>> ----- original Nachricht --------
>>> Betreff: Illegible decoding in some pdf documents
>>> Gesendet: Di, 11. Mai 2010
>>> Von: Thomas Fischer<[email protected]>
>>>> Hello,
>>>> 
>>>> I sent this note last week and didn't receive any response, here is an
>>>> updated version with some additional information.
>>>> To explain the context a little: I tried to extract text from 5091
>>>> mathematical PDF files. While I got some messages like "You do not have
>>>> permission to extract text", "Error: Header doesn't contain versioning" or
>>>> "Error: End-of-File, expected line", the majority of the files were
>>>> transformed without an error message.
>>>> Unfortunately, some of these supposedly correctly transformed files are
>>>> illegible. In those files, usually all characters are somehow decoded; and 
>>>> I
>>>> could distinguish at least 3 kinds of decoding. In those papers all
>>>> characters look like the following examples:
>>>> 
>>>> 1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74 (about 20
>>>> cases)
>>>>    created using e.g.              TeX output 2009.02.18:0900
>>>>            dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks
>>>> 
>>>> 2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11 (about 200 cases)
>>>>    created using some version of Ghostscript or pdfTeX
>>>> 
>>>> 3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7 (about 600 cases)
>>>>    created using e.g.
>>>>            some version of dvips(k) (5.83 (MiKTeX 1.20b), 5.95a by Radical 
>>>> Eye
>>>> Software)
>>>>            some version of Acrobat Distiller
>>>> 
>>>> 
>>>> Using Apple's PDF kit, I obtain readable results for the first and second
>>>> cases. In the third case, only characters from Unicode's "Private Plane" 
>>>> are
>>>> shown.
>>>> 
>>>> In some cases, only part of the document is encoded this way, probably
>>>> because the file was put together from different sources:
>>>> 
>>>> Figure 1: Hypothetical Log Quasi-Likelihood
>>>> a0 a1a3a2a5a4a7a6a9a8 a10a12a11a14a13 a15a17a16a19a18
>>>> a20
>>>> a21a17a22a24a23a26a25 a21a5a22a24a23a28a27 a21a5a22 a21a5a22a30a29a31a27
>>>> a21a5a22a30a29a32a25
>>>> section.
>>>> 
>>>> Can anybody tell me what this means, is there a way to improve the results?
>>>> Is there a way to obtain information wether the transformation yielded any
>>>> readable results?
>>> I'm sorry for the late answer. Without having a look at the documents it's 
>>> only a guess, but I'm sure it is an encoding issue. In your case it seems 
>>> to be a tex related issue, probably similar to the issue described in 
>>> PDFBX-534 [1]
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>> [1] https://issues.apache.org/jira/browse/PDFBOX-534
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Illegible decoding in some pdf documents

Reply via email to