Problem solved, see

https://issues.apache.org/jira/browse/PDFBOX-2048


Tilman



Am 28.04.2014 21:17, schrieb Tilman Hausherr:
Hi,

I'm afraid we won't be research this depper without the PDF. Normally, one possibility would be to decompress the PDF and alter the data so that personal stuff is removed, but you said that the problem goes away when decompressing the PDF with a 3rd party product :-(

It is obvious that the PDF is somehow corrupted... you could use an editor like NOTEPAD++ to look at the stream length values and then see the actual length. (See the PDF spec for details, but it is rather obvious when looking in the editor anyway).

/Length nnnn/......>>stream
.....nnnn bytes of data....
endstream

But I think this isn't the only problem in that PDF.

Tilman



Am 28.04.2014 20:56, schrieb Jonas Karlsson:
Hi Tilman,
Thanks for trying to help!

With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
ExtractText I now only get the error

org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
only getting empty text, and WriteDecodedDoc returns a

pdf with blank pages.

_jonas




On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <[email protected]>wrote:

Yes, but does WriteDecodedDoc now work correctly, or does it still bring
that LZW error?

About the streams issue: the error status is somewhat misleading, it
should rather be a warning, because there is a "plan B", which is to
disregard the length parameter and to read the PDF until "endstream". If that one failed too, then there would be a new error message "Error reading
stream using length value". So I wonder if there is another problem.
Sometimes people transfer PDF file in ascii mode from an ftp server. Could
you try the text decode feature of the pdfbox app 2.0 ?

https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

command:

java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf


Tilman


Am 28.04.2014 18:21, schrieb Jonas Karlsson:

  Hi Tilman,
I tried the 1.8.5-SNAPSHOT and get the same result as before. No text and

Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

_jonas

On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <[email protected]
wrote:
There was a (recently fixed) bug with the LZW decoder, please try the
current snapshot and tell us what happens
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/

Tilman

Am 28.04.2014 17:00, schrieb Jonas Karlsson:

   java.io.StreamCorruptedException: Error: data is null

    at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)




Reply via email to