Problem solved, see
https://issues.apache.org/jira/browse/PDFBOX-2048
Tilman
Am 28.04.2014 21:17, schrieb Tilman Hausherr:
Hi,
I'm afraid we won't be research this depper without the PDF. Normally,
one possibility would be to decompress the PDF and alter the data so
that personal stuff is removed, but you said that the problem goes
away when decompressing the PDF with a 3rd party product :-(
It is obvious that the PDF is somehow corrupted... you could use an
editor like NOTEPAD++ to look at the stream length values and then see
the actual length. (See the PDF spec for details, but it is rather
obvious when looking in the editor anyway).
/Length nnnn/......>>stream
.....nnnn bytes of data....
endstream
But I think this isn't the only problem in that PDF.
Tilman
Am 28.04.2014 20:56, schrieb Jonas Karlsson:
Hi Tilman,
Thanks for trying to help!
With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
ExtractText I now only get the error
org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength
SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream
I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
only getting empty text, and WriteDecodedDoc returns a
pdf with blank pages.
_jonas
On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr
<[email protected]>wrote:
Yes, but does WriteDecodedDoc now work correctly, or does it still
bring
that LZW error?
About the streams issue: the error status is somewhat misleading, it
should rather be a warning, because there is a "plan B", which is to
disregard the length parameter and to read the PDF until
"endstream". If
that one failed too, then there would be a new error message "Error
reading
stream using length value". So I wonder if there is another problem.
Sometimes people transfer PDF file in ascii mode from an ftp server.
Could
you try the text decode feature of the pdfbox app 2.0 ?
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
command:
java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
Tilman
Am 28.04.2014 18:21, schrieb Jonas Karlsson:
Hi Tilman,
I tried the 1.8.5-SNAPSHOT and get the same result as before. No
text and
Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
NonSequentialPDFParser
validateStreamLength
SEVERE: The end of the stream doesn't point to the correct offset,
using
workaround to read the stream
_jonas
On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr
<[email protected]
wrote:
There was a (recently fixed) bug with the LZW decoder, please try
the
current snapshot and tell us what happens
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
Tilman
Am 28.04.2014 17:00, schrieb Jonas Karlsson:
java.io.StreamCorruptedException: Error: data is null
at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)