Re: TextExtraction only working after uncompressing with pdftk

Tilman Hausherr Mon, 28 Apr 2014 23:03:08 -0700

Problem solved, see

https://issues.apache.org/jira/browse/PDFBOX-2048



Tilman



Am 28.04.2014 21:17, schrieb Tilman Hausherr:

Hi,
I'm afraid we won't be research this depper without the PDF. Normally,one possibility would be to decompress the PDF and alter the data sothat personal stuff is removed, but you said that the problem goesaway when decompressing the PDF with a 3rd party product :-(
It is obvious that the PDF is somehow corrupted... you could use aneditor like NOTEPAD++ to look at the stream length values and then seethe actual length. (See the PDF spec for details, but it is ratherobvious when looking in the editor anyway).
/Length nnnn/......>>stream
.....nnnn bytes of data....
endstream

But I think this isn't the only problem in that PDF.

Tilman



Am 28.04.2014 20:56, schrieb Jonas Karlsson:
Hi Tilman,
Thanks for trying to help!

With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
ExtractText I now only get the error

org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
only getting empty text, and WriteDecodedDoc returns a

pdf with blank pages.

_jonas
On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr<[email protected]>wrote:
Yes, but does WriteDecodedDoc now work correctly, or does it stillbring
that LZW error?

About the streams issue: the error status is somewhat misleading, it
should rather be a warning, because there is a "plan B", which is to
disregard the length parameter and to read the PDF until"endstream". Ifthat one failed too, then there would be a new error message "Errorreading
stream using length value". So I wonder if there is another problem.
Sometimes people transfer PDF file in ascii mode from an ftp server.Could
you try the text decode feature of the pdfbox app 2.0 ?

https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

command:

java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf


Tilman


Am 28.04.2014 18:21, schrieb Jonas Karlsson:

  Hi Tilman,
I tried the 1.8.5-SNAPSHOT and get the same result as before. Notext and
Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
NonSequentialPDFParser
validateStreamLength
SEVERE: The end of the stream doesn't point to the correct offset,using
workaround to read the stream

_jonas
On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr<[email protected]
wrote:
There was a (recently fixed) bug with the LZW decoder, please trythe
current snapshot and tell us what happens
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/

Tilman

Am 28.04.2014 17:00, schrieb Jonas Karlsson:

   java.io.StreamCorruptedException: Error: data is null
    at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)

Re: TextExtraction only working after uncompressing with pdftk

Reply via email to