Re: TextExtraction only working after uncompressing with pdftk

Tilman Hausherr Mon, 28 Apr 2014 11:29:41 -0700

Yes, but does WriteDecodedDoc now work correctly, or does it still bringthat LZW error?

About the streams issue: the error status is somewhat misleading, itshould rather be a warning, because there is a "plan B", which is todisregard the length parameter and to read the PDF until "endstream". Ifthat one failed too, then there would be a new error message "Errorreading stream using length value". So I wonder if there is anotherproblem. Sometimes people transfer PDF file in ascii mode from an ftpserver. Could you try the text decode feature of the pdfbox app 2.0 ?


https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

command:

java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf


Tilman


Am 28.04.2014 18:21, schrieb Jonas Karlsson:

Hi Tilman,

I tried the 1.8.5-SNAPSHOT and get the same result as before. No text and

Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

_jonas

On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <[email protected]>wrote:

There was a (recently fixed) bug with the LZW decoder, please try the
current snapshot and tell us what happens
https://repository.apache.org/content/groups/snapshots/org/
apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/

Tilman

Am 28.04.2014 17:00, schrieb Jonas Karlsson:

  java.io.StreamCorruptedException: Error: data is null

   at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)

Re: TextExtraction only working after uncompressing with pdftk

Reply via email to