Great! I will check it out when the new snapshot is available, thanks! _jonas
On Tue, Apr 29, 2014 at 2:02 AM, Tilman Hausherr <[email protected]>wrote: > Problem solved, see > > https://issues.apache.org/jira/browse/PDFBOX-2048 > > > Tilman > > > > Am 28.04.2014 21:17, schrieb Tilman Hausherr: > > Hi, >> >> I'm afraid we won't be research this depper without the PDF. Normally, >> one possibility would be to decompress the PDF and alter the data so that >> personal stuff is removed, but you said that the problem goes away when >> decompressing the PDF with a 3rd party product :-( >> >> It is obvious that the PDF is somehow corrupted... you could use an >> editor like NOTEPAD++ to look at the stream length values and then see the >> actual length. (See the PDF spec for details, but it is rather obvious when >> looking in the editor anyway). >> >> /Length nnnn/......>>stream >> .....nnnn bytes of data.... >> endstream >> >> But I think this isn't the only problem in that PDF. >> >> Tilman >> >> >> >> Am 28.04.2014 20:56, schrieb Jonas Karlsson: >> >>> Hi Tilman, >>> Thanks for trying to help! >>> >>> With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and >>> ExtractText I now only get the error >>> >>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength >>> >>> SEVERE: The end of the stream doesn't point to the correct offset, using >>> workaround to read the stream >>> >>> I'm not seeing the StreamCorrupted Exception anymore. However, I'm still >>> only getting empty text, and WriteDecodedDoc returns a >>> >>> pdf with blank pages. >>> >>> _jonas >>> >>> >>> >>> >>> On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <[email protected] >>> >wrote: >>> >>> Yes, but does WriteDecodedDoc now work correctly, or does it still bring >>>> that LZW error? >>>> >>>> About the streams issue: the error status is somewhat misleading, it >>>> should rather be a warning, because there is a "plan B", which is to >>>> disregard the length parameter and to read the PDF until "endstream". If >>>> that one failed too, then there would be a new error message "Error >>>> reading >>>> stream using length value". So I wonder if there is another problem. >>>> Sometimes people transfer PDF file in ascii mode from an ftp server. >>>> Could >>>> you try the text decode feature of the pdfbox app 2.0 ? >>>> >>>> https://repository.apache.org/content/groups/snapshots/org/ >>>> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/ >>>> >>>> command: >>>> >>>> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf >>>> >>>> >>>> Tilman >>>> >>>> >>>> Am 28.04.2014 18:21, schrieb Jonas Karlsson: >>>> >>>> Hi Tilman, >>>> >>>>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No text >>>>> and >>>>> >>>>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser. >>>>> NonSequentialPDFParser >>>>> validateStreamLength >>>>> >>>>> SEVERE: The end of the stream doesn't point to the correct offset, >>>>> using >>>>> workaround to read the stream >>>>> >>>>> _jonas >>>>> >>>>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr < >>>>> [email protected] >>>>> >>>>>> wrote: >>>>>> >>>>> There was a (recently fixed) bug with the LZW decoder, please try the >>>>> >>>>>> current snapshot and tell us what happens >>>>>> https://repository.apache.org/content/groups/snapshots/org/ >>>>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/ >>>>>> >>>>>> Tilman >>>>>> >>>>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson: >>>>>> >>>>>> java.io.StreamCorruptedException: Error: data is null >>>>>> >>>>>> at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82) >>>>>>> >>>>>>> >>>>>>> >> >

