Re: TextExtraction only working after uncompressing with pdftk

Jonas Karlsson Tue, 29 Apr 2014 06:07:27 -0700

Great! I will check it out when the new snapshot is available,

thanks!
_jonas



On Tue, Apr 29, 2014 at 2:02 AM, Tilman Hausherr <[email protected]>wrote:

> Problem solved, see
>
> https://issues.apache.org/jira/browse/PDFBOX-2048
>
>
> Tilman
>
>
>
> Am 28.04.2014 21:17, schrieb Tilman Hausherr:
>
>  Hi,
>>
>> I'm afraid we won't be research this depper without the PDF. Normally,
>> one possibility would be to decompress the PDF and alter the data so that
>> personal stuff is removed, but you said that the problem goes away when
>> decompressing the PDF with a 3rd party product :-(
>>
>> It is obvious that the PDF is somehow corrupted... you could use an
>> editor like NOTEPAD++ to look at the stream length values and then see the
>> actual length. (See the PDF spec for details, but it is rather obvious when
>> looking in the editor anyway).
>>
>> /Length nnnn/......>>stream
>> .....nnnn bytes of data....
>> endstream
>>
>> But I think this isn't the only problem in that PDF.
>>
>> Tilman
>>
>>
>>
>> Am 28.04.2014 20:56, schrieb Jonas Karlsson:
>>
>>> Hi Tilman,
>>> Thanks for trying to help!
>>>
>>> With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
>>> ExtractText I now only get the error
>>>
>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength
>>>
>>> SEVERE: The end of the stream doesn't point to the correct offset, using
>>> workaround to read the stream
>>>
>>> I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
>>> only getting empty text, and WriteDecodedDoc returns a
>>>
>>> pdf with blank pages.
>>>
>>> _jonas
>>>
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <[email protected]
>>> >wrote:
>>>
>>>  Yes, but does WriteDecodedDoc now work correctly, or does it still bring
>>>> that LZW error?
>>>>
>>>> About the streams issue: the error status is somewhat misleading, it
>>>> should rather be a warning, because there is a "plan B", which is to
>>>> disregard the length parameter and to read the PDF until "endstream". If
>>>> that one failed too, then there would be a new error message "Error
>>>> reading
>>>> stream using length value". So I wonder if there is another problem.
>>>> Sometimes people transfer PDF file in ascii mode from an ftp server.
>>>> Could
>>>> you try the text decode feature of the pdfbox app 2.0 ?
>>>>
>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
>>>>
>>>> command:
>>>>
>>>> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
>>>>
>>>>
>>>> Tilman
>>>>
>>>>
>>>> Am 28.04.2014 18:21, schrieb Jonas Karlsson:
>>>>
>>>>   Hi Tilman,
>>>>
>>>>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No text
>>>>> and
>>>>>
>>>>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
>>>>> NonSequentialPDFParser
>>>>> validateStreamLength
>>>>>
>>>>> SEVERE: The end of the stream doesn't point to the correct offset,
>>>>> using
>>>>> workaround to read the stream
>>>>>
>>>>> _jonas
>>>>>
>>>>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <
>>>>> [email protected]
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>   There was a (recently fixed) bug with the LZW decoder, please try the
>>>>>
>>>>>> current snapshot and tell us what happens
>>>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>>>>>
>>>>>>    java.io.StreamCorruptedException: Error: data is null
>>>>>>
>>>>>>      at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>>>>>
>>>>>>>
>>>>>>>
>>
>

Re: TextExtraction only working after uncompressing with pdftk

Reply via email to