[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-24 Thread Arnaud Jeansen (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065906#comment-17065906
 ] 

Arnaud Jeansen commented on PDFBOX-4781:


[~tilman] Thanks again for your time.

> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-24 Thread Arnaud Jeansen (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065499#comment-17065499
 ] 

Arnaud Jeansen commented on PDFBOX-4781:


[~tilman] Oh yes it is indeed incredibly broken, we see several of those *every 
day* on our platform, all from invoices from Orange.
Based on their metadata, they seem to be doing very funky stuff with itext to 
generate that.

It happens for some of their invoices, not all.

Anyway, thanks for having a look and confirming that "hacking" it on our side 
is the best option.

> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-24 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065489#comment-17065489
 ] 

Tilman Hausherr commented on PDFBOX-4781:
-

Wow, that PDF is really broken, here are the errors from PDF.js:
{noformat}
Warning: Indexing all PDF objects
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 194"
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 72, 195"
Warning: Native JPEG decoding failed -- trying to recover: Error during JPEG 
image loading
Warning: Unable to decode image: JpegError: JPEG error: SOI not found
...
{noformat}
I suspect that this file was opened with an ordinary editor, modified, and then 
saved. Object 9 is said to have a length of 37, but has a length of 53.

There are a lot of other errors, e.g. when opening the images. This is a 
telecom invoice. But PDF.js and Chrome (I didn't try Adobe) do not show the 
company logo (have you ever seen an invoice without company logo?). The third 
page is also not shown, although the text indicates there is one.

Changing the Flate filter code so that it returns an empty result would mean 
that incorrect streams would not be detected by preflight, our PDF/A-1b checker.

I'd prefer that you "hack" PDFBox on your own for that application 
(thumbnails), or refuse to create thumbnails for a broken PDF, i.e. create an 
"X" instead, maybe with a text "thumbnail could not be created, the PDF may be 
corrup ot incomplete". The hack would be OK as long as you don't use the jar 
for anything else.

> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is for

[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-23 Thread Arnaud Jeansen (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064673#comment-17064673
 ] 

Arnaud Jeansen commented on PDFBOX-4781:


[~tilman] Thanks, I sent it to you privately.
AFAIR, I also tested reading its byteArrays and got the same exception, as this 
also goes through the FlateFilter



> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-21 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063973#comment-17063973
 ] 

Tilman Hausherr commented on PDFBOX-4781:
-

You can send me the PDF to tilman at snafu dot de. Btw you can open byte arrays 
directly.

And the current version is 2.0.19.

> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

2020-03-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063914#comment-17063914
 ] 

Andreas Lehmkühler commented on PDFBOX-4781:


Please attach a sample pdf

> PDF files with invalid compressed streams cannot be rendered
> 
>
> Key: PDFBOX-4781
> URL: https://issues.apache.org/jira/browse/PDFBOX-4781
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.18
>Reporter: Arnaud Jeansen
>Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
> byte[] pdfFile = ...;
> float dpi = 72L;
> try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>   PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>   return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
> } catch (IOException e) {
>   // Error handling
> }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>   at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>   at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>   at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>   at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>   at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:92)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>   at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>   at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>   at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>   ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>   at java.util.zip.Inflater.inflateBytes(Native Method)
>   at java.util.zip.Inflater.inflate(Inflater.java:259)
>   at java.util.zip.Inflater.inflate(Inflater.java:280)
>   at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>   at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>   ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org