[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: PDFJS-9581-hugeimage.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.8 >Reporter: Marek Pribula >Priority: Major > Labels: Predictor > Attachments: ModifiedFilters.png, OriginalFilters.png, > PDFBOX-2554-cmykrasterobjecttypes.pdf, PDFJS-9581-hugeimage.pdf, TEST.pdf, > gs-bugzilla690022.pdf, pop-bugzilla93476.pdf, predictor_stream.patch, > predictor_stream_rev2.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormat
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itai Shaked updated PDFBOX-4151: Attachment: predictor_stream_rev2.patch > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, > PDFBOX-2554-cmykrasterobjecttypes.pdf, TEST.pdf, gs-bugzilla690022.pdf, > pop-bugzilla93476.pdf, predictor_stream.patch, predictor_stream_rev2.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: PDFBOX-2554-cmykrasterobjecttypes.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, > PDFBOX-2554-cmykrasterobjecttypes.pdf, TEST.pdf, gs-bugzilla690022.pdf, > pop-bugzilla93476.pdf, predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataForma
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: pop-bugzilla93476.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > gs-bugzilla690022.pdf, pop-bugzilla93476.pdf, predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception >
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: gs-bugzilla690022.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > gs-bugzilla690022.pdf, predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOE
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: (was: bugzilla886049.pdf) > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); >
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4151: Attachment: bugzilla886049.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > bugzilla886049.pdf, predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOExcepti
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Itai Shaked updated PDFBOX-4151: Attachment: predictor_stream.patch > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } >
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Pribula updated PDFBOX-4151: -- Attachment: OriginalFilters.png ModifiedFilters.png > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); >
[jira] [Updated] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage
[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Pribula updated PDFBOX-4151: -- Attachment: TEST.pdf > FlateFilter, LZWFilter causes double memory usage > - > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug >Reporter: Marek Pribula >Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(paramete