Re: Re[10]: PDFRenderer, PDDocument memory issue
> On 1 Jul 2015, at 05:15, Alex Sviridov wrote: > > Ok. Thank you again. I just don't understand one thing. What is the reason to > keep so large data if I only need to take page images and the most important > I DO IT BY PAGE? > > Is there no way not to keep data for previous pages if I need only data for > page N? Try profiling PDFBox to see what that data actually is. We don't cache page resources anymore. It could be cached stream data, or fonts, perhaps. -- John > Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler : >> >> >>> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59 >>> geschrieben: >>> >>> >>> Ok. Thank you very much for explanation. Could you say where this scratch >>> file is located linux/windows? >> java.io.File.createTempFile is used to create that file. It uses the default >> temp directory. It's "/tmp" on linux. I'm not sure for windows as different >> environment variables (TMP, TEMP, USERPROFILE, ) are used to search for >> such >> a directory. >> >> You may define your own temp directory using the following parameter when >> starting your application >> >> -Djava.io.tmpdir=PATH-TO-YOUR-TEMP >> >> >>> >>> >>> Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de >>> >: > Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38 > geschrieben: > > > The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed > I tried with load (fileName,true). The result - now I don't have memory > problems. However now I have 2 problems: > > 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. > One > thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. > 2) Besides, as you see thumbnail images are loaded in separate thread. > While > this thread is running and I try to > get big image for main content using BufferedImage > bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the > following exception: > > java.io.IOException: java.util.zip.DataFormatException: unknown > compression > method > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) > at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) > at > org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) > at > org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) > at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > at > org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) > at > org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) > > at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.zip.DataFormatException: unknown compression method > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:259) > at java.util.zip.Inflater.inflate(Inflater.java:280) > at > org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) > ... 20 more > > How to solve these problems? PDFBox isn't supposed to be thread safe. > > > Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler < > andr...@lehmi.de >> : >> >> >>> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:09 >>> geschr
Re: Re[10]: PDFRenderer, PDDocument memory issue
> Alex Sviridov hat am 1. Juli 2015 um 14:15 geschrieben: > > > Ok. Thank you again. I just don't understand one thing. What is the reason to > keep so large data if I only need to take page images and the most important I > DO IT BY PAGE? PDFBox doesn't know that you are doing it page by page. > > Is there no way not to keep data for previous pages if I need only data for > page N? As I said, we don't have a read on demand mechanism yet. It is in our focus but that will take a while, as the pdf format isn't that easy to work with and therefore the code to be extended is more or less complex. > Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler : > > > > > >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59 > >> geschrieben: > >> > >> > >> Ok. Thank you very much for explanation. Could you say where this scratch > >> file is located linux/windows? > >java.io.File.createTempFile is used to create that file. It uses the default > >temp directory. It's "/tmp" on linux. I'm not sure for windows as different > >environment variables (TMP, TEMP, USERPROFILE, ) are used to search for > >such > >a directory. > > > >You may define your own temp directory using the following parameter when > >starting your application > > > >-Djava.io.tmpdir=PATH-TO-YOUR-TEMP > > > > > >> > >> > >> Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de > >> >: > >> >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38 > >> >> geschrieben: > >> >> > >> >> > >> >> The file is here https://yadi.sk/i/Y0fTuvHmhbZiE > >> >Ah, that explains a lot. The pdf is a scanned document, every page holds a > >> >color > >> >image, consuming a lot of memory when processed > >> > > >> >> I tried with load (fileName,true). The result - now I don't have memory > >> >> problems. However now I have 2 problems: > >> >> > >> >> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. > >> >> One > >> >> thumbnail image is loaded about 4 seconds! > >> >If it comes to huge pdfs, you have to die one death. Either you provide > >> >enough > >> >memory to do all the stuff in memory (fast) or you use a scratch file to > >> >save > >> >memory (slow) > >> > > >> >And yes, there is room for an improvement of the memory handling (read on > >> >demand, remove after usage) in PDFBox, but that is some future feature. > >> >Patches > >> >are welcome. > >> > > >> >> 2) Besides, as you see thumbnail images are loaded in separate thread. > >> >> While > >> >> this thread is running and I try to > >> >> get big image for main content using BufferedImage > >> >> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the > >> >> following exception: > >> >> > >> >> java.io.IOException: java.util.zip.DataFormatException: unknown > >> >> compression > >> >> method > >> >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) > >> >> at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) > >> >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) > >> >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) > >> >> at > >> >> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) > >> >> at > >> >> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) > >> >> at > >> >> org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146) > >> >> at > >> >> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78) > >> >> at > >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) > >> >> at > >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) > >> >> at > >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > >> >> at > >> >> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) > >> >> at > >> >> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) > >> >> at > >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) > >> >> at > >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) > >> >> > >> >> at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) > >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> >> at java.lang.Thread.run(Thread.java:745) > >> >> Caused by: java.util.zip.DataFormatException: unknown compression method > >> >> at java.util.zip.Inflater.inflateBytes(Native Method) > >> >> at java.util.zip.Inflater.inflate(Inflater.java:259) > >> >> at java.util.zip.Inflater.inflate(Inflater.java:280) > >> >> at > >> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) > >> >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) > >> >> ... 20 more > >> >> > >> >> How to solve these proble
Re[10]: PDFRenderer, PDDocument memory issue
Ok. Thank you again. I just don't understand one thing. What is the reason to keep so large data if I only need to take page images and the most important I DO IT BY PAGE? Is there no way not to keep data for previous pages if I need only data for page N? Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler : > > >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59 >> geschrieben: >> >> >> Ok. Thank you very much for explanation. Could you say where this scratch >> file is located linux/windows? >java.io.File.createTempFile is used to create that file. It uses the default >temp directory. It's "/tmp" on linux. I'm not sure for windows as different >environment variables (TMP, TEMP, USERPROFILE, ) are used to search for >such >a directory. > >You may define your own temp directory using the following parameter when >starting your application > >-Djava.io.tmpdir=PATH-TO-YOUR-TEMP > > >> >> >> Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de >: >> >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38 >> >> geschrieben: >> >> >> >> >> >> The file is here https://yadi.sk/i/Y0fTuvHmhbZiE >> >Ah, that explains a lot. The pdf is a scanned document, every page holds a >> >color >> >image, consuming a lot of memory when processed >> > >> >> I tried with load (fileName,true). The result - now I don't have memory >> >> problems. However now I have 2 problems: >> >> >> >> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. >> >> One >> >> thumbnail image is loaded about 4 seconds! >> >If it comes to huge pdfs, you have to die one death. Either you provide >> >enough >> >memory to do all the stuff in memory (fast) or you use a scratch file to >> >save >> >memory (slow) >> > >> >And yes, there is room for an improvement of the memory handling (read on >> >demand, remove after usage) in PDFBox, but that is some future feature. >> >Patches >> >are welcome. >> > >> >> 2) Besides, as you see thumbnail images are loaded in separate thread. >> >> While >> >> this thread is running and I try to >> >> get big image for main content using BufferedImage >> >> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the >> >> following exception: >> >> >> >> java.io.IOException: java.util.zip.DataFormatException: unknown >> >> compression >> >> method >> >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) >> >> at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) >> >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) >> >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) >> >> at >> >> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) >> >> at >> >> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) >> >> at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146) >> >> at >> >> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78) >> >> at >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) >> >> at >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) >> >> at >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) >> >> at >> >> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) >> >> at >> >> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) >> >> at >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) >> >> at >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) >> >> >> >> at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> >> at java.lang.Thread.run(Thread.java:745) >> >> Caused by: java.util.zip.DataFormatException: unknown compression method >> >> at java.util.zip.Inflater.inflateBytes(Native Method) >> >> at java.util.zip.Inflater.inflate(Inflater.java:259) >> >> at java.util.zip.Inflater.inflate(Inflater.java:280) >> >> at >> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) >> >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) >> >> ... 20 more >> >> >> >> How to solve these problems? >> >PDFBox isn't supposed to be thread safe. >> > >> >> >> >> >> >> Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler < >> >> andr...@lehmi.de >> >> >: >> >> > >> >> > >> >> >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:09 >> >> >> geschrieben: >> >> >> >> >> >> >> >> >> I decided to show all the code. I also send the pdf file - some file >> >> >> from >> >> >> internet I use for testing. >> >> >The attachment didn't make it due to some restrictions to the mailing >> >> >list. >> >> >Please post a link to th