Re: PDFRenderer, PDDocument memory issue
On 1 Jul 2015, at 23:29, Andreas Lehmkühler andr...@lehmi.de wrote: John Hewson j...@jahewson.com hat am 2. Juli 2015 um 06:10 geschrieben: On 1 Jul 2015, at 07:52, Tilman Hausherr thaush...@t-online.de wrote: Am 01.07.2015 um 10:16 schrieb Alex Sviridov: In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? If you're building from source, try this: in PDImageXObject.java, remove the line cachedImage = image;. This will consume less space if you have large PDFs with many images. We don't retain XObjects across pages (anymore), so that shouldn't be the cause of his gradual memory increase? IMHO, it's quite simply to explain. During the initial parse all streams are read and all the data is stored in COSStream (see COSParser#parseCOSStream). That isn't a new behaviour and I'm working on a better solution (it's my last TODO in PDFBOX-2301) So it’s cached data in COSStream? That wouldn’t be affected by cachedImage = image;” but it would certainly explain the increasing heap usage. Glad to hear that you have an improvement underway! — John Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org BR Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: PDFRenderer, PDDocument memory issue
John Hewson j...@jahewson.com hat am 2. Juli 2015 um 06:10 geschrieben: On 1 Jul 2015, at 07:52, Tilman Hausherr thaush...@t-online.de wrote: Am 01.07.2015 um 10:16 schrieb Alex Sviridov: In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? If you're building from source, try this: in PDImageXObject.java, remove the line cachedImage = image;. This will consume less space if you have large PDFs with many images. We don't retain XObjects across pages (anymore), so that shouldn't be the cause of his gradual memory increase? IMHO, it's quite simply to explain. During the initial parse all streams are read and all the data is stored in COSStream (see COSParser#parseCOSStream). That isn't a new behaviour and I'm working on a better solution (it's my last TODO in PDFBOX-2301) Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org BR Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Re[10]: PDFRenderer, PDDocument memory issue
On 1 Jul 2015, at 05:15, Alex Sviridov ooo_satu...@mail.ru wrote: Ok. Thank you again. I just don't understand one thing. What is the reason to keep so large data if I only need to take page images and the most important I DO IT BY PAGE? Is there no way not to keep data for previous pages if I need only data for page N? Try profiling PDFBox to see what that data actually is. We don't cache page resources anymore. It could be cached stream data, or fonts, perhaps. -- John Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59 geschrieben: Ok. Thank you very much for explanation. Could you say where this scratch file is located linux/windows? java.io.File.createTempFile is used to create that file. It uses the default temp directory. It's /tmp on linux. I'm not sure for windows as different environment variables (TMP, TEMP, USERPROFILE, ) are used to search for such a directory. You may define your own temp directory using the following parameter when starting your application -Djava.io.tmpdir=PATH-TO-YOUR-TEMP Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){
Re: Re[8]: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59 geschrieben: Ok. Thank you very much for explanation. Could you say where this scratch file is located linux/windows? java.io.File.createTempFile is used to create that file. It uses the default temp directory. It's /tmp on linux. I'm not sure for windows as different environment variables (TMP, TEMP, USERPROFILE, ) are used to search for such a directory. You may define your own temp directory using the following parameter when starting your application -Djava.io.tmpdir=PATH-TO-YOUR-TEMP Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i); ImageView imageView=new ImageView(writableImage); System.out.println(Point c:+i); Label label=new Label(Integer.toString(i+1)); System.out.println(Point d:+i); VBox vBox=new VBox(imageView,label); System.out.println(Point e:+i);
Re: Re[6]: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i); ImageView imageView=new ImageView(writableImage); System.out.println(Point c:+i); Label label=new Label(Integer.toString(i+1)); System.out.println(Point d:+i); VBox vBox=new VBox(imageView,label); System.out.println(Point e:+i); vBox.setAlignment(Pos.CENTER); vBox.setStyle(-fx-padding:5px 5px 5px 5px;-fx-background-color:red); System.out.println(Point f:+i); Platform.runLater(new Runnable() { @Override public void run() { thumbFlowPane.getChildren().add(vBox); } }); } return null; } }; new Thread(task).start(); And here is the tail of the output Point a:30 Point b:30 Point c:30 Point d:30 Point e:30 Point f:30 Point a:31 What is scratch file? Sorry, I don't understand you. PDFBox holds a lot of temporary data in the memory. To reduce the memory footprint one can choose to use a scratch file instead, so that some/most
Re: Re[10]: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 14:15 geschrieben: Ok. Thank you again. I just don't understand one thing. What is the reason to keep so large data if I only need to take page images and the most important I DO IT BY PAGE? PDFBox doesn't know that you are doing it page by page. Is there no way not to keep data for previous pages if I need only data for page N? As I said, we don't have a read on demand mechanism yet. It is in our focus but that will take a while, as the pdf format isn't that easy to work with and therefore the code to be extended is more or less complex. Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59 geschrieben: Ok. Thank you very much for explanation. Could you say where this scratch file is located linux/windows? java.io.File.createTempFile is used to create that file. It uses the default temp directory. It's /tmp on linux. I'm not sure for windows as different environment variables (TMP, TEMP, USERPROFILE, ) are used to search for such a directory. You may define your own temp directory using the following parameter when starting your application -Djava.io.tmpdir=PATH-TO-YOUR-TEMP Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it
Re: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben: I want to display all page thumbnails. However I came across memory size problem with PDFRenderer or PDDocument - I don't know which one. I have the following code: private PDDocument pdfDocument; private PDFRenderer pdfRenderer; public WritableImage getPageThumbImage(int page){ WritableImage result=null; try { BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12, ImageType.RGB); result=SwingFXUtils.toFXImage(bi, null); } catch (IOException ex) { } return result; } . The method getPageThumbImage I run in for loop for every page.I set java memory heap to 500mb. And I can get about 30 images using getPageThumbImage (if I set more memory I get more). In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? There are 2 possible issues and maybe both are relevant. 1. PDFBox consumes more or less memory to load a pdf depending on the size and the content of the pdf. - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements concerning the memory footprint lately - Try to use of a scratch file (there are load methods including a boolean switcht ot activate that) 2. Your own implementation consumes more or less memory to process those thumbnails - check if you are releasing all resources (ecspecially those images you're creating) you are using during your process HTH, Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re[2]: PDFRenderer, PDDocument memory issue
Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar the result is the same. When I create images I add them to javafx FlowPane. However, the problem is not in images because I repeat - I get 400mb when I do pdfDocument=null,pdfRenderer=null. Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I don't have any problems with memory. I'm getting problem with memory when I run in for loop getPageThumbImage. I am sure that the problem is in PdfBox. Please, help me. Среда, 1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben: I want to display all page thumbnails. However I came across memory size problem with PDFRenderer or PDDocument - I don't know which one. I have the following code: private PDDocument pdfDocument; private PDFRenderer pdfRenderer; public WritableImage getPageThumbImage(int page){ WritableImage result=null; try { BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12, ImageType.RGB); result=SwingFXUtils.toFXImage(bi, null); } catch (IOException ex) { } return result; } . The method getPageThumbImage I run in for loop for every page.I set java memory heap to 500mb. And I can get about 30 images using getPageThumbImage (if I set more memory I get more). In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? There are 2 possible issues and maybe both are relevant. 1. PDFBox consumes more or less memory to load a pdf depending on the size and the content of the pdf. - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements concerning the memory footprint lately - Try to use of a scratch file (there are load methods including a boolean switcht ot activate that) 2. Your own implementation consumes more or less memory to process those thumbnails - check if you are releasing all resources (ecspecially those images you're creating) you are using during your process HTH, Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org -- Alex Sviridov
Re: Re[4]: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i); ImageView imageView=new ImageView(writableImage); System.out.println(Point c:+i); Label label=new Label(Integer.toString(i+1)); System.out.println(Point d:+i); VBox vBox=new VBox(imageView,label); System.out.println(Point e:+i); vBox.setAlignment(Pos.CENTER); vBox.setStyle(-fx-padding:5px 5px 5px 5px;-fx-background-color:red); System.out.println(Point f:+i); Platform.runLater(new Runnable() { @Override public void run() { thumbFlowPane.getChildren().add(vBox); } }); } return null; } }; new Thread(task).start(); And here is the tail of the output Point a:30 Point b:30 Point c:30 Point d:30 Point e:30 Point f:30 Point a:31 What is scratch file? Sorry, I don't understand you. PDFBox holds a lot of temporary data in the memory. To reduce the memory footprint one can choose to use a scratch file instead, so that some/most of that data will be hold in a file. To do so, simply use another load method, e.g. load(File file, boolean useScratchFiles) Среда, 1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 12:58 geschrieben: Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar the result is the same. When I create images I add them to javafx FlowPane. However, the problem is not in images because I repeat - I get 400mb when I do pdfDocument=null,pdfRenderer=null. Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I don't have any problems with memory. I'm getting problem with memory when I run in for loop getPageThumbImage. I am sure that the problem is in PdfBox. Please, help me. Maybe, but I'm not sure at all. Try to use the scratch file. Среда, 1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben: I want to display all page thumbnails. However I came across memory size problem with PDFRenderer or PDDocument - I don't know which one. I have the following code: private PDDocument pdfDocument; private PDFRenderer pdfRenderer; public WritableImage getPageThumbImage(int page){ WritableImage result=null; try { BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12, ImageType.RGB); result=SwingFXUtils.toFXImage(bi, null); } catch (IOException ex) { } return result; } . The method getPageThumbImage I run in for loop for every page.I set java memory heap to 500mb. And I can get about 30 images using getPageThumbImage (if I set more memory I get more). In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? There are 2 possible issues and maybe both are relevant. 1. PDFBox consumes more or less memory to load a pdf depending on the size and the content of the pdf. - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements concerning the memory footprint lately - Try to use of a scratch file (there are load methods including a boolean switcht ot activate that) 2. Your own implementation consumes more or less memory to process those thumbnails - check if you are releasing all resources (ecspecially those images you're creating) you are using during your process HTH, Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org -- Alex Sviridov BR Andreas
Re[8]: PDFRenderer, PDDocument memory issue
Ok. Thank you very much for explanation. Could you say where this scratch file is located linux/windows? Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i); ImageView imageView=new ImageView(writableImage); System.out.println(Point c:+i); Label label=new Label(Integer.toString(i+1)); System.out.println(Point d:+i); VBox vBox=new VBox(imageView,label); System.out.println(Point e:+i); vBox.setAlignment(Pos.CENTER); vBox.setStyle(-fx-padding:5px 5px 5px 5px;-fx-background-color:red); System.out.println(Point f:+i); Platform.runLater(new Runnable() { @Override public void run() { thumbFlowPane.getChildren().add(vBox); } }); } return null; } }; new Thread(task).start(); And here is the tail of the output Point a:30 Point b:30 Point c:30 Point d:30 Point e:30 Point f:30 Point a:31 What is
Re: Re[2]: PDFRenderer, PDDocument memory issue
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 12:58 geschrieben: Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar the result is the same. When I create images I add them to javafx FlowPane. However, the problem is not in images because I repeat - I get 400mb when I do pdfDocument=null,pdfRenderer=null. Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I don't have any problems with memory. I'm getting problem with memory when I run in for loop getPageThumbImage. I am sure that the problem is in PdfBox. Please, help me. Maybe, but I'm not sure at all. Try to use the scratch file. Среда, 1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben: I want to display all page thumbnails. However I came across memory size problem with PDFRenderer or PDDocument - I don't know which one. I have the following code: private PDDocument pdfDocument; private PDFRenderer pdfRenderer; public WritableImage getPageThumbImage(int page){ WritableImage result=null; try { BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12, ImageType.RGB); result=SwingFXUtils.toFXImage(bi, null); } catch (IOException ex) { } return result; } . The method getPageThumbImage I run in for loop for every page.I set java memory heap to 500mb. And I can get about 30 images using getPageThumbImage (if I set more memory I get more). In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? There are 2 possible issues and maybe both are relevant. 1. PDFBox consumes more or less memory to load a pdf depending on the size and the content of the pdf. - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements concerning the memory footprint lately - Try to use of a scratch file (there are load methods including a boolean switcht ot activate that) 2. Your own implementation consumes more or less memory to process those thumbnails - check if you are releasing all resources (ecspecially those images you're creating) you are using during your process HTH, Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org -- Alex Sviridov BR Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re[4]: PDFRenderer, PDDocument memory issue
I decided to show all the code. I also send the pdf file - some file from internet I use for testing. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i); ImageView imageView=new ImageView(writableImage); System.out.println(Point c:+i); Label label=new Label(Integer.toString(i+1)); System.out.println(Point d:+i); VBox vBox=new VBox(imageView,label); System.out.println(Point e:+i); vBox.setAlignment(Pos.CENTER); vBox.setStyle(-fx-padding:5px 5px 5px 5px;-fx-background-color:red); System.out.println(Point f:+i); Platform.runLater(new Runnable() { @Override public void run() { thumbFlowPane.getChildren().add(vBox); } }); } return null; } }; new Thread(task).start(); And here is the tail of the output Point a:30 Point b:30 Point c:30 Point d:30 Point e:30 Point f:30 Point a:31 What is scratch file? Sorry, I don't understand you. Среда, 1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 12:58 geschrieben: Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar the result is the same. When I create images I add them to javafx FlowPane. However, the problem is not in images because I repeat - I get 400mb when I do pdfDocument=null,pdfRenderer=null. Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I don't have any problems with memory. I'm getting problem with memory when I run in for loop getPageThumbImage. I am sure that the problem is in PdfBox. Please, help me. Maybe, but I'm not sure at all. Try to use the scratch file. Среда, 1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben: I want to display all page thumbnails. However I came across memory size problem with PDFRenderer or PDDocument - I don't know which one. I have the following code: private PDDocument pdfDocument; private PDFRenderer pdfRenderer; public WritableImage getPageThumbImage(int page){ WritableImage result=null; try { BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12, ImageType.RGB); result=SwingFXUtils.toFXImage(bi, null); } catch (IOException ex) { } return result; } . The method getPageThumbImage I run in for loop for every page.I set java memory heap to 500mb. And I can get about 30 images using getPageThumbImage (if I set more memory I get more). In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? There are 2 possible issues and maybe both are relevant. 1. PDFBox consumes more or less memory to load a pdf depending on the size and the content of the pdf. - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements concerning the memory footprint lately - Try to use of a scratch file (there are load methods including a boolean switcht ot activate that) 2. Your own implementation consumes more or less memory to process those thumbnails - check if you are releasing all resources (ecspecially those images you're creating) you are using during your process HTH, Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org -- Alex Sviridov BR Andreas - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org -- Alex Sviridov - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re[10]: PDFRenderer, PDDocument memory issue
Ok. Thank you again. I just don't understand one thing. What is the reason to keep so large data if I only need to take page images and the most important I DO IT BY PAGE? Is there no way not to keep data for previous pages if I need only data for page N? Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler andr...@lehmi.de: Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59 geschrieben: Ok. Thank you very much for explanation. Could you say where this scratch file is located linux/windows? java.io.File.createTempFile is used to create that file. It uses the default temp directory. It's /tmp on linux. I'm not sure for windows as different environment variables (TMP, TEMP, USERPROFILE, ) are used to search for such a directory. You may define your own temp directory using the following parameter when starting your application -Djava.io.tmpdir=PATH-TO-YOUR-TEMP Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben: The file is here https://yadi.sk/i/Y0fTuvHmhbZiE Ah, that explains a lot. The pdf is a scanned document, every page holds a color image, consuming a lot of memory when processed I tried with load (fileName,true). The result - now I don't have memory problems. However now I have 2 problems: 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One thumbnail image is loaded about 4 seconds! If it comes to huge pdfs, you have to die one death. Either you provide enough memory to do all the stuff in memory (fast) or you use a scratch file to save memory (slow) And yes, there is room for an improvement of the memory handling (read on demand, remove after usage) in PDFBox, but that is some future feature. Patches are welcome. 2) Besides, as you see thumbnail images are loaded in separate thread. While this thread is running and I try to get big image for main content using BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the following exception: java.io.IOException: java.util.zip.DataFormatException: unknown compression method at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83) at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335) at org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265) at org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239) at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180) at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95) at javafx.concurrent.Task$TaskCallable.call(Task.java:1423) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.zip.DataFormatException: unknown compression method at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74) ... 20 more How to solve these problems? PDFBox isn't supposed to be thread safe. Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de : Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben: I decided to show all the code. I also send the pdf file - some file from internet I use for testing. The attachment didn't make it due to some restrictions to the mailing list. Please post a link to the origin source or another place where we can download the pdf in question. Task task = new Task() { @Override protected Integer call() throws Exception { for (int i=0;imodel.getTotalPages();i++){ System.out.println(Point a:+i); WritableImage writableImage=model.getPageThumbImage(i); System.out.println(Point b:+i);
Re: PDFRenderer, PDDocument memory issue
Am 01.07.2015 um 10:16 schrieb Alex Sviridov: In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? If you're building from source, try this: in PDImageXObject.java, remove the line cachedImage = image;. This will consume less space if you have large PDFs with many images. Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: PDFRenderer, PDDocument memory issue
On 1 Jul 2015, at 07:52, Tilman Hausherr thaush...@t-online.de wrote: Am 01.07.2015 um 10:16 schrieb Alex Sviridov: In my application I have real time memory graphs and they show that memory is very fast filled. When there is no more free memory getPageThumbImage hangs - no exception, nothing. But the code stops. When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How to solve this problem? If you're building from source, try this: in PDImageXObject.java, remove the line cachedImage = image;. This will consume less space if you have large PDFs with many images. We don't retain XObjects across pages (anymore), so that shouldn't be the cause of his gradual memory increase? Tilman - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org