Re: Re[6]: PDFRenderer, PDDocument memory issue

Andreas Lehmkühler Wed, 01 Jul 2015 04:55:25 -0700

> Alex Sviridov <ooo_satu...@mail.ru> hat am 1. Juli 2015 um 13:38 geschrieben:
> 
> 
>  The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
Ah, that explains a lot. The pdf is a scanned document, every page holds a color
image, consuming a lot of memory when processed


> I tried with load (fileName,true). The result - now I don't have memory
> problems. However now I have 2 problems:
>
> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One
> thumbnail image is loaded about 4 seconds! 
If it comes to huge pdfs, you have to die one death. Either you provide enough
memory to do all the stuff in memory (fast) or you use a scratch file to save
memory (slow)

And yes, there is room for an improvement of the memory handling (read on
demand, remove after usage) in PDFBox, but that is some future feature. Patches
are welcome.

> 2) Besides, as you see thumbnail images are loaded in separate thread. While
> this thread is running and I try to
> get big image for main content using   BufferedImage
> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
> following exception:
> 
> java.io.IOException: java.util.zip.DataFormatException: unknown compression
> method
>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>     at
> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>     at
> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>     at org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
>     at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)
>     at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>     at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>     at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>     at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>     at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>     at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>   ....
>     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.zip.DataFormatException: unknown compression method
>     at java.util.zip.Inflater.inflateBytes(Native Method)
>     at java.util.zip.Inflater.inflate(Inflater.java:259)
>     at java.util.zip.Inflater.inflate(Inflater.java:280)
>     at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>     ... 20 more
> 
> How to solve these problems?
PDFBox isn't supposed to be thread safe.

> 
> 
> Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler <andr...@lehmi.de>:
> >
> >
> >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:09
> >> geschrieben:
> >> 
> >> 
> >>  I decided to show all the code. I also send the pdf file - some file from
> >> internet I use for testing.
> >The attachment didn't make it due to some restrictions to the mailing list.
> >Please post a link to the origin source or another place where we can
> >download
> >the pdf in question.
> >
> >> 
> >> Task task = new Task() {
> >>     @Override protected Integer call() throws Exception {
> >>         for (int i=0;i<model.getTotalPages();i++){
> >>             System.out.println("Point a:"+i);
> >>             WritableImage writableImage=model.getPageThumbImage(i);
> >>             System.out.println("Point b:"+i);
> >>             ImageView imageView=new ImageView(writableImage);
> >>             System.out.println("Point c:"+i);
> >>             Label label=new Label(Integer.toString(i+1));
> >>             System.out.println("Point d:"+i);
> >>             VBox vBox=new VBox(imageView,label);
> >>             System.out.println("Point e:"+i);
> >>             vBox.setAlignment(Pos.CENTER);
> >>             vBox.setStyle("-fx-padding:5px 5px 5px
> >> 5px;-fx-background-color:red");
> >>             System.out.println("Point f:"+i);
> >>             Platform.runLater(new Runnable() {
> >>                 @Override
> >>                 public void run() {
> >>                      thumbFlowPane.getChildren().add(vBox);
> >>                 }
> >>             });
> >>         }
> >>         return null;
> >>     }
> >> };
> >> new Thread(task).start();
> >> 
> >> And here is the tail of the output
> >> ....
> >> Point a:30
> >> Point b:30
> >> Point c:30
> >> Point d:30
> >> Point e:30
> >> Point f:30
> >> Point a:31
> >> 
> >> What is scratch file? Sorry, I don't understand you.
> >
> >PDFBox holds a lot of temporary data in the memory. To reduce the memory
> >footprint one can choose to use a scratch file instead, so that some/most of
> >that data will be hold in a file.
> >
> >To do so, simply use another load method, e.g. 
> >
> >load(File file, boolean useScratchFiles)
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> Среда,  1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler < andr...@lehmi.de
> >> >:
> >> >
> >> >
> >> >> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 12:58
> >> >> geschrieben:
> >> >> 
> >> >> 
> >> >>  Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar
> >> >> the
> >> >> result is the same.
> >> >> 
> >> >> When I create images I add them to javafx FlowPane. However, the problem
> >> >> is
> >> >> not in images because I repeat - I get 400mb when I do
> >> >> pdfDocument=null,pdfRenderer=null.
> >> >> 
> >> >> Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I
> >> >> don't
> >> >> have any problems with memory. 
> >> >> 
> >> >> I'm getting problem with memory when I run in for loop
> >> >> getPageThumbImage.
> >> >> 
> >> >> I am sure that the problem is in PdfBox. Please, help me.
> >> >Maybe, but I'm not sure at all.
> >> >
> >> >Try to use the scratch file.
> >> >
> >> >> Среда,  1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler <
> >> >>  andr...@lehmi.de
> >> >> >:
> >> >> >
> >> >> >
> >> >> >> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 10:16
> >> >> >> geschrieben:
> >> >> >> 
> >> >> >> 
> >> >> >>  I want to display all page thumbnails. However I came across memory
> >> >> >> size
> >> >> >> problem with PDFRenderer or PDDocument - I don't know which one. 
> >> >> >> 
> >> >> >> I have the following code:
> >> >> >>    ....
> >> >> >>     private PDDocument pdfDocument;
> >> >> >>     
> >> >> >>     private PDFRenderer pdfRenderer;
> >> >> >> 
> >> >> >>     public WritableImage getPageThumbImage(int page){
> >> >> >>         WritableImage result=null;
> >> >> >>         try {
> >> >> >>             BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12,
> >> >> >> ImageType.RGB);
> >> >> >>             result=SwingFXUtils.toFXImage(bi, null);
> >> >> >>         } catch (IOException ex) {
> >> >> >>              ....
> >> >> >>         }
> >> >> >>         return result;
> >> >> >>     }
> >> >> >>  .....
> >> >> >> The method getPageThumbImage I run in for loop for every page.I set
> >> >> >> java
> >> >> >> memory heap to 500mb. 
> >> >> >> And I can get about 30 images using getPageThumbImage (if I set more
> >> >> >> memory
> >> >> >> I
> >> >> >> get more). 
> >> >> >> In my application I have real time memory graphs and they show that
> >> >> >> memory
> >> >> >> is
> >> >> >> very fast filled. 
> >> >> >> When there is no more free memory getPageThumbImage hangs - no
> >> >> >> exception,
> >> >> >> nothing. But the code stops.
> >> >> >> When I do pdfDocument=null,pdfRenderer=null I get about 400mb free
> >> >> >> memory.
> >> >> >> How
> >> >> >> to solve this problem?
> >> >> >There are 2 possible issues and maybe both are relevant.
> >> >> >
> >> >> >1. PDFBox consumes more or less memory to load a pdf depending on the
> >> >> >size
> >> >> >and
> >> >> >the content of the pdf.
> >> >> >
> >> >> >- Are you using the latest 2.0.0-SNAPSHOT? There were some improvements
> >> >> >concerning the memory footprint lately
> >> >> >- Try to use of a scratch file (there are load methods including a
> >> >> >boolean
> >> >> >switcht ot activate that)
> >> >> >
> >> >> >2. Your own implementation consumes more or less memory to process
> >> >> >those
> >> >> >thumbnails
> >> >> >
> >> >> >- check if you are releasing all resources (ecspecially those images
> >> >> >you're
> >> >> >creating) you are using during your process
> >> >> >
> >> >> >HTH,
> >> >> >Andreas
> >> >> >
> >> >> >---------------------------------------------------------------------
> >> >> >To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
> >> >> >For additional commands, e-mail:  users-h...@pdfbox.apache.org
> >> >> >
> >> >> 
> >> >> 
> >> >> -- 
> >> >> Alex Sviridov
> >> >
> >> >BR
> >> >Andreas
> >> >
> >> >---------------------------------------------------------------------
> >> >To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
> >> >For additional commands, e-mail:  users-h...@pdfbox.apache.org
> >> >
> >> 
> >> 
> >> -- 
> >> Alex Sviridov
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail:  users-h...@pdfbox.apache.org
> >
> >
> >BR
> >Andreas
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
> >For additional commands, e-mail:  users-h...@pdfbox.apache.org
> >
> 
> 
> -- 
> Alex Sviridov

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Re[6]: PDFRenderer, PDDocument memory issue

Reply via email to