Re: Re[10]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread John Hewson


> On 1 Jul 2015, at 05:15, Alex Sviridov  wrote:
> 
> Ok. Thank you again. I just don't understand one thing. What is the reason to 
> keep so large data if I only need to take page images and the most important 
> I DO IT BY PAGE?
> 
> Is there no way not to keep data for previous pages if I need only data for 
> page N?

Try profiling PDFBox to see what that data actually is. We don't cache page 
resources anymore. It could be cached stream data, or fonts, perhaps.

-- John

> Среда,  1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler :
>> 
>> 
>>> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59 
>>> geschrieben:
>>> 
>>> 
>>> Ok. Thank you very much for explanation. Could you say where this scratch
>>> file is located linux/windows?
>> java.io.File.createTempFile is used to create that file. It uses the default
>> temp directory. It's "/tmp" on linux. I'm not sure for windows as different
>> environment variables (TMP, TEMP, USERPROFILE, ) are used to search for 
>> such
>> a directory.
>> 
>> You may define your own temp directory using the following parameter when
>> starting your application
>> 
>> -Djava.io.tmpdir=PATH-TO-YOUR-TEMP
>> 
>> 
>>> 
>>> 
>>> Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de 
>>> >:
> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38
> geschrieben:
> 
> 
> The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
 Ah, that explains a lot. The pdf is a scanned document, every page holds a
 color
 image, consuming a lot of memory when processed
 
> I tried with load (fileName,true). The result - now I don't have memory
> problems. However now I have 2 problems:
> 
> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
> One
> thumbnail image is loaded about 4 seconds!
 If it comes to huge pdfs, you have to die one death. Either you provide
 enough
 memory to do all the stuff in memory (fast) or you use a scratch file to 
 save
 memory (slow)
 
 And yes, there is room for an improvement of the memory handling (read on
 demand, remove after usage) in PDFBox, but that is some future feature.
 Patches
 are welcome.
 
> 2) Besides, as you see thumbnail images are loaded in separate thread.
> While
> this thread is running and I try to
> get big image for main content using   BufferedImage
> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
> following exception:
> 
> java.io.IOException: java.util.zip.DataFormatException: unknown 
> compression
> method
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
> at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
> at
> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
> at
> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
> at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> at 
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>   
> at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.zip.DataFormatException: unknown compression method
> at java.util.zip.Inflater.inflateBytes(Native Method)
> at java.util.zip.Inflater.inflate(Inflater.java:259)
> at java.util.zip.Inflater.inflate(Inflater.java:280)
> at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
> ... 20 more
> 
> How to solve these problems?
 PDFBox isn't supposed to be thread safe.
 
> 
> 
> Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler <  
> andr...@lehmi.de
>> :
>> 
>> 
>>> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:09
>>> geschr

Re: Re[10]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler


> Alex Sviridov  hat am 1. Juli 2015 um 14:15 geschrieben:
> 
> 
>  Ok. Thank you again. I just don't understand one thing. What is the reason to
> keep so large data if I only need to take page images and the most important I
> DO IT BY PAGE?
PDFBox doesn't know that you are doing it page by page.

> 
> Is there no way not to keep data for previous pages if I need only data for
> page N?
As I said, we don't have a read on demand mechanism yet. It is in our focus but
that will take a while, as the pdf format isn't that easy to work with and
therefore the code to be extended is more or less complex.

> Среда,  1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler :
> >
> >
> >> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59
> >> geschrieben:
> >> 
> >> 
> >>  Ok. Thank you very much for explanation. Could you say where this scratch
> >> file is located linux/windows?
> >java.io.File.createTempFile is used to create that file. It uses the default
> >temp directory. It's "/tmp" on linux. I'm not sure for windows as different
> >environment variables (TMP, TEMP, USERPROFILE, ) are used to search for
> >such
> >a directory.
> >
> >You may define your own temp directory using the following parameter when
> >starting your application
> >
> >-Djava.io.tmpdir=PATH-TO-YOUR-TEMP
> >
> >
> >> 
> >> 
> >> Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de
> >> >:
> >> >> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38
> >> >> geschrieben:
> >> >> 
> >> >> 
> >> >>  The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
> >> >Ah, that explains a lot. The pdf is a scanned document, every page holds a
> >> >color
> >> >image, consuming a lot of memory when processed
> >> >
> >> >> I tried with load (fileName,true). The result - now I don't have memory
> >> >> problems. However now I have 2 problems:
> >> >>
> >> >> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
> >> >> One
> >> >> thumbnail image is loaded about 4 seconds! 
> >> >If it comes to huge pdfs, you have to die one death. Either you provide
> >> >enough
> >> >memory to do all the stuff in memory (fast) or you use a scratch file to
> >> >save
> >> >memory (slow)
> >> >
> >> >And yes, there is room for an improvement of the memory handling (read on
> >> >demand, remove after usage) in PDFBox, but that is some future feature.
> >> >Patches
> >> >are welcome.
> >> >
> >> >> 2) Besides, as you see thumbnail images are loaded in separate thread.
> >> >> While
> >> >> this thread is running and I try to
> >> >> get big image for main content using   BufferedImage
> >> >> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
> >> >> following exception:
> >> >> 
> >> >> java.io.IOException: java.util.zip.DataFormatException: unknown
> >> >> compression
> >> >> method
> >> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
> >> >>     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
> >> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
> >> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
> >> >>     at
> >> >> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
> >> >>     at
> >> >> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
> >> >>     at
> >> >> org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146)
> >> >>     at
> >> >> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78)
> >> >>     at
> >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
> >> >>     at
> >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
> >> >>     at
> >> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> >> >>     at
> >> >> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
> >> >>     at
> >> >> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
> >> >>     at
> >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
> >> >>     at
> >> >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
> >> >>   
> >> >>     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
> >> >>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >>     at java.lang.Thread.run(Thread.java:745)
> >> >> Caused by: java.util.zip.DataFormatException: unknown compression method
> >> >>     at java.util.zip.Inflater.inflateBytes(Native Method)
> >> >>     at java.util.zip.Inflater.inflate(Inflater.java:259)
> >> >>     at java.util.zip.Inflater.inflate(Inflater.java:280)
> >> >>     at
> >> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
> >> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
> >> >>     ... 20 more
> >> >> 
> >> >> How to solve these proble

Re[10]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Alex Sviridov
 Ok. Thank you again. I just don't understand one thing. What is the reason to 
keep so large data if I only need to take page images and the most important I 
DO IT BY PAGE?

Is there no way not to keep data for previous pages if I need only data for 
page N?


Среда,  1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler :
>
>
>> Alex Sviridov < ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:59 
>> geschrieben:
>> 
>> 
>>  Ok. Thank you very much for explanation. Could you say where this scratch
>> file is located linux/windows?
>java.io.File.createTempFile is used to create that file. It uses the default
>temp directory. It's "/tmp" on linux. I'm not sure for windows as different
>environment variables (TMP, TEMP, USERPROFILE, ) are used to search for 
>such
>a directory.
>
>You may define your own temp directory using the following parameter when
>starting your application
>
>-Djava.io.tmpdir=PATH-TO-YOUR-TEMP
>
>
>> 
>> 
>> Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler < andr...@lehmi.de >:
>> >> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:38
>> >> geschrieben:
>> >> 
>> >> 
>> >>  The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
>> >Ah, that explains a lot. The pdf is a scanned document, every page holds a
>> >color
>> >image, consuming a lot of memory when processed
>> >
>> >> I tried with load (fileName,true). The result - now I don't have memory
>> >> problems. However now I have 2 problems:
>> >>
>> >> 1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
>> >> One
>> >> thumbnail image is loaded about 4 seconds! 
>> >If it comes to huge pdfs, you have to die one death. Either you provide
>> >enough
>> >memory to do all the stuff in memory (fast) or you use a scratch file to 
>> >save
>> >memory (slow)
>> >
>> >And yes, there is room for an improvement of the memory handling (read on
>> >demand, remove after usage) in PDFBox, but that is some future feature.
>> >Patches
>> >are welcome.
>> >
>> >> 2) Besides, as you see thumbnail images are loaded in separate thread.
>> >> While
>> >> this thread is running and I try to
>> >> get big image for main content using   BufferedImage
>> >> bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
>> >> following exception:
>> >> 
>> >> java.io.IOException: java.util.zip.DataFormatException: unknown 
>> >> compression
>> >> method
>> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
>> >>     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
>> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
>> >>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
>> >>     at
>> >> org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
>> >>     at
>> >> org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
>> >>     at org.apache.pdfbox.pdfparser.BaseParser.(BaseParser.java:146)
>> >>     at
>> >> org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:78)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>> >>     at
>> >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>> >>     at 
>> >> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
>> >>     at
>> >> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
>> >>   
>> >>     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
>> >>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> >>     at java.lang.Thread.run(Thread.java:745)
>> >> Caused by: java.util.zip.DataFormatException: unknown compression method
>> >>     at java.util.zip.Inflater.inflateBytes(Native Method)
>> >>     at java.util.zip.Inflater.inflate(Inflater.java:259)
>> >>     at java.util.zip.Inflater.inflate(Inflater.java:280)
>> >>     at
>> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
>> >>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>> >>     ... 20 more
>> >> 
>> >> How to solve these problems?
>> >PDFBox isn't supposed to be thread safe.
>> >
>> >> 
>> >> 
>> >> Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler <  
>> >> andr...@lehmi.de
>> >> >:
>> >> >
>> >> >
>> >> >> Alex Sviridov <  ooo_satu...@mail.ru > hat am 1. Juli 2015 um 13:09
>> >> >> geschrieben:
>> >> >> 
>> >> >> 
>> >> >>  I decided to show all the code. I also send the pdf file - some file
>> >> >> from
>> >> >> internet I use for testing.
>> >> >The attachment didn't make it due to some restrictions to the mailing
>> >> >list.
>> >> >Please post a link to th