Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Tilman Hausherr Fri, 18 Oct 2019 11:08:31 -0700

Hi,

Could you try getting the current 2.0.18 from svn? Alternatively, usethis code:


                long fileLen = raf.length();

long expectedFileLen = ((long)pageCount -inMemoryMaxPageCount) * PAGE_SIZE;


                if (expectedFileLen != fileLen)
                {

throw new IOException("Expected scratch file sizeof " + expectedFileLen + " but found " + fileLen + "in file " + file);

                }

                // enlarge if we do not int overflow
                if (pageCount + ENLARGE_PAGE_COUNT > pageCount)
                {
                    if (LOG.isDebugEnabled())
                    {
                        LOG.debug("file: " + file);

LOG.debug("fileLen before: " + fileLen + ", raflength: " + raf.length() +

                                  ", file length: " + file.length());
                    }
                    fileLen += ENLARGE_PAGE_COUNT * PAGE_SIZE;

                    raf.setLength(fileLen);
                    if (LOG.isDebugEnabled())
                    {

LOG.debug("fileLen after1: " + fileLen + ", raflength: " + raf.length() +

                                  ", file length: " + file.length());
                    }

if (fileLen != raf.length() || fileLen !=file.length())

// PDFBOX-4601 possible AWS lambda bug thatsetLength() doesn't throw

                        // if not enough space
                        long origFilePointer = raf.getFilePointer();
                        raf.seek(fileLen - 1);
                        raf.write(0);
                        raf.seek(origFilePointer);

LOG.debug("fileLen after2: " + fileLen + ",raf length: " + raf.length() + ", file length: " + file.length());

freePages.set(pageCount, pageCount +ENLARGE_PAGE_COUNT);

or just remove the log messages from the 2.0.17 source. Now there isstill a length() operation but that is "raf.length()" so maybe it isdifferent.


Sadly we can't go without that, until Amazon has fixed their systems.

Re the difference in processing time between 16 and 17 - I don't knowbecause you can't share your file. If you have a lot of time, try withdifferent versions from the repostory, i.e. a "bisect" strategy. Ittakes hours, and is a work for patient people.


Tilman

Am 18.10.2019 um 18:45 schrieb Esteban R:

Yes, it is. I have reverted ScratchFile.java and processing now takes less than 
2 secs.

I have noticed that the problem may be related to the call to file.length(): 
even if I add only the LOGs (or even more: if I only add a line: file.length(); 
to ScratchFile.java) then performance drops.

I have also tested with another +48000 pdf files: I have not found another pdf 
file with a so big processing time. These are the results for the most 
time-consuming pdfs. I have noticed that processing time in v2.0.17 seems to be 
bigger for these files even if I don't use temp files (maybe the machine was 
more loaded during the tests?)

Columns:
v2.0.16_with_tempfile v2.0.16_without_tempfile  v2.0.17_with_tempfile 
v2_0_17_without_tempfile
1635 1410 4304 2581
1642 1528 4304 2585
1652 1373 4483 2662
1652 1461 4556 2727
1684 1441 4626 2835
1833 1339 4362 2610
1863 1641 4436 2689
4961 4350 18605 4296


Esteban




________________________________
De: Tilman Hausherr <thaush...@t-online.de>
Enviado: jueves, 26 de septiembre de 2019 19:45
Para: users@pdfbox.apache.org <users@pdfbox.apache.org>
Asunto: Re: performance issue in v 2.0.17: PDImageXObject.getImage() + 
MemoryUsageSetting.setupTempFileOnly

Ouch... Might be the workaround implemented for this issue:
https://issues.apache.org/jira/browse/PDFBOX-4601

Tilman

Am 26.09.2019 um 21:39 schrieb Esteban R:

Hello. I'm getting a timeout in one of my tests after upgrading to v
2.0.17: PDImageXObject.getImage() takes more than 1:10 minutes instead
of less than 2 seconds with previous release 2.0.16.

I cannot provide the sample PDF because it contains sensitive
information. I have tried to simplify it but the issue (almost)
dissappears even if I save the file without changing anything.

Some more facts:

   * The issue happens if the load is done with
     MemoryUsageSetting.setupTempFileOnly (without that flag the issue
     doesn't happen)
   * The performance is OK for v 2.0.16
   * Please find attached the stack trace of the timeout of my original
     test
   * PDF structure is quite simple and contains a single image (find
     attached the relevant data from pdfdebugger)
   * I have created a sample program to demonstrate the issue: it
     simply loads the PDF file with the setupTempFileOnly flag and does
     getImage for all the images (only one in the PDF). It then does
     the same thing without the flag and after that it does the same
     thing with another PDF: it is simply the same file saved by PDFBOX
     with another name.

Output with pdfbox 2.0.16:
java -cp "pdfbox-2.0.16.jar;commons-logging-1.2.jar;."
TestExtractImage sample.pdf
With    temp file
     Before getImage: Thu Sep 26 16:32:46 ART 2019
     After  getImage: Thu Sep 26 16:32:46 ART 2019
Without temp file
     Before getImage: Thu Sep 26 16:32:46 ART 2019
     After  getImage: Thu Sep 26 16:32:46 ART 2019
With    temp file after saving with another name
     Before getImage: Thu Sep 26 16:32:46 ART 2019
     After  getImage: Thu Sep 26 16:32:47 ART 2019
(i.e.: less than 2 seconds in all cases)

Output with pdfbox 2.0.17:
java -cp "pdfbox-2.0.17.jar;commons-logging-1.2.jar;."
TestExtractImage sample.pdf
With    temp file
     Before getImage: Thu Sep 26 16:31:09 ART 2019
     After  getImage: Thu Sep 26 16:32:30 ART 2019  => more than 1'20"
Without temp file
     Before getImage: Thu Sep 26 16:32:30 ART 2019
     After  getImage: Thu Sep 26 16:32:30 ART 2019
With    temp file after saving with another name
     Before getImage: Thu Sep 26 16:32:30 ART 2019
     After  getImage: Thu Sep 26 16:32:34 ART 2019 => more than 2"

And here the source code for TestExtractImage.java :

import java.io.File;
import java.io.IOException;
import java.util.Date;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.io.MemoryUsageSetting;

public class TestExtractImage {
     public static void main(String[] args) throws IOException {
         System.out.println("With    temp file");
         PDDocument d = PDDocument.load(new File(args[0]),
MemoryUsageSetting.setupTempFileOnly() );
         getImage(d);
         d.close();

         System.out.println("Without temp file");
         d = PDDocument.load(new File(args[0]));
         getImage(d);
         d.save("other.pdf");
         d.close();
         System.out.println("With    temp file after saving with
another name");
         d = PDDocument.load(new File("other.pdf"),
MemoryUsageSetting.setupTempFileOnly() );
         getImage(d);
     }

     static void getImage(PDDocument d) throws IOException {
         PDResources res = d.getPage(0).getResources();
         for (COSName n: res.getXObjectNames()) {
             PDXObject o = res.getXObject(n);
             if (o instanceof PDImageXObject){
                 PDImageXObject i = (PDImageXObject) o;
                 if ("png".equals(i.getSuffix())){
                     System.out.println("    Before getImage: "+new
Date());
                     i.getImage();
                     System.out.println("    After  getImage: "+new
Date());
                 }
             }
         }
     }
}




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: performance issue in v 2.0.17: PDImageXObject.getImage() + MemoryUsageSetting.setupTempFileOnly

Reply via email to