Re: Improving Image Extraction Speed

Maruan Sahyoun Mon, 10 Feb 2020 08:48:58 -0800

would you know what type of image you are extracting? Could you upload the PDF 
you are working with to a shared hoster so we can
take a look?


BR
Maruan
 
> Thanks Maruan,
> 
> I have tried that too, The time taken for extracting images using *PDFBOX*
> was larger than using Linux command *PDFIMAGES. *
> For saying If the time taken for extracting the image from PDF using PDFBOX
> is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES.
> On checking the code, the maximum time was converting PdfXImageObject to
> BufferedImage.
> 
> Is there anything I'm missing here?, Anything can be improved in converting
> the image?
> 
> Thanks
> Aravinth.
> 
> 
> 
> On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <[email protected]>
> wrote:
> 
> > Hi,
> > 
> > take a look at the ExtractImages.java source code in
> > /org/apache/pdfbox/tools/ for cases where you can take the image data
> > directly and write that out directly.
> > 
> > BR
> > Maruan
> > 
> > 
> > > Hi all,
> > > 
> > > I have a use case where I need to extract the images and the text content
> > > from PDF documents.
> > > Comparing the image extraction and text extraction speed the time taken
> > for
> > > image extraction is too large.
> > > 
> > > Furthermore, we compared the image extraction speed with Linux bash
> > command
> > > *pdfimages* it was so much faster than pdfbox
> > > 
> > > Is there anything I'm missing? I have included the snipped I have used
> > for
> > > image extraction here.
> > > 
> > > Thanks
> > > Aravinth
> > > 
> > > 
> > >             PDDocument pdDocument = PDDocument.load(new
> > File("test.pdf"));
> > > >             for (PDPage pdPage : pdDocument.getPages())
> > > >             {
> > > >                 PDResources resources = pdPage.getResources();
> > > >                 Iterable<COSName> xObjectNames =
> > > > resources.getXObjectNames();
> > > >                 for (COSName cosName : xObjectNames)
> > > >                 {
> > > >                     PDXObject xObject = resources.getXObject(cosName);
> > > >                     if(xObject instanceof PDImageXObject)
> > > >                     {
> > > >                         PDImageXObject pdImageXObject =
> > (PDImageXObject)
> > > > xObject;
> > > >                         long start = System.currentTimeMillis();
> > > >                         BufferedImage image =
> > pdImageXObject.getImage();
> > > >                         String nameName = cosName.getName();
> > > >                         System.out.println("Time taken for PDF image
> > > > object "+nameName +" "+(System.currentTimeMillis() - start));
> > > >                         BufferedOutputStream output = new
> > > > BufferedOutputStream(new FileOutputStream(nameName + "." +
> > > > pdImageXObject.getSuffix()));
> > > >                         start = System.currentTimeMillis();
> > > >                         ImageIOUtil.writeImage(image ,
> > > > pdImageXObject.getSuffix() , output);
> > > >                         output.close();
> > > >                         System.out.println("Time taken for write to
> > file
> > > > object "+nameName +" " +(System.currentTimeMillis() - start));
> > > >                     }
> > > >                 }
> > > >             }
> > > >             pdDocument.close();
> > > >             System.err.println("Time taken for extracting for images "
> > +
> > > > (System.currentTimeMillis() - time));
> > > > 
> > > 
> > > The PDF Image extraction using pdfimages,
> > > 
> > >  long start = System.currentTimeMillis();
> > > >  ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j"
> > ,
> > > > "test.pdf" , "out");
> > > >  processBuilder.start();
> > > > 
> > > > System.out.println("Time taken for extracting images " +
> > > > (System.currentTimeMillis() - start));
> > > > 
> > --
> > Maruan Sahyoun
> > 
> > FileAffairs GmbH
> > Josef-Schappe-Straße 21
> > 40882 Ratingen
> > 
> > Tel: +49 (2102) 89497 88
> > Fax: +49 (2102) 89497 91
> > [email protected]
> > www.fileaffairs.de
> > 
> > Geschäftsführer: Maruan Sahyoun
> > Handelsregister: AG Düsseldorf, HRB 53837
> > UST.-ID: DE248275827
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> > 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improving Image Extraction Speed

Reply via email to