Re: Improving Image Extraction Speed

Maruan Sahyoun Mon, 10 Feb 2020 09:03:00 -0800

sorry - attachments don't work on the mailing list. Could you upload the file 
to a shared location?


BR
Maruan
 
> I have attached my pdf here, please take a look. 
> 
> Thanks
> Aravinth
> 
> On Mon, Feb 10, 2020 at 10:18 PM Maruan Sahyoun <sahy...@fileaffairs.de> 
> wrote:
> > would you know what type of image you are extracting? Could you upload the 
> > PDF you are working with to a shared hoster so we can
> > take a look?
> > 
> > BR
> > Maruan
> > 
> > > Thanks Maruan,
> > > 
> > > I have tried that too, The time taken for extracting images using *PDFBOX*
> > > was larger than using Linux command *PDFIMAGES. *
> > > For saying If the time taken for extracting the image from PDF using 
> > > PDFBOX
> > > is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES.
> > > On checking the code, the maximum time was converting PdfXImageObject to
> > > BufferedImage.
> > > 
> > > Is there anything I'm missing here?, Anything can be improved in 
> > > converting
> > > the image?
> > > 
> > > Thanks
> > > Aravinth.
> > > 
> > > 
> > > 
> > > On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > take a look at the ExtractImages.java source code in
> > > > /org/apache/pdfbox/tools/ for cases where you can take the image data
> > > > directly and write that out directly.
> > > > 
> > > > BR
> > > > Maruan
> > > > 
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > I have a use case where I need to extract the images and the text 
> > > > > content
> > > > > from PDF documents.
> > > > > Comparing the image extraction and text extraction speed the time 
> > > > > taken
> > > > for
> > > > > image extraction is too large.
> > > > > 
> > > > > Furthermore, we compared the image extraction speed with Linux bash
> > > > command
> > > > > *pdfimages* it was so much faster than pdfbox
> > > > > 
> > > > > Is there anything I'm missing? I have included the snipped I have used
> > > > for
> > > > > image extraction here.
> > > > > 
> > > > > Thanks
> > > > > Aravinth
> > > > > 
> > > > > 
> > > > >             PDDocument pdDocument = PDDocument.load(new
> > > > File("test.pdf"));
> > > > > >             for (PDPage pdPage : pdDocument.getPages())
> > > > > >             {
> > > > > >                 PDResources resources = pdPage.getResources();
> > > > > >                 Iterable<COSName> xObjectNames =
> > > > > > resources.getXObjectNames();
> > > > > >                 for (COSName cosName : xObjectNames)
> > > > > >                 {
> > > > > >                     PDXObject xObject = 
> > > > > > resources.getXObject(cosName);
> > > > > >                     if(xObject instanceof PDImageXObject)
> > > > > >                     {
> > > > > >                         PDImageXObject pdImageXObject =
> > > > (PDImageXObject)
> > > > > > xObject;
> > > > > >                         long start = System.currentTimeMillis();
> > > > > >                         BufferedImage image =
> > > > pdImageXObject.getImage();
> > > > > >                         String nameName = cosName.getName();
> > > > > >                         System.out.println("Time taken for PDF image
> > > > > > object "+nameName +" "+(System.currentTimeMillis() - start));
> > > > > >                         BufferedOutputStream output = new
> > > > > > BufferedOutputStream(new FileOutputStream(nameName + "." +
> > > > > > pdImageXObject.getSuffix()));
> > > > > >                         start = System.currentTimeMillis();
> > > > > >                         ImageIOUtil.writeImage(image ,
> > > > > > pdImageXObject.getSuffix() , output);
> > > > > >                         output.close();
> > > > > >                         System.out.println("Time taken for write to
> > > > file
> > > > > > object "+nameName +" " +(System.currentTimeMillis() - start));
> > > > > >                     }
> > > > > >                 }
> > > > > >             }
> > > > > >             pdDocument.close();
> > > > > >             System.err.println("Time taken for extracting for 
> > > > > > images "
> > > > +
> > > > > > (System.currentTimeMillis() - time));
> > > > > > 
> > > > > 
> > > > > The PDF Image extraction using pdfimages,
> > > > > 
> > > > >  long start = System.currentTimeMillis();
> > > > > >  ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , 
> > > > > > "-j"
> > > > ,
> > > > > > "test.pdf" , "out");
> > > > > >  processBuilder.start();
> > > > > > 
> > > > > > System.out.println("Time taken for extracting images " +
> > > > > > (System.currentTimeMillis() - start));
> > > > > > 
> > > > --
> > > > Maruan Sahyoun
> > > > 
> > > > FileAffairs GmbH
> > > > Josef-Schappe-Straße 21
> > > > 40882 Ratingen
> > > > 
> > > > Tel: +49 (2102) 89497 88
> > > > Fax: +49 (2102) 89497 91
> > > > sahy...@fileaffairs.de
> > > > www.fileaffairs.de
> > > > 
> > > > Geschäftsführer: Maruan Sahyoun
> > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > UST.-ID: DE248275827
> > > > 
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > > > 
> > > > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Improving Image Extraction Speed

Reply via email to