would you know what type of image you are extracting? Could you upload the PDF you are working with to a shared hoster so we can take a look?
BR Maruan > Thanks Maruan, > > I have tried that too, The time taken for extracting images using *PDFBOX* > was larger than using Linux command *PDFIMAGES. * > For saying If the time taken for extracting the image from PDF using PDFBOX > is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES. > On checking the code, the maximum time was converting PdfXImageObject to > BufferedImage. > > Is there anything I'm missing here?, Anything can be improved in converting > the image? > > Thanks > Aravinth. > > > > On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de> > wrote: > > > Hi, > > > > take a look at the ExtractImages.java source code in > > /org/apache/pdfbox/tools/ for cases where you can take the image data > > directly and write that out directly. > > > > BR > > Maruan > > > > > > > Hi all, > > > > > > I have a use case where I need to extract the images and the text content > > > from PDF documents. > > > Comparing the image extraction and text extraction speed the time taken > > for > > > image extraction is too large. > > > > > > Furthermore, we compared the image extraction speed with Linux bash > > command > > > *pdfimages* it was so much faster than pdfbox > > > > > > Is there anything I'm missing? I have included the snipped I have used > > for > > > image extraction here. > > > > > > Thanks > > > Aravinth > > > > > > > > > PDDocument pdDocument = PDDocument.load(new > > File("test.pdf")); > > > > for (PDPage pdPage : pdDocument.getPages()) > > > > { > > > > PDResources resources = pdPage.getResources(); > > > > Iterable<COSName> xObjectNames = > > > > resources.getXObjectNames(); > > > > for (COSName cosName : xObjectNames) > > > > { > > > > PDXObject xObject = resources.getXObject(cosName); > > > > if(xObject instanceof PDImageXObject) > > > > { > > > > PDImageXObject pdImageXObject = > > (PDImageXObject) > > > > xObject; > > > > long start = System.currentTimeMillis(); > > > > BufferedImage image = > > pdImageXObject.getImage(); > > > > String nameName = cosName.getName(); > > > > System.out.println("Time taken for PDF image > > > > object "+nameName +" "+(System.currentTimeMillis() - start)); > > > > BufferedOutputStream output = new > > > > BufferedOutputStream(new FileOutputStream(nameName + "." + > > > > pdImageXObject.getSuffix())); > > > > start = System.currentTimeMillis(); > > > > ImageIOUtil.writeImage(image , > > > > pdImageXObject.getSuffix() , output); > > > > output.close(); > > > > System.out.println("Time taken for write to > > file > > > > object "+nameName +" " +(System.currentTimeMillis() - start)); > > > > } > > > > } > > > > } > > > > pdDocument.close(); > > > > System.err.println("Time taken for extracting for images " > > + > > > > (System.currentTimeMillis() - time)); > > > > > > > > > > The PDF Image extraction using pdfimages, > > > > > > long start = System.currentTimeMillis(); > > > > ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" > > , > > > > "test.pdf" , "out"); > > > > processBuilder.start(); > > > > > > > > System.out.println("Time taken for extracting images " + > > > > (System.currentTimeMillis() - start)); > > > > > > -- > > Maruan Sahyoun > > > > FileAffairs GmbH > > Josef-Schappe-Straße 21 > > 40882 Ratingen > > > > Tel: +49 (2102) 89497 88 > > Fax: +49 (2102) 89497 91 > > sahy...@fileaffairs.de > > www.fileaffairs.de > > > > Geschäftsführer: Maruan Sahyoun > > Handelsregister: AG Düsseldorf, HRB 53837 > > UST.-ID: DE248275827 > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org