Hi, take a look at the ExtractImages.java source code in /org/apache/pdfbox/tools/ for cases where you can take the image data directly and write that out directly.
BR Maruan > Hi all, > > I have a use case where I need to extract the images and the text content > from PDF documents. > Comparing the image extraction and text extraction speed the time taken for > image extraction is too large. > > Furthermore, we compared the image extraction speed with Linux bash command > *pdfimages* it was so much faster than pdfbox > > Is there anything I'm missing? I have included the snipped I have used for > image extraction here. > > Thanks > Aravinth > > > PDDocument pdDocument = PDDocument.load(new File("test.pdf")); > > for (PDPage pdPage : pdDocument.getPages()) > > { > > PDResources resources = pdPage.getResources(); > > Iterable<COSName> xObjectNames = > > resources.getXObjectNames(); > > for (COSName cosName : xObjectNames) > > { > > PDXObject xObject = resources.getXObject(cosName); > > if(xObject instanceof PDImageXObject) > > { > > PDImageXObject pdImageXObject = (PDImageXObject) > > xObject; > > long start = System.currentTimeMillis(); > > BufferedImage image = pdImageXObject.getImage(); > > String nameName = cosName.getName(); > > System.out.println("Time taken for PDF image > > object "+nameName +" "+(System.currentTimeMillis() - start)); > > BufferedOutputStream output = new > > BufferedOutputStream(new FileOutputStream(nameName + "." + > > pdImageXObject.getSuffix())); > > start = System.currentTimeMillis(); > > ImageIOUtil.writeImage(image , > > pdImageXObject.getSuffix() , output); > > output.close(); > > System.out.println("Time taken for write to file > > object "+nameName +" " +(System.currentTimeMillis() - start)); > > } > > } > > } > > pdDocument.close(); > > System.err.println("Time taken for extracting for images " + > > (System.currentTimeMillis() - time)); > > > > The PDF Image extraction using pdfimages, > > long start = System.currentTimeMillis(); > > ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" , > > "test.pdf" , "out"); > > processBuilder.start(); > > > > System.out.println("Time taken for extracting images " + > > (System.currentTimeMillis() - start)); > > -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org