Hi all, I have a use case where I need to extract the images and the text content from PDF documents. Comparing the image extraction and text extraction speed the time taken for image extraction is too large.
Furthermore, we compared the image extraction speed with Linux bash command *pdfimages* it was so much faster than pdfbox Is there anything I'm missing? I have included the snipped I have used for image extraction here. Thanks Aravinth PDDocument pdDocument = PDDocument.load(new File("test.pdf")); > for (PDPage pdPage : pdDocument.getPages()) > { > PDResources resources = pdPage.getResources(); > Iterable<COSName> xObjectNames = > resources.getXObjectNames(); > for (COSName cosName : xObjectNames) > { > PDXObject xObject = resources.getXObject(cosName); > if(xObject instanceof PDImageXObject) > { > PDImageXObject pdImageXObject = (PDImageXObject) > xObject; > long start = System.currentTimeMillis(); > BufferedImage image = pdImageXObject.getImage(); > String nameName = cosName.getName(); > System.out.println("Time taken for PDF image > object "+nameName +" "+(System.currentTimeMillis() - start)); > BufferedOutputStream output = new > BufferedOutputStream(new FileOutputStream(nameName + "." + > pdImageXObject.getSuffix())); > start = System.currentTimeMillis(); > ImageIOUtil.writeImage(image , > pdImageXObject.getSuffix() , output); > output.close(); > System.out.println("Time taken for write to file > object "+nameName +" " +(System.currentTimeMillis() - start)); > } > } > } > pdDocument.close(); > System.err.println("Time taken for extracting for images " + > (System.currentTimeMillis() - time)); > The PDF Image extraction using pdfimages, long start = System.currentTimeMillis(); > ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" , > "test.pdf" , "out"); > processBuilder.start(); > > System.out.println("Time taken for extracting images " + > (System.currentTimeMillis() - start)); >