I have attached my pdf here, please take a look. Thanks Aravinth
On Mon, Feb 10, 2020 at 10:18 PM Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > would you know what type of image you are extracting? Could you upload the > PDF you are working with to a shared hoster so we can > take a look? > > BR > Maruan > > > Thanks Maruan, > > > > I have tried that too, The time taken for extracting images using > *PDFBOX* > > was larger than using Linux command *PDFIMAGES. * > > For saying If the time taken for extracting the image from PDF using > PDFBOX > > is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES. > > On checking the code, the maximum time was converting PdfXImageObject to > > BufferedImage. > > > > Is there anything I'm missing here?, Anything can be improved in > converting > > the image? > > > > Thanks > > Aravinth. > > > > > > > > On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de> > > wrote: > > > > > Hi, > > > > > > take a look at the ExtractImages.java source code in > > > /org/apache/pdfbox/tools/ for cases where you can take the image data > > > directly and write that out directly. > > > > > > BR > > > Maruan > > > > > > > > > > Hi all, > > > > > > > > I have a use case where I need to extract the images and the text > content > > > > from PDF documents. > > > > Comparing the image extraction and text extraction speed the time > taken > > > for > > > > image extraction is too large. > > > > > > > > Furthermore, we compared the image extraction speed with Linux bash > > > command > > > > *pdfimages* it was so much faster than pdfbox > > > > > > > > Is there anything I'm missing? I have included the snipped I have > used > > > for > > > > image extraction here. > > > > > > > > Thanks > > > > Aravinth > > > > > > > > > > > > PDDocument pdDocument = PDDocument.load(new > > > File("test.pdf")); > > > > > for (PDPage pdPage : pdDocument.getPages()) > > > > > { > > > > > PDResources resources = pdPage.getResources(); > > > > > Iterable<COSName> xObjectNames = > > > > > resources.getXObjectNames(); > > > > > for (COSName cosName : xObjectNames) > > > > > { > > > > > PDXObject xObject = > resources.getXObject(cosName); > > > > > if(xObject instanceof PDImageXObject) > > > > > { > > > > > PDImageXObject pdImageXObject = > > > (PDImageXObject) > > > > > xObject; > > > > > long start = System.currentTimeMillis(); > > > > > BufferedImage image = > > > pdImageXObject.getImage(); > > > > > String nameName = cosName.getName(); > > > > > System.out.println("Time taken for PDF > image > > > > > object "+nameName +" "+(System.currentTimeMillis() - start)); > > > > > BufferedOutputStream output = new > > > > > BufferedOutputStream(new FileOutputStream(nameName + "." + > > > > > pdImageXObject.getSuffix())); > > > > > start = System.currentTimeMillis(); > > > > > ImageIOUtil.writeImage(image , > > > > > pdImageXObject.getSuffix() , output); > > > > > output.close(); > > > > > System.out.println("Time taken for write to > > > file > > > > > object "+nameName +" " +(System.currentTimeMillis() - start)); > > > > > } > > > > > } > > > > > } > > > > > pdDocument.close(); > > > > > System.err.println("Time taken for extracting for > images " > > > + > > > > > (System.currentTimeMillis() - time)); > > > > > > > > > > > > > The PDF Image extraction using pdfimages, > > > > > > > > long start = System.currentTimeMillis(); > > > > > ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , > "-j" > > > , > > > > > "test.pdf" , "out"); > > > > > processBuilder.start(); > > > > > > > > > > System.out.println("Time taken for extracting images " + > > > > > (System.currentTimeMillis() - start)); > > > > > > > > -- > > > Maruan Sahyoun > > > > > > FileAffairs GmbH > > > Josef-Schappe-Straße 21 > > > 40882 Ratingen > > > > > > Tel: +49 (2102) 89497 88 > > > Fax: +49 (2102) 89497 91 > > > sahy...@fileaffairs.de > > > www.fileaffairs.de > > > > > > Geschäftsführer: Maruan Sahyoun > > > Handelsregister: AG Düsseldorf, HRB 53837 > > > UST.-ID: DE248275827 > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org