Hi,

take a look at the ExtractImages.java source code in /org/apache/pdfbox/tools/ 
for cases where you can take the image data
directly and write that out directly. 

BR
Maruan

   
> Hi all,
> 
> I have a use case where I need to extract the images and the text content
> from PDF documents.
> Comparing the image extraction and text extraction speed the time taken for
> image extraction is too large.
> 
> Furthermore, we compared the image extraction speed with Linux bash command
> *pdfimages* it was so much faster than pdfbox
> 
> Is there anything I'm missing? I have included the snipped I have used for
> image extraction here.
> 
> Thanks
> Aravinth
> 
> 
>             PDDocument pdDocument = PDDocument.load(new File("test.pdf"));
> >             for (PDPage pdPage : pdDocument.getPages())
> >             {
> >                 PDResources resources = pdPage.getResources();
> >                 Iterable<COSName> xObjectNames =
> > resources.getXObjectNames();
> >                 for (COSName cosName : xObjectNames)
> >                 {
> >                     PDXObject xObject = resources.getXObject(cosName);
> >                     if(xObject instanceof PDImageXObject)
> >                     {
> >                         PDImageXObject pdImageXObject = (PDImageXObject)
> > xObject;
> >                         long start = System.currentTimeMillis();
> >                         BufferedImage image = pdImageXObject.getImage();
> >                         String nameName = cosName.getName();
> >                         System.out.println("Time taken for PDF image
> > object "+nameName +" "+(System.currentTimeMillis() - start));
> >                         BufferedOutputStream output = new
> > BufferedOutputStream(new FileOutputStream(nameName + "." +
> > pdImageXObject.getSuffix()));
> >                         start = System.currentTimeMillis();
> >                         ImageIOUtil.writeImage(image ,
> > pdImageXObject.getSuffix() , output);
> >                         output.close();
> >                         System.out.println("Time taken for write to file
> > object "+nameName +" " +(System.currentTimeMillis() - start));
> >                     }
> >                 }
> >             }
> >             pdDocument.close();
> >             System.err.println("Time taken for extracting for images " +
> > (System.currentTimeMillis() - time));
> > 
> 
> The PDF Image extraction using pdfimages,
> 
>  long start = System.currentTimeMillis();
> >  ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" ,
> > "test.pdf" , "out");
> >  processBuilder.start();
> > 
> > System.out.println("Time taken for extracting images " +
> > (System.currentTimeMillis() - start));
> > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to