Improving Image Extraction Speed

aravinth thangasami Mon, 10 Feb 2020 01:37:56 -0800

Hi all,

I have a use case where I need to extract the images and the text content
from PDF documents.
Comparing the image extraction and text extraction speed the time taken for
image extraction is too large.


Furthermore, we compared the image extraction speed with Linux bash command
*pdfimages* it was so much faster than pdfbox

Is there anything I'm missing? I have included the snipped I have used for
image extraction here.

Thanks
Aravinth


            PDDocument pdDocument = PDDocument.load(new File("test.pdf"));
>             for (PDPage pdPage : pdDocument.getPages())
>             {
>                 PDResources resources = pdPage.getResources();
>                 Iterable<COSName> xObjectNames =
> resources.getXObjectNames();
>                 for (COSName cosName : xObjectNames)
>                 {
>                     PDXObject xObject = resources.getXObject(cosName);
>                     if(xObject instanceof PDImageXObject)
>                     {
>                         PDImageXObject pdImageXObject = (PDImageXObject)
> xObject;
>                         long start = System.currentTimeMillis();
>                         BufferedImage image = pdImageXObject.getImage();
>                         String nameName = cosName.getName();
>                         System.out.println("Time taken for PDF image
> object "+nameName +" "+(System.currentTimeMillis() - start));
>                         BufferedOutputStream output = new
> BufferedOutputStream(new FileOutputStream(nameName + "." +
> pdImageXObject.getSuffix()));
>                         start = System.currentTimeMillis();
>                         ImageIOUtil.writeImage(image ,
> pdImageXObject.getSuffix() , output);
>                         output.close();
>                         System.out.println("Time taken for write to file
> object "+nameName +" " +(System.currentTimeMillis() - start));
>                     }
>                 }
>             }
>             pdDocument.close();
>             System.err.println("Time taken for extracting for images " +
> (System.currentTimeMillis() - time));
>

The PDF Image extraction using pdfimages,

 long start = System.currentTimeMillis();
>  ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" ,
> "test.pdf" , "out");
>  processBuilder.start();
>
> System.out.println("Time taken for extracting images " +
> (System.currentTimeMillis() - start));
>

Improving Image Extraction Speed

Reply via email to