Re: Improving Image Extraction Speed

aravinth thangasami Mon, 10 Feb 2020 08:56:47 -0800

I have attached my pdf here, please take a look.

Thanks
Aravinth


On Mon, Feb 10, 2020 at 10:18 PM Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:

> would you know what type of image you are extracting? Could you upload the
> PDF you are working with to a shared hoster so we can
> take a look?
>
> BR
> Maruan
>
> > Thanks Maruan,
> >
> > I have tried that too, The time taken for extracting images using
> *PDFBOX*
> > was larger than using Linux command *PDFIMAGES. *
> > For saying If the time taken for extracting the image from PDF using
> PDFBOX
> > is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES.
> > On checking the code, the maximum time was converting PdfXImageObject to
> > BufferedImage.
> >
> > Is there anything I'm missing here?, Anything can be improved in
> converting
> > the image?
> >
> > Thanks
> > Aravinth.
> >
> >
> >
> > On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de>
> > wrote:
> >
> > > Hi,
> > >
> > > take a look at the ExtractImages.java source code in
> > > /org/apache/pdfbox/tools/ for cases where you can take the image data
> > > directly and write that out directly.
> > >
> > > BR
> > > Maruan
> > >
> > >
> > > > Hi all,
> > > >
> > > > I have a use case where I need to extract the images and the text
> content
> > > > from PDF documents.
> > > > Comparing the image extraction and text extraction speed the time
> taken
> > > for
> > > > image extraction is too large.
> > > >
> > > > Furthermore, we compared the image extraction speed with Linux bash
> > > command
> > > > *pdfimages* it was so much faster than pdfbox
> > > >
> > > > Is there anything I'm missing? I have included the snipped I have
> used
> > > for
> > > > image extraction here.
> > > >
> > > > Thanks
> > > > Aravinth
> > > >
> > > >
> > > >             PDDocument pdDocument = PDDocument.load(new
> > > File("test.pdf"));
> > > > >             for (PDPage pdPage : pdDocument.getPages())
> > > > >             {
> > > > >                 PDResources resources = pdPage.getResources();
> > > > >                 Iterable<COSName> xObjectNames =
> > > > > resources.getXObjectNames();
> > > > >                 for (COSName cosName : xObjectNames)
> > > > >                 {
> > > > >                     PDXObject xObject =
> resources.getXObject(cosName);
> > > > >                     if(xObject instanceof PDImageXObject)
> > > > >                     {
> > > > >                         PDImageXObject pdImageXObject =
> > > (PDImageXObject)
> > > > > xObject;
> > > > >                         long start = System.currentTimeMillis();
> > > > >                         BufferedImage image =
> > > pdImageXObject.getImage();
> > > > >                         String nameName = cosName.getName();
> > > > >                         System.out.println("Time taken for PDF
> image
> > > > > object "+nameName +" "+(System.currentTimeMillis() - start));
> > > > >                         BufferedOutputStream output = new
> > > > > BufferedOutputStream(new FileOutputStream(nameName + "." +
> > > > > pdImageXObject.getSuffix()));
> > > > >                         start = System.currentTimeMillis();
> > > > >                         ImageIOUtil.writeImage(image ,
> > > > > pdImageXObject.getSuffix() , output);
> > > > >                         output.close();
> > > > >                         System.out.println("Time taken for write to
> > > file
> > > > > object "+nameName +" " +(System.currentTimeMillis() - start));
> > > > >                     }
> > > > >                 }
> > > > >             }
> > > > >             pdDocument.close();
> > > > >             System.err.println("Time taken for extracting for
> images "
> > > +
> > > > > (System.currentTimeMillis() - time));
> > > > >
> > > >
> > > > The PDF Image extraction using pdfimages,
> > > >
> > > >  long start = System.currentTimeMillis();
> > > > >  ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" ,
> "-j"
> > > ,
> > > > > "test.pdf" , "out");
> > > > >  processBuilder.start();
> > > > >
> > > > > System.out.println("Time taken for extracting images " +
> > > > > (System.currentTimeMillis() - start));
> > > > >
> > > --
> > > Maruan Sahyoun
> > >
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > >
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > sahy...@fileaffairs.de
> > > www.fileaffairs.de
> > >
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Improving Image Extraction Speed

Reply via email to