Re: Improving Image Extraction Speed

aravinth thangasami Mon, 10 Feb 2020 09:13:58 -0800

@Mauran
The file is uploaded here, for your reference
https://www.dropbox.com/s/m2p1861q4pthgue/testpdfwithImages.pdf


@Tilman Thanks for your reply. we are using the tika for text extraction.
For extracting images we thought of using the same.

Thanks
Aravinth

On Mon, Feb 10, 2020 at 10:36 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Am 10.02.2020 um 17:13 schrieb aravinth thangasami:
> > Thanks Maruan,
> >
> > I have tried that too, The time taken for extracting images using
> *PDFBOX*
> > was larger than using Linux command *PDFIMAGES. *
> > For saying If the time taken for extracting the image from PDF using
> PDFBOX
> > is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES.
> > On checking the code, the maximum time was converting PdfXImageObject to
> > BufferedImage.
>
>
> pdfimages is from poppler, which is C/C++. So of course it is faster.
> Why do you want to use PDFBox? Poppler works nicely.
>
> Tilman
>
>
>
> >
> > Is there anything I'm missing here?, Anything can be improved in
> converting
> > the image?
> >
> > Thanks
> > Aravinth.
> >
> >
> >
> > On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de>
> > wrote:
> >
> >> Hi,
> >>
> >> take a look at the ExtractImages.java source code in
> >> /org/apache/pdfbox/tools/ for cases where you can take the image data
> >> directly and write that out directly.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Hi all,
> >>>
> >>> I have a use case where I need to extract the images and the text
> content
> >>> from PDF documents.
> >>> Comparing the image extraction and text extraction speed the time taken
> >> for
> >>> image extraction is too large.
> >>>
> >>> Furthermore, we compared the image extraction speed with Linux bash
> >> command
> >>> *pdfimages* it was so much faster than pdfbox
> >>>
> >>> Is there anything I'm missing? I have included the snipped I have used
> >> for
> >>> image extraction here.
> >>>
> >>> Thanks
> >>> Aravinth
> >>>
> >>>
> >>>              PDDocument pdDocument = PDDocument.load(new
> >> File("test.pdf"));
> >>>>              for (PDPage pdPage : pdDocument.getPages())
> >>>>              {
> >>>>                  PDResources resources = pdPage.getResources();
> >>>>                  Iterable<COSName> xObjectNames =
> >>>> resources.getXObjectNames();
> >>>>                  for (COSName cosName : xObjectNames)
> >>>>                  {
> >>>>                      PDXObject xObject =
> resources.getXObject(cosName);
> >>>>                      if(xObject instanceof PDImageXObject)
> >>>>                      {
> >>>>                          PDImageXObject pdImageXObject =
> >> (PDImageXObject)
> >>>> xObject;
> >>>>                          long start = System.currentTimeMillis();
> >>>>                          BufferedImage image =
> >> pdImageXObject.getImage();
> >>>>                          String nameName = cosName.getName();
> >>>>                          System.out.println("Time taken for PDF image
> >>>> object "+nameName +" "+(System.currentTimeMillis() - start));
> >>>>                          BufferedOutputStream output = new
> >>>> BufferedOutputStream(new FileOutputStream(nameName + "." +
> >>>> pdImageXObject.getSuffix()));
> >>>>                          start = System.currentTimeMillis();
> >>>>                          ImageIOUtil.writeImage(image ,
> >>>> pdImageXObject.getSuffix() , output);
> >>>>                          output.close();
> >>>>                          System.out.println("Time taken for write to
> >> file
> >>>> object "+nameName +" " +(System.currentTimeMillis() - start));
> >>>>                      }
> >>>>                  }
> >>>>              }
> >>>>              pdDocument.close();
> >>>>              System.err.println("Time taken for extracting for images
> "
> >> +
> >>>> (System.currentTimeMillis() - time));
> >>>>
> >>> The PDF Image extraction using pdfimages,
> >>>
> >>>   long start = System.currentTimeMillis();
> >>>>   ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" ,
> "-j"
> >> ,
> >>>> "test.pdf" , "out");
> >>>>   processBuilder.start();
> >>>>
> >>>> System.out.println("Time taken for extracting images " +
> >>>> (System.currentTimeMillis() - start));
> >>>>
> >> --
> >> Maruan Sahyoun
> >>
> >> FileAffairs GmbH
> >> Josef-Schappe-Straße 21
> >> 40882 Ratingen
> >>
> >> Tel: +49 (2102) 89497 88
> >> Fax: +49 (2102) 89497 91
> >> sahy...@fileaffairs.de
> >> www.fileaffairs.de
> >>
> >> Geschäftsführer: Maruan Sahyoun
> >> Handelsregister: AG Düsseldorf, HRB 53837
> >> UST.-ID: DE248275827
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Improving Image Extraction Speed

Reply via email to