Re: Improving Image Extraction Speed

Tilman Hausherr Mon, 10 Feb 2020 10:09:59 -0800

Am 10.02.2020 um 18:20 schrieb aravinth thangasami:

@Tilman the comparison between poppler and pdfbox is very large. I thought
anything which can be improved over here too.
One thing I observed was there is a conversion between PdfXImage object to
Java BufferedImage object with decoding takes some time.


Looks anything in reducing the operations over there can save some time

You could try to use the twelvemonkeys image library. Maybe it is a bitfaster.

ExtractImages does use a shortcut for the jpeg image when possible, orwhen the "-directJPEG" option is set. This may be faster than your code.But the other ones will still use BufferedImage.


Tilman


Thanks
Aravinth

On Mon, Feb 10, 2020 at 10:43 PM aravinth thangasami <
aravinththangas...@gmail.com> wrote:

@Mauran
The file is uploaded here, for your reference
https://www.dropbox.com/s/m2p1861q4pthgue/testpdfwithImages.pdf

@Tilman Thanks for your reply. we are using the tika for text extraction.
For extracting images we thought of using the same.

Thanks
Aravinth

On Mon, Feb 10, 2020 at 10:36 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

Am 10.02.2020 um 17:13 schrieb aravinth thangasami:

Thanks Maruan,

I have tried that too, The time taken for extracting images using

*PDFBOX*

was larger than using Linux command *PDFIMAGES. *
For saying If the time taken for extracting the image from PDF using

PDFBOX

is 300 milliseconds, It's happening in 10 milliseconds using PDFIMAGES.
On checking the code, the maximum time was converting PdfXImageObject to
BufferedImage.


pdfimages is from poppler, which is C/C++. So of course it is faster.
Why do you want to use PDFBox? Poppler works nicely.

Tilman

Is there anything I'm missing here?, Anything can be improved in

converting

the image?

Thanks
Aravinth.



On Mon, Feb 10, 2020 at 3:19 PM Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:

Hi,

take a look at the ExtractImages.java source code in
/org/apache/pdfbox/tools/ for cases where you can take the image data
directly and write that out directly.

BR
Maruan

Hi all,

I have a use case where I need to extract the images and the text

content

from PDF documents.
Comparing the image extraction and text extraction speed the time

taken

for

image extraction is too large.

Furthermore, we compared the image extraction speed with Linux bash

command

*pdfimages* it was so much faster than pdfbox

Is there anything I'm missing? I have included the snipped I have used

for

image extraction here.

Thanks
Aravinth


              PDDocument pdDocument = PDDocument.load(new

File("test.pdf"));

              for (PDPage pdPage : pdDocument.getPages())
              {
                  PDResources resources = pdPage.getResources();
                  Iterable<COSName> xObjectNames =
resources.getXObjectNames();
                  for (COSName cosName : xObjectNames)
                  {
                      PDXObject xObject =

resources.getXObject(cosName);

                      if(xObject instanceof PDImageXObject)
                      {
                          PDImageXObject pdImageXObject =

(PDImageXObject)

xObject;
                          long start = System.currentTimeMillis();
                          BufferedImage image =

pdImageXObject.getImage();

                          String nameName = cosName.getName();
                          System.out.println("Time taken for PDF image
object "+nameName +" "+(System.currentTimeMillis() - start));
                          BufferedOutputStream output = new
BufferedOutputStream(new FileOutputStream(nameName + "." +
pdImageXObject.getSuffix()));
                          start = System.currentTimeMillis();
                          ImageIOUtil.writeImage(image ,
pdImageXObject.getSuffix() , output);
                          output.close();
                          System.out.println("Time taken for write to

file

object "+nameName +" " +(System.currentTimeMillis() - start));
                      }
                  }
              }
              pdDocument.close();
              System.err.println("Time taken for extracting for

images "

(System.currentTimeMillis() - time));

The PDF Image extraction using pdfimages,

   long start = System.currentTimeMillis();

   ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" ,

"-j"

"test.pdf" , "out");
   processBuilder.start();

System.out.println("Time taken for extracting images " +
(System.currentTimeMillis() - start));

--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Improving Image Extraction Speed

Reply via email to