Hi.

I am using PDFBox for rendering PDF files into images. There is a certain file 
I am using as benchmark for any PDF library and PDF Box has some problems with 
it (please note that almost all 3rd party PDF engines have issues with this 
file):

https://archive.org/details/AlfaWaffenkatalog1911

Good news: PDFBox renders the file perfectly!
Bad news: It takes forever to do so (first page 16 seconds in PDFDebugger on my 
machine)
I was asking myself, why is this and I have identified and „fixed“ things and 
could get the time down to 6 seconds.
I started fixing these issues earlier this year, I can’t work on it all the 
time. (I noticed PDFBOX-5145 which was a good start but misses some things.)

The problem lies within the optimized nature of this file, it stores the white 
of the background,  the blackness of the text, an image mask for the text,  as 
well as drawings separately. This is nothing new, I have a scan of a very old 
magazine which was optimized from 90 to 9 mb in a similar way (but with slight 
differences so it loads in a second).

What you have is basically a low res picture of white soup, a low res picture 
of black soup, a very very high res picture of an single bit image mask (say 
10000*10000 pixels) and a bunch of normal res images for drawings.

The diffence to the fast pdf is that the image mask is applied to the black 
soup image as mask (the fast pdf renders it directly) and that the image mask 
is stored as JBIG2 instead of CCITTFax.
Since this is happening w/o the final target image resolution in mind, apply 
mask works on the full 10000*10000 pixels.
(Memory requirements: 12 MB for the bitmask, 100 MB for the 8bit mask – luckily 
single bit masks get expanded to only 8 bit, anything else turns into RGB -,  
400 MB for the picture + one extra 400MB since there is a pointless in between 
image).

Things seen in apply mask:

  *   Scaling the image to the mask is very very slow if you have a 10x scaling 
factor for each axis and large target and use bicubic. Billinear should be used 
somehow in these cases (I used an area enlargement of 16 as threshold but 
problably also should count in the absolute number of pixels).  This is a major 
performance gain (as 2 seconds instead of in many more). Nearest neighbor is 
even faster (no time) but of course not an option.
  *   There is some wasteful image allocation happening (400 MB).
  *   PDFBOX-5145 bulk copy works in a roundabout way that slows it down.
  *   It’s posible to use direct alpha copying, which  is even faster 
(optional).
  *   Softmask code could use integer math which is twice as fast with neglible 
error (0.001%) compared to float (this is a bonus optimization)

With this alone I almost shaved of half the time. I also looked at the mask 
reading part:

  *   from1bit() could be optimized a bit (and also fails to issue a warn and 
break the loop if subsampling is enabled)
  *   reading the jbig2 image in the JBIG2 library is very slow.


I understand that JBIG2 is way more complex than CCITTFax but carefully 
investigation showed that of 2 seoncds, 0.5 was used for decoding the image 
itself (depending on page complexity this number can be lower/higher) and 1.5 
for converting the bitmap into a BufferedImage. I optimized that 1.5 seconds 
away to a few milliseconds.

If you are interested in any of this, I can go and clone the git repo and 
„implement“ my changes there so you can pull things back into the main repo 
that might be worth it?

(What I can already say is that it‘s probably not going to be 100% formatting 
style compliant (no leading tabs is one thing, but the whitespaciness with 
curly brackets lines and no single line if statements I can’t guarantee)).

Gunnar

Reply via email to