Yes, I was just about to  start playing with the image expansion.  I’ll let you 
know.

Skipping image processing if no rotation is necessary might be an easy first 
step.  But I still think they should be separate

From: Tim Allison <[email protected]>
Sent: Tuesday, January 12, 2021 10:23 AM
To: [email protected]
Subject: Re: Image processing timings

Peter,
  Thank you for this info!  I'm grateful that you are exercising this part of 
the code base and sharing your results.  Are you running ImageMagick with the 
default 9x image expansion?  Can you time a run with 1x image expansion if you 
haven't?  In my experience with our one unit test file, the 9x expansion took 
minutes...before I killed the process...
  I'm not sure how we can cleanly allow this option, but skip image processing 
if no rotation is necessary might be useful?
  Thank you, again.

      Best,

            Tim

On Tue, Jan 12, 2021 at 10:14 AM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
Since I posted some incorrect timing information the other day, due to my 
misunderstanding of how EnableImageProcessing worked, I wanted to post some 
correct information.

I had a 3 page test document, which was a non-searchable image based pdf.  1 
image per page.  Scanned digitally, so very clean.  No rotation, no smudges, 
high-contrast, etc

Running it through Tika with OCR enabled, but EnableImageProcessing=false and 
ApplyRotation=false took just under 8 seconds.

If I set EnabvleImageProcessing to True, it increased by 1780% to about 145 
seconds or over 18x.

If I then set ApplyRotation to True, that increased it by about 9% to 158 
seconds.

It seems to me that since the Rotation is an easy thing to test for, if you 
allowed ApplyRotation to be set even if EnableImageProcessing was not, and then 
we didn’t bother to call ImageMagick if the current rotation was less than some 
minimal amount (say, ½ degree), there would hardly be any extra overhead if the 
majority of the documents didn’t require any rotation.

I wonder if there is any other checking that could be done to more finely tune 
some of the arguments to ImageMagick, since that seems to be the bottleneck.

Even now, it looks like Tika gets the current angle of rotation and just passes 
that to ImageMagick, without any regard to the amount.  Perhaps if the angle 
were already close to zero, we could leave out the rotate  option on the call 
to ImageMagick.  It’s hard to tell at this point, which of the ImageMagick 
options is producing the highest overhead.

Peter

Reply via email to