Since I posted some incorrect timing information the other day, due to my 
misunderstanding of how EnableImageProcessing worked, I wanted to post some 
correct information.

I had a 3 page test document, which was a non-searchable image based pdf.  1 
image per page.  Scanned digitally, so very clean.  No rotation, no smudges, 
high-contrast, etc

Running it through Tika with OCR enabled, but EnableImageProcessing=false and 
ApplyRotation=false took just under 8 seconds.

If I set EnabvleImageProcessing to True, it increased by 1780% to about 145 
seconds or over 18x.

If I then set ApplyRotation to True, that increased it by about 9% to 158 
seconds.

It seems to me that since the Rotation is an easy thing to test for, if you 
allowed ApplyRotation to be set even if EnableImageProcessing was not, and then 
we didn't bother to call ImageMagick if the current rotation was less than some 
minimal amount (say, ½ degree), there would hardly be any extra overhead if the 
majority of the documents didn't require any rotation.

I wonder if there is any other checking that could be done to more finely tune 
some of the arguments to ImageMagick, since that seems to be the bottleneck.

Even now, it looks like Tika gets the current angle of rotation and just passes 
that to ImageMagick, without any regard to the amount.  Perhaps if the angle 
were already close to zero, we could leave out the rotate  option on the call 
to ImageMagick.  It's hard to tell at this point, which of the ImageMagick 
options is producing the highest overhead.

Peter

Reply via email to