Since I posted some incorrect timing information the other day, due to my misunderstanding of how EnableImageProcessing worked, I wanted to post some correct information.
I had a 3 page test document, which was a non-searchable image based pdf. 1 image per page. Scanned digitally, so very clean. No rotation, no smudges, high-contrast, etc Running it through Tika with OCR enabled, but EnableImageProcessing=false and ApplyRotation=false took just under 8 seconds. If I set EnabvleImageProcessing to True, it increased by 1780% to about 145 seconds or over 18x. If I then set ApplyRotation to True, that increased it by about 9% to 158 seconds. It seems to me that since the Rotation is an easy thing to test for, if you allowed ApplyRotation to be set even if EnableImageProcessing was not, and then we didn't bother to call ImageMagick if the current rotation was less than some minimal amount (say, ½ degree), there would hardly be any extra overhead if the majority of the documents didn't require any rotation. I wonder if there is any other checking that could be done to more finely tune some of the arguments to ImageMagick, since that seems to be the bottleneck. Even now, it looks like Tika gets the current angle of rotation and just passes that to ImageMagick, without any regard to the amount. Perhaps if the angle were already close to zero, we could leave out the rotate option on the call to ImageMagick. It's hard to tell at this point, which of the ImageMagick options is producing the highest overhead. Peter
