Yes, I was just about to start playing with the image expansion. I’ll let you know.
Skipping image processing if no rotation is necessary might be an easy first step. But I still think they should be separate From: Tim Allison <[email protected]> Sent: Tuesday, January 12, 2021 10:23 AM To: [email protected] Subject: Re: Image processing timings Peter, Thank you for this info! I'm grateful that you are exercising this part of the code base and sharing your results. Are you running ImageMagick with the default 9x image expansion? Can you time a run with 1x image expansion if you haven't? In my experience with our one unit test file, the 9x expansion took minutes...before I killed the process... I'm not sure how we can cleanly allow this option, but skip image processing if no rotation is necessary might be useful? Thank you, again. Best, Tim On Tue, Jan 12, 2021 at 10:14 AM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Since I posted some incorrect timing information the other day, due to my misunderstanding of how EnableImageProcessing worked, I wanted to post some correct information. I had a 3 page test document, which was a non-searchable image based pdf. 1 image per page. Scanned digitally, so very clean. No rotation, no smudges, high-contrast, etc Running it through Tika with OCR enabled, but EnableImageProcessing=false and ApplyRotation=false took just under 8 seconds. If I set EnabvleImageProcessing to True, it increased by 1780% to about 145 seconds or over 18x. If I then set ApplyRotation to True, that increased it by about 9% to 158 seconds. It seems to me that since the Rotation is an easy thing to test for, if you allowed ApplyRotation to be set even if EnableImageProcessing was not, and then we didn’t bother to call ImageMagick if the current rotation was less than some minimal amount (say, ½ degree), there would hardly be any extra overhead if the majority of the documents didn’t require any rotation. I wonder if there is any other checking that could be done to more finely tune some of the arguments to ImageMagick, since that seems to be the bottleneck. Even now, it looks like Tika gets the current angle of rotation and just passes that to ImageMagick, without any regard to the amount. Perhaps if the angle were already close to zero, we could leave out the rotate option on the call to ImageMagick. It’s hard to tell at this point, which of the ImageMagick options is producing the highest overhead. Peter
