Peter,
  Thank you for this info!  I'm grateful that you are exercising this part
of the code base and sharing your results.  Are you running ImageMagick
with the default 9x image expansion?  Can you time a run with 1x image
expansion if you haven't?  In my experience with our one unit test file,
the 9x expansion took minutes...before I killed the process...
  I'm not sure how we can cleanly allow this option, but skip image
processing if no rotation is necessary might be useful?
  Thank you, again.

      Best,

            Tim

On Tue, Jan 12, 2021 at 10:14 AM Peter Kronenberg <[email protected]>
wrote:

> Since I posted some incorrect timing information the other day, due to my
> misunderstanding of how EnableImageProcessing worked, I wanted to post some
> correct information.
>
>
>
> I had a 3 page test document, which was a non-searchable image based pdf.
> 1 image per page.  Scanned digitally, so very clean.  No rotation, no
> smudges, high-contrast, etc
>
>
>
> Running it through Tika with OCR enabled, but EnableImageProcessing=false
> and ApplyRotation=false took just under 8 seconds.
>
>
>
> If I set EnabvleImageProcessing to True, it increased by 1780% to about
> 145 seconds or over 18x.
>
>
>
> If I then set ApplyRotation to True, that increased it by about 9% to 158
> seconds.
>
>
>
> It seems to me that since the Rotation is an easy thing to test for, if
> you allowed ApplyRotation to be set even if EnableImageProcessing was not,
> and then we didn’t bother to call ImageMagick if the current rotation was
> less than some minimal amount (say, ½ degree), there would hardly be any
> extra overhead if the majority of the documents didn’t require any rotation.
>
>
>
> I wonder if there is any other checking that could be done to more finely
> tune some of the arguments to ImageMagick, since that seems to be the
> bottleneck.
>
>
>
> Even now, it looks like Tika gets the current angle of rotation and just
> passes that to ImageMagick, without any regard to the amount.  Perhaps if
> the angle were already close to zero, we could leave out the rotate  option
> on the call to ImageMagick.  It’s hard to tell at this point, which of the
> ImageMagick options is producing the highest overhead.
>
>
>
> Peter
>

Reply via email to