I haven’t measured this, but it isn’t a surprise. To confirm, when you say
“perfectly clean non-searchable PDF”... you mean an image only pdf w a
single clean image per page?

Right. The image rotation is only applied if image preprocessing is
enabled. The rotate.py only calculates the rotation. The actual rotation is
done by ImageMagick.

On Mon, Jan 11, 2021 at 6:19 PM Peter Kronenberg <[email protected]>
wrote:

> Now that I understand better what this is, I can interpret my results more
> accurately.  And I see that this actually adds **a lot** of overhead.
> Even for a perfectly clean non-searchable PDF.  Does this match your
> expectation?
>
>
>
> Looking at the code, it appears that ApplyRotation is **only** done if
> enableImageProcessing is True.  Is this correct?
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, January 11, 2021 5:30 PM
> *To:* [email protected]; [email protected]
> *Subject:* RE: Turning off ImageProcessing
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> I see the Tika default for this is off or FALSE.  How much overhead is
> involved with this?  Would it make sense to default to TRUE?
>
>
>
> *From:* Peter Kronenberg <[email protected]>
> *Sent:* Monday, January 11, 2021 5:25 PM
> *To:* [email protected]; [email protected]
> *Subject:* RE: Turning off ImageProcessing
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> Ah, ok.  That’s a bit confusing.  Perhaps for 2.0?
>
> So what exactly does it mean if it’s off?  There is already a flag for
> ApplyRotation.
>
> Exactly what pre-processing is done?
>
>
>
>
>
>
>
> *From:* Tim Allison <[email protected]>
> *Sent:* Monday, January 11, 2021 3:38 PM
> *To:* [email protected]
> *Subject:* Re: Turning off ImageProcessing
>
>
>
> We should change "enableImageProcessing" to "enableImagePreprocessing".
> That flag covers the rotation.py and ImageMagick preprocessing, NOT ocr.
>
>
>
> As for the warnings...I'm trying to figure out how to push those further
> towards the time of executing tesseract so that people who run tesseract
> get the warning, but those whose files never go down that path don't get
> the warning.
>
>
>
> On Mon, Jan 11, 2021 at 12:50 PM Peter Kronenberg <
> [email protected]> wrote:
>
> Is the EnableImageProcessing flag in TesseractOCRConfig honored?  It seems
> to always do OCR.  And in fact, as long as it finds it in the path, I get
> this message
>
> *[main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract OCR
> is installed and will be automatically applied to image files unless*
>
> *you've excluded the TesseractOCRParser from the default parser.*
>
> *Tesseract may dramatically slow down content extraction (TIKA-2359).*
>
> *As of Tika 1.15 (and prior versions), Tesseract is automatically called.*
>
> *In future versions of Tika, users may need to turn the TesseractOCRParser
> on via TikaConfig.*
>
>
>
> Is the only way to turn off image processing to remove the OCR parser?
> Can I enable/disable it programmatically?
>
> The easiest way I found to disable it is to provide a bogus path (thanks
> to the hint in TesseractOCRParser#checkInitialization), but that still
> issues the above message (not sure why it can’t check first if the path is
> valid)
>
>
>
> Is there a better way to do this?
>
>

Reply via email to