Hello, we are long time Tika users that have recently started using
Tesseract.  We would like to be able to enable/disable Tesseract per
extraction with Tesseract disabled until we choose to enable it.

Use case: 1 when we are extracting text/metadata from documents we want it
to run as it did before we started using Tesseract. Very few of the images
we encounter in a document will contain extractable text, therefore it is
not worth the substantial performance hit.

Use case 2: when we are extracting metadata from an image we also do not
want OCR... if a batch of images is known to come from a digital camera it
is unlikely they will contain extractable text.  Again, not worth the hit.

Use case 3: when we are dealing with images known to be scans of documents,
faxes, etc. we definitely want OCR.

Currently I have found no way to satisfy all of these at runtime.
Therefore I have "deactivated" the Tika parser by passing it a bogus path
in the config.  I have cloned the Tika parser into another package and have
to call it directly (not though Tika) when I want OCR extraction.
Obviously this is undesirable long term.

I have been reading various Tika tickets on blacklisting and whatnot but it
is not clear to me if anything is available today?  Even if I could just
substitute my own version/sublclass of the TesseractOCRParser, that would
be a start (I'd have my version of the parser not run if there was no
TessearctOcrConfig on the context- see below.)

We invoke Tika from various places in our app and it would be nice if OCR
could be "opt in" vs. "opt out" so the traditional no-OCR-needed
invocations would work like they always worked w/o extra coding.  while the
Tika-Tesseract integration is great, the performance penalty (up to several
seconds per run) is severe enough to where you don't want to it running and
invoking external processes except when absolutely necessary...to me this
helps built the case for opt-in vs. opt-out.   Or whitelist vs. blacklist
for external parsers such as this.

So I mainly wanted  to (a) see if there is anything better I could do today
and (b) throw my use cases out there for whatever they might be worth.

Thank you,
Brian

Reply via email to