Hello, we are long time Tika users that have recently started using Tesseract. We would like to be able to enable/disable Tesseract per extraction with Tesseract disabled until we choose to enable it.
Use case: 1 when we are extracting text/metadata from documents we want it to run as it did before we started using Tesseract. Very few of the images we encounter in a document will contain extractable text, therefore it is not worth the substantial performance hit. Use case 2: when we are extracting metadata from an image we also do not want OCR... if a batch of images is known to come from a digital camera it is unlikely they will contain extractable text. Again, not worth the hit. Use case 3: when we are dealing with images known to be scans of documents, faxes, etc. we definitely want OCR. Currently I have found no way to satisfy all of these at runtime. Therefore I have "deactivated" the Tika parser by passing it a bogus path in the config. I have cloned the Tika parser into another package and have to call it directly (not though Tika) when I want OCR extraction. Obviously this is undesirable long term. I have been reading various Tika tickets on blacklisting and whatnot but it is not clear to me if anything is available today? Even if I could just substitute my own version/sublclass of the TesseractOCRParser, that would be a start (I'd have my version of the parser not run if there was no TessearctOcrConfig on the context- see below.) We invoke Tika from various places in our app and it would be nice if OCR could be "opt in" vs. "opt out" so the traditional no-OCR-needed invocations would work like they always worked w/o extra coding. while the Tika-Tesseract integration is great, the performance penalty (up to several seconds per run) is severe enough to where you don't want to it running and invoking external processes except when absolutely necessary...to me this helps built the case for opt-in vs. opt-out. Or whitelist vs. blacklist for external parsers such as this. So I mainly wanted to (a) see if there is anything better I could do today and (b) throw my use cases out there for whatever they might be worth. Thank you, Brian
