Hi Nick, Thanks, this is very helpful. This looks good and sounds like I could still use tika-server out of the box and load a .jar with an external plugin similar to the way 3rd party plugins are loaded https://cwiki.apache.org/confluence/display/TIKA/3rd+party+parser+plugins --- the goal is to not have to fork Tika just to modify this parser. Does this sound about right?
Thanks, Cristi On Tue, Aug 8, 2023 at 5:17 PM Nick Burch <[email protected]> wrote: > On Thu, 3 Aug 2023, Cristian Zamfir wrote: > > I am interested in trying out Tika with a different OCR engine and > > wondering how Tesseract is integrated. > > Largely as "just another parser", but IIRC with a bit of logic to allow > the "normal" image parsers to also have a go at the file to grab metadata > > It's all in tika-parser-ocr-module: > > https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module > > > Is it possible to write a plugin to call a different engine? > > Largely would be a case of writing your own parser, registering it for the > appropriate mime types, and disabling the Tesseract one if you have the > tesseract binary on your path > > > for scanned PDFs, I assume there is some bi-directional communication > > between Tika and Tesseract to detect inline images. Is that correct? > > Nope, the PDF parser will detect any embedded resources (eg images), and > if enabled will call the appropriate parser for each one > > Nick >
