Hi Nick,

Thanks, this is very helpful. This looks good and sounds like I could still
use tika-server out of the box and load a .jar with an external plugin
similar to the way 3rd party plugins are loaded
https://cwiki.apache.org/confluence/display/TIKA/3rd+party+parser+plugins
--- the goal is to not have to fork Tika just to modify this parser. Does
this sound about right?

Thanks,
Cristi


On Tue, Aug 8, 2023 at 5:17 PM Nick Burch <[email protected]> wrote:

> On Thu, 3 Aug 2023, Cristian Zamfir wrote:
> > I am interested in trying out Tika with a different OCR engine and
> > wondering how Tesseract is integrated.
>
> Largely as "just another parser", but IIRC with a bit of logic to allow
> the "normal" image parsers to also have a go at the file to grab metadata
>
> It's all in tika-parser-ocr-module:
>
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module
>
> > Is it possible to write a plugin to call a different engine?
>
> Largely would be a case of writing your own parser, registering it for the
> appropriate mime types, and disabling the Tesseract one if you have the
> tesseract binary on your path
>
> > for scanned PDFs, I assume there is some bi-directional communication
> > between Tika and Tesseract to detect inline images. Is that correct?
>
> Nope, the PDF parser will detect any embedded resources (eg images), and
> if enabled will call the appropriate parser for each one
>
> Nick
>

Reply via email to