Concur with Nick.  And, y, I'd frankly copy the TesseractOCRParser into a
new module, rename it and modify it to call your OCR engine, build the jar
and add the dependency to your tika bin directory (if you're using Docker?).

On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir <[email protected]>
wrote:

> Hi Nick,
>
> Thanks, this is very helpful. This looks good and sounds like I could
> still use tika-server out of the box and load a .jar with an external
> plugin similar to the way 3rd party plugins are loaded
> https://cwiki.apache.org/confluence/display/TIKA/3rd+party+parser+plugins
> --- the goal is to not have to fork Tika just to modify this parser. Does
> this sound about right?
>
> Thanks,
> Cristi
>
>
> On Tue, Aug 8, 2023 at 5:17 PM Nick Burch <[email protected]> wrote:
>
>> On Thu, 3 Aug 2023, Cristian Zamfir wrote:
>> > I am interested in trying out Tika with a different OCR engine and
>> > wondering how Tesseract is integrated.
>>
>> Largely as "just another parser", but IIRC with a bit of logic to allow
>> the "normal" image parsers to also have a go at the file to grab metadata
>>
>> It's all in tika-parser-ocr-module:
>>
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module
>>
>> > Is it possible to write a plugin to call a different engine?
>>
>> Largely would be a case of writing your own parser, registering it for
>> the
>> appropriate mime types, and disabling the Tesseract one if you have the
>> tesseract binary on your path
>>
>> > for scanned PDFs, I assume there is some bi-directional communication
>> > between Tika and Tesseract to detect inline images. Is that correct?
>>
>> Nope, the PDF parser will detect any embedded resources (eg images), and
>> if enabled will call the appropriate parser for each one
>>
>> Nick
>>
>

Reply via email to