Concur with Nick. And, y, I'd frankly copy the TesseractOCRParser into a
new module, rename it and modify it to call your OCR engine, build the jar
and add the dependency to your tika bin directory (if you're using Docker?).
On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir
wrote:
> Hi Nick,
>
>
Hi Nick,
Thanks, this is very helpful. This looks good and sounds like I could still
use tika-server out of the box and load a .jar with an external plugin
similar to the way 3rd party plugins are loaded
https://cwiki.apache.org/confluence/display/TIKA/3rd+party+parser+plugins
--- the goal is to
On Thu, 3 Aug 2023, Cristian Zamfir wrote:
I am interested in trying out Tika with a different OCR engine and
wondering how Tesseract is integrated.
Largely as "just another parser", but IIRC with a bit of logic to allow
the "normal" image parsers to also have a go at the file to grab
with another OCR engine
Hello,
I am interested in trying out Tika with a different OCR engine and wondering
how Tesseract is integrated. Is it possible to write a plugin to call a
different engine? While for images it is much easier, can just detect the file
type and use an OCR engine instead
Hello,
I am interested in trying out Tika with a different OCR engine and
wondering how Tesseract is integrated. Is it possible to write a plugin to
call a different engine? While for images it is much easier, can just
detect the file type and use an OCR engine instead, for scanned PDFs, I
assume