Re: Using Tika with another OCR engine

2023-08-14 Thread Tim Allison
Concur with Nick. And, y, I'd frankly copy the TesseractOCRParser into a new module, rename it and modify it to call your OCR engine, build the jar and add the dependency to your tika bin directory (if you're using Docker?). On Thu, Aug 10, 2023 at 3:45 AM Cristian Zamfir wrote: > Hi Nick, > >

Re: Using Tika with another OCR engine

2023-08-10 Thread Cristian Zamfir
Hi Nick, Thanks, this is very helpful. This looks good and sounds like I could still use tika-server out of the box and load a .jar with an external plugin similar to the way 3rd party plugins are loaded https://cwiki.apache.org/confluence/display/TIKA/3rd+party+parser+plugins --- the goal is to

Re: Using Tika with another OCR engine

2023-08-08 Thread Nick Burch
On Thu, 3 Aug 2023, Cristian Zamfir wrote: I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated. Largely as "just another parser", but IIRC with a bit of logic to allow the "normal" image parsers to also have a go at the file to grab

RE: [External] Using Tika with another OCR engine

2023-08-03 Thread Sandeep Kulkarni via user
with another OCR engine Hello, I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated. Is it possible to write a plugin to call a different engine? While for images it is much easier, can just detect the file type and use an OCR engine instead

Using Tika with another OCR engine

2023-08-03 Thread Cristian Zamfir
Hello, I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated. Is it possible to write a plugin to call a different engine? While for images it is much easier, can just detect the file type and use an OCR engine instead, for scanned PDFs, I assume