Hello Tim,

thanks for your feedback. Yes, I also understand now that tika is for text and metadata extraction. And so it makes no sense to pollute the project with other functionality - such as the generation of new file formats.

I have written a Docker Image with tika-server. And tika did a great job! (https://github.com/imixs/imixs-docker/tree/master/tika)

On the other hand, tesseract seems to support the PDF output as a general feature. I took a look into the TikaOCRParser and as far as I understand the code simply passes parameters to the tesseract module. In the Tika Server module there is already an extension for the language support which is also a parameter for the tesseract module. The new tika header param is called 'X-Tika-OCRLanguage'.

https://jira.apache.org/jira/browse/TIKA-1477

I've been thinking about it and asked myself if it would be an idea to allow to pass parameters in general via the HTTP header?

Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?

https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java

I did not know much enough about the implementation. So maybe this is not possible or to complex?

What did you think about the idea?


Best regards

Ralph



On 24.04.19 18:15, Tim Allison wrote:
The other vaguely related project that comes to mind is
https://www.pandoc.org/index.html but I don't know if that has hooks
to tesseract or a Rest API...  Sorry!

On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:
Maybe ?

https://github.com/tleyden/open-ocr



On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> wrote:
The goal of Tika is text and metadata extraction.  Our basic output is .txt, 
xhtml or json. We don’t currently support generation of other formats. Could 
you use DropWizard or similar to wrap tesseract it you need it to be restful?

On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> wrote:
Hi,

I have a question about the Tesseract OCR Parser which is part of Tika:
Is it possible to define the output of tesseract to PDF format. I think 
tesseract supports this option to convert a image file (e.g. tif) into a 
searchable pdf file:

$ tesseract  --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng -l 
eng pdf

I use the tika Rest API and I wonder how I can tell tell the Tika Server to 
create a PDF output file?


Thanks for any help


Ralph


--

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org <http://www.imixs.org>

Reply via email to