Re: Tika-Server - Tesseract - Output to PDF

Ralph Soika Wed, 24 Apr 2019 10:27:00 -0700

Hello Tim,

thanks for your feedback. Yes, I also understand now that tika is fortext and metadata extraction. And so it makes no sense to pollute theproject with other functionality - such as the generation of new fileformats.

I have written a Docker Image with tika-server. And tika did a greatjob! (https://github.com/imixs/imixs-docker/tree/master/tika)

On the other hand, tesseract seems to support the PDF output as ageneral feature. I took a look into the TikaOCRParser and as far as Iunderstand the code simply passes parameters to the tesseract module. Inthe Tika Server module there is already an extension for the languagesupport which is also a parameter for the tesseract module. The new tikaheader param is called 'X-Tika-OCRLanguage'.


https://jira.apache.org/jira/browse/TIKA-1477

I've been thinking about it and asked myself if it would be an idea toallow to pass parameters in general via the HTTP header?


Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?

https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java

I did not know much enough about the implementation. So maybe this isnot possible or to complex?


What did you think about the idea?


Best regards

Ralph



On 24.04.19 18:15, Tim Allison wrote:

The other vaguely related project that comes to mind is
https://www.pandoc.org/index.html but I don't know if that has hooks
to tesseract or a Rest API...  Sorry!

On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:

Maybe ?

https://github.com/tleyden/open-ocr



On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> wrote:

The goal of Tika is text and metadata extraction.  Our basic output is .txt, 
xhtml or json. We don’t currently support generation of other formats. Could 
you use DropWizard or similar to wrap tesseract it you need it to be restful?

On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> wrote:

Hi,

I have a question about the Tesseract OCR Parser which is part of Tika:
Is it possible to define the output of tesseract to PDF format. I think 
tesseract supports this option to convert a image file (e.g. tif) into a 
searchable pdf file:

$ tesseract  --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng -l 
eng pdf

I use the tika Rest API and I wonder how I can tell tell the Tika Server to 
create a PDF output file?


Thanks for any help


Ralph

--

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org<http://www.imixs.org>

Re: Tika-Server - Tesseract - Output to PDF

Reply via email to