Re: Tika-Server - Tesseract - Output to PDF

Steven Van Ingelgem Wed, 24 Apr 2019 10:49:20 -0700

As far as I understand it is that you can call setXXX() functions from
the TesseractOCRConfig class based on the header-prefix you just mentioned.
I use this to change the timeout dynamically (with the X-Tika-OCRTimeout
http header).


What would you need is a new outputType I think? OUTPUT_TYPE.PDF + its
implementation of the parameter to tesseract?


Grtz,
Steven

On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]> wrote:

> Hello Tim,
>
> thanks for your feedback. Yes, I also understand now that tika is for text
> and metadata extraction. And so it makes no sense to pollute the project
> with other functionality - such as the generation of new file formats.
>
> I have written a Docker Image with tika-server. And tika did a great job! (
> https://github.com/imixs/imixs-docker/tree/master/tika)
>
> On the other hand, tesseract seems to support the PDF output as a general
> feature. I took a look into the TikaOCRParser and as far as I understand
> the code simply passes parameters to the tesseract module. In the Tika
> Server module there is already an extension for the language support which
> is also a parameter for the tesseract module. The new tika header param is
> called 'X-Tika-OCRLanguage'.
>
> https://jira.apache.org/jira/browse/TIKA-1477
>
> I've been thinking about it and asked myself if it would be an idea to
> allow to pass parameters in general via the HTTP header?
> Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?
>
>
> https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
>
> I did not know much enough about the implementation. So maybe this is not
> possible or to complex?
>
> What did you think about the idea?
>
>
> Best regards
>
> Ralph
>
>
>
> On 24.04.19 18:15, Tim Allison wrote:
>
> The other vaguely related project that comes to mind 
> ishttps://www.pandoc.org/index.html but I don't know if that has hooks
> to tesseract or a Rest API...  Sorry!
>
> On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> 
> <[email protected]> wrote:
>
> Maybe ?
> https://github.com/tleyden/open-ocr
>
>
>
> On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> 
> <[email protected]> wrote:
>
> The goal of Tika is text and metadata extraction.  Our basic output is .txt, 
> xhtml or json. We don’t currently support generation of other formats. Could 
> you use DropWizard or similar to wrap tesseract it you need it to be restful?
>
> On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> 
> <[email protected]> wrote:
>
> Hi,
>
> I have a question about the Tesseract OCR Parser which is part of Tika:
> Is it possible to define the output of tesseract to PDF format. I think 
> tesseract supports this option to convert a image file (e.g. tif) into a 
> searchable pdf file:
>
> $ tesseract  --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng 
> -l eng pdf
>
> I use the tika Rest API and I wonder how I can tell tell the Tika Server to 
> create a PDF output file?
>
>
> Thanks for any help
>
>
> Ralph
>
>
>
> --
>
> *Imixs Software Solutions GmbH*
> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
> *Office:* Agnes-Pockels-Bogen 1, 80992 München
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>
> *Imixs* is an open source company, read more: www.imixs.org
>

Re: Tika-Server - Tesseract - Output to PDF

Reply via email to