As far as I understand it is that you can call setXXX() functions from the TesseractOCRConfig class based on the header-prefix you just mentioned. I use this to change the timeout dynamically (with the X-Tika-OCRTimeout http header).
What would you need is a new outputType I think? OUTPUT_TYPE.PDF + its implementation of the parameter to tesseract? Grtz, Steven On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]> wrote: > Hello Tim, > > thanks for your feedback. Yes, I also understand now that tika is for text > and metadata extraction. And so it makes no sense to pollute the project > with other functionality - such as the generation of new file formats. > > I have written a Docker Image with tika-server. And tika did a great job! ( > https://github.com/imixs/imixs-docker/tree/master/tika) > > On the other hand, tesseract seems to support the PDF output as a general > feature. I took a look into the TikaOCRParser and as far as I understand > the code simply passes parameters to the tesseract module. In the Tika > Server module there is already an extension for the language support which > is also a parameter for the tesseract module. The new tika header param is > called 'X-Tika-OCRLanguage'. > > https://jira.apache.org/jira/browse/TIKA-1477 > > I've been thinking about it and asked myself if it would be an idea to > allow to pass parameters in general via the HTTP header? > Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does? > > > https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java > > I did not know much enough about the implementation. So maybe this is not > possible or to complex? > > What did you think about the idea? > > > Best regards > > Ralph > > > > On 24.04.19 18:15, Tim Allison wrote: > > The other vaguely related project that comes to mind > ishttps://www.pandoc.org/index.html but I don't know if that has hooks > to tesseract or a Rest API... Sorry! > > On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> > <[email protected]> wrote: > > Maybe ? > https://github.com/tleyden/open-ocr > > > > On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> > <[email protected]> wrote: > > The goal of Tika is text and metadata extraction. Our basic output is .txt, > xhtml or json. We don’t currently support generation of other formats. Could > you use DropWizard or similar to wrap tesseract it you need it to be restful? > > On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> > <[email protected]> wrote: > > Hi, > > I have a question about the Tesseract OCR Parser which is part of Tika: > Is it possible to define the output of tesseract to PDF format. I think > tesseract supports this option to convert a image file (e.g. tif) into a > searchable pdf file: > > $ tesseract --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng > -l eng pdf > > I use the tika Rest API and I wonder how I can tell tell the Tika Server to > create a PDF output file? > > > Thanks for any help > > > Ralph > > > > -- > > *Imixs Software Solutions GmbH* > *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16 > *Office:* Agnes-Pockels-Bogen 1, 80992 München > Registergericht: Amtsgericht Muenchen, HRB 136045 > Geschaeftsführer: Gaby Heinle u. Ralph Soika > > *Imixs* is an open source company, read more: www.imixs.org >
