Y. That’d be how to do it. I regret my personal opinion is that I don’t think this fits well w/in Tika, but I wouldn’t vote against this if a colleague wanted to take it on or if there were a large interest.
If you want to restify tesseract, that should be straight forward. What is the benefit to baking it into tika-server? On Wed, Apr 24, 2019 at 5:22 PM Ralph Soika <[email protected]> wrote: > I still try to understand the TesseractOCRParser class. I looks like the > method doOCR is the one which calls the tesseract commandline tool. And > yes, it seems that there is already a way to pass (maybe) any kind of > optional parameter? But this one feature to generate "searchable pdf" seems > not to be a parameter but a kind of option: > > > https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output > > I think 'hocr' and 'pdf' are allowed as the last command option. > > And on the other hand the Tika class 'TesseractOCRParser' has this setter > method 'setOutputType' > > I am not sure but maybe everything is prepared already by the Tika > implementation? > > If so, than we could add a new Resource Class into the Tika Server: > > @Path("/pdf") > public class PDFResource { > > @PUT > @POST > @Consumes("*/*") > @Produces("application/pdf") > public String doOCRtoPDF(final InputStream is) throws IOException { > .... > } > } > > But at this point I lose my courage because I know too little about the > implementation. What do you think? Is there a way to get the low-hanging > fruit? Or am I on the wrong track...? > > > On 24.04.19 19:37, Steven Van Ingelgem wrote: > > As far as I understand it is that you can call setXXX() functions from > the TesseractOCRConfig class based on the header-prefix you just mentioned. > I use this to change the timeout dynamically (with the X-Tika-OCRTimeout > http header). > > What would you need is a new outputType I think? OUTPUT_TYPE.PDF + its > implementation of the parameter to tesseract? > > > Grtz, > Steven > > On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]> wrote: > >> Hello Tim, >> >> thanks for your feedback. Yes, I also understand now that tika is for >> text and metadata extraction. And so it makes no sense to pollute the >> project with other functionality - such as the generation of new file >> formats. >> >> I have written a Docker Image with tika-server. And tika did a great job! >> (https://github.com/imixs/imixs-docker/tree/master/tika) >> >> On the other hand, tesseract seems to support the PDF output as a general >> feature. I took a look into the TikaOCRParser and as far as I understand >> the code simply passes parameters to the tesseract module. In the Tika >> Server module there is already an extension for the language support which >> is also a parameter for the tesseract module. The new tika header param is >> called 'X-Tika-OCRLanguage'. >> >> https://jira.apache.org/jira/browse/TIKA-1477 >> >> I've been thinking about it and asked myself if it would be an idea to >> allow to pass parameters in general via the HTTP header? >> Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does? >> >> >> https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java >> >> I did not know much enough about the implementation. So maybe this is not >> possible or to complex? >> >> What did you think about the idea? >> >> >> Best regards >> >> Ralph >> >> >> >> On 24.04.19 18:15, Tim Allison wrote: >> >> The other vaguely related project that comes to mind >> ishttps://www.pandoc.org/index.html but I don't know if that has hooks >> to tesseract or a Rest API... Sorry! >> >> On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> >> <[email protected]> wrote: >> >> Maybe ? >> https://github.com/tleyden/open-ocr >> >> >> >> On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> >> <[email protected]> wrote: >> >> The goal of Tika is text and metadata extraction. Our basic output is .txt, >> xhtml or json. We don’t currently support generation of other formats. Could >> you use DropWizard or similar to wrap tesseract it you need it to be restful? >> >> On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> >> <[email protected]> wrote: >> >> Hi, >> >> I have a question about the Tesseract OCR Parser which is part of Tika: >> Is it possible to define the output of tesseract to PDF format. I think >> tesseract supports this option to convert a image file (e.g. tif) into a >> searchable pdf file: >> >> $ tesseract --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng >> -l eng pdf >> >> I use the tika Rest API and I wonder how I can tell tell the Tika Server to >> create a PDF output file? >> >> >> Thanks for any help >> >> >> Ralph >> >> >> >> -- >> >> *Imixs Software Solutions GmbH* >> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16 >> *Office:* Agnes-Pockels-Bogen 1, 80992 München >> <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g> >> Registergericht: Amtsgericht Muenchen, HRB 136045 >> Geschaeftsführer: Gaby Heinle u. Ralph Soika >> >> *Imixs* is an open source company, read more: www.imixs.org >> > -- > > *Imixs Software Solutions GmbH* > *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16 > *Office:* Agnes-Pockels-Bogen 1, 80992 München > <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g> > Registergericht: Amtsgericht Muenchen, HRB 136045 > Geschaeftsführer: Gaby Heinle u. Ralph Soika > > *Imixs* is an open source company, read more: www.imixs.org >
