Re: Tika-Server - Tesseract - Output to PDF

Tim Allison Thu, 25 Apr 2019 08:45:01 -0700

Y. That’d be how to do it.  I regret my personal opinion is that I don’t
think this fits well w/in Tika, but I wouldn’t vote against this if a
colleague wanted to take it on or if there were a large interest.


If you want to restify tesseract, that should be straight forward. What is
the benefit to baking it into tika-server?

On Wed, Apr 24, 2019 at 5:22 PM Ralph Soika <[email protected]> wrote:

> I still try to understand the TesseractOCRParser class. I looks like the
> method doOCR is the one which calls the tesseract commandline tool. And
> yes, it seems that there is already a way to pass (maybe) any kind of
> optional parameter? But this one feature to generate "searchable pdf" seems
> not to be a parameter but a kind of option:
>
>
> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output
>
> I think 'hocr' and 'pdf' are allowed as the last command option.
>
> And on the other hand the Tika class 'TesseractOCRParser' has this setter
> method 'setOutputType'
>
> I am not sure but maybe everything is prepared already by the Tika
> implementation?
>
> If so, than we could add a new Resource Class into the Tika Server:
>
> @Path("/pdf")
> public class PDFResource {
>
>     @PUT
>     @POST
>     @Consumes("*/*")
>     @Produces("application/pdf")
>     public String doOCRtoPDF(final InputStream is) throws IOException {
>         ....
>     }
> }
>
> But at this point I lose my courage because I know too little about the
> implementation. What do you think? Is there a way to get the low-hanging
> fruit? Or am I on the wrong track...?
>
>
> On 24.04.19 19:37, Steven Van Ingelgem wrote:
>
> As far as I understand it is that you can call setXXX() functions from
> the TesseractOCRConfig class based on the header-prefix you just mentioned.
> I use this to change the timeout dynamically (with the X-Tika-OCRTimeout
> http header).
>
> What would you need is a new outputType I think? OUTPUT_TYPE.PDF + its
> implementation of the parameter to tesseract?
>
>
> Grtz,
> Steven
>
> On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]> wrote:
>
>> Hello Tim,
>>
>> thanks for your feedback. Yes, I also understand now that tika is for
>> text and metadata extraction. And so it makes no sense to pollute the
>> project with other functionality - such as the generation of new file
>> formats.
>>
>> I have written a Docker Image with tika-server. And tika did a great job!
>> (https://github.com/imixs/imixs-docker/tree/master/tika)
>>
>> On the other hand, tesseract seems to support the PDF output as a general
>> feature. I took a look into the TikaOCRParser and as far as I understand
>> the code simply passes parameters to the tesseract module. In the Tika
>> Server module there is already an extension for the language support which
>> is also a parameter for the tesseract module. The new tika header param is
>> called 'X-Tika-OCRLanguage'.
>>
>> https://jira.apache.org/jira/browse/TIKA-1477
>>
>> I've been thinking about it and asked myself if it would be an idea to
>> allow to pass parameters in general via the HTTP header?
>> Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?
>>
>>
>> https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
>>
>> I did not know much enough about the implementation. So maybe this is not
>> possible or to complex?
>>
>> What did you think about the idea?
>>
>>
>> Best regards
>>
>> Ralph
>>
>>
>>
>> On 24.04.19 18:15, Tim Allison wrote:
>>
>> The other vaguely related project that comes to mind 
>> ishttps://www.pandoc.org/index.html but I don't know if that has hooks
>> to tesseract or a Rest API...  Sorry!
>>
>> On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> 
>> <[email protected]> wrote:
>>
>> Maybe ?
>> https://github.com/tleyden/open-ocr
>>
>>
>>
>> On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> 
>> <[email protected]> wrote:
>>
>> The goal of Tika is text and metadata extraction.  Our basic output is .txt, 
>> xhtml or json. We don’t currently support generation of other formats. Could 
>> you use DropWizard or similar to wrap tesseract it you need it to be restful?
>>
>> On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> 
>> <[email protected]> wrote:
>>
>> Hi,
>>
>> I have a question about the Tesseract OCR Parser which is part of Tika:
>> Is it possible to define the output of tesseract to PDF format. I think 
>> tesseract supports this option to convert a image file (e.g. tif) into a 
>> searchable pdf file:
>>
>> $ tesseract  --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng 
>> -l eng pdf
>>
>> I use the tika Rest API and I wonder how I can tell tell the Tika Server to 
>> create a PDF output file?
>>
>>
>> Thanks for any help
>>
>>
>> Ralph
>>
>>
>>
>> --
>>
>> *Imixs Software Solutions GmbH*
>> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
>> *Office:* Agnes-Pockels-Bogen 1, 80992 München
>> <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g>
>> Registergericht: Amtsgericht Muenchen, HRB 136045
>> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>>
>> *Imixs* is an open source company, read more: www.imixs.org
>>
> --
>
> *Imixs Software Solutions GmbH*
> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
> *Office:* Agnes-Pockels-Bogen 1, 80992 München
> <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g>
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>
> *Imixs* is an open source company, read more: www.imixs.org
>

Re: Tika-Server - Tesseract - Output to PDF

Reply via email to