Re: Tika-Server - Tesseract - Output to PDF

Ralph Soika Wed, 24 Apr 2019 14:22:47 -0700

I still try to understand the TesseractOCRParser class. I looks like themethod doOCR is the one which calls the tesseract commandline tool. Andyes, it seems that there is already a way to pass (maybe) any kind ofoptional parameter? But this one feature to generate "searchable pdf"seems not to be a parameter but a kind of option:


https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output


I think 'hocr' and 'pdf' are allowed as the last command option.

And on the other hand the Tika class 'TesseractOCRParser' has thissetter method 'setOutputType'

I am not sure but maybe everything is prepared already by the Tikaimplementation?


If so, than we could add a new Resource Class into the Tika Server:

@Path("/pdf")
public class PDFResource {

    @PUT
    @POST
    @Consumes("*/*")
    @Produces("application/pdf")
    public String doOCRtoPDF(final InputStream is) throws IOException {
        ....
    }
}

But at this point I lose my courage because I know too little about theimplementation. What do you think? Is there a way to get the low-hangingfruit? Or am I on the wrong track...?



On 24.04.19 19:37, Steven Van Ingelgem wrote:

As far as I understand it is that you can call setXXX() functions fromthe TesseractOCRConfig class based on the header-prefix you justmentioned.I use this to change the timeout dynamically (with theX-Tika-OCRTimeout http header).

What would you need is a new outputType I think? OUTPUT_TYPE.PDF + itsimplementation of the parameter to tesseract?



Grtz,
Steven

On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]<mailto:[email protected]>> wrote:


    Hello Tim,

    thanks for your feedback. Yes, I also understand now that tika is
    for text and metadata extraction. And so it makes no sense to
    pollute the project with other functionality - such as the
    generation of new file formats.

    I have written a Docker Image with tika-server. And tika did a
    great job! (https://github.com/imixs/imixs-docker/tree/master/tika)

    On the other hand, tesseract seems to support the PDF output as a
    general feature. I took a look into the TikaOCRParser and as far
    as I understand the code simply passes parameters to the tesseract
    module. In the Tika Server module there is already an extension
    for the language support which is also a parameter for the
    tesseract module. The new tika header param is called
    'X-Tika-OCRLanguage'.

    https://jira.apache.org/jira/browse/TIKA-1477

    I've been thinking about it and asked myself if it would be an
    idea to allow to pass parameters in general via the HTTP header?

    Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?

    
https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java

    I did not know much enough about the implementation. So maybe this
    is not possible or to complex?

    What did you think about the idea?


    Best regards

    Ralph



    On 24.04.19 18:15, Tim Allison wrote:

    The other vaguely related project that comes to mind is
    https://www.pandoc.org/index.html  but I don't know if that has hooks
    to tesseract or a Rest API...  Sorry!

    On Wed, Apr 24, 2019 at 10:08 AM Tim Allison<[email protected]>  
<mailto:[email protected]>  wrote:

    Maybe ?

    https://github.com/tleyden/open-ocr



    On Wed, Apr 24, 2019 at 9:58 AM Tim Allison<[email protected]>  
<mailto:[email protected]>  wrote:

    The goal of Tika is text and metadata extraction.  Our basic output is 
.txt, xhtml or json. We don’t currently support generation of other formats. 
Could you use DropWizard or similar to wrap tesseract it you need it to be 
restful?

    On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika<[email protected]>  
<mailto:[email protected]>  wrote:

    Hi,

    I have a question about the Tesseract OCR Parser which is part of Tika:
    Is it possible to define the output of tesseract to PDF format. I think 
tesseract supports this option to convert a image file (e.g. tif) into a 
searchable pdf file:

    $ tesseract  --tessdata-dir ./ ./testing/eurotext.png 
./testing/eurotext-eng -l eng pdf

    I use the tika Rest API and I wonder how I can tell tell the Tika Server to 
create a PDF output file?


    Thanks for any help


    Ralph

--

    *Imixs Software Solutions GmbH*
    *Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49
    (0)89-452136 16
    *Office:* Agnes-Pockels-Bogen 1, 80992 München
    Registergericht: Amtsgericht Muenchen, HRB 136045
    Geschaeftsführer: Gaby Heinle u. Ralph Soika

    *Imixs* is an open source company, read more: www.imixs.org
    <http://www.imixs.org>

--

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org<http://www.imixs.org>

Re: Tika-Server - Tesseract - Output to PDF

Reply via email to