Hi Tim,

I totally understand your personal opinion.
I've been thinking a lot about whether the Tesseract PDF output conflicts with Tika's core function. But on the other hand Tika already offers the possibility to translate output into another language. This is a very similar function to translating an image into a searchable PDF file.

I focus only so much on the Tika server because it fits so perfectly into a microservice architecture.

My suggestion is to open an issue on Github that initially only affects the class TesseractOCRParser.
And then we see how the issue evolves - ok?


Best regards
Ralph

On 25.04.19 17:44, Tim Allison wrote:
Y. That’d be how to do it.  I regret my personal opinion is that I don’t think this fits well w/in Tika, but I wouldn’t vote against this if a colleague wanted to take it on or if there were a large interest.

If you want to restify tesseract, that should be straight forward. What is the benefit to baking it into tika-server?

On Wed, Apr 24, 2019 at 5:22 PM Ralph Soika <[email protected] <mailto:[email protected]>> wrote:

    I still try to understand the TesseractOCRParser class. I looks
    like the method doOCR is the one which calls the tesseract
    commandline tool. And yes, it seems that there is already a way to
    pass (maybe) any kind of optional parameter? But this one feature
    to generate "searchable pdf" seems not to be a parameter but a
    kind of option:

    
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output

    I think 'hocr' and 'pdf' are allowed as the last command option.

    And on the other hand the Tika class 'TesseractOCRParser' has this
    setter method 'setOutputType'

    I am not sure but maybe everything is prepared already by the Tika
    implementation?

    If so, than we could add a new Resource Class into the Tika Server:

    @Path("/pdf")
    public class PDFResource {

        @PUT
        @POST
        @Consumes("*/*")
        @Produces("application/pdf")
        public String doOCRtoPDF(final InputStream is) throws
    IOException {
            ....
        }
    }

    But at this point I lose my courage because I know too little
    about the implementation. What do you think? Is there a way to get
    the low-hanging fruit? Or am I on the wrong track...?


    On 24.04.19 19:37, Steven Van Ingelgem wrote:
    As far as I understand it is that you can call setXXX() functions
    from the TesseractOCRConfig class based on the header-prefix you
    just mentioned.
    I use this to change the timeout dynamically (with the
    X-Tika-OCRTimeout http header).

    What would you need is a new outputType I think? OUTPUT_TYPE.PDF
    + its implementation of the parameter to tesseract?


    Grtz,
    Steven

    On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]
    <mailto:[email protected]>> wrote:

        Hello Tim,

        thanks for your feedback. Yes, I also understand now that
        tika is for text and metadata extraction. And so it makes no
        sense to pollute the project with other functionality - such
        as the generation of new file formats.

        I have written a Docker Image with tika-server. And tika did
        a great job!
        (https://github.com/imixs/imixs-docker/tree/master/tika)

        On the other hand, tesseract seems to support the PDF output
        as a general feature. I took a look into the TikaOCRParser
        and as far as I understand the code simply passes parameters
        to the tesseract module. In the Tika Server module there is
        already an extension for the language support which is also a
        parameter for the tesseract module. The new tika header param
        is called 'X-Tika-OCRLanguage'.

        https://jira.apache.org/jira/browse/TIKA-1477

        I've been thinking about it and asked myself if it would be
        an idea to allow to pass parameters in general via the HTTP
        header?

        Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX
        does?

        
https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java

        I did not know much enough about the implementation. So maybe
        this is not possible or to complex?

        What did you think about the idea?


        Best regards

        Ralph



        On 24.04.19 18:15, Tim Allison wrote:
        The other vaguely related project that comes to mind is
        https://www.pandoc.org/index.html  but I don't know if that has hooks
        to tesseract or a Rest API...  Sorry!

        On Wed, Apr 24, 2019 at 10:08 AM Tim Allison<[email protected]>  
<mailto:[email protected]>  wrote:
        Maybe ?

        https://github.com/tleyden/open-ocr



        On Wed, Apr 24, 2019 at 9:58 AM Tim Allison<[email protected]>  
<mailto:[email protected]>  wrote:
        The goal of Tika is text and metadata extraction.  Our basic output is 
.txt, xhtml or json. We don’t currently support generation of other formats. 
Could you use DropWizard or similar to wrap tesseract it you need it to be 
restful?

        On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika<[email protected]>  
<mailto:[email protected]>  wrote:
        Hi,

        I have a question about the Tesseract OCR Parser which is part of Tika:
        Is it possible to define the output of tesseract to PDF format. I think 
tesseract supports this option to convert a image file (e.g. tif) into a 
searchable pdf file:

        $ tesseract  --tessdata-dir ./ ./testing/eurotext.png 
./testing/eurotext-eng -l eng pdf

        I use the tika Rest API and I wonder how I can tell tell the Tika 
Server to create a PDF output file?


        Thanks for any help


        Ralph


--

Reply via email to