Hi Tim,
I totally understand your personal opinion.
I've been thinking a lot about whether the Tesseract PDF output
conflicts with Tika's core function. But on the other hand Tika already
offers the possibility to translate output into another language. This
is a very similar function to translating an image into a searchable PDF
file.
I focus only so much on the Tika server because it fits so perfectly
into a microservice architecture.
My suggestion is to open an issue on Github that initially only affects
the class TesseractOCRParser.
And then we see how the issue evolves - ok?
Best regards
Ralph
On 25.04.19 17:44, Tim Allison wrote:
Y. That’d be how to do it. I regret my personal opinion is that I
don’t think this fits well w/in Tika, but I wouldn’t vote against this
if a colleague wanted to take it on or if there were a large interest.
If you want to restify tesseract, that should be straight forward.
What is the benefit to baking it into tika-server?
On Wed, Apr 24, 2019 at 5:22 PM Ralph Soika <[email protected]
<mailto:[email protected]>> wrote:
I still try to understand the TesseractOCRParser class. I looks
like the method doOCR is the one which calls the tesseract
commandline tool. And yes, it seems that there is already a way to
pass (maybe) any kind of optional parameter? But this one feature
to generate "searchable pdf" seems not to be a parameter but a
kind of option:
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output
I think 'hocr' and 'pdf' are allowed as the last command option.
And on the other hand the Tika class 'TesseractOCRParser' has this
setter method 'setOutputType'
I am not sure but maybe everything is prepared already by the Tika
implementation?
If so, than we could add a new Resource Class into the Tika Server:
@Path("/pdf")
public class PDFResource {
@PUT
@POST
@Consumes("*/*")
@Produces("application/pdf")
public String doOCRtoPDF(final InputStream is) throws
IOException {
....
}
}
But at this point I lose my courage because I know too little
about the implementation. What do you think? Is there a way to get
the low-hanging fruit? Or am I on the wrong track...?
On 24.04.19 19:37, Steven Van Ingelgem wrote:
As far as I understand it is that you can call setXXX() functions
from the TesseractOCRConfig class based on the header-prefix you
just mentioned.
I use this to change the timeout dynamically (with the
X-Tika-OCRTimeout http header).
What would you need is a new outputType I think? OUTPUT_TYPE.PDF
+ its implementation of the parameter to tesseract?
Grtz,
Steven
On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]
<mailto:[email protected]>> wrote:
Hello Tim,
thanks for your feedback. Yes, I also understand now that
tika is for text and metadata extraction. And so it makes no
sense to pollute the project with other functionality - such
as the generation of new file formats.
I have written a Docker Image with tika-server. And tika did
a great job!
(https://github.com/imixs/imixs-docker/tree/master/tika)
On the other hand, tesseract seems to support the PDF output
as a general feature. I took a look into the TikaOCRParser
and as far as I understand the code simply passes parameters
to the tesseract module. In the Tika Server module there is
already an extension for the language support which is also a
parameter for the tesseract module. The new tika header param
is called 'X-Tika-OCRLanguage'.
https://jira.apache.org/jira/browse/TIKA-1477
I've been thinking about it and asked myself if it would be
an idea to allow to pass parameters in general via the HTTP
header?
Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX
does?
https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
I did not know much enough about the implementation. So maybe
this is not possible or to complex?
What did you think about the idea?
Best regards
Ralph
On 24.04.19 18:15, Tim Allison wrote:
The other vaguely related project that comes to mind is
https://www.pandoc.org/index.html but I don't know if that has hooks
to tesseract or a Rest API... Sorry!
On Wed, Apr 24, 2019 at 10:08 AM Tim Allison<[email protected]>
<mailto:[email protected]> wrote:
Maybe ?
https://github.com/tleyden/open-ocr
On Wed, Apr 24, 2019 at 9:58 AM Tim Allison<[email protected]>
<mailto:[email protected]> wrote:
The goal of Tika is text and metadata extraction. Our basic output is
.txt, xhtml or json. We don’t currently support generation of other formats.
Could you use DropWizard or similar to wrap tesseract it you need it to be
restful?
On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika<[email protected]>
<mailto:[email protected]> wrote:
Hi,
I have a question about the Tesseract OCR Parser which is part of Tika:
Is it possible to define the output of tesseract to PDF format. I think
tesseract supports this option to convert a image file (e.g. tif) into a
searchable pdf file:
$ tesseract --tessdata-dir ./ ./testing/eurotext.png
./testing/eurotext-eng -l eng pdf
I use the tika Rest API and I wonder how I can tell tell the Tika
Server to create a PDF output file?
Thanks for any help
Ralph
--