Hello Tim,
thanks for your feedback. Yes, I also understand now that tika is for
text and metadata extraction. And so it makes no sense to pollute the
project with other functionality - such as the generation of new file
formats.
I have written a Docker Image with tika-server. And tika did a great
job! (https://github.com/imixs/imixs-docker/tree/master/tika)
On the other hand, tesseract seems to support the PDF output as a
general feature. I took a look into the TikaOCRParser and as far as I
understand the code simply passes parameters to the tesseract module. In
the Tika Server module there is already an extension for the language
support which is also a parameter for the tesseract module. The new tika
header param is called 'X-Tika-OCRLanguage'.
https://jira.apache.org/jira/browse/TIKA-1477
I've been thinking about it and asked myself if it would be an idea to
allow to pass parameters in general via the HTTP header?
Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?
https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
I did not know much enough about the implementation. So maybe this is
not possible or to complex?
What did you think about the idea?
Best regards
Ralph
On 24.04.19 18:15, Tim Allison wrote:
The other vaguely related project that comes to mind is
https://www.pandoc.org/index.html but I don't know if that has hooks
to tesseract or a Rest API... Sorry!
On Wed, Apr 24, 2019 at 10:08 AM Tim Allison <[email protected]> wrote:
Maybe ?
https://github.com/tleyden/open-ocr
On Wed, Apr 24, 2019 at 9:58 AM Tim Allison <[email protected]> wrote:
The goal of Tika is text and metadata extraction. Our basic output is .txt,
xhtml or json. We don’t currently support generation of other formats. Could
you use DropWizard or similar to wrap tesseract it you need it to be restful?
On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika <[email protected]> wrote:
Hi,
I have a question about the Tesseract OCR Parser which is part of Tika:
Is it possible to define the output of tesseract to PDF format. I think
tesseract supports this option to convert a image file (e.g. tif) into a
searchable pdf file:
$ tesseract --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng -l
eng pdf
I use the tika Rest API and I wonder how I can tell tell the Tika Server to
create a PDF output file?
Thanks for any help
Ralph
--
*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika
*Imixs* is an open source company, read more: www.imixs.org
<http://www.imixs.org>