I still try to understand the TesseractOCRParser class. I looks like the
method doOCR is the one which calls the tesseract commandline tool. And
yes, it seems that there is already a way to pass (maybe) any kind of
optional parameter? But this one feature to generate "searchable pdf"
seems not to be a parameter but a kind of option:
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#searchable-pdf-output
I think 'hocr' and 'pdf' are allowed as the last command option.
And on the other hand the Tika class 'TesseractOCRParser' has this
setter method 'setOutputType'
I am not sure but maybe everything is prepared already by the Tika
implementation?
If so, than we could add a new Resource Class into the Tika Server:
@Path("/pdf")
public class PDFResource {
@PUT
@POST
@Consumes("*/*")
@Produces("application/pdf")
public String doOCRtoPDF(final InputStream is) throws IOException {
....
}
}
But at this point I lose my courage because I know too little about the
implementation. What do you think? Is there a way to get the low-hanging
fruit? Or am I on the wrong track...?
On 24.04.19 19:37, Steven Van Ingelgem wrote:
As far as I understand it is that you can call setXXX() functions from
the TesseractOCRConfig class based on the header-prefix you just
mentioned.
I use this to change the timeout dynamically (with the
X-Tika-OCRTimeout http header).
What would you need is a new outputType I think? OUTPUT_TYPE.PDF + its
implementation of the parameter to tesseract?
Grtz,
Steven
On Wed, 24 Apr 2019 at 19:26, Ralph Soika <[email protected]
<mailto:[email protected]>> wrote:
Hello Tim,
thanks for your feedback. Yes, I also understand now that tika is
for text and metadata extraction. And so it makes no sense to
pollute the project with other functionality - such as the
generation of new file formats.
I have written a Docker Image with tika-server. And tika did a
great job! (https://github.com/imixs/imixs-docker/tree/master/tika)
On the other hand, tesseract seems to support the PDF output as a
general feature. I took a look into the TikaOCRParser and as far
as I understand the code simply passes parameters to the tesseract
module. In the Tika Server module there is already an extension
for the language support which is also a parameter for the
tesseract module. The new tika header param is called
'X-Tika-OCRLanguage'.
https://jira.apache.org/jira/browse/TIKA-1477
I've been thinking about it and asked myself if it would be an
idea to allow to pass parameters in general via the HTTP header?
Maybe this is what the header param X_TIKA_OCR_HEADER_PREFIX does?
https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
I did not know much enough about the implementation. So maybe this
is not possible or to complex?
What did you think about the idea?
Best regards
Ralph
On 24.04.19 18:15, Tim Allison wrote:
The other vaguely related project that comes to mind is
https://www.pandoc.org/index.html but I don't know if that has hooks
to tesseract or a Rest API... Sorry!
On Wed, Apr 24, 2019 at 10:08 AM Tim Allison<[email protected]>
<mailto:[email protected]> wrote:
Maybe ?
https://github.com/tleyden/open-ocr
On Wed, Apr 24, 2019 at 9:58 AM Tim Allison<[email protected]>
<mailto:[email protected]> wrote:
The goal of Tika is text and metadata extraction. Our basic output is
.txt, xhtml or json. We don’t currently support generation of other formats.
Could you use DropWizard or similar to wrap tesseract it you need it to be
restful?
On Wed, Apr 24, 2019 at 8:21 AM Ralph Soika<[email protected]>
<mailto:[email protected]> wrote:
Hi,
I have a question about the Tesseract OCR Parser which is part of Tika:
Is it possible to define the output of tesseract to PDF format. I think
tesseract supports this option to convert a image file (e.g. tif) into a
searchable pdf file:
$ tesseract --tessdata-dir ./ ./testing/eurotext.png
./testing/eurotext-eng -l eng pdf
I use the tika Rest API and I wonder how I can tell tell the Tika Server to
create a PDF output file?
Thanks for any help
Ralph
--
*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49
(0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika
*Imixs* is an open source company, read more: www.imixs.org
<http://www.imixs.org>
--
*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika
*Imixs* is an open source company, read more: www.imixs.org
<http://www.imixs.org>