RE: TIKA-OCR issue

Latha Krishnamurthi Wed, 25 Jul 2018 15:23:12 -0700

Hi, thank you very much for your response.

I am starting tika-server using the following command line options. I am using 
1.16.  (I think this was the one logicalspark was using when I got it)


Initially
----------
tika-server-1.16.jar --host 0.0.0.0

With Tesseract libraries
--------------------------------
tika-server-1.16.jar --host 0.0.0.0 -c tika-config.xml

I tried the logicalspark on my development environment and it works fine. On 
the actual production environment, we didn’t want to download the jar each 
time, so I download and run it from the docker file. This is a better option 
since we have a microservices architecture. My docker file just copies the TIKA 
jar, the teserract shared objects and the TIKA config file to the right 
location on the target that’s all.
We download once and use it.

May be I should try downloading 1.19 and try it out since this bug points to 
fix in 1.19 (the logical spark seems to run 1.18 though). Let me try this 
version and update.

Thank you once again!

Latha.


From: Tim Allison <[email protected]>
Sent: Tuesday, July 24, 2018 3:04 PM
To: [email protected]
Subject: Re: TIKA-OCR issue

How are you starting tika-server? What’s your command line?

This may be caused by
https://issues.apache.org/jira/browse/TIKA-2669, which is now fixed.

Are you using the same version of Tika as the one from LogicalSpark?

If LogicalSpark works, why build your own? I hear some very savvy Tika folks 
are behind that. :)

On Wed, Jul 18, 2018 at 1:18 PM Latha Krishnamurthi 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I am running tika-server-1.16.jar within a docker container. I build and run 
this using my own docker file. I connect to it using the tika-python library. 
This is not able to extract text out of the image files. I then downloaded 
tesseract and installed the ‘so’ files in the container and set the 
LD_LIBARRY_PATH etc. But still the extraction does not happen ? any idea why ? 
(the text extraction works fine for PDfs, DOCs etc.)

(as a debugging I downloaded the prebuilt docker image and tried it out, it 
works fine with the image file extraction. I see that they just download 
teserract in addition). I do not have a tika-config file, but then I tried 
creating one did not help.

10:14 $ cat tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineimages" type="bool">true</param>
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param name="catchIntermediateExceptions" type="bool">false</param>
        <!-- we really should throw an exception for this.
             We are currently not checking -->
        <param name="someRandomThingOrOther" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>

https://github.com/LogicalSpark/docker-tikaserver

thanks in advance for your response.

Here are my debug log traces when TIKA starts.

======================================================================================================================================
2018-07-17 21:46:38,926 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 9:46:38 
PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: WARNING: 
JBIG2ImageReader not loaded. jbig2 files will be ignored
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: TIFFImageWriter not 
loaded. tiff files will not be processed
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: J2KImageReader not 
loaded. JPEG2000 files will not be processed.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika:
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 9:46:39 
PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: WARNING: org.xerial's 
sqlite-jdbc is not loaded.
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Please provide the 
jar on your classpath to parse sqlite files.
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: See 
tika-parsers/pom.xml for the correct version.
2018-07-17 21:46:39,367 LathaDLP-NOX-18 user.notice tika: INFO  Starting Apache 
Tika 1.16 server
2018-07-17 21:46:40,660 LathaDLP-NOX-18 user.notice tika: INFO  Setting the 
server's publish address to be http://0.0.0.0:9998/
2018-07-17 21:46:40,871 LathaDLP-NOX-18 user.notice tika: INFO  
jetty-8.y.z-SNAPSHOT
2018-07-17 21:46:40,963 LathaDLP-NOX-18 user.notice tika: INFO  Started 
[email protected]:9998<http://[email protected]:9998>
2018-07-17 21:46:40,997 LathaDLP-NOX-18 user.notice tika: INFO  Started Apache 
Tika server at http://0.0.0.0:9998/
======================================================================================================================================

RE: TIKA-OCR issue

Reply via email to