Hi,

I am running tika-server-1.16.jar within a docker container. I build and run 
this using my own docker file. I connect to it using the tika-python library. 
This is not able to extract text out of the image files. I then downloaded 
tesseract and installed the 'so' files in the container and set the 
LD_LIBARRY_PATH etc. But still the extraction does not happen ? any idea why ? 
(the text extraction works fine for PDfs, DOCs etc.)

(as a debugging I downloaded the prebuilt docker image and tried it out, it 
works fine with the image file extraction. I see that they just download 
teserract in addition). I do not have a tika-config file, but then I tried 
creating one did not help.

10:14 $ cat tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineimages" type="bool">true</param>
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param name="catchIntermediateExceptions" type="bool">false</param>
        <!-- we really should throw an exception for this.
             We are currently not checking -->
        <param name="someRandomThingOrOther" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>

https://github.com/LogicalSpark/docker-tikaserver

thanks in advance for your response.

Here are my debug log traces when TIKA starts.

======================================================================================================================================
2018-07-17 21:46:38,926 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 9:46:38 
PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: WARNING: 
JBIG2ImageReader not loaded. jbig2 files will be ignored
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: TIFFImageWriter not 
loaded. tiff files will not be processed
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: J2KImageReader not 
loaded. JPEG2000 files will not be processed.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional 
dependencies.
2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika:
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 9:46:39 
PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: WARNING: org.xerial's 
sqlite-jdbc is not loaded.
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Please provide the 
jar on your classpath to parse sqlite files.
2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: See 
tika-parsers/pom.xml for the correct version.
2018-07-17 21:46:39,367 LathaDLP-NOX-18 user.notice tika: INFO  Starting Apache 
Tika 1.16 server
2018-07-17 21:46:40,660 LathaDLP-NOX-18 user.notice tika: INFO  Setting the 
server's publish address to be http://0.0.0.0:9998/
2018-07-17 21:46:40,871 LathaDLP-NOX-18 user.notice tika: INFO  
jetty-8.y.z-SNAPSHOT
2018-07-17 21:46:40,963 LathaDLP-NOX-18 user.notice tika: INFO  Started 
[email protected]:9998
2018-07-17 21:46:40,997 LathaDLP-NOX-18 user.notice tika: INFO  Started Apache 
Tika server at http://0.0.0.0:9998/
======================================================================================================================================



Reply via email to