How are you starting tika-server? What’s your command line? This may be caused by https://issues.apache.org/jira/browse/TIKA-2669, which is now fixed.
Are you using the same version of Tika as the one from LogicalSpark? If LogicalSpark works, why build your own? I hear some very savvy Tika folks are behind that. :) On Wed, Jul 18, 2018 at 1:18 PM Latha Krishnamurthi <[email protected]> wrote: > Hi, > > > > I am running tika-server-1.16.jar within a docker container. I build and > run this using my own docker file. I connect to it using the tika-python > library. This is not able to extract text out of the image files. I then > downloaded tesseract and installed the ‘so’ files in the container and set > the LD_LIBARRY_PATH etc. But still the extraction does not happen ? any > idea why ? (the text extraction works fine for PDfs, DOCs etc.) > > > > (as a debugging I downloaded the prebuilt docker image and tried it out, > it works fine with the image file extraction. I see that they just download > teserract in addition). I do not have a tika-config file, but then I tried > creating one did not help. > > > > 10:14 $ cat tika-config.xml > > <?xml version="1.0" encoding="UTF-8"?> > > <properties> > > <parsers> > > <parser class="org.apache.tika.parser.DefaultParser"/> > > <parser class="org.apache.tika.parser.pdf.PDFParser"> > > <params> > > <param name="extractInlineimages" type="bool">true</param> > > <param name="allowExtractionForAccessibility" > type="bool">true</param> > > <param name="catchIntermediateExceptions" type="bool">false</param> > > <!-- we really should throw an exception for this. > > We are currently not checking --> > > <param name="someRandomThingOrOther" type="bool">true</param> > > </params> > > </parser> > > </parsers> > > </properties> > > > > https://github.com/LogicalSpark/docker-tikaserver > > > > thanks in advance for your response. > > > > Here are my debug log traces when TIKA starts. > > > > > ====================================================================================================================================== > > 2018-07-17 21:46:38,926 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 > 9:46:38 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: WARNING: > JBIG2ImageReader not loaded. jbig2 files will be ignored > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See > https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional > dependencies. > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: TIFFImageWriter > not loaded. tiff files will not be processed > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See > https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional > dependencies. > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: J2KImageReader > not loaded. JPEG2000 files will not be processed. > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See > https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional > dependencies. > > 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: > > 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018 > 9:46:39 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > > 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: WARNING: > org.xerial's sqlite-jdbc is not loaded. > > 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Please provide > the jar on your classpath to parse sqlite files. > > 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: See > tika-parsers/pom.xml for the correct version. > > 2018-07-17 21:46:39,367 LathaDLP-NOX-18 user.notice tika: INFO Starting > Apache Tika 1.16 server > > 2018-07-17 21:46:40,660 LathaDLP-NOX-18 user.notice tika: INFO Setting > the server's publish address to be http://0.0.0.0:9998/ > > 2018-07-17 21:46:40,871 LathaDLP-NOX-18 user.notice tika: INFO > jetty-8.y.z-SNAPSHOT > > 2018-07-17 21:46:40,963 LathaDLP-NOX-18 user.notice tika: INFO Started > [email protected]:9998 > > 2018-07-17 21:46:40,997 LathaDLP-NOX-18 user.notice tika: INFO Started > Apache Tika server at http://0.0.0.0:9998/ > > > ====================================================================================================================================== > > > > > > >
