How are you starting tika-server? What’s your command line?

This may be caused by
https://issues.apache.org/jira/browse/TIKA-2669, which is now fixed.

Are you using the same version of Tika as the one from LogicalSpark?

If LogicalSpark works, why build your own? I hear some very savvy Tika
folks are behind that. :)

On Wed, Jul 18, 2018 at 1:18 PM Latha Krishnamurthi <[email protected]>
wrote:

> Hi,
>
>
>
> I am running tika-server-1.16.jar within a docker container. I build and
> run this using my own docker file. I connect to it using the tika-python
> library. This is not able to extract text out of the image files. I then
> downloaded tesseract and installed the ‘so’ files in the container and set
> the LD_LIBARRY_PATH etc. But still the extraction does not happen ? any
> idea why ? (the text extraction works fine for PDfs, DOCs etc.)
>
>
>
> (as a debugging I downloaded the prebuilt docker image and tried it out,
> it works fine with the image file extraction. I see that they just download
> teserract in addition). I do not have a tika-config file, but then I tried
> creating one did not help.
>
>
>
> 10:14 $ cat tika-config.xml
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>   <parsers>
>
>     <parser class="org.apache.tika.parser.DefaultParser"/>
>
>     <parser class="org.apache.tika.parser.pdf.PDFParser">
>
>       <params>
>
>         <param name="extractInlineimages" type="bool">true</param>
>
>         <param name="allowExtractionForAccessibility"
> type="bool">true</param>
>
>         <param name="catchIntermediateExceptions" type="bool">false</param>
>
>         <!-- we really should throw an exception for this.
>
>              We are currently not checking -->
>
>         <param name="someRandomThingOrOther" type="bool">true</param>
>
>       </params>
>
>     </parser>
>
>   </parsers>
>
> </properties>
>
>
>
> https://github.com/LogicalSpark/docker-tikaserver
>
>
>
> thanks in advance for your response.
>
>
>
> Here are my debug log traces when TIKA starts.
>
>
>
>
> ======================================================================================================================================
>
> 2018-07-17 21:46:38,926 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018
> 9:46:38 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: WARNING:
> JBIG2ImageReader not loaded. jbig2 files will be ignored
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See
> https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional
> dependencies.
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: TIFFImageWriter
> not loaded. tiff files will not be processed
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See
> https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional
> dependencies.
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: J2KImageReader
> not loaded. JPEG2000 files will not be processed.
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: See
> https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika: for optional
> dependencies.
>
> 2018-07-17 21:46:38,927 LathaDLP-NOX-18 user.notice tika:
>
> 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Jul 17, 2018
> 9:46:39 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
>
> 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: WARNING:
> org.xerial's sqlite-jdbc is not loaded.
>
> 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: Please provide
> the jar on your classpath to parse sqlite files.
>
> 2018-07-17 21:46:39,250 LathaDLP-NOX-18 user.notice tika: See
> tika-parsers/pom.xml for the correct version.
>
> 2018-07-17 21:46:39,367 LathaDLP-NOX-18 user.notice tika: INFO  Starting
> Apache Tika 1.16 server
>
> 2018-07-17 21:46:40,660 LathaDLP-NOX-18 user.notice tika: INFO  Setting
> the server's publish address to be http://0.0.0.0:9998/
>
> 2018-07-17 21:46:40,871 LathaDLP-NOX-18 user.notice tika: INFO
> jetty-8.y.z-SNAPSHOT
>
> 2018-07-17 21:46:40,963 LathaDLP-NOX-18 user.notice tika: INFO  Started
> [email protected]:9998
>
> 2018-07-17 21:46:40,997 LathaDLP-NOX-18 user.notice tika: INFO  Started
> Apache Tika server at http://0.0.0.0:9998/
>
>
> ======================================================================================================================================
>
>
>
>
>
>
>

Reply via email to