Hi
I am a bit struggling with Tika 2.4 server and activating the scientific
parsers that are now in a separate module. So far I have not found a clear
example or instructions, so here is my progress making it work more or less
in a ubuntu 20 environment ( I am no Java expert):
- install external dependencies (gdal and co)
- create a tika-config.xml file which specifies some individual parsers
found in the scientific module (see also below)
- start the server with the jars as classpath arguments and call the main
tika class:
$ java -cp
"tika-server-standard-2.4.1.jar:tika-parser-scientific-package-2.4.1.jar" \
org.apache.tika.server.core.TikaServerCli -h '*' -c tika-config.xml
My questions:
- is this the best approach to get the scientific parsers activated? Can it
be done for all included parsers in one go?
- It looks that GDAL (which also parses image formats) is
de-activating tesseract for some image formats. Is there a way to undo
this? Or specify the order of parsers?
Thanks!
Paul
========tika-config.xml ===============
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<!--for example: <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
<service-loader dynamic="true" loadErrorHandler="WARN"/>
<encodingDetectors>
<encodingDetector
class="org.apache.tika.detect.DefaultEncodingDetector"/>
</encodingDetectors>
<translator class="org.apache.tika.language.translate.DefaultTranslator"/>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
<parser class="org.apache.tika.parser.gdal.GDALParser"/>
</parsers>
</properties>
============================