Tika 2.4.x how to configure the scientific parsers

Paul Borgermans Wed, 13 Jul 2022 08:32:57 -0700

Hi

I am a bit struggling with Tika 2.4 server and activating the scientific
parsers that are now in a separate module. So far I have not found a clear
example or instructions, so here is my progress making it work more or less
in a ubuntu 20 environment ( I am no Java expert):


- install external dependencies (gdal and co)
- create a tika-config.xml file which specifies some individual parsers
found in the scientific module (see also below)
- start the server with the jars as classpath arguments and call the main
tika class:
$ java -cp
"tika-server-standard-2.4.1.jar:tika-parser-scientific-package-2.4.1.jar" \
org.apache.tika.server.core.TikaServerCli -h '*' -c tika-config.xml

My questions:
- is this the best approach to get the scientific parsers activated? Can it
be done for all included parsers in one go?
- It looks that GDAL (which also parses image formats) is
de-activating tesseract for some image formats. Is there a way to undo
this? Or specify the order of parsers?

Thanks!
Paul

========tika-config.xml ===============

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <!--for example: <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
  <service-loader dynamic="true" loadErrorHandler="WARN"/>
  <encodingDetectors>
    <encodingDetector
class="org.apache.tika.detect.DefaultEncodingDetector"/>
  </encodingDetectors>
  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
  <detectors>
    <detector class="org.apache.tika.detect.DefaultDetector"/>
  </detectors>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
    <parser class="org.apache.tika.parser.gdal.GDALParser"/>
</parsers>
</properties>

============================

Tika 2.4.x how to configure the scientific parsers

Reply via email to