That looks right to me. The scientific parsers will be automatically added to the parser list and you shouldn't have to configure them.
The GDALParser does preempt parsing of image files that it covers, and they won't also be parsed by TesseractOCRParser. If you'd like Tesseract, turn off GDAL via the parsers section or limit which file types GDAL handles by decorating it with supported mime-types. On https://issues.apache.org/jira/browse/TIKA-3812, we document how the ordering changed (was fixed) between 2.4.0 and 2.4.1 if that has any relevance. If you need specifics on any of the above, please let us know. Best, Tim On Wed, Jul 13, 2022 at 11:32 AM Paul Borgermans <[email protected]> wrote: > Hi > > I am a bit struggling with Tika 2.4 server and activating the scientific > parsers that are now in a separate module. So far I have not found a clear > example or instructions, so here is my progress making it work more or less > in a ubuntu 20 environment ( I am no Java expert): > > - install external dependencies (gdal and co) > - create a tika-config.xml file which specifies some individual parsers > found in the scientific module (see also below) > - start the server with the jars as classpath arguments and call the main > tika class: > $ java -cp > "tika-server-standard-2.4.1.jar:tika-parser-scientific-package-2.4.1.jar" \ > org.apache.tika.server.core.TikaServerCli -h '*' -c tika-config.xml > > My questions: > - is this the best approach to get the scientific parsers activated? Can > it be done for all included parsers in one go? > - It looks that GDAL (which also parses image formats) is > de-activating tesseract for some image formats. Is there a way to undo > this? Or specify the order of parsers? > > Thanks! > Paul > > ========tika-config.xml =============== > > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > <properties> > <!--for example: <mimeTypeRepository > resource="/org/apache/tika/mime/tika-mimetypes.xml"/>--> > <service-loader dynamic="true" loadErrorHandler="WARN"/> > <encodingDetectors> > <encodingDetector > class="org.apache.tika.detect.DefaultEncodingDetector"/> > </encodingDetectors> > <translator > class="org.apache.tika.language.translate.DefaultTranslator"/> > <detectors> > <detector class="org.apache.tika.detect.DefaultDetector"/> > </detectors> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"/> > <parser class="org.apache.tika.parser.netcdf.NetCDFParser"/> > <parser class="org.apache.tika.parser.gdal.GDALParser"/> > </parsers> > </properties> > > ============================ > > > > >
