RE: Setting parser options

Peter Kronenberg Wed, 06 Jan 2021 06:57:42 -0800

At this point, our architecture is such that were are doing the file-processing 
ourselves 😊.


Was my example tika-config.xml correct?   So are you saying to have Exclude 
entries under the Default Parser for any parsers that you will be setting 
properties?
What properties are available for the Default parser?

At what point in the process can I programmatically query the parser config 
and/or set them?  That’s the part that’s not clear to me.  If I create my own 
PDFParserConfig, then it’s pretty obvious.  Upon creation, it reads the 
properties from PDFParser.properties and then I can query/set

How can I do something similar if I am using Tika-config?

From: Tim Allison <[email protected]>
Sent: Wednesday, January 6, 2021 9:30 AM
To: [email protected]
Subject: Re: Setting parser options

If you are doing large-scale processing of untrusted documents, I'd highly 
recommend the config.xml option with tika-server.  Keep your file processing 
out of your process. :D

You do need the entry for default parser.  Given the parser sort order, you 
shouldn't need to exclude, but I do it as good practice.

I'd like to figure out a way to set the field params in the default parsers, 
e.g. don't touch all the other parsers, but make the following updates to the 
PDFParser that was already loaded as part of the usual service loading process.

<parser class="org.apache.tika.parser.DefaultParser">
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">false</param>
                <param name="ocrStrategy" 
type="string">OCR_AND_TEXT_EXTRACTION</param>
            </params>
        </parser>
</parser>

On Tue, Jan 5, 2021 at 5:36 PM Peter Kronenberg 
<[email protected]<mailto:[email protected]>> wrote:
What is the recommended way for setting options on PDFParserConfig and 
TesseractOCRConfig?

I’ve come up with two methods:


  1.  Make copies of TesseractOCRConfig.properties and PDFParser.properties and 
put them in my own source under the same package name.  Whatever is in here 
will be used, ignoring the corresponding properties files in the jar
  2.  Create a tika-config.xml looking something like this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">false</param>
                <param name="ocrStrategy" 
type="string">OCR_AND_TEXT_EXTRACTION</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="pageSegMode" type="int">1</param>
                <param name="ocrStrategy" 
type="string">OCR_AND_TEXT_EXTRACTION</param>
            </params>
        </parser>
    </parsers>
</properties>

Create an instance of TikaConfig and pass it to the AutoDetectParser

TikaConfig tikaConfig = new 
TikaConfig(this.getClass().getClassLoader().getResourceAsStream("tika-config.xml"));

final AutoDetectParser parser = new AutoDetectParser(tikaConfig);


In the later case, I’m not exactly sure when the file is loaded.  At what point 
can I do pdfConfig.getxxx() or tesseractOCRConfig.getxxx().
Also not sure if I need to do a parser-exclude of both the PDF and Tess parser 
under DefaultParser.  Do I even need the entry for DefaultParser?


What is the recommended way of doing this?

RE: Setting parser options

Reply via email to