Setting parser options

Peter Kronenberg Tue, 05 Jan 2021 14:28:08 -0800

What is the recommended way for setting options on PDFParserConfig and 
TesseractOCRConfig?


I've come up with two methods:


  1.  Make copies of TesseractOCRConfig.properties and PDFParser.properties and 
put them in my own source under the same package name.  Whatever is in here 
will be used, ignoring the corresponding properties files in the jar
  2.  Create a tika-config.xml looking something like this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">false</param>
                <param name="ocrStrategy" 
type="string">OCR_AND_TEXT_EXTRACTION</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="pageSegMode" type="int">1</param>
                <param name="ocrStrategy" 
type="string">OCR_AND_TEXT_EXTRACTION</param>
            </params>
        </parser>
    </parsers>
</properties>

Create an instance of TikaConfig and pass it to the AutoDetectParser

TikaConfig tikaConfig = new 
TikaConfig(this.getClass().getClassLoader().getResourceAsStream("tika-config.xml"));

final AutoDetectParser parser = new AutoDetectParser(tikaConfig);


In the later case, I'm not exactly sure when the file is loaded.  At what point 
can I do pdfConfig.getxxx() or tesseractOCRConfig.getxxx().
Also not sure if I need to do a parser-exclude of both the PDF and Tess parser 
under DefaultParser.  Do I even need the entry for DefaultParser?


What is the recommended way of doing this?

Setting parser options

Reply via email to