Re: Setting parser options

Tim Allison Wed, 06 Jan 2021 06:30:31 -0800

If you are doing large-scale processing of untrusted documents, I'd highly
recommend the config.xml option with tika-server.  Keep your file
processing out of your process. :D


You do need the entry for default parser.  Given the parser sort order, you
shouldn't need to exclude, but I do it as good practice.

I'd like to figure out a way to set the field params in the default
parsers, e.g. don't touch all the other parsers, but make the following
updates to the PDFParser that was already loaded as part of the usual
service loading process.

<*parser class="org.apache.tika.parser.DefaultParser"*>
        <*parser class="org.apache.tika.parser.pdf.PDFParser"*>
            <*params*>
                <*param name="extractInlineImages" type="bool"*>false</
*param*>
                <*param name="ocrStrategy" type="string"*>
OCR_AND_TEXT_EXTRACTION</*param*>
            </*params*>
        </*parser*>
</*parser*>

On Tue, Jan 5, 2021 at 5:36 PM Peter Kronenberg <[email protected]>
wrote:

> What is the recommended way for setting options on PDFParserConfig and
> TesseractOCRConfig?
>
>
>
> I’ve come up with two methods:
>
>
>
>    1. Make copies of TesseractOCRConfig.properties and
>    PDFParser.properties and put them in my own source under the same package
>    name.  Whatever is in here will be used, ignoring the corresponding
>    properties files in the jar
>    2. Create a tika-config.xml looking something like this:
>
> *<?**xml version="1.0" encoding="UTF-8"*
> *?>*<*properties*>
>     <*parsers*>
>         <*parser class="org.apache.tika.parser.DefaultParser"*>
>             <*parser-exclude class="org.apache.tika.parser.pdf.PDFParser"*/>
>         </*parser*>
>         <*parser class="org.apache.tika.parser.pdf.PDFParser"*>
>             <*params*>
>                 <*param name="extractInlineImages" 
> type="bool"*>false</*param*>
>                 <*param name="ocrStrategy" 
> type="string"*>OCR_AND_TEXT_EXTRACTION</*param*>
>             </*params*>
>         </*parser*>
>         <*parser class="org.apache.tika.parser.ocr.TesseractOCRParser"*>
>             <*params*>
>                 <*param name="pageSegMode" type="int"*>1</*param*>
>                 <*param name="ocrStrategy" 
> type="string"*>OCR_AND_TEXT_EXTRACTION</*param*>
>             </*params*>
>         </*parser*>
>     </*parsers*>
> </*properties*>
>
>
>
> Create an instance of TikaConfig and pass it to the AutoDetectParser
>
> TikaConfig tikaConfig = *new 
> *TikaConfig(*this*.getClass().getClassLoader().getResourceAsStream(*"tika-config.xml"*));
>
> *final *AutoDetectParser parser = *new *AutoDetectParser(tikaConfig);
>
>
>
>
>
> In the later case, I’m not exactly sure when the file is loaded.  At what
> point can I do pdfConfig.getxxx() or tesseractOCRConfig.getxxx().
>
> Also not sure if I need to do a parser-exclude of both the PDF and Tess
> parser under DefaultParser.  Do I even need the entry for DefaultParser?
>
>
>
>
>
> What is the recommended way of doing this?
>

Re: Setting parser options

Reply via email to