Y, completely agree.  This has been on my todo list forever --
programmatically dumping the config options for parsers.

This is where it will go:
https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers

I'll try to make time for it this week.

On Tue, Jun 21, 2022 at 5:17 AM Inzamam Anwar <[email protected]>
wrote:

> Thank you Tim for the quick response.
>
> I was wondering whether it is possible to make a detailed 'default' xml
> configuration file for all settable parameters or not. Doing this will help
> people from different backgrounds to control behavior of Apache Tika.
>
> Regards
> Inzamam
>
> On Tue, Jun 21, 2022 at 12:16 AM Tim Allison <[email protected]> wrote:
>
>> In looking into this, I discovered:
>> https://issues.apache.org/jira/browse/TIKA-3796 .  It looks like that
>> parameter was not settable via tika-config.  I've fixed this now, and the
>> fix will be in the next release.  I'm not sure, yet, when that will be, but
>> you can build locally or pull a build from Jenkins.
>>
>> The example config that shows how to turn this on/off is here:
>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/tika-config-headers-footers.xml
>>
>> On Mon, Jun 20, 2022 at 3:56 AM Inzamam Anwar <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> I am trying to omit headers/footers from doc/docx files. I have tried
>>> the following XML configuration file with "tika-server-standard-2.4.0.jar".
>>> I have attached a sample file also. Any help in this regard would be
>>> appreciated.
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <properties>
>>>     <parsers>
>>>         <parser class="org.apache.tika.parser.DefaultParser">
>>>             <parser-exclude
>>> class="org.apache.tika.parser.pdf.PDFParser"/>
>>>         </parser>
>>>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>>>             <params>
>>>                 <param name="sortByPosition" type="bool">true</param>
>>>             </params>
>>>         </parser>
>>>         <parser
>>> class="org.apache.tika.parser.microsoft.OfficeParserConfig">
>>>             <params>
>>>                 <param name="includeHeadersAndFooters"
>>> type="bool">false</param>
>>>             </params>
>>>         </parser>
>>>     </parsers>
>>> </properties>
>>>
>>> Regards
>>> Inzamam
>>>
>>>

Reply via email to