Y, completely agree. This has been on my todo list forever -- programmatically dumping the config options for parsers.
This is where it will go: https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers I'll try to make time for it this week. On Tue, Jun 21, 2022 at 5:17 AM Inzamam Anwar <[email protected]> wrote: > Thank you Tim for the quick response. > > I was wondering whether it is possible to make a detailed 'default' xml > configuration file for all settable parameters or not. Doing this will help > people from different backgrounds to control behavior of Apache Tika. > > Regards > Inzamam > > On Tue, Jun 21, 2022 at 12:16 AM Tim Allison <[email protected]> wrote: > >> In looking into this, I discovered: >> https://issues.apache.org/jira/browse/TIKA-3796 . It looks like that >> parameter was not settable via tika-config. I've fixed this now, and the >> fix will be in the next release. I'm not sure, yet, when that will be, but >> you can build locally or pull a build from Jenkins. >> >> The example config that shows how to turn this on/off is here: >> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/org/apache/tika/parser/microsoft/tika-config-headers-footers.xml >> >> On Mon, Jun 20, 2022 at 3:56 AM Inzamam Anwar <[email protected]> >> wrote: >> >>> Hello, >>> >>> I am trying to omit headers/footers from doc/docx files. I have tried >>> the following XML configuration file with "tika-server-standard-2.4.0.jar". >>> I have attached a sample file also. Any help in this regard would be >>> appreciated. >>> >>> <?xml version="1.0" encoding="UTF-8"?> >>> <properties> >>> <parsers> >>> <parser class="org.apache.tika.parser.DefaultParser"> >>> <parser-exclude >>> class="org.apache.tika.parser.pdf.PDFParser"/> >>> </parser> >>> <parser class="org.apache.tika.parser.pdf.PDFParser"> >>> <params> >>> <param name="sortByPosition" type="bool">true</param> >>> </params> >>> </parser> >>> <parser >>> class="org.apache.tika.parser.microsoft.OfficeParserConfig"> >>> <params> >>> <param name="includeHeadersAndFooters" >>> type="bool">false</param> >>> </params> >>> </parser> >>> </parsers> >>> </properties> >>> >>> Regards >>> Inzamam >>> >>>
