Thanks to Tilman for the suggestion, this is the tika-config.xml to turn off xfa parsing. This protects against CVE-2025-54988.
I tested this with 2.x. I trust that it will also work with 3.x. <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <properties> <service-loader initializableProblemHandler="throw"/> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <!-- this is the one that matters --> <param name="extractAcroFormContent" type="bool">false</param> <!-- this can override the above. make absolutely sure this is false --> <param name="ifXFAExtractOnlyXFA" type="bool">false</param> </params> </parser> </parsers> </properties> On Fri, Aug 22, 2025 at 9:13 AM Simon Urli <simon.u...@xwiki.com> wrote: > > Hi, > > thanks a lot for the fast answer and the details. > > Le 22/08/2025 à 14:38, Tim Allison a écrit : > > As you read, the vulnerability is an XXE via a PDF with a crafted XFA > > (xml) file embedded. > > > > Generally speaking, the worst case scenario is that a user is running > > Tika as a user with a high level of permissions, parsing untrusted > > files and returning the results to an attacker. So, maybe a jobs site > > runs Tika as root, parses a resume submitted by an attacker and shows > > the results to the attacker, er, job applicant. The exploit is that > > the XFA may contain an external entity that would read the contents of > > e.g. "/etc/password" or pull content from > > "https://our-local-sharepoint.com/super-secret.html" and return that > > to the attacker as "this is the text we extracted from your > > resume:xyz". Data exfiltration. > > Ok in our case XWiki should never be running with a user having such a > level of permissions, but nevertheless it could allow an attacker to > access some configuration files of XWiki itself that shouldn't be > readable by any user. > > > > > A not great scenario is that the attacker drops a million such PDFs > > into your site, and you now have a million network calls to an > > internal http site on your network or even a public site. This is a > > denial of service. > I think it would be less of a problem for us as the parsing is only done > once for the indexing in a queue, and I think we have some measures to > prevent uploading too many files at once. > > > > The minimal fix is in this commit: > > https://github.com/apache/tika/commit/bfee6d5569fe9197c4ea947a96e212825184ca33 > > > > I made some slight updates here: > > https://github.com/apache/tika/commit/fd2016ffe4a892c06da097b50deeecf8c9d5813a > > > > The root cause of the vuln is that I thought that our > > IGNORING_STAX_ENTITY_RESOLVER was preventing calls to external > > entities. However, it returned a String, which was not the correct > > object type, and Java was silently ignoring that problem and backing > > off to default behavior which allows external entities. > > > > Unfortunately, there's no way via configuration to tell Tika to avoid > > parsing XFA. > That would have been indeed a good option. > > > > One solution would be to refactor your code to use tika-server, which > > would put all the dependencies into a separate jvm and you wouldn't > > have jar hell with jakarta etc. That's a heavy lift, I realize. > Yeah well, we need to perform the jakarta migration anyway for other > libraries too, it's just that we're lagging behind on the topic... > > > > 2.x is EOL, and I'd really personally rather not make another release, > > but I can see from your note that there is a need. My major concern > > with a 2.x release is that there are probably a number of other > > dependencies that now have vulns in their jdk 8 versions, and the > > amount of time spent figuring out which other dependencies we can > > update within the jdk8 limitations causes me concern. > > > > Fellow devs, what do you think? > > So clearly that would be the ideal for us: right now we're internally > discussing about forking a Tika 2.x applying your changes and deploy a > custom version in our own repo to get the fix. Would be better if it's > an official one for sure. > > Thanks again, > > Simon. > > > > > Best, > > Tim > > > > On Fri, Aug 22, 2025 at 5:09 AM Simon Urli <simon.u...@xwiki.com> wrote: > >> Hello, > >> > >> I'm one of the core contributor of the XWiki platform > >> (https://www.xwiki.org) which relies on Tika. > >> > >> We got informed this morning through our automated checks about the > >> publication of CVE-2025-54988. We still haven't managed to finish our > >> migration to Tika 3.x because of the complex migration to Jakarta of all > >> the subsequent dependencies (see > >> https://jira.xwiki.org/browse/XWIKI-22595) meaning that we depend on > >> Tika 2.x which is affected by the CVE, apparently without any easy > >> workaround and without plan for releasing a bug fix if I understand > >> correctly what's been announced regarding the 2.x EOL. > >> > >> So at this point we're trying to understand how much we're possibly > >> affected by this CVE: we're currently using the tika-parser-pdf-module > >> mainly in that class: > >> https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-search/xwiki-platform-search-solr/xwiki-platform-search-solr-api/src/main/java/org/xwiki/search/solr/internal/metadata/AbstractSolrMetadataExtractor.java#L520-L543, > >> where we use it to perform indexing of PDF documents. > >> > >> I've tried to look in the recents commits in > >> https://github.com/apache/tika/commits/3.2.2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module > >> to understand a bit better the vulnerability but I'm failing to see it, > >> and I haven't found anymore information in JIRA when browsing the > >> tickets fixed in 3.2.2. > >> > >> So would that be possible to get more information about this > >> vulnerability, like a possible scenario of an exploit so that we can > >> check quickly if we're impacted or not? > >> > >> Thanks, > >> > >> Simon Urli. > >>