Simon, I'm sorry for my delay. I'm hesitant to share the triggering PDF even offline.
I just added unit tests that confirm the fix for StAX processing: https://github.com/apache/tika/pull/2318 . Will that be of any use to you? The stax tests failed before the fix. Also, I can confirm that I was able to trigger Jazzer's XXE/SSRF sanitizer with a custom PDFParser harness with our 2.x code before the fix. The vulnerability was real. I'm sorry that I can't help more on this. Best, Tim On Wed, Aug 27, 2025 at 5:44 AM Simon Urli <simon.u...@xwiki.com> wrote: > > Hello, > > thanks for all the details, we tried using > testPDF_XFA_govdocs1_258578.pdf and we confirmed that the XFA part is > parsed and indexed on our side. However, ideally we'd like to not loose > the indexing of XFA part, and we're still in doubt that the XXE is > impacting us (partly because we tried to build a customized version of > tika 2.x with applying patches and we got errors suggesting interactions > with woodstox for parsing the XML and apparently woodstox claim to not > be affected by this kind of XXE). > > The problem is that at this point we're still trying to craft a PDF > exposing the XXE and apparently we're failing to do so... So I know it's > a bit sensitive but would that be possible to transmit us such PDF, > probably not publicly in the mailing list but maybe directly to > secur...@xwiki.org? > > Thanks again for all info and work here, > > Simon > > Le 23/08/2025 à 05:17, Tilman Hausherr a écrit : > > To check that it works, test with the file > > testPDF_XFA_govdocs1_258578.pdf from the tika source code. If "Abraham > > Lincoln" is part of the output then it didn't work. > > > > Tilman > >