Thank you for raising this and sharing a triggering doc. I've opened: https://issues.apache.org/jira/browse/TIKA-4646
On Mon, Feb 2, 2026 at 11:02 AM Mike Flester via user <[email protected]> wrote: > Hello - > > The ooxml/docx began life as a phishing email attachment. The attacker > hyperlink has been replaced with something benign. > > Tika did not extract the link because (I think) it's in "instructional > text". The document appears to work fine (the victim is able to click the > link). > > I have a bit of POI code (not production quality) that can dig this > instructional text out. > > Link to both the docx and the java code - > https://limewire.com/d/qtC1E#79Q8zip1SU > > $ javac -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor.java > $ java -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor > missing-link.docx > > Is this something that might see a place in Tika? As an option on the > existing XWPFWordExtractorDecorator? Or as a new parser in that package? Or > would I be best doing something outside of Tika for this cae? > > Thanks, > Mike >
