Thank you for raising this and sharing a triggering doc. I've opened:
https://issues.apache.org/jira/browse/TIKA-4646

On Mon, Feb 2, 2026 at 11:02 AM Mike Flester via user <[email protected]>
wrote:

> Hello -
>
> The ooxml/docx began life as a phishing email attachment. The attacker
> hyperlink has been replaced with something benign.
>
> Tika did not extract the link because (I think) it's in "instructional
> text". The document appears to work fine (the victim is able to click the
> link).
>
> I have a bit of POI code (not production quality) that can dig this
> instructional text out.
>
> Link to both the docx and the java code -
> https://limewire.com/d/qtC1E#79Q8zip1SU
>
> $ javac -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor.java
> $ java -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor
> missing-link.docx
>
> Is this something that might see a place in Tika? As an option on the
> existing XWPFWordExtractorDecorator? Or as a new parser in that package? Or
> would I be best doing something outside of Tika for this cae?
>
> Thanks,
> Mike
>

Reply via email to