Hello -

The ooxml/docx began life as a phishing email attachment. The attacker
hyperlink has been replaced with something benign.

Tika did not extract the link because (I think) it's in "instructional
text". The document appears to work fine (the victim is able to click the
link).

I have a bit of POI code (not production quality) that can dig this
instructional text out.

Link to both the docx and the java code -
https://limewire.com/d/qtC1E#79Q8zip1SU

$ javac -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor.java
$ java -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor missing-link.docx

Is this something that might see a place in Tika? As an option on the
existing XWPFWordExtractorDecorator? Or as a new parser in that package? Or
would I be best doing something outside of Tika for this cae?

Thanks,
Mike

Reply via email to