Hello - The ooxml/docx began life as a phishing email attachment. The attacker hyperlink has been replaced with something benign.
Tika did not extract the link because (I think) it's in "instructional text". The document appears to work fine (the victim is able to click the link). I have a bit of POI code (not production quality) that can dig this instructional text out. Link to both the docx and the java code - https://limewire.com/d/qtC1E#79Q8zip1SU $ javac -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor.java $ java -cp ~/tika/tika-app/target/lib/*:. POILinkExtractor missing-link.docx Is this something that might see a place in Tika? As an option on the existing XWPFWordExtractorDecorator? Or as a new parser in that package? Or would I be best doing something outside of Tika for this cae? Thanks, Mike
