On Wed, 11 Nov 2020, nensick wrote:
I am exploring the available features and I managed also to extract
Office macros but I still don't find a way to get the links.
Imagine to have a PDF, a DOCX in which you have a "click here" text as a link
pointing
to a website (let's say example[.]com). How can I get example[.].com?
If you were calling the Java directly, it would be fairly easy - just
provide your own content handler that only captures the <a> tags and
records the href attributes of those. You can use the Tee content handler
to have a normal text-extraction handler called as well as your
link-capturing one
From the Tika Server, it's not quite so simple. I'd probably just say ask
the Tika Server for the xhtml version of your document (instead of the
plain text one), then use the xml parsing in your calling language to grab
the links from the a tags. Depending on your needs, either call the Tika
Server twice, once for xhtml to get tags and once for plain text, or just
once for xhtml and process the results twice
Nick