Re: Extract URLs from a document

Nick Burch Thu, 12 Nov 2020 03:55:25 -0800

On Wed, 11 Nov 2020, nensick wrote:

I am exploring the available features and I managed also to extractOffice macros but I still don't find a way to get the links.
Imagine to have a PDF, a DOCX in which you have a "click here" text as a link 
pointing
to a website (let's say example[.]com). How can I get example[.].com?

If you were calling the Java directly, it would be fairly easy - justprovide your own content handler that only captures the <a> tags andrecords the href attributes of those. You can use the Tee content handlerto have a normal text-extraction handler called as well as yourlink-capturing one

From the Tika Server, it's not quite so simple. I'd probably just say ask

the Tika Server for the xhtml version of your document (instead of theplain text one), then use the xml parsing in your calling language to grabthe links from the a tags. Depending on your needs, either call the TikaServer twice, once for xhtml to get tags and once for plain text, or justonce for xhtml and process the results twice


Nick

Re: Extract URLs from a document

Reply via email to