Hello, Tika already comes with a handler for collecting links, see the LinkContentHandler [1]. Hyperlinks in PDFs are reported as anchors and can be picked up by this handler. We use it to collect links from any file type as if they were all HTML files.
Regards, Markus https://tika.apache.org/1.19/api/org/apache/tika/sax/LinkContentHandler.html -----Original message----- > From:Nick Burch <[email protected]> > Sent: Thursday 12th November 2020 12:55 > To: nensick <[email protected]> > Cc: [email protected] > Subject: Re: Extract URLs from a document > > On Wed, 11 Nov 2020, nensick wrote: > > I am exploring the available features and I managed also to extract > > Office macros but I still don't find a way to get the links. > > > > Imagine to have a PDF, a DOCX in which you have a "click here" text as a > > link pointing > > to a website (let's say example[.]com). How can I get example[.].com? > > If you were calling the Java directly, it would be fairly easy - just > provide your own content handler that only captures the <a> tags and > records the href attributes of those. You can use the Tee content handler > to have a normal text-extraction handler called as well as your > link-capturing one > > From the Tika Server, it's not quite so simple. I'd probably just say ask > the Tika Server for the xhtml version of your document (instead of the > plain text one), then use the xml parsing in your calling language to grab > the links from the a tags. Depending on your needs, either call the Tika > Server twice, once for xhtml to get tags and once for plain text, or just > once for xhtml and process the results twice > > Nick >
