RE: Extract URLs from a document

Markus Jelsma Thu, 12 Nov 2020 04:00:08 -0800

Hello,

Tika already comes with a handler for collecting links, see the 
LinkContentHandler [1]. Hyperlinks in PDFs are reported as anchors and can be 
picked up by this handler. We use it to collect links from any file type as if 
they were all HTML files.


Regards,
Markus

https://tika.apache.org/1.19/api/org/apache/tika/sax/LinkContentHandler.html
 
-----Original message-----
> From:Nick Burch <[email protected]>
> Sent: Thursday 12th November 2020 12:55
> To: nensick <[email protected]>
> Cc: [email protected]
> Subject: Re: Extract URLs from a document
> 
> On Wed, 11 Nov 2020, nensick wrote:
> > I am exploring the available features and I managed also to extract 
> > Office macros but I still don't find a way to get the links.
> >
> > Imagine to have a PDF, a DOCX in which you have a "click here" text as a 
> > link pointing
> > to a website (let's say example[.]com). How can I get example[.].com?
> 
> If you were calling the Java directly, it would be fairly easy - just 
> provide your own content handler that only captures the <a> tags and 
> records the href attributes of those. You can use the Tee content handler 
> to have a normal text-extraction handler called as well as your 
> link-capturing one
> 
> From the Tika Server, it's not quite so simple. I'd probably just say ask 
> the Tika Server for the xhtml version of your document (instead of the 
> plain text one), then use the xml parsing in your calling language to grab 
> the links from the a tags. Depending on your needs, either call the Tika 
> Server twice, once for xhtml to get tags and once for plain text, or just 
> once for xhtml and process the results twice
> 
> Nick
>

RE: Extract URLs from a document

Reply via email to