RE: script tags in LinkContentHandler

Joseph Naegele Tue, 05 Apr 2016 12:53:50 -0700

Thanks Ken,


I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, 
<iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835. In 
my opinion, <script src="…"> belongs in there with the rest of them. What do 
you think?

 

Joe

 

From: Ken Krugler [mailto:[email protected]] 
Sent: Tuesday, April 05, 2016 3:48 PM
To: [email protected]
Subject: Re: script tags in LinkContentHandler

 

Hi Joe,

 

On Apr 5, 2016, at 12:27pm, Joseph Naegele <[email protected] 
<mailto:[email protected]> > wrote:

 

Hi all,

 

I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses 
Tika and its LinkContentHandler. I'm interested in collecting *all* links on a 
web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags 
as links. When a <script> tags contains the "src" attribute, the attribute 
should specify a URI and the tag should not contain any content.

 

Is there any particular reason the LinkContentHandler doesn't parse <script> 
tags, or is it just that I'm the first to look for this functionality? I can 
ping the dev mailing list too if necessary.

 

I don’t think there’s a specific reason it’s not included, though see my 
comment on https://issues.apache.org/jira/browse/TIKA-503

 

e..g what about <link> elements?

 

— Ken





 

Nutch's other built-in HTML parser collects all "outlinks", including <script> 
tags, but I'd prefer to use Tika and Boilerpipe.

 

Thanks,

Joe Naegele

 

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

RE: script tags in LinkContentHandler

Reply via email to