Re: script tags in LinkContentHandler

Ken Krugler Tue, 05 Apr 2016 12:49:13 -0700

Hi Joe,

> On Apr 5, 2016, at 12:27pm, Joseph Naegele <[email protected]> 
> wrote:
> 
> Hi all,
>  
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers 
> uses Tika and its LinkContentHandler. I'm interested in collecting *all* 
> links on a web page, but I'm surprised the LinkContentHandler doesn't parse 
> <script> tags as links. When a <script> tags contains the "src" attribute, 
> the attribute should specify a URI and the tag should not contain any content.
>  
> Is there any particular reason the LinkContentHandler doesn't parse <script> 
> tags, or is it just that I'm the first to look for this functionality? I can 
> ping the dev mailing list too if necessary.


I don’t think there’s a specific reason it’s not included, though see my 
comment on https://issues.apache.org/jira/browse/TIKA-503

e..g what about <link> elements?

— Ken

>  
> Nutch's other built-in HTML parser collects all "outlinks", including 
> <script> tags, but I'd prefer to use Tika and Boilerpipe.
>  
> Thanks,
> Joe Naegele

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: script tags in LinkContentHandler

Reply via email to