Hi Joe, > On Apr 5, 2016, at 12:27pm, Joseph Naegele <[email protected]> > wrote: > > Hi all, > > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > links on a web page, but I'm surprised the LinkContentHandler doesn't parse > <script> tags as links. When a <script> tags contains the "src" attribute, > the attribute should specify a URI and the tag should not contain any content. > > Is there any particular reason the LinkContentHandler doesn't parse <script> > tags, or is it just that I'm the first to look for this functionality? I can > ping the dev mailing list too if necessary.
I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 e..g what about <link> elements? — Ken > > Nutch's other built-in HTML parser collects all "outlinks", including > <script> tags, but I'd prefer to use Tika and Boilerpipe. > > Thanks, > Joe Naegele -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
