Thanks Ken,
I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think? Joe From: Ken Krugler [mailto:[email protected]] Sent: Tuesday, April 05, 2016 3:48 PM To: [email protected] Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <[email protected] <mailto:[email protected]> > wrote: Hi all, I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 e..g what about <link> elements? — Ken Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. Thanks, Joe Naegele -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
