Hello! Yes, please open a ticket for it. As for 2, in Nutch, you can instruct the Tika parser to use a different HtmlMapper. Use IdentityHtmlMapper! I forgot the property, but look it up in TikaParser.java, it is near the bottom. The default mapper is bad indeed if you want to grab stuff from normal elements.
M. -----Original message----- > From:Joseph Naegele <[email protected]> > Sent: Wednesday 6th April 2016 22:13 > To: [email protected] > Subject: RE: script tags in LinkContentHandler > > Great, sounds good. Would you like me to open a ticket? > > With respect to parsing outlinks in Nutch, there's actually two problems: > > 1) <script> missing in LinkContentHandler > 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so > it's discarded during the parse, similarly to <style>. > > Does anyone have opinions on #2? > > - Joe > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Wednesday, April 06, 2016 9:26 AM > To: [email protected] > Subject: RE: script tags in LinkContentHandler > > Yes indeed! Script is missing and that's a mistake. See discussion at > TIKA-1835. We should open a new ticket for it. > Markus > > > > -----Original message----- > > From:Ken Krugler <[email protected]> > > Sent: Tuesday 5th April 2016 22:24 > > To: [email protected] > > Subject: Re: script tags in LinkContentHandler > > > > Hi Joe, > > <br class="" />I was looking at the version of this file in the (git) > > Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my > > mistake. > > <br class="" />I’d rolled in Markus’s patch directly to support these other > > link types, but I wish I’d remembered the old TIKA-503 discussion, as it > > would have been better to make that support conditional on using a > > different constructor, as it’s usually not a good idea to surprise > > consumers of parse output with new types of data (links). > > <br class="" />I’ll take this discussion over to TIKA-1835 now. > > <br class="" />— Ken > > <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele > > <[email protected] <mailto:[email protected]>> wrote: > > <br class="Apple-interchange-newline" />Thanks Ken, I'm confused > > though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> > > and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 > > <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script > > src="…"> belongs in there with the rest of them. What do you think? > > Joe > > From: Ken Krugler [mailto:[email protected] > > <mailto:[email protected]>] <br class="" />Sent: Tuesday, > > April 05, 2016 3:48 PM<br class="" />To: [email protected] > > <mailto:[email protected]><br class="" />Subject: Re: script tags in > > LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele > > <[email protected] <mailto:[email protected]>> wrote: > > Hi all, > > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > > links on a web page, but I'm surprised the LinkContentHandler doesn't parse > > <script> tags as links. When a <script> tags contains the "src" attribute, > > the attribute should specify a URI and the tag should not contain any > > content. > > Is there any particular reason the LinkContentHandler doesn't parse > > <script> tags, or is it just that I'm the first to look for this > > functionality? I can ping the dev mailing list too if necessary. > > I don’t think there’s a specific reason it’s not included, though see my > > comment on https://issues.apache.org/jira/browse/TIKA-503 > > <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> > > elements? > > — Ken > > <br class="" /><br class="" />Nutch's other built-in HTML parser collects > > all "outlinks", including <script> tags, but I'd prefer to use Tika and > > Boilerpipe. > > Thanks, > > Joe Naegele > > ---------------- > > Ken Krugler > > +1 530-210-6378 > > http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom > > big data solutions & training Hadoop, Cascading, Cassandra & Solr <br > > class="Apple-interchange-newline" /><br > > class="Apple-interchange-newline" /><br > > class="Apple-interchange-newline" /><br > > class="Apple-interchange-newline" /> > > > <br class="" /> > >
