Re: script tags in LinkContentHandler

Luís Filipe Nassif Wed, 06 Apr 2016 14:22:10 -0700

Hi,

I'm one of those from forensic world and, of course, my use case needs to
extract everything.


I have already tried IdentityHtmlMapper to extract "value" attributes from
"input" elements with no luck. It is not extracted by DefaultHtmlMapper and
is rendered by browsers, so I think DefaultHtmlMapper needs some improvement.
But HtmlMapper is the correct place to configure that or something must be
done with HTMLSchema (I've tried that too, but I am not a html expert)?

Thanks,
Luis

2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <[email protected]>:

> On #2, I'd prefer not skipping elements.  I definitely understand the use
> case to extract what a human can see, but I suspect if your email address
> ends in 'forensics.com', you'd probably like to see everything as well.
>
> -----Original Message-----
> From: Joseph Naegele [mailto:[email protected]]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: [email protected]
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respect to parsing outlinks in Nutch, there's actually two problems:
>
> 1) <script> missing in LinkContentHandler
> 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element
> so it's discarded during the parse, similarly to <style>.
>
> Does anyone have opinions on #2?
>
> - Joe
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Wednesday, April 06, 2016 9:26 AM
> To: [email protected]
> Subject: RE: script tags in LinkContentHandler
>
> Yes indeed! Script is missing and that's a mistake. See discussion at
> TIKA-1835. We should open a new ticket for it.
> Markus
>
>
>
> -----Original message-----
> > From:Ken Krugler <[email protected]>
> > Sent: Tuesday 5th April 2016 22:24
> > To: [email protected]
> > Subject: Re: script tags in LinkContentHandler
> >
> > Hi Joe,
> > <br class="" />I was looking at the version of this file in the (git)
> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my
> mistake.
> > <br class="" />I’d rolled in Markus’s patch directly to support these
> other link types, but I wish I’d remembered the old TIKA-503 discussion, as
> it would have been better to make that support conditional on using a
> different constructor, as it’s usually not a good idea to surprise
> consumers of parse output with new types of data (links).
> > <br class="" />I’ll take this discussion over to TIKA-1835 now.
> > <br class="" />— Ken
> > <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele
> <[email protected] <mailto:[email protected]>> wrote:
> > <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
> > though. The LinkContentHandler in 1.12 now collects <a>, <link>,
> <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835
> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script
> src="…"> belongs in there with the rest of them. What do you think?
> > Joe
> > From: Ken Krugler [mailto:[email protected]
> > <mailto:[email protected]>] <br class="" />Sent: Tuesday,
> > April 05, 2016 3:48 PM<br class="" />To: [email protected] <mailto:
> [email protected]><br class="" />Subject: Re: script tags in
> LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <
> [email protected] <mailto:[email protected]>> wrote:
> > Hi all,
> > I'm using Nutch for crawling the web, and one of its built-in HTML
> parsers uses Tika and its LinkContentHandler. I'm interested in collecting
> *all* links on a web page, but I'm surprised the LinkContentHandler doesn't
> parse <script> tags as links. When a <script> tags contains the "src"
> attribute, the attribute should specify a URI and the tag should not
> contain any content.
> > Is there any particular reason the LinkContentHandler doesn't parse
> <script> tags, or is it just that I'm the first to look for this
> functionality? I can ping the dev mailing list too if necessary.
> > I don’t think there’s a specific reason it’s not included, though see my
> comment on https://issues.apache.org/jira/browse/TIKA-503 <
> https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link>
> elements?
> > — Ken
> > <br class="" /><br class="" />Nutch's other built-in HTML parser
> collects all "outlinks", including <script> tags, but I'd prefer to use
> Tika and Boilerpipe.
> > Thanks,
> > Joe Naegele
> > ----------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
> > big data solutions & training Hadoop, Cascading, Cassandra & Solr <br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" />
>
> > <br class="" />
>
>

Re: script tags in LinkContentHandler

Reply via email to