You could

1) exclude links to *.js documents by URL filters, e.g, add to
   regex-urlfilter.txt:

# exclude JavaScript
-\.js$

2) exclude outlinks from "link" and "script" elements in general by adding 
these to

<property>
  <name>parser.html.outlinks.ignore_tags</name>
  <value></value>
  <description>Comma separated list of HTML tags, from which outlinks
  shouldn't be extracted. Nutch takes links from: a, area, form, frame,
  iframe, script, link, img. If you add any of those tags here, it
  won't be taken. Default is empty list. Probably reasonable value
  for most people would be "img,script,link".</description>
</property>


On 04/10/2012 11:36 PM, SUJIT PAL wrote:
Hi all,

This is for Nutch trunk version.

During the parse phase, it is possible to suppress Javascript outlinks by 
setting a configuration parameter? If so, what would the parameter be?

Thanks very much,
Sujit



Reply via email to