Re: Limiting outlink tags.

2007-09-07 Thread Doğacan Güney
Hi Marcin,

On 9/7/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:
 Hi,
 I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
 cases people do not want to threat image as an outlink. At least I don't 
 want. The same case is with script/@src. But, it seems there is no way to 
 limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
 a,area,form,frame,iframe,script,link,img. Only form element can be turned 
 off by parser.html.form.use_action parameter.

 I would suggest to introduce a new configuration parameter which could be 
 used to turn on or off certain elements. It could be simply done by single 
 parameter, which would contain coma separated list of tags to be turned off.

 What is your opinion? If you think it is a valid issue I can make a patch for 
 this.

There is already NUTCH-488 open for this (with a patch). Feel free to
add comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.


 Regards,
 Marcin




-- 
Doğacan Güney


Limiting outlink tags.

2007-09-06 Thread Marcin Okraszewski
Hi,
I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
cases people do not want to threat image as an outlink. At least I don't want. 
The same case is with script/@src. But, it seems there is no way to limit 
outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
a,area,form,frame,iframe,script,link,img. Only form element can be turned off 
by parser.html.form.use_action parameter.

I would suggest to introduce a new configuration parameter which could be used 
to turn on or off certain elements. It could be simply done by single 
parameter, which would contain coma separated list of tags to be turned off.

What is your opinion? If you think it is a valid issue I can make a patch for 
this.

Regards,
Marcin