Hi,

Use a regex url filter to filter those URL's and prevent them from being 
crawled again.

Cheers 
 
-----Original message-----
> From:devang pandey <[email protected]>
> Sent: Wednesday 10th July 2013 10:29
> To: [email protected]
> Subject: nutch crawling issues
> 
> I have a website eg . www.example.com. Now when I am crawling this using
> nutch 1.4 problem is that of duplicated crawling . There are a number of
> pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number
> s38r84rejkfndn keeps on changing every time you visit this page and hence
> crawler is crawling this again and again as for nutch I this this must be a
> new url everytime . Please suggest me how to overcome this issue
> 

Reply via email to