hey markus but if I would specify a regex then those urls wont be crawled at all . I dont want this all I ant is to crawl them only once
On Wed, Jul 10, 2013 at 3:23 PM, Markus Jelsma <[email protected]>wrote: > Hi - conf/regex-url-filter.txt and make sure the urlfilter-regex is > enabled in your nutch-site plugin.includes config. > > > -----Original message----- > > From:devang pandey <[email protected]> > > Sent: Wednesday 10th July 2013 11:51 > > To: [email protected] > > Subject: Re: nutch crawling issues > > > > hello markus I have one confusion should i implement changes in crawl-url > > filter or regex filter > > > > > > On Wed, Jul 10, 2013 at 3:12 PM, Markus Jelsma > > <[email protected]>wrote: > > > > > Hi, > > > > > > Use a regex url filter to filter those URL's and prevent them from > being > > > crawled again. > > > > > > Cheers > > > > > > -----Original message----- > > > > From:devang pandey <[email protected]> > > > > Sent: Wednesday 10th July 2013 10:29 > > > > To: [email protected] > > > > Subject: nutch crawling issues > > > > > > > > I have a website eg . www.example.com. Now when I am crawling this > using > > > > nutch 1.4 problem is that of duplicated crawling . There are a > number of > > > > pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number > > > > s38r84rejkfndn keeps on changing every time you visit this page and > hence > > > > crawler is crawling this again and again as for nutch I this this > must > > > be a > > > > new url everytime . Please suggest me how to overcome this issue > > > > > > > > > >

