Hi - i am missing something, you want Nutch only to follow outlinks on specific pages only? Not domains or hosts as a whole? If the latter, then domain URL filter is a good solution. in case of the former, you would have to come up with an elaborate set-up of scripts and changing URL filter configurations for various crawls. URL filters won't work, they still operate on URL's only, no metadata and no origin URL.
I have thought about it before, URL filters should have more context if available such as CrawlDatum metadata and origin URL. It would make these things much easier to build. It would also make it easier to detect spider traps. Markus -----Original message----- > From:Junqiang Zhang <[email protected]> > Sent: Monday 7th March 2016 17:36 > To: [email protected] > Subject: Re: [MASSMAIL] How to set up Nutch to only crawl links on designated > web pages repeatedly? > > Hello Eyeris, > > Thank you very much for your suggestion. Sorry for my late reply. > > Using the urls filter plugins is a good option. I am doing this for my > current crawling task. However, using urls filters is not exactly what > I want. I feel there should be some better ways to restrict nutch only > crawl the links on designated web pages. Currently, maybe nutch does > not provide such a feature. > > Best, > Junqiang > > On Sun, Jan 31, 2016 at 9:26 PM, Eyeris Rodriguez Rueda <[email protected]> wrote: > > Hello Jun. > > Maybe you can use nutch´s urls filter plugins. This plugins are used to > > filter o restrict the visit of links. > > Please i need more details about your situation. > > > > 1-How are selected the link to visit on your pages(A, B, C) , it has some > > pattern,subdomain or some keyword in url´s links? >

