RE: [MASSMAIL] How to set up Nutch to only crawl links on designated web pages repeatedly?

Markus Jelsma Thu, 10 Mar 2016 13:54:12 -0800

Hi - i am missing something, you want Nutch only to follow outlinks on specific 
pages only? Not domains or hosts as a whole? If the latter, then domain URL 
filter is a good solution. in case of the former, you would have to come up 
with an elaborate set-up of scripts and changing URL filter configurations for 
various crawls. URL filters won't work, they still operate on URL's only, no 
metadata and no origin URL.


I have thought about it before, URL filters should have more context if 
available such as CrawlDatum metadata and origin URL. It would make these 
things much easier to build. It would also make it easier to detect spider 
traps.

Markus 
 
-----Original message-----
> From:Junqiang Zhang <[email protected]>
> Sent: Monday 7th March 2016 17:36
> To: [email protected]
> Subject: Re: [MASSMAIL] How to set up Nutch to only crawl links on designated 
> web pages repeatedly?
> 
> Hello Eyeris,
> 
> Thank you very much for your suggestion. Sorry for my late reply.
> 
> Using the urls filter plugins is a good option. I am doing this for my
> current crawling task. However, using urls filters is not exactly what
> I want. I feel there should be some better ways to restrict nutch only
> crawl the links on designated web pages. Currently, maybe nutch does
> not provide such a feature.
> 
> Best,
> Junqiang
> 
> On Sun, Jan 31, 2016 at 9:26 PM, Eyeris Rodriguez Rueda <[email protected]> wrote:
> > Hello Jun.
> > Maybe you can use nutch´s urls filter plugins. This plugins are used to 
> > filter o restrict the visit of links.
> > Please i need more details about your situation.
> >
> > 1-How are selected the link to visit on your pages(A, B, C) , it has some 
> > pattern,subdomain or some keyword in url´s links?
>

RE: [MASSMAIL] How to set up Nutch to only crawl links on designated web pages repeatedly?

Reply via email to