RE: how to force nutch to crawl specific urls?

jeffersonzhou Sun, 15 May 2011 06:12:35 -0700

Hi I still have one question:

Suppose I have seed urls from two domains (abc.com and abcd.com), would all
urls pointing to abcd.com be ignored if I set "true" to
"db.ignore.external.links" and if there are such external links pointing to
abcd.com from abc.com? Furthermore, what if the seed urls w/ abcd.com?

-----Original Message-----
From: Jean-Francois Gingras [mailto:[email protected]] 
Sent: Sunday, May 15, 2011 5:55 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hi,

Just to add on top of what Gabriele et Luis said, you may want to look at 
the "db.ignore.external.links". If you have a large seed list, 
crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and 
may have performance impact (am I right?).  If you don't want to add new 
links, even from the same host, then you should take a look at 
"db.update.additions.allowed".

-----Message d'origine----- 
From: Luis Cappa Banda
Sent: Saturday, May 14, 2011 8:18 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hello.

As Gabriele said before, you should specify your url´s list that you´ll use
for crawling/fetching. You can enter an specific url or a domain url
pointing to, for example, a particular html pdf file. Of course, you can put
several urls as a list, not only one. In any case, be careful with the
crawl-urlfilter.txt config file: if you have configured it before maybe the
pattern that you decided wont´t be aplicable for your new url list and then
you won´t index anything. It´s a very common mistake.

RE: how to force nutch to crawl specific urls?

Reply via email to