Hi,
Just to add on top of what Gabriele et Luis said, you may want to look at
the "db.ignore.external.links". If you have a large seed list,
crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and
may have performance impact (am I right?). If you don't want to add new
links, even from the same host, then you should take a look at
"db.update.additions.allowed".
-----Message d'origine-----
From: Luis Cappa Banda
Sent: Saturday, May 14, 2011 8:18 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?
Hello.
As Gabriele said before, you should specify your url´s list that you´ll use
for crawling/fetching. You can enter an specific url or a domain url
pointing to, for example, a particular html pdf file. Of course, you can put
several urls as a list, not only one. In any case, be careful with the
crawl-urlfilter.txt config file: if you have configured it before maybe the
pattern that you decided wont´t be aplicable for your new url list and then
you won´t index anything. It´s a very common mistake.