Hi I still have one question: Suppose I have seed urls from two domains (abc.com and abcd.com), would all urls pointing to abcd.com be ignored if I set "true" to "db.ignore.external.links" and if there are such external links pointing to abcd.com from abc.com? Furthermore, what if the seed urls w/ abcd.com?
-----Original Message----- From: Jean-Francois Gingras [mailto:[email protected]] Sent: Sunday, May 15, 2011 5:55 AM To: [email protected] Subject: Re: how to force nutch to crawl specific urls? Hi, Just to add on top of what Gabriele et Luis said, you may want to look at the "db.ignore.external.links". If you have a large seed list, crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and may have performance impact (am I right?). If you don't want to add new links, even from the same host, then you should take a look at "db.update.additions.allowed". -----Message d'origine----- From: Luis Cappa Banda Sent: Saturday, May 14, 2011 8:18 AM To: [email protected] Subject: Re: how to force nutch to crawl specific urls? Hello. As Gabriele said before, you should specify your url´s list that you´ll use for crawling/fetching. You can enter an specific url or a domain url pointing to, for example, a particular html pdf file. Of course, you can put several urls as a list, not only one. In any case, be careful with the crawl-urlfilter.txt config file: if you have configured it before maybe the pattern that you decided wont´t be aplicable for your new url list and then you won´t index anything. It´s a very common mistake.

