Thanks for the advice. Just want to point out one thing: urls in the seed list may NOT be fetched!
My experience is that for instance I put the following urls in the seed list: http://www.abc.com/page1.html http://www.abc.com/page2.html http://www.abc.com/page3.html http://www.abc.com/page4.html http://www.abc.com/page5.html http://www.abc.com/page6.html http://www.abc.com/page7.html http://www.abc.com/page8.html When I was expecting that all the urls should have been fetched, they are not. The reason seems to be that the fetch process takes into account the depth and the topN option to judge which urls to fetch. When the judgment was done and http://www.abc.com/page6.html wasn't fetched, even I put only one url http://www.abc.com/page6.html in the seed list, it can't be fetch due likely to the 30 day re-fetching period. Am I right here? The purpose that I asked this question is that except seed list option, can I use score and boost methods to ensure some specific urls to be fetched. Thanks. -----Original Message----- From: Jean-Francois Gingras [mailto:[email protected]] Sent: Sunday, May 15, 2011 5:55 AM To: [email protected] Subject: Re: how to force nutch to crawl specific urls? Hi, Just to add on top of what Gabriele et Luis said, you may want to look at the "db.ignore.external.links". If you have a large seed list, crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and may have performance impact (am I right?). If you don't want to add new links, even from the same host, then you should take a look at "db.update.additions.allowed". -----Message d'origine----- From: Luis Cappa Banda Sent: Saturday, May 14, 2011 8:18 AM To: [email protected] Subject: Re: how to force nutch to crawl specific urls? Hello. As Gabriele said before, you should specify your url´s list that you´ll use for crawling/fetching. You can enter an specific url or a domain url pointing to, for example, a particular html pdf file. Of course, you can put several urls as a list, not only one. In any case, be careful with the crawl-urlfilter.txt config file: if you have configured it before maybe the pattern that you decided wont´t be aplicable for your new url list and then you won´t index anything. It´s a very common mistake.

