RE: how to force nutch to crawl specific urls?

cheng Sat, 14 May 2011 19:18:12 -0700

Thanks for the advice.

Just want to point out one thing: urls in the seed list may NOT be fetched!


My experience is that for instance I put the following urls in the seed
list:

http://www.abc.com/page1.html
http://www.abc.com/page2.html
http://www.abc.com/page3.html
http://www.abc.com/page4.html
http://www.abc.com/page5.html
http://www.abc.com/page6.html
http://www.abc.com/page7.html
http://www.abc.com/page8.html

When I was expecting that all the urls should have been fetched, they are
not. The reason seems to be that the fetch process takes into account the
depth and the topN option to judge which urls to fetch. When the judgment
was done and http://www.abc.com/page6.html wasn't fetched, even I put only
one url http://www.abc.com/page6.html in the seed list, it can't be fetch
due likely to the 30 day re-fetching period. Am I right here?

The purpose that I asked this question is that except seed list option, can
I use score and boost methods to ensure some specific urls to be fetched.

Thanks.




-----Original Message-----
From: Jean-Francois Gingras [mailto:[email protected]] 
Sent: Sunday, May 15, 2011 5:55 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hi,

Just to add on top of what Gabriele et Luis said, you may want to look at 
the "db.ignore.external.links". If you have a large seed list, 
crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and 
may have performance impact (am I right?).  If you don't want to add new 
links, even from the same host, then you should take a look at 
"db.update.additions.allowed".

-----Message d'origine----- 
From: Luis Cappa Banda
Sent: Saturday, May 14, 2011 8:18 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hello.

As Gabriele said before, you should specify your url´s list that you´ll use
for crawling/fetching. You can enter an specific url or a domain url
pointing to, for example, a particular html pdf file. Of course, you can put
several urls as a list, not only one. In any case, be careful with the
crawl-urlfilter.txt config file: if you have configured it before maybe the
pattern that you decided wont´t be aplicable for your new url list and then
you won´t index anything. It´s a very common mistake.

RE: how to force nutch to crawl specific urls?

Reply via email to