Fetch performance

Danicela nutch Wed, 28 Sep 2011 07:51:09 -0700

Hi,

 My config is :


 Nutch 1.0.
 generate.max.per.host = 130
 fetcher.server.delay = 5
 fetcher.threads.fetch = 50
 number of hosts in seeds = 30

 If the fetch was effective, we would get 130 * 6 (5+1 imprecision) seconds = 
13 min for a fetch.

 According to the results, a fetch lasts 26 minutes.

 When I analyse hadoop.log, I noticed that some sites are fetched during the 13 
first minutes, and the other sites, which weren't fetched until the 13rd 
minute, begin to be fetched after the 13rd minute. These sites are fetched 
until the 26th minute.

 I can conclude that the fetch lasts twice as much time than it should, because 
a part of the sites are fetched only after others. (some STATS are produced 
between the 2 steps)

 How can we prevent this split ? I mean, how to force all sites to be fetched 
from the beginning ?

 Thanks in advance for helping.

Fetch performance

Reply via email to