Hi, Check the value of the parameter '-numFetchers' when calling generate. l guess you are using a value of 2 in non-distributed mode i.e they are done in sequential order.
I'd strongly advise to move to a more recent version of Nutch if you can. There has been a considerable number of improvements added since 1.0 Julien On 28 September 2011 15:50, Danicela nutch <[email protected]> wrote: > Hi, > > My config is : > > Nutch 1.0. > generate.max.per.host = 130 > fetcher.server.delay = 5 > fetcher.threads.fetch = 50 > number of hosts in seeds = 30 > > If the fetch was effective, we would get 130 * 6 (5+1 imprecision) seconds > = 13 min for a fetch. > > According to the results, a fetch lasts 26 minutes. > > When I analyse hadoop.log, I noticed that some sites are fetched during > the 13 first minutes, and the other sites, which weren't fetched until the > 13rd minute, begin to be fetched after the 13rd minute. These sites are > fetched until the 26th minute. > > I can conclude that the fetch lasts twice as much time than it should, > because a part of the sites are fetched only after others. (some STATS are > produced between the 2 steps) > > How can we prevent this split ? I mean, how to force all sites to be > fetched from the beginning ? > > Thanks in advance for helping. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

