Hi,

Check the value of the parameter '-numFetchers' when calling generate. l
guess you are using a value of 2 in non-distributed mode i.e they are done
in sequential order.

I'd strongly advise to move to a more recent version of Nutch if you can.
There has been a considerable number of improvements added since 1.0

Julien

On 28 September 2011 15:50, Danicela nutch <[email protected]> wrote:

> Hi,
>
>  My config is :
>
>  Nutch 1.0.
>  generate.max.per.host = 130
>  fetcher.server.delay = 5
>  fetcher.threads.fetch = 50
>  number of hosts in seeds = 30
>
>  If the fetch was effective, we would get 130 * 6 (5+1 imprecision) seconds
> = 13 min for a fetch.
>
>  According to the results, a fetch lasts 26 minutes.
>
>  When I analyse hadoop.log, I noticed that some sites are fetched during
> the 13 first minutes, and the other sites, which weren't fetched until the
> 13rd minute, begin to be fetched after the 13rd minute. These sites are
> fetched until the 26th minute.
>
>  I can conclude that the fetch lasts twice as much time than it should,
> because a part of the sites are fetched only after others. (some STATS are
> produced between the 2 steps)
>
>  How can we prevent this split ? I mean, how to force all sites to be
> fetched from the beginning ?
>
>  Thanks in advance for helping.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to