Hi - your statistics between local and distributed are odd for some reason, 
probably due to a really bad configuration or hardware setup. The overhead of 
properly distributed is in terms of few dozen seconds, distributing the .job 
etc.

In general, you have very few hosts to crawl, even if they are large (500k-1m), 
you can easily recrawl them all within 30 days running local mode.

If you have a few dozen hosts or more with the same recrawl strategy, you need 
to run it distributed, and properly configured.

Regarding your Nutch config, having so much thread space is over the top and 
won't change anything. With this fetch.delay you hardly need 10 threads in 
total. Also, in general, don't use more than 1-2 threads per queue, this is 
more polite. And with the fetch.delay, you can probably recrawl a  large host 
within one day, not so polite. 

Try to keep a delay of a few seconds. Unless you need to recrawl everything 
within a very short timeframe. And keep in mind that fetching this fast 
increases stress on your servers, probably invalidating caches along the way.

Anyway, few URL's, stay local.

Regards,
Markus
 
-----Original message-----
> From:Srinivasan Ramaswamy <[email protected]>
> Sent: Tuesday 23rd May 2017 20:35
> To: [email protected]
> Subject: Local mode vs Distributed mode ? Which one is faster for doing deep 
> crawl of few domains ?
> 
> Hi All
> 
> We have a few domains and we would like to crawl all pages (deep crawling)
> from those domains (excluding external links).
> 
> We started with a domain that has 400 urls and started crawling using
> Nutch. Here is the time taken between the two modes for the smaller domain
> local mode  = 5 minutes
> distributed mode (a cluster of 3 nodes) = 2 hours
> 
> We tried the same with a domain that has > 100K urls and local mode still
> seem to be faster. Time taken for the bigger domain
> 
> local mode crawled 28K urls in 4 hours
> distributed mode crawled only 12k urls in 11hours
> 
> When i looked into the information printed in console, I saw that it runs a
> mapreduce job for every step in each iteration in distributed mode. It
> looked to me like these map reduce jobs for not so big number of urls are
> slowing things down.
> 
> Here is some of the configuration
> 
>  db.ignore.external.links=true
>  fetcher.server.delay=0.1
>  fetcher.queue.mode=byHost
> 
> smaller domain
>  fetcher.threads.fetch=100
>  fetcher.threads.per.queue=100
> 
> bigger domain (as we wanted to see whether number of threads make a
> difference)
>  fetcher.threads.fetch=400
>  fetcher.threads.per.queue=200
> 
> The performance looks surprisingly slow. Are we missing something ? Any
> suggestion would be really appreciated.
> 
> 
> Thanks
> Srini
> 

Reply via email to