Local mode vs Distributed mode ? Which one is faster for doing deep crawl of few domains ?

Srinivasan Ramaswamy Tue, 23 May 2017 11:35:45 -0700

Hi All

We have a few domains and we would like to crawl all pages (deep crawling)
from those domains (excluding external links).


We started with a domain that has 400 urls and started crawling using
Nutch. Here is the time taken between the two modes for the smaller domain
local mode  = 5 minutes
distributed mode (a cluster of 3 nodes) = 2 hours

We tried the same with a domain that has > 100K urls and local mode still
seem to be faster. Time taken for the bigger domain

local mode crawled 28K urls in 4 hours
distributed mode crawled only 12k urls in 11hours

When i looked into the information printed in console, I saw that it runs a
mapreduce job for every step in each iteration in distributed mode. It
looked to me like these map reduce jobs for not so big number of urls are
slowing things down.

Here is some of the configuration

 db.ignore.external.links=true
 fetcher.server.delay=0.1
 fetcher.queue.mode=byHost

smaller domain
 fetcher.threads.fetch=100
 fetcher.threads.per.queue=100

bigger domain (as we wanted to see whether number of threads make a
difference)
 fetcher.threads.fetch=400
 fetcher.threads.per.queue=200

The performance looks surprisingly slow. Are we missing something ? Any
suggestion would be really appreciated.


Thanks
Srini

Local mode vs Distributed mode ? Which one is faster for doing deep crawl of few domains ?

Reply via email to