How deep is your crawl going?  5 (default)?  The big issue is RAM, which you 
don't have much of.  How good are the CPUs?  In your case, I'd go with:

-Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node.  You can 
maybe go 1 map / 3 reduces on the job tracker.  What kind of data are you 
parsing?  Just the text, or everything?  That's also a factor to consider.

Depending on your CPUs and bandwidth, I'd go for 128-512 threads.  1024 if you 
have a university pipe to plug up and decent CPUs.  Also, simultaneous threads 
per host--4-8 if it's a big site.

The number of maps/reduces, total, is a question of how many links you intend 
to crawl.  I'm not sure if it's an optimal number, yet, but it's seemingly 
well, for my purposes.  I recently crawled 13k+ sites, at a depth of 10, and 
used 181 maps and 181 reduces.  I used 10 m2.xlarge EC2 spot instances.  The 
workers had 7/7 maps/reduces. The jobtracker had 2/2.  Harvesting about 1.4 
million pages, in total, took about 9-10 hours; I limited links to 2048 per 
host, which helped my purposes.  I used 1024 threads.  I'm sure there's plenty 
to tweak, that I'm not aware of, though I don't know how much more time I'll 
have left to spend on it.

Scott

On Aug 7, 2010, at 2:47 PM, AJ Chen wrote:

> I'm setting up a small cluster for crawling 3000 domains: 1 master, 3
> slaves. Using the default configs, each step (generate, fetch, updatedb)
> runs much slower than expected. Tuning configurations for better performance
> of course depends on many factors. but, there should be a good starting
> point for a small cluster of commodity linux servers (4GB RAM). The
> following parameters are mentioned in hadoop or nutch documents. For a
> 4-node cluster, please suggest some good values you have found in your
> experience.
> 
> *conf/core-site.xml:
> fs.inmemory.size.mb=  (200 larger memory for merging)
> io.file.buffer.size=4096(default)
> io.sort.factor=10(default)
> io.sort.mb=100(default)
> 
> conf/hdfs-site.xml*:
> dfs.block.size=67108864 (default), 134217728 for large file-system
> dfs.namenode.handler.count=10(default)
> dfs.https.enable (default=false)
> dfs.replication (default=3)
> 
> *conf/mapred-site.xml*:
> mapred.job.tracker.handler.count=10
> mapred.map.tasks=2(default)  (40 for 4 nodes?)
> mapred.reduce.tasks=1(default)  (for 4 nodes: 0.95*4*4)
> mapred.reduce.parallel.copies=5(default)
> mapred.submit.replication=10
> mapred.tasktracker.map.tasks.maximum=4  (2 default)
> mapred.tasktracker.reduce.tasks.maximum=4 (default=2)
> mapred.child.java.opts=-Xmx200m(default),  -Xmx512m, -Xmx1024M
> mapred.job.reuse.jvm.num.tasks=1  (-1 no limit)
> 
> thanks,
> aj
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Reply via email to