I'm setting up a small cluster for crawling 3000 domains: 1 master, 3
slaves. Using the default configs, each step (generate, fetch, updatedb)
runs much slower than expected. Tuning configurations for better performance
of course depends on many factors. but, there should be a good starting
point for a small cluster of commodity linux servers (4GB RAM). The
following parameters are mentioned in hadoop or nutch documents. For a
4-node cluster, please suggest some good values you have found in your
experience.

*conf/core-site.xml:
fs.inmemory.size.mb=  (200 larger memory for merging)
io.file.buffer.size=4096(default)
io.sort.factor=10(default)
io.sort.mb=100(default)

conf/hdfs-site.xml*:
dfs.block.size=67108864 (default), 134217728 for large file-system
dfs.namenode.handler.count=10(default)
dfs.https.enable (default=false)
dfs.replication (default=3)

*conf/mapred-site.xml*:
mapred.job.tracker.handler.count=10
mapred.map.tasks=2(default)  (40 for 4 nodes?)
mapred.reduce.tasks=1(default)  (for 4 nodes: 0.95*4*4)
mapred.reduce.parallel.copies=5(default)
mapred.submit.replication=10
mapred.tasktracker.map.tasks.maximum=4  (2 default)
mapred.tasktracker.reduce.tasks.maximum=4 (default=2)
mapred.child.java.opts=-Xmx200m(default),  -Xmx512m, -Xmx1024M
mapred.job.reuse.jvm.num.tasks=1  (-1 no limit)

thanks,
aj
-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to