I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 slaves. Using the default configs, each step (generate, fetch, updatedb) runs much slower than expected. Tuning configurations for better performance of course depends on many factors. but, there should be a good starting point for a small cluster of commodity linux servers (4GB RAM). The following parameters are mentioned in hadoop or nutch documents. For a 4-node cluster, please suggest some good values you have found in your experience.
*conf/core-site.xml: fs.inmemory.size.mb= (200 larger memory for merging) io.file.buffer.size=4096(default) io.sort.factor=10(default) io.sort.mb=100(default) conf/hdfs-site.xml*: dfs.block.size=67108864 (default), 134217728 for large file-system dfs.namenode.handler.count=10(default) dfs.https.enable (default=false) dfs.replication (default=3) *conf/mapred-site.xml*: mapred.job.tracker.handler.count=10 mapred.map.tasks=2(default) (40 for 4 nodes?) mapred.reduce.tasks=1(default) (for 4 nodes: 0.95*4*4) mapred.reduce.parallel.copies=5(default) mapred.submit.replication=10 mapred.tasktracker.map.tasks.maximum=4 (2 default) mapred.tasktracker.reduce.tasks.maximum=4 (default=2) mapred.child.java.opts=-Xmx200m(default), -Xmx512m, -Xmx1024M mapred.job.reuse.jvm.num.tasks=1 (-1 no limit) thanks, aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

