How deep is your crawl going? 5 (default)? The big issue is RAM, which you don't have much of. How good are the CPUs? In your case, I'd go with:
-Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node. You can maybe go 1 map / 3 reduces on the job tracker. What kind of data are you parsing? Just the text, or everything? That's also a factor to consider. Depending on your CPUs and bandwidth, I'd go for 128-512 threads. 1024 if you have a university pipe to plug up and decent CPUs. Also, simultaneous threads per host--4-8 if it's a big site. The number of maps/reduces, total, is a question of how many links you intend to crawl. I'm not sure if it's an optimal number, yet, but it's seemingly well, for my purposes. I recently crawled 13k+ sites, at a depth of 10, and used 181 maps and 181 reduces. I used 10 m2.xlarge EC2 spot instances. The workers had 7/7 maps/reduces. The jobtracker had 2/2. Harvesting about 1.4 million pages, in total, took about 9-10 hours; I limited links to 2048 per host, which helped my purposes. I used 1024 threads. I'm sure there's plenty to tweak, that I'm not aware of, though I don't know how much more time I'll have left to spend on it. Scott On Aug 7, 2010, at 2:47 PM, AJ Chen wrote: > I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 > slaves. Using the default configs, each step (generate, fetch, updatedb) > runs much slower than expected. Tuning configurations for better performance > of course depends on many factors. but, there should be a good starting > point for a small cluster of commodity linux servers (4GB RAM). The > following parameters are mentioned in hadoop or nutch documents. For a > 4-node cluster, please suggest some good values you have found in your > experience. > > *conf/core-site.xml: > fs.inmemory.size.mb= (200 larger memory for merging) > io.file.buffer.size=4096(default) > io.sort.factor=10(default) > io.sort.mb=100(default) > > conf/hdfs-site.xml*: > dfs.block.size=67108864 (default), 134217728 for large file-system > dfs.namenode.handler.count=10(default) > dfs.https.enable (default=false) > dfs.replication (default=3) > > *conf/mapred-site.xml*: > mapred.job.tracker.handler.count=10 > mapred.map.tasks=2(default) (40 for 4 nodes?) > mapred.reduce.tasks=1(default) (for 4 nodes: 0.95*4*4) > mapred.reduce.parallel.copies=5(default) > mapred.submit.replication=10 > mapred.tasktracker.map.tasks.maximum=4 (2 default) > mapred.tasktracker.reduce.tasks.maximum=4 (default=2) > mapred.child.java.opts=-Xmx200m(default), -Xmx512m, -Xmx1024M > mapred.job.reuse.jvm.num.tasks=1 (-1 no limit) > > thanks, > aj > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA

