Scott, thank you for the detailed suggestions. It's very helpful. I have only 4 low-end nodes, experimenting with different settings now. A couple of more questions for getting most out of such a small cluster: - Can the jobtracker be on the same namenode (i.e. master node)? - What happen if I add the master node name in the conf/slaves file? Does it make the master node also a worker node? If yes, does it help performance or not?
best, -aj On Sat, Aug 7, 2010 at 6:10 PM, Scott Gonyea <[email protected]> wrote: > How deep is your crawl going? 5 (default)? The big issue is RAM, which > you don't have much of. How good are the CPUs? In your case, I'd go with: > > -Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node. You can > maybe go 1 map / 3 reduces on the job tracker. What kind of data are you > parsing? Just the text, or everything? That's also a factor to consider. > > Depending on your CPUs and bandwidth, I'd go for 128-512 threads. 1024 if > you have a university pipe to plug up and decent CPUs. Also, simultaneous > threads per host--4-8 if it's a big site. > > The number of maps/reduces, total, is a question of how many links you > intend to crawl. I'm not sure if it's an optimal number, yet, but it's > seemingly well, for my purposes. I recently crawled 13k+ sites, at a depth > of 10, and used 181 maps and 181 reduces. I used 10 m2.xlarge EC2 spot > instances. The workers had 7/7 maps/reduces. The jobtracker had 2/2. > Harvesting about 1.4 million pages, in total, took about 9-10 hours; I > limited links to 2048 per host, which helped my purposes. I used 1024 > threads. I'm sure there's plenty to tweak, that I'm not aware of, though I > don't know how much more time I'll have left to spend on it. > > Scott > > On Aug 7, 2010, at 2:47 PM, AJ Chen wrote: > > > I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 > > slaves. Using the default configs, each step (generate, fetch, updatedb) > > runs much slower than expected. Tuning configurations for better > performance > > of course depends on many factors. but, there should be a good starting > > point for a small cluster of commodity linux servers (4GB RAM). The > > following parameters are mentioned in hadoop or nutch documents. For a > > 4-node cluster, please suggest some good values you have found in your > > experience. > > > > *conf/core-site.xml: > > fs.inmemory.size.mb= (200 larger memory for merging) > > io.file.buffer.size=4096(default) > > io.sort.factor=10(default) > > io.sort.mb=100(default) > > > > conf/hdfs-site.xml*: > > dfs.block.size=67108864 (default), 134217728 for large file-system > > dfs.namenode.handler.count=10(default) > > dfs.https.enable (default=false) > > dfs.replication (default=3) > > > > *conf/mapred-site.xml*: > > mapred.job.tracker.handler.count=10 > > mapred.map.tasks=2(default) (40 for 4 nodes?) > > mapred.reduce.tasks=1(default) (for 4 nodes: 0.95*4*4) > > mapred.reduce.parallel.copies=5(default) > > mapred.submit.replication=10 > > mapred.tasktracker.map.tasks.maximum=4 (2 default) > > mapred.tasktracker.reduce.tasks.maximum=4 (default=2) > > mapred.child.java.opts=-Xmx200m(default), -Xmx512m, -Xmx1024M > > mapred.job.reuse.jvm.num.tasks=1 (-1 no limit) > > > > thanks, > > aj > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

