My suggestion is to approach Nutch by installing and working with Hadoop, first. Ignore the slaves/master file. Just setup the datanode/tasktracker and give it the DNS name of your jobtracker. That'll handle the "hi, I can process work" for you. At least, that's how I handled it. That allowed me to setup a bunch of Hadoop nodes and scale them up, no problem.
Job Tracker: Determines what work to perform and doles it out Name Node / Secondary Name Node: HDFS management. SNN is optional in an HDFS setup. NN and SNN are only necessary, if you are using HDFS. I use s3(n), so I didn't have to care about them or their overhead. Task Tracker: These are what process work, assigned by the Job Tracker. Data Node: Name Node slave and only required when using HDFS. So, if you run a task tracker on the same system as the JT... It'll process work. There's no reason not to, unless you are sticking the JT on a weak computer. That said, don't overburden the Job Tracker or you'll just make life that much more difficult (imo). From my experiences, and the Gods of Nutchdoopcene may disagree, the Job Tracker and Map tasks are much more CPU intensive than they are memory intensive. Reduce operations benefit more from RAM than they do from CPU. My suggestion, I suppose, is to make your master node a JT+TT+NN+DN, and you should even be able to get away with doing a 1m/3r run on it. Make everything else a DN+TT. Your Map/Reduce count should, in my limited experience, scale in size with your (sites*depth) and with consideration to your memory limits. Less memory = smaller units of work. Again, I'd love for the resident Nutch scholar to call me dumb and critique my assumptions / grooming methodologies. To answer your question about the master/slaves... If you do work on your master node, it is also a slave--in addition to being a master. Aaand, bed time. Good luck. sg On Aug 8, 2010, at 7:08 PM, AJ Chen wrote: > Scott, thank you for the detailed suggestions. It's very helpful. I have > only 4 low-end nodes, experimenting with different settings now. A couple of > more questions for getting most out of such a small cluster: > - Can the jobtracker be on the same namenode (i.e. master node)? > - What happen if I add the master node name in the conf/slaves file? Does it > make the master node also a worker node? If yes, does it help performance or > not? > > best, > -aj > > On Sat, Aug 7, 2010 at 6:10 PM, Scott Gonyea <[email protected]> wrote: > >> How deep is your crawl going? 5 (default)? The big issue is RAM, which >> you don't have much of. How good are the CPUs? In your case, I'd go with: >> >> -Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node. You can >> maybe go 1 map / 3 reduces on the job tracker. What kind of data are you >> parsing? Just the text, or everything? That's also a factor to consider. >> >> Depending on your CPUs and bandwidth, I'd go for 128-512 threads. 1024 if >> you have a university pipe to plug up and decent CPUs. Also, simultaneous >> threads per host--4-8 if it's a big site. >> >> The number of maps/reduces, total, is a question of how many links you >> intend to crawl. I'm not sure if it's an optimal number, yet, but it's >> seemingly well, for my purposes. I recently crawled 13k+ sites, at a depth >> of 10, and used 181 maps and 181 reduces. I used 10 m2.xlarge EC2 spot >> instances. The workers had 7/7 maps/reduces. The jobtracker had 2/2. >> Harvesting about 1.4 million pages, in total, took about 9-10 hours; I >> limited links to 2048 per host, which helped my purposes. I used 1024 >> threads. I'm sure there's plenty to tweak, that I'm not aware of, though I >> don't know how much more time I'll have left to spend on it. >> >> Scott >> >> On Aug 7, 2010, at 2:47 PM, AJ Chen wrote: >> >>> I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 >>> slaves. Using the default configs, each step (generate, fetch, updatedb) >>> runs much slower than expected. Tuning configurations for better >> performance >>> of course depends on many factors. but, there should be a good starting >>> point for a small cluster of commodity linux servers (4GB RAM). The >>> following parameters are mentioned in hadoop or nutch documents. For a >>> 4-node cluster, please suggest some good values you have found in your >>> experience. >>> >>> *conf/core-site.xml: >>> fs.inmemory.size.mb= (200 larger memory for merging) >>> io.file.buffer.size=4096(default) >>> io.sort.factor=10(default) >>> io.sort.mb=100(default) >>> >>> conf/hdfs-site.xml*: >>> dfs.block.size=67108864 (default), 134217728 for large file-system >>> dfs.namenode.handler.count=10(default) >>> dfs.https.enable (default=false) >>> dfs.replication (default=3) >>> >>> *conf/mapred-site.xml*: >>> mapred.job.tracker.handler.count=10 >>> mapred.map.tasks=2(default) (40 for 4 nodes?) >>> mapred.reduce.tasks=1(default) (for 4 nodes: 0.95*4*4) >>> mapred.reduce.parallel.copies=5(default) >>> mapred.submit.replication=10 >>> mapred.tasktracker.map.tasks.maximum=4 (2 default) >>> mapred.tasktracker.reduce.tasks.maximum=4 (default=2) >>> mapred.child.java.opts=-Xmx200m(default), -Xmx512m, -Xmx1024M >>> mapred.job.reuse.jvm.num.tasks=1 (-1 no limit) >>> >>> thanks, >>> aj >>> -- >>> AJ Chen, PhD >>> Chair, Semantic Web SIG, sdforum.org >>> http://web2express.org >>> twitter @web2express >>> Palo Alto, CA, USA >> >> > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA

