Scott, thanks again for your insights. My 4 cheap linux boxes is now crawling selected sites at about 1M pages per day. The fetch itself is reasonable fast. But, when crawl db has >10M urls, lots of time is spend in generating segment (2-3 hours) and update crawldb (4-5 hours after each segment). I expect these non-fetching time will be increasing as the crawl db grows to 100M urls. Is there any good way to reduce the non-fetching time (i.e. generate segment and update crawldb)?
I found the following setting works OK for a small cheap cluster:** dfs.replication (default=3) mapred.map.tasks=40 mapred.reduce.tasks=8 mapred.tasktracker.map.tasks.maximum=2 mapred.tasktracker.reduce.tasks.maximum=2 mapred.child.java.opts=-Xmx1024M mapred.job.reuse.jvm.num.tasks=-1 fetch threads: 20-100 thanks, AJ On Sun, Aug 8, 2010 at 11:00 PM, Scott Gonyea <[email protected]> wrote: > My suggestion is to approach Nutch by installing and working with Hadoop, > first. Ignore the slaves/master file. Just setup the datanode/tasktracker > and give it the DNS name of your jobtracker. That'll handle the "hi, I can > process work" for you. At least, that's how I handled it. That allowed me > to setup a bunch of Hadoop nodes and scale them up, no problem. > > Job Tracker: > Determines what work to perform and doles it out > Name Node / Secondary Name Node: > HDFS management. SNN is optional in an HDFS setup. > NN and SNN are only necessary, if you are using HDFS. I use s3(n), so I > didn't have to care about them or their overhead. > Task Tracker: > These are what process work, assigned by the Job Tracker. > Data Node: > Name Node slave and only required when using HDFS. > > So, if you run a task tracker on the same system as the JT... It'll process > work. There's no reason not to, unless you are sticking the JT on a weak > computer. > > That said, don't overburden the Job Tracker or you'll just make life that > much more difficult (imo). From my experiences, and the Gods of > Nutchdoopcene may disagree, the Job Tracker and Map tasks are much more CPU > intensive than they are memory intensive. Reduce operations benefit more > from RAM than they do from CPU. > > My suggestion, I suppose, is to make your master node a JT+TT+NN+DN, and > you should even be able to get away with doing a 1m/3r run on it. Make > everything else a DN+TT. Your Map/Reduce count should, in my limited > experience, scale in size with your (sites*depth) and with consideration to > your memory limits. Less memory = smaller units of work. Again, I'd love > for the resident Nutch scholar to call me dumb and critique my assumptions / > grooming methodologies. > > To answer your question about the master/slaves... If you do work on your > master node, it is also a slave--in addition to being a master. > > Aaand, bed time. Good luck. > > sg > > > On Aug 8, 2010, at 7:08 PM, AJ Chen wrote: > > > Scott, thank you for the detailed suggestions. It's very helpful. I have > > only 4 low-end nodes, experimenting with different settings now. A couple > of > > more questions for getting most out of such a small cluster: > > - Can the jobtracker be on the same namenode (i.e. master node)? > > - What happen if I add the master node name in the conf/slaves file? Does > it > > make the master node also a worker node? If yes, does it help performance > or > > not? > > > > best, > > -aj > > > > On Sat, Aug 7, 2010 at 6:10 PM, Scott Gonyea <[email protected]> wrote: > > > >> How deep is your crawl going? 5 (default)? The big issue is RAM, which > >> you don't have much of. How good are the CPUs? In your case, I'd go > with: > >> > >> -Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node. You > can > >> maybe go 1 map / 3 reduces on the job tracker. What kind of data are > you > >> parsing? Just the text, or everything? That's also a factor to > consider. > >> > >> Depending on your CPUs and bandwidth, I'd go for 128-512 threads. 1024 > if > >> you have a university pipe to plug up and decent CPUs. Also, > simultaneous > >> threads per host--4-8 if it's a big site. > >> > >> The number of maps/reduces, total, is a question of how many links you > >> intend to crawl. I'm not sure if it's an optimal number, yet, but it's > >> seemingly well, for my purposes. I recently crawled 13k+ sites, at a > depth > >> of 10, and used 181 maps and 181 reduces. I used 10 m2.xlarge EC2 spot > >> instances. The workers had 7/7 maps/reduces. The jobtracker had 2/2. > >> Harvesting about 1.4 million pages, in total, took about 9-10 hours; I > >> limited links to 2048 per host, which helped my purposes. I used 1024 > >> threads. I'm sure there's plenty to tweak, that I'm not aware of, > though I > >> don't know how much more time I'll have left to spend on it. > >> > >> Scott > >> > >> On Aug 7, 2010, at 2:47 PM, AJ Chen wrote: > >> > >>> I'm setting up a small cluster for crawling 3000 domains: 1 master, 3 > >>> slaves. Using the default configs, each step (generate, fetch, > updatedb) > >>> runs much slower than expected. Tuning configurations for better > >> performance > >>> of course depends on many factors. but, there should be a good starting > >>> point for a small cluster of commodity linux servers (4GB RAM). The > >>> following parameters are mentioned in hadoop or nutch documents. For a > >>> 4-node cluster, please suggest some good values you have found in your > >>> experience. > >>> > >>> *conf/core-site.xml: > >>> fs.inmemory.size.mb= (200 larger memory for merging) > >>> io.file.buffer.size=4096(default) > >>> io.sort.factor=10(default) > >>> io.sort.mb=100(default) > >>> > >>> conf/hdfs-site.xml*: > >>> dfs.block.size=67108864 (default), 134217728 for large file-system > >>> dfs.namenode.handler.count=10(default) > >>> dfs.https.enable (default=false) > >>> dfs.replication (default=3) > >>> > >>> *conf/mapred-site.xml*: > >>> mapred.job.tracker.handler.count=10 > >>> mapred.map.tasks=2(default) (40 for 4 nodes?) > >>> mapred.reduce.tasks=1(default) (for 4 nodes: 0.95*4*4) > >>> mapred.reduce.parallel.copies=5(default) > >>> mapred.submit.replication=10 > >>> mapred.tasktracker.map.tasks.maximum=4 (2 default) > >>> mapred.tasktracker.reduce.tasks.maximum=4 (default=2) > >>> mapred.child.java.opts=-Xmx200m(default), -Xmx512m, -Xmx1024M > >>> mapred.job.reuse.jvm.num.tasks=1 (-1 no limit) > >>> > >>> thanks, > >>> aj > >>> -- > >>> AJ Chen, PhD > >>> Chair, Semantic Web SIG, sdforum.org > >>> http://web2express.org > >>> twitter @web2express > >>> Palo Alto, CA, USA > >> > >> > > > > > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

