Scott, thank you for the detailed suggestions. It's very helpful. I have
only 4 low-end nodes, experimenting with different settings now. A couple of
more questions for getting most out of such a small cluster:
- Can the jobtracker be on the same namenode (i.e. master node)?
- What happen if I add the master node name in the conf/slaves file? Does it
make the master node also a worker node? If yes, does it help performance or
not?

best,
-aj

On Sat, Aug 7, 2010 at 6:10 PM, Scott Gonyea <[email protected]> wrote:

> How deep is your crawl going?  5 (default)?  The big issue is RAM, which
> you don't have much of.  How good are the CPUs?  In your case, I'd go with:
>
> -Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node.  You can
> maybe go 1 map / 3 reduces on the job tracker.  What kind of data are you
> parsing?  Just the text, or everything?  That's also a factor to consider.
>
> Depending on your CPUs and bandwidth, I'd go for 128-512 threads.  1024 if
> you have a university pipe to plug up and decent CPUs.  Also, simultaneous
> threads per host--4-8 if it's a big site.
>
> The number of maps/reduces, total, is a question of how many links you
> intend to crawl.  I'm not sure if it's an optimal number, yet, but it's
> seemingly well, for my purposes.  I recently crawled 13k+ sites, at a depth
> of 10, and used 181 maps and 181 reduces.  I used 10 m2.xlarge EC2 spot
> instances.  The workers had 7/7 maps/reduces. The jobtracker had 2/2.
>  Harvesting about 1.4 million pages, in total, took about 9-10 hours; I
> limited links to 2048 per host, which helped my purposes.  I used 1024
> threads.  I'm sure there's plenty to tweak, that I'm not aware of, though I
> don't know how much more time I'll have left to spend on it.
>
> Scott
>
> On Aug 7, 2010, at 2:47 PM, AJ Chen wrote:
>
> > I'm setting up a small cluster for crawling 3000 domains: 1 master, 3
> > slaves. Using the default configs, each step (generate, fetch, updatedb)
> > runs much slower than expected. Tuning configurations for better
> performance
> > of course depends on many factors. but, there should be a good starting
> > point for a small cluster of commodity linux servers (4GB RAM). The
> > following parameters are mentioned in hadoop or nutch documents. For a
> > 4-node cluster, please suggest some good values you have found in your
> > experience.
> >
> > *conf/core-site.xml:
> > fs.inmemory.size.mb=  (200 larger memory for merging)
> > io.file.buffer.size=4096(default)
> > io.sort.factor=10(default)
> > io.sort.mb=100(default)
> >
> > conf/hdfs-site.xml*:
> > dfs.block.size=67108864 (default), 134217728 for large file-system
> > dfs.namenode.handler.count=10(default)
> > dfs.https.enable (default=false)
> > dfs.replication (default=3)
> >
> > *conf/mapred-site.xml*:
> > mapred.job.tracker.handler.count=10
> > mapred.map.tasks=2(default)  (40 for 4 nodes?)
> > mapred.reduce.tasks=1(default)  (for 4 nodes: 0.95*4*4)
> > mapred.reduce.parallel.copies=5(default)
> > mapred.submit.replication=10
> > mapred.tasktracker.map.tasks.maximum=4  (2 default)
> > mapred.tasktracker.reduce.tasks.maximum=4 (default=2)
> > mapred.child.java.opts=-Xmx200m(default),  -Xmx512m, -Xmx1024M
> > mapred.job.reuse.jvm.num.tasks=1  (-1 no limit)
> >
> > thanks,
> > aj
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to