Scott, thanks again for your insights. My 4 cheap linux boxes is now
crawling selected sites at about 1M pages per day. The fetch itself is
reasonable fast. But, when crawl db has >10M urls, lots of time is spend in
generating segment (2-3 hours) and update crawldb (4-5 hours after each
segment).  I expect these non-fetching time will be increasing as the crawl
db grows to 100M urls.  Is there any good way to reduce the non-fetching
time (i.e. generate segment and update crawldb)?

I found the following setting works OK for a small cheap cluster:**
dfs.replication (default=3)
mapred.map.tasks=40
mapred.reduce.tasks=8
mapred.tasktracker.map.tasks.maximum=2
mapred.tasktracker.reduce.tasks.maximum=2
mapred.child.java.opts=-Xmx1024M
mapred.job.reuse.jvm.num.tasks=-1
fetch threads: 20-100

thanks,
AJ

On Sun, Aug 8, 2010 at 11:00 PM, Scott Gonyea <[email protected]> wrote:

> My suggestion is to approach Nutch by installing and working with Hadoop,
> first.  Ignore the slaves/master file.  Just setup the datanode/tasktracker
> and give it the DNS name of your jobtracker.  That'll handle the "hi, I can
> process work" for you.  At least, that's how I handled it.  That allowed me
> to setup a bunch of Hadoop nodes and scale them up, no problem.
>
> Job Tracker:
> Determines what work to perform and doles it out
> Name Node / Secondary Name Node:
> HDFS management.  SNN is optional in an HDFS setup.
> NN and SNN are only necessary, if you are using HDFS.  I use s3(n), so I
> didn't have to care about them or their overhead.
> Task Tracker:
> These are what process work, assigned by the Job Tracker.
> Data Node:
> Name Node slave and only required when using HDFS.
>
> So, if you run a task tracker on the same system as the JT... It'll process
> work.  There's no reason not to, unless you are sticking the JT on a weak
> computer.
>
> That said, don't overburden the Job Tracker or you'll just make life that
> much more difficult (imo).  From my experiences, and the Gods of
> Nutchdoopcene may disagree, the Job Tracker and Map tasks are much more CPU
> intensive than they are memory intensive.  Reduce operations benefit more
> from RAM than they do from CPU.
>
> My suggestion, I suppose, is to make your master node a JT+TT+NN+DN, and
> you should even be able to get away with doing a 1m/3r run on it.  Make
> everything else a DN+TT.  Your Map/Reduce count should, in my limited
> experience, scale in size with your (sites*depth) and with consideration to
> your memory limits.  Less memory = smaller units of work.  Again, I'd love
> for the resident Nutch scholar to call me dumb and critique my assumptions /
> grooming methodologies.
>
> To answer your question about the master/slaves...  If you do work on your
> master node, it is also a slave--in addition to being a master.
>
> Aaand, bed time.  Good luck.
>
> sg
>
>
> On Aug 8, 2010, at 7:08 PM, AJ Chen wrote:
>
> > Scott, thank you for the detailed suggestions. It's very helpful. I have
> > only 4 low-end nodes, experimenting with different settings now. A couple
> of
> > more questions for getting most out of such a small cluster:
> > - Can the jobtracker be on the same namenode (i.e. master node)?
> > - What happen if I add the master node name in the conf/slaves file? Does
> it
> > make the master node also a worker node? If yes, does it help performance
> or
> > not?
> >
> > best,
> > -aj
> >
> > On Sat, Aug 7, 2010 at 6:10 PM, Scott Gonyea <[email protected]> wrote:
> >
> >> How deep is your crawl going?  5 (default)?  The big issue is RAM, which
> >> you don't have much of.  How good are the CPUs?  In your case, I'd go
> with:
> >>
> >> -Xmx768m (or -Xmx640m) and do 3 maps / 2 reduces, per worker node.  You
> can
> >> maybe go 1 map / 3 reduces on the job tracker.  What kind of data are
> you
> >> parsing?  Just the text, or everything?  That's also a factor to
> consider.
> >>
> >> Depending on your CPUs and bandwidth, I'd go for 128-512 threads.  1024
> if
> >> you have a university pipe to plug up and decent CPUs.  Also,
> simultaneous
> >> threads per host--4-8 if it's a big site.
> >>
> >> The number of maps/reduces, total, is a question of how many links you
> >> intend to crawl.  I'm not sure if it's an optimal number, yet, but it's
> >> seemingly well, for my purposes.  I recently crawled 13k+ sites, at a
> depth
> >> of 10, and used 181 maps and 181 reduces.  I used 10 m2.xlarge EC2 spot
> >> instances.  The workers had 7/7 maps/reduces. The jobtracker had 2/2.
> >> Harvesting about 1.4 million pages, in total, took about 9-10 hours; I
> >> limited links to 2048 per host, which helped my purposes.  I used 1024
> >> threads.  I'm sure there's plenty to tweak, that I'm not aware of,
> though I
> >> don't know how much more time I'll have left to spend on it.
> >>
> >> Scott
> >>
> >> On Aug 7, 2010, at 2:47 PM, AJ Chen wrote:
> >>
> >>> I'm setting up a small cluster for crawling 3000 domains: 1 master, 3
> >>> slaves. Using the default configs, each step (generate, fetch,
> updatedb)
> >>> runs much slower than expected. Tuning configurations for better
> >> performance
> >>> of course depends on many factors. but, there should be a good starting
> >>> point for a small cluster of commodity linux servers (4GB RAM). The
> >>> following parameters are mentioned in hadoop or nutch documents. For a
> >>> 4-node cluster, please suggest some good values you have found in your
> >>> experience.
> >>>
> >>> *conf/core-site.xml:
> >>> fs.inmemory.size.mb=  (200 larger memory for merging)
> >>> io.file.buffer.size=4096(default)
> >>> io.sort.factor=10(default)
> >>> io.sort.mb=100(default)
> >>>
> >>> conf/hdfs-site.xml*:
> >>> dfs.block.size=67108864 (default), 134217728 for large file-system
> >>> dfs.namenode.handler.count=10(default)
> >>> dfs.https.enable (default=false)
> >>> dfs.replication (default=3)
> >>>
> >>> *conf/mapred-site.xml*:
> >>> mapred.job.tracker.handler.count=10
> >>> mapred.map.tasks=2(default)  (40 for 4 nodes?)
> >>> mapred.reduce.tasks=1(default)  (for 4 nodes: 0.95*4*4)
> >>> mapred.reduce.parallel.copies=5(default)
> >>> mapred.submit.replication=10
> >>> mapred.tasktracker.map.tasks.maximum=4  (2 default)
> >>> mapred.tasktracker.reduce.tasks.maximum=4 (default=2)
> >>> mapred.child.java.opts=-Xmx200m(default),  -Xmx512m, -Xmx1024M
> >>> mapred.job.reuse.jvm.num.tasks=1  (-1 no limit)
> >>>
> >>> thanks,
> >>> aj
> >>> --
> >>> AJ Chen, PhD
> >>> Chair, Semantic Web SIG, sdforum.org
> >>> http://web2express.org
> >>> twitter @web2express
> >>> Palo Alto, CA, USA
> >>
> >>
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to