RE: Nutch Hadoop Optimization

Arkadi.Kosmynin Sun, 18 Dec 2011 23:20:40 -0800

Hi,

Some info on optimisation that I can share:


1. Make sure that Hadoop can use the native libraries. Nutch comes without 
them, but it is not hard to get them from a Hadoop pack or compile. They make a 
BIG difference. In my tests, a job that was running for two days without them, 
took only 5 hours to finish when they were available.

2. Apparently, Nutch is faster if the load (the numbers of URLs to fetch) is 
spread more or less evenly over iterations. This is not surprising, given that 
sorting is involved. As you may know, Arch (an extension of Nutch I am working 
on) allows dividing web sites into areas and processing the areas sequentially 
one by one. This is convenient when you want to get intranet crawling right and 
do not want to have to reindex everything after fixing a problem local to some 
area, or if you want to configure different refresh intervals for different 
parts of your site. The latest version of Arch (in testing, not released yet) 
also allows processing all areas in parallel, injecting all known links as 
seeds. It turned out that the sequential mode was faster, probably because the 
load was spread much more evenly in sequential processing. It was finishing the 
job on our sites in about 12 hours. When I switched the mode to parallel, it 
took 2 days. I added the native libraries then and the time dripped to 5 hours.

It is possible that, when the native libraries are available, the dependency on 
the load size is not so prominent. If it is still a problem, use the topN 
parameter to limit the number of URLs fetched per iteration and this spread the 
load over iterations.

Regards,

Arkadi

  

> -----Original Message-----
> From: Bai Shen [mailto:[email protected]]
> Sent: Saturday, 17 December 2011 3:05 AM
> To: [email protected]
> Subject: Re: Nutch Hadoop Optimization
> 
> The parse takes under two minutes.
> 
> One of the problems I'm running into is how to make nutch run more
> jobs,
> and how to run that many jobs on the machine without thrashing the hard
> drive.
> 
> On Fri, Dec 16, 2011 at 5:33 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
> 
> > It looks like its the parsing of these segments that is taking
> time... no?
> >
> > On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen <[email protected]>
> wrote:
> > > On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > >> This is overwhelmingly weighted towards Hadoop configuration.
> > >>
> > >> There are some guidance notes on the Nutch wiki for performance
> issues
> > >> so you may wish to give them a try first.
> > >> --
> > >>  Lewis
> > >>
> > >
> > > I'm assuming you're referring to this page?
> > > http://wiki.apache.org/nutch/OptimizingCrawls
> > >
> > >
> > > On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > >> Well, if performance is low its likely not a Hadoop issue. Hadoop
> > tuning is
> > >> only required if you start pushing it to limits.
> > >>
> > >> I would indeed check the Nutch wiki. There are important settings
> such
> > as
> > >> threads, queues etc that are very important.
> > >>
> > >>
> > > I did end up tweaking some of the hadoop settings, as it looked
> like it
> > was
> > > thrashing the disk due to not spreading out the map tasks.
> > >
> > >
> > > On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche <
> > > [email protected]> wrote:
> > >
> > >>
> > >> Having beefy machines is not going to be very useful for the
> fetching
> > step
> > >> which is IO bound and usually takes most of the time.
> > >> How big is your crawldb?  How long do the generate / parse and
> update
> > steps
> > >> take? Having more than one machine won't make a massive difference
> if
> > your
> > >> crawldb or segments are small.
> > >>
> > >> Julien
> > >>
> > >>
> > > The machines were all I had handy to make the cluster with.
> > >
> > >
> > > I'm looking at the time for a recent job and here's what I'm
> seeing.
> >  This
> > > is with 12k urls queued by domain with a max of 50 urls per domain.
> > > I know why the fetcher takes so long.  Most of the fetcher map jobs
> > finish
> > > in 3-4 minutes, but 1-2 always end up getting stuck on a single
> site and
> > > taking an additional ten minutes to work through the remaining
> urls.  Not
> > > sure how to fix that.
> > > The crawldb had around 1.2 million urls in it when I looked this
> > afternoon.
> > >
> > > nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15
> 16:14:44
> > > EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15
> 16:14:45
> > > EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition
> > > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST
> 2011 Thu
> > > Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618
> SUCCEEDED Thu
> > > Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse
> > > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST
> 2011 Thu
> > > Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15
> > 16:35:12
> > > EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED
> Thu
> > Dec
> > > 15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge
> > crawl/linkdb
> > > SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST
> > 2011index-solr
> > > http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu
> Dec 15
> > > 16:38:56 EST 2011
> >
> >
> >
> > --
> > Lewis
> >

RE: Nutch Hadoop Optimization

Reply via email to