Re: Nutch Hadoop Optimization

Bai Shen Fri, 16 Dec 2011 08:05:19 -0800

The parse takes under two minutes.

One of the problems I'm running into is how to make nutch run more jobs,
and how to run that many jobs on the machine without thrashing the hard
drive.


On Fri, Dec 16, 2011 at 5:33 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> It looks like its the parsing of these segments that is taking time... no?
>
> On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen <[email protected]> wrote:
> > On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> This is overwhelmingly weighted towards Hadoop configuration.
> >>
> >> There are some guidance notes on the Nutch wiki for performance issues
> >> so you may wish to give them a try first.
> >> --
> >>  Lewis
> >>
> >
> > I'm assuming you're referring to this page?
> > http://wiki.apache.org/nutch/OptimizingCrawls
> >
> >
> > On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma
> > <[email protected]>wrote:
> >
> >> Well, if performance is low its likely not a Hadoop issue. Hadoop
> tuning is
> >> only required if you start pushing it to limits.
> >>
> >> I would indeed check the Nutch wiki. There are important settings such
> as
> >> threads, queues etc that are very important.
> >>
> >>
> > I did end up tweaking some of the hadoop settings, as it looked like it
> was
> > thrashing the disk due to not spreading out the map tasks.
> >
> >
> > On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> >>
> >> Having beefy machines is not going to be very useful for the fetching
> step
> >> which is IO bound and usually takes most of the time.
> >> How big is your crawldb?  How long do the generate / parse and update
> steps
> >> take? Having more than one machine won't make a massive difference if
> your
> >> crawldb or segments are small.
> >>
> >> Julien
> >>
> >>
> > The machines were all I had handy to make the cluster with.
> >
> >
> > I'm looking at the time for a recent job and here's what I'm seeing.
>  This
> > is with 12k urls queued by domain with a max of 50 urls per domain.
> > I know why the fetcher takes so long.  Most of the fetcher map jobs
> finish
> > in 3-4 minutes, but 1-2 always end up getting stuck on a single site and
> > taking an additional ten minutes to work through the remaining urls.  Not
> > sure how to fix that.
> > The crawldb had around 1.2 million urls in it when I looked this
> afternoon.
> >
> > nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15 16:14:44
> > EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15 16:14:45
> > EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition
> > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST 2011 Thu
> > Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618 SUCCEEDED Thu
> > Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse
> > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST 2011 Thu
> > Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15
> 16:35:12
> > EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED Thu
> Dec
> > 15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge
> crawl/linkdb
> > SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST
> 2011index-solr
> > http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu Dec 15
> > 16:38:56 EST 2011
>
>
>
> --
> Lewis
>

Re: Nutch Hadoop Optimization

Reply via email to