It looks like its the parsing of these segments that is taking time... no? On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen <[email protected]> wrote: > On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> This is overwhelmingly weighted towards Hadoop configuration. >> >> There are some guidance notes on the Nutch wiki for performance issues >> so you may wish to give them a try first. >> -- >> Lewis >> > > I'm assuming you're referring to this page? > http://wiki.apache.org/nutch/OptimizingCrawls > > > On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma > <[email protected]>wrote: > >> Well, if performance is low its likely not a Hadoop issue. Hadoop tuning is >> only required if you start pushing it to limits. >> >> I would indeed check the Nutch wiki. There are important settings such as >> threads, queues etc that are very important. >> >> > I did end up tweaking some of the hadoop settings, as it looked like it was > thrashing the disk due to not spreading out the map tasks. > > > On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche < > [email protected]> wrote: > >> >> Having beefy machines is not going to be very useful for the fetching step >> which is IO bound and usually takes most of the time. >> How big is your crawldb? How long do the generate / parse and update steps >> take? Having more than one machine won't make a massive difference if your >> crawldb or segments are small. >> >> Julien >> >> > The machines were all I had handy to make the cluster with. > > > I'm looking at the time for a recent job and here's what I'm seeing. This > is with 12k urls queued by domain with a max of 50 urls per domain. > I know why the fetcher takes so long. Most of the fetcher map jobs finish > in 3-4 minutes, but 1-2 always end up getting stuck on a single site and > taking an additional ten minutes to work through the remaining urls. Not > sure how to fix that. > The crawldb had around 1.2 million urls in it when I looked this afternoon. > > nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15 16:14:44 > EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15 16:14:45 > EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST 2011 Thu > Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618 SUCCEEDED Thu > Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST 2011 Thu > Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15 16:35:12 > EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED Thu Dec > 15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge crawl/linkdb > SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST 2011index-solr > http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu Dec 15 > 16:38:56 EST 2011
-- Lewis

