On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney < [email protected]> wrote:
> This is overwhelmingly weighted towards Hadoop configuration. > > There are some guidance notes on the Nutch wiki for performance issues > so you may wish to give them a try first. > -- > Lewis > I'm assuming you're referring to this page? http://wiki.apache.org/nutch/OptimizingCrawls On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma <[email protected]>wrote: > Well, if performance is low its likely not a Hadoop issue. Hadoop tuning is > only required if you start pushing it to limits. > > I would indeed check the Nutch wiki. There are important settings such as > threads, queues etc that are very important. > > I did end up tweaking some of the hadoop settings, as it looked like it was thrashing the disk due to not spreading out the map tasks. On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche < [email protected]> wrote: > > Having beefy machines is not going to be very useful for the fetching step > which is IO bound and usually takes most of the time. > How big is your crawldb? How long do the generate / parse and update steps > take? Having more than one machine won't make a massive difference if your > crawldb or segments are small. > > Julien > > The machines were all I had handy to make the cluster with. I'm looking at the time for a recent job and here's what I'm seeing. This is with 12k urls queued by domain with a max of 50 urls per domain. I know why the fetcher takes so long. Most of the fetcher map jobs finish in 3-4 minutes, but 1-2 always end up getting stuck on a single site and taking an additional ten minutes to work through the remaining urls. Not sure how to fix that. The crawldb had around 1.2 million urls in it when I looked this afternoon. nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15 16:14:44 EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15 16:14:45 EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST 2011 Thu Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST 2011 Thu Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15 16:35:12 EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED Thu Dec 15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge crawl/linkdb SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST 2011index-solr http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu Dec 15 16:38:56 EST 2011

