Re: Nutch Hadoop Optimization

Lewis John Mcgibbney Fri, 16 Dec 2011 02:33:38 -0800

It looks like its the parsing of these segments that is taking time... no?

On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen <[email protected]> wrote:
> On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> This is overwhelmingly weighted towards Hadoop configuration.
>>
>> There are some guidance notes on the Nutch wiki for performance issues
>> so you may wish to give them a try first.
>> --
>>  Lewis
>>
>
> I'm assuming you're referring to this page?
> http://wiki.apache.org/nutch/OptimizingCrawls
>
>
> On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma
> <[email protected]>wrote:
>
>> Well, if performance is low its likely not a Hadoop issue. Hadoop tuning is
>> only required if you start pushing it to limits.
>>
>> I would indeed check the Nutch wiki. There are important settings such as
>> threads, queues etc that are very important.
>>
>>
> I did end up tweaking some of the hadoop settings, as it looked like it was
> thrashing the disk due to not spreading out the map tasks.
>
>
> On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche <
> [email protected]> wrote:
>
>>
>> Having beefy machines is not going to be very useful for the fetching step
>> which is IO bound and usually takes most of the time.
>> How big is your crawldb?  How long do the generate / parse and update steps
>> take? Having more than one machine won't make a massive difference if your
>> crawldb or segments are small.
>>
>> Julien
>>
>>
> The machines were all I had handy to make the cluster with.
>
>
> I'm looking at the time for a recent job and here's what I'm seeing.  This
> is with 12k urls queued by domain with a max of 50 urls per domain.
> I know why the fetcher takes so long.  Most of the fetcher map jobs finish
> in 3-4 minutes, but 1-2 always end up getting stuck on a single site and
> taking an additional ten minutes to work through the remaining urls.  Not
> sure how to fix that.
> The crawldb had around 1.2 million urls in it when I looked this afternoon.
>
> nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15 16:14:44
> EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15 16:14:45
> EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition
> crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST 2011 Thu
> Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618 SUCCEEDED Thu
> Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse
> crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST 2011 Thu
> Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15 16:35:12
> EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED Thu Dec
> 15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge crawl/linkdb
> SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST 2011index-solr
> http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu Dec 15
> 16:38:56 EST 2011




-- 
Lewis

Re: Nutch Hadoop Optimization

Reply via email to