Thanks, Steve,
Dennis

--- On Wed, 10/6/10, Steve Cohen <[email protected]> wrote:

From: Steve Cohen <[email protected]>
Subject: Re: need a larger map task number
To: "Dennis" <[email protected]>
Cc: [email protected]
Date: Wednesday, October 6, 2010, 9:30 AM

Here is a link to nutch configuration files:

http://wiki.apache.org/nutch/NutchConfigurationFiles

Read the whole file but here is a snippet:

"So for example if you define the property in hadoop-default.xml or
nutch-default.xml and it is not defined in either hadoop-site.xml or
nutch-site.xml then the property will stand. If you define the property in
either nutch-site.xml or hadoop-site.xml then it will override
nutch-default.xml and hadoop-default.xml settings. And if you define it in
both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will
override the hadoop-site.xml settings because nutch-site.xml is added after
hadoop-site.xml. And remember only individual properties are overridden not
the entire file"

On Tue, Oct 5, 2010 at 8:46 PM, Dennis <[email protected]> wrote:

> Thanks, Steve
>
> I'am using Nutch 1.1, and I installed it following this:
> http://wiki.apache.org/nutch/NutchHadoopTutorial.
> But I did not see any hadoop-site.xml file. I used grep to see anything
> related with 'task' (see bellow). Besides, the "crawldb crawl/crawldb" job
> uses more mapreduce tasks, usually 4, while other jobs uses only 2.
> Any Idea?
>
> b...@nutch03:~/nutch/search$ grep task conf/*
> conf/capacity-scheduler.xml:  <!-- The default configuration settings for
> the capacity task scheduler -->
> conf/domain-suffixes.xml:    <!--  ke :
> http://www.kenic.or.ke/index.php?option=com_content&task=view&id=117&Itemid=145--
> >
> conf/domain-suffixes.xml:    <!--  TASK geographical domains (
> www.task.gda.pl/uslugi/dns)-- <http://www.task.gda.pl/uslugi/dns%29-->>
> conf/hadoop-policy.xml:    <description>ACL for InterTrackerProtocol, used
> by the tasktrackers to
> conf/hadoop-policy.xml:
>  <name>security.task.umbilical.protocol.acl</name>
> conf/hadoop-policy.xml:    tasks to communicate with the parent
> tasktracker.
> conf/mapred-site.xml:    reduce task.
> conf/mapred-site.xml:  <name>mapred.map.tasks</name>
> conf/mapred-site.xml:    define mapred.map tasks to be number of slave
> hosts
> conf/mapred-site.xml:  <name>mapred.reduce.tasks</name>
> conf/mapred-site.xml:    define mapred.reduce tasks to be number of slave
> hosts
>
> Dennis
>
>
> --- On *Tue, 10/5/10, Steve Cohen <[email protected]>* wrote:
>
>
> From: Steve Cohen <[email protected]>
> Subject: Re: need a larger map task number
> To: [email protected]
> Date: Tuesday, October 5, 2010, 9:40 PM
>
>
> For nutch, I found that updating the values in hadoop-site.xml was enough,
> though I also set values for mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum.
>
> On Tue, Oct 5, 2010 at 9:24 AM, Dennis 
> <[email protected]<http://mc/[email protected]>>
> wrote:
>
> > Hi, all
> > My "fetch" job uses only 2 map tasks and 2 reduce tasks although I
> > configured "mapred.map.tasks" and "mapred.reduce.tasks" in
> "mapreduce.xml"
> > to "32", while I need it run faster.How can I make nutch to use more map
> and
> > reduce tasks when it's fetching?
> > Dennis
> >
> >
> >
>
>
>



      

Reply via email to