Thanks, Steve, Dennis --- On Wed, 10/6/10, Steve Cohen <[email protected]> wrote:
From: Steve Cohen <[email protected]> Subject: Re: need a larger map task number To: "Dennis" <[email protected]> Cc: [email protected] Date: Wednesday, October 6, 2010, 9:30 AM Here is a link to nutch configuration files: http://wiki.apache.org/nutch/NutchConfigurationFiles Read the whole file but here is a snippet: "So for example if you define the property in hadoop-default.xml or nutch-default.xml and it is not defined in either hadoop-site.xml or nutch-site.xml then the property will stand. If you define the property in either nutch-site.xml or hadoop-site.xml then it will override nutch-default.xml and hadoop-default.xml settings. And if you define it in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will override the hadoop-site.xml settings because nutch-site.xml is added after hadoop-site.xml. And remember only individual properties are overridden not the entire file" On Tue, Oct 5, 2010 at 8:46 PM, Dennis <[email protected]> wrote: > Thanks, Steve > > I'am using Nutch 1.1, and I installed it following this: > http://wiki.apache.org/nutch/NutchHadoopTutorial. > But I did not see any hadoop-site.xml file. I used grep to see anything > related with 'task' (see bellow). Besides, the "crawldb crawl/crawldb" job > uses more mapreduce tasks, usually 4, while other jobs uses only 2. > Any Idea? > > b...@nutch03:~/nutch/search$ grep task conf/* > conf/capacity-scheduler.xml: <!-- The default configuration settings for > the capacity task scheduler --> > conf/domain-suffixes.xml: <!-- ke : > http://www.kenic.or.ke/index.php?option=com_content&task=view&id=117&Itemid=145-- > > > conf/domain-suffixes.xml: <!-- TASK geographical domains ( > www.task.gda.pl/uslugi/dns)-- <http://www.task.gda.pl/uslugi/dns%29-->> > conf/hadoop-policy.xml: <description>ACL for InterTrackerProtocol, used > by the tasktrackers to > conf/hadoop-policy.xml: > <name>security.task.umbilical.protocol.acl</name> > conf/hadoop-policy.xml: tasks to communicate with the parent > tasktracker. > conf/mapred-site.xml: reduce task. > conf/mapred-site.xml: <name>mapred.map.tasks</name> > conf/mapred-site.xml: define mapred.map tasks to be number of slave > hosts > conf/mapred-site.xml: <name>mapred.reduce.tasks</name> > conf/mapred-site.xml: define mapred.reduce tasks to be number of slave > hosts > > Dennis > > > --- On *Tue, 10/5/10, Steve Cohen <[email protected]>* wrote: > > > From: Steve Cohen <[email protected]> > Subject: Re: need a larger map task number > To: [email protected] > Date: Tuesday, October 5, 2010, 9:40 PM > > > For nutch, I found that updating the values in hadoop-site.xml was enough, > though I also set values for mapred.tasktracker.map.tasks.maximum and > mapred.tasktracker.reduce.tasks.maximum. > > On Tue, Oct 5, 2010 at 9:24 AM, Dennis > <[email protected]<http://mc/[email protected]>> > wrote: > > > Hi, all > > My "fetch" job uses only 2 map tasks and 2 reduce tasks although I > > configured "mapred.map.tasks" and "mapred.reduce.tasks" in > "mapreduce.xml" > > to "32", while I need it run faster.How can I make nutch to use more map > and > > reduce tasks when it's fetching? > > Dennis > > > > > > > > >

