7, each node has a datanode and a tasktracker running on it. I attach the
full file here:
2014.03.07|10:13:17~/HadoopSetupTest/hadoop-1.2.1/conf>cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>compute-1-23:50331</value>
<description>The host and port at which the MapReduce job tracker runs.
If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:50332</value>
<description>The port at which the MapReduce task tracker runs.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:50333</value>
<description>The port at which the MapReduce task tracker runs.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
<description>The maximum number of map tasks that will run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>7</value>
<description>The maximum number of reduce tasks that will run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>pool.name</value>
<description>pool name property can be specified in jobconf</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description>The local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk I/O.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>${hadoop.tmp.dir}/system/mapred</value>
<description>The shared directory where MapReduce stores control files.
</description>
</property>
<property>
<name>mapred.tasktracker.dns.interface</name>
<value>default</value>
<description>The name of the Network Interface from which a task
tracker should report its IP address. (e.g. eth0)
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx3600m -XX:+UseParallelGC -mx1024m -XX:MaxHeapFreeRatio=10
-XX:MinHeapFreeRatio=10</value>
<description>Java opts for the task tracker child processes.
The following symbol, if present, will be interpolated: @taskid@ is
replaced
by current TaskID. Any other occurrences of '@' will go unchanged.
For example, to enable verbose gc logging to a file named for the taskid
in
/tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
-Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected]
The configuration variable mapred.child.ulimit can be used to control the
maximum virtual memory of the child processes.
</description>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>-1</value>
<description>How many tasks to run per jvm. If set to -1, there is no
limit.</description>
</property>
<property>
<name>mapred.job.tracker.handler.count</name>
<value>40</value>
<description>The number of server threads for the JobTracker. This
should be roughly 4% of the number of tasktracker nodes.</description>
</property>
<property>
<name>mapred.jobtracker.maxtasks.per.job</name>
<value>-1</value>
<description>The maximum number of tasks for a single job. A value of
-1 indicates that there is no maximum.</description>
</property>
<property>
<name>mapred.tasktracker.expiry.interval</name>
<value>600000</value>
<description>Time to wait get progress report from a task tracker so
that jobtracker decides the task is in progress. default is 1000*60*10 i.e.
10 minutes</description>
</property>
<property>
<name>mapred.task.timeout</name>
<value>0</value>
<description>Time to wait get progress report from a task tracker so
that jobtracker decides the task is in progress. default is 1000*60*10 i.e.
10 minutes</description>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
<description>set the speculative execution for map tasks</description>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
<description>set the speculative execution for reduce
tasks</description>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>conf/excludes</value>
</property>
<property>
<name>mapred.job.tracker.handler.count</name>
<value>40</value>
</property>
</configuration>
2014-03-07 9:59 GMT-06:00 Claudio Martella <[email protected]>:
> that depends on your cluster configuration. what is the maximum number of
> mappers you can have concurrently on each node?
>
>
> On Fri, Mar 7, 2014 at 4:42 PM, Suijian Zhou <[email protected]>wrote:
>
>> The current setting is:
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx6144m -XX:+UseParallelGC -mx1024m -XX:MaxHeapFreeRatio=10
>> -XX:MinHeapFreeRatio=10</value>
>>
>> Is 6144MB enough( for each task tracker)? I.e: I have 39 nodes to process
>> the 8*2GB input files.
>>
>> Best Regards,
>> Suijian
>>
>>
>>
>> 2014-03-07 9:21 GMT-06:00 Claudio Martella <[email protected]>:
>>
>> this setting won't be used by Giraph (or by any mapreduce application),
>>> but by the hadoop infrastructure itself.
>>> you should use mapred.child.java.opts instead.
>>>
>>>
>>> On Fri, Mar 7, 2014 at 4:19 PM, Suijian Zhou <[email protected]>wrote:
>>>
>>>> Hi, Claudio,
>>>> I have set the following when ran the program:
>>>> export HADOOP_DATANODE_OPTS="-Xmx10g"
>>>> and
>>>> export HADOOP_HEAPSIZE=30000
>>>>
>>>> in hadoop-env.sh and restarted hadoop.
>>>>
>>>> Best Regards,
>>>> Suijian
>>>>
>>>>
>>>>
>>>> 2014-03-06 17:29 GMT-06:00 Claudio Martella <[email protected]
>>>> >:
>>>>
>>>> did you actually increase the heap?
>>>>>
>>>>>
>>>>> On Thu, Mar 6, 2014 at 11:43 PM, Suijian Zhou
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hi,
>>>>>> I tried to process only 2 of the input files, i.e, 2GB + 2GB input,
>>>>>> the program finished successfully in 6 minutes. But as I have 39 nodes,
>>>>>> they should be enough to load and process the 8*2GB=16GB size graph? Can
>>>>>> somebody help to give some hints( Will all the nodes participate in graph
>>>>>> loading from HDFS or only master node load the graph?)? Thanks!
>>>>>>
>>>>>> Best Regards,
>>>>>> Suijian
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-03-06 16:24 GMT-06:00 Suijian Zhou <[email protected]>:
>>>>>>
>>>>>> Hi, Experts,
>>>>>>> I'm trying to process a graph by pagerank in giraph, but the
>>>>>>> program always stucks there.
>>>>>>> There are 8 input files, each one is with size ~2GB and all copied
>>>>>>> onto HDFS. I use 39 nodes and each node has 16GB Mem and 8 cores. It
>>>>>>> keeps
>>>>>>> printing the same info(as the following) on the screen after 2 hours,
>>>>>>> looks
>>>>>>> no progress at all. What are the possible reasons? Testing small example
>>>>>>> files run without problems. Thanks!
>>>>>>>
>>>>>>> 14/03/06 16:17:42 INFO job.JobProgressTracker: Data from 39 workers
>>>>>>> - Compute superstep 0: 5854829 out of 49200000 vertices computed; 181
>>>>>>> out
>>>>>>> of 1521 partitions computed
>>>>>>> 14/03/06 16:17:47 INFO job.JobProgressTracker: Data from 39 workers
>>>>>>> - Compute superstep 0: 5854829 out of 49200000 vertices computed; 181
>>>>>>> out
>>>>>>> of 1521 partitions computed
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Suijian
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Claudio Martella
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Claudio Martella
>>>
>>>
>>
>>
>
>
> --
> Claudio Martella
>
>