Hey Arjun, I am glad someone finally responded to this thread. I am surprised no one else is trying to figure out these configuration settings...
Here is my understanding of your questions (though I am not sure they are right): *Is setting both mapreduce.map.cpu.vcores and yarn.nodemanager.resource.cpu-vcores is required?* Yes, I believe you need both of these set or else they will revert to default values. Importantly, I think you should set these to the same value so that you spawn one mapper/giraph-worker per machine (as this was said to be optimal). Since I have 32 cores per machine, I have set both these values to 32 and has worked to only spawn one worker per machine (unless I try to have a worker share a machine with the master). Check this page out: http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ *What happens if they are not set, while giraph.numComputeThreads is set?* The above parameters specify how many nodes per machine you are allowing for workers AND how many cores one worker will use. If you don't set *giraph.numComputeThreads *then the worker will use the default number (I think that is 1) despite possibly being allocated more cores. Hence, I set *giraph.numComputeThreads, **giraph.numInputThreads, *and *giraph.numOutputThreads *to be the same as the above two paramters, the total cores in one machine (for me 32). Giraph is never going to fully utilize the entire machine, so I don't think its really possible to tell if these are correct settings, but all of this seems reasonable based on my experience and how these parameters are defined. *Are there any other parameters that must be set in order to make sure we are *really* using the cores, not just multi-threading on a single core?* No idea, but the above parameters and some memory configurations are all I set. The memory configurations are worse in my opinion, as I was running into memory issues and ended up having to manually set the following parameters: - yarn.nodemanager.resource.memory-mb - yarn.scheduler.minimum-allocation-mb - yarn.scheduler.maximum-allocation-mb - mapreduce.map.memory.mb - -yh (in Giraph arguments) All of these were required to be manually set to get Giraph to run without having memory issues. Best regards, Steve On Thu, Apr 23, 2015 at 8:15 PM, Arjun Sharma <[email protected]> wrote: > Just bumping up this thread, as I am having the same question as Steven's. > > Steven, did you get to know if setting both mapreduce.map.cpu.vcores and > yarn.nodemanager.resource.cpu-vcores is required? What happens if they > are not set, while giraph.numComputeThreads is set? Are there any > other parameters that must be set in order to make sure we are *really* > using the cores, not just multi-threading on a single core? > > > On Wed, Mar 18, 2015 at 11:48 AM, Steven Harenberg <[email protected]> > wrote: > >> Hi all, >> >> Previously with MapReduceV1, the suggestion was to have a 1:1 >> correspondence between workers and compute nodes (machines) and set the >> number of the threads to be the number of cores per machines. To achieve >> this configuration, we would set "mapred.tasktracker.map.tasks.maximum=1". >> Since workers correspond to mappers this would ensure there was one worker >> per machine. >> >> Now I am reading that with Yarn this property longer exists as there >> aren't tasktrackers. Instead, we have the global properties >> "yarn.nodemanager.resource.cpu-vcores", which specifies the cores _per >> node_, and the property "mapreduce.map.cpu.vcores", which specifies the >> cores _per map task_. >> >> If we want to have one mapper per node that is fully utilizing the >> machine, I assume we should just set mapreduce.map.cpu.vcores = >> yarn.nodemanager.resource.cpu-vcores = the # of cores per node. Is this >> correct? >> >> Do I still need to set giraph.numComputeThreads to be the number of cores >> per node? >> >> Thanks, >> Steve >> > >
