Szilard Nemeth created YARN-9421:
------------------------------------

             Summary: Implement SafeMode for ResourceManager by defining a 
resource threshold
                 Key: YARN-9421
                 URL: https://issues.apache.org/jira/browse/YARN-9421
             Project: Hadoop YARN
          Issue Type: New Feature
            Reporter: Szilard Nemeth
            Assignee: Szilard Nemeth


We have a hypothetical testcase in our test suite that tests Resource Types.
 The test does the following: 
 1. Sets up a resource named "gpu"
 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
 3. It executes a sleep job with resoure requests: 
 "-Dmapreduce.reduce.resource.gpu=7" and 
"-Dyarn.app.mapreduce.am.resource.gpu=11"

Sometimes, we encounter situations when the app submission fails with: 
{code:java}
2019-02-25 06:09:56,795 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
failed in validating AM resource request for application 
application_1551103768202_0001
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request! Cannot allocate containers as requested resource is greater 
than maximum allowed allocation. Requested resource type=[gpu], Requested 
resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed 
allocation=<memory:8192, vCores:1>, please note that maximum allowed allocation 
is calculated by scheduler based on maximum resource of registered 
NodeManagers, which might be less than configured maximum 
allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
It's clearly visible that the maximum allowed allocation does not have any 
"gpu" resources.

 

Looking into the logs further, I realized that sometimes the node having the 
"gpu" resources are registered after the app is submitted.
 In a real world situation and even with this very special test exexution, we 
can't be sure which order NMs are registering with RM.
 With the advent of resource types, this issue was more likely surface.

If we have a cluster with some "rare" resources like GPUs only on some nodes 
out of a 100, we can quickly run into a situation when the NMs with GPUs are 
registering later than the normal nodes. While the critical NMs are still 
registering, we will most likely experience the same 
InvalidResourceRequestException if we submit jobs requesting GPUs.

There is a naive solution to this: 
 1. Give some time for RM to wait for NMs to be able to register themselves and 
put submitted applications on hold. This could work in some situations but it's 
not the most flexible solution as different clusters can have different 
requirements. Of course, we can make this more flexible by making the timeout 
value configurable.

*A more flexible alternative would be:*
 2. We define a threshold of Resource capability: While we haven't reached this 
threshold, we put submitted jobs on hold. Once we reached the threshold, we 
enable jobs to pass through. 
 This is very similar to an already existing concept, the SafeMode in HDFS 
([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
 Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 GPUs. 
 Defining a threshold like this, we can ensure most of the submitted jobs won't 
be lost, just "parked" until NMs are registered.

The final solution could be the Resource threshold, or the combination of the 
threshold and timeout value. I'm open for any other suggestion as well.

*Last but not least, a very easy way to reproduce the issue on a 3 node 
cluster:* 
 1. Configure a resource type, named 'testres'.
 2. Node1 runs RM, Node 2/3 runs NMs
 3. Node2 has 1 testres
 4. Node3 has 0 testres
 5. Stop all nodes
 6. Start RM on Node1
 7. Start NM on Node3 (the one without the resource)
 8. Start a pi job, request 1 testres for the AM

Here's the command to start the job:
{code:java}
MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar 
"./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" pi 
-Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
 

*Configurations*: 
 node1: yarn-site.xml of ResourceManager:
{code:java}
<property>
 <name>yarn.resource-types</name>
 <value>testres</value>
</property>{code}
node2: yarn-site.xml of NodeManager:
{code:java}
<property>
 <name>yarn.resource-types</name>
 <value>testres</value>
</property>
<property>
 <name>yarn.nodemanager.resource-type.testres</name>
 <value>1</value>
</property>{code}
node3: yarn-site.xml of NodeManager:
{code:java}
<property>
 <name>yarn.resource-types</name>
 <value>testres</value>
</property>{code}
Please see full process logs from RM, NM, YARN-client attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to