Szilard Nemeth created YARN-9421:
------------------------------------
Summary: Implement SafeMode for ResourceManager by defining a
resource threshold
Key: YARN-9421
URL: https://issues.apache.org/jira/browse/YARN-9421
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth
We have a hypothetical testcase in our test suite that tests Resource Types.
The test does the following:
1. Sets up a resource named "gpu"
2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
3. It executes a sleep job with resoure requests:
"-Dmapreduce.reduce.resource.gpu=7" and
"-Dyarn.app.mapreduce.am.resource.gpu=11"
Sometimes, we encounter situations when the app submission fails with:
{code:java}
2019-02-25 06:09:56,795 WARN
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission
failed in validating AM resource request for application
application_1551103768202_0001
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid
resource request! Cannot allocate containers as requested resource is greater
than maximum allowed allocation. Requested resource type=[gpu], Requested
resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed
allocation=<memory:8192, vCores:1>, please note that maximum allowed allocation
is calculated by scheduler based on maximum resource of registered
NodeManagers, which might be less than configured maximum
allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
It's clearly visible that the maximum allowed allocation does not have any
"gpu" resources.
Looking into the logs further, I realized that sometimes the node having the
"gpu" resources are registered after the app is submitted.
In a real world situation and even with this very special test exexution, we
can't be sure which order NMs are registering with RM.
With the advent of resource types, this issue was more likely surface.
If we have a cluster with some "rare" resources like GPUs only on some nodes
out of a 100, we can quickly run into a situation when the NMs with GPUs are
registering later than the normal nodes. While the critical NMs are still
registering, we will most likely experience the same
InvalidResourceRequestException if we submit jobs requesting GPUs.
There is a naive solution to this:
1. Give some time for RM to wait for NMs to be able to register themselves and
put submitted applications on hold. This could work in some situations but it's
not the most flexible solution as different clusters can have different
requirements. Of course, we can make this more flexible by making the timeout
value configurable.
*A more flexible alternative would be:*
2. We define a threshold of Resource capability: While we haven't reached this
threshold, we put submitted jobs on hold. Once we reached the threshold, we
enable jobs to pass through.
This is very similar to an already existing concept, the SafeMode in HDFS
([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 GPUs.
Defining a threshold like this, we can ensure most of the submitted jobs won't
be lost, just "parked" until NMs are registered.
The final solution could be the Resource threshold, or the combination of the
threshold and timeout value. I'm open for any other suggestion as well.
*Last but not least, a very easy way to reproduce the issue on a 3 node
cluster:*
1. Configure a resource type, named 'testres'.
2. Node1 runs RM, Node 2/3 runs NMs
3. Node2 has 1 testres
4. Node3 has 0 testres
5. Stop all nodes
6. Start RM on Node1
7. Start NM on Node3 (the one without the resource)
8. Start a pi job, request 1 testres for the AM
Here's the command to start the job:
{code:java}
MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar
"./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" pi
-Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
*Configurations*:
node1: yarn-site.xml of ResourceManager:
{code:java}
<property>
<name>yarn.resource-types</name>
<value>testres</value>
</property>{code}
node2: yarn-site.xml of NodeManager:
{code:java}
<property>
<name>yarn.resource-types</name>
<value>testres</value>
</property>
<property>
<name>yarn.nodemanager.resource-type.testres</name>
<value>1</value>
</property>{code}
node3: yarn-site.xml of NodeManager:
{code:java}
<property>
<name>yarn.resource-types</name>
<value>testres</value>
</property>{code}
Please see full process logs from RM, NM, YARN-client attached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]