[
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804805#comment-16804805
]
Adam Antal commented on YARN-9421:
----------------------------------
Thanks for filing this [~snemeth], it is an interesting proposal.
Just to add another corner case: what if we have three NMs with 1-1-2 testres.
We have an app requesting 2 testres, and have a safemode threshold for 2
testres. If the first two NMs are coming up first then the cluster will have 2
testres, will exit safemode, but the app still fails with the exception, as the
only node that could handle 2 testres is not yet available.
What is the key feature we try to accomplish? I see the problem here and in
your example that we rather depend on specific node(s) to come up?
> Implement SafeMode for ResourceManager by defining a resource threshold
> -----------------------------------------------------------------------
>
> Key: YARN-9421
> URL: https://issues.apache.org/jira/browse/YARN-9421
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Szilard Nemeth
> Priority: Major
> Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
> The test does the following:
> 1. Sets up a resource named "gpu"
> 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
> 3. It executes a sleep job with resoure requests:
> "-Dmapreduce.reduce.resource.gpu=7" and
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with:
> {code:java}
> 2019-02-25 06:09:56,795 WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission
> failed in validating AM resource request for application
> application_1551103768202_0001
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid
> resource request! Cannot allocate containers as requested resource is greater
> than maximum allowed allocation. Requested resource type=[gpu], Requested
> resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed
> allocation=<memory:8192, vCores:1>, please note that maximum allowed
> allocation is calculated by scheduler based on maximum resource of registered
> NodeManagers, which might be less than configured maximum
> allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
> It's clearly visible that the maximum allowed allocation does not have any
> "gpu" resources.
>
> Looking into the logs further, I realized that sometimes the node having the
> "gpu" resources are registered after the app is submitted.
> In a real world situation and even with this very special test exexution, we
> can't be sure which order NMs are registering with RM.
> With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes
> out of a 100, we can quickly run into a situation when the NMs with GPUs are
> registering later than the normal nodes. While the critical NMs are still
> registering, we will most likely experience the same
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this:
> 1. Give some time for RM to wait for NMs to be able to register themselves
> and put submitted applications on hold. This could work in some situations
> but it's not the most flexible solution as different clusters can have
> different requirements. Of course, we can make this more flexible by making
> the timeout value configurable.
> *A more flexible alternative would be:*
> 2. We define a threshold of Resource capability: While we haven't reached
> this threshold, we put submitted jobs on hold. Once we reached the threshold,
> we enable jobs to pass through.
> This is very similar to an already existing concept, the SafeMode in HDFS
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
> Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3
> GPUs.
> Defining a threshold like this, we can ensure most of the submitted jobs
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node
> cluster:*
> 1. Configure a resource type, named 'testres'.
> 2. Node1 runs RM, Node 2/3 runs NMs
> 3. Node2 has 1 testres
> 4. Node3 has 0 testres
> 5. Stop all nodes
> 6. Start RM on Node1
> 7. Start NM on Node3 (the one without the resource)
> 8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar"
> pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>
> *Configurations*:
> node1: yarn-site.xml of ResourceManager:
> {code:java}
> <property>
> <name>yarn.resource-types</name>
> <value>testres</value>
> </property>{code}
> node2: yarn-site.xml of NodeManager:
> {code:java}
> <property>
> <name>yarn.resource-types</name>
> <value>testres</value>
> </property>
> <property>
> <name>yarn.nodemanager.resource-type.testres</name>
> <value>1</value>
> </property>{code}
> node3: yarn-site.xml of NodeManager:
> {code:java}
> <property>
> <name>yarn.resource-types</name>
> <value>testres</value>
> </property>{code}
> Please see full process logs from RM, NM, YARN-client attached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]