[ https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szilard Nemeth updated YARN-9421: --------------------------------- Attachment: resourcemanager.log client-log.log nodemanager.log > Implement SafeMode for ResourceManager by defining a resource threshold > ----------------------------------------------------------------------- > > Key: YARN-9421 > URL: https://issues.apache.org/jira/browse/YARN-9421 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Szilard Nemeth > Priority: Major > Attachments: client-log.log, nodemanager.log, resourcemanager.log > > > We have a hypothetical testcase in our test suite that tests Resource Types. > The test does the following: > 1. Sets up a resource named "gpu" > 2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu". > 3. It executes a sleep job with resoure requests: > "-Dmapreduce.reduce.resource.gpu=7" and > "-Dyarn.app.mapreduce.am.resource.gpu=11" > Sometimes, we encounter situations when the app submission fails with: > {code:java} > 2019-02-25 06:09:56,795 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission > failed in validating AM resource request for application > application_1551103768202_0001 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[gpu], Requested > resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed > allocation=<memory:8192, vCores:1>, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code} > It's clearly visible that the maximum allowed allocation does not have any > "gpu" resources. > > Looking into the logs further, I realized that sometimes the node having the > "gpu" resources are registered after the app is submitted. > In a real world situation and even with this very special test exexution, we > can't be sure which order NMs are registering with RM. > With the advent of resource types, this issue was more likely surface. > If we have a cluster with some "rare" resources like GPUs only on some nodes > out of a 100, we can quickly run into a situation when the NMs with GPUs are > registering later than the normal nodes. While the critical NMs are still > registering, we will most likely experience the same > InvalidResourceRequestException if we submit jobs requesting GPUs. > There is a naive solution to this: > 1. Give some time for RM to wait for NMs to be able to register themselves > and put submitted applications on hold. This could work in some situations > but it's not the most flexible solution as different clusters can have > different requirements. Of course, we can make this more flexible by making > the timeout value configurable. > *A more flexible alternative would be:* > 2. We define a threshold of Resource capability: While we haven't reached > this threshold, we put submitted jobs on hold. Once we reached the threshold, > we enable jobs to pass through. > This is very similar to an already existing concept, the SafeMode in HDFS > ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]). > Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 > GPUs. > Defining a threshold like this, we can ensure most of the submitted jobs > won't be lost, just "parked" until NMs are registered. > The final solution could be the Resource threshold, or the combination of the > threshold and timeout value. I'm open for any other suggestion as well. > *Last but not least, a very easy way to reproduce the issue on a 3 node > cluster:* > 1. Configure a resource type, named 'testres'. > 2. Node1 runs RM, Node 2/3 runs NMs > 3. Node2 has 1 testres > 4. Node3 has 0 testres > 5. Stop all nodes > 6. Start RM on Node1 > 7. Start NM on Node3 (the one without the resource) > 8. Start a pi job, request 1 testres for the AM > Here's the command to start the job: > {code:java} > MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar > "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" > pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code} > > *Configurations*: > node1: yarn-site.xml of ResourceManager: > {code:java} > <property> > <name>yarn.resource-types</name> > <value>testres</value> > </property>{code} > node2: yarn-site.xml of NodeManager: > {code:java} > <property> > <name>yarn.resource-types</name> > <value>testres</value> > </property> > <property> > <name>yarn.nodemanager.resource-type.testres</name> > <value>1</value> > </property>{code} > node3: yarn-site.xml of NodeManager: > {code:java} > <property> > <name>yarn.resource-types</name> > <value>testres</value> > </property>{code} > Please see full process logs from RM, NM, YARN-client attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org