[ 
https://issues.apache.org/jira/browse/YARN-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808567#comment-16808567
 ] 

Szilard Nemeth edited comment on YARN-9421 at 4/3/19 10:05 AM:
---------------------------------------------------------------

[~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case 
can happen with any default resources like memory, vcores, etc.
Do you still have concerns?

[~eyang]: Thanks for your comments!
You are right about the concern of cluster can change frequently. I haven't 
mentioned but I meant to: I want to use the safemode mechanism only on startup. 
If we define a low enough timeout value, jobs can't queue up so we don't use 
much memory. I agree with you as the safemode concept wouldn't be a default 
behavior and I never wanted to be like that: This is definitely planned as an 
opt-in feature.
Does this answer all of your concerns / questions? I didn't really get the SLA 
part, sorry.


was (Author: snemeth):
[~adam.antal]: Coming back to your corner case: As [~wilfreds] said: This case 
can happen with any default resources like memory, vcores, etc.
Do you still have concerns?

[~eyang]: Thanks for your comments!
You are right about the concern of cluster can change frequently. I haven't 
mentioned but I meant to: I want to use the safemode mechanism only on startup. 
If we define a low enough timeout value, jobs can't queue up so we don't use 
much memory. I agree with you as the safemode concept wouldn't be a default 
behavior and I never wanted to be like that: This is definitely planned as an 
opt-in feature.

> Implement SafeMode for ResourceManager by defining a resource threshold
> -----------------------------------------------------------------------
>
>                 Key: YARN-9421
>                 URL: https://issues.apache.org/jira/browse/YARN-9421
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Priority: Major
>         Attachments: client-log.log, nodemanager.log, resourcemanager.log
>
>
> We have a hypothetical testcase in our test suite that tests Resource Types.
>  The test does the following: 
>  1. Sets up a resource named "gpu"
>  2. Out of 9 NodeManager nodes, 1 node has 100 of "gpu".
>  3. It executes a sleep job with resoure requests: 
>  "-Dmapreduce.reduce.resource.gpu=7" and 
> "-Dyarn.app.mapreduce.am.resource.gpu=11"
> Sometimes, we encounter situations when the app submission fails with: 
> {code:java}
> 2019-02-25 06:09:56,795 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RM app submission 
> failed in validating AM resource request for application 
> application_1551103768202_0001
>  org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[gpu], Requested 
> resource=<memory:1024, vCores:1, gpu: 11>, maximum allowed 
> allocation=<memory:8192, vCores:1>, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation=<memory:16003, vCores:4, gpu: 9223372036854775807>{code}
> It's clearly visible that the maximum allowed allocation does not have any 
> "gpu" resources.
>  
> Looking into the logs further, I realized that sometimes the node having the 
> "gpu" resources are registered after the app is submitted.
>  In a real world situation and even with this very special test exexution, we 
> can't be sure which order NMs are registering with RM.
>  With the advent of resource types, this issue was more likely surface.
> If we have a cluster with some "rare" resources like GPUs only on some nodes 
> out of a 100, we can quickly run into a situation when the NMs with GPUs are 
> registering later than the normal nodes. While the critical NMs are still 
> registering, we will most likely experience the same 
> InvalidResourceRequestException if we submit jobs requesting GPUs.
> There is a naive solution to this: 
>  1. Give some time for RM to wait for NMs to be able to register themselves 
> and put submitted applications on hold. This could work in some situations 
> but it's not the most flexible solution as different clusters can have 
> different requirements. Of course, we can make this more flexible by making 
> the timeout value configurable.
> *A more flexible alternative would be:*
>  2. We define a threshold of Resource capability: While we haven't reached 
> this threshold, we put submitted jobs on hold. Once we reached the threshold, 
> we enable jobs to pass through. 
>  This is very similar to an already existing concept, the SafeMode in HDFS 
> ([https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode]).
>  Back to my GPU example above, the threshold could be: 8 vcores, 16GB, 3 
> GPUs. 
>  Defining a threshold like this, we can ensure most of the submitted jobs 
> won't be lost, just "parked" until NMs are registered.
> The final solution could be the Resource threshold, or the combination of the 
> threshold and timeout value. I'm open for any other suggestion as well.
> *Last but not least, a very easy way to reproduce the issue on a 3 node 
> cluster:* 
>  1. Configure a resource type, named 'testres'.
>  2. Node1 runs RM, Node 2/3 runs NMs
>  3. Node2 has 1 testres
>  4. Node3 has 0 testres
>  5. Stop all nodes
>  6. Start RM on Node1
>  7. Start NM on Node3 (the one without the resource)
>  8. Start a pi job, request 1 testres for the AM
> Here's the command to start the job:
> {code:java}
> MY_HADOOP_VERSION=3.3.0-SNAPSHOT;pushd /opt/hadoop;bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dyarn.app.mapreduce.am.resource.testres=1 1 1000;popd{code}
>  
> *Configurations*: 
>  node1: yarn-site.xml of ResourceManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> node2: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>
> <property>
>  <name>yarn.nodemanager.resource-type.testres</name>
>  <value>1</value>
> </property>{code}
> node3: yarn-site.xml of NodeManager:
> {code:java}
> <property>
>  <name>yarn.resource-types</name>
>  <value>testres</value>
> </property>{code}
> Please see full process logs from RM, NM, YARN-client attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to