[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529911#comment-14529911
 ] 

Wangda Tan commented on YARN-1680:
----------------------------------

[~cwelch]:
bq. I'd suggest that nodelable specific headroom logic probably doesn't belong 
there either.
Node label specific headroom logic is different, the appoarch in my mind is all 
apps in a queue or user share the same headroom under a label. There's no 
special headroom computation logic for app. In another word, I think we 
shouldn't support computing headroom of "A && B" for app1 and computing 
headroom of "C || D" for app2 in RM side.

bq. Better than a deadlock, but not as good as if it had received accurate 
headroom and could have avoided the reactionary delay
Actually I think this statement may not true, assume we compute an accurate 
headroom for app, but that doesn't mean the app can get as much resource as we 
compute. In CS, app get resource in FIFO order, when you tell an app that "you 
can use 40 GB", that actually means, "If you are in the head of queue, and the 
queue is in the head of all queues, you can get 40GB". If you're in some random 
place of the queue, you may not be able to get it after hours.
So my point is, accurate headroom can still starve apps. In any case, app needs 
their own "delay" to preempt container of itself to avoid starvation.

[~jianhe]:
bq. Doing the calculation in one place is still a more accurate snapshot than 
doing the calculations in multiple places. 
This may not always true considering computation efficiency, in RM side, we 
cannot do scan of nodes for each application, so the result will just be a 
approximate one (for example, hard locality, label expression, 
affinity/anti-affinity, etc.) but in application side, if we have 10000 nodes, 
we can do 100 times scan for all nodes every seconds.

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Craig Welch
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to