Craig Welch commented on YARN-1680:


bq. Actually I think this statement may not true, assume we compute an accurate 
headroom for app, but that doesn't mean the app can get as much resource as we 
compute...you may not be able to get it after hours.

This would only occur if other applications were allocated those resources, in 
which case the headroom will drop and the application will be made aware of it 
via headroom updates.  The scenario you propose as a counter example is 
inaccurate.  It is the case that accurate headroom (including a fix for the 
blacklist issue here) will result in faster overall job completion than the 
reactionary approach with allocation failure.


bq. OTOH, blacklisting / hard-locality are app-decisions. From the platform's 
perspective, those nodes, free or otherwise, are actually available for apps to 

Not quite so, as the scheduler respects the blacklist and doesn't allocate 
containers to an app when it would run counter to the apps blacklisting

That said, so far the discussion regarding the proposal has largely been about 
where the activity should live, let's put that aside for a moment and 
concentrate on the approach itself.  With api additions / additional library 
work / etc it should be possible to do the same thing outside the scheduler as 
within.  Whether and what to do in or out of the scheduler needs to be settled 
still, of course, but a decision on how the headroom will be adjusted is needed 
in any case, and and is needed before putting together the change wherever it 
ends up living.


"where app headroom is finalized" == in the scheduler OR in a library 
available/used by AM's.  if externalized, obviously api's to get whatever info 
is not yet available outside the scheduler will need to be added

Retain a node/rack blacklist where app headroom is finalized (already the case)
Add a "last change" timestamp or incrementing counter to track node 
addition/removal at the cluster level (which is what exists for "cluster 
black/white" listing afaict), updated when those events occur
Add a "last change" timestamp/counter to where app headroom is finalized to 
track blacklist changes
have "last updated" values on where app headroom is finalized to track the 
above two "last change" values, updated when blacklist values are recalculated
On headroom calculation, where app headroom is finalized checks if it has any 
entries in the blacklist or if it has a "blacklist deduction" value in it's 
resourceusage entry (see below), to determine if blacklist must be taken into 
if blacklist must be taken into account, check the "last updated" values for 
both cluster and app blacklist changes, if and only if either is stale (last 
updated != last change) then recalculate the blacklist deduction
when calculating the blacklist deduction use Chen He basic logic from existing 
patches. Place the deduction value into where app headroom is finalized. 
NodeLables could be taken into account as well, only blacklist entries which 
match the nodelabel expression used by the application would be added to the 
deduction, if a nl expression is in play
whenever the headroom is generated where app headroom is finalized, perform the 
blacklist value deduction

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Craig Welch
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 

This message was sent by Atlassian JIRA

Reply via email to