[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208941#comment-14208941 ]
Craig Welch commented on YARN-2848: ----------------------------------- bq. IIUC, this JIRA is to tackle the cases which app has some special requirements on resource requests (including but not limited to black list nodes, node labels expression, etc.) and RM want to return headroom considering such factors to AM. Well, yes... although the extent to which they are "special" isn't clear, [YARN-1680] surfaces this as a bug (something of a design miss...) for blacklisting of resources which has been around for some time - and of course, node labels were recently added but with an eye to being used - as in, there's a desire to be able to use them with processes which will want to have accurate headroom, userlimit, etc - so the problem already exists, as it were, it's not "something new" we're choosing to introduce, it's rather a way of resolving inconsistencies which exist because of functionalities which is are perhaps not fully complete wrt the rest of the system - and in so far as we want applications to work with constraints with respect to nodes they use, we will need to solve this problem in some way, or do away with headroom and / or user limits as such, which is not a very attractive choice bq. My major concern of this is it will bring more computation complexity in RM side – we already have very heavy computation when trying to allocate containers, like locality/hierachy-of-queues/user-limit/headroom/node-labels The idea is to minimize the calculation needed during allocation by making adjustments to resources only as needed by external events which should be relatively infrequent with respect to any given application bq. if we trying to resolve the problem by handling events (such as node label change, black node list change, etc.) at app level, it will be very problematic, since some of the operations cannot be even done in O( n ) time. bq. So I think if some operation have complex of O( n ), (n can be as large as #app in the cluster), we should be very discreet to such operation. so, the suggestion is not to have the activity which accepts a node label change or a node addition or removal from a cluster synchronously notify all applications of that change - rather, to allow applications to check for changes relevant to them (changes to the nodes held by a label they care about (label level info), node additions or removals relevant to their blacklisting (cluster level info)) and to have the application only adjust it's resource view when it determines it is necessary to do so - at the level of the cluster handling the addition or removal of a node, or changes to the nodes for a node label, nothing more than an indication of "last change" for the resources needs to occur, and applications will simply check for "change indications" that they care about and take action as needed - it should be as efficient and lightweight as possible, and would not impose any O ( n ) (where n=#app in cluster) operations on any single/synchronous code path > (FICA) Applications should maintain an application specific 'cluster' > resource to calculate headroom and userlimit > ------------------------------------------------------------------------------------------------------------------ > > Key: YARN-2848 > URL: https://issues.apache.org/jira/browse/YARN-2848 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler > Reporter: Craig Welch > Assignee: Craig Welch > > Likely solutions to [YARN-1680] (properly handling node and rack blacklisting > with cluster level node additions and removals) will entail managing an > application-level "slice" of the cluster resource available to the > application for use in accurately calculating the application headroom and > user limit. There is an assumption that events which impact this resource > will occur less frequently than the need to calculate headroom, userlimit, > etc (which is a valid assumption given that occurs per-allocation heartbeat). > Given that, the application should (with assistance from cluster-level > code...) detect changes to the composition of the cluster (node addition, > removal) and when those have occurred, calculate an application specific > cluster resource by comparing cluster nodes to it's own blacklist (both rack > and individual node). I think it makes sense to include nodelabel > considerations into this calculation as it will be efficient to do both at > the same time and the single resource value reflecting both constraints could > then be used for efficient frequent headroom and userlimit calculations while > remaining highly accurate. The application would need to be made aware of > nodelabel changes it is interested in (the application or removal of labels > of interest to the application to/from nodes). For this purpose, the > application submissions's nodelabel expression would be used to determine the > nodelabel impact on the resource used to calculate userlimit and headroom > (Cases where the application elected to request resources not using the > application level label expression are out of scope for this - but for the > common usecase of an application which uses a particular expression > throughout, userlimit and headroom would be accurate) This could also provide > an overall mechanism for handling application-specific resource constraints > which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)