[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212248#comment-15212248 ] Bikas Saha commented on YARN-1040: -- bq. Hmmm.. Given that launching multiple processes, being a new feature, I feel that it should be fine to mandate the app to use new APIs, no ? In Tez/Spark clearly using the ability to launch multiple processes in containers will need the use of new APIs on the NM. And that could be an optional feature local to that part of the code that can be safely added and then turned on/off in an isolated manner by users. That is fine. But if to use the new API's for this one optional feature, we have to change Tez/Spark to redo their AM-RM implementations and update all their internals regarding the concept of allocations and containers (where what the entire code used to consider containers are now allocations), then I hope we appreciate how destabilizing that change would be to those projects. > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208906#comment-15208906 ] Bikas Saha commented on YARN-1040: -- It would be great if existing apps can use the changes in YARN-1040 to be able to run more than a single process (sequentially or concurrently). If we use YARN-1040 to build the primitives here then those primitives could be used for the broader work designed for services (which seems to be indicated in the design doc). Without YARN-1040, existing java based apps cannot use features like increasing container memory because the JVM has to be restarted before it can grow to a larger size. I can see the argument of asking users to use new APIs for new features but requiring existing apps to change their AM/RM implementations (that have been stabilized with much effort) just to be able to launch multiple processes does not seem empathetic. Separately from this, I have not been actively involved in the project for a while. Hence my understanding of the scope and semantic changes proposed in it may be stale and I may be inaccurate in thinking that these are fundamental enough to be done in a special jira for that purpose for a wider discussion. You guys can make a call on that. > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202391#comment-15202391 ] Bikas Saha commented on YARN-1040: -- This design doc effectively looks like a re-design of almost all core semantics of YARN. This probably deserves a wider discussion on the dev email list and under its own jira. Although it covers YARN-1040 and YARN-4726 the scope looks much wider and careful thinking about backwards compatibility is needed etc. Conceptually this changes the current semantic understanding of allocation and container thats widely understood externally. I am afraid that this jira or just the folks on this thread are not enough to make a decision for the given proposal. As far as this jira is concerned, both the previous (say a) & new (say b) proposals sound similar with startContainer_in_a renamed to startAllocation_in_b & startProcess_in_a renamed to startContainer_in_b. So we may be fine in that restricted part minus the renamings. > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4758) Enable discovery of AMs by containers
[ https://issues.apache.org/jira/browse/YARN-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-4758: - Description: {color:red} This is already discussed on the umbrella JIRA YARN-1489. Copying some of my condensed summary from the design doc (section 3.2.10.3) of YARN-4692. {color} Even after the existing work in Workpreserving AM restart (Section 3.1.2 / YARN-1489), we still haven’t solved the problem of old running containers not knowing where the new AM starts running after the previous AM crashes. This is a specifically important problem to be solved for long running services where we’d like to avoid killing service containers when AMs failover. So far, we left this as a task for the apps, but solving it in YARN is much desirable. [(Task) This looks very much like service-registry (YARN-913), but for appcontainers to discover their own AMs. Combining this requirement (of any container being able to find their AM across failovers) with those of services (to be able to find through DNS where a service container is running - YARN-4757) will put our registry scalability needs to be much higher than that of just service endpoints. This calls for a more distributed solution for registry readers something that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. See comment https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 was: {color:red} This is already discussed on the umbrella JIRA YARN-1489. Copying some of my condensed summary from the design doc (section 3.2.10.3) of YARN-4692. {color} Even after the existing work in Workpreserving AM restart (Section 3.1.2 / YARN-1489), we still haven’t solved the problem of old running containers not knowing where the new AM starts running after the previous AM crashes. This is a specifically important problem to be solved for long running services where we’d like to avoid killing service containers when AMs failover. So far, we left this as a task for the apps, but solving it in YARN is much desirable. [(Task) This looks very much like service-registry (YARN-913), but for appcontainers to discover their own AMs. Combining this requirement (of any container being able to find their AM across failovers) with those of services (to be able to find through DNS where a service container is running - YARN-4757) will put our registry scalability needs to be much higher than that of just service endpoints. This calls for a more distributed solution for registry readers something that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > Enable discovery of AMs by containers > - > > Key: YARN-4758 > URL: https://issues.apache.org/jira/browse/YARN-4758 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli > > {color:red} > This is already discussed on the umbrella JIRA YARN-1489. > Copying some of my condensed summary from the design doc (section 3.2.10.3) > of YARN-4692. > {color} > Even after the existing work in Workpreserving AM restart (Section 3.1.2 / > YARN-1489), we still haven’t solved the problem of old running containers not > knowing where the new AM starts running after the previous AM crashes. This > is a specifically important problem to be solved for long running services > where we’d like to avoid killing service containers when AMs failover. So > far, we left this as a task for the apps, but solving it in YARN is much > desirable. [(Task) This looks very much like service-registry (YARN-913), > but for appcontainers to discover their own AMs. > Combining this requirement (of any container being able to find their AM > across failovers) with those of services (to be able to find through DNS > where a service container is running - YARN-4757) will put our registry > scalability needs to be much higher than that of just service endpoints. > This calls for a more distributed solution for registry readers something > that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > See comment > https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171244#comment-15171244 ] Bikas Saha commented on YARN-1011: -- bq. If it absolutely wants a guaranteed container, we should allocate a guaranteed container and kill the opportunistic one. If it does not want, we can let the opportunistic container continue to run. I get that. The question is which of the 2 is the default behavior? Next question to consider: If we let the opportunistic container run and consider it as part of guaranteed capacity then what prevents the node from killing it when the node resources get actually over-used and something needs to be killed by the node. And how does the rest of the system (yarn + app) react to losing a guaranteed container? bq. The overall cluster utilization is an implementation detail, its sole purpose is to reduce the chances of running into cases that need cross-node promotion. I am sorry I could not understand how that is so? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167889#comment-15167889 ] Bikas Saha commented on YARN-1040: -- Vinod, the plan you are suggesting has merits. But my initial impression is that reworking allocations and containers is a much bigger change than whats proposed earlier in this jira. Not only internally in YARN but also externally in terms of thinking about the whole larger flow of allocations and containers for users of YARN. The proposal discussed earlier is of much smaller scope and I believe sufficient to take us where we need to go. And it does not need reworking the RM related flow of allocations and containers. E.g. it may not be necessary for the RM to understand single use allocations vs multi-use vs concurrent use allocations. But for the RM level changes you are suggesting we may be on the path of convergence. At this point, the discussion is complex enough that we may want to gather interested people and do it as a group outside jira comments and then post it back. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167738#comment-15167738 ] Bikas Saha commented on YARN-1040: -- My guess is that YARN-4725 may be redundant after we do this work because then we would have exposed primitives to apps to make that happen. The arguments for YARN not doing it by itself would be the same. If it can be done easily by the app and is very likely app dependent without one-size-fits-all then let the app do it. Coming back to this jira. Yes, lets please track any first-class support of the notion of upgrades separately which can be done as a follow up. Perhaps we can put the design in a document and look at the next level of details. We can send email to the dev list after adding a more detailed document to this jira. Then, based on +ve feedback, we could go ahead with jiras/code. The devil is in the details :) This would be a significant change and we could use more eyes for reviews. For startProcess identifier, it may be useful for the app to provide the identifier in startProcess and then use it later to refer to the process. E.g. stopProcess. vs YARN trying to come up with identifiers. This may make the apps life easier because it could use meaningful terms based on its own logic. We can discuss such details in the design document. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166633#comment-15166633 ] Bikas Saha commented on YARN-1040: -- I am sorry if I caused a digression by mentioning Slider etc. I am not sure the upgrade scenario is the only one for this jira since this jira covers a broader set. Even without upgrades apps can change the processes they are running in a container without having to lose the container allocation. Identical calls of primitives could be used without the notion of upgrade. E.g. start a Java process first for a Java task, then launch a python process for a Python task. To the NM this is identical to starting v1 and then starting v2. So while it makes sense for the second one to use an API called upgrade, it may not for the first one. (Unrelated to this jira, IMO, YARN should allow upgrade of app code without losing containers but not necessarily understand it deeply. E.g. YARN need not assume that upgrade will need additional resource or try to acquire them transparently for the application.) For the purpose of this jira here is what my thoughts are when I had opened YARN-1292 to delink process lifecycle from container. 1) new API - acquireContainer - means ask for the allocated resource. The API has a flag to specify whether process exit implies releaseContainer. This is for backwards compatibility with a default of true. Apps that want to continue to use that behavior can explicitly pass true when using the new API and is mainly for reducing number of RPCs for apps like MR/Tez etc. 2) new API - startProcess - means start the remote process 3) new API - stopProcess - means stop the remote process 4) new API - releaseContainer - means release the allocated resource 5) Potentially a new API for localization, though in theory, this could be separate. Since this fine grained control makes the protocol chatty, we can reduce the RPC traffic by having a new NM RPC, say NMCommand, that takes a sequence of API primitives that can be sent in 1 RPC. So the current API of startContainer effectively becomes NMCommand(1, 2) and stopContainer becomes NMCommand(3,4). This can be leveraged for backwards compatibility and rolling upgrades. The above items would effectively delink process and container lifecyle and close out this jira. This provides the fine grained control in core YARN that can be used for various scenarios e.g. upgrades without YARN understanding the scenarios. If we need to add higher level notions for upgrades etc. then those could be done as separate items. I hope that helps make my thoughts concrete within the scope of this jira. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15165668#comment-15165668 ] Bikas Saha commented on YARN-1040: -- Agree with your scenarios. I am trying to figure a way by which this does not become a YARN problem (both initial work and ongoing maintenance). E.g. we dont know for sure that the resource needs to be x, 2x or 3x. This is an allocation decision and cannot be done without the RMs blessing. And increasing container resources is already work in progress and may become another NM primitive. Next, what is the ordering for the tasks during an upgrade? We could implement one of many possibilities but then be stuck with bug-fixing or improving it. Potentially use that as a precedent to implement yet another upgrade policy. Hence, my suggestion of creating composable primitives that can be used to easily implement these flows. And leave it to the apps to determine the exact upgrades paths. Perhaps Slider is a better place which could wrap different upgrade possibilities using the composable primitives. E.g. SliderStopAllUpgradePolicy or SliderConcurrentUpgradePolicy. Or they could be provided as helper libs in YARN/NMClient so apps dont have to compose the primitives from scratch. The main aim is to continue to make core YARN/NM simple by creating primitives and layering complexity on top. This approach may be simpler and incremental to develop, test and deploy. Of course, these are my personal design views :) Thoughts? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163708#comment-15163708 ] Bikas Saha commented on YARN-1040: -- I am not sure we need to place (somewhat artificial) constraints on the app when its not clear that it practically affects YARN 1) Container with no process should be allowed. Apps could terminate all running tasks of version A, then start running tasks of version B when they are not backwards compatible. 2) Container should be allowed to run multiple processes. This is similar to the existing process spawning more processes. It is different from that in the sense that the NM has to add the new process to existing monitoring/cgroups etc. 3) Startprocess should be allowed with no process actually started. This will allow apps to localize new resources to an existing container. Alternatively, we could create a new localization API thats delinked from starting the process. But re-localization is an important related feature that we should look at supporting via this work because currently that does not work since its tied to start process. 4) Most current apps are already communicating directly with their tasks and hence can shut them down when they are not needed. However, like suggested above, it may be useful for the NM to provide a feature whereby the previous task can be shutdown when a new task request is received. Alternatively, the NM could provide a stopProcess API to make that explicit. IMO all of this should be allowed. The timeline could be different with some being allowed earlier and some later based on implementation effort. Thinking ahead, it may be useful for the NM to accept a series of API calls within the same RPC (with the current mechanism supported as a single command entity for backwards compatibility). Then we will not have to build a lot of logic into the NM. The app can get all features by composing a multi-command entity. E.g. Current start process = {acquire, localize, start} // where acquire means start container Current shutdown process = {stop, release} // where release means give up container Only localize = {localize} Start another process = {localize, start} Start another process after shutting down first process = {stop, start} or {stop, localize, start} Start another process and then shutdown the first process = {start, stop} New container shutdown = {release} // at this point there may be 0 or more processes running and which will be stopped > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157482#comment-15157482 ] Bikas Saha commented on YARN-1011: -- 2.3 The relationship between overall cluster utilization and node over-allocation is not clear. Like you say, for an under allocated cluster, it would likely be easy to find a guaranteed container. So I am not sure if we should go ahead and make this tenuous link formal by adding it as a config in the code. Regardless of cluster allocation state, a node could be over-allocated. 3 So we are going to add promotion notification to the AM RM protocol, right? By corollary, we would be adding a flag to the initial allocation that shows if it was guaranteed or opportunistic, right? 3.2 I agree that cross node promotion is complex. But I am afraid, it does not look like something that can be deferred for later. Because its likely not possible. Its very likely that an app may have a guaranteed and an opportunistic container. And when it gives up a guaranteed container then we will need to allocate another one to it. That may be the opportunistic container or a new one. So its ok to defer any advanced stuff at this time, but in the minimum, for the sake of a complete logic definition, we will need some default behavior. The obvious default behavior would be to ignore the opportunistic container and let it run until it finishes or it preempted (because the node becomes busy). This aligns with the philosophy of opportunistic allocation being a secondary scheduling loop. Perhaps this is what we had in mind and if so, then my request is to call it for the sake of completeness. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155867#comment-15155867 ] Bikas Saha commented on YARN-1011: -- 2.3. Whats the reasoning behind this? Over-allocating a node seems to be a local decision based on the nodes expected and actual utilization. So I would expect the logic to be something similar to 1) Node is already 100% allocated 2) Actual utilization is < 80% 3) Over-allocate to bring actual utilization ~=80%. 3. What is the AM/RM interaction in this promotion? 3.2. Not clear what is actually happening here? Will new container be allocated and the opportunistic container allowed to continue till is exits or is preempted? How does all this interact with preemption? > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf, > yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118116#comment-15118116 ] Bikas Saha commented on YARN-2019: -- Does this now mean that during a failover the new RM could forget about the jobs that failed to get stored by the previous RM? > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Jian He >Priority: Critical > Labels: ha > Fix For: 2.7.2, 2.6.2 > > Attachments: YARN-2019.1-wip.patch, YARN-2019.patch, YARN-2019.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086326#comment-15086326 ] Bikas Saha commented on YARN-1011: -- Some of what I am saying emanates from prior experience with a different Hadoop like system. You can read more about it here. http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086325#comment-15086325 ] Bikas Saha commented on YARN-1011: -- bq. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. bq. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. Yes. the difference is that the opportunistic container may not be convertible into a normal container because that node is still over-allocated. So at this point, what should be done? Should this container be terminated and run somewhere else as normal (because capacity is now available)? Should some other container be preempted on this node to make this container normal? Should the RM allocate a normal container and give it to the app in addition to the running opportunistic container in case the app can do the transfer internally? Also, with this feature in place, should we run all containers beyond guaranteed capacity as opportunistic containers? This would ensure that any excess containers that we give to a job will not affect performance of the guaranteed containers of other jobs. This would also make the scheduling and allocation more consistent in that the guaranteed containers always run at normal priority and extra containers run at lower priority. The extra container could be extra over capacity (but without over-subscription) or extra over-subscription. Because of this I feel that running tasks at lower priority could be an independent (but related) work item. Staying on this topic and addition configuration to it. It may make sense to add some way by which an application can say that dont oversubscribe nodes when my containers run on it. Putting cgroups or docker in this context, would these mechanism support over-allocating resources like cpu or memory? bq. When space frees up on nodes, NMs send candidate containers for promotion on the heartbeat. That shouldn't be necessary since the RM will get to know about free capacity and run its scheduling cycle for that node - at which point it will be able to take action like allocation a new container or upgrading an existing one. There isnt anything the NM can tell the RM (which the RM already does not know) except for the current utilization of the node. Some of what I am saying emanates from prior experience with a different Hadoop like system. You can read more about it here. http://research.microsoft.com/pubs/232978/osdi14-paper-boutin.pdf > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083957#comment-15083957 ] Bikas Saha commented on YARN-1011: -- I agree with natural container churn in favor of preemption to avoid lost work though the issue of clearly defining scheduler policy still remains. bq. If we were oversubscribing 10X then I'd probably want it for sure, but if it's at most 2X capacity then worst case is a container only gets 50% of the resource it had requested. Obviously for something like memory this has to be closely controlled because going over the physical capabilities of the machine has very significant consequences. But for CPU, I'd definitely be inclined to live with the occasional 50% worst case for all containers, in order to avoid the 1/1024th worst case for OPPORTUNISTIC containers on a busy node. I did not understand this. Does this mean, its ok for normal containers to run 50% slower in the presence of opportunistic containers? If yes, then there are scenarios where this may not be a valid choice. E.g. when a cluster is running a mix of SLA and non-SLA jobs. Non-SLA jobs are ok if there containers got slowed down to increase cluster utilization by running opportunistic containers because we are getting higher overall throughput. But SLA jobs are not ok with missing deadlines because there tasks ran 50% slower. IMO, the litmus test for a feature like this would be to take an existing cluster (with low utilization because tasks are asking for more resources than what they need 100% of the time). Then turn this feature on and get better cluster utilization and throughput without affecting the existing workload. Whatever be the internal implementation details. Agree? bq. 50% of maximum-under-utilized resource of past 30 min for each NM can be used to allocate opportunistic containers. These are heuristics and may all be valid under different circumstances. What we should step back and see is what is the source of this optimization. Observation : Cluster is under-utilized despite being fully allocated Possible reasons : 1) Tasks are incorrectly over-allocated. Will never use the resources they ask for and hence we can safely run additional opportunistic containers. So this feature is used to compensate for poorly configured applications. Probably a valid scenario but is it common? 2) Tasks are correctly allocated but dont use their capacity to the limit all the time. E.g. Terasort will use high cpu only during the sorting but not during the entire length of the job. But its containers will ask for enough CPU to run the sort in the desired time. This is a typical application behavior where resource usage varies over time. So this feature is used to soak up the fallow resources in the cluster while tasks are not using their quoted capacity. The arguments and assumptions we make need to be considered in the light of which of 1 or 2 is the common case and where this feature will be useful. While its useful to have configuration knobs, for a complex dynamic feature like this that is basically reacting to runtime observations, it may be quite hard to be able to configure this statically using manual configuration. While some limits about max over-allocation limit etc. are easy and probably required to configure, we should look at making this feature work by itself instead of relying exclusively on configuration (hell :P) for users to make this feature usable. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083683#comment-15083683 ] Bikas Saha commented on YARN-1011: -- Good points but let me play the devils advocate to get some more clarity :) bq. As soon as we realize the perf is slower because the node has higher usage than we had anticipated, we preempt the container and retry allocation (guaranteed or opportunistic depending on the new cluster state). So, it shouldn't run slower for longer than our monitoring interval. Is this assumption okay? How do we determine that the perf is slower? The CPU would never exceed 100% even under over-allocation. Is preempting always necessary? If we are sure that the OS is going to starve the opportunistic containers, then can assume that when the node is fully utilized, then only our guaranteed containers are using resources? So we can let the opportunistic containers be so that they can start soaking up excess capacity after the normal containers have stopped spiking. Perhaps some experiments will shed some light on this. bq. The opportunistic container will continue to run on this node so long as it is getting the resources it needs. If there is any sort of resource contention, it is preempted and is up for allocation on one of the free nodes. Lets say job capacity is 1 container and the job asks for 2. Its get 1 normal container and 1 opportunistic container. Now it releases its 1 normal container. At this point what happens to the opportunistic container. It is clearly running at lower priority on the node and as such we are not giving the job its guaranteed capacity. The question is not about finding an optimal solution for this problem (and there may not be one). The issue here is to crisply define the semantics around scheduling in the design. Whatever the semantics are, we should clearly know what they are. IMO, the exact semantics of scheduling should be in the docs. bq. The RM schedules the next highest priority "task" for which it couldn't find a guaranteed container as an opportunistic container. This task continues to run as long as it is not getting enough resources. If there is no resource contention, the task continues to run. If guaranteed resources free up on the node it is running, isn't it fair to promote the container to Guaranteed. Sure. And thats why the system should upgrade opportunistic containers in the order in which they were allocated. However, the decision must be made at the RM and not the NM since the NMs dont know about total capacity and multiple NMs locally upgrading their opportunistic containers might end up over-allocating for a job. Further, the queue sharing state may have changed since the opportunistic allocation, and hence assuming that the opportunistic container "would have" gotten that allocation anyways, at a later point in time, may not be valid. In summary, what we need in the document is a clear definition of the scheduling policy around this - whatever that policy may be. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081950#comment-15081950 ] Bikas Saha commented on YARN-1011: -- In Tez we always try to allocated the most important work to the next allocated container. So doing opportunistic containers without providing the AM with the ability to know about it and use it judiciously may not be something that can be delayed to a second phase. Being able to choose only guaranteed or non-guaranteed containers only covers half the problem (and probably the less relevant one) in which an application should always finish in 1min using guaranteed capacity but may sometimes finish in 30s because it got opportunistic containers. The other side is probably more important where a regression is caused due to opportunistic containers. 1) the app got opportunistic containers and their perf wasnt the same as normal containers - so it ran slower. This may be mitigated by the system guaranteeing that only excess container beyond guaranteed capacity would be opportunistic. This would require that the system upgrade opportunistic containers in the same order as it would allocate containers. However, things get complicated because a node with an opportunistic container may continue to run its normal containers while space frees up for guaranteed capacity on other nodes. At this point, which container becomes guaranteed - the new one on a free node or the opportunistic one that is already doing work? Which one should be preempted? 2) the app suffered because its guaranteed containers got slowed down due to competition from opportunistic containers. This needs strong support for lower priority resource consumption for opportunistic containers. IMO, the NM cannot make a local choice about upgrading its opportunistic containers because this is effectively a resource allocation decision and only the RM has the info to do that. The NM does not know if this would exceed guaranteed capacity and in total, a bunch of NMs making this choice locally can lead to excessive over-allocation of guaranteed resources. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf, yarn-1011-design-v1.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072279#comment-15072279 ] Bikas Saha commented on YARN-1011: -- In my prior experience, something like this is not practical without pro-active cpu management (which has been delegated to future work in the document). It is essential to run opportunistic tasks at lower OS cpu priority so that they never obstruct progress of normal tasks. Typically we will find that the machine is under-allocated the most in cpu usage since most processing has bursty cpu. When a normal task has a cpu burst then it should have to contend with an opportunistic task since this will be detrimental to the expected performance of that task. Without this, jobs will not run predictably in the cluster. From what I have seen, users prefer predictability over most other things. ie. having a 1 min job run in 1 min all the time vs making that job run in 30s 85% of the time and but in 2 mins for 5% of the time because that makes it really hard to establish SLAs. In fact, this is the litmus test for opportunistic scheduling. It should be able to raise the utilization of a cluster from (say 50%) to (say 75%) without affecting the latency of the jobs compared to when the cluster was running at 50%. For memory, in fact, its ok to share and reach 100% capacity but its important to check that the machine does not start thrashing. Most well written tasks will run within their memory limits and start spilling etc. Opportunistic tasks are trying to occupy the memory that these tasks thought they could use but are not using (or that these tasks are keeping in buffer on purpose). The crucial thing to consider here is to look for stats that signify the onset of memory paging activity (or overall memory over-subscription at the OS level). At that point, even normal tasks that are within their limit will be adversely affected because the OS will start paging memory to disk. So we need to start proactively killing opportunistic tasks before the such paging activity gets triggered. Handling opportunistic tasks raises questions on the involvement of the AMs. Unless I missed something, this is not called out clearly in the doc. In that sense it would be instructive to consider opportunistic scheduling in a similar light as preemption. App got container that it should not have gotten at that time if we had been strict but got it because we decided to loosen the strings (of queue capacity or machine capacity resp). - will opportunistic containers be given only when for containers that are beyond queue capacity such that we dont break any guarantees on their liveliness. ie. an AM will not expect to lose any container that is within its queue capacity but opportunistic containers can be killed at any time. - does the AM need to know that a newly allocated container was opportunistic. E.g. so that it does not schedule the highest priority work on that container. - will conversion of opportunistic containers to regular containers be automatically done by the RM? Will the RM notify the AM about such conversions? - when terminating opportunistic containers will the RM ask the AM about which containers to kill? Given the above perf related scenarios this may not be a viable option. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: yarn-1011-design-v0.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060559#comment-15060559 ] Bikas Saha commented on YARN-1197: -- The API supports it but the backed implementation does not. So in the future, based on need, this could be supported compatibly. Do you have a scenario where this is essential? > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, graceful, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan > Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, > YARN-1197_Design.2015.06.24.pdf, YARN-1197_Design.2015.07.07.pdf, > YARN-1197_Design.2015.08.21.pdf, YARN-1197_Design.pdf > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028104#comment-15028104 ] Bikas Saha commented on YARN-4108: -- These problems will be hard to solve without involving the scheduler in the decision cycle. The preemption policy can determine how much to preempt from a queue at a macro level. But the actual containers to preempt would be selected by the scheduler. That is where using the global node picture will help. For a given container request, if we can scan its nodes (if any) and make either an allocation or preemption decision. Else, if we are doing container allocation on node heartbeat, then just like delay scheduling logic, we can mark a node for preemption but not preempt it and associate that node with the container request for which preemption is needed (request.nodeToPreempt). And we can cycle through all nodes like this and change the request->node association when we find better nodes to preempt. After cycling through all nodes, if when we again reach a node such that it matches the request.nodeToPreempt then we can execute the decision of actually preempting the node. If there are no nodes that can satisfy the request (e.g. request wants node A but preemptedQueue has no containers on node A) then scheduler should be able to callback to the preemption module and notify it so that some other queue can be picked to preempt. > CapacityScheduler: Improve preemption to preempt only those containers that > would satisfy the incoming request > -- > > Key: YARN-4108 > URL: https://issues.apache.org/jira/browse/YARN-4108 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > > This is sibling JIRA for YARN-2154. We should make sure container preemption > is more effective. > *Requirements:*: > 1) Can handle case of user-limit preemption > 2) Can handle case of resource placement requirements, such as: hard-locality > (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I > don't want to use rack1 and host\[1-3\]) > 3) Can handle preemption within a queue: cross user preemption (YARN-2113), > cross applicaiton preemption (such as priority-based (YARN-1963) / > fairness-based (YARN-3319)). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4390) Consider container request size during CS preemption
[ https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025542#comment-15025542 ] Bikas Saha commented on YARN-4390: -- I am not sure if this is a bug as described. If preemption does free 8x1GB containers then it will create 8GB free space on the node. The scheduler (which is container request size aware) should then allocate 1x8GB container to the under-allocated AM. [~curino] Is that correct? Of course there could be a bug in the implementation but by design, this should not happening. However, if YARN ends up preempting 8x1GB containers on different nodes then the under-allocated AM will not get its resources and may result in further avoidable preemptions. This is [~sunilg] case. > Consider container request size during CS preemption > > > Key: YARN-4390 > URL: https://issues.apache.org/jira/browse/YARN-4390 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.0.0, 2.8.0, 2.7.3 >Reporter: Eric Payne >Assignee: Eric Payne > > There are multiple reasons why preemption could unnecessarily preempt > containers. One is that an app could be requesting a large container (say > 8-GB), and the preemption monitor could conceivably preempt multiple > containers (say 8, 1-GB containers) in order to fill the large container > request. These smaller containers would then be rejected by the requesting AM > and potentially given right back to the preempted app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997181#comment-14997181 ] Bikas Saha commented on YARN-2047: -- I think the general idea is that the AM cannot be trusted about allocated resources or running containers. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991069#comment-14991069 ] Bikas Saha commented on YARN-2047: -- >From the description it seems like the original scope was making sure that a >lost NM's containers are marked expired by the RM even across RM restart. For >that, wont it be enough to save a dead/decommissioned NM info in the state >store. Upon restart, repopulate the decommissioned/dead status from the state >store. It can take appropriate action at that time - e.g. cancelling an AM >containers for those NMs when the AM re-registers or asking those NMs to >restart and re-register if they heartbeat again. If this is a required action then it would also imply that saving a such nodes would be a critical state change operation. So, e.g. decommission command from the admin should not complete until the store has been updated. Is that the case? > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3911) Add tail of stderr to diagnostics if container fails to launch or it container logs are empty
[ https://issues.apache.org/jira/browse/YARN-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983845#comment-14983845 ] Bikas Saha commented on YARN-3911: -- Sure > Add tail of stderr to diagnostics if container fails to launch or it > container logs are empty > - > > Key: YARN-3911 > URL: https://issues.apache.org/jira/browse/YARN-3911 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha > > The stderr may have useful info in those cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4278) On AM registration, response should include cluster Nodes report on demanded by registration request.
[ https://issues.apache.org/jira/browse/YARN-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970840#comment-14970840 ] Bikas Saha commented on YARN-4278: -- Client can ask for it but IMO, no clients actually do. Hence we probably don't have enough evidence to suggest that it might work in practice, specially at large deployments like Yahoo. > On AM registration, response should include cluster Nodes report on demanded > by registration request. > -- > > Key: YARN-4278 > URL: https://issues.apache.org/jira/browse/YARN-4278 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > From the yarn-dev mailing list discussion thread > [Thread-1|http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201510.mbox/%3c0ee80f6f7a98a64ebd18f2be839c91156798a...@szxeml512-mbs.china.huawei.com%3E] > > [Thread-2|http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201510.mbox/%3c4f7812fc-ab5d-465d-ac89-824735698...@hortonworks.com%3E] > > Slider required to know about cluster nodes details for providing support for > affinity/anti-affinity on containers. > Current behavior : During life span of application , updatedNodes are sent in > allocate request only if there are any change like added/removed/'state > change' in the nodes. Otherwise cluster nodes not updated to AM. > One of the approach thought by [~ste...@apache.org] is while AM registration > let response hold the cluster nodes report -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4278) On AM registration, response should include cluster Nodes report on demanded by registration request.
[ https://issues.apache.org/jira/browse/YARN-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970408#comment-14970408 ] Bikas Saha commented on YARN-4278: -- Not sure if the alternate to allow an AM to use the ClientRMProtocol was considered in the discussions. Node reports as well as other data like queue info (which AMs might be also interested in) would then become available to the AM. As far as the approach in this jira is concerned, it would be good to check if the AM-RM resync (after an RM restart) would automatically benefit from this change or whether it would need to be handled separately. Any concerns about the size of this data for large clusters? Would it fit easily on the RPC? Would it cause excessive network usage on the RM nic when many AM's register/resync with it? > On AM registration, response should include cluster Nodes report on demanded > by registration request. > -- > > Key: YARN-4278 > URL: https://issues.apache.org/jira/browse/YARN-4278 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > > From the yarn-dev mailing list discussion thread > [Thread-1|http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201510.mbox/%3c0ee80f6f7a98a64ebd18f2be839c91156798a...@szxeml512-mbs.china.huawei.com%3E] > > [Thread-2|http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201510.mbox/%3c4f7812fc-ab5d-465d-ac89-824735698...@hortonworks.com%3E] > > Slider required to know about cluster nodes details for providing support for > affinity/anti-affinity on containers. > Current behavior : During life span of application , updatedNodes are sent in > allocate request only if there are any change like added/removed/'state > change' in the nodes. Otherwise cluster nodes not updated to AM. > One of the approach thought by [~ste...@apache.org] is while AM registration > let response hold the cluster nodes report -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961154#comment-14961154 ] Bikas Saha commented on YARN-1509: -- Sounds good > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch, YARN-1509.5.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961035#comment-14961035 ] Bikas Saha commented on YARN-1509: -- Does issuing an increase followed by the decrease actually remove the pending change request on the RM or will it cause the RM to try to change a container resource to the same size as the existing resource and then go down the code path of increase (new token) or decrease (update NM). This would be a corner case that would be good to double check. If that works and the RM actually removes the pending container change request then we could use this mechanism to implement a cancel method wrapper in the AMRMClient. Otherwise, if fixes are needed on the RM side, we could so it separately when we fix the RM. > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch, YARN-1509.5.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959877#comment-14959877 ] Bikas Saha commented on YARN-1509: -- bq. Increase/Decrease/Change Not sure why the implementation went down that route of creating a separation of increase vs decrease throughout the flow. In any case, that is the back-end implementation. That can change and evolve without affecting user code. This is the user facing API. So once code is written against this API making backwards incompatible changes or providing new functionality in the future needs to be considered. Also just API simplicity needs to be considered. 1) Having a changeRequest API allows us to support other kinds of changes (mix of increase or decrease) at a later point. For now, the API could check for increase/decrease (which it has to do anyways for sanity checking) and reject unsupported scenarios. 2) Even for just increase or decrease, given the user can provide the old Container + new Request Size, it should be easy for the library to figure out if an increase or decrease is needed. Why burden the user by having them call 2 different API's that are essentially making the same request. Having to handle container token for increase but do nothing for decrease is something that already has to be explained to the user. Same for the fact that decrease is quick but not increase. So those aspects of user education are probably orthogonal. We have iterated on this API question a few times now. In the above 2 points I have tried to summarize the reasons for requesting to change instead of increase/decrease. Even if we dont support both increase and decrease (point 1) I think the simplicity of having a single API (point 2) would be an advantage of having change vs increase/decrease. This also simplifies to having a single callback onContainerResourceChanged() instead of 2 callbacks. At this point, I will request you and [~leftnoteasy] to consider both future extensions and API simplicity to make a final call on having 2 APIs and 2 callbacks vs 1. bq. My thought is, since we already ask the user to provide the old container when he sends out the change request, he should have the old container already, so we don't necessarily have to provide the old container info in the callback method I am fine either ways. I discussed the same thing offline with [~leftnoteasy] and he thought that having the old container info would make life easier for the user. A user who is using this API is very likely going to have state about the container for which a change has been requested and can match them up using the containerId. bq. The AbstractCallbackHandler.onError will be called when the change container request throws exception on the RM side. The change request is sent inside the allocate heartbeat request, right? So i am not sure how we get an exception back for the specific case of a failed container change request. Or are you saying that invalid container resource change requests are immediately rejected by the RM synchronously in the allocate RPC? bq. I am not against providing a separate cancel API. But I think the API needs to be clear that the cancel is only for increase request, NOT decrease request (just like we don't have something like cancel release container). Having a simple cancel request regardless of increase or decrease is preferable since then we are not leaking the current state of the implementation to the user. It is future safe E.g. later if we find an issue with decreasing and fixing that makes it non-instantaneous, then we dont want to have to change the API to support that. But today, given that it is instantaneous, we can simply ignore the cancellation of a decrease in the cancel method of AMRMClient. I think the RM does not support a cancel container resource change request. Does it? If it does not, then perhaps this API can be discussed/added in a separate jira after there is back end support for it. > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch, YARN-1509.5.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3911) Add tail of stderr to diagnostics if container fails to launch or it container logs are empty
[ https://issues.apache.org/jira/browse/YARN-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959579#comment-14959579 ] Bikas Saha commented on YARN-3911: -- I am sorry for the super delayed response. For some reason I was not a watcher on this. By tail I did not mean linux tail command but just the concept of sending the last set of entries (say last 4K of data) from the stderr and/or syslog if the container launch fails. > Add tail of stderr to diagnostics if container fails to launch or it > container logs are empty > - > > Key: YARN-3911 > URL: https://issues.apache.org/jira/browse/YARN-3911 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha > > The stderr may have useful info in those cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950835#comment-14950835 ] Bikas Saha commented on YARN-1509: -- A change container request (maybe not supported now) can be increase cpu + decrease memory. Hence a built in concept of increase and decrease in the API is something I am wary off. So how about {code} public abstract void onContainersResourceChanged( Map oldToNewContainers); OR public abstract void onContainersResourceChanged( List updatedContainerInfo);{code} Would there be a case (maybe not currently) when a change container request can fail on the RM? Should the callback allow notifying about a failure to change the container? What is the RM notifies AMRMClient about a container completed. That container happens to have a pending change request? What should happen in this case? Should the AMRM client clear that pending request? Should it also notify the user that pending container change request has failed or just rely on onContainerCompleted() to let the AM get that information. I would be wary of overloading cancel with a second container change request. To be clear, here we are discussing user facing semantics and API. Having clear semantics is important vs implicit or overloaded behavior. E.g. are there cases where an increase followed by a decrease request is a valid scenario and how would that be different compare to an increase followed by a cancel. Should the RM do different things for increase followed by cancel vs increase followed by decrease? AM restart does not need any handling since the AM is going to start from a clean slate. Sorry, my bad. I missed the handling of the RM restart case. Is there an existing test for that code path that could be augmented to make sure that the new changes are tested? > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch, YARN-1509.5.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
[ https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947315#comment-14947315 ] Bikas Saha commented on YARN-1509: -- Sorry for coming in late on this. I have some questions on the API. Why are there separate methods for increase and decrease instead of a single method to change the container resource size? By comparing the existing resource allocation to a container and the new requested resource allocation, it should be clear whether an increase or decrease is being requested. Also, for completeness, is there a need for a cancelContainerResourceChange()? After a container resource change request has been submitted, what are my options as a user other than to wait for the request to be satisfied by the RM? If I release the container, then does it mean all pending change requests for that container should be removed? From a quick look at the patch, it does not look like that is being covered, unless I am missing something. What will happen if the AM restarts after submitting a change request. Does the AM-RM re-register protocol need an update to handle the case of re-synchronizing on the change requests? Whats happens if the RM restarts? If these are explained in a document, then please point me to the document. The patch did not seem to have anything around this area. So I thought I would ask. Also, why have the callback interface methods been made non-public? Would that be an incompatible change? > Make AMRMClient support send increase container request and get > increased/decreased containers > -- > > Key: YARN-1509 > URL: https://issues.apache.org/jira/browse/YARN-1509 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan (No longer used) >Assignee: MENG DING > Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, > YARN-1509.4.patch, YARN-1509.5.patch > > > As described in YARN-1197, we need add API in AMRMClient to support > 1) Add increase request > 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs
[ https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734102#comment-14734102 ] Bikas Saha commented on YARN-4087: -- Repeating my comments from YARN-2019 here There would be 2 kinds of state store operations - reads and writes. Writes may be of 2 kinds - critical and non-critical. E.g. saving an application submission is critical. Saving a node information is perhaps not critical. It would affect system correctness is critical write operation errors are allowed to be ignored. We end up with YARN-4118 and other such potential issues. Essentially we are turning a write-ahead log into something that optional. I dont see how the system can make stable reliability guarantees by making the write-ahead log non-fatal. On the other hand read errors or non-critical write errors should not block RM progress but do need to be potentially retried. That also does not seem to be addressed in the patch. > Followup fixes after YARN-2019 regarding RM behavior when state-store error > occurs > -- > > Key: YARN-4087 > URL: https://issues.apache.org/jira/browse/YARN-4087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.7.2, 2.6.2 > > Attachments: YARN-4087-branch-2.6.patch, YARN-4087.1.patch, > YARN-4087.2.patch, YARN-4087.3.patch, YARN-4087.5.patch, YARN-4087.6.patch, > YARN-4087.7.patch > > > Several fixes: > 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in > production environment. > 2. If HA is enabled and if there's any state-store error, after the retry > operation failed, we always transition RM to standby state. Otherwise, we > may see two active RMs running. YARN-4107 is one example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734100#comment-14734100 ] Bikas Saha commented on YARN-2019: -- Sorry for coming in late on this. There would be 2 kinds of state store operations - reads and writes. Writes may be of 2 kinds - critical and non-critical. E.g. saving an application submission is critical. Saving a node information is perhaps not critical. It would affect system correctness is critical write operation errors are allowed to be ignored. We end up with YARN-4118 and other such potential issues. Essentially we are turning a write-ahead log into something that optional. I dont see how the system can make stable reliability guarantees by making the write-ahead log non-fatal. On the other hand read errors or non-critical write errors should not block RM progress but do need to be potentially retried. That also does not seem to be addressed in the patch. > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Jian He >Priority: Critical > Labels: ha > Fix For: 2.8.0, 2.7.2, 2.6.2 > > Attachments: YARN-2019.1-wip.patch, YARN-2019.patch, YARN-2019.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS
[ https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729917#comment-14729917 ] Bikas Saha commented on YARN-3942: -- In Option 1 the latency would be the time to read the entire session file for a session that has run many DAGs, right? Since we know the file size to be read, could we return a message saying something like "scanning file size FOO. Expect BAR latency"? > Timeline store to read events from HDFS > --- > > Key: YARN-3942 > URL: https://issues.apache.org/jira/browse/YARN-3942 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineserver >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-3942.001.patch > > > This adds a new timeline store plugin that is intended as a stop-gap measure > to mitigate some of the issues we've seen with ATS v1 while waiting for ATS > v2. The intent of this plugin is to provide a workable solution for running > the Tez UI against the timeline server on a large-scale clusters running many > thousands of jobs per day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM asynchronously
[ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717907#comment-14717907 ] Bikas Saha commented on YARN-4088: -- Right. So the combined objective is to continue to have small heartbeat intervals with larger clusters while still using the central scheduler for all allocations. Clearly, in theory, that is a bottleneck by design and our attempt is to engineer our way out of it for medium size clusters. Right? :) > RM should be able to process heartbeats from NM asynchronously > -- > > Key: YARN-4088 > URL: https://issues.apache.org/jira/browse/YARN-4088 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler >Reporter: Srikanth Kandula > > Today, the RM sequentially processes one heartbeat after another. > Imagine a 3000 server cluster with each server heart-beating every 3s. This > gives the RM 1ms on average to process each NM heartbeat. That is tough. > It is true that there are several underlying datastructures that will be > touched during heartbeat processing. So, it is non-trivial to parallelize the > NM heartbeat. Yet, it is quite doable... > Parallelizing the NM heartbeat would substantially improve the scalability of > the RM, allowing it to either > a) run larger clusters or > b) support faster heartbeats or dynamic scaling of heartbeats > c) take more asks from each application or > c) use cleverer/ more expensive algorithms such as node labels or better > packing or ... > Indeed the RM's scalability limit has been cited as the motivating reason for > a variety of efforts which will become less needed if this can be solved. > Ditto for slow heartbeats. See Sparrow and Mercury papers for example. > Can we take a shot at this? > If not, could we discuss why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM asynchronously
[ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717831#comment-14717831 ] Bikas Saha commented on YARN-4088: -- Why not on a 3K cluster? We could slowdown heartbeats to (say 10s) on a 3K node cluster. That should work though I agree that NM info would be stale for longer, if that's your point. > RM should be able to process heartbeats from NM asynchronously > -- > > Key: YARN-4088 > URL: https://issues.apache.org/jira/browse/YARN-4088 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler >Reporter: Srikanth Kandula > > Today, the RM sequentially processes one heartbeat after another. > Imagine a 3000 server cluster with each server heart-beating every 3s. This > gives the RM 1ms on average to process each NM heartbeat. That is tough. > It is true that there are several underlying datastructures that will be > touched during heartbeat processing. So, it is non-trivial to parallelize the > NM heartbeat. Yet, it is quite doable... > Parallelizing the NM heartbeat would substantially improve the scalability of > the RM, allowing it to either > a) run larger clusters or > b) support faster heartbeats or dynamic scaling of heartbeats > c) take more asks from each application or > c) use cleverer/ more expensive algorithms such as node labels or better > packing or ... > Indeed the RM's scalability limit has been cited as the motivating reason for > a variety of efforts which will become less needed if this can be solved. > Ditto for slow heartbeats. See Sparrow and Mercury papers for example. > Can we take a shot at this? > If not, could we discuss why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM asynchronously
[ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717746#comment-14717746 ] Bikas Saha commented on YARN-4088: -- Is the suggestion to process them in concurrently? Not quite sure what async means here? Is it async wrt the RPC thread? Another alternative would be to dynamically adjust the NM heartbeat interval. IIRC, the NM next heartbeat interval is sent by the RM in the response to the heartbeat. If not, then this could be added. The RM could potentially increase this interval till it reaches a steady/stable state of heartbeat processing. This would help in self-adjusting to cluster sizes. Small for small cluster and high for high cluster. This could tune up under high load and then tune down once load diminishes. > RM should be able to process heartbeats from NM asynchronously > -- > > Key: YARN-4088 > URL: https://issues.apache.org/jira/browse/YARN-4088 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, scheduler >Reporter: Srikanth Kandula > > Today, the RM sequentially processes one heartbeat after another. > Imagine a 3000 server cluster with each server heart-beating every 3s. This > gives the RM 1ms on average to process each NM heartbeat. That is tough. > It is true that there are several underlying datastructures that will be > touched during heartbeat processing. So, it is non-trivial to parallelize the > NM heartbeat. Yet, it is quite doable... > Parallelizing the NM heartbeat would substantially improve the scalability of > the RM, allowing it to either > a) run larger clusters or > b) support faster heartbeats or dynamic scaling of heartbeats > c) take more asks from each application or > c) use cleverer/ more expensive algorithms such as node labels or better > packing or ... > Indeed the RM's scalability limit has been cited as the motivating reason for > a variety of efforts which will become less needed if this can be solved. > Ditto for slow heartbeats. See Sparrow and Mercury papers for example. > Can we take a shot at this? > If not, could we discuss why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3736) Add RMStateStore apis to store and load accepted reservations for failover
[ https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660415#comment-14660415 ] Bikas Saha commented on YARN-3736: -- Sounds good! > Add RMStateStore apis to store and load accepted reservations for failover > -- > > Key: YARN-3736 > URL: https://issues.apache.org/jira/browse/YARN-3736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-3736.001.patch, YARN-3736.001.patch, > YARN-3736.002.patch, YARN-3736.003.patch, YARN-3736.004.patch, > YARN-3736.005.patch > > > We need to persist the current state of the plan, i.e. the accepted > ReservationAllocations & corresponding RLESpareseResourceAllocations to the > RMStateStore so that we can recover them on RM failover. This involves making > all the reservation system data structures protobuf friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3736) Add RMStateStore apis to store and load accepted reservations for failover
[ https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658798#comment-14658798 ] Bikas Saha commented on YARN-3736: -- Folks, how much load is this going to add to the state store (in terms of actual data volume and rate of stores/updates)? The original design of the RM state store assumes low volume of data and updates (proportional to number of new applications entering the system). So asking. Please let me know if this has been discussed and I missed that part. > Add RMStateStore apis to store and load accepted reservations for failover > -- > > Key: YARN-3736 > URL: https://issues.apache.org/jira/browse/YARN-3736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Anubhav Dhoot > Fix For: 2.8.0 > > Attachments: YARN-3736.001.patch, YARN-3736.001.patch, > YARN-3736.002.patch, YARN-3736.003.patch, YARN-3736.004.patch, > YARN-3736.005.patch > > > We need to persist the current state of the plan, i.e. the accepted > ReservationAllocations & corresponding RLESpareseResourceAllocations to the > RMStateStore so that we can recover them on RM failover. This involves making > all the reservation system data structures protobuf friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646700#comment-14646700 ] Bikas Saha commented on YARN-2005: -- I am fine for opening a separate jira for the specific case I mentioned. Opened YARN-3994 for that. If you want you can extend its scope to blacklisting. > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3994) RM should respect AM resource/placement constraints
Bikas Saha created YARN-3994: Summary: RM should respect AM resource/placement constraints Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644966#comment-14644966 ] Bikas Saha commented on YARN-2005: -- I believe the reverse case is also valid. A user may want to specify a locality constraint for the AM but today that is ignored because its dropped from the am resource request by the RM. Similarly, only the memory resource constraint is used and others (cpu etc) are dropped. Perhaps this jira should tackle this problem wholistically by fully respecting the resource request as specified in the am container launch context. > Blacklisting support for scheduling AMs > --- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Anubhav Dhoot > Attachments: YARN-2005.001.patch, YARN-2005.002.patch, > YARN-2005.003.patch, YARN-2005.004.patch > > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3911) Add tail of stderr to diagnostics if container fails to launch or it container logs are empty
Bikas Saha created YARN-3911: Summary: Add tail of stderr to diagnostics if container fails to launch or it container logs are empty Key: YARN-3911 URL: https://issues.apache.org/jira/browse/YARN-3911 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha The stderr may have useful info in those cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606218#comment-14606218 ] Bikas Saha commented on YARN-1197: -- There has been a lot of discussion that looks like its converging. It would be helpful for the other interested (but not deeply involved) people, if there was an updated design document with details about the agreed upon design. Also, if this document could outline some of the intuition/logic behind the design choices (like going through AM for low latency) then it would super useful. Thanks! > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan > Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, > YARN-1197_Design.2015.06.24.pdf, YARN-1197_Design.pdf > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548492#comment-14548492 ] Bikas Saha commented on YARN-1902: -- An alternate approach that we tried in Apache Tez is to wrap a TaskScheduler around the AMRMClient that would take request from the application and do the matching internally. Since it would know the matching, it could automatically remove the matched requests also. (Still does not remove the race condition but it cleaner wrt to the user as an API). The TaskScheduler was written to be independent of Tez code so that we could contribute it to YARN as a library, however we did not find time to do so. Now that code has evolved quite a bit but the original, well-tested code could still be extracted from Tez 0.1 branch and contributed to YARN if someone is interested in doing that work. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Sietse T. Au >Assignee: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546421#comment-14546421 ] Bikas Saha commented on YARN-1902: -- Yes. And then the RM may give a container on H1 which is not useful for the app. If we again auto-decrement and release the container then we end up with 2 outstanding requests and the job will hang because it needs 3 containers. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Sietse T. Au >Assignee: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546377#comment-14546377 ] Bikas Saha commented on YARN-1902: -- The AMRMClient was not written to automatically remove requests because it does not know which requests will be matched to allocated containers. The explicit contract is for users of AMRMClient to remove requests that have been matched to containers. If we change that behavior to automatically remove requests then it may lead to issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that change should only be made in a different version of AMRMClient or else existing users will break. In the worst case, if the AMRMClient (automatically) removes the wrong request then the application will hang because the RM will not provide it the container that is needed. Not automatically removing the request has the downside of getting additional containers that need to be released by the application. We chose excess containers over hanging for the original implementation. Excess containers should happen rarely because the user controls when AMRMClient heartbeats to the RM and can do that after having removed all matched requests, so that the remote request table reflects the current state of outstanding requests. There may still be a race condition on the RM side that gives more containers. Excess containers can happen more often with AMRMClientAsync, because it heartbeats at a regular schedule and can send more requests than really outstanding if the heartbeat goes out before the user has removed the matched requests. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Sietse T. Au >Assignee: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-240) Rename ProcessTree.isSetsidAvailable
[ https://issues.apache.org/jira/browse/YARN-240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved YARN-240. - Resolution: Won't Fix > Rename ProcessTree.isSetsidAvailable > > > Key: YARN-240 > URL: https://issues.apache.org/jira/browse/YARN-240 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: trunk-win >Reporter: Bikas Saha >Assignee: Bikas Saha > > The logical use of this member is to find out if processes can be grouped > into a unit for process manipulation. eg. killing process groups etc. > setsid is the Linux implementation and it leaks into the name. > I suggest renaming it to isProcessGroupAvailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-255) Support secure AM launch for unmanaged AM's
[ https://issues.apache.org/jira/browse/YARN-255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-255: Assignee: (was: Bikas Saha) > Support secure AM launch for unmanaged AM's > --- > > Key: YARN-255 > URL: https://issues.apache.org/jira/browse/YARN-255 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Bikas Saha > > Currently unmanaged AM launch does not get security tokens because tokens are > passed by the RM to the AM via the NM during AM container launch. For > unmanaged AM's the RM can send tokens in the SubmitApplicationResponse to the > secure client. The client can then pass these onto the AM in a manner similar > to the NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-394) RM should be able to return requests that it cannot fulfill
[ https://issues.apache.org/jira/browse/YARN-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-394: Assignee: (was: Bikas Saha) > RM should be able to return requests that it cannot fulfill > --- > > Key: YARN-394 > URL: https://issues.apache.org/jira/browse/YARN-394 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha > > Currently, the RM has no way of returning requests that cannot be met. e.g. > if the app wants a specific node and that node dies, then the RM should > return that request instead of holding onto to it indefinitely. > Some situations in which this would be useful are: > * After YARN-392, requests are location specific, and the locations that were > requested are no longer in the cluster. > * A high memory machine is lost, and resource requests above certain sizes > are no longer able to be satisfied anywhere. > * All nodes in the cluster become unavailable. > At these points, there is no way the RM can inform the apps about its > inability to allocate requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-456) allow OS scheduling priority of NM to be different than the containers it launches for Windows
[ https://issues.apache.org/jira/browse/YARN-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-456: Assignee: (was: Bikas Saha) > allow OS scheduling priority of NM to be different than the containers it > launches for Windows > -- > > Key: YARN-456 > URL: https://issues.apache.org/jira/browse/YARN-456 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Bikas Saha > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523973#comment-14523973 ] Bikas Saha commented on YARN-556: - [~jianhe] [~adhoot] [~kasha] [~vinodkv] Should we resolve this jira as complete? > RM Restart phase 2 - Work preserving restart > > > Key: YARN-556 > URL: https://issues.apache.org/jira/browse/YARN-556 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: Work Preserving RM Restart.pdf, > WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch > > > YARN-128 covered storing the state needed for the RM to recover critical > information. This umbrella jira will track changes needed to recover the > running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-745) Move UnmanagedAMLauncher to yarn client package
[ https://issues.apache.org/jira/browse/YARN-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-745: Assignee: (was: Bikas Saha) > Move UnmanagedAMLauncher to yarn client package > --- > > Key: YARN-745 > URL: https://issues.apache.org/jira/browse/YARN-745 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bikas Saha > > Its currently sitting in yarn applications project which sounds wrong. client > project sounds better since it contains the utilities/libraries that clients > use to write and debug yarn applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-908) URL is a YARN API record instead of being java.net.url
[ https://issues.apache.org/jira/browse/YARN-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523396#comment-14523396 ] Bikas Saha commented on YARN-908: - The issue is why not use standard java URL instead of YARN specific URL. One of the things YARN URL does is serialize to PB but it can be serialized as a string instead. So what is the use of a YARN URL class? It just confuses users. > URL is a YARN API record instead of being java.net.url > -- > > Key: YARN-908 > URL: https://issues.apache.org/jira/browse/YARN-908 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha > > Why is it not a java.net.url? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355395#comment-14355395 ] Bikas Saha commented on YARN-2140: -- This paper may have useful insights into the network sharing issues. http://research.microsoft.com/en-us/um/people/srikanth/data/nsdi11_seawall.pdf > Add support for network IO isolation/scheduling for containers > -- > > Key: YARN-2140 > URL: https://issues.apache.org/jira/browse/YARN-2140 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: NetworkAsAResourceDesign.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325548#comment-14325548 ] Bikas Saha commented on YARN-2261: -- Looks like AM preemption will not fail the AM and so the comments about AM preemption are probably not valid. > YARN should have a way to run post-application cleanup > -- > > Key: YARN-2261 > URL: https://issues.apache.org/jira/browse/YARN-2261 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See MAPREDUCE-5956 for context. Specific options are at > https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325547#comment-14325547 ] Bikas Saha commented on YARN-2261: -- Sounds reasonable. Though things like how does YARN expose this information to the user would eventually need to be thought about. Currently the RM page shows running applications. How will show cleanup in progress? Can the original AM be preempted due to lack of resources? In that case, how will we launch the clean up container. Though probably the same problem exists now because if an AM is preempted it would not be able to clean up. However, moving this responsibility from the user to YARN (as proposed in this jira) makes that a YARN problem to solve. > YARN should have a way to run post-application cleanup > -- > > Key: YARN-2261 > URL: https://issues.apache.org/jira/browse/YARN-2261 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See MAPREDUCE-5956 for context. Specific options are at > https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292848#comment-14292848 ] Bikas Saha commented on YARN-3025: -- Yes. But that would mean that the RM cannot provide the latest updates. > Provide API for retrieving blacklisted nodes > > > Key: YARN-3025 > URL: https://issues.apache.org/jira/browse/YARN-3025 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ted Yu > > We have the following method which updates blacklist: > {code} > public synchronized void updateBlacklist(List blacklistAdditions, > List blacklistRemovals) { > {code} > Upon AM failover, there should be an API which returns the blacklisted nodes > so that the new AM can make consistent decisions. > The new API can be: > {code} > public synchronized List getBlacklistedNodes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291522#comment-14291522 ] Bikas Saha commented on YARN-3025: -- In the worst case, the frequency of this update can be once per heartbeat per AM. That can be high. Lets say 1000 AMs pinging every 1 sec. > Provide API for retrieving blacklisted nodes > > > Key: YARN-3025 > URL: https://issues.apache.org/jira/browse/YARN-3025 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ted Yu > > We have the following method which updates blacklist: > {code} > public synchronized void updateBlacklist(List blacklistAdditions, > List blacklistRemovals) { > {code} > Upon AM failover, there should be an API which returns the blacklisted nodes > so that the new AM can make consistent decisions. > The new API can be: > {code} > public synchronized List getBlacklistedNodes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3025) Provide API for retrieving blacklisted nodes
[ https://issues.apache.org/jira/browse/YARN-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271658#comment-14271658 ] Bikas Saha commented on YARN-3025: -- The AM is expected to maintain its own state. However, this could be part of the initial sync with the RM when the AM connects to the RM, it could provide list of known running containers (which the AM could connect to) etc. However, the RM probably does not persist this information after RM restart. So it may not be able to provide this information if both the RM and AM restart, leading to inconsistent response based on whether one or both have restarted before reconnecting. > Provide API for retrieving blacklisted nodes > > > Key: YARN-3025 > URL: https://issues.apache.org/jira/browse/YARN-3025 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ted Yu > > We have the following method which updates blacklist: > {code} > public synchronized void updateBlacklist(List blacklistAdditions, > List blacklistRemovals) { > {code} > Upon AM failover, there should be an API which returns the blacklisted nodes > so that the new AM can make consistent decisions. > The new API can be: > {code} > public synchronized List getBlacklistedNodes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233341#comment-14233341 ] Bikas Saha commented on YARN-2139: -- So to be clear, currently vdisks is counting the number of physical drives present on the box. Something to keep in mind would be whether this also entails a change in the NM policy of providing a directly on every local dir (which typically maps to every disk) to every task. And tasks are free to choose one or more of those dirs (disks) to write to. This puts the spinning disk head under contention and affects performance of all writers on that disk because seeks are expensive. The thumb rule tends to be to allocate as many number of tasks to a machine as the number of disks (maybe 2x) so as to keep this seek cost low. Should we consider evaluating a change in this policy that gives a container 1 local dir to a container with 1 vdisk. This way for a machine with 6 disks (and 6 vdisks) would have 6 tasks running, each with their own "dedicated" disk. Off hand its hard to say how this would compare with all 6 disks allocated to all 6 tasks and letting cgroups enforce sharing. If multiple tasks end up choosing the same disk for their writes, then they may not end up getting the "allocation" that they thought they would get. > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Isolation_Scheduling_3.pdf, > Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, > YARN-2139-prototype-2.patch, YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232448#comment-14232448 ] Bikas Saha commented on YARN-2139: -- Is the concept of vdisk representing a spinning disk or is it going to be some pluggable API? > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Isolation_Scheduling_3.pdf, > Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, > YARN-2139-prototype-2.patch, YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232133#comment-14232133 ] Bikas Saha commented on YARN-2139: -- Thanks for the update. Its not clear to me how we are going to clearly de-couple 1) and 2) from 3). From first thoughts, scheduling is what prevents over-allocation and the NM enforces the scheduling decision. Could you please throw some light on that? > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Isolation_Scheduling_3.pdf, > Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, > YARN-2139-prototype-2.patch, YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228030#comment-14228030 ] Bikas Saha commented on YARN-1902: -- If all the requests are for the same location (say star) then the client needs to sends all outstanding requests to the RM. The AM-RM protocol is not a delta protocol. Sending deltas will not work because the RM expects all requests at a given location to be presented for every update. It simply sets that value from the request to its internal book-keeping objects (vs performing an increment for the "new" request). > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221591#comment-14221591 ] Bikas Saha commented on YARN-2139: -- Given that this design and possible implementation might go through unstable rounds and are currently not abstracted enough in the core code, doing this on a branch seems prudent. Given that SSDs are becoming common, thinking of storage as only spinning disks may be limited. Multiple writers may affect each other more negatively on spinning disk vs SSDs. It may be useful to see if the consideration of storage could be abstracted into a plugin so that storage could have a different resource allocation policy by storage type (e.g. allocate/share by spindle for spinning disk storage vs allocate/share by iops on ssd storage vs allocate/share by network bandwidth for non-DAS storage). If we can abstract the policy into a plugin on trunk itself then perhaps we would not need a branch. Secondly, it will probably take a long time to agree on what a common policy should be and the consensus decision will probably not be a good fit for a large percentage of real clusters because of hardware variety. So making this a plugin would enable quicker development, trial and usage of disk based allocation compared to arriving at a grand unified allocation model for storage. > [Umbrella] Support for Disk as a Resource in YARN > -- > > Key: YARN-2139 > URL: https://issues.apache.org/jira/browse/YARN-2139 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan > Attachments: Disk_IO_Scheduling_Design_1.pdf, > Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, > YARN-2139-prototype.patch > > > YARN should consider disk as another resource for (1) scheduling tasks on > nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188516#comment-14188516 ] Bikas Saha commented on YARN-1902: -- bq. Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Firstly, I am not sure if the same ContainerRequest object can be passed multiple times in addContainerRequest. It should be different objects each time (even if they point to the same resource). This might have something to do with the internal book-keeping done for matching requests. Secondly, after z requests are made and 1 allocation is received then z-1 requests remain. If you are using AMRMClientImpl then its your (users) responsibility to call removeContainerRequest() for the request that was matched to this container. The AMRMClient does not know which of your z requests could be assigned to this container. So in the general case, it cannot automatically remove a request from the internal table because it does not know which request to remove. If the javadocs dont clarify these semantics then we can improve the javadocs. Thirdly, the protocol between the AMRMClient and the RM has an inherent race. So if the client had earlier asked for z containers and in the next heartbeat reduces that to z-1, the RM may actually return z containers to it because it had already allocated them to this client before the client updated the RM with the new value. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0, 2.4.0 >Reporter: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171459#comment-14171459 ] Bikas Saha commented on YARN-2314: -- My understanding from the comments was that in most cases this cache was adding overhead without benefit since the RPC layer was not controlled by the cache. We have no empirical evidence either ways about the performance. If you know of cases where this change of default might cause issues, then it would be helpful if they were enumerated in a comment. Then Tez/other users could test for those cases when they upgrade to 2.6 and make their own choices. > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, > nmproxycachefix.prototype.patch > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171429#comment-14171429 ] Bikas Saha commented on YARN-2314: -- To be clear, my question was only to clarify if Tez would get the benefits without doing anything because the defaults are correct. Looks like that is the case. Thanks! > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, > nmproxycachefix.prototype.patch > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171408#comment-14171408 ] Bikas Saha commented on YARN-2314: -- Folks, this is something that would be of interest in Tez since it uses the ContainerManagementProtocolProxy. My summary understanding is that the default is to turn this proxy off and this improves things for large scale clusters. So when Tez moves to 2.6 then it will automatically pick the defaults (which turn caching off) and benefit for large clusters. Is that correct? > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, > nmproxycachefix.prototype.patch > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059094#comment-14059094 ] Bikas Saha commented on YARN-2261: -- Would that be an existing issue that needs to be tracked separately? > YARN should have a way to run post-application cleanup > -- > > Key: YARN-2261 > URL: https://issues.apache.org/jira/browse/YARN-2261 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See MAPREDUCE-5956 for context. Specific options are at > https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058979#comment-14058979 ] Bikas Saha commented on YARN-2261: -- The cleanup would be indistinguishable from an AM that is cleaning up post job completion (as it happens today). Specially, if we use the second approach (via AM cleanup mode) this would be virtually indistinguishable from what happens today. > YARN should have a way to run post-application cleanup > -- > > Key: YARN-2261 > URL: https://issues.apache.org/jira/browse/YARN-2261 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See MAPREDUCE-5956 for context. Specific options are at > https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055231#comment-14055231 ] Bikas Saha commented on YARN-2261: -- +1 for having the control/responsibility in YARN An alternative that may fit better with the RM model of launching AM's is to optionally, have the RM run the AM in cleanup mode. This way the clean up logic can reside in the AM as it does today and the RM does not need to learn any new tricks about launching anything other that the AM. The existing launch context is used to launch the AM and the AM is told (via env or via register) that its in cleanup mode. The AM can use its logic to do cleanup and then successfully unregister with the RM. Until the unregister happens the RM can keep restarting the AM in clean up mode for a max of N times (to handled unexpected failures). When an AM is running in cleanup mode the it will not be allowed to make any allocated requests. This can be handled via AMRMClient so that AM's that use AMRMClient dont need to do anything. The ApplicationMasterService, of course, will need to handle this for non AMRMClient AM's. By this method, minimal changes will be needed in the API and RM internals to enable this feature in a compatible manner. > YARN should have a way to run post-application cleanup > -- > > Key: YARN-2261 > URL: https://issues.apache.org/jira/browse/YARN-2261 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > See MAPREDUCE-5956 for context. Specific options are at > https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053201#comment-14053201 ] Bikas Saha commented on YARN-1366: -- bq. I meant an empty response. After this does AMRMClientASync need to see AMCommand.RE_SYNC? That should be dead code right? > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.11.patch, YARN-1366.12.patch, YARN-1366.2.patch, YARN-1366.3.patch, > YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, > YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050978#comment-14050978 ] Bikas Saha commented on YARN-1366: -- Does a null response make sense for the user? > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, > YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, > YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, > YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050965#comment-14050965 ] Bikas Saha commented on YARN-1366: -- Why are we returning the old allocateResponse to the user? What is the user expected to do with this allocateResponse that has a RESYNC command in it? Should we make a second call to allocate (after re-registering) and then send that response back up to the user? {code}+// re register with RM +registerApplicationMaster(); +return allocateResponse; + }{code} There needs to be some clear documentation that if the user has not removed container requests that have already been satisfied, then the re-register may end up sending the entire ask list to the RM (including matched requests). Which would mean the RM could end up giving it a lot of new allocated containers. > AM should implement Resync with the ApplicationMasterService instead of > shutting down > - > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.10.patch, > YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, > YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, > YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, > YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
[ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043029#comment-14043029 ] Bikas Saha commented on YARN-614: - Steve, what you want should already happen. AM will suicide and the RM will restart it and count it as an AM failure, just like it does today. This jira is just making other source of failure like node loss, not count as an AM failure. > Retry attempts automatically for hardware failures or YARN issues and set > default app retries to 1 > -- > > Key: YARN-614 > URL: https://issues.apache.org/jira/browse/YARN-614 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bikas Saha >Assignee: Xuan Gong > Fix For: 2.5.0 > > Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, > YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch, > YARN-614.7.patch > > > Attempts can fail due to a large number of user errors and they should not be > retried unnecessarily. The only reason YARN should retry an attempt is when > the hardware fails or YARN has an error. NM failing, lost NM and NM disk > errors are the hardware errors that come to mind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034876#comment-14034876 ] Bikas Saha commented on YARN-2052: -- With 32 bits for epoch number we have 4 billion restarts before it overflows. We are probably safe without any handling. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034732#comment-14034732 ] Bikas Saha commented on YARN-2052: -- Ah. I did not see the rest of the comment. Yes. Integer overflow is a problem. We should make it a long in the same release as the epoch number addition so that we dont have to worry about that. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034731#comment-14034731 ] Bikas Saha commented on YARN-2052: -- Why would ContainerId#compareTo fail? Existing containerId's should remain unchanged after RM restart. Only new container ids should have a different epoch number. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps
[ https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034700#comment-14034700 ] Bikas Saha commented on YARN-1373: -- Sorry I am not clear how this is a dup. This jira is tracking new behavior in the RM that will transition a recovered RMAppImpl/RMAppAttemptImpl (and still running for real) app to a RUNNING state instead of a terminal recovered state. This is to ensure that the state machines are in the correct state for the running AM to resync and continue as running. This is not related to killing the app master process on the NM. > Transition RMApp and RMAppAttempt state to RUNNING after restart for > recovered running apps > --- > > Key: YARN-1373 > URL: https://issues.apache.org/jira/browse/YARN-1373 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently the RM moves recovered app attempts to the a terminal recovered > state and starts a new attempt. Instead, it will have to transition the last > attempt to a running state such that it can proceed as normal once the > running attempt has resynced with the ApplicationMasterService (YARN-1365 and > YARN-1366). If the RM had started the application container before dying then > the AM would be up and trying to contact the RM. The RM may have had died > before launching the container. For this case, the RM should wait for AM > liveliness period and issue a kill container for the stored master container. > It should transition this attempt to some RECOVER_ERROR state and proceed to > start a new attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034691#comment-14034691 ] Bikas Saha commented on YARN-2052: -- bq. Had an offline discussion with Vinod. Maybe it's still better to persist some sequence number to indicate the number of RM restarts when RM starts up. Is this the same as the epoch number that was mentioned earlier in this jira? https://issues.apache.org/jira/browse/YARN-2052?focusedCommentId=13996675&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13996675. Seems to me that its the same with epoch number changed to num-rm-restarts. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028734#comment-14028734 ] Bikas Saha commented on YARN-2052: -- Dont we already read and write synchronously from the store during RM startup? If we have an epoch number then it must be persisted. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2148) TestNMClient failed due more exit code values added and passed to AM
[ https://issues.apache.org/jira/browse/YARN-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028213#comment-14028213 ] Bikas Saha commented on YARN-2148: -- lgtm. +1. > TestNMClient failed due more exit code values added and passed to AM > > > Key: YARN-2148 > URL: https://issues.apache.org/jira/browse/YARN-2148 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.0.0, 2.5.0 >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2148.patch > > > Currently, TestNMClient will be failed in trunk, see > https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClient/ > {code} > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:385) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:347) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} > Test cases in TestNMClient uses following code to verify exit code of > COMPLETED containers > {code} > testGetContainerStatus(container, i, ContainerState.COMPLETE, > "Container killed by the ApplicationMaster.", Arrays.asList( > new Integer[] {137, 143, 0})); > {code} > But YARN-2091 added logic to make exit code reflecting the actual status, so > exit code of the "killed by ApplicationMaster" will be -105, > {code} > if (container.hasDefaultExitCode()) { > container.exitCode = exitEvent.getExitCode(); > } > {code} > We should update test case as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2148) TestNMClient failed due more exit code values added and passed to AM
[ https://issues.apache.org/jira/browse/YARN-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028216#comment-14028216 ] Bikas Saha commented on YARN-2148: -- [~ozawa] Can you please apply this patch and run the full YARN tests. I would like to make sure that we catch all tests that may have failed due to the container exit status. I can commit the patch if there are no new tests. Else we should fix them all in this jira. Thanks! > TestNMClient failed due more exit code values added and passed to AM > > > Key: YARN-2148 > URL: https://issues.apache.org/jira/browse/YARN-2148 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.0.0, 2.5.0 >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2148.patch > > > Currently, TestNMClient will be failed in trunk, see > https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClient/ > {code} > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:385) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:347) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} > Test cases in TestNMClient uses following code to verify exit code of > COMPLETED containers > {code} > testGetContainerStatus(container, i, ContainerState.COMPLETE, > "Container killed by the ApplicationMaster.", Arrays.asList( > new Integer[] {137, 143, 0})); > {code} > But YARN-2091 added logic to make exit code reflecting the actual status, so > exit code of the "killed by ApplicationMaster" will be -105, > {code} > if (container.hasDefaultExitCode()) { > container.exitCode = exitEvent.getExitCode(); > } > {code} > We should update test case as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027187#comment-14027187 ] Bikas Saha commented on YARN-1365: -- Sounds like the right approach. Keeps things consistent. Allowing unregister without register (while sounding harmless by itself) would need changes in the state machine to support and also breaks the existing contract that even an "empty" application needs to at least call register and unregister or its considered failed. > ApplicationMasterService to allow Register and Unregister of an app that was > running before restart > --- > > Key: YARN-1365 > URL: https://issues.apache.org/jira/browse/YARN-1365 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1365.001.patch, YARN-1365.002.patch, > YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch > > > For an application that was running before restart, the > ApplicationMasterService currently throws an exception when the app tries to > make the initial register or final unregister call. These should succeed and > the RMApp state machine should transition to completed like normal. > Unregistration should succeed for an app that the RM considers complete since > the RM may have died after saving completion in the store but before > notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2091) Add more values to ContainerExitStatus and pass it from NM to RM and then to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-2091: - Summary: Add more values to ContainerExitStatus and pass it from NM to RM and then to app masters (was: Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters) > Add more values to ContainerExitStatus and pass it from NM to RM and then to > app masters > > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch, YARN-2091.7.patch, > YARN-2091.8.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2140) Add support for network IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026672#comment-14026672 ] Bikas Saha commented on YARN-2140: -- [~ywskycn] For this and YARN-2139 my suggestion would be to first post a design sketch and discuss some alternatives. You may prototype some approach to get supporting data for that design doc. This will help get community interaction and understanding for your proposal and enable quicker progress. > Add support for network IO isolation/scheduling for containers > -- > > Key: YARN-2140 > URL: https://issues.apache.org/jira/browse/YARN-2140 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wei Yan >Assignee: Wei Yan > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020350#comment-14020350 ] Bikas Saha commented on YARN-2091: -- How about naming it hasDefaultExitCode() and directly using ContainerExitStatus.INVALID instead of creating a new member variable. {code}+ + private boolean isDefaultExitCode() { +return (this.exitCode == DEFAULT_EXIT_CODE); + }{code} Can we rename this to getContainerExitStatus() so that its clear that we are getting a ContainerExitStatus value. {code}+ public int getReason() { +return this.reason; + }{code} I would suggest the following names KILL_AM_STOP_CONTAINER -> KILLED_BY_APPMASTER - Container was terminated by stop request by the app master KILL_BY_RESOURCEMANAGER -> KILLED_BY_RESOURCEMANAGER - Container was terminated by the resource manager. KILL_FINISHED_APPMASTER -> KILLED_AFTER_APP_COMPLETION - Container was terminated after the application finished KILL_EXCEEDED_PMEM -> KILLED_EXCEEDED_PMEM - Container terminated because of exceeding allocated physical memory KILL_EXCEEDED_VMEM -> KILLED_EXCEEDED_VMEM - Container terminated because of exceeding allocated virtual memory Spurious edit? {code}- //if the current state is NEW it means the CONTAINER_INIT was never + //if the current state is NEW it means the CONTAINER_INIT was never{code} I am +1 after these comments. [~vinodkv] Do you have any further comments? > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch, YARN-2091.7.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016081#comment-14016081 ] Bikas Saha commented on YARN-2091: -- If we are sure that the default value is set in the code to ContainerExitStatus.INVALID then sounds good. Given that ContainerExitStatus.INVALID == 5000 we have to explicitly initialize with that value since Java will default to 0. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015611#comment-14015611 ] Bikas Saha commented on YARN-2091: -- Can this miss a case when the exitCode has not been set (eg when the container crashes on its own)? Should we check if the exitCode has already been set (eg. via a kill event) and if its not set then set it from exitEvent? How can we check if the exitCode has not been set? Maybe have some uninitialized/invalid default value. {code}@@ -829,7 +829,6 @@ public void transition(ContainerImpl container, ContainerEvent event) { @Override public void transition(ContainerImpl container, ContainerEvent event) { ContainerExitEvent exitEvent = (ContainerExitEvent) event; - container.exitCode = exitEvent.getExitCode();{code} The new exit status code need better comments/docs. E.g. what is the difference between to 2 new appmaster related exit status. Is kill_by_resourcemanager a generic value that can be replaced later on by a more specific reason like preempted? > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014284#comment-14014284 ] Bikas Saha commented on YARN-2091: -- We can check all cases of ContainerKillEvent and add new ExitStatus values where it makes sense or use some good default value. If a test needs to change to account for a new value then we should change the test. There may be other cases of exit status being set or tested which are unrelated to container kill event. Those can stay out of the scope of this jira. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014079#comment-14014079 ] Bikas Saha commented on YARN-2091: -- Why is "isAMAware" needed. All values in ContainerExitStatus are public and hence user code should already be aware of them. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, > YARN-2091.4.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013071#comment-14013071 ] Bikas Saha commented on YARN-2091: -- That would make sense if YARN would allow specifying pmem and vmem separately in resource request. Without that the information is not actionable. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012980#comment-14012980 ] Bikas Saha commented on YARN-2091: -- Instead of having the following if-else code everywhere, can we simply always use the "reason" from the kill event. That way if we add a different reason tomorrow (exceeded disk quota) then we dont have to find all these special cases. {code}+ if (killEvent.getReason() == ContainerExitStatus.KILL_EXCEEDED_MEMORY) { +container.exitCode = killEvent.getReason(); + } else { +container.exitCode = ExitCode.TERMINATED.getExitCode(); + }{code} As an exercise, we can find other cases of ContainerKillEvent and see if new enums (like exceeded_memory) can be added. Or if a suitable default value can be found. Clearly, if we are killing then there should be a reason. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch, YARN-2091.2.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010630#comment-14010630 ] Bikas Saha commented on YARN-2091: -- We are on the same page. The kill reason is directly a ContainerExitStatus. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010585#comment-14010585 ] Bikas Saha commented on YARN-2091: -- Thats the missing pieces AFAIK. That exit reason needs to be passed along internally through the NM and then on to the RM and AM. Maybe simply directly use ContainerExitStatus instead of a new reason object inside ContainerKillEvent. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007866#comment-14007866 ] Bikas Saha commented on YARN-796: - Thanks [~john.jian.fang] An interesting use case from your comment is that not only can labels be used to specify affinity but they can be used to specify anti-affinity ie. dont place task on a certain label. Do I correctly understand as this being your use case? OR Is your ask that node managers should specify their own labels when they register with the RM instead of the node manager to label mapping being a central RM configuration? > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: YARN-796.patch > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2091) Add ContainerExitStatus.KILL_EXCEECED_MEMORY and pass it to app masters
Bikas Saha created YARN-2091: Summary: Add ContainerExitStatus.KILL_EXCEECED_MEMORY and pass it to app masters Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)