[jira] [Updated] (YARN-422) Add NM client library
[ https://issues.apache.org/jira/browse/YARN-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-422: - Attachment: YARN-422.3.patch In the newest patch, there're the following significant changes: 1. NMCommunicator closes the PRC proxy every time a interaction with NodeManager is finished. 2. No restrict order of stopping/querying/stopping container is enforced. Because of this, NMClient#getContainerStatus and NMClient#stopContainerStatus needs to be changed, adding two more params, i.e., NodeId and ContainerToken. This will be used to start the PRC proxy (previously startContainer must be called first, such that this information is already stored when the following interactions are invoked). 3. Due to the stateless session, NMClientImpl no longer needs to keep the NMCommunicator instances for each started container. However, NMClientImpl still needs to remember which container is not stopped. The alive containers need to be stopped when NMClientImpl stops. Otherwise, they may be not stoppable. 4. CallbackHandler distinguishes the hanlders for exception happening under each type of interaction with container. Therefore, the event type is no longer necessary to be exposed to the public. In addition, I've addressed the code refactoring issues mentioned in Vinod's comments, and modified the test cases. Add NM client library - Key: YARN-422 URL: https://issues.apache.org/jira/browse/YARN-422 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: AMNMClient_Defination.txt, AMNMClient_Definition_Updated_With_Tests.txt, proposal_v1.pdf, YARN-422.1.patch, YARN-422.2.patch, YARN-422.3.patch Create a simple wrapper over the ContainerManager protocol to provide hide the details of the protocol implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-422) Add NM client library
[ https://issues.apache.org/jira/browse/YARN-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649119#comment-13649119 ] Bikas Saha commented on YARN-422: - Is it necessary for the library to stop all containers before stopping itself? I dont think its the semantics of the protocol and should not be enforced by the library. I can easily see cases in which clients can start a bunch of long running containers and go away. Add NM client library - Key: YARN-422 URL: https://issues.apache.org/jira/browse/YARN-422 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: AMNMClient_Defination.txt, AMNMClient_Definition_Updated_With_Tests.txt, proposal_v1.pdf, YARN-422.1.patch, YARN-422.2.patch, YARN-422.3.patch Create a simple wrapper over the ContainerManager protocol to provide hide the details of the protocol implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-568) FairScheduler: support for work-preserving preemption
[ https://issues.apache.org/jira/browse/YARN-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649126#comment-13649126 ] Carlo Curino commented on YARN-568: --- Sandy, I agree with your summary of the FS mechanics, and you raise important questions that I try to address below. The idea behind the preemption we are introducing is to prempt first and kill later to allow the AM to save its work before killing (in the CS we go a step further and let the AM pick the containers but it is a bit trickier so I would leave it out for the time being). This requires us to be consistent in how we pick the containers and first ask nicely, and then kill the same containers if the AM is ignoring us or being too slow. This is needed to give a consistent view of the RM needs to the AM. Assuming we are being consistent in picking containers, I think the simple mechanics we posted should be ok. Now how can we get there: 1) This translate in a deterministic choice of containers across invocations of the preemption procedures. Sorting by priority is a first step in that direction (although as I commented [here | https://issues.apache.org/jira/browse/YARN-569?focusedCommentId=13638825page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13638825] there are some other issues with that). Adding reverse-container-ordering might help guarantee the picking order is consistent (missing now). In particular, if the need for preemption is consistent over time, no new containers would be granted to this app, so picking from the tail should yield a consistent set of containers (minus the one naturally expiring, which would be accounted in future run as a reduced preemption need). On the other hand if the cluster conditions change drastically enough (e.g., big job finishes) and there is no more need to kill some containers from this app, we save the cost of kill and reschedule. In a sense, instead of looking at an instantaneous need for preemption every 15sec, we check every 5 seconds and only kill when there is a sustained need for a window of maxWaitTimeBeforeKill. I think that if we can get this to work as intended we would get a better overall policy (less jitter). 2) toPreempt is decremented in all three cases because we would otherwise double-kill for the same resource needs: imagine you want 5 containers and send corresponding preemption requests, while the AMs are working on preemption, the preemption procedure is called again and re-detects that we want 5 containers back. If you don't account for the pending requests (i.e., decrementing toPreempt for those too) you would pick (preempt or kill) another 5 containers (depending on time constants this could happen more than twice)... now we are forcing the AM to release 10(or more) containers for a 5 containers preemption need. Anyway, I agree that once we converge on this we should comment it out clearly in the code, this seems the kind of code that people would try to fix :-). The shift you spotted with this comment is between running rarely enough so that all the actions initiated during a previous run are fully reflected in the current cluster state, to run frequently enough that the actions we are taking might not be visible yet. This force us to do some more bookeeping and have robust heuristics, but I think it is work the improvement in the scheduler behavior. 3) It is probably good to have a no-preemption mode in which we simply straight kill. However, by setting the time constant right (e.g., preemptionInterval 5sec and maxWaitTimeBeforeKill to 10sec) you would get the same effect of having a hard kill at most 15sec after there is a need for preemption, but for every preemption-aware AM we could save the progress made so far. In our current MR implementation of preemption, you might get containers back even faster, as we release containers once we are done checkpointing. Note that since we are not actually killing at every preemptionInterval we could set that very low (if performance of the FS allow it) and get more points of observation and faster reaction times, while maxWaitTimeBeforeKill would be tuned as a tradeoff between giving the AM enough time to preempt and speed of rebalance. I will look into adding the allocation-order as a second-level ordering for containers. Please let me know whether this seems enough or I am missing something. FairScheduler: support for work-preserving preemption -- Key: YARN-568 URL: https://issues.apache.org/jira/browse/YARN-568 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: YARN-568.patch, YARN-568.patch In the attached
[jira] [Updated] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)
[ https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-569: Attachment: preemption.2.patch Attaching a patch that contains wip code to add preemption to the capacity scheduler. It was written pre-DRF times. The approach is similar to the current efforts in having the logic in a separate thread. So most of the code should still easily apply. The approach differs in that it turns off reservation and also specifies where the preempted resources should go. Hopefully there will be something helpful in it to contribute to the efforts in this jira. CapacityScheduler: support for preemption (using a capacity monitor) Key: YARN-569 URL: https://issues.apache.org/jira/browse/YARN-569 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, preemption.2.patch, YARN-569.patch, YARN-569.patch There is a tension between the fast-pace reactive role of the CapacityScheduler, which needs to respond quickly to applications resource requests, and node updates, and the more introspective, time-based considerations needed to observe and correct for capacity balance. To this purpose we opted instead of hacking the delicate mechanisms of the CapacityScheduler directly to add support for preemption by means of a Capacity Monitor, which can be run optionally as a separate service (much like the NMLivelinessMonitor). The capacity monitor (similarly to equivalent functionalities in the fairness scheduler) operates running on intervals (e.g., every 3 seconds), observe the state of the assignment of resources to queues from the capacity scheduler, performs off-line computation to determine if preemption is needed, and how best to edit the current schedule to improve capacity, and generates events that produce four possible actions: # Container de-reservations # Resource-based preemptions # Container-based preemptions # Container killing The actions listed above are progressively more costly, and it is up to the policy to use them as desired to achieve the rebalancing goals. Note that due to the lag in the effect of these actions the policy should operate at the macroscopic level (e.g., preempt tens of containers from a queue) and not trying to tightly and consistently micromanage container allocations. - Preemption policy (ProportionalCapacityPreemptionPolicy): - Preemption policies are by design pluggable, in the following we present an initial policy (ProportionalCapacityPreemptionPolicy) we have been experimenting with. The ProportionalCapacityPreemptionPolicy behaves as follows: # it gathers from the scheduler the state of the queues, in particular, their current capacity, guaranteed capacity and pending requests (*) # if there are pending requests from queues that are under capacity it computes a new ideal balanced state (**) # it computes the set of preemptions needed to repair the current schedule and achieve capacity balance (accounting for natural completion rates, and respecting bounds on the amount of preemption we allow for each round) # it selects which applications to preempt from each over-capacity queue (the last one in the FIFO order) # it remove reservations from the most recently assigned app until the amount of resource to reclaim is obtained, or until no more reservations exits # (if not enough) it issues preemptions for containers from the same applications (reverse chronological order, last assigned container first) again until necessary or until no containers except the AM container are left, # (if not enough) it moves onto unreserve and preempt from the next application. # containers that have been asked to preempt are tracked across executions. If a containers is among the one to be preempted for more than a certain time, the container is moved in a the list of containers to be forcibly killed. Notes: (*) at the moment, in order to avoid double-counting of the requests, we only look at the ANY part of pending resource requests, which means we might not preempt on behalf of AMs that ask only for specific locations but not any. (**) The ideal balance state is one in which each queue has at least its guaranteed capacity, and the spare capacity is distributed among queues (that wants some) as a weighted fair share. Where the weighting is based on the guaranteed capacity of a queue, and the function runs to a fix point. Tunables of the ProportionalCapacityPreemptionPolicy: # observe-only mode (i.e., log the actions it would take, but
[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)
[ https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649235#comment-13649235 ] Carlo Curino commented on YARN-569: --- Thank Bikas, we will look into it, and see whether we can integrate your ideas straight into the patch, or at least set things up to prepare the ground for a future version of this that leverages your work. CapacityScheduler: support for preemption (using a capacity monitor) Key: YARN-569 URL: https://issues.apache.org/jira/browse/YARN-569 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, preemption.2.patch, YARN-569.patch, YARN-569.patch There is a tension between the fast-pace reactive role of the CapacityScheduler, which needs to respond quickly to applications resource requests, and node updates, and the more introspective, time-based considerations needed to observe and correct for capacity balance. To this purpose we opted instead of hacking the delicate mechanisms of the CapacityScheduler directly to add support for preemption by means of a Capacity Monitor, which can be run optionally as a separate service (much like the NMLivelinessMonitor). The capacity monitor (similarly to equivalent functionalities in the fairness scheduler) operates running on intervals (e.g., every 3 seconds), observe the state of the assignment of resources to queues from the capacity scheduler, performs off-line computation to determine if preemption is needed, and how best to edit the current schedule to improve capacity, and generates events that produce four possible actions: # Container de-reservations # Resource-based preemptions # Container-based preemptions # Container killing The actions listed above are progressively more costly, and it is up to the policy to use them as desired to achieve the rebalancing goals. Note that due to the lag in the effect of these actions the policy should operate at the macroscopic level (e.g., preempt tens of containers from a queue) and not trying to tightly and consistently micromanage container allocations. - Preemption policy (ProportionalCapacityPreemptionPolicy): - Preemption policies are by design pluggable, in the following we present an initial policy (ProportionalCapacityPreemptionPolicy) we have been experimenting with. The ProportionalCapacityPreemptionPolicy behaves as follows: # it gathers from the scheduler the state of the queues, in particular, their current capacity, guaranteed capacity and pending requests (*) # if there are pending requests from queues that are under capacity it computes a new ideal balanced state (**) # it computes the set of preemptions needed to repair the current schedule and achieve capacity balance (accounting for natural completion rates, and respecting bounds on the amount of preemption we allow for each round) # it selects which applications to preempt from each over-capacity queue (the last one in the FIFO order) # it remove reservations from the most recently assigned app until the amount of resource to reclaim is obtained, or until no more reservations exits # (if not enough) it issues preemptions for containers from the same applications (reverse chronological order, last assigned container first) again until necessary or until no containers except the AM container are left, # (if not enough) it moves onto unreserve and preempt from the next application. # containers that have been asked to preempt are tracked across executions. If a containers is among the one to be preempted for more than a certain time, the container is moved in a the list of containers to be forcibly killed. Notes: (*) at the moment, in order to avoid double-counting of the requests, we only look at the ANY part of pending resource requests, which means we might not preempt on behalf of AMs that ask only for specific locations but not any. (**) The ideal balance state is one in which each queue has at least its guaranteed capacity, and the spare capacity is distributed among queues (that wants some) as a weighted fair share. Where the weighting is based on the guaranteed capacity of a queue, and the function runs to a fix point. Tunables of the ProportionalCapacityPreemptionPolicy: # observe-only mode (i.e., log the actions it would take, but behave as read-only) # how frequently to run the policy # how long to wait between preemption and kill of a container # which fraction of the containers I would like to obtain should I preempt (has to do with the natural