[jira] [Updated] (YARN-422) Add NM client library

2013-05-04 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-422:
-

Attachment: YARN-422.3.patch

In the newest patch, there're the following significant changes:

1. NMCommunicator closes the PRC proxy every time a interaction with 
NodeManager is finished.

2. No restrict order of stopping/querying/stopping container is enforced. 
Because of this, NMClient#getContainerStatus and NMClient#stopContainerStatus 
needs to be changed, adding two more params, i.e., NodeId and ContainerToken. 
This will be used to start the PRC proxy (previously startContainer must be 
called first, such that this information is already stored when the following 
interactions are invoked).

3. Due to the stateless session, NMClientImpl no longer needs to keep the 
NMCommunicator instances for each started container. However, NMClientImpl 
still needs to remember which container is not stopped. The alive containers 
need to be stopped when NMClientImpl stops. Otherwise, they may be not 
stoppable.

4. CallbackHandler distinguishes the hanlders for exception happening under 
each type of interaction with container. Therefore, the event type is no longer 
necessary to be exposed to the public.

In addition, I've addressed the code refactoring issues mentioned in Vinod's 
comments, and modified the test cases.

 Add NM client library
 -

 Key: YARN-422
 URL: https://issues.apache.org/jira/browse/YARN-422
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Zhijie Shen
 Attachments: AMNMClient_Defination.txt, 
 AMNMClient_Definition_Updated_With_Tests.txt, proposal_v1.pdf, 
 YARN-422.1.patch, YARN-422.2.patch, YARN-422.3.patch


 Create a simple wrapper over the ContainerManager protocol to provide hide 
 the details of the protocol implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-422) Add NM client library

2013-05-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649119#comment-13649119
 ] 

Bikas Saha commented on YARN-422:
-

Is it necessary for the library to stop all containers before stopping itself? 
I dont think its the semantics of the protocol and should not be enforced by 
the library. I can easily see cases in which clients can start a bunch of long 
running containers and go away.

 Add NM client library
 -

 Key: YARN-422
 URL: https://issues.apache.org/jira/browse/YARN-422
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Zhijie Shen
 Attachments: AMNMClient_Defination.txt, 
 AMNMClient_Definition_Updated_With_Tests.txt, proposal_v1.pdf, 
 YARN-422.1.patch, YARN-422.2.patch, YARN-422.3.patch


 Create a simple wrapper over the ContainerManager protocol to provide hide 
 the details of the protocol implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-568) FairScheduler: support for work-preserving preemption

2013-05-04 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649126#comment-13649126
 ] 

Carlo Curino commented on YARN-568:
---

Sandy, I agree with your summary of the FS mechanics, and you raise important 
questions that I try to address below. 

The idea behind the preemption we are introducing is to prempt first and kill 
later to allow the AM to save its work before killing (in the CS we go a step 
further and let the AM pick the containers but it is a bit trickier so I would 
leave it out for the time being). This requires us to be consistent in how we 
pick the containers and first ask nicely, and then kill the same containers if 
the AM is ignoring us or being too slow. This is needed to give a consistent 
view of the RM needs to the AM. Assuming we are being consistent in picking 
containers, I think the simple mechanics we posted should be ok. 

Now how can we get there:

1) This translate in a deterministic choice of containers across invocations of 
the preemption procedures. Sorting by priority is a first step in that 
direction (although as I commented [here | 
https://issues.apache.org/jira/browse/YARN-569?focusedCommentId=13638825page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13638825]
 there are some other issues with that). Adding reverse-container-ordering 
might help guarantee the picking order is consistent (missing now). In 
particular, if the need for preemption is consistent over time, no new 
containers would be granted to this app, so picking from the tail should 
yield a consistent set of containers (minus the one naturally expiring, which 
would be accounted in future run as a reduced preemption need). On the other 
hand if the cluster conditions change drastically enough (e.g., big job 
finishes) and there is no more need to kill some containers from this app, we 
save the cost of kill and reschedule. In a sense, instead of looking at an 
instantaneous need for preemption every 15sec, we check every 5 seconds and 
only kill when there is a sustained need for a window of 
maxWaitTimeBeforeKill. I think that if we can get this to work as intended we 
would get a better overall policy (less jitter). 

2) toPreempt is decremented in all three cases because we would otherwise 
double-kill for the same resource needs: imagine you want 5 containers and send 
corresponding preemption requests,
while the AMs are working on preemption, the preemption procedure is called 
again and re-detects that we want 5 containers back. If you don't account for 
the pending requests (i.e., decrementing toPreempt for those too) you would 
pick (preempt or kill) another 5 containers (depending on time constants this 
could happen more than twice)... now we are forcing the AM to release 10(or 
more) containers for a 5 containers preemption need. Anyway, I agree that once 
we converge on this we should comment it out clearly in the code, this seems 
the kind of code that people would try to fix :-). The shift you spotted with 
this comment is between running rarely enough so that all the actions 
initiated during a previous run are fully reflected in the current cluster 
state, to run frequently enough that the actions we are taking might not be 
visible yet. This force us to do some more bookeeping and have robust 
heuristics, but I think it is work the improvement in the scheduler behavior.

3) It is probably good to have a no-preemption mode in which we simply 
straight kill. However, by setting the time constant right (e.g., 
preemptionInterval 5sec and maxWaitTimeBeforeKill to 10sec) you would get the 
same effect of having a hard kill at most 15sec after there is a need for 
preemption, but for every preemption-aware AM we could save the progress made 
so far. In our current MR implementation of preemption, you might get 
containers back even faster, as we release containers once we are done 
checkpointing. Note that since we are not actually killing at every 
preemptionInterval we could set that very low (if performance of the FS allow 
it) and get more points of observation and faster reaction times, while 
maxWaitTimeBeforeKill would be tuned as a tradeoff between giving the AM enough 
time to preempt and speed of rebalance. 

I will look into adding the allocation-order as a second-level ordering for 
containers. Please let me know whether this seems enough or I am missing 
something.


 FairScheduler: support for work-preserving preemption 
 --

 Key: YARN-568
 URL: https://issues.apache.org/jira/browse/YARN-568
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: YARN-568.patch, YARN-568.patch


 In the attached 

[jira] [Updated] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-05-04 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-569:


Attachment: preemption.2.patch

Attaching a patch that contains wip code to add preemption to the capacity 
scheduler. It was written pre-DRF times. The approach is similar to the current 
efforts in having the logic in a separate thread. So most of the code should 
still easily apply. The approach differs in that it turns off reservation and 
also specifies where the preempted resources should go. Hopefully there will be 
something helpful in it to contribute to the efforts in this jira.

 CapacityScheduler: support for preemption (using a capacity monitor)
 

 Key: YARN-569
 URL: https://issues.apache.org/jira/browse/YARN-569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
 preemption.2.patch, YARN-569.patch, YARN-569.patch


 There is a tension between the fast-pace reactive role of the 
 CapacityScheduler, which needs to respond quickly to 
 applications resource requests, and node updates, and the more introspective, 
 time-based considerations 
 needed to observe and correct for capacity balance. To this purpose we opted 
 instead of hacking the delicate
 mechanisms of the CapacityScheduler directly to add support for preemption by 
 means of a Capacity Monitor,
 which can be run optionally as a separate service (much like the 
 NMLivelinessMonitor).
 The capacity monitor (similarly to equivalent functionalities in the fairness 
 scheduler) operates running on intervals 
 (e.g., every 3 seconds), observe the state of the assignment of resources to 
 queues from the capacity scheduler, 
 performs off-line computation to determine if preemption is needed, and how 
 best to edit the current schedule to 
 improve capacity, and generates events that produce four possible actions:
 # Container de-reservations
 # Resource-based preemptions
 # Container-based preemptions
 # Container killing
 The actions listed above are progressively more costly, and it is up to the 
 policy to use them as desired to achieve the rebalancing goals. 
 Note that due to the lag in the effect of these actions the policy should 
 operate at the macroscopic level (e.g., preempt tens of containers
 from a queue) and not trying to tightly and consistently micromanage 
 container allocations. 
 - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
 - 
 Preemption policies are by design pluggable, in the following we present an 
 initial policy (ProportionalCapacityPreemptionPolicy) we have been 
 experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
 follows:
 # it gathers from the scheduler the state of the queues, in particular, their 
 current capacity, guaranteed capacity and pending requests (*)
 # if there are pending requests from queues that are under capacity it 
 computes a new ideal balanced state (**)
 # it computes the set of preemptions needed to repair the current schedule 
 and achieve capacity balance (accounting for natural completion rates, and 
 respecting bounds on the amount of preemption we allow for each round)
 # it selects which applications to preempt from each over-capacity queue (the 
 last one in the FIFO order)
 # it remove reservations from the most recently assigned app until the amount 
 of resource to reclaim is obtained, or until no more reservations exits
 # (if not enough) it issues preemptions for containers from the same 
 applications (reverse chronological order, last assigned container first) 
 again until necessary or until no containers except the AM container are left,
 # (if not enough) it moves onto unreserve and preempt from the next 
 application. 
 # containers that have been asked to preempt are tracked across executions. 
 If a containers is among the one to be preempted for more than a certain 
 time, the container is moved in a the list of containers to be forcibly 
 killed. 
 Notes:
 (*) at the moment, in order to avoid double-counting of the requests, we only 
 look at the ANY part of pending resource requests, which means we might not 
 preempt on behalf of AMs that ask only for specific locations but not any. 
 (**) The ideal balance state is one in which each queue has at least its 
 guaranteed capacity, and the spare capacity is distributed among queues (that 
 wants some) as a weighted fair share. Where the weighting is based on the 
 guaranteed capacity of a queue, and the function runs to a fix point.  
 Tunables of the ProportionalCapacityPreemptionPolicy:
 # observe-only mode (i.e., log the actions it would take, but 

[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-05-04 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649235#comment-13649235
 ] 

Carlo Curino commented on YARN-569:
---

Thank Bikas, we will look into it, and see whether we can integrate your ideas 
straight into the patch, 
or at least set things up to prepare the ground for a future version of this 
that leverages your work.


 CapacityScheduler: support for preemption (using a capacity monitor)
 

 Key: YARN-569
 URL: https://issues.apache.org/jira/browse/YARN-569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
 preemption.2.patch, YARN-569.patch, YARN-569.patch


 There is a tension between the fast-pace reactive role of the 
 CapacityScheduler, which needs to respond quickly to 
 applications resource requests, and node updates, and the more introspective, 
 time-based considerations 
 needed to observe and correct for capacity balance. To this purpose we opted 
 instead of hacking the delicate
 mechanisms of the CapacityScheduler directly to add support for preemption by 
 means of a Capacity Monitor,
 which can be run optionally as a separate service (much like the 
 NMLivelinessMonitor).
 The capacity monitor (similarly to equivalent functionalities in the fairness 
 scheduler) operates running on intervals 
 (e.g., every 3 seconds), observe the state of the assignment of resources to 
 queues from the capacity scheduler, 
 performs off-line computation to determine if preemption is needed, and how 
 best to edit the current schedule to 
 improve capacity, and generates events that produce four possible actions:
 # Container de-reservations
 # Resource-based preemptions
 # Container-based preemptions
 # Container killing
 The actions listed above are progressively more costly, and it is up to the 
 policy to use them as desired to achieve the rebalancing goals. 
 Note that due to the lag in the effect of these actions the policy should 
 operate at the macroscopic level (e.g., preempt tens of containers
 from a queue) and not trying to tightly and consistently micromanage 
 container allocations. 
 - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
 - 
 Preemption policies are by design pluggable, in the following we present an 
 initial policy (ProportionalCapacityPreemptionPolicy) we have been 
 experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
 follows:
 # it gathers from the scheduler the state of the queues, in particular, their 
 current capacity, guaranteed capacity and pending requests (*)
 # if there are pending requests from queues that are under capacity it 
 computes a new ideal balanced state (**)
 # it computes the set of preemptions needed to repair the current schedule 
 and achieve capacity balance (accounting for natural completion rates, and 
 respecting bounds on the amount of preemption we allow for each round)
 # it selects which applications to preempt from each over-capacity queue (the 
 last one in the FIFO order)
 # it remove reservations from the most recently assigned app until the amount 
 of resource to reclaim is obtained, or until no more reservations exits
 # (if not enough) it issues preemptions for containers from the same 
 applications (reverse chronological order, last assigned container first) 
 again until necessary or until no containers except the AM container are left,
 # (if not enough) it moves onto unreserve and preempt from the next 
 application. 
 # containers that have been asked to preempt are tracked across executions. 
 If a containers is among the one to be preempted for more than a certain 
 time, the container is moved in a the list of containers to be forcibly 
 killed. 
 Notes:
 (*) at the moment, in order to avoid double-counting of the requests, we only 
 look at the ANY part of pending resource requests, which means we might not 
 preempt on behalf of AMs that ask only for specific locations but not any. 
 (**) The ideal balance state is one in which each queue has at least its 
 guaranteed capacity, and the spare capacity is distributed among queues (that 
 wants some) as a weighted fair share. Where the weighting is based on the 
 guaranteed capacity of a queue, and the function runs to a fix point.  
 Tunables of the ProportionalCapacityPreemptionPolicy:
 # observe-only mode (i.e., log the actions it would take, but behave as 
 read-only)
 # how frequently to run the policy
 # how long to wait between preemption and kill of a container
 # which fraction of the containers I would like to obtain should I preempt 
 (has to do with the natural