[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327610#comment-14327610
 ] 

Junping Du commented on YARN-914:
-

Break down this feature into sub-JIRAs.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324404#comment-14324404
 ] 

Jason Lowe commented on YARN-914:
-

bq.  what's the benefit of step 1 over decommission nodes directly after 
timeout?

We don't have to notify AMs if we want to keep things simpler.  However we 
already support preempting (i.e.: killing) of specific containers via 
StrictPreemptionContract so it seems  straightforward to allow the AMs to be a 
bit more proactive.  Note that we'd still need a timeout to give them time to 
respond, so the decomm would be two phases, the first where we're simply 
waiting for containers to complete on their own, and the second where we notify 
AMs about imminent preemption and give them a little bit of time to react 
before forcibly killing any remaining containers.  The advantage of adding the 
preemption-with-explicit-grace-period feature is that we don't need two 
separate timeout phases.  Without the feature, telling AMs too early that their 
containers are going away might make them do something expensive/drastic when 
the container is going to complete on its own in a few more minutes.  Letting 
them know the deadline explicitly lets them make the call of whether to do 
anything or let it ride.

bq.  If there is benefit, why we don't do this today when decommission nodes?

Because today's decommission is instantaneous and not graceful, and fixing that 
is the point of this JIRA. ;-)

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324390#comment-14324390
 ] 

Junping Du commented on YARN-914:
-

bq. The main point I'm trying to make here is that we shouldn't be worrying too 
much about long-running services right now. 
Agree. Especially we were pushing the tracking of timeout out of YARN core in 
above discussion. The new CLI will track time (configurable per operation) and 
send force decommission after timeout. We can add notification to AM on NM's 
decommissioning (and timeout) also which could be more complicated though. 

bq.  In the short-term I think we just go with a configurable decomm timeout 
and AM notification via strict preemption as the timeout expires. If we want to 
get a bit fancier, we can annotate the strict preemption with a timeout so the 
AM knows approximately when the preemption will occur.
Ok. My understanding here is we have two steps here: 1. notify AM in strict 
preemption after timeout; 2. notify AM in flexible preemption with tolerant 
timeout when start decommissioning. Quick question here is: what's the benefit 
of step 1 over decommission nodes directly after timeout? If there is benefit, 
why we don't do this today when decommission nodes?

bq. With that feature we would notify AMs as soon as the node is marked for 
decomm that their containers will be forcibly preempted (i.e.: killed) in X 
minutes, and it's up to each AM to decide whether to do anything about it or if 
their containers on that node will complete within that time naturally. With 
that setup we don't have to special-case LRS apps or anything like that, as 
we're telling the apps ASAP the decomm is happening and giving them time to 
deal with it, LRS or not.
Make sense. Sounds like there is a sub JIRA already being created, and we can 
extend it to have a timeout.

  

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-17 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324277#comment-14324277
 ] 

Jason Lowe commented on YARN-914:
-

bq. I think prediction of expected runtime of containers could be hard in YARN 
case. However, can we typically say long running service containers are 
expected to run very long or infinite? If so, notifying AM to preempt 
containers of LRS make more sense here than waiting here for timeout. Isn't it? 

The main point I'm trying to make here is that we shouldn't be worrying too 
much about long-running services right now.  YARN doesn't even know which are 
which yet, and without any kind of container lifespan prediction there's no way 
to know whether a container will finish within the decomm timeout window or 
not.  YARN knowing which apps are LRS is a primitive form of container lifespan 
prediction (i.e.: LRS = containers run forever).  We will have the same 
problems with apps that aren't LRS but have containers that can run for a 
"long" time, where "long" is larger than the decomm timeout.  That's why I'm 
not convinced it makes sense to do anything special for LRS apps vs. other apps.

In the short-term I think we just go with a configurable decomm timeout and AM 
notification via strict preemption as the timeout expires.  If we want to get a 
bit fancier, we can annotate the strict preemption with a timeout so the AM 
knows approximately _when_ the preemption will occur.  With that feature we 
would notify AMs as soon as the node is marked for decomm that their containers 
will be forcibly preempted (i.e.: killed) in X minutes, and it's up to each AM 
to decide whether to do anything about it or if their containers on that node 
will complete within that time naturally.  With that setup we don't have to 
special-case LRS apps or anything like that, as we're telling the apps ASAP the 
decomm is happening and giving them time to deal with it, LRS or not.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-16 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323124#comment-14323124
 ] 

Junping Du commented on YARN-914:
-

Thanks [~jlowe] for review and comments!
bq. Nit: How about DECOMMISSIONING instead of DECOMMISSION_IN_PROGRESS?
Sounds good. Will update it later.

bq. We should remove its available (not total) resources from the cluster then 
continue to remove available resources as containers complete on that node. 
That's a very good point. Yes. we should update resource in this way.

bq. As for the UI changes, initial thought is that decommissioning nodes should 
still show up in the active nodes list since they are still running containers. 
A separate decommissioning tab to filter for those nodes would be nice, 
although I suppose users can also just use the jquery table to sort/search for 
nodes in that state from the active nodes list if it's too crowded to add yet 
another node state tab (or maybe get rid of some effectively dead tabs like the 
reboot state tab).
Make sense. Will add to proposal and can discuss more details on UI JIRA later.

bq. For the NM restart open question, this should no longer an issue now that 
the NM is unaware of graceful decommission.
Right.

bq. For the AM dealing with being notified of decommissioning, again I think 
this should just be treated like a strict preemption for the short term. IMHO 
all the AM needs to know is that the RM is planning on taking away those 
containers, and what the AM should do about it is similar whether the reason 
for removal is preemption or decommissioning.


bq. Back to the long running services delaying decommissioning concern, does 
YARN even know the difference between a long-running container and a "normal" 
container? 
I am afraid not now. YARN-1039 should be a start to do the differentiation.

bq. If it doesn't, how is it supposed to know a container is not going to 
complete anytime soon? Even a "normal" container could run for many hours. It 
seems to me the first thing we would need before worrying about this scenario 
is the ability for YARN to know/predict the expected runtime of containers.
I think prediction of expected runtime of containers could be hard in YARN 
case. However, can we typically say long running service containers are 
expected to run very long or infinite? If so, notifying AM to preempt 
containers of LRS make more sense here than waiting here for timeout. Isn't it? 

bq. There's still an open question about tracking the timeout RM side instead 
of NM side. Sounds like the NM side is not going to be pursued at this point, 
and we're going with no built-in timeout support in YARN for the short-term.
That was unclear at the beginning of discussion but much clear now, will remove 
this part.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-11 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316980#comment-14316980
 ] 

Jason Lowe commented on YARN-914:
-

Thanks for updating the doc, Junping.  Additional comments:

Nit: How about DECOMMISSIONING instead of DECOMMISSION_IN_PROGRESS?

The design says when a node starts decommissioning we will remove its resources 
from the cluster, but that's not really the case, correct?  We should remove 
its available (not total) resources from the cluster then continue to remove 
available resources as containers complete on that node.  Failing to do so will 
result in weird metrics like more resources running on the cluster than the 
cluster says it has, etc.

Are we only going to support graceful decommission via updates to the 
include/exclude files and refresh?  Not needed for the initial cut, but 
thinking of a couple of use-cases and curious what others thought:
* Would be convenient to have an rmadmin command that does this in one step, 
especially for a single-node.  Arguably if we are persisting cluster nodes in 
the state store we can migrate the list there, and the include/exclude list 
simply become convenient ways to batch-update the cluster state.
* Will NMs be able to request a graceful decommission via their health check 
script?  There have been some cases in the past where it would have been nice 
for the NM to request a ramp-down on containers but not instantly kill all of 
them with an UNHEALTHY report.

As for the UI changes, initial thought is that decommissioning nodes should 
still show up in the active nodes list since they are still running containers. 
 A separate decommissioning tab to filter for those nodes would be nice, 
although I suppose users can also just use the jquery table to sort/search for 
nodes in that state from the active nodes list if it's too crowded to add yet 
another node state tab (or maybe get rid of some effectively dead tabs like the 
reboot state tab).

For the NM restart open question, this should no longer an issue now that the 
NM is unaware of graceful decommission  All the RM needs to do is ensure that a 
node that is rejoining the cluster when the RM thought it was already part of 
it retains its previous running/decommissioning state.  That way if an NM is 
decommissioning before the restart it will continue to decommission after it 
restarts.

For the AM dealing with being notified of decommissioning, again I think this 
should just be treated like a strict preemption for the short term.  IMHO all 
the AM needs to know is that the RM is planning on taking away those 
containers, and what the AM should do about it is similar whether the reason 
for removal is preemption or decommissioning.

Back to the long running services delaying decommissioning concern, does YARN 
even know the difference between a long-running container and a "normal" 
container?  If it doesn't, how is it supposed to know a container is not going 
to complete anytime soon?  Even a "normal" container could run for many hours.  
It seems to me the first thing we would need before worrying about this 
scenario is the ability for YARN to know/predict the expected runtime of 
containers.

There's still an open question about tracking the timeout RM side instead of NM 
side.  Sounds like the NM side is not going to be pursued at this point, and 
we're going with no built-in timeout support in YARN for the short-term.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-11 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316606#comment-14316606
 ] 

Junping Du commented on YARN-914:
-

bq. I do agree with Vinod that there should minimally be an easy way, CLI or 
otherwise, for outside scripts driving the decommission to either force it or 
wait for it to complete. If waiting, there also needs to be a way to either 
have the wait have a timeout which will force after that point or another 
method with which to easily kill the containers still on that node.
Make sense. Sounds like most of us here make agreement on to go with 2nd 
approach proposed by Ming and refined by Vinod.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314677#comment-14314677
 ] 

Jason Lowe commented on YARN-914:
-

bq. However, YARN-2567 is about threshold thing, may be a wrong JIRA number?

That's the right JIRA.  It's about waiting for a threshold number of nodes to 
report back in after the RM recovers, and the RM would need to persist the 
state about the nodes in the cluster to know what percentage of the old nodes 
have reported back in.

As for whether we should just provide hooks vs. making it much more of a 
turnkey solution, I'd be an advocate for initially seeing what we can do with 
hooks.  Based on what we learn with trying to do decommission with that we can 
provide feedback into the process of making it a built-in, turnkey solution 
later.  I do agree with Vinod that there should minimally be an easy way, CLI 
or otherwise, for outside scripts driving the decommission to either force it 
or wait for it to complete.  If waiting, there also needs to be a way to either 
have the wait have a timeout which will force after that point or another 
method with which to easily kill the containers still on that node.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-10 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314653#comment-14314653
 ] 

Junping Du commented on YARN-914:
-

Thanks [~vinodkv] for comments!
bq. IAC, I think we should also have a CLI command to decommission the node 
which optionally waits till the decommission succeeds.
That sounds pretty good. This new CLI can simply "gracefully" decommission 
related nodes and wait to timeout to forcefully decommission nodes haven't 
finished. Comparing with approach of external script proposed by Ming above, 
this has less dependency on effort that outside of hadoop.  

bq. Regarding storage of the decommission state, YARN-2567 also plans to make 
sure that the state of all nodes is maintained up to date on the state-store. 
That helps with many other cases too. We should combine these efforts.
That make sense. However, YARN-2567 is about threshold thing, may be a wrong 
JIRA number?

bq. Regarding long running services, I think it makes sense to let the admin 
initiating the decommission know - not in terms of policy but as a diagnostic. 
Other than waiting for a timeout, the admin may not have noticed that a service 
is running on this node before the decommission is triggered.

bq. This is the umbrella concern I have. There are two ways to do this: Let 
YARN manage the decommission process or manage it on top of YARN. If the later 
is the approach, I don't see a lot to be done here besides YARN-291. No?
Agree that there is less effort for 2nd approach. If so, we still need RM can 
aware containers/apps get finished then trigger shutdown to NM to make 
decommission comes earlier (and randomly) which I guess is important to upgrade 
of large cluster. Isn't it? For YARN-291, my understanding is now we don't rely 
on any open issues left there because we only need to set NM's resource to 0 at 
runtime which we already provide there. BTW, I think the approach you just 
proposed above is "2nd approach + a new CLI". Isn't it? I prefer to go with 
this way but would like to hear other guys' ideas here also.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312677#comment-14312677
 ] 

Vinod Kumar Vavilapalli commented on YARN-914:
--

Is the decommission_timeout a server side config or specifiable for each 
decommission request? The current refreshNodes approach will not enable a per 
request config. IAC, I think we should also have a CLI command to decommission 
the node which optionally waits till the decommission succeeds.

Regarding storage of the decommission state, YARN-2567 also plans to make sure 
that the state of all nodes is maintained up to date on the state-store. That 
helps with many other cases too. We should combine these efforts. /cc [~jianhe]

Regarding long running services, I think it makes sense to let the admin 
initiating the decommission know - not in terms of policy but as a diagnostic. 
Other than waiting for a timeout, the admin may not have noticed that a service 
is running on this node before the decommission is triggered.

bq. Alternatively we can remove graceful decommission timeout for YARN layer 
and let external decommission script handle that. If the script considers the 
graceful decommission takes too long, it can ask YARN to do the immediate 
decommission.
This is the umbrella concern I have. There are two ways to do this: Let YARN 
manage the decommission process or manage it on top of YARN. If the later is 
the approach, I don't see a lot to be done here besides YARN-291. No? 

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-09 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312525#comment-14312525
 ] 

Junping Du commented on YARN-914:
-

Thanks for review and comments, [~xgong], [~jlowe] and [~mingma]!

bq. I believe this is about the configuration synchronization between multiple 
RM nodes. Please take a look at 
https://issues.apache.org/jira/browse/YARN-1666, and 
https://issues.apache.org/jira/browse/YARN-1611
Thanks for pointing this out. Sounds like we already resolve most of the 
problem, good to know it. :)

bq. Do we really need to handle the "LRS containers" and "short-term 
containers" differently? There are lots of different cases we need to take 
care. I think that we can just use the same way to handle both.
I haven't think through this yet. IMO, the benefit of this feature is to 
provide a reasonable time window for running applications get chance to finish 
before nodes get decommissioned. Given the endless live cycle of LRS 
containers, I didn't see the benefit to keep LRS containers running until 
timeout but only delay the decommission process. Or we assume AM can do some 
react to LRS containers when get notified? May be for the first step, we can do 
the same thing for LRS and non-LRS containers to keep it simple. But I think we 
should keep mind open for this.

bq. Maybe we need to track the timeout at RM side and NM side. RM can stop NM 
if the timeout is reached but it does not receive the "decommission complete" 
from NM.
Sounds reasonable given possible broken communication between NM and RM. 
However, as Jason Lowe proposed below, we can only track in RM side. Thoughts?

bq. For transferring knowledge to the standby RM, we could persist the graceful 
decomm node list to the state store.
Yes. Sounds like most of work already be done in YARN-1666 (decommission node 
list) and YARN-1611 (timeout value) like @Xuan mentioned above. The only left 
work here is to keep track of start time of each decommissioning node. Isn't it?

bq. I agree with Xuan that so far I don't see a need to treat LRS and normal 
containers separately. Either a container exits before the decommission timeout 
or it doesn't.
Just like we want decommission happen before timeout if all containers and apps 
are finished, we don't want unnecessary time cost for delay the decommission 
process. Isn't it? However, it could be the other case if we think the delay 
can help to LRS application. Anyway, like mentioned above, it should be fine to 
keep the same behavior for now but I think we may need to keep mind on it.  

bq. Just to be clear, the NM is already tracking which applications are active 
on a node and is reporting these to the RM on heartbeats (see NM context and 
NodeStatusUpdaterImpl appTokenKeepAliveMap). The DecommissionService doesn't 
need to explicitly track the apps itself as this is already being done.
Yes. The diagram not only include the new components but also existing 
components. Thanks for reminding on this though. 

bq. As for doing this RM side or NM side, I think it can simplify things if we 
do this on the RM side. The RM already needs to know about graceful 
decommission to avoid scheduling new apps/containers on the node. Also the NM 
is heartbeating active apps back to the RM, so it's easy for the RM to track 
which apps are still active on a particular node. If the RMNodeImpl state 
machine sees that it's in the decommissioning state and all apps/containers 
have completed then it can transition to the decommissioned state. For timeouts 
the RM can simply set a timer-delivered event to the RMNode when the graceful 
decommission starts, and the RMNode can act accordingly when the timer event 
arrives, killing containers etc. Actually I'm not sure the NM needs to know 
about graceful decommission at all, which IMHO simplifies the design since only 
one daemon needs to participate and be knowledgeable of the feature. The NM 
would simply see the process as a reduction in container assignments until 
eventually containers are killed and the RM tells it that it's decommissioned.
That make sense. In addition, I think even RMNode doesn't have to track time 
themselves (or in worst case, thousands of threads need to access time), and we 
can have something like DecommssionTimeoutMonitor that derived from 
AbstractLivelinessMonitor. When detected timeout, it can send out 
decommssion_timeout event to RMNode to make node shutdown happens. Also, I 
agree that NM may not necessary to aware of this decommission_in_process.

bq. To clarify decomm node list, it appears there are two things, one is the 
decomm request list; another one is the run time state of the decomm nodes. 
From Xuan's comment it appears we want to put the request in HDFS and leverage 
FileSystemBasedConfigurationProvider to read it at run time. Given it is 
considered configuration, that seems a good fit. Jason mentioned the

[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-05 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307724#comment-14307724
 ] 

Ming Ma commented on YARN-914:
--

I agree with Jason. It is easier if NM doesn't need to know about decommission. 
There is a scalability issue that Junping might have brought up; but it 
shouldn't be an issue.

To clarify decomm node list, it appears there are two things, one is the decomm 
request list; another one is the run time state of the decomm nodes. From 
Xuan's comment it appears we want to put the request in HDFS and leverage 
FileSystemBasedConfigurationProvider to read it at run time. Given it is 
considered configuration, that seems a good fit. Jason mentioned the state 
store , that can be used to track the run time state of the decomm. This is 
necessary given we plan to introduce timeout for graceful decommission. 
However, if we assume ResouceOption's overcommitTimeout state is stored in 
state store for RM failover case as part YARN-291, then the new active RM can 
just replay the state transition. If so, it seems we don't need to persist 
decomm run time state to state store.

Alternatively we can remove graceful decommission timeout for YARN layer and 
let external decommission script handle that. If the script considers the 
graceful decommission takes too long, it can ask YARN to do the immediate 
decommission.

BTW, it appears fair scheduler doesn't support ConfigurationProvider.

Recommission is another scenario. It can happen when node is in decommissioned 
state or decommissioned_in_progress state.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307545#comment-14307545
 ] 

Jason Lowe commented on YARN-914:
-

For transferring knowledge to the standby RM, we could persist the graceful 
decomm node list to the state store.

I agree with Xuan that so far I don't see a need to treat LRS and normal 
containers separately.  Either a container exits before the decommission 
timeout or it doesn't.

Just to be clear, the NM is already tracking which applications are active on a 
node and is reporting these to the RM on heartbeats (see NM context and 
NodeStatusUpdaterImpl appTokenKeepAliveMap).  The DecommissionService doesn't 
need to explicitly track the apps itself as this is already being done.

As for doing this RM side or NM side, I think it can simplify things if we do 
this on the RM side.  The RM already needs to know about graceful decommission 
to avoid scheduling new apps/containers on the node.  Also the NM is 
heartbeating active apps back to the RM, so it's easy for the RM to track which 
apps are still active on a particular node.  If the RMNodeImpl state machine 
sees that it's in the decommissioning state and all apps/containers have 
completed then it can transition to the decommissioned state.  For timeouts the 
RM can simply set a timer-delivered event to the RMNode when the graceful 
decommission starts, and the RMNode can act accordingly when the timer event 
arrives, killing containers etc.  Actually I'm not sure the NM needs to know 
about graceful decommission at all, which IMHO simplifies the design since only 
one daemon needs to participate and be knowledgeable of the feature.  The NM 
would simply see the process as a reduction in container assignments until 
eventually containers are killed and the RM tells it that it's decommissioned.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-02-04 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306602#comment-14306602
 ] 

Xuan Gong commented on YARN-914:


Thanks for the proposal [~djp]

bq. RM in failed over (with HA enabled) when gracefully decommission is just 
triggered. We should make sure the new active RM can carry on the action 
forward (how to keep sync for decommissioned node list between active and 
standby RM?)

I believe this is about the configuration synchronization between multiple RM 
nodes. Please take a look at https://issues.apache.org/jira/browse/YARN-1666, 
and https://issues.apache.org/jira/browse/YARN-1611

bq. With containers of long running services, the timeout may not help but only 
delay the upgrade/reboot process. Shall we skip it and decommission directly in 
this case?

Do we really need to handle the "LRS containers" and "short-term containers" 
differently? There are lots of different cases we need to take care. I think 
that we can just use the same way to handle both.

bq. Another possibility is to track decommission timeout in RM side, instead of 
NM side ­ a new decommission services proposed above. Which way is better?

Maybe we need to track the timeout at RM side and NM side. RM can stop NM if 
the timeout is reached but it does not receive the "decommission complete" from 
NM.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-01-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289286#comment-14289286
 ] 

Jason Lowe commented on YARN-914:
-

bq. The first step I was thinking to keep NM running in a low resource mode 
after graceful decommissioned

I think it could be useful to leave the NM process up after the graceful 
decommission completes.  That allows automated decommissioning tools to know 
the process completed by querying the NM directly.  If the NM exits then the 
tool may have difficulty distinguishing between the NM crashing just before 
decommisioning completed vs. successful completion.  The RM will be tracking 
this state as well, so it may not be critical to do it one way or the other if 
the tool is querying the RM rather than the NM directly.

bq. However, I am not sure if they can handle state migration to new node ahead 
of predictable node lost here, or be stateless more or less make more sense 
here?

I agree with Ming that it would be nice if the graceful decommission process 
could give the AMs a "heads up" about what's going on.  The simplest way to 
accomplish that is to leverage the already existing preemption framework to 
tell the AM that YARN is about to take the resources away.  The 
StrictPreemptionContract portion of the PreemptionMessage can be used to list 
exact resources that YARN will be reclaiming and give the AM a chance to react 
to that before the containers are reclaimed.  It's then up to the AM if it 
wants to do anything special or just let the containers get killed after a 
timeout.

bq. These notification may still be necessary, so AM won't add these nodes into 
blacklist if container get killed afterwards. Thoughts?

I thought we could leverage the updated nodes list of the AllocateResponse to 
let AMs know when nodes are entering the decommissioning state or at least when 
the decommission state completes (and containers are killed).  Although if the 
AM adds the node to the blacklist, that's not such a bad thing either since the 
RM should never allocate new containers on a decommissioning node anyway.


> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-01-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288644#comment-14288644
 ] 

Junping Du commented on YARN-914:
-

Sorry for replying late. These are all good points, a couple of comments:

bq. Sounds like we need a new state for NM, called "decommission_in_progress" 
when NM is draining the containers.
Agree. We need a dedicated state for NM in this situation and both AM and RM 
should be aware of it for properly handle it.  

bq. To clarify my early comment "all its map output are fetched or until all 
the applications the node touches have completed", the question is when YARN 
can declare a node's state has been gracefully drained and thus the node 
gracefully decommissioned ( admins can shutdown the whole machine without any 
impact on jobs ). For MR, the state could be running tasks/containers or mapper 
outputs. Say we have timeout of 30 minutes for decommission, it takes 3 minutes 
to finish the mappers on the node, another 5 minutes for the job to finish, 
then YARN can declare the node gracefully decommissioned in 8 minutes, instead 
of waiting for 30 minutes. RM knows all applications on any given NM. So if all 
applications on any given node have completed, RM can mark the node 
"decommissioned".
The first step I was thinking to keep NM running in a low resource mode after 
graceful decommissioned - no running containers, no new containers get spawned, 
no obviously resources consumption, etc. and just like putting these nodes into 
maintenance mode. Timeout value there is used to kill unfinished containers to 
release resources. Not quite sure if we have to terminate NM after timeout but 
would like to understand your use case here.

bq. Yes, I meant long running services. If YARN just kills the containers upon 
decommission request, the impact could vary. Some services might not have 
states to drain. Or maybe the services can handle the state migration on their 
own without YARN's help. For such services, maybe we can just use 
ResourceOption's timeout for that; set timeout to 0 and NM will just kill the 
containers.
I believe most of these services already take care of losing nodes as each node 
in YARN cluster cannot be reliable always. However, I am not sure if they can 
handle state migration to new node ahead of predictable node lost here, or be 
stateless more or less make more sense here? If we have an example application 
that could easy migrate a node's state to another, then we can discuss how to 
provide some rudimentary support here.   

bq. Given we don't plan to have applications checkpoint and migrate states, it 
doesn't seem to be necessary to have YARN notify applications upon decommission 
requests. Just to call it out.
These notification may still be necessary, so AM won't add these nodes into 
blacklist if container get killed afterwards. Thoughts?

bq. It might be useful to have a new state called "decommissioned_timeout", so 
that admins know the node has been gracefully decommissioned or not.
Just like my above comments, we can see if we have to terminate the NM. If not, 
I prefer to use "maintenance" state and Admin can decide if to fully 
decommission it later. Again, we should talk on your scenarios here. 

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2015-01-06 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266692#comment-14266692
 ] 

Ming Ma commented on YARN-914:
--

Thanks, Junping. The timeout is definitely necessary.

* Sounds like we need a new state for NM, called "decommission_in_progress" 
when NM is draining the containers. When RM considers the decommission 
completes, it will be marked "decommissioned".

* To clarify my early comment "all its map output are fetched or until all the 
applications the node touches have completed", the question is when YARN can 
declare a node's state has been gracefully drained and thus the node gracefully 
decommissioned ( admins can shutdown the whole machine without any impact on 
jobs ). For MR, the state could be running tasks/containers or mapper outputs. 
Say we have timeout of 30 minutes for decommission, it takes 3 minutes to 
finish the mappers on the node, another 5 minutes for the job to finish, then 
YARN can declare the node gracefully decommissioned in 8 minutes, instead of 
waiting for 30 minutes. RM knows all applications on any given NM. So if all 
applications on any given node have completed, RM can mark the node 
"decommissioned".

* Yes, I meant long running services. If YARN just kills the containers upon 
decommission request, the impact could vary. Some services might not have 
states to drain. Or maybe the services can handle the state migration on their 
own without YARN's help. For such services, maybe we can just use 
ResourceOption's timeout for that; set timeout to 0 and NM will just kill the 
containers.

* Given we don't plan to have applications checkpoint and migrate states, it 
doesn't seem to be necessary to have YARN notify applications upon decommission 
requests. Just to call it out.

* It might be useful to have a new state called "decommissioned_timeout", so 
that admins know the node has been gracefully decommissioned or not.

Thoughts?

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2014-12-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256547#comment-14256547
 ] 

Junping Du commented on YARN-914:
-

Hi [~mingma], Thanks for comments here.
bq. So YARN will reduce the capacity of the nodes as part of the decomission 
process until all its map output are fetched or until all the applications the 
node touches have completed?
Yes. I am not sure if it is necessary for YARN to mark additional 
decommissioned on the node as node's resource is already updated to 0, and no 
container will get chance to be allocated on the node. Auxiliary service should 
still be running which shouldn't consume much resource if no request of service.

bq. In addition, it will be interesting to understand how you handle long 
running jobs.
Do you mean long-running services? 
First, I think we should support a timeout in drain resources of the node 
(ResourceOption already has timeout in design). So running containers should be 
preempted if run out of time. 
Second, we should support special container tag for the long running services 
(some discussions in YARN-1039) so we don't have to waste time to wait 
container finish until timeout. 
Third, in prospective of operation, we could add long-running label to specific 
nodes and try not to do decommission on nodes with long-running tag.
Let me know if this make sense to you.


> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2014-12-19 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254382#comment-14254382
 ] 

Ming Ma commented on YARN-914:
--

[~djp], thanks for working on this.

It looks like we are going to use YARN-291 and thus the "drain the state" 
approach, instead of the more complicated "migrate the state" approach. So YARN 
will reduce the capacity of the nodes as part of the decomission process until 
all its map output are fetched or until all the applications the node touches 
have completed? In addition, it will be interesting to understand how you 
handle long running jobs.

FYI, https://issues.apache.org/jira/browse/YARN-1996 will drain containers of 
unhealthy nodes.


> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2014-01-13 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870080#comment-13870080
 ] 

Ming Ma commented on YARN-914:
--

Junping/Luke, have you looked into the checkpointing framework being done to 
support preemption? One possible design to support this scenario could be 
something like:

1. Drain NM with a timeout. When NM is being drained, no more tasks will be 
assigned to this node.
2. After the timeout, RM -> AM -> tasks checkpointing will kick in. Task state 
and application-level state such as map outputs will be preserved; tasks will 
be rescheduled to other nodes.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2013-11-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817145#comment-13817145
 ] 

Steve Loughran commented on YARN-914:
-

YARN-1394 adds the need for AMs to be told of NM failure/decommission as causes 
for container completion

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2013-07-15 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709255#comment-13709255
 ] 

Aaron T. Myers commented on YARN-914:
-

Thanks, Luke.

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2013-07-15 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709144#comment-13709144
 ] 

Luke Lu commented on YARN-914:
--

[~atm]: Nice catch! Of course :)

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager

2013-07-15 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709067#comment-13709067
 ] 

Aaron T. Myers commented on YARN-914:
-

Should we perhaps do an s/NN/NM/g in the description of this JIRA? Or does this 
have something to do with the Name Node and I'm completely missing it?

> Support graceful decommission of nodemanager
> 
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
>
> When NNs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NN is decommissioned, all running containers on the NN need to 
> be rescheduled on other NNs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira