[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2017-11-16 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256106#comment-16256106
 ] 

Junping Du commented on YARN-914:
-

Client side graceful decommission work has been done together with proper 
document, so we should claim part of goal is achieved. I think we should 
separate server side decommission work into phase two with fixing HA issues, 
Jason Format issues, and other enhancements, which is helpful to make list 
cleaner. If nobody against, I will create a new Umbrella jira (and new branch) 
for moving all open JIRAs to under that one.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2016-02-05 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135138#comment-15135138
 ] 

Daniel Zhi commented on YARN-914:
-

I have applied and merged my code changes on top of latest Hadoop trunk branch 
(3.0.0-SNAPSHOT), launched cluster and verified graceful decommission works as 
expected. Per suggestion, I created a sub-JIRA with a doc that describes the 
design and the patch on top of latest trunk.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2016-02-05 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135159#comment-15135159
 ] 

Daniel Zhi commented on YARN-914:
-

Lack of a better title, the sub-JIRA is currently named as: "Automatic and 
Asynchronous Decommissioning Nodes Status Tracking".

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066518#comment-15066518
 ] 

Junping Du commented on YARN-914:
-

Hi [~danzhi], thanks for sharing the information above and welcome to join the 
contribution to Apache Hadoop.

bq. Our implementation is much in sync with the architecture and idea in the 
JIRA design document.
Good to hear that we are on the same page. One thing we need to pay attention 
is: we already have many patches committed into trunk/branch-2.8. As an 
continuous developing effort on YARN, we need to remove the code (current 
internal to yourself) for similar functionality or APIs before contributing or 
it would take reviewer/committer more effort to differentiate which 
functionalities/APIs are duplicated and which are not - that usually take much 
longer time.

bq. On the other hand, there are additional details and component level designs 
that the JIRA design document not necessarily discuss or touch. These details 
naturally surfaced up during the development iterations and the corresponding 
design became matured and stabilized.
I agree that the design document could miss some details of implementation in 
general. However, we can find more background/details in JIRA discussion or 
patch implementation. Let me explain below.

bq. One example is the DecommissioningNodeWatcher, which embedded in 
ResourceTrackingService, tracks DECOMMISSIONING nodes status automatically and 
asynchronously after client/admin made the graceful decommission request. 
Another example is per node decommission timeout support, which is useful to 
decommission node that will be terminated soon.
Actually, our current design and committed patches already support timeout 
feature. There are basically two ways to handle timeout: RM side or CLI side, 
both have pros and cons.
Per disussions above 
(https://issues.apache.org/jira/browse/YARN-914?focusedCommentId=14312677=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14312677
 and 
https://issues.apache.org/jira/browse/YARN-914?focusedCommentId=14312677=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14312677),
 we (Jason, Vinod and I) all agreed to go with CLI way first and we already 
implement it in sub JIRA (YARN-3225) and get committed. Of course, we are open 
for the other way of implementation, but we do want it can be based on a switch 
on/off configuration that doesn't affect current preferred option that we 
already implemented.

bq. Are you able to share these details in an "augmented" design doc? Agreeing 
on the design would greatly help with review/commits later.
I would prefer the effort to abstract the different implementation for 
tracking/handling timeout. This doesn't sounds like a overall "augmented" 
design as prevous saying it "much in sync" with current architecture and 
design. Also it is more proper to create a sub jira to discuss your ideas and 
put your document there given we already have a very long discussion here on 
overall design.

bq. As far as implementation goes, it is recommended to create subtasks as you 
see fit. Note that it is easier to review smaller chunks of code. Also, since 
you guys have implemented it already, can you comment on how much of the code 
changes are in frequently updated parts? If not much, it might make sense to 
develop on a branch and merge it to trunk.
I would say most parts of YARN-914 are already get committed or patch available 
already. It doesn't sounds massive of work for enhancing the timeout 
tracking/handling here, so a dedicated develop branch sounds unnecessary to me. 
However, I would prefer to create a sub jira to discuss the idea/scope and take 
a look at your demo code (with removing the duplicated code/feature that 
already committed or patch available public) before making any 
judegement/decision.

[~danzhi], the concrete steps I would suggest for now is:
1. Review all JIRA discussions/design doc/implementations under this umbrella 
JIRA so far, and understand the scope and gap with your current internal 
implementation.
2. Raise a sub jira to put your ideas/design to highlight different options for 
discussion. If possible, put a demo patch with removing any similar code or 
feature on existing patches for better understanding. We can discuss later on 
how to bring in your patch contribution.
Make sense?

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission 

[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064054#comment-15064054
 ] 

Jason Lowe commented on YARN-914:
-

[~danzhi] the patch should be against trunk.  We always commit first against 
trunk and then backport to prior releases in reverse release order (e.g.: 
trunk->branch-2->branch-2.8->branch-2.7) so we avoid a situation where a 
feature or fix is in a release but disappears in a subsequently released 
version.  See the [How to 
Contribute|http://wiki.apache.org/hadoop/HowToContribute] page for more 
information including details on preparing and naming the patch, etc.

Is this implementation inline with the design document on this JIRA or is it 
using a different approach?

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-18 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065119#comment-15065119
 ] 

Karthik Kambatla commented on YARN-914:
---

bq. On the other hand, there are additional details and component level designs 
that the JIRA design document not necessarily discuss or touch. 
Are you able to share these details in an "augmented" design doc? Agreeing on 
the design would greatly help with review/commits later.

As far as implementation goes, it is recommended to create subtasks as you see 
fit. Note that it is easier to review smaller chunks of code. Also, since you 
guys have implemented it already, can you comment on how much of the code 
changes are in frequently updated parts? If not much, it might make sense to 
develop on a branch and merge it to trunk. 

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-18 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064737#comment-15064737
 ] 

Daniel Zhi commented on YARN-914:
-

Thanks. Always commit to trunk first makes lots of sense to me. We would need 
to port the code to trunk and likely build AMI image with it so to leverage our 
internal verification tests system.

Our implementation is much in sync with the architecture and idea in the JIRA 
design document. On the other hand, there are additional details and component 
level designs that the JIRA design document not necessarily discuss or touch. 
These details naturally surfaced up during the development iterations and the 
corresponding design became matured and stabilized. One example is the 
DecommissioningNodeWatcher, which embedded in ResourceTrackingService, tracks 
DECOMMISSIONING nodes status automatically and asynchronously after 
client/admin made the graceful decommission request. Another example is per 
node decommission timeout support, which is useful to decommission node that 
will be terminated soon.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-17 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062988#comment-15062988
 ] 

Daniel Zhi commented on YARN-914:
-

AWS EMR (Elastic Map Reduce) implemented graceful decommission of YARN nodes 
and included it in several most recent AMI releases The implementation has been 
verified in thousands of customer clusters. We like to contribute the 
implementation back to Apache hadoop.

Internally we have the code in both hadoop 2.6.0 and hadoop 2.7.1. To prepare 
for release back to Apache hadoop, which branch should we prepare the code 
against?

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-17 Thread Parvez (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063000#comment-15063000
 ] 

Parvez commented on YARN-914:
-

HI Daniel,

Thank you for the reply. Yes AWS released the latest AMI version that supports 
graceful decommissioning of nodes. I guess you are referring to 
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-resize.html#graceful-shrink

Sorry don't have much idea about the specific branch.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-17 Thread Parvez (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063002#comment-15063002
 ] 

Parvez commented on YARN-914:
-

HI Daniel,

Thank you for the reply. Yes AWS released the latest AMI version that supports 
graceful decommissioning of nodes. I guess you are referring to 
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-resize.html#graceful-shrink

Sorry don't have much idea about the specific branch.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-17 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063033#comment-15063033
 ] 

Daniel Zhi commented on YARN-914:
-

Yes. (Another related blog: 
https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/)
 

Just clarify my question: my current patch is on top of hadoop 2.7.1. However I 
see branches "trunk", "branch-2.8", "branch-2.7.2" in 
git://git.apache.org/hadoop.git. It would require extra preparation to make a 
patch against these branches and it's unclear to me which branch to prepare the 
patch against.

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-12-17 Thread Daniel Zhi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062997#comment-15062997
 ] 

Daniel Zhi commented on YARN-914:
-

Maybe you are already aware of this: EMR team has implemented graceful 
decommission in recent AMIs (for example, AMI 3.10.0 or 4.2.0). In these new 
AMIs, when you re-size down the cluster, the control logic will select best 
candidates and graceful decommission them instead of terminate them right away 
as before. You can move on to use AMI 3.10.0 (which is hadoop 2.6.0).

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: graceful
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-09-17 Thread Parvez (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803230#comment-14803230
 ] 

Parvez commented on YARN-914:
-

Hi,

I am facing issues when trying to resize the AWS EMR cluster which is 
configured with Hadoop 2.6.0

Resizing works fine, but when decommissioning a node which has containers 
running in it, the entire emr cluster stops functioning. On a resize request, 
the EMR terminates a Task Node (EC2 instance ) randomly, without checking if it 
has containers running in it or not. 

Here YARN should perform moving the containers and the job from one node to 
another, which it isnt doing I suppose .

Could it be related to the issue listed here ? 

Please answer. Thank you. 

> (Umbrella) Support graceful decommission of nodemanager
> ---
>
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Luke Lu
>Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf, 
> GracefullyDecommissionofNodeManagerv3.pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-914) (Umbrella) Support graceful decommission of nodemanager

2015-03-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367873#comment-14367873
 ] 

Junping Du commented on YARN-914:
-

Hi, can someone in watch list help to review patch in sub JIRA YARN-3212? 
Thanks!

 (Umbrella) Support graceful decommission of nodemanager
 ---

 Key: YARN-914
 URL: https://issues.apache.org/jira/browse/YARN-914
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Luke Lu
Assignee: Junping Du
 Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
 Gracefully Decommission of NodeManager (v2).pdf, 
 GracefullyDecommissionofNodeManagerv3.pdf


 When NMs are decommissioned for non-fault reasons (capacity change etc.), 
 it's desirable to minimize the impact to running applications.
 Currently if a NM is decommissioned, all running containers on the NM need to 
 be rescheduled on other NMs. Further more, for finished map tasks, if their 
 map output are not fetched by the reducers of the job, these map tasks will 
 need to be rerun as well.
 We propose to introduce a mechanism to optionally gracefully decommission a 
 node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)