[jira] [Assigned] (YARN-11489) Fix memory leak of DelegationTokenRenewer futures in DelegationTokenRenewerPoolTracker

2023-05-08 Thread Chun Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen reassigned YARN-11489:


Assignee: Chun Chen

> Fix memory leak of DelegationTokenRenewer futures in 
> DelegationTokenRenewerPoolTracker
> --
>
> Key: YARN-11489
> URL: https://issues.apache.org/jira/browse/YARN-11489
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chun Chen
>Assignee: Chun Chen
>Priority: Major
>
> The future of DelegationTokenRenewer runnable is not removed properly 
> previously, it is only removed if the runnable timeouts.
> And a Queue struct is more suitable than a Map struct for storing the futures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11489) Fix memory leak of DelegationTokenRenewer futures in DelegationTokenRenewerPoolTracker

2023-05-08 Thread Chun Chen (Jira)
Chun Chen created YARN-11489:


 Summary: Fix memory leak of DelegationTokenRenewer futures in 
DelegationTokenRenewerPoolTracker
 Key: YARN-11489
 URL: https://issues.apache.org/jira/browse/YARN-11489
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen


The future of DelegationTokenRenewer runnable is not removed properly 
previously, it is only removed if the runnable timeouts.
And a Queue struct is more suitable than a Map struct for storing the futures.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2018-04-10 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433277#comment-16433277
 ] 

Chun Chen commented on YARN-2674:
-

[~shaneku...@gmail.com] Please catch on the work here. I no longer work on yarn 
any more.

> Distributed shell AM may re-launch containers if RM work preserving restart 
> happens
> ---
>
> Key: YARN-2674
> URL: https://issues.apache.org/jira/browse/YARN-2674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, resourcemanager
>Reporter: Chun Chen
>Assignee: Chun Chen
>Priority: Major
>  Labels: oct16-easy
> Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, 
> YARN-2674.4.patch, YARN-2674.5.patch
>
>
> Currently, if RM work preserving restart happens while distributed shell is 
> running, distribute shell AM may re-launch all the containers, including 
> new/running/complete. We must make sure it won't re-launch the 
> running/complete containers.
> We need to remove allocated containers from 
> AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-06-16 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589179#comment-14589179
 ] 

Chun Chen commented on YARN-1983:
-

[~sidharta-s], thanks for letting me know the progress of this and much happy 
to learn all your concerns and designs to implement this.

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-06-15 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587370#comment-14587370
 ] 

Chun Chen commented on YARN-1983:
-

[~vinodkv], according to your suggestion, I propose the following change: 
1. Allow NM_CE to specify a comma list CE classes.
2. Allow user to specify a env named NM_CLIENT_CE in CLC. If the value of 
NM_CLIENT_CE is one of the CE class configured previous, choose that one to 
execute the container, throw exception otherwise.
3. If user specify only one CE class of NM_CE, ignore NM_CLIENT_CE in env of 
CLC and always use that one to execute containers.
4. If user specify multiple classes of NM_CE, he has to configure for a default 
CE named NM_DEFAULT_CE in yarn-site.xml in case he doesn't specify env 
NM_CLIENT_CE on submit containers.

NM_CE=yarn.nodemanager.container-executor.class
NM_CLIENT_CE=yarn.nodemanager.client.container-executor.class
NM_DEFAULT_CE=yarn.nodemanager.default.container-executor.class

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3469) ZKRMStateStore: Avoid setting watches that are not required

2015-06-14 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585394#comment-14585394
 ] 

Chun Chen commented on YARN-3469:
-

Just a note. I saw ZOOKEEPER-706 has been resolved recently after 5 years.

 ZKRMStateStore: Avoid setting watches that are not required
 ---

 Key: YARN-3469
 URL: https://issues.apache.org/jira/browse/YARN-3469
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3469.01.patch


 In ZKRMStateStore, most operations(e.g. getDataWithRetries, 
 getDataWithRetries, getDataWithRetries) set watches on znode. Large watches 
 will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause 
 session re-establishment to 
 fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706].  
 Although there is a workaround that setting jute.maxbuffer to a larger value, 
 we need to adjust this value once there are more app and attempts stored in 
 ZK. And those watches are useless now. It might be better that do not set 
 watches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-06-05 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Attachment: YARN-2674.4.patch

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, 
 YARN-2674.4.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-06-05 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574106#comment-14574106
 ] 

Chun Chen commented on YARN-2674:
-

Upload a patch to fix test failures.

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, 
 YARN-2674.4.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-06-05 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Attachment: YARN-2674.5.patch

Upload YARN-2674.5.patch to remove unnecessary synchronize.

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, 
 YARN-2674.4.patch, YARN-2674.5.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-06-04 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Attachment: YARN-2674.3.patch

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-06-04 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572535#comment-14572535
 ] 

Chun Chen commented on YARN-2674:
-

Upload YARN-2674.3.patch with a test case and more detailed comments.

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-03 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571957#comment-14571957
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for reviewing and committing the patch, [~xgong].

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Fix For: 2.8.0

 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.7.patch, 
 YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570202#comment-14570202
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for reviewing the patch, [~zxu] ! 

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568586#comment-14568586
 ] 

Chun Chen commented on YARN-3749:
-

bq. It looks like we need keep conf.set(YarnConfiguration.RM_HA_ID, 
RM1_NODE_ID); in TestRMEmbeddedElector to fix this test failure.
Sorry, my bad. Upload YARN-3749.7.patch to fix that and add a tests in 
{{TestYarnConfiguration}} to make sure {{YarnConfiguration#updateConnectAddr}} 
won't add suffix to NM service address configurations. 

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.2.patch

Upload a new patch to fix test cases.
Lots of previous tests The HA Configuration has multiple addresses that match 
local node's address. is because I forgot to set YarnConfiguration.RM_HA_ID 
before starting NM.

The patch also contains 2 minor fix, changed getting conf value of 
RM_SCHEDULER_ADDRESS from serviceStart to serviceInit in 
ApplicationMasterService, changed duplicates setRpcAddressForRM in tests to 
HAUtil.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.7.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.4.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.5.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568521#comment-14568521
 ] 

Chun Chen commented on YARN-3749:
-

[~zxu] ,thanks, I agree. Upload YARN-3749.6.patch to address your comments.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.3.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568448#comment-14568448
 ] 

Chun Chen commented on YARN-3749:
-

Thanks for the review [~zxu] [~iwasakims].
Upload a new patch to address your comments.

bq. 1. It looks like setRpcAddressForRM and setConfForRM are only used by test 
code. Should we create a new HA test utility file to include these functions?

Moved setRpcAddressForRM and setConfForRM to HATestUtil.java

bq. 2. Do we really need the following change at MiniYARNCluster#serviceInit 
conf.set(YarnConfiguration.RM_HA_ID, rm0);

This is indeed necessary, as [~iwasakims]'s comments, this is used to bypass 
the check in `HAUtil#getRMHAId` used by NodeManager instance.

bq. 3. Is any particular reason to configure YarnConfiguration.RM_HA_ID as 
RM2_NODE_ID instead of RM1_NODE_ID in ProtocolHATestBase?

Not really, changed it to RM1_NODE_ID.

bq. I think there should be a comment explain that it is a dummy for unit test 
at least.

Added a comment in `MiniYARNCluster#serviceInit`

Also the newly uploaded patch YARN-3749.4.patch only make a copy of the 
configuration in initResourceManager when there are multiple RMs. If there is 
only one RM, many test case in yarn-client depends on the random ports after RM 
starts.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.6.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-06-01 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568458#comment-14568458
 ] 

Chun Chen commented on YARN-3749:
-

Upload a YARN-3749.5.patch to set {{YarnConfiguration.RM_HA_ID}} only in 
{{MiniYARNCluster#serviceInit}}, removed that in other tests.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, 
 YARN-3749.5.patch, YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-05-30 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566326#comment-14566326
 ] 

Chun Chen commented on YARN-3749:
-

Upload a patch to fix it.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2015-05-30 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566327#comment-14566327
 ] 

Chun Chen commented on YARN-2674:
-

Thanks for the comments [~vinodkv]. Will upload a new patch with test case 
after YARN-3749 fixed.

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-05-30 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Summary: We should make a copy of configuration when init MiniYARNCluster 
with multiple RMs  (was: We should make a copy of config MiniYARNCluster )

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3749) We should make a copy of config MiniYARNCluster

2015-05-30 Thread Chun Chen (JIRA)
Chun Chen created YARN-3749:
---

 Summary: We should make a copy of config MiniYARNCluster 
 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-05-30 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Description: 
When I was trying to write a test case for YARN-2674, I found DS client trying 
to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM 
failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is in 
ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 
0.0.0.0:18032. See the following code in ClientRMService:
{code}
clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
   YarnConfiguration.RM_ADDRESS,
   
YarnConfiguration.DEFAULT_RM_ADDRESS,
   server.getListenerAddress());
{code}

Since we use the same instance of configuration in rm1 and rm2 and init both RM 
before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
starting of rm1.
So I think it is safe to make a copy of configuration when init both of the rm.

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen

 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-05-30 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen reassigned YARN-3749:
---

Assignee: Chun Chen

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen

 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs

2015-05-30 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3749:

Attachment: YARN-3749.patch

 We should make a copy of configuration when init MiniYARNCluster with 
 multiple RMs
 --

 Key: YARN-3749
 URL: https://issues.apache.org/jira/browse/YARN-3749
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chun Chen
Assignee: Chun Chen
 Attachments: YARN-3749.patch


 When I was trying to write a test case for YARN-2674, I found DS client 
 trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 
 when RM failover. But I initially set 
 yarn.resourcemanager.address.rm1=0.0.0.0:18032, 
 yarn.resourcemanager.address.rm2=0.0.0.0:28032  After digging, I found it is 
 in ClientRMService where the value of yarn.resourcemanager.address.rm2 
 changed to 0.0.0.0:18032. See the following code in ClientRMService:
 {code}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,

 YarnConfiguration.DEFAULT_RM_ADDRESS,
server.getListenerAddress());
 {code}
 Since we use the same instance of configuration in rm1 and rm2 and init both 
 RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 
 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during 
 starting of rm1.
 So I think it is safe to make a copy of configuration when init both of the 
 rm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile

2015-02-28 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341427#comment-14341427
 ] 

Chun Chen commented on YARN-3080:
-

[~ashahab], I think we can simply fix this by using the pid of the session 
script bash process instead since docker run will block until it exits. If 
docker container exits, the session script bash process will exit immediately. 
As for signalContainer, we can use docker kill --signal=SIGNAL containerId 
instead.

 The DockerContainerExecutor could not write the right pid to container pidFile
 --

 Key: YARN-3080
 URL: https://issues.apache.org/jira/browse/YARN-3080
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Beckham007
Assignee: Abin Shahab
 Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, 
 YARN-3080.patch


 The docker_container_executor_session.sh is like this:
 {quote}
 #!/usr/bin/env bash
 echo `/usr/bin/docker inspect --format {{.State.Pid}} 
 container_1421723685222_0008_01_02`  
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp
 /bin/mv -f 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp
  
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid
 /usr/bin/docker run --rm  --name container_1421723685222_0008_01_02 -e 
 GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e 
 GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e 
 GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e 
 GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M 
 --cpu-shares=1024 -v 
 /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02
  -v 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02
  -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash 
 /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh
 {quote}
 The DockerContainerExecutor use docker inspect before docker run, so the 
 docker inspect couldn't get the right pid for the docker, signalContainer() 
 and nm restart would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-02-05 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307247#comment-14307247
 ] 

Chun Chen commented on YARN-1983:
-

Thanks for the comments, [~vinodkv] [~chris.douglas].
Do you mean we can simply use NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME env to 
identify container types? IMHO, users can implement a custom 
${yarn.nodemanager.container-executor.class} for their own scenarios currently, 
 what if they want to use both LinuxCE and their owner CE at runtime? I think 
NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME is not enough to distinguish these 
different container types. 
Besides, based on my proposal, if users don't want to use DockerCE, they don't 
need to change any configuration.
I think my current patch is intrusive indeed but more general, right?

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-01-30 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-1983:

Attachment: YARN-1983.2.patch

Update the patch to rewrite unit tests.

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-28 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296263#comment-14296263
 ] 

Chun Chen commented on YARN-3077:
-

[~ozawa], OK, upload a new patch to update the patch and change the name to be 
self-explaining.

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.2.patch, YARN-3077.3.patch, YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-28 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3077:

Attachment: YARN-3077.3.patch

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.2.patch, YARN-3077.3.patch, YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor

2015-01-26 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293050#comment-14293050
 ] 

Chun Chen commented on YARN-2718:
-

Yes, my mistake. Thanks for pointing out. Attach a fixed patch on YARN-1983

 Create a CompositeConatainerExecutor that combines DockerContainerExecutor 
 and DefaultContainerExecutor
 ---

 Key: YARN-2718
 URL: https://issues.apache.org/jira/browse/YARN-2718
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Abin Shahab
 Attachments: YARN-2718.patch


 There should be a composite container that allows users to run their jobs in 
 DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging 
 purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-01-26 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-1983:

Attachment: YARN-1983.patch

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2015-01-26 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292937#comment-14292937
 ] 

Chun Chen commented on YARN-2992:
-

[~kasha] [~rohithsharma] [~jianhe], we are constantly facing the following error
RM log
{code}
2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to 
authenticate using SASL (unknown erro
r)
2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.196.128.13/10.196.128.13:2181, initiating session
2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 
0x24ab193421e4812, negotiated timeout = 
1
2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 
0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected 
error, closing socket connection and attempti
ng reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2015-01-27 00:13:19,517 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003)
2015-01-27 00:13:19,518 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 934
{code}

ZK log
{code}
2015-01-27 00:13:19,300 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464
2015-01-27 00:13:19,302 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 
0x24ab193421e4812
2015-01-27 00:13:19,303 [myid:1] - INFO  
[QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 
0x24ab193421e4812 with negotiated timeout 1 for client /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth 
packet /10.240.92.100:46464
2015-01-27 00:13:19,303 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success 
/10.240.92.100:46464
2015-01-27 00:13:19,320 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
causing close of session 0x24ab193421e4812 due to java.io.IOException: Len 
error 1425415
2015-01-27 00:13:19,321 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812
2015-01-27 00:13:23,093 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477
2015-01-27 00:13:23,159 [myid:1] - INFO  

[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor

2015-01-26 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292903#comment-14292903
 ] 

Chun Chen commented on YARN-2718:
-

Well, The patch I've uploaded mainly focuses on making apps which running on a 
same Yarn cluster able to specify different container executors. Yarn currently 
only supports using s single container executor to launch containers. As 
[~guoleitao] said, we want to run both MapReduce jobs and Dockers on the same 
cluster. I think maybe it's better for me to upload the patch on YARN-1983.
As for the debug purpose of Docker containers, we implement a service registry 
feature to register host ip and ports of the running containers on a 
highly-available key value store etcd and make use of webshell to debug. 

 Create a CompositeConatainerExecutor that combines DockerContainerExecutor 
 and DefaultContainerExecutor
 ---

 Key: YARN-2718
 URL: https://issues.apache.org/jira/browse/YARN-2718
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Abin Shahab
 Attachments: YARN-2718.patch


 There should be a composite container that allows users to run their jobs in 
 DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging 
 purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-01-26 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293060#comment-14293060
 ] 

Chun Chen commented on YARN-1983:
-

Attach a patch which creates a CompositeContainerExecutor to implement this. 
The patch allows apps to specify container executor class in 
ContainerLaunchContext. Also changes 
${yarn.nodemanager.container-executor.class} to allow specify a comma separated 
list of container executor class and adds a new configuration 
${yarn.nodemanager.default.container-executor.class}, the default container 
executor to execute(launch) the containers when submit containers without 
specify container executor.

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-23 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290363#comment-14290363
 ] 

Chun Chen commented on YARN-3077:
-

Thanks for reviewing the patch, [~jianhe]. upload a new patch addressing your 
comments.

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.2.patch, YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-23 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3077:

Attachment: YARN-3077.2.patch

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.2.patch, YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3094) reset timer for liveness monitors after RM recovery

2015-01-23 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290395#comment-14290395
 ] 

Chun Chen commented on YARN-3094:
-

Since RM can't receive ping from AM util ApplicationMasterService starts, I 
think it is more accurate to reset time in AMLivelinessMonitor service after 
ApplicationMasterService starts. I suggest init AMLivelinessMonitor service 
after ApplicationMasterService in RMActiveServices#serviceInit.

 reset timer for liveness monitors after RM recovery
 ---

 Key: YARN-3094
 URL: https://issues.apache.org/jira/browse/YARN-3094
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3094.patch


 When RM restarts, it will recover RMAppAttempts and registry them to 
 AMLivenessMonitor if they are not in final state. AM will time out in RM if 
 the recover process takes long time due to some reasons(e.g. too many apps). 
 In our system, we found the recover process took about 3 mins, and all AM 
 time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-21 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285419#comment-14285419
 ] 

Chun Chen commented on YARN-3077:
-

The failed tests passed in my owner laptop.

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2466) Umbrella issue for Yarn launched Docker Containers

2015-01-21 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285434#comment-14285434
 ] 

Chun Chen commented on YARN-2466:
-

I think this feature is currently only a alpha version and the author 
[~ashahab] seems not active on this right now. I upload a patch to create a 
CompositeContainerExecutor to allow running different types of containers with 
different container executors at the same time. See 
https://issues.apache.org/jira/browse/YARN-2718.

 Umbrella issue for Yarn launched Docker Containers
 --

 Key: YARN-2466
 URL: https://issues.apache.org/jira/browse/YARN-2466
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.4.1
Reporter: Abin Shahab
Assignee: Abin Shahab

 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to package their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).
 In addition to software isolation mentioned above, Docker containers will 
 provide resource, network, and user-namespace isolation. 
 Docker provides resource isolation through cgroups, similar to 
 LinuxContainerExecutor. This prevents one job from taking other jobs 
 resource(memory and CPU) on the same hadoop cluster. 
 User-namespace isolation will ensure that the root on the container is mapped 
 an unprivileged user on the host. This is currently being added to Docker.
 Network isolation will ensure that one user’s network traffic is completely 
 isolated from another user’s network traffic. 
 Last but not the least, the interaction of Docker and Kerberos will have to 
 be worked out. These Docker containers must work in a secure hadoop 
 environment.
 Additional details are here: 
 https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor

2015-01-21 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2718:

Attachment: YARN-2718.patch

Upload my patch to share my thoughts about creating a 
compositeContainerExecutor to enable running multiple types of containers with 
different container executors at the same time. No need to switch 
container-executors.

 Create a CompositeConatainerExecutor that combines DockerContainerExecutor 
 and DefaultContainerExecutor
 ---

 Key: YARN-2718
 URL: https://issues.apache.org/jira/browse/YARN-2718
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Abin Shahab
 Attachments: YARN-2718.patch


 There should be a composite container that allows users to run their jobs in 
 DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging 
 purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-20 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285174#comment-14285174
 ] 

Chun Chen commented on YARN-3077:
-

Thanks. :)

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen

 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-20 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285161#comment-14285161
 ] 

Chun Chen commented on YARN-3077:
-

[~varun_saxena] I would like to do it myself.

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
Assignee: Varun Saxena

 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-20 Thread Chun Chen (JIRA)
Chun Chen created YARN-3077:
---

 Summary: RM should create 
yarn.resourcemanager.zk-state-store.parent-path recursively
 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen


If multiple clusters share a zookeeper cluster, users might use 
/rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
specified a customer value which is not a top-level path for 
${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively

2015-01-20 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-3077:

Attachment: YARN-3077.patch

 RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
 

 Key: YARN-3077
 URL: https://issues.apache.org/jira/browse/YARN-3077
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-3077.patch


 If multiple clusters share a zookeeper cluster, users might use 
 /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user 
 specified a customer value which is not a top-level path for 
 ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent 
 path first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2015-01-19 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282328#comment-14282328
 ] 

Chun Chen commented on YARN-2262:
-

I think the second exception is because TFile is immutable and 
FileSystemApplicationHistoryStore uses TFile as the underlying storage layer.

 Few fields displaying wrong values in Timeline server after RM restart
 --

 Key: YARN-2262
 URL: https://issues.apache.org/jira/browse/YARN-2262
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.4.0
Reporter: Nishan Shetty
Assignee: Naganarasimha G R
 Attachments: Capture.PNG, Capture1.PNG, 
 yarn-testos-historyserver-HOST-10-18-40-95.log, 
 yarn-testos-resourcemanager-HOST-10-18-40-84.log, 
 yarn-testos-resourcemanager-HOST-10-18-40-95.log


 Few fields displaying wrong values in Timeline server after RM restart
 State:null
 FinalStatus:  UNDEFINED
 Started:  8-Jul-2014 14:58:08
 Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2014-12-02 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Attachment: YARN-2674.2.patch

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2014-12-02 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231719#comment-14231719
 ] 

Chun Chen commented on YARN-2674:
-

Thanks for review, [~jianhe] , upload a new patch addressing your comments. 

Looking at the patch again, I think other applications using AMRMClientImpl 
might have the same issue if it didn’t explicit call removeContainerRequest. 
IMHO, It is better if we can fix the issue within AMRMClientImpl. Any thoughts?

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-2674.1.patch, YARN-2674.2.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor

2014-11-17 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214690#comment-14214690
 ] 

Chun Chen commented on YARN-2718:
-

[~ashahab], if you don't mind, can I work on this. I have already implemented 
it on our branch.

 Create a CompositeConatainerExecutor that combines DockerContainerExecutor 
 and DefaultContainerExecutor
 ---

 Key: YARN-2718
 URL: https://issues.apache.org/jira/browse/YARN-2718
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Abin Shahab

 There should be a composite container that allows users to run their jobs in 
 DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging 
 purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-11-06 Thread Chun Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201527#comment-14201527
 ] 

Chun Chen commented on YARN-1964:
-

Hi, [~ashahab], thanks for the patch. I’m also working on this and we are 
running dockers on Yarn now. Based on our experience, here are my comments:
1. Since yarn.nodemanager.docker-container-executor.image-name is a application 
specified container launch environment argument which would be exported in 
launch_container.sh and bash doesn't allow dot separated environment variables, 
so you might need to change it to 
ApplicationConstants.Environment.DOCKER_IMAGE_NAME.
2. Remove “--net=host” from docker run command cause cluster administrator 
might not want docker containers to use host network directly. 
3. Define a new environment maybe 
ApplicationConstants.Environment.DOCKER_RUN_ARGS to allow application to 
specify customer options of docker run command such as -P, -e, etc. And in this 
way, we can also specify —net=host” from application side.
4. Remove localDirMount from docker run command, there is no need to mount 
local dir.
5. Use ApplicationConstants.Environment.HADOOP_YARN_HOME  instead of 
“HADOOP_YARN_HOME
{code}
exclusionSet.add(HADOOP_YARN_HOME);
exclusionSet.add(HADOOP_COMMON_HOME);
exclusionSet.add(HADOOP_HDFS_HOME);
exclusionSet.add(HADOOP_COMMON_HOME);
exclusionSet.add(JAVA_HOME);
{code}

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2014-10-09 Thread Chun Chen (JIRA)
Chun Chen created YARN-2674:
---

 Summary: Distributed shell AM may re-launch containers if RM work 
preserving restart happens
 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chun Chen


Currently, if RM work preserving restart happens while distributed shell is 
running, distribute shell AM may re-launch all the containers, including 
new/running/complete. We must make sure it won't re-launch the running/complete 
containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2014-10-09 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Attachment: YARN-2674.1.patch

 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-2674.1.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

2014-10-09 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2674:

Description: 
Currently, if RM work preserving restart happens while distributed shell is 
running, distribute shell AM may re-launch all the containers, including 
new/running/complete. We must make sure it won't re-launch the running/complete 
containers.
We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable 
once AM receive them from RM. 

  was:Currently, if RM work preserving restart happens while distributed shell 
is running, distribute shell AM may re-launch all the containers, including 
new/running/complete. We must make sure it won't re-launch the running/complete 
containers.


 Distributed shell AM may re-launch containers if RM work preserving restart 
 happens
 ---

 Key: YARN-2674
 URL: https://issues.apache.org/jira/browse/YARN-2674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Chun Chen
 Attachments: YARN-2674.1.patch


 Currently, if RM work preserving restart happens while distributed shell is 
 running, distribute shell AM may re-launch all the containers, including 
 new/running/complete. We must make sure it won't re-launch the 
 running/complete containers.
 We need to remove allocated containers from 
 AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

2014-09-26 Thread Chun Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Chen updated YARN-2612:

Description: 
We are testing RM work preserving restart and found the following logs when we 
ran a simple MapReduce task PI. Some completed containers which already 
pulled by AM never reported back to NM, so NM continuously report the completed 
containers while AM had finished. 
{code}
2014-09-26 17:00:42,228 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:42,228 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:43,230 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:43,230 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:44,233 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
2014-09-26 17:00:44,233 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Null container completed...
{code}

In YARN-1372, NM will report completed containers to RM until it gets ACK from 
RM.  If AM does not call allocate, which means that AM does not ack RM, RM will 
not ack NM. We([~chenchun]) have observed these two cases when running 
Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has 
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished 
yet.

In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
RM could send this AppAttempt's completed containers to NM.
2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
have corresponding RMContainer, RM just ack it to NM.

We prefer to solution 2 because it is more clear and concise. However RM might 
ack same completed containers to NM many times.

  was:
In YARN-1372, NM will report completed containers to RM until it gets ACK from 
RM.  If AM does not call allocate, which means that AM does not ack RM, RM will 
not ack NM. We([~chenchun]) have observed these two cases when running 
Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has 
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished 
yet.

In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
RM could send this AppAttempt's completed containers to NM.
2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
have corresponding RMContainer, RM just ack it to NM.

We prefer to solution 2 because it is more clear and concise. However RM might 
ack same completed containers to NM many times.


 Some completed containers are not reported to NM
 

 Key: YARN-2612
 URL: https://issues.apache.org/jira/browse/YARN-2612
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2612.2.patch, YARN-2612.patch


 We are testing RM work preserving restart and found the following logs when 
 we ran a simple MapReduce task PI. Some completed containers which already 
 pulled by AM never reported back to NM, so NM continuously report the 
 completed containers while AM had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In YARN-1372, NM will report completed containers to RM until it gets ACK 
 from RM.  If AM does not call allocate, which means that AM does not ack RM, 
 RM will not ack NM. We([~chenchun]) have observed these two cases when 
 running