[jira] [Assigned] (YARN-11489) Fix memory leak of DelegationTokenRenewer futures in DelegationTokenRenewerPoolTracker
[ https://issues.apache.org/jira/browse/YARN-11489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen reassigned YARN-11489: Assignee: Chun Chen > Fix memory leak of DelegationTokenRenewer futures in > DelegationTokenRenewerPoolTracker > -- > > Key: YARN-11489 > URL: https://issues.apache.org/jira/browse/YARN-11489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chun Chen >Assignee: Chun Chen >Priority: Major > > The future of DelegationTokenRenewer runnable is not removed properly > previously, it is only removed if the runnable timeouts. > And a Queue struct is more suitable than a Map struct for storing the futures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11489) Fix memory leak of DelegationTokenRenewer futures in DelegationTokenRenewerPoolTracker
Chun Chen created YARN-11489: Summary: Fix memory leak of DelegationTokenRenewer futures in DelegationTokenRenewerPoolTracker Key: YARN-11489 URL: https://issues.apache.org/jira/browse/YARN-11489 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen The future of DelegationTokenRenewer runnable is not removed properly previously, it is only removed if the runnable timeouts. And a Queue struct is more suitable than a Map struct for storing the futures. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433277#comment-16433277 ] Chun Chen commented on YARN-2674: - [~shaneku...@gmail.com] Please catch on the work here. I no longer work on yarn any more. > Distributed shell AM may re-launch containers if RM work preserving restart > happens > --- > > Key: YARN-2674 > URL: https://issues.apache.org/jira/browse/YARN-2674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications, resourcemanager >Reporter: Chun Chen >Assignee: Chun Chen >Priority: Major > Labels: oct16-easy > Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, > YARN-2674.4.patch, YARN-2674.5.patch > > > Currently, if RM work preserving restart happens while distributed shell is > running, distribute shell AM may re-launch all the containers, including > new/running/complete. We must make sure it won't re-launch the > running/complete containers. > We need to remove allocated containers from > AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589179#comment-14589179 ] Chun Chen commented on YARN-1983: - [~sidharta-s], thanks for letting me know the progress of this and much happy to learn all your concerns and designs to implement this. Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.2.patch, YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587370#comment-14587370 ] Chun Chen commented on YARN-1983: - [~vinodkv], according to your suggestion, I propose the following change: 1. Allow NM_CE to specify a comma list CE classes. 2. Allow user to specify a env named NM_CLIENT_CE in CLC. If the value of NM_CLIENT_CE is one of the CE class configured previous, choose that one to execute the container, throw exception otherwise. 3. If user specify only one CE class of NM_CE, ignore NM_CLIENT_CE in env of CLC and always use that one to execute containers. 4. If user specify multiple classes of NM_CE, he has to configure for a default CE named NM_DEFAULT_CE in yarn-site.xml in case he doesn't specify env NM_CLIENT_CE on submit containers. NM_CE=yarn.nodemanager.container-executor.class NM_CLIENT_CE=yarn.nodemanager.client.container-executor.class NM_DEFAULT_CE=yarn.nodemanager.default.container-executor.class Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.2.patch, YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3469) ZKRMStateStore: Avoid setting watches that are not required
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585394#comment-14585394 ] Chun Chen commented on YARN-3469: - Just a note. I saw ZOOKEEPER-706 has been resolved recently after 5 years. ZKRMStateStore: Avoid setting watches that are not required --- Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor Fix For: 2.7.1 Attachments: YARN-3469.01.patch In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Attachment: YARN-2674.4.patch Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, YARN-2674.4.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574106#comment-14574106 ] Chun Chen commented on YARN-2674: - Upload a patch to fix test failures. Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, YARN-2674.4.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Attachment: YARN-2674.5.patch Upload YARN-2674.5.patch to remove unnecessary synchronize. Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, YARN-2674.4.patch, YARN-2674.5.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Attachment: YARN-2674.3.patch Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572535#comment-14572535 ] Chun Chen commented on YARN-2674: - Upload YARN-2674.3.patch with a test case and more detailed comments. Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571957#comment-14571957 ] Chun Chen commented on YARN-3749: - Thanks for reviewing and committing the patch, [~xgong]. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Fix For: 2.8.0 Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570202#comment-14570202 ] Chun Chen commented on YARN-3749: - Thanks for reviewing the patch, [~zxu] ! We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568586#comment-14568586 ] Chun Chen commented on YARN-3749: - bq. It looks like we need keep conf.set(YarnConfiguration.RM_HA_ID, RM1_NODE_ID); in TestRMEmbeddedElector to fix this test failure. Sorry, my bad. Upload YARN-3749.7.patch to fix that and add a tests in {{TestYarnConfiguration}} to make sure {{YarnConfiguration#updateConnectAddr}} won't add suffix to NM service address configurations. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.2.patch Upload a new patch to fix test cases. Lots of previous tests The HA Configuration has multiple addresses that match local node's address. is because I forgot to set YarnConfiguration.RM_HA_ID before starting NM. The patch also contains 2 minor fix, changed getting conf value of RM_SCHEDULER_ADDRESS from serviceStart to serviceInit in ApplicationMasterService, changed duplicates setRpcAddressForRM in tests to HAUtil. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.7.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.7.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.4.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.5.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568521#comment-14568521 ] Chun Chen commented on YARN-3749: - [~zxu] ,thanks, I agree. Upload YARN-3749.6.patch to address your comments. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.3.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568448#comment-14568448 ] Chun Chen commented on YARN-3749: - Thanks for the review [~zxu] [~iwasakims]. Upload a new patch to address your comments. bq. 1. It looks like setRpcAddressForRM and setConfForRM are only used by test code. Should we create a new HA test utility file to include these functions? Moved setRpcAddressForRM and setConfForRM to HATestUtil.java bq. 2. Do we really need the following change at MiniYARNCluster#serviceInit conf.set(YarnConfiguration.RM_HA_ID, rm0); This is indeed necessary, as [~iwasakims]'s comments, this is used to bypass the check in `HAUtil#getRMHAId` used by NodeManager instance. bq. 3. Is any particular reason to configure YarnConfiguration.RM_HA_ID as RM2_NODE_ID instead of RM1_NODE_ID in ProtocolHATestBase? Not really, changed it to RM1_NODE_ID. bq. I think there should be a comment explain that it is a dummy for unit test at least. Added a comment in `MiniYARNCluster#serviceInit` Also the newly uploaded patch YARN-3749.4.patch only make a copy of the configuration in initResourceManager when there are multiple RMs. If there is only one RM, many test case in yarn-client depends on the random ports after RM starts. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.6.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.6.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568458#comment-14568458 ] Chun Chen commented on YARN-3749: - Upload a YARN-3749.5.patch to set {{YarnConfiguration.RM_HA_ID}} only in {{MiniYARNCluster#serviceInit}}, removed that in other tests. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.2.patch, YARN-3749.3.patch, YARN-3749.4.patch, YARN-3749.5.patch, YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566326#comment-14566326 ] Chun Chen commented on YARN-3749: - Upload a patch to fix it. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566327#comment-14566327 ] Chun Chen commented on YARN-2674: - Thanks for the comments [~vinodkv]. Will upload a new patch with test case after YARN-3749 fixed. Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Summary: We should make a copy of configuration when init MiniYARNCluster with multiple RMs (was: We should make a copy of config MiniYARNCluster ) We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3749) We should make a copy of config MiniYARNCluster
Chun Chen created YARN-3749: --- Summary: We should make a copy of config MiniYARNCluster Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Description: When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen reassigned YARN-3749: --- Assignee: Chun Chen We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3749) We should make a copy of configuration when init MiniYARNCluster with multiple RMs
[ https://issues.apache.org/jira/browse/YARN-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3749: Attachment: YARN-3749.patch We should make a copy of configuration when init MiniYARNCluster with multiple RMs -- Key: YARN-3749 URL: https://issues.apache.org/jira/browse/YARN-3749 Project: Hadoop YARN Issue Type: Bug Reporter: Chun Chen Assignee: Chun Chen Attachments: YARN-3749.patch When I was trying to write a test case for YARN-2674, I found DS client trying to connect to both rm1 and rm2 with the same address 0.0.0.0:18032 when RM failover. But I initially set yarn.resourcemanager.address.rm1=0.0.0.0:18032, yarn.resourcemanager.address.rm2=0.0.0.0:28032 After digging, I found it is in ClientRMService where the value of yarn.resourcemanager.address.rm2 changed to 0.0.0.0:18032. See the following code in ClientRMService: {code} clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST, YarnConfiguration.RM_ADDRESS, YarnConfiguration.DEFAULT_RM_ADDRESS, server.getListenerAddress()); {code} Since we use the same instance of configuration in rm1 and rm2 and init both RM before we start both RM, we will change yarn.resourcemanager.ha.id to rm2 during init of rm2 and yarn.resourcemanager.ha.id will become rm2 during starting of rm1. So I think it is safe to make a copy of configuration when init both of the rm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341427#comment-14341427 ] Chun Chen commented on YARN-3080: - [~ashahab], I think we can simply fix this by using the pid of the session script bash process instead since docker run will block until it exits. If docker container exits, the session script bash process will exit immediately. As for signalContainer, we can use docker kill --signal=SIGNAL containerId instead. The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307247#comment-14307247 ] Chun Chen commented on YARN-1983: - Thanks for the comments, [~vinodkv] [~chris.douglas]. Do you mean we can simply use NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME env to identify container types? IMHO, users can implement a custom ${yarn.nodemanager.container-executor.class} for their own scenarios currently, what if they want to use both LinuxCE and their owner CE at runtime? I think NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME is not enough to distinguish these different container types. Besides, based on my proposal, if users don't want to use DockerCE, they don't need to change any configuration. I think my current patch is intrusive indeed but more general, right? Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.2.patch, YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-1983: Attachment: YARN-1983.2.patch Update the patch to rewrite unit tests. Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.2.patch, YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296263#comment-14296263 ] Chun Chen commented on YARN-3077: - [~ozawa], OK, upload a new patch to update the patch and change the name to be self-explaining. RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.2.patch, YARN-3077.3.patch, YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3077: Attachment: YARN-3077.3.patch RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.2.patch, YARN-3077.3.patch, YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293050#comment-14293050 ] Chun Chen commented on YARN-2718: - Yes, my mistake. Thanks for pointing out. Attach a fixed patch on YARN-1983 Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor --- Key: YARN-2718 URL: https://issues.apache.org/jira/browse/YARN-2718 Project: Hadoop YARN Issue Type: New Feature Reporter: Abin Shahab Attachments: YARN-2718.patch There should be a composite container that allows users to run their jobs in DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-1983: Attachment: YARN-1983.patch Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292937#comment-14292937 ] Chun Chen commented on YARN-2992: - [~kasha] [~rohithsharma] [~jianhe], we are constantly facing the following error RM log {code} 2015-01-27 00:13:19,379 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.196.128.13/10.196.128.13:2181. Will not attempt to authenticate using SASL (unknown erro r) 2015-01-27 00:13:19,383 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.196.128.13/10.196.128.13:2181, initiating session 2015-01-27 00:13:19,404 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.196.128.13/10.196.128.13:2181, sessionid = 0x24ab193421e4812, negotiated timeout = 1 2015-01-27 00:13:19,417 WARN org.apache.zookeeper.ClientCnxn: Session 0x24ab193421e4812 for server 10.196.128.13/10.196.128.13:2181, unexpected error, closing socket connection and attempti ng reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 2015-01-27 00:13:19,517 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:895) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:892) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1031) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1050) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:898) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.access$600(ZKRMStateStore.java:82) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1003) 2015-01-27 00:13:19,518 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 934 {code} ZK log {code} 2015-01-27 00:13:19,300 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46464 2015-01-27 00:13:19,302 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x24ab193421e4812 2015-01-27 00:13:19,303 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@617] - Established session 0x24ab193421e4812 with negotiated timeout 1 for client /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@892] - got auth packet /10.240.92.100:46464 2015-01-27 00:13:19,303 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@926] - auth success /10.240.92.100:46464 2015-01-27 00:13:19,320 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x24ab193421e4812 due to java.io.IOException: Len error 1425415 2015-01-27 00:13:19,321 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.240.92.100:46464 which had sessionid 0x24ab193421e4812 2015-01-27 00:13:23,093 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x24ab193421e4812 at /10.240.92.100:46477 2015-01-27 00:13:23,159 [myid:1] - INFO
[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292903#comment-14292903 ] Chun Chen commented on YARN-2718: - Well, The patch I've uploaded mainly focuses on making apps which running on a same Yarn cluster able to specify different container executors. Yarn currently only supports using s single container executor to launch containers. As [~guoleitao] said, we want to run both MapReduce jobs and Dockers on the same cluster. I think maybe it's better for me to upload the patch on YARN-1983. As for the debug purpose of Docker containers, we implement a service registry feature to register host ip and ports of the running containers on a highly-available key value store etcd and make use of webshell to debug. Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor --- Key: YARN-2718 URL: https://issues.apache.org/jira/browse/YARN-2718 Project: Hadoop YARN Issue Type: New Feature Reporter: Abin Shahab Attachments: YARN-2718.patch There should be a composite container that allows users to run their jobs in DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN
[ https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293060#comment-14293060 ] Chun Chen commented on YARN-1983: - Attach a patch which creates a CompositeContainerExecutor to implement this. The patch allows apps to specify container executor class in ContainerLaunchContext. Also changes ${yarn.nodemanager.container-executor.class} to allow specify a comma separated list of container executor class and adds a new configuration ${yarn.nodemanager.default.container-executor.class}, the default container executor to execute(launch) the containers when submit containers without specify container executor. Support heterogeneous container types at runtime on YARN Key: YARN-1983 URL: https://issues.apache.org/jira/browse/YARN-1983 Project: Hadoop YARN Issue Type: Improvement Reporter: Junping Du Attachments: YARN-1983.patch Different container types (default, LXC, docker, VM box, etc.) have different semantics on isolation of security, namespace/env, performance, etc. Per discussions in YARN-1964, we have some good thoughts on supporting different types of containers running on YARN and specified by application at runtime which largely enhance YARN's flexibility to meet heterogenous app's requirement on isolation at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290363#comment-14290363 ] Chun Chen commented on YARN-3077: - Thanks for reviewing the patch, [~jianhe]. upload a new patch addressing your comments. RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.2.patch, YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3077: Attachment: YARN-3077.2.patch RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.2.patch, YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3094) reset timer for liveness monitors after RM recovery
[ https://issues.apache.org/jira/browse/YARN-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290395#comment-14290395 ] Chun Chen commented on YARN-3094: - Since RM can't receive ping from AM util ApplicationMasterService starts, I think it is more accurate to reset time in AMLivelinessMonitor service after ApplicationMasterService starts. I suggest init AMLivelinessMonitor service after ApplicationMasterService in RMActiveServices#serviceInit. reset timer for liveness monitors after RM recovery --- Key: YARN-3094 URL: https://issues.apache.org/jira/browse/YARN-3094 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3094.patch When RM restarts, it will recover RMAppAttempts and registry them to AMLivenessMonitor if they are not in final state. AM will time out in RM if the recover process takes long time due to some reasons(e.g. too many apps). In our system, we found the recover process took about 3 mins, and all AM time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285419#comment-14285419 ] Chun Chen commented on YARN-3077: - The failed tests passed in my owner laptop. RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2466) Umbrella issue for Yarn launched Docker Containers
[ https://issues.apache.org/jira/browse/YARN-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285434#comment-14285434 ] Chun Chen commented on YARN-2466: - I think this feature is currently only a alpha version and the author [~ashahab] seems not active on this right now. I upload a patch to create a CompositeContainerExecutor to allow running different types of containers with different container executors at the same time. See https://issues.apache.org/jira/browse/YARN-2718. Umbrella issue for Yarn launched Docker Containers -- Key: YARN-2466 URL: https://issues.apache.org/jira/browse/YARN-2466 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.4.1 Reporter: Abin Shahab Assignee: Abin Shahab Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to package their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). In addition to software isolation mentioned above, Docker containers will provide resource, network, and user-namespace isolation. Docker provides resource isolation through cgroups, similar to LinuxContainerExecutor. This prevents one job from taking other jobs resource(memory and CPU) on the same hadoop cluster. User-namespace isolation will ensure that the root on the container is mapped an unprivileged user on the host. This is currently being added to Docker. Network isolation will ensure that one user’s network traffic is completely isolated from another user’s network traffic. Last but not the least, the interaction of Docker and Kerberos will have to be worked out. These Docker containers must work in a secure hadoop environment. Additional details are here: https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2718: Attachment: YARN-2718.patch Upload my patch to share my thoughts about creating a compositeContainerExecutor to enable running multiple types of containers with different container executors at the same time. No need to switch container-executors. Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor --- Key: YARN-2718 URL: https://issues.apache.org/jira/browse/YARN-2718 Project: Hadoop YARN Issue Type: New Feature Reporter: Abin Shahab Attachments: YARN-2718.patch There should be a composite container that allows users to run their jobs in DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285174#comment-14285174 ] Chun Chen commented on YARN-3077: - Thanks. :) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285161#comment-14285161 ] Chun Chen commented on YARN-3077: - [~varun_saxena] I would like to do it myself. RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Assignee: Varun Saxena If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
Chun Chen created YARN-3077: --- Summary: RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3077) RM should create yarn.resourcemanager.zk-state-store.parent-path recursively
[ https://issues.apache.org/jira/browse/YARN-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-3077: Attachment: YARN-3077.patch RM should create yarn.resourcemanager.zk-state-store.parent-path recursively Key: YARN-3077 URL: https://issues.apache.org/jira/browse/YARN-3077 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Chun Chen Attachments: YARN-3077.patch If multiple clusters share a zookeeper cluster, users might use /rmstore/${yarn.resourcemanager.cluster-id} as the state store path. If user specified a customer value which is not a top-level path for ${yarn.resourcemanager.zk-state-store.parent-path}, yarn should create parent path first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282328#comment-14282328 ] Chun Chen commented on YARN-2262: - I think the second exception is because TFile is immutable and FileSystemApplicationHistoryStore uses TFile as the underlying storage layer. Few fields displaying wrong values in Timeline server after RM restart -- Key: YARN-2262 URL: https://issues.apache.org/jira/browse/YARN-2262 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Naganarasimha G R Attachments: Capture.PNG, Capture1.PNG, yarn-testos-historyserver-HOST-10-18-40-95.log, yarn-testos-resourcemanager-HOST-10-18-40-84.log, yarn-testos-resourcemanager-HOST-10-18-40-95.log Few fields displaying wrong values in Timeline server after RM restart State:null FinalStatus: UNDEFINED Started: 8-Jul-2014 14:58:08 Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Attachment: YARN-2674.2.patch Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231719#comment-14231719 ] Chun Chen commented on YARN-2674: - Thanks for review, [~jianhe] , upload a new patch addressing your comments. Looking at the patch again, I think other applications using AMRMClientImpl might have the same issue if it didn’t explicit call removeContainerRequest. IMHO, It is better if we can fix the issue within AMRMClientImpl. Any thoughts? Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Attachments: YARN-2674.1.patch, YARN-2674.2.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214690#comment-14214690 ] Chun Chen commented on YARN-2718: - [~ashahab], if you don't mind, can I work on this. I have already implemented it on our branch. Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor --- Key: YARN-2718 URL: https://issues.apache.org/jira/browse/YARN-2718 Project: Hadoop YARN Issue Type: New Feature Reporter: Abin Shahab There should be a composite container that allows users to run their jobs in DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201527#comment-14201527 ] Chun Chen commented on YARN-1964: - Hi, [~ashahab], thanks for the patch. I’m also working on this and we are running dockers on Yarn now. Based on our experience, here are my comments: 1. Since yarn.nodemanager.docker-container-executor.image-name is a application specified container launch environment argument which would be exported in launch_container.sh and bash doesn't allow dot separated environment variables, so you might need to change it to ApplicationConstants.Environment.DOCKER_IMAGE_NAME. 2. Remove “--net=host” from docker run command cause cluster administrator might not want docker containers to use host network directly. 3. Define a new environment maybe ApplicationConstants.Environment.DOCKER_RUN_ARGS to allow application to specify customer options of docker run command such as -P, -e, etc. And in this way, we can also specify —net=host” from application side. 4. Remove localDirMount from docker run command, there is no need to mount local dir. 5. Use ApplicationConstants.Environment.HADOOP_YARN_HOME instead of “HADOOP_YARN_HOME {code} exclusionSet.add(HADOOP_YARN_HOME); exclusionSet.add(HADOOP_COMMON_HOME); exclusionSet.add(HADOOP_HDFS_HOME); exclusionSet.add(HADOOP_COMMON_HOME); exclusionSet.add(JAVA_HOME); {code} Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
Chun Chen created YARN-2674: --- Summary: Distributed shell AM may re-launch containers if RM work preserving restart happens Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chun Chen Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Attachment: YARN-2674.1.patch Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Attachments: YARN-2674.1.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens
[ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2674: Description: Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. was:Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. Distributed shell AM may re-launch containers if RM work preserving restart happens --- Key: YARN-2674 URL: https://issues.apache.org/jira/browse/YARN-2674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chun Chen Attachments: YARN-2674.1.patch Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers. We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2612: Description: We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. was: In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 Attachments: YARN-2612.2.patch, YARN-2612.patch We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running