[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802724#comment-17802724 ] Shilun Fan commented on YARN-6409: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500079#comment-17500079 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 8s{color} | {color:red}{color} | {color:red} YARN-6409 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1275/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500071#comment-17500071 ] chaosju commented on YARN-6409: --- For the above scenario, the blacklist mechanism will also fail. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500067#comment-17500067 ] chaosju commented on YARN-6409: --- Application application_1638900618047_13607751 failed 2 times due to Error launching appattempt_1638900618047_13607751_02. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1646452340499 found 1646208502269 Note: System times on machines may be out of sync. Check system time and time zones. at sun.reflect.GeneratedConstructorAccessor141.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171) +1 at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:124) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) . Failing the application. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080872#comment-17080872 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 7s{color} | {color:red} YARN-6409 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25854/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690345#comment-16690345 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} YARN-6409 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/22595/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509970#comment-16509970 ] genericqa commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 7s{color} | {color:red} YARN-6409 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21017/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185063#comment-16185063 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 8m 44s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 232 unchanged - 0 fixed = 233 total (was 232) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 9m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 45m 59s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 83m 16s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:71bbb86 | | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f8fdb0b59768 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ca669f9 | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/17691/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/17691/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | |
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078962#comment-16078962 ] Rohith Sharma K S commented on YARN-6409: - Thanks Jason for your valuable inputs. bq. If this is a common occurrence then that needs to be root-caused. this really make sense to me before going with this JIRA fix. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078762#comment-16078762 ] Jason Lowe commented on YARN-6409: -- Sorry for the long delay in responding -- was out of the office for quite a bit lately. The approach seems reasonable to me assuming we're getting some level of retries during the AM launch from the container manager proxy in AMLauncher. If we can't get an AM to launch on that node even after some retries then it is very likely a subsequent attempt will also fail. IMHO that's an appropriate time to blacklist. If the container manager proxy is _not_ doing the retries in this case then that would be the first place to fix -- we should be trying a bit harder to get the current attempt launched before jumping to blacklist conclusions. I'm confused on how the NM is capable of regularly heartbeating (and thus getting scheduled for AM launches) but also regularly not responding to launch requests. If this is a common occurrence then that needs to be root-caused. This proposed change is not really a fix for that, just a workaround in case it occurs. Without a fix it will lead to prolonged AM and task launch times, since I'm assuming AMs will see similar difficulties trying to launch tasks on a node if the RM cannot launch an AM on it. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068806#comment-16068806 ] Haibo Chen commented on YARN-6409: -- Thanks Ray for suggestion. That sounds good to me. [~jlowe] [~djp] Do you guys have experience on this issue? We'd appreciate your input on which is the preferred. We have seen this issue with our customers fairly often. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068796#comment-16068796 ] Ray Chiang commented on YARN-6409: -- What about making this a configuration setting? It seems like this shows up more on larger clusters (higher chance of network down, more nodes to deal with, etc.). > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387)
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013424#comment-16013424 ] Rohith Sharma K S commented on YARN-6409: - IMO, I have concern for blacklisting a AM node on SocketTimeoutException. It would simply lead to loose one of the *capable/healthy node* get blacklisted. The NM might be too busy at particular point of time, but it is capable for launching containers without any doubt. There is NO way to purge blacklisted nodes unless blacklisted node count reaches threshold. In socketTimeException, many healthy nodes get blacklisted. I would like to hear opinion from other community folks who worked on AM blacklisting feature. cc :/ [~jlowe] [~vvasudev] [~vinodkv] [~sunilg] [~djp] > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013321#comment-16013321 ] Robert Kanter commented on YARN-6409: - [~rohithsharma], any comments? > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013320#comment-16013320 ] Robert Kanter commented on YARN-6409: - I imagine the depth will vary depending on the timing and other factors of the error. A depth of 3 sounds reasonable and we can always increase it later if we see that it's not enough. +1 > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011424#comment-16011424 ] Hadoop QA commented on YARN-6409: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 22s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 233 unchanged - 0 fixed = 234 total (was 233) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 38m 1s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 58m 19s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12868160/YARN-6409.03.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 9b5e90a59118 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / c48f297 | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/15933/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15933/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15933/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL:
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011322#comment-16011322 ] Haibo Chen commented on YARN-6409: -- Thanks [~rkanter] for your review! I am not familiar with the RPC code, not sure if the SocketTimeoutException will always be at depth 3. But it looks to me that the raw exception can be wrapped multiple times. Do you think if it's safe to assume that NM connection is down once we see a SocketTimeoutException regardless of at which depth it appears? I'll update the patch to address checkstyle issue and test failures. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011261#comment-16011261 ] Robert Kanter commented on YARN-6409: - The overall approach seems fine to me, but the {{TestRMAppAttemptTransitions}} failure looks related and doesn't fail without the patch. Can you take a look at it? Looks like you need to check {{instanceof}} before casting. Also, on the depth of 3 thing, is there a reason the {{SocketTimeoutException}} won't be at the same depth each time? > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005507#comment-16005507 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 26s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 156 unchanged - 0 fixed = 159 total (was 156) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 39m 20s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 63m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867418/YARN-6409.02.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux b154299c946a 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ad1e3e4 | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/15898/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/15898/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15898/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U:
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005365#comment-16005365 ] Haibo Chen commented on YARN-6409: -- Going three levels down so that the failures indicated in the above stacktrace can be captured. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993457#comment-15993457 ] Haibo Chen commented on YARN-6409: -- Yeah, NMs can be unresponsive during the heartbeat window, from what we can tell. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988555#comment-15988555 ] Rohith Sharma K S commented on YARN-6409: - bq. Do you mean to say there is NO way to remove blacklisted nodes from AM container? Yup bq. RM will often try the subsequent AM launch attempts on the same node and fail the application eventually, Do you mean NM is unresponsive but NM heartbeats are sent to RM? > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987332#comment-15987332 ] Haibo Chen commented on YARN-6409: -- bq. there is way to remove blacklisted nodes for AM containers Do you mean to say there is NO way to remove blacklisted nodes from AM container? I thought about being more specific on the types of launch failures that we want to blacklist a node for, but from the stacktrace, it seems that SockeTimeoutException (which is the failure for which I want blacklist a node) is wrapped deep inside an IOException. Agree that we should be more specific here. bq. My doubt is why NM with connectivity issues are blacklisted? Ideally, they shouldn't. But we have observed that if NM do not recover form connectivity issues quickly enough (maybe say JVM was not responsive), RM will often try the subsequent AM launch attempts on the same node and fail the application eventually, which is highly undesirable. This spirit is the same as when we blacklist a node for an invalid container exist status. To have all nodes blacklisted because of this change, all of them need to have connectivity issues during the window when RM tries to launch the app AM container. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986389#comment-15986389 ] Rohith Sharma K S commented on YARN-6409: - As of today there is way to remove blacklisted nodes for AM containers. Should be very careful to blacklist AM nodes. I see from patch that for all IOException during AM container launch failure are blacklisted. This would be a risk. My doubt is why NM with connectivity issues are blacklisted? These are recoverable and will be back to normal. Moreover these are capable of launching containers. Lets say if AM have resource constraint on specific nodes and these nodes get blacklisted, then there is no way to launch this AM on any nodes and hang forever. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983707#comment-15983707 ] Haibo Chen commented on YARN-6409: -- The findbugs issues and unit test failures are unrelated. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975756#comment-15975756 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 2s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 8 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 28s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 156 unchanged - 0 fixed = 159 total (was 156) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 39m 29s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 63m 27s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions | | | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.TestResourceTrackerService | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12864160/YARN-6409.01.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 55be01de3bf2 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / c154935 | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/15681/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/15681/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit |
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975680#comment-15975680 ] Haibo Chen commented on YARN-6409: -- Ah, I see why the test was failing. I was trying to fail the AM launch with an instance of Exception, which is not counted as a reason to blacklist the node. Patch updated to address that. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971314#comment-15971314 ] Haibo Chen commented on YARN-6409: -- The added unit test in the patch is flaky. Will address it in a new patch > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963647#comment-15963647 ] Hadoop QA commented on YARN-6409: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 30s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 157 unchanged - 0 fixed = 159 total (was 157) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 41m 4s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 68m 13s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions | | | hadoop.yarn.server.resourcemanager.TestNodeBlacklistingOnAMFailures | | | hadoop.yarn.server.resourcemanager.TestRMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:612578f | | JIRA Issue | YARN-6409 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862750/YARN-6409.00.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 99a7cfdb60b9 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e9ac61c | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/15573/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/15573/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15573/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U:
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963580#comment-15963580 ] Haibo Chen commented on YARN-6409: -- Sorry for my delayed reply, [~rohithsharma]. I have added description to the problem I was thinking of, and a draft patch for review. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948859#comment-15948859 ] Rohith Sharma K S commented on YARN-6409: - [~haibochen] would you give more details? How many nodes were there? Launch failed for what purpose? If you can get launch failure exit code it will be help full. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org