[ 
https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-6409:
-----------------------------
    Description: 
Currently, node blacklisting upon AM failures only handles failures that happen 
after AM container is launched (see 
RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()).  However, AM launch can 
also fail if the NM, where the AM container is allocated, goes unresponsive.  
Because it is not handled, scheduler may continue to allocate AM containers on 
that same NM for the following app attempts. 

{code}
Application application_1478721503753_0870 failed 2 times due to Error 
launching appattempt_1478721503753_0870_000002. Got exception: 
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host Details 
: local host is: "*.me.com/17.111.179.113"; destination host is: 
"*.me.com":8041; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 
at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
 
at com.sun.proxy.$Proxy86.startContainers(Unknown Source) 
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
 
at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:497) 
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
 
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 
at com.sun.proxy.$Proxy87.startContainers(Unknown Source) 
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256)
 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis 
timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
remote=*.me.com/17.111.178.125:8041] 
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
 
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
 
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) 
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) 
at org.apache.hadoop.ipc.Client.call(Client.java:1447) 
... 15 more 
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
remote=*.me.com/17.111.178.125:8041] 
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
at java.io.FilterInputStream.read(FilterInputStream.java:133) 
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
at java.io.DataInputStream.readInt(DataInputStream.java:387) 
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) 
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) 
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) 
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) 
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
 
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) 
... 18 more 
. Failing the application.
{code}

  was:
Currently, node blacklisting upon AM failures only handles failures that happen 
after AM container is launched (see 
RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()).  However, AM launch can 
also fail if the NM, where the AM container is allocated, goes unresponsive.  
Because it is not handled, scheduler may continue to allocate AM containers on 
that same NM for the following app attempts. 

{code}
Application application_1478721503753_0870 failed 2 times due to Error 
launching appattempt_1478721503753_0870_000002. Got exception: 
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/17.111.179.113:46702 
remote=mr22p32ic-ztep01041101.me.com/17.111.178.125:8041]; Host Details : local 
host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 
at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
 
at com.sun.proxy.$Proxy86.startContainers(Unknown Source) 
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
 
at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:497) 
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
 
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
 
at com.sun.proxy.$Proxy87.startContainers(Unknown Source) 
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256)
 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis 
timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
remote=*.me.com/17.111.178.125:8041] 
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
 
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
 
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) 
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) 
at org.apache.hadoop.ipc.Client.call(Client.java:1447) 
... 15 more 
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
remote=*.me.com/17.111.178.125:8041] 
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
at java.io.FilterInputStream.read(FilterInputStream.java:133) 
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
at java.io.DataInputStream.readInt(DataInputStream.java:387) 
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) 
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) 
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) 
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) 
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
 
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) 
... 18 more 
. Failing the application.
{code}


> RM does not blacklist node for AM launch failures
> -------------------------------------------------
>
>                 Key: YARN-6409
>                 URL: https://issues.apache.org/jira/browse/YARN-6409
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: YARN-6409.00.patch
>
>
> Currently, node blacklisting upon AM failures only handles failures that 
> happen after AM container is launched (see 
> RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()).  However, AM launch 
> can also fail if the NM, where the AM container is allocated, goes 
> unresponsive.  Because it is not handled, scheduler may continue to allocate 
> AM containers on that same NM for the following app attempts. 
> {code}
> Application application_1478721503753_0870 failed 2 times due to Error 
> launching appattempt_1478721503753_0870_000002. Got exception: 
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host 
> Details : local host is: "*.me.com/17.111.179.113"; destination host is: 
> "*.me.com":8041; 
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  
> at com.sun.proxy.$Proxy86.startContainers(Unknown Source) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>  
> at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:497) 
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  
> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis 
> timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
> remote=*.me.com/17.111.178.125:8041] 
> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:422) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>  
> at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
>  
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) 
> at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1447) 
> ... 15 more 
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
> remote=*.me.com/17.111.178.125:8041] 
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> at java.io.FilterInputStream.read(FilterInputStream.java:133) 
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
> at java.io.DataInputStream.readInt(DataInputStream.java:387) 
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) 
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) 
> at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) 
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) 
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:422) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>  
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) 
> ... 18 more 
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to