[ 
https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078762#comment-16078762
 ] 

Jason Lowe commented on YARN-6409:
----------------------------------

Sorry for the long delay in responding -- was out of the office for quite a bit 
lately.

The approach seems reasonable to me assuming we're getting some level of 
retries during the AM launch from the container manager proxy in AMLauncher.  
If we can't get an AM to launch on that node even after some retries then it is 
very likely a subsequent attempt will also fail.  IMHO that's an appropriate 
time to blacklist.  If the container manager proxy is _not_ doing the retries 
in this case then that would be the first place to fix -- we should be trying a 
bit harder to get the current attempt launched before jumping to blacklist 
conclusions.

I'm confused on how the NM is capable of regularly heartbeating (and thus 
getting scheduled for AM launches) but also regularly not responding to launch 
requests.  If this is a common occurrence then that needs to be root-caused.  
This proposed change is not really a fix for that, just a workaround in case it 
occurs.  Without a fix it will lead to prolonged AM and task launch times, 
since I'm assuming AMs will see similar difficulties trying to launch tasks on 
a node if the RM cannot launch an AM on it.

> RM does not blacklist node for AM launch failures
> -------------------------------------------------
>
>                 Key: YARN-6409
>                 URL: https://issues.apache.org/jira/browse/YARN-6409
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: YARN-6409.00.patch, YARN-6409.01.patch, 
> YARN-6409.02.patch, YARN-6409.03.patch
>
>
> Currently, node blacklisting upon AM failures only handles failures that 
> happen after AM container is launched (see 
> RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()).  However, AM launch 
> can also fail if the NM, where the AM container is allocated, goes 
> unresponsive.  Because it is not handled, scheduler may continue to allocate 
> AM containers on that same NM for the following app attempts. 
> {code}
> Application application_1478721503753_0870 failed 2 times due to Error 
> launching appattempt_1478721503753_0870_000002. Got exception: 
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host 
> Details : local host is: "*.me.com/17.111.179.113"; destination host is: 
> "*.me.com":8041; 
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1475) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1408) 
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  
> at com.sun.proxy.$Proxy86.startContainers(Unknown Source) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>  
> at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:497) 
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  
> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis 
> timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
> remote=*.me.com/17.111.178.125:8041] 
> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:422) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>  
> at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
>  
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) 
> at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) 
> at org.apache.hadoop.ipc.Client.call(Client.java:1447) 
> ... 15 more 
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 
> remote=*.me.com/17.111.178.125:8041] 
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> at java.io.FilterInputStream.read(FilterInputStream.java:133) 
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
> at java.io.DataInputStream.readInt(DataInputStream.java:387) 
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) 
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) 
> at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) 
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) 
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:422) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>  
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) 
> ... 18 more 
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to