[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106999#comment-14106999
 ] 

Jason Lowe commented on YARN-2441:
----------------------------------

Ah, then this seems like a case where a client (likely an AM) is connecting to 
the NM before the NM has finished registering with the RM to get the secret 
keys.  Trying to block new container requests at the app level probably isn't 
going to work in practice because the SASL layer in RPC doesn't let the 
connection get to the point where the app can try to reject the request.

IMHO we should remove the "blocking client requests" code and instead do a 
delayed server start, sorta like the delay added by YARN-1337 when NM recovery 
is enabled.  Ideally the RPC layer would support the ability to bind to a 
server socket but not start accepting requests until later.  That would allow 
us to register with the RM knowing what our client port is but without trying 
to let clients through that port until we're really ready.

Shorter term fix might be to have the secret manager throw an exception that 
can be retried by clients if the master key isn't set yet.

> NPE in nodemanager after restart
> --------------------------------
>
>                 Key: YARN-2441
>                 URL: https://issues.apache.org/jira/browse/YARN-2441
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Nishan Shetty
>            Priority: Minor
>
> {code}
> 2014-08-22 16:43:19,640 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Blocking new container-requests as container manager rpc server is still 
> starting.
> 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
> Responder: starting
> 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
> listener on 45026: starting
> 2014-08-22 16:43:20,029 INFO 
> org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
>  Updating node address : host-10-18-40-95:45026
> 2014-08-22 16:43:20,029 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  ContainerManager started at /10.18.40.95:45026
> 2014-08-22 16:43:20,030 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
> 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue class java.util.concurrent.LinkedBlockingQueue
> 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
> Reader #1 for port 45027
> 2014-08-22 16:43:20,158 INFO 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
> protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
> to the server
> 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
> Responder: starting
> 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
> listener on 45027: starting
> 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
> for port 45026: readAndProcess from client 10.18.40.84 threw exception 
> [java.lang.NullPointerException]
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
>       at 
> org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
>       at 
> org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
>       at 
> org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
>       at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
>       at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>       at 
> org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
>       at 
> org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
>       at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
>       at 
> org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
>       at 
> org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
>       at 
> org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
>       at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
>       at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
>       at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
>       at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
> 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
> for port 45026: readAndProcess from client 10.18.40.84 threw exception 
> [java.lang.NullPointerException]
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to