Lee young gon created YARN-10969:
------------------------------------

             Summary: After RM fail-over, getContainerStatus fails from 
ApplicationMaster to NodeManager
                 Key: YARN-10969
                 URL: https://issues.apache.org/jira/browse/YARN-10969
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.1.2
            Reporter: Lee young gon


If the artifact type of yarn-service spec is docker, getContainerStatus is 
periodically requested through the NMClient.

And when RM fail-over occurs, getContainerStatus fails after a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2) 
time.

Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance - 
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004] 
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again 
javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
violation. Mismatched response. [Caused by 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.] at 
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
 at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at 
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
 at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Caused by: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response. at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at 
org.apache.hadoop.ipc.Client.call(Client.java:1457) at 
org.apache.hadoop.ipc.Client.call(Client.java:1367) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at 
org.apache.hadoo
{code}
The overall flow is as follows.
 # Started AM
 # AM requests containers
 # RM assigned a container
 ## RM makes tokens for each NM assigned to NMToken Master Key and delivers 
them to AM
 ## This NMToken master key is rolling 
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs) 
for each NM. There's a timing issue, but it's the same value as RM
 # NM is assigned a container and stores the NMToken master key (same as the 
master key used by RM) in the old Master Keys (hashmap) at that point with 
ApplicationAttemptId as the key
 # After that, requests from AM to NM (getContainerStatus) are made through the 
issued token
 ** Even if the master key is rolled, the request succeeds because it is stored 
in NM's oldMasterKeys (stored in NMStateStore)
 # But it becomes a problem if the AM loses that token for any reason(e.g. RM 
failover, AM restart)
 # For example, when the AM restarts, the AM uses the token created with the 
NMToken master key at that point and is only effective for a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
 ** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey, 
currentMasterKey and preliminaryMasterKey are valid, but subsequent tokens are 
invalid

That is, for any reason, when AM is re-issued with a token, it is only valid 
for a specific time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to