Lee young gon created YARN-10969:
------------------------------------
Summary: After RM fail-over, getContainerStatus fails from
ApplicationMaster to NodeManager
Key: YARN-10969
URL: https://issues.apache.org/jira/browse/YARN-10969
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.1.2
Reporter: Lee young gon
If the artifact type of yarn-service spec is docker, getContainerStatus is
periodically requested through the NMClient.
And when RM fail-over occurs, getContainerStatus fails after a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2)
time.
Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance -
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004]
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again
javax.security.sasl.SaslException: DIGEST-MD5: digest response format
violation. Mismatched response. [Caused by
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response.] at
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
at
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response. at
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at
org.apache.hadoop.ipc.Client.call(Client.java:1457) at
org.apache.hadoop.ipc.Client.call(Client.java:1367) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at
org.apache.hadoo
{code}
The overall flow is as follows.
# Started AM
# AM requests containers
# RM assigned a container
## RM makes tokens for each NM assigned to NMToken Master Key and delivers
them to AM
## This NMToken master key is rolling
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs)
for each NM. There's a timing issue, but it's the same value as RM
# NM is assigned a container and stores the NMToken master key (same as the
master key used by RM) in the old Master Keys (hashmap) at that point with
ApplicationAttemptId as the key
# After that, requests from AM to NM (getContainerStatus) are made through the
issued token
** Even if the master key is rolled, the request succeeds because it is stored
in NM's oldMasterKeys (stored in NMStateStore)
# But it becomes a problem if the AM loses that token for any reason(e.g. RM
failover, AM restart)
# For example, when the AM restarts, the AM uses the token created with the
NMToken master key at that point and is only effective for a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey,
currentMasterKey and preliminaryMasterKey are valid, but subsequent tokens are
invalid
That is, for any reason, when AM is re-issued with a token, it is only valid
for a specific time
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]