[
https://issues.apache.org/jira/browse/YARN-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lee young gon updated YARN-10969:
---------------------------------
Description:
If the artifact type of yarn-service spec is docker, getContainerStatus is
periodically requested through the NMClient.
And when RM fail-over occurs, getContainerStatus fails after a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2)
time.
Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance -
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004]
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again
javax.security.sasl.SaslException: DIGEST-MD5: digest response format
violation. Mismatched response. [Caused by
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response.] at
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
at
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response. at
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at
org.apache.hadoop.ipc.Client.call(Client.java:1457) at
org.apache.hadoop.ipc.Client.call(Client.java:1367) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at
org.apache.hadoo
{code}
The overall flow is as follows.
# Started AM
# AM requests containers
# RM assigned a container
## RM makes tokens for each NM assigned to NMToken Master Key and delivers
them to AM
## This NMToken master key is rolling
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs)
for each NM. There's a timing issue, but it's the same value as RM
# NM is assigned a container and stores the NMToken master key (same as the
master key used by RM) in the old Master Keys (hashmap) at that point with
ApplicationAttemptId as the key
# After that, requests from AM to NM (getContainerStatus) are made through the
issued token
** Even if the master key is rolled, the request succeeds because it is stored
in NM's oldMasterKeys (stored in NMStateStore)
# But it becomes a problem if the AM loses that token for any reason(e.g. RM
failover, AM restart)
# For example, when the AM restarts, the AM uses the token created with the
NMToken master key at that point and is only effective for a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey,
currentMasterKey and previousMasterkey are valid, but subsequent tokens are
invalid
That is, for any reason, when AM is re-issued with a token, it is only valid
for a specific time
was:
If the artifact type of yarn-service spec is docker, getContainerStatus is
periodically requested through the NMClient.
And when RM fail-over occurs, getContainerStatus fails after a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2)
time.
Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance -
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004]
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again
javax.security.sasl.SaslException: DIGEST-MD5: digest response format
violation. Mismatched response. [Caused by
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response.] at
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
at
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
DIGEST-MD5: digest response format violation. Mismatched response. at
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at
org.apache.hadoop.ipc.Client.call(Client.java:1457) at
org.apache.hadoop.ipc.Client.call(Client.java:1367) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at
org.apache.hadoo
{code}
The overall flow is as follows.
# Started AM
# AM requests containers
# RM assigned a container
## RM makes tokens for each NM assigned to NMToken Master Key and delivers
them to AM
## This NMToken master key is rolling
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs)
for each NM. There's a timing issue, but it's the same value as RM
# NM is assigned a container and stores the NMToken master key (same as the
master key used by RM) in the old Master Keys (hashmap) at that point with
ApplicationAttemptId as the key
# After that, requests from AM to NM (getContainerStatus) are made through the
issued token
** Even if the master key is rolled, the request succeeds because it is stored
in NM's oldMasterKeys (stored in NMStateStore)
# But it becomes a problem if the AM loses that token for any reason(e.g. RM
failover, AM restart)
# For example, when the AM restarts, the AM uses the token created with the
NMToken master key at that point and is only effective for a
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey,
currentMasterKey and preliminaryMasterKey are valid, but subsequent tokens are
invalid
That is, for any reason, when AM is re-issued with a token, it is only valid
for a specific time
> After RM fail-over, getContainerStatus fails from ApplicationMaster to
> NodeManager
> ----------------------------------------------------------------------------------
>
> Key: YARN-10969
> URL: https://issues.apache.org/jira/browse/YARN-10969
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: Lee young gon
> Priority: Major
> Attachments: YARN-10969.001.patch
>
>
> If the artifact type of yarn-service spec is docker, getContainerStatus is
> periodically requested through the NMClient.
> And when RM fail-over occurs, getContainerStatus fails after a
> specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2)
> time.
> Then the following log occurs in AM
> {code:java}
> 2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance -
> [COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004]
> Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try
> again javax.security.sasl.SaslException: DIGEST-MD5: digest response format
> violation. Mismatched response. [Caused by
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
> DIGEST-MD5: digest response format violation. Mismatched response.] at
> sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
> at
> org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745) Caused by:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
> DIGEST-MD5: digest response format violation. Mismatched response. at
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at
> org.apache.hadoop.ipc.Client.call(Client.java:1457) at
> org.apache.hadoop.ipc.Client.call(Client.java:1367) at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at
> org.apache.hadoo
> {code}
> The overall flow is as follows.
> # Started AM
> # AM requests containers
> # RM assigned a container
> ## RM makes tokens for each NM assigned to NMToken Master Key and delivers
> them to AM
> ## This NMToken master key is rolling
> periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs)
> for each NM. There's a timing issue, but it's the same value as RM
> # NM is assigned a container and stores the NMToken master key (same as the
> master key used by RM) in the old Master Keys (hashmap) at that point with
> ApplicationAttemptId as the key
> # After that, requests from AM to NM (getContainerStatus) are made through
> the issued token
> ** Even if the master key is rolled, the request succeeds because it is
> stored in NM's oldMasterKeys (stored in NMStateStore)
> # But it becomes a problem if the AM loses that token for any reason(e.g. RM
> failover, AM restart)
> # For example, when the AM restarts, the AM uses the token created with the
> NMToken master key at that point and is only effective for a
> specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2)
> time
> ** If there is an old MasterKey of ApplicationAttemptId in NM's
> oldMasterKey, currentMasterKey and previousMasterkey are valid, but
> subsequent tokens are invalid
> That is, for any reason, when AM is re-issued with a token, it is only valid
> for a specific time
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]