[jira] [Updated] (YARN-10969) After RM fail-over, getContainerStatus fails from ApplicationMaster to NodeManager

Lee young gon (Jira) Fri, 24 Sep 2021 00:58:09 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lee young gon updated YARN-10969:
---------------------------------
    Description: 
If the artifact type of yarn-service spec is docker, getContainerStatus is 
periodically requested through the NMClient.

And when RM fail-over occurs, getContainerStatus fails after a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2) 
time.

Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance - 
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004] 
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again 
javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
violation. Mismatched response. [Caused by 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.] at 
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
 at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at 
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
 at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Caused by: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response. at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at 
org.apache.hadoop.ipc.Client.call(Client.java:1457) at 
org.apache.hadoop.ipc.Client.call(Client.java:1367) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at 
org.apache.hadoo
{code}
The overall flow is as follows.
 # Started AM
 # AM requests containers
 # RM assigned a container
 ## RM makes tokens for each NM assigned to NMToken Master Key and delivers 
them to AM
 ## This NMToken master key is rolling 
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs) 
for each NM. There's a timing issue, but it's the same value as RM
 # NM is assigned a container and stores the NMToken master key (same as the 
master key used by RM) in the old Master Keys (hashmap) at that point with 
ApplicationAttemptId as the key
 # After that, requests from AM to NM (getContainerStatus) are made through the 
issued token
 ** Even if the master key is rolled, the request succeeds because it is stored 
in NM's oldMasterKeys (stored in NMStateStore)
 # But it becomes a problem if the AM loses that token for any reason(e.g. RM 
failover, AM restart)
 # For example, when the AM restarts, the AM uses the token created with the 
NMToken master key at that point and is only effective for a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
 ** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey, 
currentMasterKey and previousMasterkey are valid, but subsequent tokens are 
invalid

That is, for any reason, when AM is re-issued with a token, it is only valid 
for a specific time

  was:
If the artifact type of yarn-service spec is docker, getContainerStatus is 
periodically requested through the NMClient.

And when RM fail-over occurs, getContainerStatus fails after a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2) 
time.

Then the following log occurs in AM
{code:java}
2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance - 
[COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004] 
Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try again 
javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
violation. Mismatched response. [Caused by 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.] at 
sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
 at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at 
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
 at 
org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Caused by: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response. at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at 
org.apache.hadoop.ipc.Client.call(Client.java:1457) at 
org.apache.hadoop.ipc.Client.call(Client.java:1367) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at 
org.apache.hadoo
{code}
The overall flow is as follows.
 # Started AM
 # AM requests containers
 # RM assigned a container
 ## RM makes tokens for each NM assigned to NMToken Master Key and delivers 
them to AM
 ## This NMToken master key is rolling 
periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs) 
for each NM. There's a timing issue, but it's the same value as RM
 # NM is assigned a container and stores the NMToken master key (same as the 
master key used by RM) in the old Master Keys (hashmap) at that point with 
ApplicationAttemptId as the key
 # After that, requests from AM to NM (getContainerStatus) are made through the 
issued token
 ** Even if the master key is rolled, the request succeeds because it is stored 
in NM's oldMasterKeys (stored in NMStateStore)
 # But it becomes a problem if the AM loses that token for any reason(e.g. RM 
failover, AM restart)
 # For example, when the AM restarts, the AM uses the token created with the 
NMToken master key at that point and is only effective for a 
specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) time
 ** If there is an old MasterKey of ApplicationAttemptId in NM's oldMasterKey, 
currentMasterKey and preliminaryMasterKey are valid, but subsequent tokens are 
invalid

That is, for any reason, when AM is re-issued with a token, it is only valid 
for a specific time


> After RM fail-over, getContainerStatus fails from ApplicationMaster to 
> NodeManager
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-10969
>                 URL: https://issues.apache.org/jira/browse/YARN-10969
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: Lee young gon
>            Priority: Major
>         Attachments: YARN-10969.001.patch
>
>
> If the artifact type of yarn-service spec is docker, getContainerStatus is 
> periodically requested through the NMClient.
> And when RM fail-over occurs, getContainerStatus fails after a 
> specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs *2) 
> time.
> Then the following log occurs in AM
> {code:java}
> 2021-04-05 19:18:47,381 [pool-5-thread-2] ERROR instance.ComponentInstance - 
> [COMPINSTANCE regionserver-2 : container_e82_1612399098156_879545_01_000004] 
> Failed to get container status on ac3iax2079.bdp.bdata.ai:9454, will try 
> again javax.security.sasl.SaslException: DIGEST-MD5: digest response format 
> violation. Mismatched response. [Caused by 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
> DIGEST-MD5: digest response format violation. Mismatched response.] at 
> sun.reflect.GeneratedConstructorAccessor35.newInstance(Unknown Source) at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:159)
>  at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy47.getContainerStatuses(Unknown Source) at 
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:339)
>  at 
> org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStatusRetriever.run(ComponentInstance.java:958)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745) Caused by: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
> DIGEST-MD5: digest response format violation. Mismatched response. at 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1457) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1367) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>  at com.sun.proxy.$Proxy46.getContainerStatuses(Unknown Source) at 
> org.apache.hadoo
> {code}
> The overall flow is as follows.
>  # Started AM
>  # AM requests containers
>  # RM assigned a container
>  ## RM makes tokens for each NM assigned to NMToken Master Key and delivers 
> them to AM
>  ## This NMToken master key is rolling 
> periodically(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs) 
> for each NM. There's a timing issue, but it's the same value as RM
>  # NM is assigned a container and stores the NMToken master key (same as the 
> master key used by RM) in the old Master Keys (hashmap) at that point with 
> ApplicationAttemptId as the key
>  # After that, requests from AM to NM (getContainerStatus) are made through 
> the issued token
>  ** Even if the master key is rolled, the request succeeds because it is 
> stored in NM's oldMasterKeys (stored in NMStateStore)
>  # But it becomes a problem if the AM loses that token for any reason(e.g. RM 
> failover, AM restart)
>  # For example, when the AM restarts, the AM uses the token created with the 
> NMToken master key at that point and is only effective for a 
> specific(yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs*2) 
> time
>  ** If there is an old MasterKey of ApplicationAttemptId in NM's 
> oldMasterKey, currentMasterKey and previousMasterkey are valid, but 
> subsequent tokens are invalid
> That is, for any reason, when AM is re-issued with a token, it is only valid 
> for a specific time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10969) After RM fail-over, getContainerStatus fails from ApplicationMaster to NodeManager

Reply via email to