[
https://issues.apache.org/jira/browse/YARN-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129937#comment-16129937
]
Miklos Szegedi commented on YARN-7009:
--------------------------------------
The exception above was caused by the fact that we are always in the RUNNING
state, so we just jump to reinitialize after restart. I have also seen another
instance, when the container finished on a busy machine early waiting
indefinitely. I solved this by increasing the sleep time to an arbitrarily
large number. We stop the container anyways, plus we have the timeout.
There is another issue with the pid file. If it is not created within 2
seconds, this code may skip the reinitialize keeping the container in the
REINITIALIZE state forever.
processId may be null:
{code}
// get process id from pid file if available
// else if shell is still active, get it from the shell
String processId = null;
if (pidFilePath != null) {
processId = getContainerPid(pidFilePath);
}
// kill process
if (processId != null) {
{code}
Fixing this however needs significant refactoring, so another Jira I think. The
only thing I could do in this jira is to increase the
DEFAULT_NM_PROCESS_KILL_WAIT_MS timeout to 10000. What do you think?
> TestNMClient.testNMClientNoCleanupOnStop is flaky by design
> -----------------------------------------------------------
>
> Key: YARN-7009
> URL: https://issues.apache.org/jira/browse/YARN-7009
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Miklos Szegedi
> Assignee: Miklos Szegedi
> Attachments: YARN-7009.000.patch
>
>
> The sleeps to wait for a transition to reinit and than back to running is not
> long enough, it can miss the reinit event.
> {code}
> java.lang.AssertionError: Exception is not expected:
> org.apache.hadoop.yarn.exceptions.YarnException: Cannot perform RE_INIT on
> [container_1502735389852_0001_01_000001]. Current state is [REINITIALIZING,
> isReInitializing=true].
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.preReInitializeOrLocalizeCheck(ContainerManagerImpl.java:1772)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1697)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1668)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.reInitializeContainer(ContainerManagementProtocolPBServiceImpl.java:214)
> at
> org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:237)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> at
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testReInitializeContainer(TestNMClient.java:567)
> at
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:405)
> at
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:214)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Cannot perform
> RE_INIT on [container_1502735389852_0001_01_000001]. Current state is
> [REINITIALIZING, isReInitializing=true].
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.preReInitializeOrLocalizeCheck(ContainerManagerImpl.java:1772)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1697)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1668)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.reInitializeContainer(ContainerManagementProtocolPBServiceImpl.java:214)
> at
> org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:237)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.reInitializeContainer(ContainerManagementProtocolPBClientImpl.java:237)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy93.reInitializeContainer(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.reInitializeContainer(NMClientImpl.java:322)
> at
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testReInitializeContainer(TestNMClient.java:561)
> ... 11 more
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnException):
> Cannot perform RE_INIT on [container_1502735389852_0001_01_000001]. Current
> state is [REINITIALIZING, isReInitializing=true].
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.preReInitializeOrLocalizeCheck(ContainerManagerImpl.java:1772)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1697)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.reInitializeContainer(ContainerManagerImpl.java:1668)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.reInitializeContainer(ContainerManagementProtocolPBServiceImpl.java:214)
> at
> org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:237)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1490)
> at org.apache.hadoop.ipc.Client.call(Client.java:1436)
> at org.apache.hadoop.ipc.Client.call(Client.java:1346)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy92.reInitializeContainer(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.reInitializeContainer(ContainerManagementProtocolPBClientImpl.java:235)
> ... 23 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]