[
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
lujie updated YARN-9238:
------------------------
Description:
We have foud a data race that can make an odd situation.
See
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
// Allocate OPPORTUNISTIC containers.
171. SchedulerApplicationAttempt appAttempt =
172. ((AbstractYarnScheduler)rmContext.getScheduler())
173. .getApplicationAttempt(appAttemptId);
174.
175. OpportunisticContainerContext oppCtx =
176. appAttempt.getOpportunisticContainerContext();
177. oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0)just before
code1#171, when the code of line code1#171~173 continue to execute to get the
appAttempt by appattempt_0, the appAttempt should represents the currenct AM.
But we found that the appAttempt represents to the new AM and its attempid
is appattempt_1. This appAttempt that represents the new AM has not init its
oppCtx, so NPE happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason about we use old appattempt_0 but get the new
appAttempt that represent to new AM. Below code({color:#ff0000}code2{color}) is
the function body of getApplicationAttempt at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400 SchedulerApplication<T> app = applications.get(
401 applicationAttemptId.getApplicationId());
402 return app == null ? null : app.getCurrentAppAttempt();
403 }
{code}
when old AM Crash, the CurrentAppAttempt of app will be setted as the new
appAttempt that presentes the new AM. So the code2 #402 will return the new
appAttempt.
if AM crashes just before code1, bug won't happens due to
ApplicationDoesNotExistInCacheException. AM crashed after code1, everything is
also ok.
We shoud add the check: whether the the getted appAttempt have the same id as
given id.
patch comes soon!
was:
We have foud a data race that can make an odd situation.
See
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
// Allocate OPPORTUNISTIC containers.
171. SchedulerApplicationAttempt appAttempt =
172. ((AbstractYarnScheduler)rmContext.getScheduler())
173. .getApplicationAttempt(appAttemptId);
174.
175. OpportunisticContainerContext oppCtx =
176. appAttempt.getOpportunisticContainerContext();
177. oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0)just before
code1#171, when the code of line code1#171~173 continue to execute to get the
appAttempt by appattempt_0, the appAttempt should represents the currenct AM.
But we found that the appAttempt represents to the new AM and its attempid
is appattempt_1. This appAttempt that represents the new AM has not init its
oppCtx, so NPE happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason about we use old appattempt_0 but get the new
appAttempt that represent to new AM. Below code({color:#ff0000}code2{color}) is
the function body of getApplicationAttempt at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400 SchedulerApplication<T> app = applications.get(
401 applicationAttemptId.getApplicationId());
402 return app == null ? null : app.getCurrentAppAttempt();
403 }
{code}
when old AM Crash, the CurrentAppAttempt of app will be setted as the new
appAttempt that presentes the new AM. So the code2 #402 will return the new
appAttempt.
if AM crashes just before code1, bug won't happens due to
ApplicationDoesNotExistInCacheException
. AM crashed after code1, everything is also ok.
We shoud add the check: whether the the getted appAttempt have the same id as
given id.
patch comes soon!
> We get a wrong attempt by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
> Key: YARN-9238
> URL: https://issues.apache.org/jira/browse/YARN-9238
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: lujie
> Assignee: lujie
> Priority: Critical
> Attachments: YARN-9238_1.patch,
> hadoop-test-resourcemanager-hadoop11.log
>
>
> We have foud a data race that can make an odd situation.
> See
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
> {code:java}
> // Allocate OPPORTUNISTIC containers.
> 171. SchedulerApplicationAttempt appAttempt =
> 172. ((AbstractYarnScheduler)rmContext.getScheduler())
> 173. .getApplicationAttempt(appAttemptId);
> 174.
> 175. OpportunisticContainerContext oppCtx =
> 176. appAttempt.getOpportunisticContainerContext();
> 177. oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0)just before
> code1#171, when the code of line code1#171~173 continue to execute to get the
> appAttempt by appattempt_0, the appAttempt should represents the currenct
> AM. But we found that the appAttempt represents to the new AM and its
> attempid is appattempt_1. This appAttempt that represents the new AM has
> not init its oppCtx, so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> We have found the reason about we use old appattempt_0 but get the new
> appAttempt that represent to new AM. Below code({color:#ff0000}code2{color})
> is the function body of getApplicationAttempt at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId
> applicationAttemptId) {
> 400 SchedulerApplication<T> app = applications.get(
> 401 applicationAttemptId.getApplicationId());
> 402 return app == null ? null : app.getCurrentAppAttempt();
> 403 }
> {code}
> when old AM Crash, the CurrentAppAttempt of app will be setted as the new
> appAttempt that presentes the new AM. So the code2 #402 will return the new
> appAttempt.
> if AM crashes just before code1, bug won't happens due to
> ApplicationDoesNotExistInCacheException. AM crashed after code1, everything
> is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id as
> given id.
> patch comes soon!
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]