[
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
lujie updated YARN-9238:
------------------------
Description:
See
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate
{code:java}
// Allocate OPPORTUNISTIC containers.
171. SchedulerApplicationAttempt appAttempt =
172. ((AbstractYarnScheduler)rmContext.getScheduler())
173. .getApplicationAttempt(appAttemptId);
174.
175. OpportunisticContainerContext oppCtx =
176. appAttempt.getOpportunisticContainerContext();
177. oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if "allocate" arrive at line#171 and MRAppmaster crashes, ResourceManager will
start the new appAttempt and do
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
this.currentAttempt = currentAttempt;
}{code}
the new appAttmept hasn't init its field OpportunisticContainerContext , hence
oopCtx ==null and null pointer happens at line 177
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code}
was:
We have found a data race that can make an odd situation.
See
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
// Allocate OPPORTUNISTIC containers.
171. SchedulerApplicationAttempt appAttempt =
172. ((AbstractYarnScheduler)rmContext.getScheduler())
173. .getApplicationAttempt(appAttemptId);
174.
175. OpportunisticContainerContext oppCtx =
176. appAttempt.getOpportunisticContainerContext();
177. oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just before
code1#171, when code1#171~173 continue to execute to get the appAttempt by
appattempt_0, the obtained appAttempt should represent the currenct AM. But
we found that the obtained appAttempt represents the new AM and its attempid
is appattempt_1. This obtained appAttempt has not init its oppCtx, so NPE
happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
So why old appAttempt disappeares and why we use old appattempt_0 but get the
new appAttempt
We have found the reason. Below code({color:#ff0000}code2{color}) is the
function body of getApplicationAttempt at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400 SchedulerApplication<T> app = applications.get(
401 applicationAttemptId.getApplicationId());
402 return app == null ? null : app.getCurrentAppAttempt();
403 }
{code}
when old AM Crash, new AM and new appAttempt comes. The currentAttempt of app
will be setted as the new appAttempt (see code3). So the code2 #402 will return
the new appAttempt.
if AM crashes at the head of allocate function(code1), bug won't happens due to
ApplicationDoesNotExistInCacheException. AM crashed after code1, everything is
also ok.
We shoud add the check: whether the the getted appAttempt have the same id with
given id.
patch comes soon!
{color:#ff0000}code3{color}
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
this.currentAttempt = currentAttempt;
}
{code}
> Allocate on previous or removed or non existent application attempt
> -------------------------------------------------------------------
>
> Key: YARN-9238
> URL: https://issues.apache.org/jira/browse/YARN-9238
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: lujie
> Assignee: lujie
> Priority: Critical
> Attachments: YARN-9238_1.patch, YARN-9238_2.patch, YARN-9238_3.patch,
> hadoop-test-resourcemanager-hadoop11.log
>
>
> See
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate
> {code:java}
> // Allocate OPPORTUNISTIC containers.
> 171. SchedulerApplicationAttempt appAttempt =
> 172. ((AbstractYarnScheduler)rmContext.getScheduler())
> 173. .getApplicationAttempt(appAttemptId);
> 174.
> 175. OpportunisticContainerContext oppCtx =
> 176. appAttempt.getOpportunisticContainerContext();
> 177. oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if "allocate" arrive at line#171 and MRAppmaster crashes, ResourceManager
> will start the new appAttempt and do
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
> currentAttempt){
> this.currentAttempt = currentAttempt;
> }{code}
> the new appAttmept hasn't init its field OpportunisticContainerContext ,
> hence oopCtx ==null and null pointer happens at line 177
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]