[
https://issues.apache.org/jira/browse/YARN-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874113#comment-15874113
]
Sunil G edited comment on YARN-6207 at 2/20/17 6:35 AM:
--------------------------------------------------------
app null check may not fix the race condition correctly. It can still cause
corner case. I would like to continue discussion on attempt states as well.
Lets take [~bibinchundatt] scenario itself where App Attempt events are in
Async Dispatcher itself (delayed). The same scenario will happen after first
attempt failure as well (2nd attempt is delayed).
+What will happen in scheduler:+
In any case, we assume that {{SchedulerApplication}} is created inside
scheduler (this is happening because we check app state in CientRMService as
ACCEPTED/RUNNING). CS and FS’s *moveApplication* is invoking
{{getApplicationAttempt}} to get app attempt object.
{{AbsScheduler#getApplicationAttempt}} could return null is 2 cases. a) when
application itself is not there. b) when curr attempt is null. As mentioned
earlier in first line, app could not be null. Still attempt may be null. In
case of first attempt failure, {{SchedulerApplication.getCurrentAppAttempt}}
could return old object till 2nd attempt is set via {{APP_ATTEMPT_ADDED}}.
Hence the app null check will not help in both scheduler (FS has only a null
check for app, not for attempt). Even attempt null check also won’t help in
case of first AM failure as scheduler does have old stale object in stopped
form.
Also there could be another corner case. Assume move app has called when 1st
attempt was failed and 2nd attempt was in init states. It could potentially
push 2 attempts to target queue. Ideally if we fix in ClientRMServer, we need
not have to worry changes across scheduler. If attempt state is ACCEPTED to
RUNNING, we are sure that new attempt is added to scheduler.
Discussed offline with [~rohithsharma] as well.
was (Author: sunilg):
app null check may not fix the race condition correctly. It can still cause
corner case. I would like to continue discussion on attempt states as well.
Lets take [~bibinchundatt] scenario itself where App Attempt events are in
Async Dispatcher itself (delayed). The same scenario will happen after first
attempt failure as well (2nd attempt is delayed).
+What will happen in scheduler:+
In any case, we assume that {{SchedulerApplication}} is created inside
scheduler (this is happening because we check app state in CientRMService as
ACCEPTED/RUNNING). CS and FS’s *moveApplication* is invoking
{{getApplicationAttempt}} to get app attempt object.
{{AbsScheduler#getApplicationAttempt}} could return null is 2 cases. a) when
application itself is not there. b) when curr attempt is null. As mentioned
earlier in first line, app could not be null. Still attempt may be null. In
case of first attempt failure, {{SchedulerApplication.getCurrentAppAttempt}}
could return old object till 2nd attempt is set via {{APP_ATTEMPT_ADDED}}.
Hence the app null check will not help in both scheduler (FS has only a null
check for app, not for attempt). Even attempt null check also won’t help in
case of first AM failure as scheduler does have old stale object in stopped
form.
Also there could be another corner case. Assume move app has called when 1st
attempt was failed and 2nd attempt was in init states. It could potentially
push 2 attempts to target queue. Ideally if we fix in ClientRMServer, we need
not have to worry changes across scheduler. If attempt state is ACCEPTED to
RUNNING, we are sure that new attempt is added to scheduler.
> Move application can fail when attempt add event is delayed
> ------------------------------------------------------------
>
> Key: YARN-6207
> URL: https://issues.apache.org/jira/browse/YARN-6207
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Attachments: YARN-6207.001.patch, YARN-6207.002.patch
>
>
> *Steps to reproduce*
> 1.Submit application and delay attempt add to Scheduler
> (Simulate using debug at EventDispatcher for SchedulerEventDispatcher)
> 2. Call move application to destination queue.
> {noformat}
> Caused by:
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.preValidateMoveApplication(CapacityScheduler.java:2086)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.moveApplicationAcrossQueue(RMAppManager.java:669)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.moveApplicationAcrossQueues(ClientRMService.java:1231)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBServiceImpl.java:388)
> at
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:537)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483)
> at org.apache.hadoop.ipc.Client.call(Client.java:1429)
> at org.apache.hadoop.ipc.Client.call(Client.java:1339)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
> at com.sun.proxy.$Proxy7.moveApplicationAcrossQueues(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBClientImpl.java:398)
> ... 16 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]