[jira] [Comment Edited] (YARN-6207) Move application can fail when attempt add event is delayed

Sunil G (JIRA) Sun, 19 Feb 2017 22:36:13 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874113#comment-15874113
 ]


Sunil G edited comment on YARN-6207 at 2/20/17 6:35 AM:
--------------------------------------------------------

app null check may not fix the race condition correctly. It can still cause 
corner case. I would like to continue discussion on attempt states as well.
Lets take [~bibinchundatt] scenario itself where App Attempt events are in 
Async Dispatcher itself (delayed). The same scenario will happen after first 
attempt failure as well (2nd attempt is delayed). 

+What will happen in scheduler:+
In any case, we assume that {{SchedulerApplication}} is created inside 
scheduler (this is happening because we check app state in CientRMService as 
ACCEPTED/RUNNING). CS and FS’s *moveApplication* is invoking 
{{getApplicationAttempt}} to get app attempt object. 
{{AbsScheduler#getApplicationAttempt}} could return null is 2 cases. a) when 
application itself is not there. b) when curr attempt is null. As mentioned 
earlier in first line, app could not be null. Still attempt may be null. In 
case of first attempt failure, {{SchedulerApplication.getCurrentAppAttempt}} 
could return old object till 2nd attempt is set via {{APP_ATTEMPT_ADDED}}.
Hence the app null check will not help in both scheduler (FS has only a null 
check for app, not for attempt). Even attempt null check also won’t help in 
case of first AM failure as scheduler does have old stale object in stopped 
form. 

Also there could be another corner case. Assume move app has called when 1st 
attempt was failed and 2nd attempt was in init states. It could potentially 
push 2 attempts to target queue. Ideally if we fix in ClientRMServer, we need 
not have to worry changes across scheduler. If attempt state is ACCEPTED to 
RUNNING, we are sure that new attempt is added to scheduler.
Discussed offline with [~rohithsharma] as well.




was (Author: sunilg):
app null check may not fix the race condition correctly. It can still cause 
corner case. I would like to continue discussion on attempt states as well.
Lets take [~bibinchundatt] scenario itself where App Attempt events are in 
Async Dispatcher itself (delayed). The same scenario will happen after first 
attempt failure as well (2nd attempt is delayed). 

+What will happen in scheduler:+
In any case, we assume that {{SchedulerApplication}} is created inside 
scheduler (this is happening because we check app state in CientRMService as 
ACCEPTED/RUNNING). CS and FS’s *moveApplication* is invoking 
{{getApplicationAttempt}} to get app attempt object. 
{{AbsScheduler#getApplicationAttempt}} could return null is 2 cases. a) when 
application itself is not there. b) when curr attempt is null. As mentioned 
earlier in first line, app could not be null. Still attempt may be null. In 
case of first attempt failure, {{SchedulerApplication.getCurrentAppAttempt}} 
could return old object till 2nd attempt is set via {{APP_ATTEMPT_ADDED}}.
Hence the app null check will not help in both scheduler (FS has only a null 
check for app, not for attempt). Even attempt null check also won’t help in 
case of first AM failure as scheduler does have old stale object in stopped 
form. 

Also there could be another corner case. Assume move app has called when 1st 
attempt was failed and 2nd attempt was in init states. It could potentially 
push 2 attempts to target queue. Ideally if we fix in ClientRMServer, we need 
not have to worry changes across scheduler. If attempt state is ACCEPTED to 
RUNNING, we are sure that new attempt is added to scheduler.



> Move application can  fail when attempt add event is delayed
> ------------------------------------------------------------
>
>                 Key: YARN-6207
>                 URL: https://issues.apache.org/jira/browse/YARN-6207
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-6207.001.patch, YARN-6207.002.patch
>
>
> *Steps to reproduce*
> 1.Submit application  and delay attempt add to Scheduler
> (Simulate using debug at EventDispatcher for SchedulerEventDispatcher)
> 2. Call move application to destination queue.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.preValidateMoveApplication(CapacityScheduler.java:2086)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.moveApplicationAcrossQueue(RMAppManager.java:669)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.moveApplicationAcrossQueues(ClientRMService.java:1231)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBServiceImpl.java:388)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:537)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
>       at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1339)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
>       at com.sun.proxy.$Proxy7.moveApplicationAcrossQueues(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBClientImpl.java:398)
>       ... 16 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-6207) Move application can fail when attempt add event is delayed

Reply via email to