[jira] [Updated] (YARN-6207) Move application can fail when attempt add event is delayed

Bibin A Chundatt (JIRA) Wed, 22 Feb 2017 09:34:26 -0800

     [ 
https://issues.apache.org/jira/browse/YARN-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bibin A Chundatt updated YARN-6207:
-----------------------------------
    Attachment: YARN-6207.004.patch

Thank you [~naganarasimha...@apache.org]/[~sunilg]/[~rohithsharma] for comments

{quote}
I think !app.isStopped() can be done at upper level along with null check. if 
(null != app || !app.isStopped() )
nit : change null check with java code style i.e app!=null.
{quote} 
Incase of application submitted with transferFromPreviousAttempt in app 
context. Live containers  metrics needs to be updated in queues
{quote}
1. app.move(dest); is invoked event when app is STOPPED. Internally it updates 
queue metrics in source queue and also in appScheduling info (which also is 
stopped). I think if app is stopped, we can assume that all internal metrics of 
the app is released from source queue. Hence we may not need to do the same 
again in move. please check once .
{quote}
Since live container metrics need to be updated {{app.move}} we can skip only 
the appSchedulingInfo update when stopped.
{quote}
2. abstractUsersManager.deactivateApplication(user, applicationId); this is 
invoked from app.move(). So do we need to call LQ.finishApplication() except 
the fact that queue may have to be moved to STOPPED if it was draining.
{quote}
As mentioned appSchedulingInfo update we can skip incase of stopped attempts.
{quote}
3. FS also need a null check for attempt, correct?
{quote}
SchedulerApplication null check i have handled in latest patch. Incase if the 
comment is regarding Fair Scheduler currently we will handle only Capacity 
scheduler cases in this jira
{quote}
1.one corner case when ClientRMService validates app state is still running but 
when it reaches scheduler application might have got completed hence to be safe 
just we can check whether scheduler application is not null for appId.
{quote}
Done
{quote} 
Can we think of moving dest.submitApplication(appId, user, destQueueName); 
below if (null != app) block so that its better we finish handling all attempt 
related stuff and then updated the application related modifcations ?
{quote}
As discussed offline validation for application submit to queue is done in  
queue.submitApplication. Only when limits are reached we should update attempt 
level metrics.This is part of existing flow so no need to change.
{quote}
ln 2058, i think we can directly get application.getCurrentAppAttempt
{quote}
Done
{quote}
ln 2103, Was wondering the queue partition information needs to be checked even 
if the attempt doesn't exist, thoughts?
{quote}
When Fica app is null means the schedulingInfo based partition we will not be 
able get as per current  implementation .This we can skip probably another jira 
we can file for the same.
{quote}
comment at ln 2067 can be moved to just before 
{{source.finishApplicationAttempt}
{quote}
Done

> Move application can  fail when attempt add event is delayed
> ------------------------------------------------------------
>
>                 Key: YARN-6207
>                 URL: https://issues.apache.org/jira/browse/YARN-6207
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-6207.001.patch, YARN-6207.002.patch, 
> YARN-6207.003.patch, YARN-6207.004.patch
>
>
> *Steps to reproduce*
> 1.Submit application  and delay attempt add to Scheduler
> (Simulate using debug at EventDispatcher for SchedulerEventDispatcher)
> 2. Call move application to destination queue.
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.preValidateMoveApplication(CapacityScheduler.java:2086)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.moveApplicationAcrossQueue(RMAppManager.java:669)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.moveApplicationAcrossQueues(ClientRMService.java:1231)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBServiceImpl.java:388)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:537)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
>       at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1339)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
>       at com.sun.proxy.$Proxy7.moveApplicationAcrossQueues(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.moveApplicationAcrossQueues(ApplicationClientProtocolPBClientImpl.java:398)
>       ... 16 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6207) Move application can fail when attempt add event is delayed

Reply via email to