Sunil G commented on YARN-1408:

Thank you Wangda for the comment. 
bq.FairScheduler supports preemption as well, but it doesn't inherent 
PreemptableResourceScheduler interface. 
Yes. This will be cause similar problem in Fair also.

I agree that we have to add back the ResourceRequest. Because as you mentioned 
in the example the user, which is AM here, will not get any update at all.
In summary, the containers which are yet to be pulled off by AM, but killed by 
RM (preemption cases) are the problem. 

bq.If container transferred to ACQUIRED state, stored ResourceRequests will be 
removed directly.
bq.If container killed/preempted before transferred to ACQUIRED state, we 
should recover stored ResourceRequests.

Inline with this, if we want to recover ResourceRequest during RMContainer 
transitions(like KILL before ACQUIRED), we need a link back to the respective 
May be here we can try add the required object to event, which can add back a 
ResurceRequest to scheduler with modified  
AppSchedulingInfo#updateResourceRequests. Is this what you also thought?

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
> timeout for 30mins
> ----------------------------------------------------------------------------------------------
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
> Yarn-1408.4.patch, Yarn-1408.patch
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update 
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM 
> heartbeat reached RM.
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> This also caused the Task to go for a timeout for 30minutes as this Container 
> was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs

This message was sent by Atlassian JIRA

Reply via email to