Sunil G commented on YARN-1408:

bq. If container killed/preempted before transferred to ACQUIRED state, we 
should recover stored ResourceRequests.

Container count is either decremented or that entry is removed from priority 
based map, if the count is 0.
So once removed, a complete Resource Request has to be added back with 
corresponding priority.

RMContainer has priority information and it can be used to get ResourceRequest 
with *AppSchedulingInfo#getResourceRequests* api. 
CapacityScheduler is extending PreemptableResourceScheduler interface and it 
has *killContainer* implementation. From this the corresponding 
FiCaSchedulerApp object can be found and thus can get the reference of 

I am thinking this way the ResourceRequest can be recreated and added back if 
the Container is preempted at ACQUIRED/ALLOCATED etc. *killContainer* will be 
invoked at the preemption event and container state also can be found here. 
[~leftnoteasy], [~jianhe], [~mayank_bansal] is this approach fine?

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
> timeout for 30mins
> ----------------------------------------------------------------------------------------------
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
> Yarn-1408.4.patch, Yarn-1408.patch
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update 
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM 
> heartbeat reached RM.
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> This also caused the Task to go for a timeout for 30minutes as this Container 
> was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs

This message was sent by Atlassian JIRA

Reply via email to