[ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994801#comment-13994801
 ] 

Sunil G commented on YARN-1408:
-------------------------------

Please check the below scenario.
After allocating a container to an application, CS will decrement its 
associated Resource Request info. 
Once this container is identified for preemption, preemption module in RM will 
do the container kill regardless whatever state the container is.
 
I am assuming that state of one container is AQUIRED [Waiting for Launch event 
to become RUNNING]. And now this is marked for preemption, so container will 
get preempted.
 
Hence Next heartbeat to AM has same container present in 
newlyAllocatedContainers and in completedContainers.  [Allocation and Kill 
happened within an AM heartbeat cycle]
An Invalid state transition [AQUIRED at KILLED] will be happened while 
processing from newlyAllocatedContainers in AM side. This will cause task to 
timeout after 30mins.
 
If we try remove container from newlyAllocatedContainers, we can avoid invalid 
state transition. But this will cause task hang. [RM lost the resource request]
As per initial explanation, RM has allocated a container and AM is waiting to 
get that container to assign for a task.
Due to preemption, this has not been happened. Hence it will cause task to hang.

I feel we can preempt those containers which are only in RUNNING state. 
[~devaraj.k] and [~curino], please share your thoughts.

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
> timeout for 30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>             Fix For: 2.5.0
>
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
> Yarn-1408.4.patch, Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update 
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM 
> heartbeat reached RM.
> ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> ACQUIRED at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container 
> was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to