[
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859336#comment-13859336
]
Sunil G commented on YARN-1408:
-------------------------------
Hi Devaraj
As per your comments, I have made the changes.
1. Need to handle the invalid transition and during the transition container to
be removed from ContainerAllocationExpirer to avoid the timeout.
[Sunil]: When we remove this extra preempted container from the
newlyAllocatedContainers, the invalid transition got handled.
Because, when heartbeat comes, this extra container will not be there in
newlyAllocatedContainers and hence ACQUIRED event will not be fired at this
container.
2. In the patch, trying to remove from newlyAllocatedContainers. This can be
removed directly from newlyAllocatedContainers using
java.util.List.remove(Object o), instead of iterating, checking and then
removing.
[Sunil]: Yes, i changed it by removing directly from the list
3. Can you also add test to demonstrate this case.
[Sunil]: Change has done to remove an element from the
newlyAllocatedContainers.
There are no functions added. Now the verification is done by manual testing to
ensure the removal is performed.
> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task
> timeout for 30mins
> ----------------------------------------------------------------------------------------------
>
> Key: YARN-1408
> URL: https://issues.apache.org/jira/browse/YARN-1408
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.2.0
> Reporter: Sunil G
> Fix For: 2.2.0
>
> Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
> * yarn.resourcemanager.scheduler.monitor.enable= true ,
> *
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b which would use less than 20% of cluster
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM
> heartbeat reached RM.
> ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> ACQUIRED at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container
> was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)