[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

Wangda Tan (JIRA) Mon, 14 Jul 2014 07:48:29 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060699#comment-14060699
 ]


Wangda Tan commented on YARN-1408:
----------------------------------

bq. is it possible that schedulerAttempt here is null? e.g. preemption happens 
after the attempt completed.
After thinking about this, we can discuss following cases
1) New attempt started after old attempt completed
Scheduler (fair/capacity/fifo) will first set old attempt is stopped, but will 
not remove it from existing SchedulerApplication. And recoverResourceRequests 
will only take effect when attempt is not stopped
{code}
+    if (!isStopped) {
+      appSchedulingInfo.updateResourceRequests(requests, true);
{code}

Then it will add a new attempt, but will use setCurrentAttempt method, will not 
cause getCurrentAttempt become null.
And because the old attempt completed, we shouldn't recover new resource 
request, so no-op is not a problem here too.

2) Application completed after old attempt completed
It is possible that SchedulerApplication become null here because old 
application already completed and removed from scheduler. So add a null check 
before
{code}
+    schedulerAttempt.recoverResourceRequests(requests);
{code}
Should be enough.

Is it make sense to you, [~jianhe]/[~sunilg]?

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
> timeout for 30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, 
> Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, 
> Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.9.patch, 
> Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  
> yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
> capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update 
> from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM 
> heartbeat reached RM.
> ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> ACQUIRED at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container 
> was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

Reply via email to