[jira] [Commented] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

Wangda Tan (JIRA) Fri, 16 Jun 2017 10:23:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052158#comment-16052158
 ]


Wangda Tan commented on YARN-6714:
----------------------------------

Thanks [~Tao Yang] again for investigations and working on the patch. Could you 
move test case from TestCapacityScheduler to 
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). 

The root cause of the issue is behavior of 
{{AbstractYarnScheduler#getApplicationAttempt}} is inconsistent to its name, it 
discarded application_attempt_id and always return the latest attempt. We 
should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId 
to applicationId. 3) Took a scan of all usages to see if any similar issue 
could happen.

[~Tao Yang], could you file a separate JIRA for that? (And welcome if you can 
work on that :)). 

+ [~sunilg]. 

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6714
>                 URL: https://issues.apache.org/jira/browse/YARN-6714
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.9.0, 3.0.0-alpha3
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_000002 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
>         at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

Reply via email to