[
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055413#comment-16055413
]
Tao Yang commented on YARN-6714:
--------------------------------
Thanks [~leftnoteasy] for reviewing the patch.
{quote}
Could you move test case from TestCapacityScheduler to
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
{quote}
Sure, I will update the patch later for this and YARN-6678.
{quote}
could you file a separate JIRA for that? (And welcome if you can work on that ).
{quote}
I'm glad to work on this :D
> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED
> event when async-scheduling enabled in CapacityScheduler
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.9.0, 3.0.0-alpha3
> Reporter: Tao Yang
> Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover
> and unreserve all reserved containers, it still have chance to get and commit
> the outdated reserve proposal of the failed app attempt. This problem
> happened on an app in our cluster, when this app stopped, it unreserved all
> reserved containers and compared these appAttemptId with current
> appAttemptId, if not match it will throw IllegalStateException and make RM
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor]
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve for application
> appattempt_1495188831758_0121_000002 when currently reserved for application
> application_1495188831758_0121 on node host: node1:45454 #containers=2
> available=... used=...
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and
> CapacityScheduler#tryCommit both need to get write_lock before executing, so
> we can check the app attempt state in commit process to avoid committing
> outdated proposals.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]