[
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052158#comment-16052158
]
Wangda Tan commented on YARN-6714:
----------------------------------
Thanks [~Tao Yang] again for investigations and working on the patch. Could you
move test case from TestCapacityScheduler to
TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well).
The root cause of the issue is behavior of
{{AbstractYarnScheduler#getApplicationAttempt}} is inconsistent to its name, it
discarded application_attempt_id and always return the latest attempt. We
should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId
to applicationId. 3) Took a scan of all usages to see if any similar issue
could happen.
[~Tao Yang], could you file a separate JIRA for that? (And welcome if you can
work on that :)).
+ [~sunilg].
> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED
> event when async-scheduling enabled in CapacityScheduler
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6714
> URL: https://issues.apache.org/jira/browse/YARN-6714
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.9.0, 3.0.0-alpha3
> Reporter: Tao Yang
> Assignee: Tao Yang
> Attachments: YARN-6714.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover
> and unreserve all reserved containers, it still have chance to get and commit
> the outdated reserve proposal of the failed app attempt. This problem
> happened on an app in our cluster, when this app stopped, it unreserved all
> reserved containers and compared these appAttemptId with current
> appAttemptId, if not match it will throw IllegalStateException and make RM
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor]
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve for application
> appattempt_1495188831758_0121_000002 when currently reserved for application
> application_1495188831758_0121 on node host: node1:45454 #containers=2
> available=... used=...
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and
> CapacityScheduler#tryCommit both need to get write_lock before executing, so
> we can check the app attempt state in commit process to avoid committing
> outdated proposals.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]