[
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576970#comment-14576970
]
Peng Zhang commented on YARN-3535:
----------------------------------
Sorry for late reply.
Thanks for your comments.
bq. 1. I think the method recoverResourceRequestForContainer should be
synchronized, any thought?
I notice it's not with synchronized originally. I checked this method and found
only "applications" need to be protected( get by calling
"getCurrentAttemptForContainer()" ). "applications" is instantiated using
ConcurrentHashMap in derived scheduler, so I think it's no need to add
synchronized.
Other three comments are all related with test.
# Changing TestAMRestart.java is because that case
testAMRestartWithExistingContainers will trigger this logic. After this patch,
one more container may be scheduled, and
attempt.getJustFinishedContainers().size() may be bigger than expectedNum and
loop never ends. So I simply change the situation.
# I agreed that this issue exist in all scheduler, and should be tested
generally. But I didn't find good way to reproduce it. I'll take a try with
ParameterizedSchedulerTestBase.
# I change RMContextImpl.java to get schedulerDispatcher and start it in test
TestFairScheduler. Otherwise event handler cannot be triggered. I'll check if
this can also be solved based on ParameterizedSchedulerTestBase.
> ResourceRequest should be restored back to scheduler when RMContainer is
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Assignee: Peng Zhang
> Priority: Critical
> Labels: BB2015-05-TBR
> Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz,
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)