[
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626207#comment-14626207
]
Peng Zhang commented on YARN-3535:
----------------------------------
[~rohithsharma]
Thanks for rebase and adding tests.
As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused
test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed.
But I cannot remember my old thoughts:
bq. Remove call of recoverResourceRequestForContainer from preemption to avoid
duplication of recover RR.
I applied my patch {{YARN-3535-002.patch}} on our production cluster,
preemption works well with FairScheduler.
Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it
before. And I think it's because:
bq. Changing TestAMRestart.java is because that case
testAMRestartWithExistingContainers will trigger this logic. After this patch,
one more container may be scheduled, and
attempt.getJustFinishedContainers().size() may be bigger than expectedNum and
loop never ends. So I simply change the situation.
> ResourceRequest should be restored back to scheduler when RMContainer is
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Assignee: Peng Zhang
> Priority: Critical
> Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch,
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)