Peng Zhang commented on YARN-3535:


Thanks for rebase and adding tests.

As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused 
test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed. 
But I cannot remember my old thoughts:
bq. Remove call of recoverResourceRequestForContainer from preemption to avoid 
duplication of recover RR.

I applied my patch {{YARN-3535-002.patch}} on our production cluster, 
preemption works well with FairScheduler.

Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it 
before. And I think it's because:
bq. Changing TestAMRestart.java is because that case 
testAMRestartWithExistingContainers will trigger this logic. After this patch, 
one more container may be scheduled, and 
attempt.getJustFinishedContainers().size() may be bigger than expectedNum and 
loop never ends. So I simply change the situation.

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>            Priority: Critical
>         Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, 
> YARN-3535-002.patch, syslog.tgz, yarn-app.log
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.

This message was sent by Atlassian JIRA

Reply via email to