[
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348
]
Sunil G commented on YARN-3535:
-------------------------------
Hi [~rohithsharma] and [~peng.zhang]
After seeing this patch, I feel there may a synchronization problem. Please
correct me if I am wrong.
In ContainerRescheduledTransition code, its been used like
{code}
+ container.eventHandler.handle(new ContainerRescheduledEvent(container));
+ new FinishedTransition().transition(container, event);
{code}
Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will
process the {{recoverResourceRequestForContainer}} is a separate thread.
Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and
it will be processed for closure for this container. If the Scheduler
dispatcher is slower in processing due to pending event queue length, there are
chances that recoverResourceRequest may not be correct.
I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to
WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to
{{RMContainerImpl}} indicate recovery of resource request is completed. This
can move the state forward to KILLED in {{RMContainerImpl}}.
Please share your thoughts.
> ResourceRequest should be restored back to scheduler when RMContainer is
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Assignee: Peng Zhang
> Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch,
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz,
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)