[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348
 ] 

Sunil G commented on YARN-3535:
-------------------------------

Hi [~rohithsharma] and [~peng.zhang]
After seeing this patch, I feel there may a synchronization problem. Please 
correct me if I am wrong.
In ContainerRescheduledTransition code, its been used like
{code}
+      container.eventHandler.handle(new ContainerRescheduledEvent(container));
+      new FinishedTransition().transition(container, event);
{code}
Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will 
process the {{recoverResourceRequestForContainer}} is a separate thread. 
Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and 
it will be processed for closure for this container. If the Scheduler 
dispatcher is slower in processing due to pending event queue length, there are 
chances that recoverResourceRequest may not be correct.

I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to 
WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to 
{{RMContainerImpl}} indicate recovery of resource request is completed. This 
can move the state forward to KILLED in {{RMContainerImpl}}. 
Please share your thoughts.

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>            Priority: Critical
>         Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to