[ 
https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-10341:
------------------------------
          Component/s: service-scheduler
     Target Version/s: 3.3.1, 3.4.0
    Affects Version/s: 3.3.1
                       3.4.0

> Yarn Service Container Completed event doesn't get processed 
> -------------------------------------------------------------
>
>                 Key: YARN-10341
>                 URL: https://issues.apache.org/jira/browse/YARN-10341
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: service-scheduler
>    Affects Versions: 3.4.0, 3.3.1
>            Reporter: Bilwa S T
>            Assignee: Bilwa S T
>            Priority: Critical
>             Fix For: 3.4.0, 3.3.1
>
>         Attachments: YARN-10341.001.patch, YARN-10341.002.patch, 
> YARN-10341.003.patch, YARN-10341.004.patch
>
>
> If there 10 workers running and if containers get killed , after a while we 
> see that there are just 9 workers runnning. This is due to CONTAINER 
> COMPLETED Event is not processed on AM side. 
>  Issue is in below code:
> {code:java}
> public void onContainersCompleted(List<ContainerStatus> statuses) {
>       for (ContainerStatus status : statuses) {
>         ContainerId containerId = status.getContainerId();
>         ComponentInstance instance = 
> liveInstances.get(status.getContainerId());
>         if (instance == null) {
>           LOG.warn(
>               "Container {} Completed. No component instance exists. 
> exitStatus={}. diagnostics={} ",
>               containerId, status.getExitStatus(), status.getDiagnostics());
>           return;
>         }
>         ComponentEvent event =
>             new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED)
>                 .setStatus(status).setInstance(instance)
>                 .setContainerId(containerId);
>         dispatcher.getEventHandler().handle(event);
>       }
> {code}
> If component instance doesnt exist for a container, it doesnt iterate over 
> other containers as its returning from method. This happens when 
> restart_policy is "ON_FAILURE"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to