[jira] [Updated] (YARN-10895) ContainerIdPBImpl objects still can be leaked in RMNodeImpl.completedContainers

Jeongin Ju (Jira) Tue, 24 Aug 2021 02:59:05 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeongin Ju updated YARN-10895:
------------------------------
    Description: 
YARN-10467 fixed ContainerIdPBImpl Object Leakage in 
RMNodeImpl.completedContainers.

After applying YARN-10467 patch and operating cluster with large number of 
nodes, we found similar heap leakage still exists.

In heap dump which are dumped after failover, (so it is not active RM) about 
4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.

 

There are two cases.

 

1. Apps with 'KeepContainersAcrossApplicationAttempts'  is not cleared when 
they are failed

Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear 
RMAppAttemptImpl.justFinishedContainers.

If app attempt is failed and retried by next attempt, we may not need to clear 
RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will 
be handed over to next attempts and eventually cleared.

However, when app is failed, there is no next attempt and heap leakage occur.

(We found this case when Yarn Service Application failed over multiple attempts 
because of OOM in AM)

 

2. Apps is killed explicitly by user

When app is killed by user by 'yarn application -kill' CLI interface or WebUI 
interface,  RMAppAttemptImpl.amContainerFinished is not called because app and 
app attempt state is already changed.

 

To handle this, we added sendFinishedContainersToNMs for each 
RMAppAttemptImpl.finishedContainersSentToAm, 
RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'

 

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

  was:
YARN-10467 fixed ContainerIdPBImpl Object Leakage in 
RMNodeImpl.completedContainers.

After applying YARN-10467 patch and operating cluster with large number of 
nodes, we found similar heap leakage still exists.

In heap dump which are dumped after failover, (so it is not active RM) about 
4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.

 

There are two cases.

 

1. Apps with 'KeepContainersAcrossApplicationAttempts' 

Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear 
RMAppAttemptImpl.justFinishedContainers.

If app attempt is failed and retried by next attempt, we may not need to clear 
RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will 
be handed over to next attempts and eventually cleared.

However, when app is failed, there is no next attempt and heap leakage occur.

(We found this case when Yarn Service Application failed over multiple attempts 
because of OOM in AM)

 

2. Apps is killed explicitly by user

When app is killed by user by 'yarn application -kill' CLI interface or WebUI 
interface,  RMAppAttemptImpl.amContainerFinished is not called because app and 
app attempt state is already changed.

 

To handle this, we added sendFinishedContainersToNMs for each 
RMAppAttemptImpl.finishedContainersSentToAm, 
RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'

 

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!


> ContainerIdPBImpl objects still can be leaked in 
> RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10895
>                 URL: https://issues.apache.org/jira/browse/YARN-10895
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Jeongin Ju
>            Priority: Major
>         Attachments: YARN-10895.001.patch
>
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in 
> RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of 
> nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about 
> 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>  
> There are two cases.
>  
> 1. Apps with 'KeepContainersAcrossApplicationAttempts'  is not cleared when 
> they are failed
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear 
> RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to 
> clear RMAppAttemptImpl.justFinishedContainers because related 
> ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple 
> attempts because of OOM in AM)
>  
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI 
> interface,  RMAppAttemptImpl.amContainerFinished is not called because app 
> and app attempt state is already changed.
>  
> To handle this, we added sendFinishedContainersToNMs for each 
> RMAppAttemptImpl.finishedContainersSentToAm, 
> RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>  
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10895) ContainerIdPBImpl objects still can be leaked in RMNodeImpl.completedContainers

Reply via email to