[ 
https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040477#comment-18040477
 ] 

ASF GitHub Bot commented on YARN-10895:
---------------------------------------

github-actions[bot] commented on PR #3327:
URL: https://github.com/apache/hadoop/pull/3327#issuecomment-3573253102

   We're closing this stale PR because it has been open for 100 days with no 
activity. This isn't a judgement on the merit of the PR in any way. It's just a 
way of keeping the PR queue manageable.
   If you feel like this was a mistake, or you would like to continue working 
on it, please feel free to re-open it and ask for a committer to remove the 
stale tag and review again.
   Thanks all for your contribution.




> ContainerIdPBImpl objects still can be leaked in 
> RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10895
>                 URL: https://issues.apache.org/jira/browse/YARN-10895
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Jeongin Ju
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: YARN-10895.001.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in 
> RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of 
> nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about 
> 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>  
> There are two cases.
>  
> 1. Apps with 'KeepContainersAcrossApplicationAttempts'  is not cleared when 
> they are failed
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear 
> RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to 
> clear RMAppAttemptImpl.justFinishedContainers because related 
> ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple 
> attempts because of OOM in AM)
>  
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI 
> interface,  RMAppAttemptImpl.amContainerFinished is not called because app 
> and app attempt state is already changed.
>  
> To handle this, we added sendFinishedContainersToNMs for each 
> RMAppAttemptImpl.finishedContainersSentToAm, 
> RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>  
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to