[
https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeongin Ju updated YARN-10895:
------------------------------
Description:
YARN-10467 fixed ContainerIdPBImpl Object Leakage in
RMNodeImpl.completedContainers.
After applying YARN-10467 patch and operating cluster with large number of
nodes, we found similar heap leakage still exists.
In heap dump which are dumped after failover, (so it is not active RM) about
4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
There are two cases.
1. Apps with 'KeepContainersAcrossApplicationAttempts' is not cleared when
they are failed
Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear
RMAppAttemptImpl.justFinishedContainers.
If app attempt is failed and retried by next attempt, we may not need to clear
RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will
be handed over to next attempts and eventually cleared.
However, when app is failed, there is no next attempt and heap leakage occur.
(We found this case when Yarn Service Application failed over multiple attempts
because of OOM in AM)
2. Apps is killed explicitly by user
When app is killed by user by 'yarn application -kill' CLI interface or WebUI
interface, RMAppAttemptImpl.amContainerFinished is not called because app and
app attempt state is already changed.
To handle this, we added sendFinishedContainersToNMs for each
RMAppAttemptImpl.finishedContainersSentToAm,
RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
We found and patched our cluster on 3.1.2 but it seems trunk still has the same
problem.
I attached patch based on the trunk.
Thanks!
was:
YARN-10467 fixed ContainerIdPBImpl Object Leakage in
RMNodeImpl.completedContainers.
After applying YARN-10467 patch and operating cluster with large number of
nodes, we found similar heap leakage still exists.
In heap dump which are dumped after failover, (so it is not active RM) about
4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
There are two cases.
1. Apps with 'KeepContainersAcrossApplicationAttempts'
Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear
RMAppAttemptImpl.justFinishedContainers.
If app attempt is failed and retried by next attempt, we may not need to clear
RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will
be handed over to next attempts and eventually cleared.
However, when app is failed, there is no next attempt and heap leakage occur.
(We found this case when Yarn Service Application failed over multiple attempts
because of OOM in AM)
2. Apps is killed explicitly by user
When app is killed by user by 'yarn application -kill' CLI interface or WebUI
interface, RMAppAttemptImpl.amContainerFinished is not called because app and
app attempt state is already changed.
To handle this, we added sendFinishedContainersToNMs for each
RMAppAttemptImpl.finishedContainersSentToAm,
RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
We found and patched our cluster on 3.1.2 but it seems trunk still has the same
problem.
I attached patch based on the trunk.
Thanks!
> ContainerIdPBImpl objects still can be leaked in
> RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
> Key: YARN-10895
> URL: https://issues.apache.org/jira/browse/YARN-10895
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.2
> Reporter: Jeongin Ju
> Priority: Major
> Attachments: YARN-10895.001.patch
>
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in
> RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of
> nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about
> 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>
> There are two cases.
>
> 1. Apps with 'KeepContainersAcrossApplicationAttempts' is not cleared when
> they are failed
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear
> RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to
> clear RMAppAttemptImpl.justFinishedContainers because related
> ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple
> attempts because of OOM in AM)
>
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI
> interface, RMAppAttemptImpl.amContainerFinished is not called because app
> and app attempt state is already changed.
>
> To handle this, we added sendFinishedContainersToNMs for each
> RMAppAttemptImpl.finishedContainersSentToAm,
> RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>
> We found and patched our cluster on 3.1.2 but it seems trunk still has the
> same problem.
> I attached patch based on the trunk.
>
> Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]