[ 
https://issues.apache.org/jira/browse/FLINK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630326#comment-17630326
 ] 

Chesnay Schepler commented on FLINK-29927:
------------------------------------------

Unless I'm several underestimating Scala magic this leak should've been there 
for a long time.

Should be easy to fix though; I've opened a PR.

> AkkaUtils#getAddress may cause memory leak
> ------------------------------------------
>
>                 Key: FLINK-29927
>                 URL: https://issues.apache.org/jira/browse/FLINK-29927
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / RPC
>    Affects Versions: 1.16.0, 1.15.2
>            Reporter: Gen Luo
>            Assignee: Chesnay Schepler
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.17.0, 1.15.3, 1.16.1
>
>         Attachments: RemoteAddressExtensionLeaking.png
>
>
> We found a slow memory leak in JM. When MetricFetcherImpl tries to retrieve 
> metrics, it always call MetricQueryServiceRetriever#retrieveService first. 
> And the method will acquire the address of a task manager, which will use 
> AkkaUtil#getAddress internally. While the getAddress method is implemented 
> like this:
> {code:java}
>     public static Address getAddress(ActorSystem system) {
>         return new RemoteAddressExtension().apply(system).getAddress();
>     }
> {code}
> and the RemoteAddressExtension#apply is like this:
> {code:scala}
>   def apply(system: ActorSystem): T = {
>     java.util.Objects.requireNonNull(system, "system must not be 
> null!").registerExtension(this)
>   }
> {code}
> This means every call of AkkaUtils#getAddress will register a new extension 
> to the ActorSystem, and can never be released until the ActorSystem exits.
> Most of the usage of the method are called only once while initializing, but 
> as described above, MetricFetcherImpl will also use the method. It can 
> happens periodically while users open the WebUI, or happens when the users 
> call the RESTful API directly to get metrics. This means the memory may keep 
> leaking. 
> The leak may be introduced in FLINK-23662 when porting the scala version of 
> AkkaUtils to the java one, while I'm not sure if the scala version has the 
> same issue.
> The leak seems very slow. We observed it on a job running for more than one 
> month with only 1G memory for job manager. So I suppose it's not an emergency 
> one but still needs to fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to