[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

2024-03-19 Thread Fei Feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828262#comment-17828262
 ] 

Fei Feng commented on FLINK-34726:
--

I think this two things (rest cluster client  and  flink deployment resource 
info ) should be contained in Context, to avoid unnecessary runtime overhead 
and GC pressure.  The difficulty lies in how to promptly update when changes 
occur. for example, rest cluster client should be recreate or update  if 
jobmanager rest address changed, and if FlinkDeployment object changed, 
sessionjob's SecondaryResource should be update
 

> Flink Kubernetes Operator has some room for optimizing performance.
> ---
>
> Key: FLINK-34726
> URL: https://issues.apache.org/jira/browse/FLINK-34726
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0, kubernetes-operator-1.6.0, 
> kubernetes-operator-1.7.0
>Reporter: Fei Feng
>Priority: Major
> Attachments: operator_no_submit_no_kill.flamegraph.html
>
>
> When there is a huge number of FlinkDeployment and FlinkSessionJob in a 
> kubernetes cluster, there will be a significant delay between event submit 
> into reconcile thread pool and  event is processed. 
> this is our test:we give operator enough resource(cpu: 10core, memory: 20g, 
> reconcile thread pool  size was 200 ) and we deployed 1 jobs firstly (one 
> FlinkDeployment and one SessionJob per job) , then we do submit/delete job 
> tests. we found that 
> 1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob 
> CR to k8s and the flink job submited to jobmanager.
> 2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR 
>  and the flink job and session cluster cleared.
>  
> I use async-profiler to get flamegraph when  there is a huge number 
> FlinkDeployment and FlinkSessionJob. I found two obvious areas for 
> optimization
> 1. For Flinkdeployment: in the observe step, we call 
> AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time 
> we call these method we need create RestClusterClient/ send requests/ close, 
> I think we should reuse RestClusterClient as much as possible to avoid 
> frequently creating objects to reduce GC pressure
> 2. For FlinkSessionJob (This issue is more obvious): in the whole reconcile 
> loop, we call getSecondaryResource 5 times to get FlinkDeployement resource 
> info. Based on my current understanding of the Flink Operator, I think we do 
> not need to call it 5 times in a single reconcile loop, calling it once is 
> enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 
> 6% cpu usage)
> [^operator_no_submit_no_kill.flamegraph.html]
> I hope we can discuss solutions to address this problem together. I'm very 
> willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

2024-03-19 Thread Fei Feng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828253#comment-17828253
 ] 

Fei Feng commented on FLINK-34726:
--

"Sounds a bit strange that getSecondaryResource is so expensive as that should 
happen from a cache. "
-
it's not strange actually, because there is object json de/serialization logic 
in getSecondaryResource 's implementation (you can see in flamgraph ), so 
FlinkDeployment CR's size may effect this process's cpu cost. (our 
FlinkDeployment CR size was 21K, it's not reasonable and we will reduce CR 
size) 

> Flink Kubernetes Operator has some room for optimizing performance.
> ---
>
> Key: FLINK-34726
> URL: https://issues.apache.org/jira/browse/FLINK-34726
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0, kubernetes-operator-1.6.0, 
> kubernetes-operator-1.7.0
>Reporter: Fei Feng
>Priority: Major
> Attachments: operator_no_submit_no_kill.flamegraph.html
>
>
> When there is a huge number of FlinkDeployment and FlinkSessionJob in a 
> kubernetes cluster, there will be a significant delay between event submit 
> into reconcile thread pool and  event is processed. 
> this is our test:we give operator enough resource(cpu: 10core, memory: 20g, 
> reconcile thread pool  size was 200 ) and we deployed 1 jobs firstly (one 
> FlinkDeployment and one SessionJob per job) , then we do submit/delete job 
> tests. we found that 
> 1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob 
> CR to k8s and the flink job submited to jobmanager.
> 2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR 
>  and the flink job and session cluster cleared.
>  
> I use async-profiler to get flamegraph when  there is a huge number 
> FlinkDeployment and FlinkSessionJob. I found two obvious areas for 
> optimization
> 1. For Flinkdeployment: in the observe step, we call 
> AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time 
> we call these method we need create RestClusterClient/ send requests/ close, 
> I think we should reuse RestClusterClient as much as possible to avoid 
> frequently creating objects to reduce GC pressure
> 2. For FlinkSessionJob (This issue is more obvious): in the whole reconcile 
> loop, we call getSecondaryResource 5 times to get FlinkDeployement resource 
> info. Based on my current understanding of the Flink Operator, I think we do 
> not need to call it 5 times in a single reconcile loop, calling it once is 
> enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 
> 6% cpu usage)
> [^operator_no_submit_no_kill.flamegraph.html]
> I hope we can discuss solutions to address this problem together. I'm very 
> willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

2024-03-19 Thread Gyula Fora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828210#comment-17828210
 ] 

Gyula Fora commented on FLINK-34726:


Thanks for the detailed analysis [~Fei Feng] . You are completely right that we 
don't optimise the rest client usage and that may add a significant overhead. 
We have done similar optimisation in the past for config access/generation by 
using the FlinkResourceContext class. 

We could probably move the rest client generation logic there instead of hiding 
it under the FlinkService completely. This will be however a bigger change as 
it will affect the methods of the FlinkService interface as well.

Sounds a bit strange that getSecondaryResource is so expensive as that should 
happen from a cache. We should look into it while it's expensive in the first 
place because passing the FlinkDeployment objects around will make the code a 
bit more complicated, but I guess that could also be hidden under the 
FlinkSessionJobContext

> Flink Kubernetes Operator has some room for optimizing performance.
> ---
>
> Key: FLINK-34726
> URL: https://issues.apache.org/jira/browse/FLINK-34726
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.5.0, kubernetes-operator-1.6.0, 
> kubernetes-operator-1.7.0
>Reporter: Fei Feng
>Priority: Major
> Attachments: operator_no_submit_no_kill.flamegraph.html
>
>
> When there is a huge number of FlinkDeployment and FlinkSessionJob in a 
> kubernetes cluster, there will be a significant delay between event submit 
> into reconcile thread pool and  event is processed. 
> this is our test:we give operator enough resource(cpu: 10core, memory: 20g, 
> reconcile thread pool  size was 200 ) and we deployed 1 jobs firstly (one 
> FlinkDeployment and one SessionJob per job) , then we do submit/delete job 
> tests. we found that 
> 1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob 
> CR to k8s and the flink job submited to jobmanager.
> 2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR 
>  and the flink job and session cluster cleared.
>  
> I use async-profiler to get flamegraph when  there is a huge number 
> FlinkDeployment and FlinkSessionJob. I found two obvious areas for 
> optimization
> 1. For Flinkdeployment: in the observe step, we call 
> AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time 
> we call these method we need create RestClusterClient/ send requests/ close, 
> I think we should reuse RestClusterClient as much as possible to avoid 
> frequently creating objects to reduce GC pressure
> 2. For FlinkSessionJob (This issue is more obvious): in the whole reconcile 
> loop, we call getSecondaryResource 5 times to get FlinkDeployement resource 
> info. Based on my current understanding of the Flink Operator, I think we do 
> not need to call it 5 times in a single reconcile loop, calling it once is 
> enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 
> 6% cpu usage)
> [^operator_no_submit_no_kill.flamegraph.html]
> I hope we can discuss solutions to address this problem together. I'm very 
> willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)