[jira] [Commented] (FLINK-34213) Consider using accumulated busy time instead of busyMsPerSecond

2024-01-23 Thread Maximilian Michels (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810006#comment-17810006
 ] 

Maximilian Michels commented on FLINK-34213:


If we had to query metrics per vertex, that would be too expensive, but it 
seems like that is not necessary. Here is an exemplary REST API response to the 
{{/jobs/}} endpoint:

{noformat}
{
"jid": "b4f918c2a0312de9fe7369a7db093e96",
"name": "-",
"isStoppable": false,
"state": "RUNNING",
"start-time": 1705094021727,
"end-time": -1,
"duration": 928985186,
"maxParallelism": 1,
"now": 1706023006913,
"timestamps": {
"SUSPENDED": 0,
"RUNNING": 1705094036134,
"FAILING": 0,
"CANCELED": 0,
"CANCELLING": 0,
"CREATED": 1705094035034,
"INITIALIZING": 1705094021727,
"FAILED": 0,
"RESTARTING": 0,
"RECONCILING": 0,
"FINISHED": 0
},
"vertices": [
{
"id": "db1f263dc155338dc2a9622a2e06d115",
"name": "",
"maxParallelism": 1,
"parallelism": 18,
"status": "RUNNING",
"start-time": 1705094037437,
"end-time": -1,
"duration": 928969476,
"tasks": {
"CANCELED": 0,
"DEPLOYING": 0,
"CANCELING": 0,
"RECONCILING": 0,
"FINISHED": 0,
"SCHEDULED": 0,
"CREATED": 0,
"INITIALIZING": 0,
"FAILED": 0,
"RUNNING": 18
},
"metrics": {
"read-bytes": 0,
"read-bytes-complete": true,
"write-bytes": 2907138853415272,
"write-bytes-complete": true,
"read-records": 0,
"read-records-complete": true,
"write-records": 229589536334,
"write-records-complete": true,
"accumulated-backpressured-time": 1533744940,
"accumulated-idle-time": 10026044858,
"accumulated-busy-time": 5161601268
}
},
   ...
]
}
{noformat}

Note the accumulated backpressure/idle time.

> Consider using accumulated busy time instead of busyMsPerSecond
> ---
>
> Key: FLINK-34213
> URL: https://issues.apache.org/jira/browse/FLINK-34213
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler, Kubernetes Operator
>Reporter: Maximilian Michels
>Priority: Minor
>
> We might achieve much better accuracy if we used the accumulated busy time 
> metrics from Flink, instead of the momentarily collected ones.
> We would use the diff between the last accumulated and the current 
> accumulated busy time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34213) Consider using accumulated busy time instead of busyMsPerSecond

2024-01-23 Thread Gyula Fora (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809922#comment-17809922
 ] 

Gyula Fora commented on FLINK-34213:


The problem here is that we are using aggregated metrics only currently. To 
track diff you would need to query individual vertex metrics and track them 
overhead which would be a huge overhead / cost for jobs with many vertices

> Consider using accumulated busy time instead of busyMsPerSecond
> ---
>
> Key: FLINK-34213
> URL: https://issues.apache.org/jira/browse/FLINK-34213
> Project: Flink
>  Issue Type: Improvement
>  Components: Autoscaler, Kubernetes Operator
>Reporter: Maximilian Michels
>Priority: Minor
>
> We might achieve much better accuracy if we used the accumulated busy time 
> metrics from Flink, instead of the momentarily collected ones.
> We would use the diff between the last accumulated and the current 
> accumulated busy time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)