[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252668#comment-17252668 ] Piotr Nowojski commented on FLINK-17328: I'm currently taking over and investigating the parent issue how could it be implemented. After the investigation I would either use the existing tickets and assign them to myself or modify them/create new ones. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251665#comment-17251665 ] Till Rohrmann commented on FLINK-17328: --- [~pnowojski] is this still relevant? What is the state of this ticket? > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195554#comment-17195554 ] Piotr Nowojski commented on FLINK-17328: Sorry [~chesnay], I think I've misunderstood you in that case. Regarding the pool usages, they are helpful for two things: # input pool usage is useful to find a task which is back-pressured but is not back-pressuring upstream tasks. That's a temporary situation, but can happen. Especially if task is emitting some large chunk of buffered data, like {{WindowOperator}} after firing a timer. # combination of average pool usage with "is back pressured" state, can be used to distinguish between a case when a couple of channels are back-pressured (data skew) or all of the channels are. It's not as important as the "is back-pressured" fact, but still useful and hard to digest without the job graph. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195478#comment-17195478 ] Chesnay Schepler commented on FLINK-17328: -- I never said displaying it in the UI is not more convenient than doing the matching by hand. What I'm questioning is why we expose the pool usages through the REST API when all you really need is "backpressure between subtask A of Task 1 and subtask B of Task 2" or "this edge has 20% more data then other edges". > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195466#comment-17195466 ] Piotr Nowojski commented on FLINK-17328: What I meant is difficult, is that if you have ~100 of tasks (with hundreds of parallel subtasks each), it's really difficult to understand what's happening with the Job, without visualising the data in a shape of the job graph. Have you tried doing it [~chesnay]? :) With textual form, you are forced to look the tasks (or subtasks for data skew) one by one. Grafana or other metrics visualisers are not helping with that much. Now compare this to looking at a graph with green, yellow or red dots and with some other similar marker for average state of the buffer pools. One quick glance and it becomes immediately obvious: * what is backpressured and what's not * if there is some data skew involved and on which edges More over, just for the sake of sanity of people using Flink or answering to users's problems, it's really good to have some basic functionality built into the system, that allows to understand what's happening. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195454#comment-17195454 ] Chesnay Schepler commented on FLINK-17328: -- If the matching of such metrics to the JobGraph is painful, then I don't see how exposing these in the UI solves anything. I would think that a better goal would be to have the REST API provide a more high-level take on where back-pressure is, instead of exposing a bunch of low-level metrics and doing matching in the UI. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195451#comment-17195451 ] Piotr Nowojski commented on FLINK-17328: I would tend to agree with [~lining]. With monitoring the backpressure or data skew (for which state of the buffer pools can be used), it's important to know the topology of the job. Despite most of those informations being currently available in one way or another via metrics, correlating information of the subtasks/tasks buffers usage with the job graph is very painful and manual process, while UI can present it easily in a very readable form. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195262#comment-17195262 ] lining commented on FLINK-17328: WebUI has monitor backpressure. But users need to know current and upstream's network metric to judge current whether is the source of backpressure. Now users have to record relevant information. It is just improved for the old function. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150858#comment-17150858 ] Chesnay Schepler commented on FLINK-17328: -- I'm not convinced this is necessary. Not only are these metrics fairly low-level, but there are already metric REST endpoints for aggregating metrics across subtasks. As usual, the WebUI is to serve basic functionality, not replace an entire monitoring stack. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > Labels: pull-request-available > > JobDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091368#comment-17091368 ] Gary Yao commented on FLINK-17328: -- I assigned you but I cannot promise a timely review at the moment. > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Assignee: lining >Priority: Major > > JobVertexDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17328) Expose network metric for job vertex in rest api
[ https://issues.apache.org/jira/browse/FLINK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090203#comment-17090203 ] lining commented on FLINK-17328: [~gary] could you assign it to me? > Expose network metric for job vertex in rest api > > > Key: FLINK-17328 > URL: https://issues.apache.org/jira/browse/FLINK-17328 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / REST >Reporter: lining >Priority: Major > > JobVertexDetailsHandler > * pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, > inputFloatingBuffersUsageAvg > * back-pressured for show whether it is back pressured(merge all iths > subtasks) -- This message was sent by Atlassian Jira (v8.3.4#803005)