[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-10-14 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950879#comment-16950879
 ] 

Piotr Nowojski commented on FLINK-12576:


I think it depends. If the original issue was not fixed, then re-opening and 
removing the incorrect fix version is ok. In this case, the conclusion was that 
everything works as it should, so keeping the previous fix version makes more 
sense, as the code behaves/will behave in the same way in 1.9.2 as in 1.9.0.

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-10-13 Thread Jark Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950674#comment-16950674
 ] 

Jark Wu commented on FLINK-12576:
-

Hi [~pnowojski], I was moving "OPEN" 1.9.0 issues to 1.9.2, because 1.9.0 has 
already been released. 
I didn't notice this is an "REOPEN" issue. So I reset it back to 1.9.0 now. 

Btw, if the conclusion of the "REOPEN" issus is there are still something to 
fix, then what the fixVersion should be?
Should we open another issue to track this? 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.2
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-10-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950367#comment-16950367
 ] 

Piotr Nowojski commented on FLINK-12576:


[~jark], why did you move the fix version from 1.9.0 to 1.9.2? As far as I 
remember, this was fixed in 1.9.0

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.2
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-26 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938675#comment-16938675
 ] 

Piotr Nowojski commented on FLINK-12576:


I think we should document this, that {{inPoolUsage}} ignores local channels 
and in such cases, to rule out some mistakes, it's best to check the 
{{inputQueueLength}} value as well. Especially that I haven't found this being 
mentioned anywhere:
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-26 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938424#comment-16938424
 ] 

zhijiang commented on FLINK-12576:
--

Thanks for reproducing the case [~kevin.cyj]

My previous concern was also for the specific input metric before, so I pointed 
out the first question before:

> 1. The input metric here is for {{inputQueueLength}}?

If it was the inputQueueLength case, there must be a potential bug and actually 
this Jira ticket was motivated for this metric in LocalInputChannel before.

If it was the case of inPoolUsage, it can be explained reasonable as always 0 
for local channel, because it fetches the buffer directly from upstream's 
partition queue, so its buffer pool is never used and always be 0. 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-26 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938406#comment-16938406
 ] 

Yingjie Cao commented on FLINK-12576:
-

[~alpinegizmo] Because only one of the two upstream task will emit records to 
the downstream task. One of downstream task will always get its input from 
local input channel and no buffer in the bufferPool for input will be used. So 
inPoolUsage is always zero for one of the channels should be what's expected. 

For one two-slot TM case, both input channels for the Backpressure vertex are 
local input channel. So the inPoolUsage should be also zero.

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-26 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938405#comment-16938405
 ] 

David Anderson commented on FLINK-12576:


Ok, I see what's going on now, at least to some extent. I see now that the 
input queue length metric is behaving as documented.

I wasn't focused on the input queue length metric when I re-opened this ticket 
– I was only looking at the inPoolUsage and exclusive and floating buffer 
metrics. Is it the case that these metrics are also intended to ignore local 
input channels? If so, then I guess the only bug is in the documentation, which 
fails to explain this.

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-26 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938397#comment-16938397
 ] 

David Anderson commented on FLINK-12576:


> Input queue length of both the local and remote channel are not always zero. 
> Did I do something wrong?

No, you did nothing wrong: that's what I see as well. See the screenshots I 
posted showing all of the input metrics in various cases. With 2 single-slot 
TMs, both channels have input queue length that's not always zero. However, 
inPoolUsage is always zero for one of the channels, which I believe is wrong. 

And in the case of one two-slot TM, then both channels are local, and both show 
input queue length (and all other input metrics) that is always zero, which is 
definitely confusing, if not wrong.

If the current behavior is somehow considered "correct" then the documentation 
needs to be updated to explain which of these metrics don't work in the local 
case -- or better, the metrics should be renamed to make it clear what they are 
actually measuring.


> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-25 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938232#comment-16938232
 ] 

Yingjie Cao commented on FLINK-12576:
-

[~alpinegizmo] I tried the instruction gave above, but the problem did not 
reproduce.

The instruction I used is:

!image-2019-09-26-11-36-06-027.png!

Here is my result:

!image-2019-09-26-11-34-24-878.png!

Input queue length of both the local and remote channel are not always zero. 
Did I do something wrong?

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-24 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936889#comment-16936889
 ] 

David Anderson commented on FLINK-12576:


Here's what going on overall in that Flink 1.8 test:

 !flink-1.8-output-subtasks.png! 
 !flink-1.8-input-subtasks.png! 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-24 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936885#comment-16936885
 ] 

David Anderson commented on FLINK-12576:


In terms of what's happening overall in that Flink 1.8 test:

 !flink-1.8-output-subtasks.png! 
 !flink-1.8-input-subtasks.png! 


> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-24 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936877#comment-16936877
 ] 

David Anderson commented on FLINK-12576:


Here are the results for Flink 1.8 with two single-slot TMs.

 !flink-1.8-2-single-slot-TMs-output.png! 
 !flink-1.8-2-single-slot-TMs-input.png! 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, flink-1.8-2-single-slot-TMs-output.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-24 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936784#comment-16936784
 ] 

David Anderson commented on FLINK-12576:


I just did some more careful testing, this time with

taskmanager.network.memory.buffers-per-channel:1
taskmanager.network.memory.floating-buffers-per-gate:1

which I think is as low as the buffering can go. 

Here are the various input and output metrics, running on Flink 1.9 with 2 
single-slot TMs:

 !Screen Shot 2019-09-24 at 3.22.53 PM.png! 
 !Screen Shot 2019-09-24 at 3.22.36 PM.png! 

Running on Flink 1.9 with a single two-slot TM looks like this:

 !Screen Shot 2019-09-24 at 3.13.05 PM.png! 
 !Screen Shot 2019-09-24 at 3.11.15 PM.png! 

I'll see if I can repeat the case you asked about on Flink 1.8.

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-23 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936348#comment-16936348
 ] 

zhijiang commented on FLINK-12576:
--

Thanks for reporting this [~alpinegizmo]

I want to confirm two things:

1. The input metric here is for {{inputQueueLength}}?

2. Have you tried whether this problem exists before release-1.9, especially 
for the case of non-local in 2 single-slot TMs.

This ticket actually made two mainly changes before. One is for considering the 
input metric (inputQueueLength) for local input channel. The other is that the 
metric value is got out of synchronized way instead for remote input channel. 
So I wonder whether it would cause visibility issue for metric reporter thread. 
But it seems that this issue only happens for the parallelism of backpressure 
operator in your testing.

 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-05-29 Thread Piotr Nowojski (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850844#comment-16850844
 ] 

Piotr Nowojski commented on FLINK-12576:


Yes, I think you are right. For example for \{{PipelinedSubpartition}} this 
should return \{{PipelinedSubpartition.buffers.size()}}

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Piotr Nowojski
>Assignee: aitozi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-05-28 Thread aitozi (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849862#comment-16849862
 ] 

aitozi commented on FLINK-12576:


Hi, [~pnowojski] I check the code, for localInputChannel do we just have to 
count the buffer queued in resultsubpartition? I think the outQueueLength of 
localInputChannel should equal to the inputQueueLength, right?

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Piotr Nowojski
>Assignee: aitozi
>Priority: Major
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)