[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166216#comment-17166216 ] Robert Metzger commented on FLINK-11127: In my opinion yes, we can close it. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166196#comment-17166196 ] Till Rohrmann commented on FLINK-11127: --- Can this ticket be closed [~rmetzger]? Why did you reopen it [~chesnay]? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164484#comment-17164484 ] Robert Metzger commented on FLINK-11127: I don't know ECS well enough to really comment on this. But since no other user has complained about the Flink on ECS, and you've also moved away from it, I would consider this problem resolved. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164385#comment-17164385 ] Rafi Aroch commented on FLINK-11127: Data processing worked because the initiator of the connection were the TMs. It was only the metrics which did not work when JM was polling for metrics. But maybe I'm missing something. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164381#comment-17164381 ] Robert Metzger commented on FLINK-11127: Thanks a lot for your quick comment. I believe Flink would also work on AWS ECS (in versions after 1.8), because the data exchange (of processing data) was able to establish a connection. I'm closing this ticket now. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Assignee: Robert Metzger >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164373#comment-17164373 ] Rafi Aroch commented on FLINK-11127: [~rmetzger] At the time, I did not use K8S, I used AWS ECS. In ECS container IP addresses are not accessible from one container to the other (unless you use a specific network mode which was not recommended). To access JM from TM I used an ELB. But the other way around was not possible. Eventually we switched everything to K8S. I guess this is not a very common use case nowadays so probably would make sense to close the ticket. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Assignee: Robert Metzger >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164351#comment-17164351 ] Robert Metzger commented on FLINK-11127: After offline discussions with [~trohrmann] and [~uce], I believe we can close this issue. Since FLINK-11632 is configuring the "taskmanager.network.bind-policy" by default to "ip", establishing the connection from the TMs to the JM should always work on Kubernetes. A good argument why this works is the way how the network stack (specifically netty) is establishing connections between the TaskManagers: They also establish connections among each other via the TM IP. So the connection between any pods in K8s based on IP addresses should work, if not, we would have much bigger problems. I don't fully get what [~aroch] means by "accessible from the outside": {quote}Also, the "ip" bind-policy would not help because the resolved IP is the internal network IP which is not accessible from outside and JM fails to fetch metrics.{quote} ... if you mean "outside" as in outside the K8s cluster (say through a loadbalancer into the cluster), then I agree, the IP won't be accessible. But that's also not what we need here. Internal access is sufficient. Unless [~aroch] can describe a scenario where the JM can not connect to the TMs, I would close this ticket. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0, 1.9.2, 1.10.0 >Reporter: Ufuk Celebi >Assignee: Robert Metzger >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966250#comment-16966250 ] Till Rohrmann commented on FLINK-11127: --- You are right [~aroch] that we should try to fix this problem properly by inverting the registration direction. At the moment I think nobody is actively working on it. If you want to take a stab at it you could try. Before starting to write code, I would be interested in your solution approach, though. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965619#comment-16965619 ] Rafi Aroch commented on FLINK-11127: I'm seeing the same issue. We are not using Kubernetes (yet) in the company, we use AWS ECS. So I can't use the K8S workaround. Also, the _"ip"_ bind-policy would not help because the resolved IP is the internal network IP which is not accessible from outside and JM fails to fetch metrics. Are we considering fixing the root cause of this issue, as was suggested by [~uce] in the description: _"let the query service connect to the JM similar to how the TMs register"_? This fix would cover all deployment methods. I would be happy to help. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964969#comment-16964969 ] Ken commented on FLINK-11127: - [~hwanju] [~trohrmann], for context, the "fix" I mentioned is from Sergei in this thread. https://issues.apache.org/jira/browse/FLINK-11127?focusedCommentId=16719651=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16719651 First I don't know if the issue I am seeing was a non-issue in 1.7. Because we are running flink in k8s, we had to implement what Sergei suggested in order for the JM to identify TMs. I can only confirm that the error shows up with 1.8 whenever I manually cancel the job through the UI, and this job is a forever running job using default restart strategy. What I did not post is the message prior to the error message, which is related to connecting to JM failed with a different IP (starts with 170.x.x.x) that where my JM is actually at (starts with 10.x.x.x). > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964210#comment-16964210 ] Hwanju Kim commented on FLINK-11127: [~tsubasa2oo2], in addition to what [~trohrmann] said (and it's interesting to me as well why cancel led to metric connection issue), I also wonder what was the workaround fix you used before 1.8 and had worked fine in the same scenario without the error you mentioned above. The problematic error related to DNS would show akka error with "Name or service not known". We've tested for TM-to-RM registration but not necessarily have seen the metrics connection error that you showed. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963811#comment-16963811 ] Till Rohrmann commented on FLINK-11127: --- This sounds a bit strange that cancelling a job makes the cluster connection to break down. Usually, a job cancellation should have no effect on the underlying cluster communication [~tsubasa2oo2]. Could you maybe provide us with the corresponding debug logs? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962337#comment-16962337 ] Ken commented on FLINK-11127: - [~hwanju], I upgraded from 1.7 to 1.8. Previously I use the workaround fix. With 1.8, I noticed that {code:java} Association with remote system [akka.tcp://flink-metrics@: has failed {code} has surface again. This does not happen when I start the JM and TMs at the beginning. I was able to force this error by simply "cancel" the job and let it auto-restart. I tried both the workaround as well as removing {code:java} taskmanager.host{code} and use {code:java} taskmanager.network.bind-policy: ip{code} Whenever I "cancel" the job through the UI, the warning shows up in JM and TMs, and the statistics and metrics in the UI stopped working. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950234#comment-16950234 ] Hwanju Kim commented on FLINK-11127: [~vicTTim], although I have not retested with 1.8 (we would shortly though), I think that problem has been solved by that FLINK-11632 if you use "ip" as bind policy. What we have applied is a subset of FLINK-11632. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943231#comment-16943231 ] Tim commented on FLINK-11127: - [~hwanju] - Thanks much for describing this scenario. It is exactly what I have been experiencing. Am I to take it that the proposal has been implemented in FLINK-11632 in full? I checked on that ticket, and Till confirmed it is available in version 1.8.0 onwards (I'm on 1.7.x). Thanks! > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821683#comment-16821683 ] Hwanju Kim commented on FLINK-11127: We experienced a similar issue to this, but more seriously with the communication between resource manager and task manager. In a normal situation, it works fine since only TMs actively connect to JM, whose name is resolvable (i.e., there's no outbound association from JM actor, only inbound). However, if a TM has a fatal error such as a task not responding to canceling request, it does graceful cleanup, a part of which is closing akka system sending a poison pill to JM, and then shutdown itself. Once this poison pill is gotten in JM, (as part of fail-over restart) its actor starts doing outbound association to destination TM host name that was provided during initial handshake. This outbound association here can't be succeeded if TM is not accessible via host name like in general Kubernetes setting. From this point on, TM can talk to JM for TM registration, but JM can't respond to this registration request, since outbound association can never be made. This failure of outbound association from JM's akka endpoint causes indefinite stuck in task scheduling due to the failure of TM registration with this error: {code:java} 2019-02-28 21:58:15,867 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.ssl.tcp://flink@flink-jobmanager:6123/user/resourcemanager, retrying in 1 ms. {code} In response to constant failure like above, JM has slot allocation failure indefinitely as well: {code:java} org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 30 ms {code} We know there's multiple workarounds suggested here in this thread like stateful set, init container, and the passing JVM argument, but we did not want to add artifacts and complexity to deployment in production just to fix this issue (I tried the last taskmanager.host one as it's the least invasive to deployment, but it did not work for our case). Therefore, we went ahead adding "_taskmanager.rpc.use-host-address_" configuration in Flink and it's false by default, but if it's set to true, only in RPC setting, TM simply uses _taskManagerAddress.getHostAddress()_ instead of _taskManagerAddress.getHostName()_ (actual patch is a few lines as you could expect). It was minimal enough to us and it has been solving the problem so far. We decided to do this way because this could be a helpful option for an environment like the usual Kubernetes setting without TM stateful set or tweaks. I am not sure if you guys are interested in this way, but sharing this for thought or interest. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes, Runtime / Coordination, Runtime > / Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773917#comment-16773917 ] Alex commented on FLINK-11127: -- [~spoganshev], [~ray365], there is a simpler method to set configuration settings (in particular {{taskmanager.host}}): Some main classes in Flink (for starting JM and TM services) allow passing optional arguments in form {{-Dconfig-key=config-value}}. This would override the corresponding {{config-key}} in {{flink-conf.yaml}}. *Side note:* this optional arguments made in similar style as JVMs {{-Dsome-option}} but can appear after the main class. Further more, the official [Flink docker images|https://hub.docker.com/_/flink] would allow you to pass through additional command line arguments as container arguments. So, in concrete this case, with official Flink docker images and example [Kubernetes templates|https://github.com/apache/flink/tree/master/flink-container/kubernetes], you can just modify TMs deployment template definition: {noformat} apiVersion: extensions/v1beta1 kind: Deployment metadata: name: flink-task-manager spec: replicas: ${FLINK_JOB_PARALLELISM} template: metadata: labels: app: flink component: task-manager spec: containers: - name: flink-task-manager image: ${FLINK_IMAGE_NAME} args: ["task-manager", "-Djobmanager.rpc.address=flink-job-cluster", "-Dtaskmanager.host=$(K8S_POD_IP)"] # <<< additional new command line arg env: - name: K8S_POD_IP # <<< env variable definition, from K8s' downward api valueFrom: fieldRef: fieldPath: status.podIP {noformat} > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753461#comment-16753461 ] Cristian commented on FLINK-11127: -- [~ray365] so how does the JM know how to call that service? And how does it ensure that the metrics it receives are the right one for the TM it wants to get the metrics from? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736052#comment-16736052 ] seye commented on FLINK-11127: -- Our team found another solution that worked. These are the steps 1) _metrics.internal.query-service.port_ : \{set open port} 2) expose port in your TM container 3) create a headless service for your stateful set(TM) {code:java} apiVersion: v1 kind: Service metadata: labels: app: flink-taskmanager name: flink-taskmanager spec: clusterIP: None ports: - port: {port you set in 1} selector: app: flink-taskmanager {code} *note: taskmanager.host config not needed* > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724877#comment-16724877 ] Sergei Poganshev commented on FLINK-11127: -- [~gvnagarjun] Sure, it's also a possibility. We just didn't want to taint the docker image with a workaround logic. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724737#comment-16724737 ] Nagarjun Guraja commented on FLINK-11127: - [~spoganshev] I am wondering, modifying the docker entrypoint script to first configure *taskmanager.host* with pod ip and then invoke taskmanager.sh should also do the trick instead of using init container right? Do you see any issue with that approach other than getting the workaround to docker image as opposed to handling externally? ** > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720579#comment-16720579 ] Sergei Poganshev commented on FLINK-11127: -- Yes, the workaround works. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720158#comment-16720158 ] Ufuk Celebi commented on FLINK-11127: - [~plucas] Thanks for the update. Regarding connecting by IP: I think in order to expose the TM by IP, users need to configure the {{taskmanager.host}} option as mentioned by [~spoganshev]. His workaround should work, but it is indeed a bit involved. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720154#comment-16720154 ] Patrick Lucas commented on FLINK-11127: --- StatefulSets now support the podManagementPolicy option of "Parallel", which allows all the Pods to launch or terminate at the same time, which partially solves the problem, though experimentation is needed to really know whether that is preferable to using a Deployment. (See: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies) Would it be feasible for the JMs to connect to the TMs by IP address (as they do naturally for other types of communication)? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719974#comment-16719974 ] Till Rohrmann commented on FLINK-11127: --- Thanks for reporting this issue [~uce]. I think your solution proposal is how we should fix it. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719651#comment-16719651 ] Sergei Poganshev commented on FLINK-11127: -- Here's a (slightly tricky) workaround that works: * configure *metrics.internal.query-service.port* property to some fixed port (e.g. **) * in taskmanager deployment: ** expose that fixed port in taskmanager deployment spec ** use init container to put taskmanager's pod ip to flink configuration file: *** define a shared _emptyDir_ volume for *both* flink container and init container to use that flink container will mount to /etc/flink *** init container: add environment variable to init container definition that references *status.podIP* mount flink configuration folder to init container (anywhere other than /etc/flink) - this will be used as a basis for changes copy files from this folder to /etc/flink make init container command append pod ip to flink-conf.yaml as a value for *taskmanager.host* property *** since /etc/flink is mounted to both containers, changes in init container will be visible in flink container Configured this way jobmanager will connect to taskmanagers IPs directly to a configured fixed port. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719153#comment-16719153 ] Sergei Poganshev commented on FLINK-11127: -- [~uce] Ah. Thanks for the heads up. We'll probably use StatfulSet just for development purposes for now (on a small number of nodes), until the problem is fixed. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719004#comment-16719004 ] Ufuk Celebi commented on FLINK-11127: - [~spoganshev] Yes, the port configuration is correct. I'm aware of the {{StatefulSet}} workaround, but it is not feasible for larger replica counts in my opinion, because it results in pods to be created sequentially: {quote} For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from \{0..N-1}. [...] When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed before web-0 is [Running and Ready|https://kubernetes.io/docs/user-guide/pod-states/], and web-2 will not be deployed until web-1 is Running and Ready. {quote} (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees) > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718980#comment-16718980 ] Sergei Poganshev commented on FLINK-11127: -- Looks like `metrics.internal.query-service.port` can be used for this workaround. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718967#comment-16718967 ] Sergei Poganshev commented on FLINK-11127: -- Is there a way to configure the port on which taskmanager listens to connections? If so, then at least headless service and StatefulSet can be utilized to work around the issue. > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager
[ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716636#comment-16716636 ] Ufuk Celebi commented on FLINK-11127: - [~trohrm...@apache.org] [~Zentol] Are you aware of a work around for this? Is it possible to use the IP address instead of the hostname? > Make metrics query service establish connection to JobManager > - > > Key: FLINK-11127 > URL: https://issues.apache.org/jira/browse/FLINK-11127 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Kubernetes, Metrics >Affects Versions: 1.7.0 >Reporter: Ufuk Celebi >Priority: Major > > As part of FLINK-10247, the internal metrics query service has been separated > into its own actor system. Before this change, the JobManager (JM) queried > TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a > separate connection to the TM metrics query service actor. > In the context of Kubernetes, this is problematic as the JM will typically > *not* be able to resolve the TMs by name, resulting in warnings as follows: > {code} > 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has > failed, address is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused > by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] > {code} > In order to expose the TMs by name in Kubernetes, users require a service > *for each* TM instance which is not practical. > This currently results in the web UI not being to display some basic metrics > about number of sent records. You can reproduce this by following the READMEs > in {{flink-container/kubernetes}}. > This worked before, because the JM is typically exposed via a service with a > known name and the TMs establish the connection to it which the metrics query > service piggybacked on. > A potential solution to this might be to let the query service connect to the > JM similar to how the TMs register. > I tagged this ticket as an improvement, but in the context of Kubernetes I > would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)