[GitHub] Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance"
Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance" URL: https://github.com/apache/flink/pull/7820#discussion_r261851802 ## File path: flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/PrometheusPushGatewayReporter.java ## @@ -73,7 +77,7 @@ public void open(MetricConfig config) { @Override public void report() { try { - pushGateway.push(CollectorRegistry.defaultRegistry, jobName); + pushGateway.push(CollectorRegistry.defaultRegistry, jobName, instance); Review comment: Compared with **instance = ""**, **instance = jobname** just enriches the meaning of the indicator, but does not solve the data loss problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance"
Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance" URL: https://github.com/apache/flink/pull/7820#discussion_r261851648 ## File path: flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/PrometheusPushGatewayReporter.java ## @@ -73,7 +77,7 @@ public void open(MetricConfig config) { @Override public void report() { try { - pushGateway.push(CollectorRegistry.defaultRegistry, jobName); + pushGateway.push(CollectorRegistry.defaultRegistry, jobName, instance); Review comment: In fact, using a random string is also a compromise. However, if you only use **jobName** as the instance, putgateway will merge into one according to the **metric name and instance**, and the data will be lost. Just like the picture I wrote in JIRA (https://issues.apache.org/jira/browse/FLINK-11742 ). Now I adjust the instance to a splicing of jobName and random strings, which can reduce confusion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance"
Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance" URL: https://github.com/apache/flink/pull/7820#discussion_r261476961 ## File path: flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/PrometheusPushGatewayReporter.java ## @@ -73,7 +77,7 @@ public void open(MetricConfig config) { @Override public void report() { try { - pushGateway.push(CollectorRegistry.defaultRegistry, jobName); + pushGateway.push(CollectorRegistry.defaultRegistry, jobName, instance); Review comment: First of all, thank you very much for the information provided, which will help me better understand Flink related knowledge. This question depends on how we view the Flink cluster. First, consider the cluster as a black box. JM is the same as "dispatch" in SpringMVC. That whole cluster is a Promethues Job. Second, the JM class is compared to a more feature-rich nginx. Then different jobs correspond to different Promethues Jobs. Third, treat each Flink Job as a microservice cluster. The same function of Task and even Sub-Task is a specific micro-service function, a series of associated micro-services complete the entire business. Because the above two methods need to solve a problem, different TM runs this different Flink Job's task, the same task runs on different TMs. This means that the TM's logo cannot accurately separate the tasks of different Flink Jobs, metrics of the same name still conflict . JM and TM are just the running containers for jobs. So in theory, we need to be precise to the Task and even the Sub-Task level, then use the Job prefix to group them. To solve this problem perfectly, we need a comprehensive understanding of the entire Metric system. Maybe we should close this PR and decide how to deal with it after a thorough discussion in the community. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance"
Tom-Goong commented on a change in pull request #7820: [FLINK-11742][Metrics]Push metrics to Pushgateway without "instance" URL: https://github.com/apache/flink/pull/7820#discussion_r261221674 ## File path: flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/PrometheusPushGatewayReporter.java ## @@ -73,7 +77,7 @@ public void open(MetricConfig config) { @Override public void report() { try { - pushGateway.push(CollectorRegistry.defaultRegistry, jobName); + pushGateway.push(CollectorRegistry.defaultRegistry, jobName, instance); Review comment: > In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process. A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a job. > For example, an API server job with four replicated instances: job: api-server -- instance 1: 1.2.3.4:5670 -- instance 2: 1.2.3.4:5671 -- instance 3: 5.6.7.8:5670 -- instance 4: 5.6.7.8:5671 https://prometheus.io/docs/concepts/jobs_instances/#jobs-and-instances I think a Flink job corresponds to a Prometheus job, and taskmanager and jobmanager correspond to different instances. If the jobName is used as the instance label, the same metrics of different tasksmanages will conflict, and operations such as sum will fail. I failed to use ip:port to form an instance label, because I am not familiar enough with the Flink source. So use this the easiest way This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services