Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

2020-05-13 Thread Yun Tang
Hi

>From our experience, instead of offering more resource for Prometheus 
>push-gateway and servers. We could leverage Flink' feature to avoid sending 
>unnecessary data (especially high-dimension tags, e,g task_attempt_id) after 
>Flink-1.10. In general, we could exclude 
>"operator_id;task_id;task_attempt_id", which are rarely used, in 
>metrics.reporter..scope.variables.excludes.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter

Best
Yun Tang

From: Thomas Huang 
Sent: Wednesday, May 13, 2020 12:00
To: 李佳宸 ; user@flink.apache.org 
Subject: Re: Prometheus Pushgateway Reporter Can not DELETE metrics on 
pushgateway

I met this issue three months ago. Finally, we got the conclusion that is 
Prometheus push gateway can not handle high throughout metric data. But we 
solved the issue via service discovery. We changed the Prometheus metric 
reporter code, adding the registration logic, so the job can expose the host 
and port on discovery service. And then write a plugin for Prometheus that can 
get the service list to pull the metrics from the Flink jobs.


From: 李佳宸 
Sent: Wednesday, May 13, 2020 11:26:26 AM
To: user@flink.apache.org 
Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my 
configuration about reporter:

metrics.reporter.promgateway.class: 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both)

I found that when Flink cluster restart, there showed up metrics which have new 
jobName with random suffix. But there still existed those metrics having 
jobName before restarting cluster(value stop update). Since Prometheus still 
periodically pulled the data in pushgateway, I got a bunch of time series data 
with value unchanged forever.

It looks like:


# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0

Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown 
to true, the old metrics information should be deleted from pushgateway. But it 
didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting 
metrics from pushgateway?

Thanks!


Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

2020-05-12 Thread Thomas Huang
I met this issue three months ago. Finally, we got the conclusion that is 
Prometheus push gateway can not handle high throughout metric data. But we 
solved the issue via service discovery. We changed the Prometheus metric 
reporter code, adding the registration logic, so the job can expose the host 
and port on discovery service. And then write a plugin for Prometheus that can 
get the service list to pull the metrics from the Flink jobs.


From: 李佳宸 
Sent: Wednesday, May 13, 2020 11:26:26 AM
To: user@flink.apache.org 
Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my 
configuration about reporter:

metrics.reporter.promgateway.class: 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both)

I found that when Flink cluster restart, there showed up metrics which have new 
jobName with random suffix. But there still existed those metrics having 
jobName before restarting cluster(value stop update). Since Prometheus still 
periodically pulled the data in pushgateway, I got a bunch of time series data 
with value unchanged forever.

It looks like:


# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0

Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown 
to true, the old metrics information should be deleted from pushgateway. But it 
didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting 
metrics from pushgateway?

Thanks!


Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

2020-05-12 Thread 李佳宸
Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my 
configuration about reporter:

metrics.reporter.promgateway.class: 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both) 

I found that when Flink cluster restart, there showed up metrics which have new 
jobName with random suffix. But there still existed those metrics having 
jobName before restarting cluster(value stop update). Since Prometheus still 
periodically pulled the data in pushgateway, I got a bunch of time series data 
with value unchanged forever. 

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: 
jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded 
(scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"}
 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"}
 0
Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown 
to true, the old metrics information should be deleted from pushgateway. But it 
didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting 
metrics from pushgateway?

Thanks!