RE: Socket timeout when report metrics to pushgateway

2023-12-17 Thread Jiabao Sun
Hi,

The pushgateway uses push mode to report metrics. When deployed on a single 
machine under high load, there may be some performance issues. 
A simple solution is to set up multiple pushgateways and push the metrics to 
different pushgateways based on different task groups.

There are other metrics reporters available based on the push model, such as 
InfluxDB[1]. In a clustered mode, InfluxDB may offer better performance than 
pushgateway. 
You can try using InfluxDB as an alternative and evaluate its performance.

I speculate that the reason for using pushgateway is because when running Flink 
with YARN application or per job mode, the task ports are randomized, 
making it difficult for prometheus to determine which task to scrape. 

By the way, if you deploy tasks using the flink kubernetes operator,  you can 
directly use the prometheus metrics reporter without the need for 
pushgateway[2].

Best,
Jiabao

[1] 
https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/metric_reporters/#influxdb
[2] 
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example


On 2023/12/12 08:23:22 李琳 wrote:
> hello,
>   we build flink report metrics to prometheus pushgateway, the program has 
> been running for a period of time, with a amount of data reported to 
> pushgateway, pushgateway response socket timeout exception, and much of 
> metrics data reported failed. following is the exception:
> 
> 
>  2023-12-12 04:13:07,812 WARN 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter [] - Failed 
> to push metrics to PushGateway with jobName
> 00034937_20231211200917_54ede15602bb8704c3a98ec481bea96, groupingKey{}.
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream. socketRead(Native Method) ~[?:1.8.0_281]
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
> ~[?:1.8.0 281]
> at java.net.SocketInputStream.read(SocketInputStream. java:171) ~[?:1.8.0 
> 281] at java.net.SocketInputStream.read(SocketInputStream. java:141) 
> ~[?:1.8.0 2811
> at java.io.BufferedInputStream.fill (BufferedInputStream. java:246) ~[?:1.8.0 
> 2811 at java.io. BufferedInputStream.read1(BufferedInputStream.java:286) 
> ~[?:1.8.0_281] at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0 281] 
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) 
> ~[?:1.8.0_281] at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) 
> ~[?:1.8.0_281] at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
>  ~[?:1.8.0_281] at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
>  ~[?:1.8.0 2811 at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)~[?:1.8.0_281]
>  at 
> io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)~[flink-metrics-prometheus-1.13.5.jar:1.13.5]
> at io.prometheus. client.exporter .PushGateway .push (PushGatevay . java:138) 
> ~[flink-metrics-prometheus-1.13.5. jar:1.13.51
> at 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter.report(PrometheusPushGatewayReporter.java:63)
> [flink-metrics-prometheus-1.13.5.jar:1.13.51
> at org.apache. flink.runtime.metrics.MetricRegistryImp1$ReporterTask.run 
> (MetricRegistryImpl. java:494) [flink-dist_2.11-1.13.5.jar:1.13.5]
> 
> after test, it was caused with amount of data reported to pushgateway, then 
> we restart pushgateway server and the exception disappeared, but after sever 
> hours the exception re-emergenced.
> 
> so i want to know how to config flink or pushgateway to avoid the exception?
> 
> best regards.
> leilinee 

Socket timeout when report metrics to pushgateway

2023-12-12 Thread 李琳
hello,
  we build flink report metrics to prometheus pushgateway, the program has been 
running for a period of time, with a amount of data reported to pushgateway, 
pushgateway response socket timeout exception, and much of metrics data 
reported failed. following is the exception:


 2023-12-12 04:13:07,812 WARN 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter [] - Failed 
to push metrics to PushGateway with jobName
00034937_20231211200917_54ede15602bb8704c3a98ec481bea96, groupingKey{}.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream. socketRead(Native Method) ~[?:1.8.0_281]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0 
281]
at java.net.SocketInputStream.read(SocketInputStream. java:171) ~[?:1.8.0 281] 
at java.net.SocketInputStream.read(SocketInputStream. java:141) ~[?:1.8.0 2811
at java.io.BufferedInputStream.fill (BufferedInputStream. java:246) ~[?:1.8.0 
2811 at java.io. BufferedInputStream.read1(BufferedInputStream.java:286) 
~[?:1.8.0_281] at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0 281] 
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) 
~[?:1.8.0_281] at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) 
~[?:1.8.0_281] at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
 ~[?:1.8.0_281] at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
 ~[?:1.8.0 2811 at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)~[?:1.8.0_281]
 at 
io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)~[flink-metrics-prometheus-1.13.5.jar:1.13.5]
at io.prometheus. client.exporter .PushGateway .push (PushGatevay . java:138) 
~[flink-metrics-prometheus-1.13.5. jar:1.13.51
at 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter.report(PrometheusPushGatewayReporter.java:63)
[flink-metrics-prometheus-1.13.5.jar:1.13.51
at org.apache. flink.runtime.metrics.MetricRegistryImp1$ReporterTask.run 
(MetricRegistryImpl. java:494) [flink-dist_2.11-1.13.5.jar:1.13.5]

after test, it was caused with amount of data reported to pushgateway, then we 
restart pushgateway server and the exception disappeared, but after sever hours 
the exception re-emergenced.

so i want to know how to config flink or pushgateway to avoid the exception?

best regards.
leilinee