[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-12 Thread Christoph Anton Mitterer
Hey Brian

On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:

It depends on the exact semantics of "for". e.g. take a simple case of 1 
minute rule evaluation interval. If you apply "for: 1m" then I guess that 
means the alert must be firing for two successive evaluations (otherwise, 
"for: 1m" would have no effect).


Seems you're right.

I did quite some testing meanwhile with the following alertmanager route 
(note, that I didn't use 5m, but 1m... simply in order to not have to wait 
so long):
  routes:
  - match_re:
  alertname: 'td.*'
receiver:   admins_monitoring
group_by:   [alertname]
group_wait: 0s
group_interval: 1s

and the following rules:
groups:
  - name: alerts_general_single-scrapes
interval: 15s
rules:
- alert: td-fast
  expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0'
  for:  1m
- alert: td
  expr: 'up == 0'
  for:  1m

My understanding is, correct me if wrong, that basically prometheus would 
run a thread for the scrape job (which in my case would have an interval of 
15s) and another one that evaluates the alert rules (above every 15s) which 
then sends the alert to the alertmanager (if firing).

It felt a bit brittle to have the rules evaluated with the same period then 
the scrapes, so I did all tests once with 15s for the rules interval, and 
once with 10s. But it seems as if this wouldn't change the behaviour.


But up[5m] only looks at samples wholly contained within a 5 minute window, 
and therefore will normally only look at 5 samples.


As you can see above,... I had already noticed that you were indeed right 
before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need 
to look back 5 * evaluation_interval(15s) = 75s

At least in my tests, that seemed to cause the desired behaviour, except 
for one case:
When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there 
is... within (less than?) 1m, another sequence of "0"s that eventually 
cause a "slow" td. In that case, td-fast fires for a while, until it 
directly switches over to td firing.

Was your idea above with something like:
>expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
>for: 7m
intended to fix that issue?

Or could one perhaps use 
ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] 
== 1 somehow, to check whether it did fire... and then silence the false 
positive.

 

  (If there is jitter in the sampling time, then occasionally it might look 
at 4 or 6 samples)


Jitter in the sense that the samples are taken at slightly different times?
Do you think that could affect the desired behaviour? I would intuitively 
expect that it rather only cases the "base duration" not be be exactly e.g. 
1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would 
happen +/- 15s earlier (and conversely for td-slow).


Another point I basically don't understand... how does all that relate to 
the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to 
5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within 
there and now. AFAIU, there it doesn't go "automatically" back any further 
(like the 5m above), right?

In order for the for: to work I need at least two samples... so doesn't 
that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in 
the above example), the above two alerts will never fire, even if it's down?

So if I had e.g. some jobs scraping only every 10m ... I'd need another 
pair of td/td-fast alerts, which then filter on the job 
(up{job="longRunning"}) and either only have td... (if that makes sense) 
... or at td-fast for if one of the every-10m-scrape fails and an even long 
"slow" td like if that fails for 1h.


If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas


As far as I understand you... 6 cycles of rule evaluation interval... with 
at least two samples within that interval, right?
 

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.



 Last but not least an (only) partially related question:

Once an alert fires (in prometheus), even i just for one evaluation 
interval cycle and there is no inhibiton rule or so in alertmanager... 
is it expected that a notification is sent out for sure,... regardless of 
alertmanagers grouping settings?
Like when the alert fires for one short 15s evaluation interval and clears 
again afterwards,... but group_wait: is set to some 7d ... is it expected 
to send that singe firing event after 7d, even if it has resolved already 
once the 7d are over and there was .g. no further firing in between?


Thanks a lot :-)
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To 

[prometheus-users] How AlertManager webhook notification retry works in case the receiver is donw for 1 hour ?

2023-05-12 Thread Abdul
Hi All,

I have queries regards, AlertManager webhook notification in scenarios like 
receiver is down between 5 minutes to 60 minutes. Will AlertManager sends 
alerts occurred in between this time once receiver is up ?. 

For example,

AlertManager Wehhook is configured at 10.00 AM and Alerts are flowing then 
receiver is DOWN between 10:30 AM to 11:30 AM. Now created alerts in 
between 10.30 AM to 11.30 AM, will all alerts sends to receiver once 
receiver is UP ?. or Alerts won't be sent ?.

Thanks,
Abdul

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74c7f71d-9063-45b8-8a47-7f0c8cb246can%40googlegroups.com.


[prometheus-users] How to use Jmx exporter JavaAgent as a sidecar container

2023-05-12 Thread tantan hngo
As title suggests, I want to run the javaagent in the same pod as our 
Tomcat. 
How do I run the javaagent commands towards our .war file that is in the 
other container? 

I tried just running the agent by itself and let it listen on the port that 
we've defined in our tomcat to expose its JMX but nothing. 

I get various errors depending on how I do things like; httpserverclass 
missing error, no main manifest attribute in jmx_prom_javagent.jar etc

Any ideas? <3

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e3b81475-377c-458b-af10-570a2ce415f1n%40googlegroups.com.


[prometheus-users] Prometheus JMX Exporter Help

2023-05-12 Thread tantan hngo


Hello everyone,

not quite sure if this is where I should be asking for help but i'm kind of 
befuddled about this whole situation. I'm trying to set up 
https://github.com/prometheus/jmx_exporter for our containerized Java 
application on our cluster. Specifically the JavaAgent as we are interested 
in getting the CPU and memory metrics. However, getting it initialized I am 
faced with this:
[image: r/PrometheusMonitoring - Prometheus JMX Exporter for Java17] 


After doing some research it appears this class references internal 
packages and therefore "makes it unusable for modern Java apps" (
https://github.com/prometheus/client_java/issues/533 , 
https://github.com/open-telemetry/opentelemetry-java/issues/4192) and also 
that the error suggests that the agent has been written for older java 
since these classes were apparently removed years ago.

I am not a Java developer, just trying to make this work for our 
monitoringstack. Has anyone else here tried exporting jmx metrics with 
prometheus or are you maybe using something else? I tried basically doing 
the same thing as this person: 
https://github.com/Ilak-0/prometheus-jmx-exporter-kubernetes

I've been stuck on this for too long and will soon have my hair falling off 
><

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/60b5a3fd-e86f-42df-a361-20690b2fd9dfn%40googlegroups.com.


[prometheus-users] Seeking help with Prometheus empty value problem

2023-05-12 Thread 张星



When the value of query_result_sys_network_track_error is empty, the result 
is also empty. I want it to default to 0, regardless of whether the minuend 
or subtrahend is empty. I tried using vector(0), but when used together 
with sum by, the same problem occurs.

My current query is:

```

(sum by (host)(sum_over_time(query_result_sys_network_track_error[24h]))) - 
(sum by (host)(sum_over_time(query_result_sys_network_track_error[24h] 
offset 7d)))

```

I tried using vector(0) to solve this problem but it did not work:

```

(sum by (host)(sum_over_time(query_result_sys_network_track_error)) or 
vector(0)) - (sum by 
(host)(sum_over_time(query_result_sys_network_track_error[24h] offset 7d)) 
or vector(0))

```

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/de87a0e2-56f2-40db-a755-99466247c6f0n%40googlegroups.com.


[prometheus-users] Deleting all segments newer than corrupted segment

2023-05-12 Thread Jérôme Loyet
Hi,

this morning we noticed a prometheus server with 3.3TB of metrics stopped 
to returned metrics older than ~2h30. Disk was still full with 3.3TB of 
data.

when I restarted the prometheus servers, it started to replay the WAL and 
find a corrupted segment. Then it deleted all segments after the corrupted 
one ... at the end the 3.3TB of data have been flushed to 48GB ... 

I don't understand why a corrupted segment imply deleting all newer 
segment. To me this make non sense and make the prom tsdb not reliable. I 
would have expect the TSDB to be rock solid and be able to recover in case 
of segment corruption or worst case just losing the segment ... no all 
segments that are newer than the corrupted one.

What is the technical reason behind this ?

Thanks you

Regards
++ Jerome

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9ccc74eb-5d34-43c8-a22e-2869c9a12d17n%40googlegroups.com.