[prometheus-users] Inhibit resolved messages from inhibited alerts

2024-03-21 Thread Michael Kogelman
It seems that resolved messages are still thrown/received when an inhibited
alert is resolved. Is there any way to squelch these as well? Or is this
pretty much as intended.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAOOH65-eRHhDwZcRS3s_qcZ84%2BTN9PqVZuM4KKyv2h2z9sLmLw%40mail.gmail.com.


Re: [prometheus-users] blackbox_exporter 0.24.0 and smokeping_prober 0.7.1 - DNS cache "nscd" not working

2024-03-21 Thread Chris Siebenmann
> Having a quick look at the binary, it seems, that netgo build tag was 
> applied:
>
> $ strings blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter | egrep 
> '\-tags.*net.*'
> build   -tags=netgo
> build   -tags=netgo

As a side note: if you have the Go toolchain available, you can use 'go
version -m ' to conveniently dump out all of this information
(among other things). Taken from the current blackbox_exporter binary
release:

build   -tags=netgo
build   CGO_ENABLED=0

(In the case of standard Prometheus exporters like node_exporter and
blackbox_exporter, it looks like the tags information is reported in
their '--version' output, although not the CGO setting. But 'go version
-m' is authoritative for all Go binaries.)

- cks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/285787.1711035177%40apps0.cs.toronto.edu.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-21 Thread Christoph Anton Mitterer

I've been looking into possible alternatives, based on the ideas given here.

I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert: single-scrape-failure
expr: 'resets( up[2m0s] ) > 0'
for: 1m
or perhaps even
- alert: single-scrape-failure
expr: 'changes( up[2m0s] ) >= 2'
for: 1m
(which would however behave a bit different, I guess)
)

plus an inhibit rule, that silences single-scrape-failure when
target-down fires.
The for: 1m is needed, so that target-down has a chance to fire
(and inhibit) before single-scrape-failure does.

I'm not really sure, whether that works in all cases, though,
especially since I look back much more (and the additional time
span further back may undesirably trigger again.


Using for: > 0 seems generally a bit fragile for my use-case (because I 
want to capture even single scrape failures, but with for: > 0 I need t to 
have at least two evaluations to actually trigger, so my evaluation period 
must be small enough so that it's done >= 2 during the scrape interval.

Also, I guess the scrape intervals and the evaluation intervals are not 
synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
certain number of samples in that range, it may be that there are actually 
more or less.


If I forget about the above approach with inhibiting, then I need to 
consider cases like:
time>
- 0 1 0 0 0 0 0 0
first zero should be a single-scrape-failure, the last 6 however a
target-down
- 1 0 0 0 0 0 1 0 0 0 0 0 0
same here, the first 5 should be a single-scrape-failure, the last 6
however a target-down
- 1 0 0 0 0 0 0 1 0 0 0 0 0 0
here however, both should be target-down
- 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
or
1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
here, 2x target-down, 1x single-scrape-failure




II) Using the original {min,max}_over_time approach:
- min_over_time(up[1m]) == 0
tells me, there was at least one missing scrape in the last 1m.
but that alone would already be the case for the first zero:
. . . . . 0
so:
- for: 1m
was added (and the [1m] was enlarged)
but this would still fire with
0 0 0 0 0 0 0
which should however be a target-down
so:
- unless max_over_time(up[1m]) == 0
was added to silence it then
but that would still fail in e.g. the case when a previous
target-down runs out:
0 0 0 0 0 0 -> target down
the next is a 1
0 0 0 0 0 0 1 -> single-scrape-failure
and some similar cases,

Plus the usage of for: >0s is - in my special case - IMO fragile.



III) So in my previous mail I came up with the idea of using:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 
1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless 
max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] 
offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless 
max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] 
offset 10s) == 0' for: 0m
The idea was, that when I don't use for: >0s, the first time
window where one can be really sure (in all cases), that whether
it's a single-scrape-failure or target-down is a 0 in -70s to
-60s:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | | 
| | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 | 0 
| 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the zeros 
is between -20
and -10.
But if there's a sequence of zeros, it takes up to -70s to -60s,
when it becomes clear.

Now the zero in that time span could also be that of a target-down
sequence of zeros like in case 3.
For these cases, I had the shifted silencers that each looked over
1m.

Looked good at first, though there were some open questions.
At least one main problem, namely it would fail in e.g. that case:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a
The zero between -70s to 60s would be noticed, but still be
silenced, because the one would not.




Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, which 
Ben Kochie, suggested, too.

First I didn't quite understand how the latter would help me? Maybe I have 
the wrong mindset for it, so could you guys please explain what your idea 
was wiht keep_firing_for:?




IV) resets() sounded promising at first, but while I tried quite some
variations, I wasn't able to get anything working.
First, something like
resets(up[1m]) >= 1
alone (with or without a for: >0s) would already fire in case of:
time>
1 0
which still could become a target-down but also in case of:
1 0 0 0 0 0 0
which is a target down.
And I think even