Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-05 Thread Christoph Anton Mitterer
Hey Chris. On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote: > - The evaluation interval is sufficiently less than the scrape > interval, so that it's guaranteed that none of the `up`-samples are > being missed. I assume you were referring to the above specific point?

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-04 Thread Chris Siebenmann
> The assumptions I've made are basically three: > - Prometheus does that "faking" of sample times, and thus these are > always on point with exactly the scrape interval between each. > This in turn should mean, that if I have e.g. a scrape interval of > 10s, and I do up[20s], then

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-04 Thread Christoph Anton Mitterer
Hey. On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote: You want to "capture" single scrape failures? Sure - it's already being captured. Make yourself a dashboard. Well as I've said before, the dashboard always has the problem that someone actually needs to look at it.

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-22 Thread 'Brian Candler' via Prometheus Users
Personally I think you're looking at this wrong. You want to "capture" single scrape failures? Sure - it's already being captured. Make yourself a dashboard. But do you really want to be *alerted* on every individual one-time scrape failure? That goes against the whole philosophy of

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-21 Thread Christoph Anton Mitterer
I've been looking into possible alternatives, based on the ideas given here. I) First one completely different approach might be: - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: ( - alert: single-scrape-failure expr: 'min_over_time( up[2m0s] ) == 0' for: 1m or - alert:

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-18 Thread Ben Kochie
I usually recommend throwing out any "But this is how Icinga does it". thinking. The way we do things in Prometheus for this kind of thing is to simply think about "availability". For any scrape failures: avg_over_time(up[5m]) < 1 For more than one scrape failure (assuming 15s intervals)

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Christoph Anton Mitterer
Hey Chris. On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote: > > One thing you can look into here for detecting and counting failed > scrapes is resets(). This works perfectly well when applied to a > gauge Though it is documented as to be only used with counters... :-/ > that is 1

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Chris Siebenmann
> As a reminder, my goal was: > - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to > how Icinga would put the host into down state, after pings failed or a > number of seconds) > - but even if a single scrape fails (which alone wouldn't trigger the above > alert) I'd

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Christoph Anton Mitterer
Hey there. I eventually got back to this and I'm still fighting this problem. As a reminder, my goal was: - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to how Icinga would put the host into down state, after pings failed or a number of seconds) - but even if a single

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-13 Thread Brian Candler
On Saturday, 13 May 2023 at 03:26:18 UTC+1 Christoph Anton Mitterer wrote: (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples) Jitter in the sense that the samples are taken at slightly different times? Yes. Each sample is timestamped with the time

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-12 Thread Christoph Anton Mitterer
Hey Brian On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote: It depends on the exact semantics of "for". e.g. take a simple case of 1 minute rule evaluation interval. If you apply "for: 1m" then I guess that means the alert must be firing for two successive evaluations

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-10 Thread Brian Candler
> Not sure if I'm right, but I think if one places both rules in the same group (and I think even the order shouldn't matter?), then the original: > expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0 > for: 5m > with 5m being the "for:"-time of the long-alert should be

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-09 Thread Christoph Anton Mitterer
Hey Brian. On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote: That's tricky to get exactly right. You could try something like this (untested): expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0 for: 5m - min_over_time will be 0 if any single scrape

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-09 Thread Brian Candler
That's tricky to get exactly right. You could try something like this (untested): expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0 for: 5m - min_over_time will be 0 if any single scrape failed in the past 5 minutes - max_over_time will be 0 if all scrapes failed (which