Hey Chris.
On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:
> - The evaluation interval is sufficiently less than the scrape
> interval, so that it's guaranteed that none of the `up`-samples are
> being missed.
I assume you were referring to the above specific point?
> The assumptions I've made are basically three:
> - Prometheus does that "faking" of sample times, and thus these are
> always on point with exactly the scrape interval between each.
> This in turn should mean, that if I have e.g. a scrape interval of
> 10s, and I do up[20s], then
Hey.
On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:
You want to "capture" single scrape failures? Sure - it's already being
captured. Make yourself a dashboard.
Well as I've said before, the dashboard always has the problem that someone
actually needs to look at it.
Personally I think you're looking at this wrong.
You want to "capture" single scrape failures? Sure - it's already being
captured. Make yourself a dashboard.
But do you really want to be *alerted* on every individual one-time scrape
failure? That goes against the whole philosophy of
I've been looking into possible alternatives, based on the ideas given here.
I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert:
I usually recommend throwing out any "But this is how Icinga does it".
thinking.
The way we do things in Prometheus for this kind of thing is to simply
think about "availability".
For any scrape failures:
avg_over_time(up[5m]) < 1
For more than one scrape failure (assuming 15s intervals)
Hey Chris.
On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
>
> One thing you can look into here for detecting and counting failed
> scrapes is resets(). This works perfectly well when applied to a
> gauge
Though it is documented as to be only used with counters... :-/
> that is 1
> As a reminder, my goal was:
> - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
> how Icinga would put the host into down state, after pings failed or a
> number of seconds)
> - but even if a single scrape fails (which alone wouldn't trigger the above
> alert) I'd
Hey there.
I eventually got back to this and I'm still fighting this problem.
As a reminder, my goal was:
- if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
how Icinga would put the host into down state, after pings failed or a
number of seconds)
- but even if a single
On Saturday, 13 May 2023 at 03:26:18 UTC+1 Christoph Anton Mitterer wrote:
(If there is jitter in the sampling time, then occasionally it might look
at 4 or 6 samples)
Jitter in the sense that the samples are taken at slightly different times?
Yes. Each sample is timestamped with the time
Hey Brian
On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:
It depends on the exact semantics of "for". e.g. take a simple case of 1
minute rule evaluation interval. If you apply "for: 1m" then I guess that
means the alert must be firing for two successive evaluations
> Not sure if I'm right, but I think if one places both rules in the same
group (and I think even the order shouldn't matter?), then the original:
> expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
> for: 5m
> with 5m being the "for:"-time of the long-alert should be
Hey Brian.
On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:
That's tricky to get exactly right. You could try something like this
(untested):
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
- min_over_time will be 0 if any single scrape
That's tricky to get exactly right. You could try something like this
(untested):
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which
14 matches
Mail list logo