Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-05 Thread Christoph Anton Mitterer
Hey Chris.

On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:

> - The evaluation interval is sufficiently less than the scrape 
> interval, so that it's guaranteed that none of the `up`-samples are 
> being missed. 


I assume you were referring to the above specific point?

Maybe there is a misunderstanding:

With the above I merely meant that, my solution requires that the alert 
rule evaluation interval is small enough, so that when it look at 
resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an 
additional shift by 10s, so effectively -80s to -60s), the evaluations 
happen often enough, so that no sample can "jump over" that time window.

I.e. if the scrape interval was 10s, but the evaluation interval only 20s, 
it would surely miss some.
 

I don't believe this assumption about up{} is correct. My understanding 
is that up{} is not merely an indication that Prometheus has connected 
to the target exporter, but an indication that it has successfully 
scraped said exporter. Prometheus can only know this after all samples 
from the scrape target have been received and ingested and there are no 
unexpected errors, which means that just like other metrics from the 
scrape, up{} can only be visible after the scrape has finished (and 
Prometheus knows whether it succeeded or not). 


Yes, I'd have assumed so as well. Therefore I generally shifted both alerts 
by 10s, hoping that 10s is enough for all that.

 

How long scrapes take is variable and can be up to almost their timeout 
interval. You may wish to check 'scrape_duration_seconds'. Our metrics 
suggest that this can go right up to the timeout (possibly in the case 
of failed scrapes). 


Interesting. 

I see the same (I mean entries that go up to and even a bit above the 
timeout). Would be interesting to know whether these are ones that still 
made it "just in time (despite actually being a bit longer than the 
timeout)... or whether these are only such that timed out and were 
discarded.
Cause the name scrape_duration_seconds would kind of imply that it's the 
former, but I guess it's actually the latter.

So what would you think that means for me and my solution now? The I should 
shift all my checks even further? That is at least the scrape_timeout + 
some extra time for the data getting into the TDSB?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f6603b09-d44b-412d-831a-c53234c85a82n%40googlegroups.com.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-04 Thread Chris Siebenmann
> The assumptions I've made are basically three:
> - Prometheus does that "faking" of sample times, and thus these are
>   always on point with exactly the scrape interval between each.
>   This in turn should mean, that if I have e.g. a scrape interval of
>   10s, and I do up[20s], then regardless of when this is done, I get
>   at least 2 samples, and in some rare cases (when the evaluation
>   happens exactly on a scrape time), 3 samples.
>   Never more, never less.
>   Which for `up` I think should be true, as Prometheus itself
>   generates it, right, and not the exporter that is scraped.
> - The evaluation interval is sufficiently less than the scrape
>   interval, so that it's guaranteed that none of the `up`-samples are
>   being missed.

I don't believe this assumption about up{} is correct. My understanding
is that up{} is not merely an indication that Prometheus has connected
to the target exporter, but an indication that it has successfully
scraped said exporter. Prometheus can only know this after all samples
from the scrape target have been received and ingested and there are no
unexpected errors, which means that just like other metrics from the
scrape, up{} can only be visible after the scrape has finished (and
Prometheus knows whether it succeeded or not).

How long scrapes take is variable and can be up to almost their timeout
interval. You may wish to check 'scrape_duration_seconds'. Our metrics
suggest that this can go right up to the timeout (possibly in the case
of failed scrapes).

- cks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2784936.1712256052%40apps0.cs.toronto.edu.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-04 Thread Christoph Anton Mitterer
Hey.

On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:

You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.


Well as I've said before, the dashboard always has the problem that someone 
actually needs to look at it.
 

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.


I guess in the end I'll see whether or not I'm annoyed by it. ;-)
 

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?


So far I've seen two kinds of nodes, those where I never get scrape errors, 
and those where they happen regularly - and probably need investigation.


Anyway,... I think it might have found a solution, which - if some
assumption's I've made are correct - I'm somewhat confident that
it works, even in the strange cases.


The assumptions I've made are basically three:
- Prometheus does that "faking" of sample times, and thus these are
  always on point with exactly the scrape interval between each.
  This in turn should mean, that if I have e.g. a scrape interval of
  10s, and I do up[20s], then regardless of when this is done, I get
  at least 2 samples, and in some rare cases (when the evaluation
  happens exactly on a scrape time), 3 samples.
  Never more, never less.
  Which for `up` I think should be true, as Prometheus itself
  generates it, right, and not the exporter that is scraped.
- The evaluation interval is sufficiently less than the scrape
  interval, so that it's guaranteed that none of the `up`-samples are
  being missed.
- After some small time (e.g. 10s) it's guaranteed that all samples
  are in the TSDB and a query will return them.
  (basically, to counter the observation I've made in
  https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg )
- Both alerts run in the same alert group, and that means (I hope) that
  each query in them is evaluated with respect to the very same time.

With that, my final solution would be:
- alert: general_target-down   (TD below)
  expr: 'max_over_time(up[1m] offset 10s) == 0'
  for:  0s
- alert: general_target-down_single-scrapes   (TDSS below)
  expr: 'resets(up[20s] offset 60s) >= 1  unless  max_over_time(up[50s] 
offset 10s) == 0'
  for:  0s

And that seems to actually work for at least practical cases (of
course it's difficult to simulate the cases where the evaluation
happens right on time of a scrape).

For anyone who'd ever be interested in the details, and why I think that 
works in all cases,
I've attached the git logs where I describe the changes in my config git 
below.

Thanks to everyone for helping me with that :-)

Best wishes,
Chris.


(needs a mono-spaced font to work out nicely)
TL/DR:
-
commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43
Author: Christoph Anton Mitterer 
Date:   Mon Mar 25 02:01:57 2024 +0100

alerts: overhauled the `general_target-down_single-scrapes`-alert

This is a major overhaul of the 
`general_target-down_single-scrapes`-alert,
which turned out to have been quite an effort that went over several 
months.

Before this branch was merged, the 
`general_target-down_single-scrapes`-alert
(from now on called “TDSS”) had various issues.
While the alert did stop to fire, when the `general_target-down`-alert 
(from now
on called “TD”) started to do so, that alone meant that it would still 
also fire
when scrapes failed which eventually turned out to be an actual TD.
For example the first few (< ≈7) `0`s would have caused TDSS to fire 
which would
seamlessly be replaced by a firing TD (unless any `1`s came in between).

Assumptions made below:
• The scraping interval is `10s`.
• If a (single) time series for the `up`-metric is given like `0 1 0 0 
1`, the
  time goes from left (farther back in time) to right (less farther 
back in
  time).

I) Goals

There should be two alerts:
• TD
  Is for general use and similar to Icinga’s concept of host being `UP` 
or
  `DOWN` (with the minor difference, that an unreachable Prometheus 
target does
  not necessarily mean that a host is `DOWN` in that sense).
  It should fire after scraping has failed for some time, for example 
one
  minute (which is assumed form now on).
• TDSS
  Since Prometheus is all about monitoring metrics, it’s of interest 
whether the
  scraping fails, even if it’s only every now and then for very short 
amount of
  times, because in that cases samples are lost.
  TD will notice 

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-22 Thread 'Brian Candler' via Prometheus Users
Personally I think you're looking at this wrong.

You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.

If you want to do further investigation when a host has more than N 
single-scrape failures in 24 hours, sure. But firstly, is that urgent 
enough to warrant an alert? If it is, then you also say you *don't* want to 
be alerted on this when a more important alert has been sent for the same 
host in the same time period.  That's tricky to get right, which is what 
this whole thread is about. Like you say: alertmanager is probably not the 
right tool for that.

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?

If it's "not often" then I wouldn't worry too much it anyway (check a 
dashboard), but in any case you don't want to waste time trying to bend 
existing tooling to work in ways it wasn't intended for. That is: if you 
need suitable tooling, then write it.

It could be as simple as a script doing one query per day, using the same 
logic I just outlined above:
- identify hosts with scrape failures above a particular threshold over the 
last 24 hours
- identify hosts where one or more alerts have been generated over the last 
24 hours (there are metrics for this)
- subtract the second set from the first set
- if the remaining set is non-empty, then send a notification

You can do this in any language of your choice, or even a shell script with 
promtool/curl and jq.

On Friday 22 March 2024 at 02:31:52 UTC Christoph Anton Mitterer wrote:

>
> I've been looking into possible alternatives, based on the ideas given 
> here.
>
> I) First one completely different approach might be:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
> - alert: single-scrape-failure
> expr: 'min_over_time( up[2m0s] ) == 0'
> for: 1m
> or
> - alert: single-scrape-failure
> expr: 'resets( up[2m0s] ) > 0'
> for: 1m
> or perhaps even
> - alert: single-scrape-failure
> expr: 'changes( up[2m0s] ) >= 2'
> for: 1m
> (which would however behave a bit different, I guess)
> )
>
> plus an inhibit rule, that silences single-scrape-failure when
> target-down fires.
> The for: 1m is needed, so that target-down has a chance to fire
> (and inhibit) before single-scrape-failure does.
>
> I'm not really sure, whether that works in all cases, though,
> especially since I look back much more (and the additional time
> span further back may undesirably trigger again.
>
>
> Using for: > 0 seems generally a bit fragile for my use-case (because I 
> want to capture even single scrape failures, but with for: > 0 I need t to 
> have at least two evaluations to actually trigger, so my evaluation period 
> must be small enough so that it's done >= 2 during the scrape interval.
>
> Also, I guess the scrape intervals and the evaluation intervals are not 
> synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
> certain number of samples in that range, it may be that there are actually 
> more or less.
>
>
> If I forget about the above approach with inhibiting, then I need to 
> consider cases like:
> time>
> - 0 1 0 0 0 0 0 0
> first zero should be a single-scrape-failure, the last 6 however a
> target-down
> - 1 0 0 0 0 0 1 0 0 0 0 0 0
> same here, the first 5 should be a single-scrape-failure, the last 6
> however a target-down
> - 1 0 0 0 0 0 0 1 0 0 0 0 0 0
> here however, both should be target-down
> - 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
> or
> 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
> here, 2x target-down, 1x single-scrape-failure
>
>
>
>
> II) Using the original {min,max}_over_time approach:
> - min_over_time(up[1m]) == 0
> tells me, there was at least one missing scrape in the last 1m.
> but that alone would already be the case for the first zero:
> . . . . . 0
> so:
> - for: 1m
> was added (and the [1m] was enlarged)
> but this would still fire with
> 0 0 0 0 0 0 0
> which should however be a target-down
> so:
> - unless max_over_time(up[1m]) == 0
> was added to silence it then
> but that would still fail in e.g. the case when a previous
> target-down runs out:
> 0 0 0 0 0 0 -> target down
> the next is a 1
> 0 0 0 0 0 0 1 -> single-scrape-failure
> and some similar cases,
>
> Plus the usage of for: >0s is - in my special case - IMO fragile.
>
>
>
> III) So in my previous mail I came up with the idea of using:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
> alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
> unless max_over_time(up[1m0s]) 

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-21 Thread Christoph Anton Mitterer

I've been looking into possible alternatives, based on the ideas given here.

I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert: single-scrape-failure
expr: 'resets( up[2m0s] ) > 0'
for: 1m
or perhaps even
- alert: single-scrape-failure
expr: 'changes( up[2m0s] ) >= 2'
for: 1m
(which would however behave a bit different, I guess)
)

plus an inhibit rule, that silences single-scrape-failure when
target-down fires.
The for: 1m is needed, so that target-down has a chance to fire
(and inhibit) before single-scrape-failure does.

I'm not really sure, whether that works in all cases, though,
especially since I look back much more (and the additional time
span further back may undesirably trigger again.


Using for: > 0 seems generally a bit fragile for my use-case (because I 
want to capture even single scrape failures, but with for: > 0 I need t to 
have at least two evaluations to actually trigger, so my evaluation period 
must be small enough so that it's done >= 2 during the scrape interval.

Also, I guess the scrape intervals and the evaluation intervals are not 
synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
certain number of samples in that range, it may be that there are actually 
more or less.


If I forget about the above approach with inhibiting, then I need to 
consider cases like:
time>
- 0 1 0 0 0 0 0 0
first zero should be a single-scrape-failure, the last 6 however a
target-down
- 1 0 0 0 0 0 1 0 0 0 0 0 0
same here, the first 5 should be a single-scrape-failure, the last 6
however a target-down
- 1 0 0 0 0 0 0 1 0 0 0 0 0 0
here however, both should be target-down
- 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
or
1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
here, 2x target-down, 1x single-scrape-failure




II) Using the original {min,max}_over_time approach:
- min_over_time(up[1m]) == 0
tells me, there was at least one missing scrape in the last 1m.
but that alone would already be the case for the first zero:
. . . . . 0
so:
- for: 1m
was added (and the [1m] was enlarged)
but this would still fire with
0 0 0 0 0 0 0
which should however be a target-down
so:
- unless max_over_time(up[1m]) == 0
was added to silence it then
but that would still fail in e.g. the case when a previous
target-down runs out:
0 0 0 0 0 0 -> target down
the next is a 1
0 0 0 0 0 0 1 -> single-scrape-failure
and some similar cases,

Plus the usage of for: >0s is - in my special case - IMO fragile.



III) So in my previous mail I came up with the idea of using:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 
1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless 
max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] 
offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless 
max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] 
offset 10s) == 0' for: 0m
The idea was, that when I don't use for: >0s, the first time
window where one can be really sure (in all cases), that whether
it's a single-scrape-failure or target-down is a 0 in -70s to
-60s:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | | 
| | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 | 0 
| 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the zeros 
is between -20
and -10.
But if there's a sequence of zeros, it takes up to -70s to -60s,
when it becomes clear.

Now the zero in that time span could also be that of a target-down
sequence of zeros like in case 3.
For these cases, I had the shifted silencers that each looked over
1m.

Looked good at first, though there were some open questions.
At least one main problem, namely it would fail in e.g. that case:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a
The zero between -70s to 60s would be noticed, but still be
silenced, because the one would not.




Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, which 
Ben Kochie, suggested, too.

First I didn't quite understand how the latter would help me? Maybe I have 
the wrong mindset for it, so could you guys please explain what your idea 
was wiht keep_firing_for:?




IV) resets() sounded promising at first, but while I tried quite some
variations, I wasn't able to get anything working.
First, something like
resets(up[1m]) >= 1
alone (with or without a for: >0s) would already fire in case of:
time>
1 0
which still could become a target-down but also in case of:
1 0 0 0 0 0 0
which is a target down.
And I think even 

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-18 Thread Ben Kochie
I usually recommend throwing out any "But this is how Icinga does it".
thinking.

The way we do things in Prometheus for this kind of thing is to simply
think about "availability".

For any scrape failures:

avg_over_time(up[5m]) < 1

For more than one scrape failure (assuming 15s intervals)

avg_over_time(up[5m]) < 0.95

This is a much easier way to think about "uptime".

Also, if you want, there is the new "keep_firing_for" alerting option.

On Mon, Mar 18, 2024 at 5:45 AM Christoph Anton Mitterer 
wrote:

> Hey Chris.
>
> On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
> >
> > One thing you can look into here for detecting and counting failed
> > scrapes is resets(). This works perfectly well when applied to a
> > gauge
>
> Though it is documented as to be only used with counters... :-/
>
>
> > that is 1 or 0, and in this case it will count the number of times
> > the
> > metric went from 1 to 0 in a particular time interval. You can
> > similarly
> > use changes() to count the total number of transitions (either 1->0
> > scrape failures or 0->1 scrapes starting to succeed after failures).
>
> The idea sounds promising... especially to also catch cases like that
> 8a, I've mentioned in my previous mail and where the
> {min,max}_over_time approach seems to fail.
>
>
> > It may also be useful to multiply the result of this by the current
> > value of the metric, so for example:
> >
> >   resets(up{..}[1m]) * up{..}
> >
> > will be non-zero if there have been some number of scrape failures
> > over
> > the past minute *but* the most recent scrape succeeded (if that
> > scrape
> > failed, you're multiplying resets() by zero and getting zero). You
> > can
> > then wrap this in an '(...) > 0' to get something you can maybe use
> > as
> > an alert rule for the 'scrapes failed' notification. You might need
> > to
> > make the range for resets() one step larger than you use for the
> > 'target-down' alert, since resets() will also be zero if up{...} was
> > zero all through its range.
> >
> > (At this point you may also want to look at the alert
> > 'keep_firing_for'
> > setting.)
>
> I will give that some more thinking and reply back if I should find
> some way to make an alert out of this.
>
> Well and probably also if I fail to ^^ ... at least at a first glance I
> wasn't able to use that to create and alert that would behave as
> desired. :/
>
>
> > However, my other suggestion here would be that this notification or
> > count of failed scrapes may be better handled as a dashboard or a
> > periodic report (from a script) instead of through an alert,
> > especially
> > a fast-firing alert.
>
> Well the problem with a dashboard would IMO be, that someone must
> actually look at it or otherwise it would be pointless. ;-)
>
> Not really sure how to do that with a script (which I guess would be
> conceptually similar to an alert... just that it's sent e.g. weekly).
>
> I guess I'm not so much interested in the exact times, when single
> scrapes fail (I cannot correct it retrospectively anyway) but just
> *that* it happens and that I have to look into it.
>
> My assumption kinda is, that normally scrapes aren't lost. So I would
> really only get an alert mail if something's wrong.
> And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could
> still reduce mail but on the alertmanager level?
>
>
> > I think it will be relatively difficult to make an
> > alert give you an accurate count of how many times this happened; if
> > you
> > want such a count to make decisions, a dashboard (possibly
> > visualizing
> > the up/down blips) or a report could be better. A program is also in
> > the
> > position to extract the raw up{...} metrics (with timestamps) and
> > then
> > readily analyze them for things like how long the failed scrapes tend
> > to
> > last for, how frequently they happen, etc etc.
>
> Well that sounds to be quite some effort... and I already think that my
> current approaches required far too much of an effort (and still don't
> fully work ^^).
> As said... despite not really being comparable to Prometheus: in
> Incinga a failed sensor probe would be immediately noticeable.
>
>
> Thanks,
> Chris.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Christoph Anton Mitterer
Hey Chris.

On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
> 
> One thing you can look into here for detecting and counting failed
> scrapes is resets(). This works perfectly well when applied to a
> gauge

Though it is documented as to be only used with counters... :-/


> that is 1 or 0, and in this case it will count the number of times
> the
> metric went from 1 to 0 in a particular time interval. You can
> similarly
> use changes() to count the total number of transitions (either 1->0
> scrape failures or 0->1 scrapes starting to succeed after failures).

The idea sounds promising... especially to also catch cases like that
8a, I've mentioned in my previous mail and where the
{min,max}_over_time approach seems to fail.


> It may also be useful to multiply the result of this by the current
> value of the metric, so for example:
> 
>   resets(up{..}[1m]) * up{..}
> 
> will be non-zero if there have been some number of scrape failures
> over
> the past minute *but* the most recent scrape succeeded (if that
> scrape
> failed, you're multiplying resets() by zero and getting zero). You
> can
> then wrap this in an '(...) > 0' to get something you can maybe use
> as
> an alert rule for the 'scrapes failed' notification. You might need
> to
> make the range for resets() one step larger than you use for the
> 'target-down' alert, since resets() will also be zero if up{...} was
> zero all through its range.
> 
> (At this point you may also want to look at the alert
> 'keep_firing_for'
> setting.)

I will give that some more thinking and reply back if I should find
some way to make an alert out of this.

Well and probably also if I fail to ^^ ... at least at a first glance I
wasn't able to use that to create and alert that would behave as
desired. :/


> However, my other suggestion here would be that this notification or
> count of failed scrapes may be better handled as a dashboard or a
> periodic report (from a script) instead of through an alert,
> especially
> a fast-firing alert.

Well the problem with a dashboard would IMO be, that someone must
actually look at it or otherwise it would be pointless. ;-)

Not really sure how to do that with a script (which I guess would be
conceptually similar to an alert... just that it's sent e.g. weekly).

I guess I'm not so much interested in the exact times, when single
scrapes fail (I cannot correct it retrospectively anyway) but just
*that* it happens and that I have to look into it.

My assumption kinda is, that normally scrapes aren't lost. So I would
really only get an alert mail if something's wrong.
And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could
still reduce mail but on the alertmanager level?


> I think it will be relatively difficult to make an
> alert give you an accurate count of how many times this happened; if
> you
> want such a count to make decisions, a dashboard (possibly
> visualizing
> the up/down blips) or a report could be better. A program is also in
> the
> position to extract the raw up{...} metrics (with timestamps) and
> then
> readily analyze them for things like how long the failed scrapes tend
> to
> last for, how frequently they happen, etc etc.

Well that sounds to be quite some effort... and I already think that my
current approaches required far too much of an effort (and still don't
fully work ^^).
As said... despite not really being comparable to Prometheus: in
Incinga a failed sensor probe would be immediately noticeable.


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Chris Siebenmann
> As a reminder, my goal was:
> - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
>   how Icinga would put the host into down state, after pings failed or a
>   number of seconds)
> - but even if a single scrape fails (which alone wouldn't trigger the above
>   alert) I'd like to get a notification (telling me, that something might be
>   fishy with the networking or so), that is UNLESS that single failed scrape
>   is part of a sequence of failed scrapes that also caused / will cause the
>   above target-down alert
>
> Assuming in the following, each number is a sample value with ~10s distance 
> for
> the `up` metric of a single host, with the most recent one being the 
> right-most:
> - 1 1 1 1 1 1 1 => should give nothing
> - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single 
> failure,
>or develop into the target-down alert)
> - 1 1 1 1 1 0 0 => same as above, not clear yet
> ...
> - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

One thing you can look into here for detecting and counting failed
scrapes is resets(). This works perfectly well when applied to a gauge
that is 1 or 0, and in this case it will count the number of times the
metric went from 1 to 0 in a particular time interval. You can similarly
use changes() to count the total number of transitions (either 1->0
scrape failures or 0->1 scrapes starting to succeed after failures).
It may also be useful to multiply the result of this by the current
value of the metric, so for example:

resets(up{..}[1m]) * up{..}

will be non-zero if there have been some number of scrape failures over
the past minute *but* the most recent scrape succeeded (if that scrape
failed, you're multiplying resets() by zero and getting zero). You can
then wrap this in an '(...) > 0' to get something you can maybe use as
an alert rule for the 'scrapes failed' notification. You might need to
make the range for resets() one step larger than you use for the
'target-down' alert, since resets() will also be zero if up{...} was
zero all through its range.

(At this point you may also want to look at the alert 'keep_firing_for'
setting.)

However, my other suggestion here would be that this notification or
count of failed scrapes may be better handled as a dashboard or a
periodic report (from a script) instead of through an alert, especially
a fast-firing alert. I think it will be relatively difficult to make an
alert give you an accurate count of how many times this happened; if you
want such a count to make decisions, a dashboard (possibly visualizing
the up/down blips) or a report could be better. A program is also in the
position to extract the raw up{...} metrics (with timestamps) and then
readily analyze them for things like how long the failed scrapes tend to
last for, how frequently they happen, etc etc.

- cks
PS: This is not my clever set of tricks, I got it from other people.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3652072.1710729628%40apps0.cs.toronto.edu.