Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread 'Brian Candler' via Prometheus Users
Only you can determine that, by comparing the lists of alerts from both 
sides and seeing what differs, and looking into how they are generated and 
measured. There are all kinds of things which might affect this, e.g. 
pending/keep_firing_for alerts, group wait etc.

But you might also want to read this:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

If you're generating more than a handful of alerts per day, then maybe you 
need to reconsider what constitutes an "alert".

On Saturday 30 March 2024 at 09:49:04 UTC Trio Official wrote:

> Thank you for your prompt response and guidance on addressing the metric 
> staleness issue.
>
> Regarding metric staleness  I confirm that I have already implemented the 
> approach to use square brackets for the recording metrics and alerting rule
>  (e.g. max_over_time(metric[1h])). However, the main challenge persists 
> with the discrepancy in the number of alerts generated by Prometheus 
> compared to those displayed in Alertmanager. 
>
> To illustrate, when observing Prometheus, I may observe approximately 
> 25,000 alerts triggered within a given period. However, when reviewing the 
> corresponding alerts in Alertmanager, the count often deviates 
> significantly, displaying figures such as 10,000 or 18,000, rather than the 
> expected 25,000.
>
> This inconsistency poses a significant challenge in our alert management 
> process, leading to confusion and potentially overlooking critical alerts.
>
> I would greatly appreciate any further insights or recommendations you may 
> have to address this issue and ensure alignment between Prometheus and 
> Alertmanager in terms of the number of alerts generated and displayed.
> On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:
>
>> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>>
>> I believe that recording rules and alerting rules similarly may have 
>> their evaluation time happen at different offsets within their 
>> evaluation interval. This is done for the similar reason of spreading 
>> out the internal load of rule evaluations across time.
>>
>>
>> I think it's more accurate to say that *rule groups* are spread spread 
>> over their evaluation interval, and rules within the same rule group are 
>> evaluated 
>> sequentially 
>> .
>>  
>> This is how you can build rules that depend on each other, e.g. a recording 
>> rule followed by other rules that depend on its output; put them in the 
>> same rule group.
>>
>> As for scraping: you *can* change this staleness interval, 
>> using --query.lookback-delta, but it's strongly not recommended. Using the 
>> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
>> that even if you miss one scrape for a random reason, you still have two 
>> points within the lookback-delta so that the timeseries does not go stale.
>>
>> There's no good reason to scrape at one hour intervals:
>> * Prometheus is extremely efficient with its storage compression, 
>> especially when adjacent data points are equal, so scraping the same value 
>> every 2 minutes is going to use hardly any more storage than scraping it 
>> every hour.
>> * If you're worried about load on the exporter because responding to a 
>> scrape is slow or expensive, then you should run the exporter every hour 
>> from a local cronjob, and write its output to a persistent location (e.g. 
>> to PushGateway or statsd_exporter, or simply write it to a file which can 
>> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
>> server).  You can then scrape this as often as you like.
>>
>> node_exporter textfile-collector exposes an extra metrics for the 
>> timestamp on each file, so you can alert in the case that the file isn't 
>> being updated.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4471ac2e-ee83-494a-9a90-a7c86992a9f6n%40googlegroups.com.


Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread Trio Official


Thank you for your prompt response and guidance on addressing the metric 
staleness issue.

Regarding metric staleness  I confirm that I have already implemented the 
approach to use square brackets for the recording metrics and alerting rule 
(e.g. 
max_over_time(metric[1h])). However, the main challenge persists with the 
discrepancy in the number of alerts generated by Prometheus compared to 
those displayed in Alertmanager. 

To illustrate, when observing Prometheus, I may observe approximately 
25,000 alerts triggered within a given period. However, when reviewing the 
corresponding alerts in Alertmanager, the count often deviates 
significantly, displaying figures such as 10,000 or 18,000, rather than the 
expected 25,000.

This inconsistency poses a significant challenge in our alert management 
process, leading to confusion and potentially overlooking critical alerts.

I would greatly appreciate any further insights or recommendations you may 
have to address this issue and ensure alignment between Prometheus and 
Alertmanager in terms of the number of alerts generated and displayed.
On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:

> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>
> I believe that recording rules and alerting rules similarly may have 
> their evaluation time happen at different offsets within their 
> evaluation interval. This is done for the similar reason of spreading 
> out the internal load of rule evaluations across time.
>
>
> I think it's more accurate to say that *rule groups* are spread spread 
> over their evaluation interval, and rules within the same rule group are 
> evaluated 
> sequentially 
> .
>  
> This is how you can build rules that depend on each other, e.g. a recording 
> rule followed by other rules that depend on its output; put them in the 
> same rule group.
>
> As for scraping: you *can* change this staleness interval, 
> using --query.lookback-delta, but it's strongly not recommended. Using the 
> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
> that even if you miss one scrape for a random reason, you still have two 
> points within the lookback-delta so that the timeseries does not go stale.
>
> There's no good reason to scrape at one hour intervals:
> * Prometheus is extremely efficient with its storage compression, 
> especially when adjacent data points are equal, so scraping the same value 
> every 2 minutes is going to use hardly any more storage than scraping it 
> every hour.
> * If you're worried about load on the exporter because responding to a 
> scrape is slow or expensive, then you should run the exporter every hour 
> from a local cronjob, and write its output to a persistent location (e.g. 
> to PushGateway or statsd_exporter, or simply write it to a file which can 
> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
> server).  You can then scrape this as often as you like.
>
> node_exporter textfile-collector exposes an extra metrics for the 
> timestamp on each file, so you can alert in the case that the file isn't 
> being updated.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3cfa1ba2-ef9c-4e9c-be3b-1f8ae8067e7en%40googlegroups.com.


Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread 'Brian Candler' via Prometheus Users
On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:

I believe that recording rules and alerting rules similarly may have 
their evaluation time happen at different offsets within their 
evaluation interval. This is done for the similar reason of spreading 
out the internal load of rule evaluations across time.


I think it's more accurate to say that *rule groups* are spread spread over 
their evaluation interval, and rules within the same rule group are evaluated 
sequentially 
.
 
This is how you can build rules that depend on each other, e.g. a recording 
rule followed by other rules that depend on its output; put them in the 
same rule group.

As for scraping: you *can* change this staleness interval, 
using --query.lookback-delta, but it's strongly not recommended. Using the 
default of 5 mins, you should use a maximum scrape interval of 2 mins so 
that even if you miss one scrape for a random reason, you still have two 
points within the lookback-delta so that the timeseries does not go stale.

There's no good reason to scrape at one hour intervals:
* Prometheus is extremely efficient with its storage compression, 
especially when adjacent data points are equal, so scraping the same value 
every 2 minutes is going to use hardly any more storage than scraping it 
every hour.
* If you're worried about load on the exporter because responding to a 
scrape is slow or expensive, then you should run the exporter every hour 
from a local cronjob, and write its output to a persistent location (e.g. 
to PushGateway or statsd_exporter, or simply write it to a file which can 
be picked up by node_exporter textfile-collector or even a vanilla HTTP 
server).  You can then scrape this as often as you like.

node_exporter textfile-collector exposes an extra metrics for the timestamp 
on each file, so you can alert in the case that the file isn't being 
updated.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/244eb39e-1ded-4161-80cf-b32deb9cd2c7n%40googlegroups.com.


Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-29 Thread Chris Siebenmann
> I am encountering challenges with configuring Prometheus and Alertmanager 
> for my application's alarm system. Below are the configurations I am 
> currently using:
>
> *prometheus.yml:* 
>
> Scrape Interval: 1h

This scrape interval is far too high. Although it's not well documented,
you can't set scrape_interval higher than two or three minutes without
causing seriously weird issues, where your rules may not see metrics
because Prometheus considers the metrics stale. Prometheus considers
metrics stale if the most recent sample is more than five minutes old;
this time is not adjustable as far as I know. I believe you've already
seen signs of this from your other problems, but really, as far as I
know such a configuration basically isn't supported.

(In my view this is such a problem that Prometheus should at least
require a forced 'I know what I'm doing, really I want this' command
line option to accept a scrape interval that's larger than the staleness
interval, or maybe even within ten seconds or so of it.)

I believe that all rule evaluation intervals similarly need to be no
more than five minutes because of the stale metrics issue, since both
recording rules and alerting rules generate metrics (the recording rules
generate their metrics in an obvious way, the alerting rules generate
ALERTS metrics and some other ones). It's possible that alerts don't go
stale inside Prometheus despite their metrics going stale, but I
wouldn't count on this.

(Although it's possible that metrics from recording and/or alert rules
are special and are exempted from staleness, I would be surprised.)

Prometheus scrapes different targets at different offsets within their
scrape interval, so you can't synchronize scrapes and rule evaluations
the way you apparently want to. The time offset for any particular
scrape target is deterministic but not predictable (and it may change
between eg Prometheus releases, or even on a Prometheus restart).
Prometheus does this to spread out the load of scraping more or less
evenly across the scrape interval, rather than descending on all targets
simultaneously every X seconds or minutes.

I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.

- cks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1394369.1711750147%40apps0.cs.toronto.edu.