Hy Stuart, Julien and Ben,
Hope you don't mind that I answer all three replies in one... don't
wanna spam the list ;-)
On Tue, 2023-02-21 at 07:31 +, Stuart Clark wrote:
> Prometheus itself cannot do downsampling, but other related projects
> such as Cortex & Thanos have such features.
Hey Brian
On Tue, 2023-02-28 at 00:27 -0800, Brian Candler wrote:
>
> I can offer a couple more options:
>
> (1) Use two servers with federation.
> - server 1 does the scraping and keeps the detailed data for 2 weeks
> - server 2 scrapes server 1 at lower interval, using the federation
>
On Tue, 2023-02-28 at 10:25 +0100, Ben Kochie wrote:
>
> Debian release cycles are too slow for the pace of Prometheus
> development.
It's rather simple to pull the version from Debian unstable, if on
needs so, and that seems pretty current.
> You'd be better off running Prometheus using
Hey.
I wondered whether one can to with Prometheus something similar that is
possible with systems using RRD (e.g. Ganlia).
Depending on the kind of metrics, like for those from the node exporter,
one may want a very high sample resolution (and thus short scraping
interval) for like the last
Hey.
I have some troubles understanding how to do things right™ with respect
to alerting.
In principle I'd like to do two things:
a) have certain alert rules run only for certain instances
(though that may in practise actually be less needed, when only the
respective nodes would generate
On Wednesday, April 26, 2023 at 9:14:35 AM UTC+2 Brian Candler wrote:
> I guess with (2) you also meant having a route which is then permanently
muted?
I'd use a route with a null receiver (i.e. a receiver which has no
_configs under it)
Ah, interesting. It wasn't even clear to me from the
Hey again.
On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote:
> expr: up{job="myjob"} == 1 unless my_metric
Beware with that, that it will only work if the labels on both 'up' and
'my_metric' match exactly. If they don't, then you can either use on(...)
to specify the set
Hey Brian
On Tuesday, April 25, 2023 at 9:59:12 AM UTC+2 Brian Candler wrote:
So really I'd divide the possibilities 3 ways:
a. Prevent the alert being generated from prometheus in the first place, by
writing the expr in such a way that it filters out conditions that you
don't want to alert
On Tuesday, April 25, 2023 at 9:32:25 AM UTC+2 Brian Candler wrote:
I think you would have basically the same problem with Icinga unless you
have configured Icinga with a list of RAID controllers which should be
present on a given device, or a list of drives which should be present in a
Hey there.
What I'm trying to do is basically replace Icinga with Prometheus (or
well not really replacing, but integrating it into the latter, which I
anyway need for other purposes).
So I'll have e.g. some metric that shows me the RAID status on
instances, and I want to get an alert, when a
Hey.
I wondered whether the following is possible with Prometheus. I basically
think about possibly phasing out Icinga and do any alerting in Prometheus.
For checks that are clearly metrics based (like load or free disk space)
this seems rather easy.
But what about any checks that are not
Hey Ben.
On Saturday, February 11, 2023 at 11:18:44 AM UTC+1 Ben Kochie wrote:
You combine this with an "info" metric that tells you about the rest of the
device.
Ah,... and I assume that one could just also export these info metrics
alongside e.g. node_md_state?
Thanks :-)
Chris.
--
You
Hey Brian.
On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:
That's tricky to get exactly right. You could try something like this
(untested):
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
- min_over_time will be 0 if any single scrape
Hey.
I have an alert rule like this:
groups:
- name: alerts_general
rules:
- alert: general_target-down
expr: 'up == 0'
for: 5m
which is intended to notify about a target instance (respectively a
specific exporter on that) being down.
There are also routes in
Hey Brian
On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:
It depends on the exact semantics of "for". e.g. take a simple case of 1
minute rule evaluation interval. If you apply "for: 1m" then I guess that
means the alert must be firing for two successive evaluations
Hey there.
I eventually got back to this and I'm still fighting this problem.
As a reminder, my goal was:
- if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
how Icinga would put the host into down state, after pings failed or a
number of seconds)
- but even if a single
Hey Chris.
On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
>
> One thing you can look into here for detecting and counting failed
> scrapes is resets(). This works perfectly well when applied to a
> gauge
Though it is documented as to be only used with counters... :-/
> that is 1
I've been looking into possible alternatives, based on the ideas given here.
I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert:
Hey.
I noticed a somewhat unexpected behaviour, perhaps someone can explain why
this happens.
- on a Prometheus instance, with a scrape interval of 10s
- doing the following queries via curl from the same node where Prometheus
runs (so there cannot be any different system times or so
Looking
Hey.
On Friday, April 5, 2024 at 7:10:29 AM UTC+2 Ben Kochie wrote:
If the jitter is > 0.002, the real value is stored.
Interesting... though I guess bad for my solution in the other thread,
where I make the assumption that it's guaranteed that samples are always
exactly on point with the
Hey Chris.
On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:
> - The evaluation interval is sufficiently less than the scrape
> interval, so that it's guaranteed that none of the `up`-samples are
> being missed.
I assume you were referring to the above specific point?
Hey.
I have some simple alerts like:
- alert: node_upgrades_non-security_apt
expr: 'sum by (instance,job) (
apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
- alert: node_upgrades_security_apt
expr: 'sum by (instance,job) (
Hey Brian.
On Saturday, April 6, 2024 at 9:33:27 AM UTC+2 Brian Candler wrote:
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep
firing, when the scraping failed, but also when it actually goes back to an
ok state, right?
It affects all alerts individually, and I
On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:
On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:
But for Prometheus, with keep_firing_for, it will be like the same alert.
If the alerts have the exact same set of labels (e.g. the alert
hanges in my config git
below.
Thanks to everyone for helping me with that :-)
Best wishes,
Chris.
(needs a mono-spaced font to work out nicely)
TL/DR:
-
commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43
Author: Christoph Anton Mitterer
Date: Mon
Hey Chris, Brian.
Thanks for your replies/confirmations.
On Sunday, March 24, 2024 at 8:16:14 AM UTC+1 Ben Kochie wrote:
Yup, this is correct. Prometheus sets the timestamp of the sample at the
start of the scrape. But since it's an ACID compliant database, the data is
not queryable until
26 matches
Mail list logo