from:"'Brian Candler' via Prometheus Users"

[prometheus-users] Re: job label missing from discoveredLabels (prometheus v2.42.0)

2024-05-31 Thread 'Brian Candler' via Prometheus Users

I don't see this with v2.45.5, and I'm also concerned about why "app": 
"another-testapp" occurs in one of your discoveredLabels.

I suggest you try that, and/or the latests v2.52.1 (you can of course set 
up a completely separate instance but point it to the same service 
discovery source) and see if you can replicate the issue. Also check the 
changelogs and git history to see if there's anything relevant there.

On Friday 31 May 2024 at 20:22:00 UTC+1 Vu Nguyen wrote:

> Hi all,
>
> We have a test code that reads target metadata info and job label name 
> from `discoveredLabels` list. That list is included in the response we get 
> from '/api/v1/targets' endpoint.
>
> During the test, we noticed that the response from target endpoint is 
> inconsistent: the job label sometimes is missing from `discoveredLabels` 
> for a few discovered targets. 
>
> Below output is what I extracted from our deployment: the first target 
> have the job label in its discovered labels but missing in the second 
> target, on the same config job.
>
> {
>   "status" : "success",
>   "data" : {
> "activeTargets" : [ {
>   "discoveredLabels" : {
> "__meta_kubernetes_pod_phase" : "Running",
> "__meta_kubernetes_pod_ready" : "true",
> "__meta_kubernetes_pod_uid" : 
> "a7b4cce2-1be7-4df9-a032-c7a51bb655db",
> "__metrics_path__" : "/metrics",
> "__scheme__" : "http",
> "__scrape_interval__" : "15s",
> "__scrape_timeout__" : "10s",
> "job" : "kubernetes-pods"
>   },
>   "labels" : {
> "app" : "testapp",
> "job" : "kubernetes-pods",
> "kubernetes_namespace" : "spider1"
>   },
>   "health" : "down",
>   "scrapeInterval" : "15s",
>   "scrapeTimeout" : "10s"
> }, {
>   "discoveredLabels" : {
>  "__meta_kubernetes_pod_phase" : "Running",
> "__meta_kubernetes_pod_ready" : "true",
> "__meta_kubernetes_pod_uid" : 
> "85dfeac6-985d-479e-8459-fc20ae8dcec3",
> "__metrics_path__" : "/metrics",
> "__scheme__" : "http",
> "__scrape_interval__" : "15s",
> "__scrape_timeout__" : "10s",
> "app" : "another-testapp"
>   },
>   "labels" : {
> "app" : "another-testapp",
> "job" : "kubernetes-pods",
> "kubernetes_namespace" : "spider1"
>   },
>   "scrapePool" : "kubernetes-pods",
>   "health" : "down",
>   "scrapeInterval" : "15s",
>   "scrapeTimeout" : "10s"
> } ]
>   }
> }
>
> Could you please help us understand why we have this inconsistency? Is 
> that correct way to get job level value from `discoveredLabels` set?
>
> Thanks,
> Vu
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bfe354cf-a4f6-40d8-b88b-571bf9d1289fn%40googlegroups.com.

Re: [prometheus-users] how to get count of no.of instance

2024-05-28 Thread 'Brian Candler' via Prometheus Users

Those mangled screenshots are no use. What I would need to see are the 
actual results of the two queries, from the Prometheus web interface (not 
Grafana), in plain text: e.g.

foo{bar="baz",qux="abc"} 42.0

...with the *complete* set of labels, not expurgated. That's what's needed 
to formulate the join query.

On Tuesday 28 May 2024 at 13:23:21 UTC+1 Sameer Modak wrote:

> Hello Brian,
>
> Actually tried as you suggested earlier but when i execute it says no data 
> . So below are the individual query ss , so if i ran individually they give 
> the output
>
> On Sunday, May 26, 2024 at 1:24:10 PM UTC+5:30 Brian Candler wrote:
>
>> The labels for the two sides of the division need to match exactly.
>>
>> If they match 1:1 except for additional labels, then you can use
>> xxx / on (foo,bar) yyy   # foo,bar are the matching labels
>> or
>> xxx / ignoring (baz,qux) zzz   # baz,qux are the labels to ignore
>>
>> If they match N:1 then you need to use group_left or group_right.
>>
>> If you show the results of the two halves of the query separately then we 
>> can be more specific. That is:
>>
>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~"$consumergroup",topic=~"$topic"})
>>  
>> by (consumergroup, topic) 
>>
>> count(up{job="prometheus.scrape.kafka_exporter"})
>>
>> On Sunday 26 May 2024 at 08:28:10 UTC+1 Sameer Modak wrote:
>>
>>> I tried the same i m not getting any data post adding below 
>>>
>>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
>>> "$consumergroup",topic=~"$topic"}) by (consumergroup, topic) / count(up{
>>> job="prometheus.scrape.kafka_exporter"})
>>>
>>> On Saturday, May 25, 2024 at 11:53:44 AM UTC+5:30 Ben Kochie wrote:
>>>
 You can use the `up` metric

 sum(...)
 /
 count(up{job="kafka"})

 On Fri, May 24, 2024 at 5:53 PM Sameer Modak  
 wrote:

> Hello Team,
>
> I want to know the no of instance data sending to prometheus. How do i 
> formulate the query .
>
>
> Basically i have below working query but issues is we have 6  
> instances hence its summing value of all instances. Instead we just need 
> value from one instance.
> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
> "$consumergroup",topic=~"$topic"})by (consumergroup, topic)
> I was thinking to divide it / 6 but it has to be variabalise on runtime
> if 3 exporters are running then it value/3 to get exact value.
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/fa5f309f-779f-45f9-b5a0-430b75ff0884n%40googlegroups.com
>  
> 
> .
>


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9633348b-5d27-409e-b28f-e1e32e8af6b0n%40googlegroups.com.

Re: [prometheus-users] Pod with Pending phase is in endpoints scraping targets (Prometheus 2.46.0)

2024-05-27 Thread 'Brian Candler' via Prometheus Users

Have you looked in the changelog 
 for 
Prometheus? I found:

## 2.51.0 / 2024-03-18

* [BUGFIX] Kubernetes SD: Pod status changes were not discovered by 
Endpoints service discovery #13337 


*=> fixes #11305 , 
which looks similar to your problem*

## 2.50.0 / 2024-02-22

* [ENHANCEMENT] Kubernetes SD: Check preconditions earlier and avoid 
unnecessary checks or iterations in kube_sd. #13408 


I'd say it's worth trying the latest release, 2.51.2.

On Monday 27 May 2024 at 12:21:01 UTC+1 Vu Nguyen wrote:

> Hi,
>
> Do you have a response to this thread? Has anyone ever encountered the 
> issue?
>
> Regards,
> Vu
>
> On Mon, May 20, 2024 at 2:56 PM Vu Nguyen  wrote:
>
>> Hi,
>>
>> With endpoints scraping role, the job should scrape POD endpoint that is 
>> up and running. That is what we are expected. 
>>
>> I think by concept, K8S does not create an endpoint if Pod is in other 
>> phases like Pending, Failed, etc.
>>
>> In our environments, Prometheus 2.46.0 on K8S v1.28.2, we currently have 
>> issues: 
>> 1) POD is up and running from `kubectl get pod`, but from Prometheus 
>> discovery page, it shows:
>> __meta_kubernetes_pod_phase="Pending" 
>> __meta_kubernetes_pod_ready="false"  
>>
>> 2) The the endpoints job discover POD targets with pod phase=`Pending`.
>>
>> Those issues disappear after we restart Prometheus pod.  
>>
>> I am not sure if 1) that is K8S that does not trigger event after POD 
>> phase changes so Prometheus is not able to refresh its endpoints discovery 
>> or 2) it is a known problem of Prometheus? 
>>
>> And do you think it is worth to add the following relabeling rule to 
>> endpoints job role?
>>
>>   - source_labels: [ __meta_kubernetes_pod_phase ]
>> regex: Pending|Succeeded|Failed|Completed
>> action: drop
>>
>> Thanks, Vu 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/c0f97ed7-1421-4c7c-a57d-2d301bb12418n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0641d658-e295-418b-ae00-af6ce83e7ccbn%40googlegroups.com.

Re: [prometheus-users] how to get count of no.of instance

2024-05-26 Thread 'Brian Candler' via Prometheus Users

The labels for the two sides of the division need to match exactly.

If they match 1:1 except for additional labels, then you can use
xxx / on (foo,bar) yyy   # foo,bar are the matching labels
or
xxx / ignoring (baz,qux) zzz   # baz,qux are the labels to ignore

If they match N:1 then you need to use group_left or group_right.

If you show the results of the two halves of the query separately then we 
can be more specific. That is:

sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~"$consumergroup",topic=~"$topic"})
 
by (consumergroup, topic) 

count(up{job="prometheus.scrape.kafka_exporter"})

On Sunday 26 May 2024 at 08:28:10 UTC+1 Sameer Modak wrote:

> I tried the same i m not getting any data post adding below 
>
> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
> "$consumergroup",topic=~"$topic"}) by (consumergroup, topic) / count(up{
> job="prometheus.scrape.kafka_exporter"})
>
> On Saturday, May 25, 2024 at 11:53:44 AM UTC+5:30 Ben Kochie wrote:
>
>> You can use the `up` metric
>>
>> sum(...)
>> /
>> count(up{job="kafka"})
>>
>> On Fri, May 24, 2024 at 5:53 PM Sameer Modak  
>> wrote:
>>
>>> Hello Team,
>>>
>>> I want to know the no of instance data sending to prometheus. How do i 
>>> formulate the query .
>>>
>>>
>>> Basically i have below working query but issues is we have 6  instances 
>>> hence its summing value of all instances. Instead we just need value from 
>>> one instance.
>>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
>>> "$consumergroup",topic=~"$topic"})by (consumergroup, topic)
>>> I was thinking to divide it / 6 but it has to be variabalise on runtime
>>> if 3 exporters are running then it value/3 to get exact value.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to prometheus-use...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/fa5f309f-779f-45f9-b5a0-430b75ff0884n%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4487c977-aac5-478b-a81d-47c7edace010n%40googlegroups.com.

[prometheus-users] Re: Regular Expression and Label Action Support to match two or more source labels

2024-05-22 Thread 'Brian Candler' via Prometheus Users

I would assume that the reason this feature was added was because there 
wasn't a feasible alternative way to do it.

I suggest you upgrade to v2.45.5 which is the current "Long Term Stable" 
release.  The previous LTS release (v2.37) went end-of-life 
 in July 2023, so 
it seems you're very likely running something unsupported at the moment.

On Wednesday 22 May 2024 at 11:52:03 UTC+1 tejaswini vadlamudi wrote:

> Sure Brian, I was aware of this.
> This config comes with a software change, but is there any possibility or 
> workaround in the old (< 2.41) Prometheus releases on this topic?
>
> /Teja
>
> On Wednesday, May 22, 2024 at 12:01:31 PM UTC+2 Brian Candler wrote:
>
>> Yes, there are similar relabel actions "keepequal" and "dropequal":
>>
>> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
>>
>> These were added in v2.41.0 
>>  / 2022-12-20
>> https://github.com/prometheus/prometheus/pull/11564
>>
>> They behave slightly differently from VM: in Prometheus, the 
>> concatenation of source_labels is compared with target_label.
>>
>> On Tuesday 21 May 2024 at 15:43:05 UTC+1 tejaswini vadlamudi wrote:
>>
>>> The below relabeling rule from Victoria Metrics is useful for matching 
>>> accurate ports and dropping unwanted targets.- action: 
>>> keep_if_equal
>>>   source_labels: 
>>> [__meta_kubernetes_service_annotation_prometheus_io_port, 
>>> __meta_kubernetes_pod_container_port_number]
>>> Does anyone know how we can compare two labels using Prometheus 
>>> Relabeling rules?
>>>
>>> Based on my analysis, Prometheus doesn't support regex patterns on 
>>> 1. backreferences like \1 
>>> 2. lookaheads or lookbehinds
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0a936729-7115-4f5c-ae98-04e99a6287a8n%40googlegroups.com.

[prometheus-users] Re: Regular Expression and Label Action Support to match two or more source labels

2024-05-22 Thread 'Brian Candler' via Prometheus Users

Yes, there are similar relabel actions "keepequal" and "dropequal":
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

These were added in v2.41.0 
 / 2022-12-20
https://github.com/prometheus/prometheus/pull/11564

They behave slightly differently from VM: in Prometheus, the concatenation 
of source_labels is compared with target_label.

On Tuesday 21 May 2024 at 15:43:05 UTC+1 tejaswini vadlamudi wrote:

> The below relabeling rule from Victoria Metrics is useful for matching 
> accurate ports and dropping unwanted targets.- action: 
> keep_if_equal
>   source_labels: 
> [__meta_kubernetes_service_annotation_prometheus_io_port, 
> __meta_kubernetes_pod_container_port_number]
> Does anyone know how we can compare two labels using Prometheus Relabeling 
> rules?
>
> Based on my analysis, Prometheus doesn't support regex patterns on 
> 1. backreferences like \1 
> 2. lookaheads or lookbehinds
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ee16deed-f7c8-4388-ae2f-78a767bb1cc6n%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

2024-05-19 Thread 'Brian Candler' via Prometheus Users

> server returned HTTP status 500 Internal Server Error: too old sample

This is not the server failing to process the data; it's the client 
supplying invalid data. You found that this has been fixed to a 400.

> server returned HTTP status 500 Internal Server Error: label name 
\"prometheus\" is not unique: invalid sample

I can't speak for the authors, but it looks to me like that should be a 400 
as well.

On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote:

> Sorry for my poor description. Here is the story:
>
> 1) At first, we are using prometheus v2.47
>
> Then we found all metrics are missing, we check the prometheus log and 
> prometheus agent log:
>
> prometheus log(lots of lines):
> ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> 
> ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
>
> prometheus agent logs:
> ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> 
> ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
>
> Then we check the codes:
>
> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77
>
> The "too old sample" is considered an 500. And the agent keeps retrying 
> (exit only when the error is not Recoverable, and 500 is considered 
> Recoverable):
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670
>
> > You may have come across a bug where a *particular* piece of data being 
> sent by the agent was

[prometheus-users] Re: hundreds of containers, how to alert when a certain container is down?

2024-05-18 Thread 'Brian Candler' via Prometheus Users

Monitoring for a metric vanishing is not a very good way to do alerting. 
Metrics hang around for the "staleness" interval, which by default is 5 
minutes. Ideally, you should monitor all the things you care about 
explicitly, get a success metric like "up" (1 = working, 0 = not working) 
and then alert on "up == 0" or equivalent. This is much more flexible and 
timely.

Having said that, there's a quick and dirty hack that might be good enough 
for you:

expr: container_memory_usage_bytes offset 10m unless 
container_memory_usage_bytes

This will give you an alert if any metric container_memory_usage_bytes 
existed 10 minutes ago but does not exist now. The alert will resolve 
itself after 10 minutes.

The result of this expression is a vector, so it can alert on multiple 
containers at once; each element of the vector will have the container name 
in the label ("name")

On Saturday 18 May 2024 at 19:50:48 UTC+1 Sleep Man wrote:

> I have a large number of containers. I learned that the following 
> configuration can monitor a single container down. How to configure it to 
> monitor all containers and send the container name once a container is down.
>
>
> - name: containers
>   rules:
>   - alert: jenkins_down
> expr: absent(container_memory_usage_bytes{name="jenkins"})
> for: 30s
> labels:
>   severity: critical
> annotations:
>   summary: "Jenkins down"
>   description: "Jenkins container is down for more than 30 seconds."
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/17dd75ea-16d6-4e2d-bdc1-3d2bb345c4fan%40googlegroups.com.

[prometheus-users] Re: Alertmanager frequently sending erroneous resolve notifications

2024-05-18 Thread 'Brian Candler' via Prometheus Users

> What can be done?

Perhaps the alert condition resolved very briefly. The solution with modern 
versions of prometheus (v2.42.0 
 or later) is to 
do this:

for: 2d
keep_firing_for: 10m

The alert won't be resolved unless it has been *continuously* absent for 10 
minutes. (Of course, this means your "resolved" notifications will be 
delayed by 10 minutes - but that's basically the whole point, don't send 
them until you're sure they're not going to retrigger)

The other alternative is simply to turn off resolved notifications 
entirely. This approach sounds odd but has a lot to recommend it:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
https://blog.cloudflare.com/alerts-observability

The point is that if a problem occurred which was serious enough to alert 
on, then it requires investigation before the case can be "closed": either 
there's an underlying problem, or if it was a false positive then the alert 
condition needs tuning. Sending a resolved message encourages laziness 
("oh, it fixed itself, no further work required").  Also, turning off 
resolved messages instantly reduces your notifications by 50%.

On Saturday 18 May 2024 at 19:50:32 UTC+1 Sarah Dundras wrote:

> Hi, this problem is driving me mad: 
>
> I am monitoring backups that log their backup results to a textfile. It is 
> being picked up and all is well, also the alert are ok, BUT! Alertmanager 
> frequently sends out odd "resolved" notifications although the firing 
> status never changed! 
>
> Here's such an alert rule that does this: 
>
> - alert: Restic Prune Freshness
> expr: restic_prune_status{uptodate!="1"} and 
> restic_prune_status{alerts!="0"}
> for: 2d
> labels:
> topic: backup
> freshness: outdated
> job: "{{ $labels.restic_backup }}"
> server: "{{ $labels.server }}"
> product: veeam
> annotations:
> description: "Restic Prune for '{{ $labels.backup_name }}' on host '{{ 
> $labels.server_name }}' is not up-to-date (too old)"
> host_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2_name={{
>  
> $labels.server_name }}=0_name=All"
> service_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2_name=All=0_name={{
>  
> $labels.backup_name }}"
> service: "{{ $labels.job_name }}" 
>
> What can be done? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3f1dc4fe-9378-4b6f-a7eb-cc0e7e02bcfan%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

2024-05-17 Thread 'Brian Candler' via Prometheus Users

It's difficult to make sense of what you're saying. Without seeing logs 
from both the agent and the server while this problem was occurring (e.g. 
`journalctl -eu prometheus`), it's hard to know what was really happening. 
Also you need to say what exact versions of prometheus and the agent were 
running.

The fundamental issue here is, why should restarting the *agent* cause the 
prometheus *server* to stop returning 500 errors?

> So my question is why 5xx from the promtheus server is considered 
Recoverable?

It is by definition of the HTTP protocol: 
https://datatracker.ietf.org/doc/html/rfc2616#section-10.5

Actually it depends on exactly which 5xx error code you're talking about, 
but common 500 and 503 errors are generally transient, meaning there was a 
problem at the server and the request may succeed if tried again later.  If 
the prometheus server wanted to tell the client that the request was 
invalid and could never possibly succeed, then it would return a 4xx error.

> And I believe there should be a way to exit the loop, for example a 
maximum times to  retry.

You are saying that you would prefer the agent to throw away data, rather 
than hold onto the data and try again later when it may succeed. In this 
situation, retrying is normally the correct thing to do.

You may have come across a bug where a *particular* piece of data being 
sent by the agent was causing a *particular* version of prometheus to fail 
with a 5xx internal error every time. The logs should make it clear if this 
was happening.

On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:

> Hello all,
>
> Recently we found that our samples are all lost. After some investigation, 
> we found:
> 1, we are using prometheus agent to send all data to prometheus server by 
> remote write
> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
> the function is sendWriteRequestWithBackoff()
> 3, inside the function, if attempt(the function where request is made to 
> prometheus server) function returns an Recoverable Error, then it will 
> retry sending the request
> 4, when a Recoverable error is returned? one scenario is the prometheus 
> server returned 5xx error
> 5, I think not every 5xx error is recoverable, and there is no other way 
> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
> retrying but every time it receives an 5xx from the server. so we lost all 
> samples for hours until we restart the agent
>
> So my question is why 5xx from the promtheus server is considered 
> Recoverable? And I believe there should be a way to exit the loop, for 
> example a maximum times to  retry.
>
> It seems that the agent mode is not mature enough to work in production.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/099dd271-0797-4f07-8ce5-700f3d552317n%40googlegroups.com.

[prometheus-users] Re: what insecure_skip_verify will do

2024-05-16 Thread 'Brian Candler' via Prometheus Users

Then you did something wrong in your config, but you'll need to show the 
config if you want help fixing it.

It also depends on what you're talking to: is this a scrape job talking to 
an exporter? Is this service discovery? Something else?

On Thursday 16 May 2024 at 15:12:14 UTC+1 Sameer Modak wrote:

> So here is the update i did try this insecure skip but i am still getting 
> below error,
>
>  tls: failed to verify certificate: x509: certificate signed by unknown 
> authority
>
> On Thursday, May 16, 2024 at 1:28:43 PM UTC+5:30 Brian Candler wrote:
>
>> It depends what you mean by "secure".
>>
>> It's encrypted, because you've told it to use HTTPS (HTTP + TLS). If the 
>> remote end doesn't talk TLS, then the two won't be able to establish a 
>> connection at all.
>>
>> However it is also insecure, because the client has no way of knowing 
>> whether the remote device is the one it's expecting to talk to, or an 
>> imposter. If it's an imposter, they can capture any data sent by the 
>> client, and return any data they like to the client. It's the job of a 
>> certificate to verify the identity of the server, and you've told it to 
>> skip that check.
>>
>> On Thursday 16 May 2024 at 07:33:31 UTC+1 Sameer Modak wrote:
>>
>>> Thanks a lot . Any easy way to check if traffic is secure apart from 
>>> wireshark. 
>>>
>>> On Wednesday, May 15, 2024 at 8:50:18 PM UTC+5:30 Alexander Wilke wrote:
>>>
 It will skip the certificate Check. So certificate May be valid or 
 invalid and is Always trusted.
 Connection is still encrypted

 Sameer Modak schrieb am Mittwoch, 15. Mai 2024 um 17:04:07 UTC+2:

> Hello Team,
>
> If i set  insecure_skip_verify: true will my data be unsecured. Will 
> it be non ssl??
>


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2aebad82-3278-44a5-8daf-8db97869fe24n%40googlegroups.com.

[prometheus-users] Re: what insecure_skip_verify will do

2024-05-16 Thread 'Brian Candler' via Prometheus Users

It depends what you mean by "secure".

It's encrypted, because you've told it to use HTTPS (HTTP + TLS). If the 
remote end doesn't talk TLS, then the two won't be able to establish a 
connection at all.

However it is also insecure, because the client has no way of knowing 
whether the remote device is the one it's expecting to talk to, or an 
imposter. If it's an imposter, they can capture any data sent by the 
client, and return any data they like to the client. It's the job of a 
certificate to verify the identity of the server, and you've told it to 
skip that check.

On Thursday 16 May 2024 at 07:33:31 UTC+1 Sameer Modak wrote:

> Thanks a lot . Any easy way to check if traffic is secure apart from 
> wireshark. 
>
> On Wednesday, May 15, 2024 at 8:50:18 PM UTC+5:30 Alexander Wilke wrote:
>
>> It will skip the certificate Check. So certificate May be valid or 
>> invalid and is Always trusted.
>> Connection is still encrypted
>>
>> Sameer Modak schrieb am Mittwoch, 15. Mai 2024 um 17:04:07 UTC+2:
>>
>>> Hello Team,
>>>
>>> If i set  insecure_skip_verify: true will my data be unsecured. Will it 
>>> be non ssl??
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/78dc1be1-77c9-4b7f-bdfb-805843ed69b4n%40googlegroups.com.

[prometheus-users] Re: Locatinme in Alertmanager

2024-05-09 Thread 'Brian Candler' via Prometheus Users

Can you describe what the actual problem is? Are you seeing an error 
message, if so what is it?

Why are you defining a time interval of 00:00 to 23:59, which is basically 
all the time apart from 1 minute between 23:59 and 24:00? You also don't 
seem to be referencing it from a routing rule.

In any case, "Time interval" only affects what times notifications are sent 
or muted, and only if you refer to them in a routing rule. It makes no 
change to the *content* of the notification.

If you want the notifications to contain local times, then you'll need to 
show the configuration of your receivers - are you doing template 
expansion? Which exact parts of the message do you want to change? Of 
course, your webhook receiver can do whatever reformatting of the messages 
you like.

On Thursday 9 May 2024 at 09:06:03 UTC+1 Tareerat Pansuntia wrote:

> Hello! all.  I set up a website monitoring project using Prometheus, 
> Blackbox Exporter, and Alertmanager to monitor and send notifications. I 
> have configured it to send alerts via Line Notify using a webhook receiver. 
> However, i currently facing an issue with setting the timezone to my 
> country's timezone. this is my config
>
> global:
>  resolve_timeout: 1m
>
> route:
>   group_by: ['alertname']
>   group_wait: 30s
>   group_interval: 10s
>   repeat_interval: 10s
>   receiver: 'email and line-notify'
>
> receivers:
> - name: 'email and line-notify'
>   email_configs:
>   - to: '...
>   webhook_configs:
> - ...
>
> time_intervals:
> - name: everyday
>   time_intervals:
>   - times:
> - start_time: "00:00"
>   end_time: "23:59"
> location: 'Asia/Bangkok'
>
> inhibit_rules:
>   - source_match:
>   severity: 'critical'
> target_match:
>   severity: 'warning'
> equal: ['alertname', 'instance']
>
> Could someone please guide me on the correct format for specifying time 
> intervals in Prometheus?
>
> Regards.
> Tareerat
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f02f2a50-a89d-4be3-be1d-04a347e44b86n%40googlegroups.com.

[prometheus-users] Re: Does anyone have any examples of what a postgres_exporter.yml file is supposed to look like?

2024-05-08 Thread 'Brian Candler' via Prometheus Users

...then move on to configuring *prometheus* I meant.

On Wednesday 8 May 2024 at 07:11:46 UTC+1 Brian Candler wrote:

> - job_name: 'postgresql_exporter'
> static_configs:
> - targets: ['host.docker.internal:5432']
>
> One problem I can see is that you're trying to get prometheus to scrape 
> the postgres SQL port. If you go to the Prometheus web UI and look at the 
> Status > Targets menu option, I think you will see it's currently failing.  
> Or run the query "up == 0".
>
> You need to change it to scrape prometheus exporter: that is port 9187, 
> not port 5432.
>
> However, before you get around to configuring prometheus, I suggest you 
> first make sure that postgres-exporter itself is working properly, by 
> scraping it manually:
>
> curl x.x.x.x:9187/metrics
>
> (or inside the exporter container you could try curl 
> 127.0.0.1:9187/metrics, but that depends if the container has a "curl" 
> binary)
>
> Once you're able to do that (which may also require adjusting your 
> postgres_exporter.yml and/or pg_hba.conf, then move on to configuring 
> postgres.
>
> On Tuesday 7 May 2024 at 21:24:18 UTC+1 Christian Sanchez wrote:
>
>> Hello, all.
>>
>> I've started to learn Prometheus and found out about the 
>> postgres_exporter. I'd like to include metrics from the PostgreSQL server I 
>> have running on Google Cloud.
>>
>> I don't understand how to actually build out the postgres_exporter.yml 
>> file. The prometheus-community GitHub repository 
>>  doesn't seem 
>> to have examples of building this file out.
>>
>> Maybe I am not reading the README in the repo that well, but I'd like to 
>> see some examples of the exporter file.
>>
>> When running the Prometheus container, this is where I'm expecting to see 
>> the exporter query options (see attachment)
>>
>>
>> I am running Prometheus and the Postgres Exporter through Docker Compose.
>> Here is my docker-compose.yml file:
>> version: '3'
>> services:
>> prometheus:
>> image: prom/prometheus
>> volumes:
>> - "./prometheus.yml:/etc/prometheus/prometheus.yml"
>> ports:
>> - 9090:9090
>>
>> postgres-exporter:
>> image: prometheuscommunity/postgres-exporter
>> volumes:
>> - "./postgres_exporter.yml:/postgres_exporter.yml:ro"
>> ports:
>> - 9187:9187
>> environment:
>> DATA_SOURCE_NAME: "
>> postgresql://my-user:my-pa...@host.docker.internal:5432/my-database?sslmode=disable
>> "
>>
>>
>> Here is my prometheus.yml file:
>> global:
>> scrape_interval: 45s
>>
>> scrape_configs:
>> - job_name: 'prometheus'
>> static_configs:
>> - targets: ['localhost:9090']
>>
>> - job_name: 'postgresql_exporter'
>> static_configs:
>> - targets: ['host.docker.internal:5432']
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8d5d968c-0714-4ffe-885d-911e1a243cc9n%40googlegroups.com.

[prometheus-users] Re: Does anyone have any examples of what a postgres_exporter.yml file is supposed to look like?

2024-05-08 Thread 'Brian Candler' via Prometheus Users

 - job_name: 'postgresql_exporter'
static_configs:
- targets: ['host.docker.internal:5432']

One problem I can see is that you're trying to get prometheus to scrape the 
postgres SQL port. If you go to the Prometheus web UI and look at the 
Status > Targets menu option, I think you will see it's currently failing.  
Or run the query "up == 0".

You need to change it to scrape prometheus exporter: that is port 9187, not 
port 5432.

However, before you get around to configuring prometheus, I suggest you 
first make sure that postgres-exporter itself is working properly, by 
scraping it manually:

curl x.x.x.x:9187/metrics

(or inside the exporter container you could try curl 
127.0.0.1:9187/metrics, but that depends if the container has a "curl" 
binary)

Once you're able to do that (which may also require adjusting your 
postgres_exporter.yml and/or pg_hba.conf, then move on to configuring 
postgres.

On Tuesday 7 May 2024 at 21:24:18 UTC+1 Christian Sanchez wrote:

> Hello, all.
>
> I've started to learn Prometheus and found out about the 
> postgres_exporter. I'd like to include metrics from the PostgreSQL server I 
> have running on Google Cloud.
>
> I don't understand how to actually build out the postgres_exporter.yml 
> file. The prometheus-community GitHub repository 
>  doesn't seem 
> to have examples of building this file out.
>
> Maybe I am not reading the README in the repo that well, but I'd like to 
> see some examples of the exporter file.
>
> When running the Prometheus container, this is where I'm expecting to see 
> the exporter query options (see attachment)
>
>
> I am running Prometheus and the Postgres Exporter through Docker Compose.
> Here is my docker-compose.yml file:
> version: '3'
> services:
> prometheus:
> image: prom/prometheus
> volumes:
> - "./prometheus.yml:/etc/prometheus/prometheus.yml"
> ports:
> - 9090:9090
>
> postgres-exporter:
> image: prometheuscommunity/postgres-exporter
> volumes:
> - "./postgres_exporter.yml:/postgres_exporter.yml:ro"
> ports:
> - 9187:9187
> environment:
> DATA_SOURCE_NAME: "
> postgresql://my-user:my-pa...@host.docker.internal:5432/my-database?sslmode=disable
> "
>
>
> Here is my prometheus.yml file:
> global:
> scrape_interval: 45s
>
> scrape_configs:
> - job_name: 'prometheus'
> static_configs:
> - targets: ['localhost:9090']
>
> - job_name: 'postgresql_exporter'
> static_configs:
> - targets: ['host.docker.internal:5432']
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/32dfd4cb-6730-423c-ac6a-20aee647b6c2n%40googlegroups.com.

[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread 'Brian Candler' via Prometheus Users

There's no metric I see there that tells you whether messages are being 
produced, only whether they're being consumed.

Without that, then I'm not sure you can do any better than this:

sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
unless on (topic) sum by (topic) 
(rate(kafka_consumergroup_current_offset[5m]) * 60) < 1

The first part:
sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
will give you an alert for each (consumergroup,topic) combination which has 
not consumed anything in the last 5 minutes.

The second part:
unless on (topic) sum by (topic) 
(rate(kafka_consumergroup_current_offset[5m]) * 60) < 1
will suppress the alert if *no* consumers have consumed at least 1 message 
per minute.  But this won't be useful unless each topic has at least 2 
consumer groups, so that if one is consuming it can alert on the other.

Given the examples you show, it looks like you only have one consumer group 
per topic.  Therefore, I think you need to find a metric which explicitly 
gives the publisher offset for each topic/partition.

On Tuesday 30 April 2024 at 18:30:24 UTC+1 Robson Jose wrote:

> like this ?
>
> kafka_consumergroup_current_offset{consumergroup="consumer-events", 
> env="prod", instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-EVENTS"}
> 292350417
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION"}
> 30027218
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-CHAT"}
> 3493310
> kafka_consumergroup_current_offset{consumergroup="consumer-email", 
> env="prod", instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-EMAIL"}
> 82381171
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-PUSH"}
> 31267495
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-SMS"}
> 366
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-WHATSAPP"}
> Em terça-feira, 30 de abril de 2024 às 12:28:29 UTC-3, Brian Candler 
> escreveu:
>
>> You're showing aggregates, not the raw metrics.
>>
>> On Tuesday 30 April 2024 at 16:23:15 UTC+1 Robson Jose wrote:
>>
>>> like this
>>>   sum by (consumergroup, topic) 
>>> (delta(kafka_consumergroup_current_offset{}[5m])/5)
>>>
>>> {consumergroup="consumer-shop", topic="SHOP-EVENTS"}
>>> 1535.25
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION"}
>>> 1.5
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-CHAT"}
>>> 0.25
>>> {consumergroup="consumer-email", topic="TOPIC-NOTIFICATION-EMAIL"}
>>> 0
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-TESTE"}
>>> 1.25
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-SMS"}
>>> 0
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-WHATSAPP"}
>>> 0
>>> {consumergroup="consumer-user-event", topic="TOPIC-USER-EVENTS"}
>>> 0
>>>
>>> Em terça-feira, 30 de abril de 2024 às 12:14:23 UTC-3, Brian Candler 
>>> escreveu:
>>>
 Without seeing examples of the exact metrics you are receiving then 
 it's hard to be sure what the right query is.

 > I want that if the consumption of messages in the topic in the last 5 
 minutes is 0 and the production of messages is greater than 1 in the topic

 Then you'll want metrics for the consumption (consumer group offset) 
 and production (e.g. partition long-end offset or consumer group lag)

 On Tuesday 30 April 2024 at 13:51:50 UTC+1 Robson Jose wrote:

>
> Hello, Thanks for responding in case
>
> I want that if the consumption of messages in the topic in the last 5 
> minutes is 0 and the production of messages is greater than 1 in the 
> topic, 
> then the group of consumers is not consuming messages and I wanted to 
> return which groups and topics these would be
> Em sexta-feira, 19 de abril de 2024 às 15:36:44 UTC-3, Brian Candler 
> escreveu:
>
>> Maybe what you're trying to do is:
>>
>> sum by (consumergroup, topic) 
>> (rate(kafka_consumergroup_current_offset[5m]) * 60) == 0
>> unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 
>> 60) < 1
>>
>> That is: alert on any combination of (consumergroup,topic) where the 
>> 5-minute rate of consumption is zero, unless the rate for that topic 
>> across 
>> all

[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread 'Brian Candler' via Prometheus Users

You're showing aggregates, not the raw metrics.

On Tuesday 30 April 2024 at 16:23:15 UTC+1 Robson Jose wrote:

> like this
>   sum by (consumergroup, topic) 
> (delta(kafka_consumergroup_current_offset{}[5m])/5)
>
> {consumergroup="consumer-shop", topic="SHOP-EVENTS"}
> 1535.25
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION"}
> 1.5
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-CHAT"}
> 0.25
> {consumergroup="consumer-email", topic="TOPIC-NOTIFICATION-EMAIL"}
> 0
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-TESTE"}
> 1.25
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-SMS"}
> 0
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-WHATSAPP"}
> 0
> {consumergroup="consumer-user-event", topic="TOPIC-USER-EVENTS"}
> 0
>
> Em terça-feira, 30 de abril de 2024 às 12:14:23 UTC-3, Brian Candler 
> escreveu:
>
>> Without seeing examples of the exact metrics you are receiving then it's 
>> hard to be sure what the right query is.
>>
>> > I want that if the consumption of messages in the topic in the last 5 
>> minutes is 0 and the production of messages is greater than 1 in the topic
>>
>> Then you'll want metrics for the consumption (consumer group offset) and 
>> production (e.g. partition long-end offset or consumer group lag)
>>
>> On Tuesday 30 April 2024 at 13:51:50 UTC+1 Robson Jose wrote:
>>
>>>
>>> Hello, Thanks for responding in case
>>>
>>> I want that if the consumption of messages in the topic in the last 5 
>>> minutes is 0 and the production of messages is greater than 1 in the topic, 
>>> then the group of consumers is not consuming messages and I wanted to 
>>> return which groups and topics these would be
>>> Em sexta-feira, 19 de abril de 2024 às 15:36:44 UTC-3, Brian Candler 
>>> escreveu:
>>>
 Maybe what you're trying to do is:

 sum by (consumergroup, topic) 
 (rate(kafka_consumergroup_current_offset[5m]) * 60) == 0
 unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 
 60) < 1

 That is: alert on any combination of (consumergroup,topic) where the 
 5-minute rate of consumption is zero, unless the rate for that topic 
 across 
 all consumers is less than 1 per minute.

 As far as I can tell, kafka_consumergroup_current_offset is a counter, 
 and therefore you should use either rate() or increase().  The only 
 difference is that rate(foo[5m]) gives the increase per second, while 
 increase(foo[5m]) gives the increase per 5 minutes.

 Hence:
 rate(kafka_consumergroup_current_offset[5m]) * 60
 increase(kafka_consumergroup_current_offset[5m]) / 5
 should both be the same, giving the per-minute increase.

 On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:

> Sorry, first link was wrong.
>
> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
>
> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>
> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>
>> Can you give examples of the metrics in question, and what conditions 
>> you're trying to check for?
>>
>> Looking at your specific PromQL query: Firstly, in my experience, 
>> it's very unusual in Prometheus queries to use ==bool or >bool, and in 
>> this 
>> specific case definitely seems to be wrong.
>>
>> Secondly, you won't be able to join the LH and RH sides of your 
>> expression with "and" unless either they have exactly the same label 
>> sets, 
>> or you modify your condition using "and on (...)" or "and ignoring 
>> (...)".
>>
>> "and" is a vector intersection operator, where the result vector 
>> includes a value if the labels match, and the value is taken from the 
>> LHS, 
>> and that means it doesn't combine the values like you might be used to 
>> in 
>> other programming languages. For example,
>>
>> vector(0) and vector(1)  => value is 0
>> vector(1) and vector(0)  => value is 1
>> vector(42) and vector(99)  => value is 42
>>
>> This is as described in the documentation 
>> 
>> :
>>
>> vector1 and vector2 results in a vector consisting of the elements 
>> of vector1 for which there are elements in vector2 with exactly 
>> matching label sets. Other elements are dropped. The metric name and 
>> values 
>> are carried over from the left-hand side vector.
>>
>> PromQL alerts on the presence of values, and in PromQL you need to 
>> think in terms of "what (labelled) values are present or absent in this 
>> vector", using the "and/unless" operators to suppress elements in the 
>> result vector, and the "or" operator to add additional elements to the 
>> result vector.
>>
>> Maybe these explanations help:
>>
>>

[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread 'Brian Candler' via Prometheus Users

Without seeing examples of the exact metrics you are receiving then it's 
hard to be sure what the right query is.

> I want that if the consumption of messages in the topic in the last 5 
minutes is 0 and the production of messages is greater than 1 in the topic

Then you'll want metrics for the consumption (consumer group offset) and 
production (e.g. partition long-end offset or consumer group lag)

On Tuesday 30 April 2024 at 13:51:50 UTC+1 Robson Jose wrote:

>
> Hello, Thanks for responding in case
>
> I want that if the consumption of messages in the topic in the last 5 
> minutes is 0 and the production of messages is greater than 1 in the topic, 
> then the group of consumers is not consuming messages and I wanted to 
> return which groups and topics these would be
> Em sexta-feira, 19 de abril de 2024 às 15:36:44 UTC-3, Brian Candler 
> escreveu:
>
>> Maybe what you're trying to do is:
>>
>> sum by (consumergroup, topic) 
>> (rate(kafka_consumergroup_current_offset[5m]) * 60) == 0
>> unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 60) 
>> < 1
>>
>> That is: alert on any combination of (consumergroup,topic) where the 
>> 5-minute rate of consumption is zero, unless the rate for that topic across 
>> all consumers is less than 1 per minute.
>>
>> As far as I can tell, kafka_consumergroup_current_offset is a counter, 
>> and therefore you should use either rate() or increase().  The only 
>> difference is that rate(foo[5m]) gives the increase per second, while 
>> increase(foo[5m]) gives the increase per 5 minutes.
>>
>> Hence:
>> rate(kafka_consumergroup_current_offset[5m]) * 60
>> increase(kafka_consumergroup_current_offset[5m]) / 5
>> should both be the same, giving the per-minute increase.
>>
>> On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:
>>
>>> Sorry, first link was wrong.
>>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
>>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>>
>>> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>>>
 Can you give examples of the metrics in question, and what conditions 
 you're trying to check for?

 Looking at your specific PromQL query: Firstly, in my experience, it's 
 very unusual in Prometheus queries to use ==bool or >bool, and in this 
 specific case definitely seems to be wrong.

 Secondly, you won't be able to join the LH and RH sides of your 
 expression with "and" unless either they have exactly the same label sets, 
 or you modify your condition using "and on (...)" or "and ignoring (...)".

 "and" is a vector intersection operator, where the result vector 
 includes a value if the labels match, and the value is taken from the LHS, 
 and that means it doesn't combine the values like you might be used to in 
 other programming languages. For example,

 vector(0) and vector(1)  => value is 0
 vector(1) and vector(0)  => value is 1
 vector(42) and vector(99)  => value is 42

 This is as described in the documentation 

 :

 vector1 and vector2 results in a vector consisting of the elements of 
 vector1 for which there are elements in vector2 with exactly matching 
 label sets. Other elements are dropped. The metric name and values are 
 carried over from the left-hand side vector.

 PromQL alerts on the presence of values, and in PromQL you need to 
 think in terms of "what (labelled) values are present or absent in this 
 vector", using the "and/unless" operators to suppress elements in the 
 result vector, and the "or" operator to add additional elements to the 
 result vector.

 Maybe these explanations help:

 https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ

 https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ

 On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:

> Good afternoon, I would like to know if it is possible to do this 
> query, the value that should return is applications with a value of 0 in 
> the first query and greater than one in the 2nd
>
> (
>   sum by (consumergroup, topic) 
> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
> ) 
> and (
>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
> >bool 1
> )
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3b6f298-1c3d-47cd-b04a-66c62bd71c86n%40googlegroups.com.

[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread 'Brian Candler' via Prometheus Users

Maybe what you're trying to do is:

sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 60) < 
1

That is: alert on any combination of (consumergroup,topic) where the 
5-minute rate of consumption is zero, unless the rate for that topic across 
all consumers is less than 1 per minute.

As far as I can tell, kafka_consumergroup_current_offset is a counter, and 
therefore you should use either rate() or increase().  The only difference 
is that rate(foo[5m]) gives the increase per second, while 
increase(foo[5m]) gives the increase per 5 minutes.

Hence:
rate(kafka_consumergroup_current_offset[5m]) * 60
increase(kafka_consumergroup_current_offset[5m]) / 5
should both be the same, giving the per-minute increase.

On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:

> Sorry, first link was wrong.
> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>
> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>
>> Can you give examples of the metrics in question, and what conditions 
>> you're trying to check for?
>>
>> Looking at your specific PromQL query: Firstly, in my experience, it's 
>> very unusual in Prometheus queries to use ==bool or >bool, and in this 
>> specific case definitely seems to be wrong.
>>
>> Secondly, you won't be able to join the LH and RH sides of your 
>> expression with "and" unless either they have exactly the same label sets, 
>> or you modify your condition using "and on (...)" or "and ignoring (...)".
>>
>> "and" is a vector intersection operator, where the result vector includes 
>> a value if the labels match, and the value is taken from the LHS, and that 
>> means it doesn't combine the values like you might be used to in other 
>> programming languages. For example,
>>
>> vector(0) and vector(1)  => value is 0
>> vector(1) and vector(0)  => value is 1
>> vector(42) and vector(99)  => value is 42
>>
>> This is as described in the documentation 
>> 
>> :
>>
>> vector1 and vector2 results in a vector consisting of the elements of 
>> vector1 for which there are elements in vector2 with exactly matching 
>> label sets. Other elements are dropped. The metric name and values are 
>> carried over from the left-hand side vector.
>>
>> PromQL alerts on the presence of values, and in PromQL you need to think 
>> in terms of "what (labelled) values are present or absent in this vector", 
>> using the "and/unless" operators to suppress elements in the result vector, 
>> and the "or" operator to add additional elements to the result vector.
>>
>> Maybe these explanations help:
>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>
>> On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:
>>
>>> Good afternoon, I would like to know if it is possible to do this query, 
>>> the value that should return is applications with a value of 0 in the first 
>>> query and greater than one in the 2nd
>>>
>>> (
>>>   sum by (consumergroup, topic) 
>>> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
>>> ) 
>>> and (
>>>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
>>> >bool 1
>>> )
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9796ea54-47c9-47dc-8f87-460de1468a66n%40googlegroups.com.

[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread 'Brian Candler' via Prometheus Users

Sorry, first link was wrong.
https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ

On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:

> Can you give examples of the metrics in question, and what conditions 
> you're trying to check for?
>
> Looking at your specific PromQL query: Firstly, in my experience, it's 
> very unusual in Prometheus queries to use ==bool or >bool, and in this 
> specific case definitely seems to be wrong.
>
> Secondly, you won't be able to join the LH and RH sides of your expression 
> with "and" unless either they have exactly the same label sets, or you 
> modify your condition using "and on (...)" or "and ignoring (...)".
>
> "and" is a vector intersection operator, where the result vector includes 
> a value if the labels match, and the value is taken from the LHS, and that 
> means it doesn't combine the values like you might be used to in other 
> programming languages. For example,
>
> vector(0) and vector(1)  => value is 0
> vector(1) and vector(0)  => value is 1
> vector(42) and vector(99)  => value is 42
>
> This is as described in the documentation 
> 
> :
>
> vector1 and vector2 results in a vector consisting of the elements of 
> vector1 for which there are elements in vector2 with exactly matching 
> label sets. Other elements are dropped. The metric name and values are 
> carried over from the left-hand side vector.
>
> PromQL alerts on the presence of values, and in PromQL you need to think 
> in terms of "what (labelled) values are present or absent in this vector", 
> using the "and/unless" operators to suppress elements in the result vector, 
> and the "or" operator to add additional elements to the result vector.
>
> Maybe these explanations help:
> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>
> On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:
>
>> Good afternoon, I would like to know if it is possible to do this query, 
>> the value that should return is applications with a value of 0 in the first 
>> query and greater than one in the 2nd
>>
>> (
>>   sum by (consumergroup, topic) 
>> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
>> ) 
>> and (
>>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
>> >bool 1
>> )
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/304a4437-6cbb-451b-b476-d3196dc6923bn%40googlegroups.com.

[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread 'Brian Candler' via Prometheus Users

Can you give examples of the metrics in question, and what conditions 
you're trying to check for?

Looking at your specific PromQL query: Firstly, in my experience, it's very 
unusual in Prometheus queries to use ==bool or >bool, and in this specific 
case definitely seems to be wrong.

Secondly, you won't be able to join the LH and RH sides of your expression 
with "and" unless either they have exactly the same label sets, or you 
modify your condition using "and on (...)" or "and ignoring (...)".

"and" is a vector intersection operator, where the result vector includes a 
value if the labels match, and the value is taken from the LHS, and that 
means it doesn't combine the values like you might be used to in other 
programming languages. For example,

vector(0) and vector(1)  => value is 0
vector(1) and vector(0)  => value is 1
vector(42) and vector(99)  => value is 42

This is as described in the documentation 

:

vector1 and vector2 results in a vector consisting of the elements of 
vector1 for which there are elements in vector2 with exactly matching label 
sets. Other elements are dropped. The metric name and values are carried 
over from the left-hand side vector.

PromQL alerts on the presence of values, and in PromQL you need to think in 
terms of "what (labelled) values are present or absent in this vector", 
using the "and/unless" operators to suppress elements in the result vector, 
and the "or" operator to add additional elements to the result vector.

Maybe these explanations help:
https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ

On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:

> Good afternoon, I would like to know if it is possible to do this query, 
> the value that should return is applications with a value of 0 in the first 
> query and greater than one in the 2nd
>
> (
>   sum by (consumergroup, topic) 
> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
> ) 
> and (
>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) >bool 
> 1
> )
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d54ade93-2ea4-438e-986a-a9c780ab71acn%40googlegroups.com.

Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread 'Brian Candler' via Prometheus Users

No. That test case demonstrates that it is the label *values* that are 
downcased, not the label names, exactly as you said.

On Thursday 18 April 2024 at 13:07:51 UTC+1 Vaibhav Ingulkar wrote:

> Thanks @Brian Candler
>
> Actually not possible fixing the data at source due to multiple 
> variations in diff aws services and huge data modification. So looking to 
> make it dynamically by capturing labels starting with "*tag_*".
>
> As mentioned here 
> https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482
>  can 
> you please give me one example of config to achieve it dynamically for all 
> labels starting with "*tag_*"
>
> It will be great help if that works for me. :)
>
>
> On Thursday, April 18, 2024 at 4:46:15 PM UTC+5:30 Brian Candler wrote:
>
>> You mean you're seeing tag_owner, tag_Owner, tag_OWNER from different 
>> instances? Because the tags weren't entered consistently?
>>
>> I don't see a lowercasing version of the "labelmap" action. So I think 
>> you're back to either:
>>
>> 1. fixing the data at source (e.g. using the EC2 API to read the tags and 
>> reset them to the desired values; and then make policies and procedures so 
>> that new instances have consistent tag names); or
>> 2. proxying / modifying the exporter
>>
>> > I think  lower/upper action in relabeling works to make "*values*" of 
>> labels to lower/upper 
>>
>> I believe so. The way I interpret it, "lowercase" action is the same as 
>> "replace", but the concatenated values from source_labels are lowercased 
>> first. Hence the fixed target_label that you specify will get the 
>> lowercased value, after any regex matching/capturing.
>>
>> The test case here agrees:
>>
>> https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482
>>
>> On Thursday 18 April 2024 at 11:47:16 UTC+1 Vaibhav Ingulkar wrote:
>>
>>> Additionally , I have prepare below config under metric_relable_configs
>>> - action: labelmap
>>>   regex: 'tag_(.*)'
>>>   replacement: $1
>>>
>>> It is giving one me new set of all label starting with word '*tag_*' as 
>>> added in regex but not converting them to lowercase and removing "*tag_*" 
>>> from label name, for ex. *tag_Name* is converted only "N*ame*"
>>> Also existing label *tag_Name* is also remaining as it is .i.e. old 
>>> label *tag_Name* and new label *Name*
>>>
>>> So Firstly I want that "*tag_"* should remain as it it in new label and 
>>> it should get converted to lower case i.e. for ex. *tag_Budget_Code* to 
>>> *tag_budget_code* or *tag_Name* to *tag_name*
>>> Secondly need to remove old label for ex. *tag_Budget_Code* , *tag_Name* , 
>>> etc
>>>
>>> On Thursday, April 18, 2024 at 3:46:57 PM UTC+5:30 Vaibhav Ingulkar 
>>> wrote:
>>>
 Thanks @Brian Kochie

 Correct me if I am wrong but I think  lower/upper action in relabeling 
 works to make "*values*" of labels to lower/upper and not "*keys*" *i.e. 
 label name itself wont get convert to lowercase*. Right?

 Because I an using *v2.41.0 *and  have tried it and it is converting 
 all values of labels to lowercase.

 Here my requirement is to convert labels i.e. keys to lowercase for ex. 
 *tag_Budget_Code* to *tag_budget_code* or *tag_Name* to *tag_name*

 On Thursday, April 18, 2024 at 2:26:10 PM UTC+5:30 Brian Candler wrote:

> On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:
>
> Prometheus can lower/upper in relabeling.
>
>
> Thanks! That was added in v2.36.0 
> , and I 
> missed it.
>


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c5a33b4a-b27c-447d-bc6a-3b14a5fb2e12n%40googlegroups.com.

Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread 'Brian Candler' via Prometheus Users

You mean you're seeing tag_owner, tag_Owner, tag_OWNER from different 
instances? Because the tags weren't entered consistently?

I don't see a lowercasing version of the "labelmap" action. So I think 
you're back to either:

1. fixing the data at source (e.g. using the EC2 API to read the tags and 
reset them to the desired values; and then make policies and procedures so 
that new instances have consistent tag names); or
2. proxying / modifying the exporter

> I think  lower/upper action in relabeling works to make "*values*" of 
labels to lower/upper 

I believe so. The way I interpret it, "lowercase" action is the same as 
"replace", but the concatenated values from source_labels are lowercased 
first. Hence the fixed target_label that you specify will get the 
lowercased value, after any regex matching/capturing.

The test case here agrees:
https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482

On Thursday 18 April 2024 at 11:47:16 UTC+1 Vaibhav Ingulkar wrote:

> Additionally , I have prepare below config under metric_relable_configs
> - action: labelmap
>   regex: 'tag_(.*)'
>   replacement: $1
>
> It is giving one me new set of all label starting with word '*tag_*' as 
> added in regex but not converting them to lowercase and removing "*tag_*" 
> from label name, for ex. *tag_Name* is converted only "N*ame*"
> Also existing label *tag_Name* is also remaining as it is .i.e. old label 
> *tag_Name* and new label *Name*
>
> So Firstly I want that "*tag_"* should remain as it it in new label and 
> it should get converted to lower case i.e. for ex. *tag_Budget_Code* to 
> *tag_budget_code* or *tag_Name* to *tag_name*
> Secondly need to remove old label for ex. *tag_Budget_Code* , *tag_Name* , 
> etc
>
> On Thursday, April 18, 2024 at 3:46:57 PM UTC+5:30 Vaibhav Ingulkar wrote:
>
>> Thanks @Brian Kochie
>>
>> Correct me if I am wrong but I think  lower/upper action in relabeling 
>> works to make "*values*" of labels to lower/upper and not "*keys*" *i.e. 
>> label name itself wont get convert to lowercase*. Right?
>>
>> Because I an using *v2.41.0 *and  have tried it and it is converting all 
>> values of labels to lowercase.
>>
>> Here my requirement is to convert labels i.e. keys to lowercase for ex. 
>> *tag_Budget_Code* to *tag_budget_code* or *tag_Name* to *tag_name*
>>
>> On Thursday, April 18, 2024 at 2:26:10 PM UTC+5:30 Brian Candler wrote:
>>
>>> On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:
>>>
>>> Prometheus can lower/upper in relabeling.
>>>
>>>
>>> Thanks! That was added in v2.36.0 
>>> , and I 
>>> missed it.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bb7045b4-eeaf-404a-8aaf-affeae3bcf95n%40googlegroups.com.

Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread 'Brian Candler' via Prometheus Users

On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:

Prometheus can lower/upper in relabeling.


Thanks! That was added in v2.36.0 
, and I missed 
it.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b6a6d314-9c36-4d81-957d-22048c6b04ben%40googlegroups.com.

[prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread 'Brian Candler' via Prometheus Users

> Need urgent help!!!

See https://www.catb.org/~esr/faqs/smart-questions.html#urgent

> we can add *only one pattern (Uppercase or lowercase)* in template code.

At worst you can match like this: tag_Name=~"[fF][oO][oO][bB][aA][rR]"

I don't know of any way internally to prometheus to lowercase labels. What 
you could do though is to write a HTTP proxy: you scrape the proxy from 
prometheus, the proxy scrapes the upstream source, and modifies the labels 
before returning the results to prometheus.

Or: since you're using an external package anyway (cloudwatch_exporter), 
you could modify and recompile it yourself.

IMO it would better if you fix the data at source, i.e. make your tags be 
consistent in AWS. Prometheus faithfully reproduces the data you give it. 
Garbage in, garbage out.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c6d8f693-8238-4929-9dbc-d96e64b57180n%40googlegroups.com.

[prometheus-users] Re: many-to-many not allowed error

2024-04-18 Thread 'Brian Candler' via Prometheus Users

Look at the results of each half of the query separately:

redis_memory_max_bytes{k8s_cluster_name="$cluster", 
namespace="$namespace", pod="$pod_name"}

redis_instance_info{role=~"master|slave"}

You then need to find some set of labels which mean that N entries on the 
left-hand side always match exactly 1 entry on the right-hand side.

On Thursday 18 April 2024 at 07:30:49 UTC+1 saravanan E.M wrote:

> Hi Team
>
> Am getting many-to-many not allowed error while trying to join two time 
> series with role
>
> redis_memory_max_bytes{k8s_cluster_name="$cluster", 
> namespace="$namespace", pod="$pod_name"}
>   * on (k8s_cluster_name, namespace, pod) group_left(redis_instance_info) 
>   (redis_instance_info{role=~"master|slave"})
>
> Kindly help in having the correct query for this.
>
> Thanks
> Saravanan
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/205d246f-6b23-4474-955f-b71012eb3fbfn%40googlegroups.com.

Re: [prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-18 Thread 'Brian Candler' via Prometheus Users

You don't need a separate job for each DNS server. You can have a single 
job with multiple target blocks.

  - job_name: 'dns'
scrape_interval: 5s
metrics_path: /probe
params:
  module: [dns_probe]
static_configs:
  - targets:
- www.google.com
- www.mindfree.cl
labels:
  dns: 208.67.220.220 #australia cloudflare
  - targets:
- www.google.com
- www.microsoft.com
labels:
  dns: 198.55.49.149 #canada

relabel_configs:
- source_labels: [__address__]
  #target_label: __param_target
  target_label: __param_hostname
# Populate target URL parameter with dns server IP
- source_labels: [__param_hostname]
  target_label: instance
#QUERY
- source_labels: [dns]
  #target_label: __param_hostname
  target_label: __param_target
# Populate __address__ with the address of the blackbox exporter to hit
- target_label: __address__
  replacement: localhost:9115

(Although personally, I would use file_sd_configs for this, so I can edit 
the targets without having to re-read the prometheus config file).

On Thursday 18 April 2024 at 01:52:45 UTC+1 Vincent Romero wrote:

> [image: blackbox-dns1.png]
> log blackbox_exporter sorry
>
> El Wednesday, April 17, 2024 a la(s) 8:50:39 PM UTC-4, Vincent Romero 
> escribió:
>
>> Hello every i change the relabel
>>
>> y try this
>>
>> - job_name: '208.67.222.220-opendns' ##REBUILD new blackbox_expoerter
>> scrape_interval: 5s
>> metrics_path: /probe
>> params:
>> module: [dns_probe]
>> static_configs:
>> - targets:
>> - www.google.com
>> - www.mindfree.cl
>> labels:
>> dns: 208.67.220.220 #australia cloudflare
>>
>> relabel_configs:
>> - source_labels: [__address__]
>> #target_label: __param_target
>> target_label: __param_hostname
>> # Populate target URL parameter with dns server IP
>> - source_labels: [__param_hostname]
>> target_label: instance
>> #QUERY
>> - source_labels: [dns]
>> #target_label: __param_hostname
>> target_label: __param_target
>> # Populate __address__ with the address of the blackbox exporter to hit
>> - target_label: __address__
>> replacement: localhost:9115
>>
>> - job_name: '198.55.49.149-canada' ##REBUILD new blackbox_expoerter
>> scrape_interval: 5s
>> metrics_path: /probe
>> params:
>> module: [dns_probe]
>> static_configs:
>> - targets:
>> - www.google.com
>> - www.microsoft.com
>> labels:
>> dns: 198.55.49.149 #canada
>>
>> relabel_configs:
>> - source_labels: [__address__]
>> #target_label: __param_target
>> target_label: __param_hostname
>> # Populate target URL parameter with dns server IP
>> - source_labels: [__param_hostname]
>> target_label: instance
>> #QUERY
>> - source_labels: [dns]
>> #target_label: __param_hostname
>> target_label: __param_target
>> # Populate __address__ with the address of the blackbox exporter to hit
>> - target_label: __address__
>> replacement: localhost:9115
>>
>>
>> with this i can used in target any domain to resolve with labels dns
>>
>> in the log in blackbox i have this
>>
>> looking good no? 
>>
>>
>> El Friday, April 12, 2024 a la(s) 9:46:32 AM UTC-4, Brian Candler 
>> escribió:
>>
>>> It's not really related to blackbox_exporter itself, but I don't 
>>> entirely agree with that comment.
>>>
>>> There are two different things at play here: the address you send the 
>>> query to ("target"), and the name that you are looking up ("queryname").
>>>
>>> - For caching resolvers: large providers use anycast with fixed IP 
>>> addresses, since that's what you have to configure in your client (8.8.8.8, 
>>> 1.1.1.1 etc). Those target addresses will *never* change.  I think 
>>> 185.228.168.9 
>>> falls into this category too: although you could get to it by resolving "
>>> security-filter-dns.cleanbrowsing.org", for a filtered DNS service 
>>> you'd always be using the IP address directly.
>>>
>>> - For authoritative servers: using the nameserver's DNS name (e.g. 
>>> ns.example.com) more closely reflects what the DNS does during 
>>> resolution, but makes it harder to work out what's going wrong if it fails. 
>>> The IP addresses that NS records resolve to can change, but very rarely do 
>>> (and since it's your own authoritative nameservers, you'll know if you 
>>> renumber them). Furthermore, in my experience, NS names are never

[prometheus-users] Re: Prometheus Azure Service Discovery behind a proxy server

2024-04-15 Thread 'Brian Candler' via Prometheus Users

> Is there a way to enable or add proxy config just for the service 
discoery and microsoft authentication part ?

The configuration of azure sd is here:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#azure_sd_config
It has its own local settings for proxy_url, proxy_connect_header etc, 
which relate purely to the service discovery, and not to scraping.

On Monday 15 April 2024 at 01:01:59 UTC+1 Durga Prasad Kommareddy wrote:

> I have Prometheus running on a azure VM. And have few other VMs in 
> multiple subscriptions peered with the prometheus VM/Vnet.
>
> So i can reach the target VM metrics at http://IP:9100/metrics. But the 
> service discovery itself is not working unless i use a public IP/internet  
> Prometheus service discovery and microsoft authentication.
>
> Is there a way to enable or add proxy config just for the service discoery 
> and microsoft authentication part ? i dont need proxy for the actual 
> metrcis scraping because the VM can talk to all my target VMs so that'll 
> work.  

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/16554dc8-5482-4c8e-936f-3f0e8294c7f4n%40googlegroups.com.

Re: [prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-12 Thread 'Brian Candler' via Prometheus Users

It's not really related to blackbox_exporter itself, but I don't entirely 
agree with that comment.

There are two different things at play here: the address you send the query 
to ("target"), and the name that you are looking up ("queryname").

- For caching resolvers: large providers use anycast with fixed IP 
addresses, since that's what you have to configure in your client (8.8.8.8, 
1.1.1.1 etc). Those target addresses will *never* change.  I think 
185.228.168.9 
falls into this category too: although you could get to it by resolving 
"security-filter-dns.cleanbrowsing.org", 
for a filtered DNS service you'd always be using the IP address directly.

- For authoritative servers: using the nameserver's DNS name (e.g. 
ns.example.com) more closely reflects what the DNS does during resolution, 
but makes it harder to work out what's going wrong if it fails. The IP 
addresses that NS records resolve to can change, but very rarely do (and 
since it's your own authoritative nameservers, you'll know if you renumber 
them). Furthermore, in my experience, NS names are never geo-aware: they 
always return static IPs (although these may point to anycast addresses).

- Geo-aware DNS generally takes place for the user-visible query names 
(like "www.google.com") and generally are affected by the *source* address 
where the query is coming from.

On Friday 12 April 2024 at 14:21:57 UTC+1 Conall O'Brien wrote:

> On Wed, 10 Apr 2024 at 06:47, 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> One exporter scrape = one probe test and I think that should remain. You 
>> can get what you want by expanding the targets (which is a *list* of 
>> targets+labels):
>>
>>   static_configs:
>> - targets:
>> - 1.1.1.1
>> - 185.228.168.9
>>   labels:
>> queryname: www.google.com
>> - targets:
>> - 1.1.1.1
>> - 185.228.168.9
>>   labels:
>> queryname: www.microsoft.com
>>
>
> Given the targets, I would strongly suggest using DNS names over raw IP 
> addresses for every scrape. Large providers use geo-aware DNS systems, so 
> the IP numbers change over time for a number of reasons (e.g maintenance, 
> capacity turnup/turndown, etc). Probing raw IPs will not reflect the actual 
> state of the service.
>  
>
>> On Tuesday 9 April 2024 at 22:48:44 UTC+1 Vincent Romero wrote:
>>
>>> Hello, this worked
>>>
>>> With the new feature with simple domain works, but considered whether 
>>> the label required adding N domains?
>>>
>>> Y try add other domain in the same labels
>>>
>>>   - job_name: 'blackbox-dns-monitor'
>>> scrape_interval: 5s
>>> metrics_path: /probe
>>> params:
>>>   module: [dns_probe]
>>> static_configs:
>>>   - targets:
>>> - 1.1.1.1 #australia cloudflare
>>> - 185.228.168.9 #ireland
>>> labels:
>>>   queryname: www.google.com, www.microsoft.com NOT WORK
>>>   queryname: www.microsoft.com NOT WORK (add line)
>>>
>>> [image: Captura de pantalla 2024-04-09 a la(s) 17.44.20.png]
>>>
>>> El Tuesday, April 9, 2024 a la(s) 12:19:25 PM UTC-4, Vincent Romero 
>>> escribió:
>>>
>>>> i will try make build, with this change
>>>>
>>>>
>>>>
>>>> El Saturday, April 6, 2024 a la(s) 2:45:29 PM UTC-3, Brian Candler 
>>>> escribió:
>>>>
>>>>> You're correct that currently the qname is statically configured in 
>>>>> the prober config.
>>>>>
>>>>> A patch was submitted to allow what you want, but hasn't been merged:
>>>>> https://github.com/prometheus/blackbox_exporter/pull/1105
>>>>>
>>>>> You can build blackbox_exporter yourself with this patch applied 
>>>>> though.
>>>>>
>>>>> On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:
>>>>>
>>>>>> Helo everyone
>>>>>>
>>>>>> what is the difference between http_2xx and dns module configuration
>>>>>>
>>>>>>
>>>>>> I have this example y my config
>>>>>>
>>>>>> blackbox.yml
>>>>>> modules:
>>>>>>   http_2xx:
>>>>>> prober: http
>>>>>> http:
>>>>>>   preferred_ip_protocol: "ip4"
>>>>>>   http_post_2xx:
>

[prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-09 Thread 'Brian Candler' via Prometheus Users

One exporter scrape = one probe test and I think that should remain. You 
can get what you want by expanding the targets (which is a *list* of 
targets+labels):

  static_configs:
- targets:
- 1.1.1.1
- 185.228.168.9
  labels:
queryname: www.google.com
- targets:
- 1.1.1.1
- 185.228.168.9
  labels:
queryname: www.microsoft.com

On Tuesday 9 April 2024 at 22:48:44 UTC+1 Vincent Romero wrote:

> Hello, this worked
>
> With the new feature with simple domain works, but considered whether the 
> label required adding N domains?
>
> Y try add other domain in the same labels
>
>   - job_name: 'blackbox-dns-monitor'
> scrape_interval: 5s
> metrics_path: /probe
> params:
>   module: [dns_probe]
> static_configs:
>   - targets:
> - 1.1.1.1 #australia cloudflare
> - 185.228.168.9 #ireland
> labels:
>   queryname: www.google.com, www.microsoft.com NOT WORK
>   queryname: www.microsoft.com NOT WORK (add line)
>
> [image: Captura de pantalla 2024-04-09 a la(s) 17.44.20.png]
>
> El Tuesday, April 9, 2024 a la(s) 12:19:25 PM UTC-4, Vincent Romero 
> escribió:
>
>> i will try make build, with this change
>>
>>
>>
>> El Saturday, April 6, 2024 a la(s) 2:45:29 PM UTC-3, Brian Candler 
>> escribió:
>>
>>> You're correct that currently the qname is statically configured in the 
>>> prober config.
>>>
>>> A patch was submitted to allow what you want, but hasn't been merged:
>>> https://github.com/prometheus/blackbox_exporter/pull/1105
>>>
>>> You can build blackbox_exporter yourself with this patch applied though.
>>>
>>> On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:
>>>
 Helo everyone

 what is the difference between http_2xx and dns module configuration


 I have this example y my config

 blackbox.yml
 modules:
   http_2xx:
 prober: http
 http:
   preferred_ip_protocol: "ip4"
   http_post_2xx:
 prober: http
 http:
   method: POST
   www.google.com:
 prober: dns
 timeout: 1s
 dns:
   transport_protocol: "udp"
   preferred_ip_protocol: "ip4"
   query_name: "www.google.com"
   query_type: "A"
   valid_rcodes:
 - NOERROR

 prometheus.yml
   - job_name: 'blackbox'
 metrics_path: /probe
 params:
   module: [http_2xx]
 static_configs:
   - targets:
 - https://www.google.com
 relabel_configs:
   - source_labels: [__address__]
 target_label: __param_target
   - source_labels: [__param_target]
 target_label: instance
   - target_label: __address__
 replacement: localhost:9115

   - job_name: 'blackbox-dns-monitor'
 scrape_interval: 1s
 metrics_path: /probe
   #params:
   #module: [mindfree.cl]
 relabel_configs:
 # Populate domain label with domain portion of __address__
 - source_labels: [__address__]
   regex: (.*):.*$
   replacement: $1
   target_label: domain
 # Populate instance label with dns server IP portion of __address__
 - source_labels: [__address__]
   regex: .*:(.*)$
   replacement: $1
   target_label: instance
 # Populate module URL parameter with domain portion of __address__
 # This is a parameter passed to the blackbox exporter
 - source_labels: [domain]
   target_label: __param_module
 # Populate target URL parameter with dns server IP
 - source_labels: [instance]
   target_label: __param_target
 # Populate __address__ with the address of the blackbox exporter to 
 hit
 - target_label: __address__
   replacement: localhost:9115

 static_configs:
   - targets:
 - www.google.com:1.1.1.1 #australia cloudflare
  - www.google.com:8.8.8.8 #example other nameserver


 So, i will try config a simple DNS resolution for any domain
 If i want add other nameserver i need to add other line with the same 
 domain

 Why whe i used module http_2xx need simple add the target

 Thanks

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74285578-2c0c-48e1-ac85-4ca80cd9bcffn%40googlegroups.com.

[prometheus-users] Re: what to do about flapping alerts?

2024-04-08 Thread 'Brian Candler' via Prometheus Users

On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One
drive fails so some alert starts firing. Eventually the computing centre
replaces the drive and it starts rebuilding (guess it doesn't matter
whether the rebuilding is still considered to cause an alert or not).
Eventually it finishes and the alert should go away (and I should e.g. get
a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.

If the alerts have the exact same set of labels (e.g. the alert is at the
level of the RAID controller, not at the level of individual drives) then
yes.

It failed, it fixed, it failed again within keep_firing_for: then you only
get a single alert, with no additional notification.

But that's not the problem you originally asked for:

"When the target goes down, the alert clears and as soon as it's back, it
pops up again, sending a fresh alert notification."

keep_firing_for can be set differently for different alerts. So you can
set it to 10m for the "up == 0" alert, and not set it at all for the RAID
alert, if that's what you want.

Also, depending on how large I have to set keep_firing_for, I will also get
resolve messages later... which depending on what one does with the alerts
may also be less desirable.

Surely that delay is essential for the de-flapping scenario you describe:
you can't send the alert resolved message until you are *sure* the alert
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before
keeping_firing_for had expired), and the problem recurred, then you'd have
to send out a new alert failing message - which is the flap noise I think
you are asking to suppress.

In any case, sending out resolved messages is arguably a bad idea:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

I turned them off, and:
(a) it immediately reduced notifications by 50%
(b) it encourages that alerts are properly investigated (or that alerts are
properly tuned)

That is: if something was important enough to alert on in the first place,
then it's important enough to investigate thoroughly, even if the threshold
has been crossed back to normal since then. And if it wasn't important
enough to alert on, then the alerting rule needs adjusting to make it less
noisy.

This is expanded upon in this document:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

I think the main problem behind may be rather a conceptual one, namely that
Prometheus uses "no data" for no alert, which happens as well when there is
no data because of e.g. scrape failures, so it can’t really differentiate
between the two conditions.

I think it can.

Scrape failures can be explicitly detected by up == 0. Alert on those
separately.

The odd occasional missed scrape doesn't affect most other queries because
of the lookback-delta: i.e. instant vector queries will look up to 5
minutes into the past. As long as you're scraping every 2 minutes, you can
always survive a single failed scrape without noticing it.

If your device goes away for longer than 5 minutes, then sure the alerting
data will no longer be there - but then you have no idea whether the
condition you were alerting on or not exists (since you have no visibility
of the target state). Instead, you have a "scrape failed" condition, which
as I said already, is easy to alert on.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6e6de7dd-b156-475f-b76d-6f758f2c3189n%40googlegroups.com.

[prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-06 Thread 'Brian Candler' via Prometheus Users

You're correct that currently the qname is statically configured in the 
prober config.

A patch was submitted to allow what you want, but hasn't been merged:
https://github.com/prometheus/blackbox_exporter/pull/1105

You can build blackbox_exporter yourself with this patch applied though.

On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:

> Helo everyone
>
> what is the difference between http_2xx and dns module configuration
>
>
> I have this example y my config
>
> blackbox.yml
> modules:
>   http_2xx:
> prober: http
> http:
>   preferred_ip_protocol: "ip4"
>   http_post_2xx:
> prober: http
> http:
>   method: POST
>   www.google.com:
> prober: dns
> timeout: 1s
> dns:
>   transport_protocol: "udp"
>   preferred_ip_protocol: "ip4"
>   query_name: "www.google.com"
>   query_type: "A"
>   valid_rcodes:
> - NOERROR
>
> prometheus.yml
>   - job_name: 'blackbox'
> metrics_path: /probe
> params:
>   module: [http_2xx]
> static_configs:
>   - targets:
> - https://www.google.com
> relabel_configs:
>   - source_labels: [__address__]
> target_label: __param_target
>   - source_labels: [__param_target]
> target_label: instance
>   - target_label: __address__
> replacement: localhost:9115
>
>   - job_name: 'blackbox-dns-monitor'
> scrape_interval: 1s
> metrics_path: /probe
>   #params:
>   #module: [mindfree.cl]
> relabel_configs:
> # Populate domain label with domain portion of __address__
> - source_labels: [__address__]
>   regex: (.*):.*$
>   replacement: $1
>   target_label: domain
> # Populate instance label with dns server IP portion of __address__
> - source_labels: [__address__]
>   regex: .*:(.*)$
>   replacement: $1
>   target_label: instance
> # Populate module URL parameter with domain portion of __address__
> # This is a parameter passed to the blackbox exporter
> - source_labels: [domain]
>   target_label: __param_module
> # Populate target URL parameter with dns server IP
> - source_labels: [instance]
>   target_label: __param_target
> # Populate __address__ with the address of the blackbox exporter to hit
> - target_label: __address__
>   replacement: localhost:9115
>
> static_configs:
>   - targets:
> - www.google.com:1.1.1.1 #australia cloudflare
>  - www.google.com:8.8.8.8 #example other nameserver
>
>
> So, i will try config a simple DNS resolution for any domain
> If i want add other nameserver i need to add other line with the same 
> domain
>
> Why whe i used module http_2xx need simple add the target
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f2c1373c-51a6-446d-8ec1-d2e784abfd40n%40googlegroups.com.

[prometheus-users] Re: what to do about flapping alerts?

2024-04-06 Thread 'Brian Candler' via Prometheus Users

> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
firing, when the scraping failed, but also when it actually goes back to an 
ok state, right?

It affects all alerts individually, and I believe it's exactly what you 
want. A brief flip from "failing" to "OK" doesn't resolve the alert; it 
only resolves if it has remained in the "OK" state for the keep_firing_for 
duration. Therefore you won't get a fresh alert until it's been OK for at 
least keep_firing_for and *then* fails again.

As you correctly surmise, an alert isn't really a boolean condition, it's a 
presence/absence condition: the expr returns a vector of 0 or more alerts, 
each with a unique combination of labels.  "keep_firing_for" retains a 
particular labelled value in the vector for a period of time even if it's 
no longer being generated by the alerting "expr".  Hence if it does 
reappear in the expr output during that time, it's just a continuation of 
the previous alert.

> Similarly, when a node goes completely down (maintenance or so) and then 
up again, all alerts would then start again to fire (and even a generous 
keep_firing_for would have been exceeded)... and send new notifications.

I don't understand what you're saying here. Can you give some specific 
examples?

If you have an alerting expression like "up == 0" and you take 10 machines 
down then your alerting expression will return a vector of ten zeros and 
this will generate ten alerts (typically grouped into a single 
notification, if you use the default alertmanager config)

When they revert to up == 1 then they won't "start again to fire", because 
they were already firing. Indeed, it's almost the opposite. Let's say you 
have keep_firing_for: 10m, then if any machine goes down in the 10 minutes 
after the end of maintenance then it *won't* generate a new alert, because 
it will just be a continuation of the old one.

However, when you're doing maintenance, you might also be using silences to 
prevent notifications. In that case you might want your silence to extend 
10 minutes past the end of the maintenance period.

On Saturday 6 April 2024 at 04:03:07 UTC+1 Christoph Anton Mitterer wrote:

> Hey.
>
> I have some simple alerts like:
> - alert: node_upgrades_non-security_apt
>   expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
> - alert: node_upgrades_security_apt
>   expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'
>
> If there's no upgrades, these give no value.
> Similarly, for all other simple alerts, like free disk space:
> 1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", 
> instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / 
> node_filesystem_size_bytes  >  0.80
>
> No value => all ok, some value => alert.
>
> I do have some instances which are pretty unstable (i.e. scraping fails 
> every know and then - or more often than that), which are however mostly 
> out of my control, so I cannot do anything about that.
>
> When the target goes down, the alert clears and as soon as it's back, it 
> pops up again, sending a fresh alert notification.
>
> Now I've seen:
> https://github.com/prometheus/prometheus/pull/11827
> which describes keep_firing_for as "the minimum amount of time that an 
> alert should remain firing, after the expression does not return any 
> results", respectively in 
> https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
>  
> :
> # How long an alert will continue firing after the condition that 
> triggered it # has cleared. [ keep_firing_for:  | default = 0s ] 
>
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
> firing, when the scraping failed, but also when it actually goes back to an 
> ok state, right?
> That's IMO however rather undesirable.
>
> Similarly, when a node goes completely down (maintenance or so) and then 
> up again, all alerts would then start again to fire (and even a generous 
> keep_firing_for would have been exceeded)... and send new notifications.
>
>
> Is there any way to solve this? Especially that one doesn't get new 
> notifications sent, when the alert never really stopped?
>
> At least I wouldn't understand how keep_firing_for would do this.
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fa157174-2d90-45f0-9084-dc28e52e88dan%40googlegroups.com.

[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-03 Thread 'Brian Candler' via Prometheus Users

On Wednesday 3 April 2024 at 16:01:21 UTC+1 mohan garden wrote:

Is there a way i can see the entire message which alert manager sends out 
to the Opsgenie? - somewhere in the alertmanager logs or a text file?


You could try setting api_url to point to a webserver that you control.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/27e70b2b-9101-478e-9a2b-364f6287da32n%40googlegroups.com.

[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-03 Thread 'Brian Candler' via Prometheus Users

> but i was expecting an additional host=server2 tag on the ticket. 

You won't get that, because CommonLabels is exactly how it sounds: those 
labels which are common to all the alerts in the group.  If one alert has 
instance=server1 and the other has instance=server2, but they're in the 
same alert group, then no 'instance' will appear in CommonLabels.

The documentation is here:
https://prometheus.io/docs/alerting/latest/notifications/

It looks like you could iterate over Alerts.Firing then the Labels within 
each alert.

Alternatively, you could disable grouping and let opsgenie do the grouping 
(I don't know opsgenie, so I don't know how good a job it would do of that)


On Wednesday 3 April 2024 at 09:11:24 UTC+1 mohan garden wrote:

> *correction: 
> *Scenario2: *While server1 trigger is active, a second server ( say 
> server2)'s local disk usage reaches 50%,
>
> i see that the already open Opsgenie ticket's details gets updated as:
>
> ticket header name:  local disk usage reached 50%
> ticket description:  space on /var file system at server1:9100 server = 
> 82%."
>  space on /var file system at 
> server2:9100 server = 80%."
> ticket tags: criteria: overuse , team: support, severity: critical, 
> infra,monitor,host=server1
>
> [image: photo003.png]
>
>
>
> On Wednesday, April 3, 2024 at 1:37:12 PM UTC+5:30 mohan garden wrote:
>
>> Hi Brian, 
>> Thank you for the response, Here are some more details, hope this will 
>> help you in gaining more understanding into the configuration and method i 
>> am using to generate tags :
>>
>>
>> 1. We collect data from the node exporter, and have created some rules 
>> around the collected data. Here is one example - 
>> - alert: "Local Disk usage has reached 50%"
>>   expr: (round( 
>> node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*",}
>>  
>> / 
>> node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>>  
>> * 100  ,0.1) >= y ) and (round( 
>> node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>>  
>> / 
>> node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"}
>>  
>> * 100  ,0.1) <= z )
>>   for: 5m
>>   labels:
>> criteria: overuse
>> severity: critical
>> team: support
>>   annotations:
>> summary: "{{ $labels.instance }} 's  ({{ $labels.device }}) has 
>> low space."
>> description: "space on {{ $labels.mountpoint }} file system at {{ 
>> $labels.instance }} server = {{ $value }}%."
>>
>> 2. at the alert manager , we have created notification rules to notify in 
>> case the aforementioned condition occurs:
>>
>>   smtp_from: 'ser...@example.com'
>>   smtp_require_tls: false
>>   smtp_smarthost: 'ser...@example.com:25 '
>>
>> templates:
>>   - /home/ALERTMANAGER/conf/template/*.tmpl
>>
>> route:
>>   group_wait: 5m
>>   group_interval: 2h
>>   repeat_interval: 5h
>>   receiver: admin
>>   routes:
>>   - match_re:
>>   alertname: ".*Local Disk usage has reached .*%"
>> receiver: admin
>> routes:
>> - match:
>> criteria: overuse
>> severity: critical
>> team: support
>>   receiver: mailsupport
>>   continue: true
>> - match:
>> criteria: overuse
>> team: support
>> severity: critical
>> receiver: opsgeniesupport
>>
>> receivers:
>>   - name: opsgeniesupport
>> opsgenie_configs:
>> - api_key: XYZ
>>   api_url: https://api.opsgenie.com
>>   message: '{{ .CommonLabels.alertname }}'
>>   description: "{{ range .Alerts }}{{ .Annotations.description 
>> }}\n\r{{ end }}"
>>   tags: '{{ range $k, $v := .CommonLabels}}{{ if or (eq $k 
>> "criteria")  (eq $k "severity") (eq $k "team") }}{{$k}}={{$v}},{{ else if 
>> eq $k "instance" }}{{ reReplaceAll "(.+):(.+)" "host=$1" $v 
>> }},{{end}}{{end}},infra,monitor'
>>   priority: 'P1'
>>   update_alerts: true
>>   send_resolved: true
>> ...
>> So you can see that i derive a  tag host= from the instance 
>> label.
>>
>>
>> *Scenario1: *When server1 's local disk usage reaches 50%, i see that 
>> Opsgenie ticket is created having:
>> Opsgenie Ticket metadata: 
>> ticket header name:  local disk usage reached 50%
>> ticket description:  space on /var file system at server1:9100 server = 
>> 82%."
>> ticket tags: criteria: overuse , team: support, severity: critical, 
>> infra,monitor,host=server1
>>
>> so everything works as expected, no issues with Scenario1.
>>
>>
>> *Scenario2: *While server1 trigger is active, a second server ( say 
>> server2)'s local disk usage reaches 50%,
>>
>> i see that Opsgenie tickets are getting updated as:
>> ticket header name:  local disk usage reached 50%
>> ticket description:  space on /var file system at server1:9100 server = 
>> 82%."
>> ticket description:  space on /var file system at server2:9100 server = 
>> 80%."
>>

[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-02 Thread 'Brian Candler' via Prometheus Users

FYI, those images are unreadable - copy-pasted text would be much better.

My guess, though, is that you probably don't want to group alerts before 
sending them to opsgenie. You haven't shown your full alertmanager config, 
but if you have a line like

   group_by: ['alertname']

then try

   group_by: ["..."]

(literally, exactly that: a single string containing three dots, inside 
square brackets)

On Tuesday 2 April 2024 at 17:15:39 UTC+1 mohan garden wrote:

> Dear Prometheus Community,
> I am reaching out regarding an issue i have encountered with  prometheus 
> alert tagging, specifically while creating tickets in Opsgenie.
>
>
> I have configured alertmanager  to send alerts to Opsgenie as , the 
> configuration as :
> [image: photo001.png]i ticket is generated with expected description and 
> tags as - 
> [image: photo002.png]
>
> Now, by default the alerts are grouped by the alert name( default 
> behavior).So when the similar event happens on a different server i see 
> that the description is updated as:
> [image: photo003.png]
> but the tag on the ticket remains same, 
> expected behavior: criteria=..., host=108, host=114, infra.support 
>
> I have set update_alert and send_resolved settings to true.
> I am not sure that in order to make it work as expected, If i need 
> additional configuration at opsgenie or at the alertmanager. 
>
> I would appreciate any insight or guidance on the method to resolve this 
> issue and ensure that alerts for different servers are correctly tagged in 
> Opsgenie.
>
> Thank you in advance.
> Regards,
> CP
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f4ec4e77-672a-42a5-ad5a-1aa9f82d6b3en%40googlegroups.com.

Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread 'Brian Candler' via Prometheus Users

Only you can determine that, by comparing the lists of alerts from both 
sides and seeing what differs, and looking into how they are generated and 
measured. There are all kinds of things which might affect this, e.g. 
pending/keep_firing_for alerts, group wait etc.

But you might also want to read this:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

If you're generating more than a handful of alerts per day, then maybe you 
need to reconsider what constitutes an "alert".

On Saturday 30 March 2024 at 09:49:04 UTC Trio Official wrote:

> Thank you for your prompt response and guidance on addressing the metric 
> staleness issue.
>
> Regarding metric staleness  I confirm that I have already implemented the 
> approach to use square brackets for the recording metrics and alerting rule
>  (e.g. max_over_time(metric[1h])). However, the main challenge persists 
> with the discrepancy in the number of alerts generated by Prometheus 
> compared to those displayed in Alertmanager. 
>
> To illustrate, when observing Prometheus, I may observe approximately 
> 25,000 alerts triggered within a given period. However, when reviewing the 
> corresponding alerts in Alertmanager, the count often deviates 
> significantly, displaying figures such as 10,000 or 18,000, rather than the 
> expected 25,000.
>
> This inconsistency poses a significant challenge in our alert management 
> process, leading to confusion and potentially overlooking critical alerts.
>
> I would greatly appreciate any further insights or recommendations you may 
> have to address this issue and ensure alignment between Prometheus and 
> Alertmanager in terms of the number of alerts generated and displayed.
> On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:
>
>> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>>
>> I believe that recording rules and alerting rules similarly may have 
>> their evaluation time happen at different offsets within their 
>> evaluation interval. This is done for the similar reason of spreading 
>> out the internal load of rule evaluations across time.
>>
>>
>> I think it's more accurate to say that *rule groups* are spread spread 
>> over their evaluation interval, and rules within the same rule group are 
>> evaluated 
>> sequentially 
>> .
>>  
>> This is how you can build rules that depend on each other, e.g. a recording 
>> rule followed by other rules that depend on its output; put them in the 
>> same rule group.
>>
>> As for scraping: you *can* change this staleness interval, 
>> using --query.lookback-delta, but it's strongly not recommended. Using the 
>> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
>> that even if you miss one scrape for a random reason, you still have two 
>> points within the lookback-delta so that the timeseries does not go stale.
>>
>> There's no good reason to scrape at one hour intervals:
>> * Prometheus is extremely efficient with its storage compression, 
>> especially when adjacent data points are equal, so scraping the same value 
>> every 2 minutes is going to use hardly any more storage than scraping it 
>> every hour.
>> * If you're worried about load on the exporter because responding to a 
>> scrape is slow or expensive, then you should run the exporter every hour 
>> from a local cronjob, and write its output to a persistent location (e.g. 
>> to PushGateway or statsd_exporter, or simply write it to a file which can 
>> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
>> server).  You can then scrape this as often as you like.
>>
>> node_exporter textfile-collector exposes an extra metrics for the 
>> timestamp on each file, so you can alert in the case that the file isn't 
>> being updated.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4471ac2e-ee83-494a-9a90-a7c86992a9f6n%40googlegroups.com.

Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread 'Brian Candler' via Prometheus Users

On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:

I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.

I think it's more accurate to say that *rule groups* are spread spread over
their evaluation interval, and rules within the same rule group are evaluated
sequentially
.

This is how you can build rules that depend on each other, e.g. a recording
rule followed by other rules that depend on its output; put them in the
same rule group.

As for scraping: you *can* change this staleness interval,
using --query.lookback-delta, but it's strongly not recommended. Using the
default of 5 mins, you should use a maximum scrape interval of 2 mins so
that even if you miss one scrape for a random reason, you still have two
points within the lookback-delta so that the timeseries does not go stale.

There's no good reason to scrape at one hour intervals:
* Prometheus is extremely efficient with its storage compression,
especially when adjacent data points are equal, so scraping the same value
every 2 minutes is going to use hardly any more storage than scraping it
every hour.
* If you're worried about load on the exporter because responding to a
scrape is slow or expensive, then you should run the exporter every hour
from a local cronjob, and write its output to a persistent location (e.g.
to PushGateway or statsd_exporter, or simply write it to a file which can
be picked up by node_exporter textfile-collector or even a vanilla HTTP
server). You can then scrape this as often as you like.

node_exporter textfile-collector exposes an extra metrics for the timestamp
on each file, so you can alert in the case that the file isn't being
updated.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/244eb39e-1ded-4161-80cf-b32deb9cd2c7n%40googlegroups.com.

[prometheus-users] Re: Relabeling for proxied hosts

2024-03-28 Thread 'Brian Candler' via Prometheus Users

According to the source in prometheus-common/model/labels.go, these are the 
only declared magic labels:

const (
// AlertNameLabel is the name of the label containing the an 
alert's name.
AlertNameLabel = "alertname"

// ExportedLabelPrefix is the prefix to prepend to the label names 
present in
// exported metrics if a label of the same name is added by the 
server.
ExportedLabelPrefix = "exported_"

// MetricNameLabel is the label name indicating the metric name of a
// timeseries.
MetricNameLabel = "__name__"

// SchemeLabel is the name of the label that holds the scheme on 
which to
// scrape a target.
SchemeLabel = "__scheme__"

// AddressLabel is the name of the label that holds the address of
// a scrape target.
AddressLabel = "__address__"

// MetricsPathLabel is the name of the label that holds the path on 
which to
// scrape a target.
MetricsPathLabel = "__metrics_path__"

// ReservedLabelPrefix is a prefix which is not legal in 
user-supplied
// label names.
ReservedLabelPrefix = "__"

// MetaLabelPrefix is a prefix for labels that provide meta 
information.
// Labels with this prefix are used for intermediate label 
processing and
// will not be attached to time series.
MetaLabelPrefix = "__meta_"

// TmpLabelPrefix is a prefix for temporary labels as part of 
relabelling.
// Labels with this prefix are used for intermediate label 
processing and
// will not be attached to time series. This is reserved for use in
// Prometheus configuration files by users.
TmpLabelPrefix = "__tmp_"

// ParamLabelPrefix is a prefix for labels that provide URL 
parameters
// used to scrape a target.
ParamLabelPrefix = "__param_"

// JobLabel is the label name indicating the job from which a 
timeseries
// was scraped.
JobLabel = "job"

// InstanceLabel is the label name used for the instance label.
InstanceLabel = "instance"

// BucketLabel is used for the label that defines the upper bound 
of a
// bucket of a histogram ("le" -> "less or equal").
BucketLabel = "le"

// QuantileLabel is used for the label that defines the quantile in 
a
// summary.
QuantileLabel = "quantile"
)

Hence I don't think you can do what you want in relabeling; you need 
separate jobs.

On Wednesday 27 March 2024 at 21:20:08 UTC Mykola Buhryk wrote:

> Hello, 
>
> I'm looking for a possibility to have one Prometheus job that can include 
> targets that are not directly accessible by Prometheus.
>
> For now, I have 2 separate jobs, one for standard hosts, and a second one 
> for proxied where I need to set the *proxy_url* parameter
>
> So my question is, is there any way to achieve the same result with 
> relabeling within one job?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6a126188-03f6-4e60-acb1-01abe4a196c7n%40googlegroups.com.

[prometheus-users] Re: [snmp-exporter] when will --config.expand-environment-variables be available?

2024-03-26 Thread 'Brian Candler' via Prometheus Users

It's in git head, so it's available now if you compile snmp_exporter from 
source. Otherwise you need to wait until the next release. I don't know 
when that will be.

On Tuesday 26 March 2024 at 08:49:45 UTC ohey...@gmail.com wrote:

> Readme on Github shows this option, but it's not available.
>
>
>
> On Tuesday 19 March 2024 at 16:55:45 UTC+1 ohey...@gmail.com wrote:
>
>> Hi,
>>
>>
>> looking for this feature "--config.expand-environment-variables" to get 
>> PCI-DSS compliance configs.
>> Any idea, when it will be available?
>>
>> Thanks a lot and regards,
>>
>> Olaf
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/be1d2758-3142-496a-a0b4-715da590935en%40googlegroups.com.

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-22 Thread 'Brian Candler' via Prometheus Users

Personally I think you're looking at this wrong.

You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.

If you want to do further investigation when a host has more than N 
single-scrape failures in 24 hours, sure. But firstly, is that urgent 
enough to warrant an alert? If it is, then you also say you *don't* want to 
be alerted on this when a more important alert has been sent for the same 
host in the same time period.  That's tricky to get right, which is what 
this whole thread is about. Like you say: alertmanager is probably not the 
right tool for that.

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?

If it's "not often" then I wouldn't worry too much it anyway (check a 
dashboard), but in any case you don't want to waste time trying to bend 
existing tooling to work in ways it wasn't intended for. That is: if you 
need suitable tooling, then write it.

It could be as simple as a script doing one query per day, using the same 
logic I just outlined above:
- identify hosts with scrape failures above a particular threshold over the 
last 24 hours
- identify hosts where one or more alerts have been generated over the last 
24 hours (there are metrics for this)
- subtract the second set from the first set
- if the remaining set is non-empty, then send a notification

You can do this in any language of your choice, or even a shell script with 
promtool/curl and jq.

On Friday 22 March 2024 at 02:31:52 UTC Christoph Anton Mitterer wrote:

>
> I've been looking into possible alternatives, based on the ideas given 
> here.
>
> I) First one completely different approach might be:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
> - alert: single-scrape-failure
> expr: 'min_over_time( up[2m0s] ) == 0'
> for: 1m
> or
> - alert: single-scrape-failure
> expr: 'resets( up[2m0s] ) > 0'
> for: 1m
> or perhaps even
> - alert: single-scrape-failure
> expr: 'changes( up[2m0s] ) >= 2'
> for: 1m
> (which would however behave a bit different, I guess)
> )
>
> plus an inhibit rule, that silences single-scrape-failure when
> target-down fires.
> The for: 1m is needed, so that target-down has a chance to fire
> (and inhibit) before single-scrape-failure does.
>
> I'm not really sure, whether that works in all cases, though,
> especially since I look back much more (and the additional time
> span further back may undesirably trigger again.
>
>
> Using for: > 0 seems generally a bit fragile for my use-case (because I 
> want to capture even single scrape failures, but with for: > 0 I need t to 
> have at least two evaluations to actually trigger, so my evaluation period 
> must be small enough so that it's done >= 2 during the scrape interval.
>
> Also, I guess the scrape intervals and the evaluation intervals are not 
> synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
> certain number of samples in that range, it may be that there are actually 
> more or less.
>
>
> If I forget about the above approach with inhibiting, then I need to 
> consider cases like:
> time>
> - 0 1 0 0 0 0 0 0
> first zero should be a single-scrape-failure, the last 6 however a
> target-down
> - 1 0 0 0 0 0 1 0 0 0 0 0 0
> same here, the first 5 should be a single-scrape-failure, the last 6
> however a target-down
> - 1 0 0 0 0 0 0 1 0 0 0 0 0 0
> here however, both should be target-down
> - 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
> or
> 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
> here, 2x target-down, 1x single-scrape-failure
>
>
>
>
> II) Using the original {min,max}_over_time approach:
> - min_over_time(up[1m]) == 0
> tells me, there was at least one missing scrape in the last 1m.
> but that alone would already be the case for the first zero:
> . . . . . 0
> so:
> - for: 1m
> was added (and the [1m] was enlarged)
> but this would still fire with
> 0 0 0 0 0 0 0
> which should however be a target-down
> so:
> - unless max_over_time(up[1m]) == 0
> was added to silence it then
> but that would still fail in e.g. the case when a previous
> target-down runs out:
> 0 0 0 0 0 0 -> target down
> the next is a 1
> 0 0 0 0 0 0 1 -> single-scrape-failure
> and some similar cases,
>
> Plus the usage of for: >0s is - in my special case - IMO fragile.
>
>
>
> III) So in my previous mail I came up with the idea of using:
> - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
> alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
> unless max_over_time(up[1m0s])

[prometheus-users] Re: HTTPS proxy_url

2024-03-20 Thread 'Brian Candler' via Prometheus Users

The error "http2: unsupported scheme" might be affected by this setting:

# Whether to enable HTTP2. [ enable_http2:  

| default: true ]
Whether that will fix your problem I don't know.

If you've deployed PushProx behind ALB, wouldn't it be accessed simply by 
connecting to its external URL which points to the outside of ALB?  In that 
case you don't need "proxy_url" at all. That's for when you want the HTTP 
request to pass through an external proxy, like Squid.

On Wednesday 20 March 2024 at 17:31:48 UTC Nikolay Buhryk wrote:

> Hello
>
> I would like to use *proxy_url* for my job with hosts that not directly 
> accessible by Prometheus.
>
> PushProx proxy service 
>  is 
> deployed behind AWS ALB with HTTPS, but for some reason, it doesn't work, 
> I'm always getting an error *context deadline exceeded* on the Prometheus 
> side.
>
> However proxy client on the host that I would like to scrape is able to 
> register itself via HTTPS
>
> When I tried to use it with HTTP, setting it in config as *proxy_url: 
> htttp://some_url *to have the possibility to redirect all calls to the 
> HTTPS by ALB it says *http2: unsupported scheme*
>
> Does the *proxy_url* parameter support HTTPS in the URL?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/61e3ac04-551c-45c6-a075-58bb3bf0286an%40googlegroups.com.

[prometheus-users] Re: Thanos sidecar installation

2024-03-20 Thread 'Brian Candler' via Prometheus Users

Follow the Thanos documentation linked 
from https://github.com/thanos-io/thanos?tab=readme-ov-file#getting-started

In particular: https://thanos.io/tip/thanos/quick-tutorial.md/ shows 
running the sidecar.

On Wednesday 20 March 2024 at 10:09:45 UTC BHARATH KUMAR wrote:

> Hello All,
>
> I installed node exporter, Prometheus, blackbox exporter and thanos. node 
> exporter and blackbox are working fine and I wrote the jobs in 
> prometheus.yml file
>
> I want to install thanos sidecar and what are the additional configuration 
> do I need to add in prometheus.yml and how to run that thanos sidecar?
>
> I am running everything  as binary. No docker or kubernetes are used.
>
> Thanks in Advance.
>
> Regards,
> Bharath kumar.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5c8828d5-bd0c-42b9-98ec-4cbfe7801f95n%40googlegroups.com.

Re: [prometheus-users] blackbox_exporter 0.24.0 and smokeping_prober 0.7.1 - DNS cache "nscd" not working

2024-03-20 Thread 'Brian Candler' via Prometheus Users

> To be able to use DNS caching (without rebuilding), one would need a 
local DNS server with enabled cache on the system which is referenced in 
the resolv.conf.

That's what systemd does: its cache binds to 127.0.0.53, and then you point 
to 127.0.0.53 in /etc/resolv.conf

On Wednesday 20 March 2024 at 06:04:43 UTC Anthony Cairncross wrote:

> Hello there,
>
> I hope I can add some detail to the discussion.
>
> Had a go - no pun intended ;) - at trying to use GODEBUG variables to see 
> what happens.
> When using "export GODEBUG=netdns=cgo+1" and running the precompiled 
> blackbox_exporter like "blackbox_exporter-0.24.0.linux-amd64" you would get 
> the something like the following in the output:
>
> go package net: built with netgo build tag; using Go's resolver
>
> Looking at net module from golang at
> https://github.com/golang/go/blob/go1.20.4/src/net/conf.go#L61
> or the explanation in newer versions at
> https://github.com/golang/go/blob/master/src/net/conf.go#L18
>
> It shows that if the app was built with the "netgo" build tag, the 
> go-resolver would always be used or respectively the use of "netcgo" would 
> be prohibited. As stated by Ben, that glibc is not used.
> So trying to get it to use glibc functions with "GODEBUG=netdns=cgo" won't 
> work here.
>
> Having a quick look at the binary, it seems, that netgo build tag was 
> applied:
>
> $ strings blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter | egrep 
> '\-tags.*net.*'
> build   -tags=netgo
> build   -tags=netgo
>
> Or as per another var:
>
> $ strings blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter | grep 
> CGO_ENABLED
> build   CGO_ENABLED=0
> build   CGO_ENABLED=0
>
> So go would usually look in /etc/nsswitch.conf, /etc/hosts and then 
> directly call the DNS server from /etc/resolv.conf if there is no local 
> hosts entry.
> To be able to use DNS caching (without rebuilding), one would need a local 
> DNS server with enabled cache on the system which is referenced in the 
> resolv.conf. Like with CoreDNS, bind, dnsmasq, unbound, etc.
> I tried to find something about how go uses nsswitch.conf to get it to use 
> nscd, but nothing helped so far.
>
> Alexander Wilke schrieb am Samstag, 16. März 2024 um 00:05:21 UTC+1:
>
>> Thanks for the hint. I checked the Go DNS feature and found these hints:
>>
>>
>>1. export GODEBUG=netdns=go # force pure Go resolver 
>>2. export GODEBUG=netdns=cgo # force cgo resolver 
>>
>>
>>
>> I tried to set the cgo env variable and restarted services. however 
>> systemd-resolved and nscd seem not to be able to cache it.
>> May have to wait for a colleague who is more experienced in linux than 
>> me. perhaps we can figure it out why it's not working with the new 
>> behaviour.
>>
>>
>>
>>
>> Ben Kochie schrieb am Freitag, 15. März 2024 um 17:52:09 UTC+1:
>>
>>> All of the Prometheus components you're talking about are 
>>> statically compiled Go binaries. These use Go's native DNS resolution. It 
>>> does not use glibc. So maybe looking for solutions related to Golang and 
>>> nscd would help. I've not looked into this myself.
>>>
>>> But on the subject of node local DNS caches. I can highly 
>>> recommend CoreDNS's cache plugin[0]. It even has built-in Prometheus 
>>> support so you can find how good your cache is working. The CoreDNS cache 
>>> specifically supports prefetching, which is important for making sure 
>>> there's no gap or latency in updating the cache when the TTL is close to 
>>> expiring.
>>>
>>> [0]: https://coredns.io/plugins/cache/
>>> [1]: https://coredns.io/plugins/metrics/
>>>
>>> On Fri, Mar 15, 2024 at 3:41 PM Alexander Wilke  
>>> wrote:
>>>
 Hello,

 I am running blackbox_exporter and smokeping_prober on a RHEL8 
 environment. Unfortunately with our configu wie have around 4-5 million 
 DNS 
 queries per 24hrs.

 The reason for that is that we do very frequent tcp queries to various 
 destinations which results in many DNS requests.

 To reduce the DNS load on the DNS server we tried to implement "nscd" 
 as a DNS cache.

 However running strace we notice that the blackbox_exporter is checking 
 resolve.con, then nsswitch.conf then /etc/hosts and then send the query 
 directly to the DNS server not using the DNS cache. Thats for every target 
 of blackbox_exporter.

 For smokeping_prober I am aware that it resolves DNS only at restart 
 and we notice the same. All requests are directly send to DNS server but 
 not to the cache.

 anyone using nscd on RHEL8 to cache blackbox_exporter and/or 
 smokeping_prober?

 If not has anyone a working, simple configuration with unbound for this 
 specific scenario?

 Is blackbox and smokeping using glibc methods to resolve DNS or 
 something else?

 Thank you very much!

 -- 
 You received this message because you are subscribed to the Google 
 Groups "Prometheus Users"

[prometheus-users] Re: Get prometheus snapshot for specific timeperiod.

2024-03-17 Thread 'Brian Candler' via Prometheus Users

> To capture data for a specific duration, can you provide the URL query
that takes time parameters, such as start and finish times.

No I can't, because there is no feature for that - as the API documentation
makes clear.

> Snapshot creates a snapshot of* all current data *into
snapshots/- under the TSDB's data directory

The snapshot for today includes all the earlier data, of course. So I
presume you could take a snapshot then just delete all the blocks outside
the time period of interest - but I've never tested it. You should be able
to find the time ranges from the meta.json files:

root@prometheus:/var/lib/prometheus/data# cat
01HS1YQS341WWGFYZSN0N8GPR1/meta.json
{
"ulid": "01HS1YQS341WWGFYZSN0N8GPR1",
"minTime": 170994240,
"maxTime": 171052560,
...

Because these blocks are rolled up, they may cover a fairly wide time
interval. For example, the above block covers 58320 milliseconds, which
is 6.75 days.

# date --date @1709942400
Sat Mar 9 00:00:00 GMT 2024
# date --date @1710525600
Fri Mar 15 18:00:00 GMT 2024

However, you should note that you won't save much disk space by trimming
snapshots in this way (if that's what you're trying to do). That's because
the snapshots are hardlinks to shared, immutable files. The snapshots will
only take extra space once the main tsdb has started to expire blocks, and
the snapshot contains blocks older than that.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/d2e4b4ac-3852-4bfb-bbb2-d1ca61773578n%40googlegroups.com.

[prometheus-users] Re: Get prometheus snapshot for specific timeperiod.

2024-03-17 Thread 'Brian Candler' via Prometheus Users

I'm not sure what you mean by "preserve prometheus snapshot" - AFAIK the 
snapshot remains forever until you delete it.

If you mean you want to delete snapshots when they reach a particular age, 
then you can do that yourself from a cronjob. e.g. for 90 days retention:
find /snapshots -mtime +90 -type f -delete

On Sunday 17 March 2024 at 14:52:49 UTC abhishek ellendula wrote:

> Hi All
>
> Below is the way how we generally create snapshot.
>
> Snapshot 
> 
>
> Snapshot creates a snapshot of all current data into 
> snapshots/- under the TSDB's data directory and returns 
> the directory as response. It will optionally skip snapshotting data that 
> is only present in the head block, and which has not yet been compacted to 
> disk.
> POST /api/v1/admin/tsdb/snapshot PUT /api/v1/admin/tsdb/snapshot 
>
> URL query parameters:
>
>- skip_head=: Skip data present in the head block. Optional.
>
> $ curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot { 
> "status": "success", "data": { "name": "20171210T211224Z-2be650b6d019eb54" 
> } } 
>
> The snapshot now exists at 
> /snapshots/20171210T211224Z-2be650b6d019eb54
>
> But is there a any way or method to preserve prometheus snapshot for 
> specific time-period.
>
> Thanks 
> Abhishek
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/419a7604-38b2-49ae-af0c-4390e5d1n%40googlegroups.com.

[prometheus-users] Re: Best practive: "job_name in prometheus agent? Same job_name allowed ?

2024-03-15 Thread 'Brian Candler' via Prometheus Users

> What would you recommend in a situation with several hundredes or 
thousands of servers or systems within a kubernetes cluster which should 
have the node_exporter installed.

I would just scrape them normally, using service discovery to identify the 
nodes to be scraped.  Implicitly you're saying you can't, or don't want to, 
do that.

> then remotw_writes the results to a central prometheus server or a 
loadbalancer which distributes to different prometheus servers.

Definitely don't remote_write to a load balancer; it will be 
non-deterministic which node receives each data point. If you want to load 
share, then statically configure some nodes to point to different 
prometheus instances. If one server goes down then fix it, and remote_write 
should buffer in the mean time.

If you can't stand the idea of losing access to metrics for a short period 
of time, then you could remote_write to multiple servers, and use promxy to 
merge them when querying. But really I think you're adding a lot of cost 
and complexity for little gain.

> However I think I will have a problem because if I use "127.0.0.1:9100" 
as target to scrape then all instances are equal.

The instance label does not necessarily have to be the same as the 
"__address__" that you scrape. If you've set the instance label explicitly, 
then prometheus won't change it. But you would have to ensure that each 
host knows its unique name and puts it into the instance label.

> Is there any possibility to use a variable in the scrape_config which 
reflects any environment variable from linux system or any other mechanism 
to make this instance unique?

I've never had to do this, but you could 
try --enable-feature=expand-external-labels
See 
https://prometheus.io/docs/prometheus/latest/feature_flags/#expand-environment-variables-in-external-labels
Then you could leave instance="127.0.0.1:9100" but add another label which 
identifies the node.

On Thursday 14 March 2024 at 21:59:52 UTC Alexander Wilke wrote:

> Thanks for your response.
>
> What would you recommend in a situation with several hundredes or 
> thousands of servers or systems within a kubernetes cluster which should 
> have the node_exporter installed.
> my idea was to install the node_exporter + prometheus agent. agent scrapes 
> local node_exporter and then remotw_writes the results to a central 
> prometheus server or a loadbalancer which distributes to different 
> prometheus servers.
> my idea was to user the same configu for alle node_exporter + prometheus 
> agents. For that reason they all have the same job name which would be ok.
>
> However I think I will have a problem because if I use "127.0.0.1:9100" 
> as target to scrape then all instances are equal.
>
> Is there any possibility to use a variable in the scrape_config which 
> reflects any environment variable from linux system or any other mechanism 
> to make this instance unique?
>
>
> Brian Candler schrieb am Donnerstag, 14. März 2024 um 13:04:07 UTC+1:
>
>> As long as all the time series have distinct label sets (in particular, 
>> different "instance" labels), and you're not mixing scraping with 
>> remote-writing for the same targets, then I don't see any problem with all 
>> the agents using the same "job" label when remote-writing.
>>
>> On Tuesday 12 March 2024 at 22:30:22 UTC Alexander Wilke wrote:
>>
>>> At the moment I am running the job with name
>>> "node_exporter" which has 20 different targets. (instances)
>>> With this configuration there should not be any conflict.
>>>
>>> my idea is to install the prometheus agent on the nodes itself.
>>> technically it looks like it work if I use the same job_name on the 
>>> agent and central prometheus as long as the targets/instances are different.
>>>
>>> In general I avoid conflicting job_names but in this situation it may be 
>>> ok from my point of view.
>>>
>>> what do you think, recommend in this specific scenario ?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dfc394b9-5f81-4173-b832-1c7d06702f28n%40googlegroups.com.

[prometheus-users] Re: Best practive: "job_name in prometheus agent? Same job_name allowed ?

2024-03-14 Thread 'Brian Candler' via Prometheus Users

As long as all the time series have distinct label sets (in particular, 
different "instance" labels), and you're not mixing scraping with 
remote-writing for the same targets, then I don't see any problem with all 
the agents using the same "job" label when remote-writing.

On Tuesday 12 March 2024 at 22:30:22 UTC Alexander Wilke wrote:

> At the moment I am running the job with name
> "node_exporter" which has 20 different targets. (instances)
> With this configuration there should not be any conflict.
>
> my idea is to install the prometheus agent on the nodes itself.
> technically it looks like it work if I use the same job_name on the agent 
> and central prometheus as long as the targets/instances are different.
>
> In general I avoid conflicting job_names but in this situation it may be 
> ok from my point of view.
>
> what do you think, recommend in this specific scenario ?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/24268b8d-0313-44d5-8227-3be95eaacde7n%40googlegroups.com.

[prometheus-users] Re: disable all alerts for a job

2024-03-12 Thread 'Brian Candler' via Prometheus Users

option 1: filter them out in alertmanager, with an extra routing rule that 
matches on the 'job' label and delivers to a null receiver.
option 1b: create a long-lived silence in alertmanager that matches on the 
'job' label

option 2: drop them in alert_relabel_configs 

But it may be clearer in the long term to modify the alerting expressions: 
if you change foo to foo{job!="bar"}, then the intention is obvious.

On Tuesday 12 March 2024 at 12:51:22 UTC mel wrote:

> How do I disable all alerts for a job without having to modify each alert 
> rule and excluding the job?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5935cc12-9f7a-49ee-aa7b-2d88ddbe1492n%40googlegroups.com.

[prometheus-users] Re: drop all some metrics based on regex

2024-03-12 Thread 'Brian Candler' via Prometheus Users

Thanks. I always forget that labels starting with __ are automatically 
dropped after target relabelling, but not metric relabelling.

On Monday 11 March 2024 at 20:49:42 UTC Ben Kochie wrote:

> The other way you can do this is with the "__tmp_keep" pattern. This is 
> where you positively tag the metrics you want to keep, and then use a drop 
> that matches that temporary label doesn't exist.
>
> metric_relabel_configs:
>   - source_labels: [__name__, name]
> regex: 'node_systemd_.*;(ssh|apache).*'
> target_label: __tmp_keep
> replacement: yes
>   - source_labels: [__tmp_keep,__name__]
> regex: ';node_systemd_.*'
> action: drop
>   - regex: __tmp_keep
> action: labeldrop
>
> On Monday, March 11, 2024 at 9:43:27 PM UTC+1 mel wrote:
>
>> You are absolutely correct but I don't have access to a lot of the 
>> servers so I am trying to drop them on the prometheus side
>>
>> On Monday, March 11, 2024 at 1:39:18 PM UTC-7 Ben Kochie wrote:
>>
>>> relabel actions are exclusive. Drop means keep everything but X. Keep 
>>> means drop everything but X.
>>>
>>> For your exact problem, there is already a node_exporter flag to handle 
>>> this.
>>>
>>> ./node_exporter --collector.systemd.unit-include="(ssh|apache)"
>>>
>>> This will also be more efficient because it it will only gather data 
>>> about those two units.
>>>
>>> On Monday, March 11, 2024 at 9:27:30 PM UTC+1 mel wrote:
>>>
 Hello I am using node_exporter and I am trying to drop all 
 node_systemd_unit_state metrics except for a handful of services like 
 (e.g.,) ssh and apache. How would I do this? I came up with the following, 
 but I don't think this is correct because it will drop other metrics as 
 well (metrics that are not related to systemd service)

 metric_relabel_configs:
 - source_labels: [__name__, name]
   regex: 'node_systemd_unit_state;(ssh|apache).*'
   action: keep

 How do I drop all service metrics except for ssh and apache service?

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0778c04a-56e4-4090-a918-50cc7db0f7ean%40googlegroups.com.

[prometheus-users] Re: drop all some metrics based on regex

2024-03-11 Thread 'Brian Candler' via Prometheus Users

You can use temporary variables. Something like this (untested):

metric_relabel_configs:
- source_labels: [__name__, name]
  regex: 'node_systemd_unit_state;(ssh|apache).*'
  target_label: __tmp_keep
  replacement: y
- source_labels: [__name__, __tmp_keep]
  regex: 'node_systemd_unit_state;'
  action: drop

On Monday 11 March 2024 at 20:43:27 UTC mel wrote:

> You are absolutely correct but I don't have access to a lot of the servers 
> so I am trying to drop them on the prometheus side
>
> On Monday, March 11, 2024 at 1:39:18 PM UTC-7 Ben Kochie wrote:
>
>> relabel actions are exclusive. Drop means keep everything but X. Keep 
>> means drop everything but X.
>>
>> For your exact problem, there is already a node_exporter flag to handle 
>> this.
>>
>> ./node_exporter --collector.systemd.unit-include="(ssh|apache)"
>>
>> This will also be more efficient because it it will only gather data 
>> about those two units.
>>
>> On Monday, March 11, 2024 at 9:27:30 PM UTC+1 mel wrote:
>>
>>> Hello I am using node_exporter and I am trying to drop all 
>>> node_systemd_unit_state metrics except for a handful of services like 
>>> (e.g.,) ssh and apache. How would I do this? I came up with the following, 
>>> but I don't think this is correct because it will drop other metrics as 
>>> well (metrics that are not related to systemd service)
>>>
>>> metric_relabel_configs:
>>> - source_labels: [__name__, name]
>>>   regex: 'node_systemd_unit_state;(ssh|apache).*'
>>>   action: keep
>>>
>>> How do I drop all service metrics except for ssh and apache service?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/696c1928-c98d-4841-88d7-5772b9497a80n%40googlegroups.com.

[prometheus-users] Re: PromQL - Check for specific value in the past

2024-03-06 Thread 'Brian Candler' via Prometheus Users

You can use a subquery which will sample the data, something like this:

bgp_state_info != 3 and present_over_time((bgp_state_info == 3)[60d:1h])

You can reduce the sampling interval from 1h to reduce the risk of missing 
times when BGP was up, but then the query becomes increasingly expensive.

It would be nice if PromQL allowed you to do filtering and arithmetic 
expressions between range vectors and scalars, e.g. 
present_over_time(bgp_state_info[60d] == 3), but it doesn't.

Another approach is to use a recording rule, where you can combine the 
current value with a new value, e.g.

- record: bgp_seen
  expr: bgp_seen or bgp_state_info == 3

Temporarily set the expression to the subquery to prime it from historical 
data.  With a bit of tweaking you could make the value of this expression 
be the timestamp when bgp_state_info == 3 was first seen.

The alert then becomes:

bgp_state_info != 3 and bgp_seen

On Wednesday 6 March 2024 at 11:48:23 UTC fiala...@gmail.com wrote:

> Hi,
>
> I have a metric bgp_state_info. Ok state is when metric has value = 3, 
> other values (from 1 to 7) are considered as error.
>
> I want to fire alert only for metrics that has value 3 at least only once. 
> In other words I dont' want to fire alert for bgp that never worked.
>
> Is it possible via promQL to do this? I have data retention 60 days and 
> I'm aware of this limitations.
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9cc1f2fd-cdb8-416f-9a6a-83376f7071b5n%40googlegroups.com.

[prometheus-users] Re: Powershell POST to pushgateway

2024-03-04 Thread 'Brian Candler' via Prometheus Users

*got "101\r"*

suggests that you're sending a Windows newline (.\r\n) instead of a 
Unix newline (.\n), and that pushgateway isn't happy with that.

I don't know if Windows supports some mechanism for inserting explicit 
control sequences, such as
-Body 'metricname1 101\n'
or
echo -ne 'metricname1 101\n'

On Tuesday 5 March 2024 at 05:29:05 UTC Leen Tux wrote:

> I got this error :
> Invoke-WebRequest -Uri '
> http://192.168.1.111:9091/metrics/job/jobname1/instance/instancename1' 
> -Method Post -Body 'metricname1 101' -ContentType 'application/octet-stream'
> *Invoke-WebRequest : text format parsing error in line 1: unexpected end 
> of input stream*
>
> I am facing the same issue here:
>
> https://stackoverflow.com/questions/68818211/send-metrics-with-pushgateway-prometheus-using-windows-console
> But the solution mentioned did not work for me. 
>
> echo "metricname1 101
> " | Invoke-WebRequest -Uri http://192.168.1.111:9091/metrics/job/jobname1 
> -Method POST 
> *Invoke-WebRequest : text format parsing error in line 1: expected float 
> as value,  got "101\r"*
>
> On Monday, March 4, 2024 at 4:21:22 PM UTC+3 Brian Candler wrote:
>
>> https://superuser.com/questions/344927/powershell-equivalent-of-curl
>>
>> On Monday 4 March 2024 at 07:10:18 UTC Leen Tux wrote:
>>
>>> Hi
>>> What is the powershell command equivalent to:
>>> *$ echo 'metricname1 101' | curl --data-binary @- 
>>> http://localhost:9091/metrics/job/jobname1/instance/instancename1 
>>> *
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/848609dd-0db9-48a7-b796-6d441371d98cn%40googlegroups.com.

[prometheus-users] Re: Powershell POST to pushgateway

2024-03-04 Thread 'Brian Candler' via Prometheus Users

https://superuser.com/questions/344927/powershell-equivalent-of-curl

On Monday 4 March 2024 at 07:10:18 UTC Leen Tux wrote:

> Hi
> What is the powershell command equivalent to:
> *$ echo 'metricname1 101' | curl --data-binary @- 
> http://localhost:9091/metrics/job/jobname1/instance/instancename1 
> *
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8f48629a-2ab0-4e47-8bd0-7f56137f2615n%40googlegroups.com.

[prometheus-users] Re: Best practices to using "xxxxx_info" gauge metric

2024-03-04 Thread 'Brian Candler' via Prometheus Users

Yes, it's good practice and you can read about it here:

https://www.robustperception.io/how-to-have-labels-for-machine-roles
https://www.robustperception.io/exposing-the-software-version-to-prometheus

You may also find these relevant:
https://www.robustperception.io/left-joins-in-promql
https://prometheus.io/docs/prometheus/latest/querying/operators/#many-to-one-and-one-to-many-vector-matches

On Monday 4 March 2024 at 07:10:18 UTC Rajat Jindal wrote:

> hello community folks,
>
> what do people think about publishing *_info* metrics with static labels.
>
> e.g. xx_info{"foo"=bar, "foo2"=bar2} with constant value of 1 as gauge 
> metrics (where foo and foo2 are some static metadata associated with a 
> single instance).
>
> is there any documented best practices around having (or not having) such 
> metrics published?
>
> Thank you
> Rajat Jindal
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0b8ddef7-6ffb-405b-bb41-5b6aa77be65dn%40googlegroups.com.

[prometheus-users] Re: Hi All

2024-03-01 Thread 'Brian Candler' via Prometheus Users

Sorry, but you have the wrong group. This is for Prometheus, not Grafana.

Questions about Grafana should be addressed to the Grafana community:
https://community.grafana.com/

On Friday 1 March 2024 at 21:49:11 UTC+7 h0ksa wrote:

>
> Hello everyone! I have a Grafana dashboard with two panels and a variable 
> representing projects. By default, I want Panel 1 to be displayed, and when 
> the variable is set to "ALL," Panel 1 should continue to be shown. If the 
> variable has any other value, I want Panel 2 to be displayed. Is it 
> possible to achieve this in Grafana?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2930cead-787b-4f60-8619-adcee14f9175n%40googlegroups.com.

[prometheus-users] Re: User level disk usage monitoring and notification - with prometheus and alertmanager

2024-02-29 Thread 'Brian Candler' via Prometheus Users

On Thursday 29 February 2024 at 18:46:05 UTC+7 Puneet Singh wrote:

The default node exporter has the ability to report the disk usage at user 
level in my context?  - by extending it via any flag  ( i came across the 
text collector and i plan to explore that.)
 or writing the custom exporter would be the optimal workaround?


I don't know what the "optimal" solution would be: you haven't said which 
filesystem you're using, and whether you're actually enforcing quotas at 
the filesystem level - in which case the filesystem will be keeping track 
of them, and you can just ask the filesystem for the current quota for each 
user.

If not, then periodically running du -sk /home/* sounds reasonable as long 
as it's not done too often. And yes, if you reformat those into prometheus 
metrics you can just drop them into a file for the textfile collector to 
pick up. Prometheus itself will add the "instance" label, so you only need 
to add "user" and "mountpoint" attributes (the latter would be statically 
"/home")

Don't use du -sh because you'll get metrics like "25M" or "304K" and it 
will be up to you to normalise them.
 
For alerting, the simple template expansion you have is almost certainly 
not going to work; it's almost certainly not usern...@gmail.com. You could 
however make a static set of metrics mapping username to email:

{username="fred",email="f...@flintstone.com"} 1
{username="wilma",email="w...@rubble.com"} 1

Then scrape this, and do another join in your promQL to pick up the email 
label.  This is similar to the approach from
https://www.robustperception.io/using-time-series-as-alert-thresholds

Once you have the E-mail address as a label, then see
https://www.robustperception.io/using-labels-to-direct-email-notifications/

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/df57003b-7ce8-46f7-a507-5b2e59e44b91n%40googlegroups.com.

[prometheus-users] Re: consul discovery

2024-02-29 Thread 'Brian Candler' via Prometheus Users

Consul is a Hashicorp product. How you configure and manage consul is not 
really a topic for a Prometheus mailing list. 

See https://www.consul.io/community for a list of Consul community 
resources.

On Thursday 29 February 2024 at 00:33:07 UTC+7 sri L wrote:

> Hi all,
>
> I am trying to register onprem multiple nodes in consul DB using single 
> json file but while registering through API, getting syntax error (*Request 
> decode failed: json: cannot unmarshal array into Go value of type 
> structs.RegisterRequest*)
>
>
> I am using below curl command for registering nodes
>
> curl --request PUT --data @nodes.json 
> http://x.x.x.x:8500/v1/catalog/register
>
> Below is the node json file I am using:
>
> [
> {
>   "Node": "ABC",
>   "Address": "ABC.net",
>   "NodeMeta": {
> "external-node": "true",
> "external-probe": "true"
>   },
>   "Service": {
> "ID": "node_exporter",
> "Service": "monitoring",
> "Tags": ["node_exporter"],
> "Port": 9100
>   }
> },
> {
>   "Node": "XYZ",
>   "Address": "XYZ.net",
>   "NodeMeta": {
> "external-node": "true",
> "external-probe": "true"
>   },
>   "Service": {
> "ID": "node_exporter",
> "Service": "monitoring",
> "Tags": ["node_exporter"],
> "Port": 9100
>   }
> }
> ]
>
> I am not installing consul agent in those nodes, just trying to register 
> those external nodes in consul DB and from their Prometheus has to discover 
> those nodes.
>
> Can anyone please suggest on adding multiple nodes with one json file.
>
> Thanks
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4d8f6e43-58d4-450a-889a-40d8f22e3d16n%40googlegroups.com.

[prometheus-users] Re: User level disk usage monitoring and notification - with prometheus and alertmanager

2024-02-29 Thread 'Brian Candler' via Prometheus Users

> I don't think  *condition1* and *condition2* will work as labels and 
label values returned by condition1 and condition2 are different.

condition1 if on (instance,mountpoint) group_left(username) condition2

This assumes that the both expressions have "instance" and "mountpoint" 
labels; these are the only ones considered when matching. It also assumes 
there is a many-to-1 relationship from the left-hand size (users) to right 
hand side (filesystem), and that there is a label "username" that you would 
like carried forward from the LHS into the result.

> So i need 3 rules  - 1 each for server1,server2 and server3

I don't think so. The vector of results can include values for each 
(user,filesystem,instance) on the LHS, and each (filesystem,instnace) on 
the RHS, and alert separately for every filesystem that reaches 90%.

On Wednesday 28 February 2024 at 22:55:11 UTC+7 Puneet Singh wrote:

> Hi All, 
> I have a monitoring requirement related to the user level disk usage and 
> alerting. And i am wondering if prometheus is the correct tool to handle 
> this requirement or,
>   a custom python script (whish uses os, subprocess, smtp module)  to 
> handle monitoring and alerting will be optimial solution in this context?
>
>
> Here is the problem description - 
> In our setup we have 3 servers we have  a single mount point "/", and each 
> user's directory, such as "/home/user1", "/home/user2", and so forth, 
> resides within this mount point.
> [image: Untitled11.png]
>   We enforce disk quotas for individual users, and our goal is to monitor 
> each user's disk usage and trigger alerts to the top 10 users when overall 
> quota exceeds 90%.
>
>
> Challenges:
> 1. Afaik, prometheus monitors the overall storage status and the 
> mountpoint information, so individual user's disk consumption is not being  
> tracked by Prometheus. Example - 
> [image: Untitled12.png]
>
> a) Do i need to write custom exporter here which uses du -sh to figure out 
> the disk usage  ? where 
> user_disk_usage_bytes{*username="ravi"*} 39
>
> b) or node exporter can do this?
>
>
>
>
> after data collection, i need to deal with alerting rule 
> 2. Here is the alert condition on the custom exporter-
>
> *condition1:* can help determine the users who have high usage
> topk*( * user_disk_usage_bytes*  /  * *scalar(*
> node_filesystem_size_bytes{instance="server1:9100",mountpoint='/'}*) ) *
>
> *condition2:*  this can help determine if the usage has reached 90% 
> (available space less than 10%)
>  (node_filesystem_avail_bytes{instance="server1:9100",mountpoint='/'}  
> /   node_filesystem_size_bytes{ instance="server1:9100",mountpoint='/'  }  
>   ) < 0.1
>
> I don't think  *condition1* and *condition2* will work as labels and 
> label values returned by condition1 and condition2 are different.
>
> Is there a way to achieve this with PromQL ?
>
> Now, assuming that i am able to get a list of users if system utilization 
> is 90% as - 
> {username="ravi"}  80
> {username="user1"}  90
> {username="user2"}  70
> {username="user3"}  80
> {username="user4"}  90
>
> the alerting rule will be 
> groups:
> - name: example
>   rules:
>   - alert: Storage space is low on server1
> expr: *condition1* and *condition2*
> for: 10m
> labels: alertname: "Server1's Storage space is running low, Please 
> cleanup the disk space - {{ $labels.username }}" annotations:
>   summary: "you are using {{ $value }}% space on the / space.please 
> cleanup."
> So i need 3 rules  - 1 each for server1,server2 and server3
>
> 3.  Now alert manager is responsible to sending out the alerts 
> And to send the alert , i think this should be the configuration in 
> current context - 
> [image: Untitled14.png]
> as i have already included username in the alert name , and by default 
> grouping of alert happens by alertname so i think with this setting 1:1 
> email should be sent to each user.
>
>
>
> Apologies for the lengthy post , but I have tried expressing the flow to 
> solve this problem based on my understanding of Prometheus so far.
>
> I would greatly appreciate any insights, recommendations, or best 
> practices i can get can offer in achieving dynamic user disk usage 
> monitoring with Prometheus and Alert Manager.
>
> Thank you in advance .
>
> Best regards,
> Puneet
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/049b709b-8a09-4a49-9a71-f29a24314f30n%40googlegroups.com.

[prometheus-users] Re: Integrating Prometheus with Splunk and ServiceNow for automated ticket creation.

2024-02-26 Thread 'Brian Candler' via Prometheus Users

> Invalid authorization

Seems you're not authorizing to Splunk properly. Can you point to their 
documentation which says how you need to authenticate to their API?

I note you're using http rather than https, so HTTP basic auth is probably 
not allowed (it's insecure, it sends the username and password in cleartext 
along with every request). But even with https, they may require you to 
authenticate in some other way.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2da167c3-4841-49a4-8ccd-dfc7b8a48bb8n%40googlegroups.com.

[prometheus-users] Re: Metrics from PUSH Consumer - Relabeled Metrics? Check "Up" state?

2024-02-26 Thread 'Brian Candler' via Prometheus Users

> I am still looking for a solution to identify if a device which uses 
"PUSH" method is not sending data anmore for e.g. 10 minutes.

Push an additional metric which is "last push time", and check when that 
value is more than 10 minutes earlier than the current time.

If you already have a metric like "push_device_uptime", which I presume is 
monotonically increasing, then you can check for when this stops increasing:

expr: push_device_uptime <= push_device_uptime offset 10m

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0e3a8a4f-35ab-468c-9161-9fd81eff9a5dn%40googlegroups.com.

[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread 'Brian Candler' via Prometheus Users

On Saturday 24 February 2024 at 01:00:57 UTC+7 Alexander Wilke wrote:

Another possibility could be

QueryA + queryB == 0 #both down

No, that doesn't work, for exactly the same reason that "QueryA and QueryB"
doesn't work.

With a binary expression like "foo + bar", each side is a vector, and each
element of the vector has a different label set.

The result only combines values from the left and right hand sides with
*exactly* matching label sets. Therefore, an element in the LHS with
{HOSTNAME="server1"} does not match an element in the RHS with
{hostname="server2"}. Elements in the LHS which don't match any element in
the RHS (and vice versa) are dropped.

But you can modify that logic, using for example "foo + ignoring(HOSTNAME)
bar"

In this case, the HOSTNAME label is ignored when matching the LHS and RHS.
But if an element on the LHS then matches multiple on the RHS, or vice
versa, there will be an error. N:1 or 1:N matches can be made to work by
adding group_left or group_right clauses. If multiple elements on LHS match
multiple elements on the RHS, then that doesn't work.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/c780eea5-d842-404b-a04c-02558163eafbn%40googlegroups.com.

[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread 'Brian Candler' via Prometheus Users

On Friday 23 February 2024 at 02:28:52 UTC+7 Puneet Singh wrote:

Now i tried to find the time duration where both these service were 
simultaneously down / 0 on both server1 and server2 :
(sum without (USER) (
*go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1) and (sum without (USER) (
*go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1)


I was expecting a graph similar to the once for server2 , but i got :
[image: Untitled.png]

I think i need to ignore the HOSTNAME label , but unable to figure out the 
way to ignore the HOSTNAME label in combination with sum without clause.


You've got exactly the right idea.  It's not the "sum without" that needs 
modifying, it's the "and"

() and ignoring (hostname) ()

 See: 
https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching-keywords

In this particular example, there are other ways to do this which might end 
up with a more compact expression. You could have an outer sum over the 
inner sums, but then I think the whole expression simplifies to just

sum without (USER) (
*go_service_status{HOSTNAME=~"server1|server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3cf69feb-735f-4dd0-b8a5-1d74008f35ccn%40googlegroups.com.

[prometheus-users] Re: snmp_exporter - Generator.yml configuration

2024-02-21 Thread 'Brian Candler' via Prometheus Users

You need to collect the metric "ifHCInOctets".  The module "if_mib" in the 
supplied sample generator.yml does this.

On Wednesday 21 February 2024 at 00:39:19 UTC+7 Mitchell Laframboise wrote:

> Hi there.  What do I have to include in my generator.yml configuration to 
> scrape data for this query?
>
>
> rate(ifHCInOctets{job=~'$JobName',instance=~'$Device',ifName=~'$Interface'}[$__rate_interval])*$interfacebits
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/84ab8c27-476a-47fb-bd28-d7d3fd433b85n%40googlegroups.com.

[prometheus-users] Re: interference between prometheus and rabbitmq when starting

2024-02-19 Thread 'Brian Candler' via Prometheus Users

If RabbitMQ attempts to start but fails, the reason will be shown in its 
output (e.g. "journalctl -eu rabbitmq" or whatever the service name is).

One guess is that RabbitMQ is trying to bind to the same port as 
prometheus. Prometheus uses port 9090 by default, and I think RabbitMQ uses 
port 5672 and 15672 by default, so they ought to be fine. But maybe there's 
some additional port configured in one or the other.

If there's no attempt even to start RabbitMQ while Prometheus is running, 
then this would be down to your process manager (e.g. systemd); maybe 
systemd has some weird dependencies configured between those two packages? 
This would be unusual.

Those are a couple of ideas, but basically this is a problem with your 
system which you'll need to resolve locally. The best practice in any case 
would be to run Prometheus and RabbitMQ in separate VMs, or at least 
separate containers.

On Tuesday 20 February 2024 at 02:25:33 UTC+7 Mateus Silva wrote:

> When starting the system, the Prometheus service appears with metrics and 
> exporters, but RabbitMQ does not start. When I stop Prometheus, RabbitMQ 
> starts. Could you tell me what could be causing this problem? I couldn't 
> find the error. Thank you for your help if possible
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2514533a-2f8e-4a32-a329-c189d9960f20n%40googlegroups.com.

[prometheus-users] Re: prometheus ui on local computer

2024-02-18 Thread 'Brian Candler' via Prometheus Users

localhost:9090 is what you'd enter if prometheus was running on the same 
machine as your browser.

In this case it's remote, so enter
  :9090
where  is the IP-address of the server where prometheus is 
running.

On Monday 19 February 2024 at 07:13:02 UTC Leah Stapleton wrote:

> Hello,
> This is my first time using Prometheus. I found it very easy to set up the 
> server but I'm puzzled about viewing the data. 
>
> I have a prometheus server running on a VPN at Digital Ocean, which is set 
> up to scrape data from Caddy webserver.
>
> But how do I view the dashboards?
>
> I know there's a prometheus ui that I can use at localhost:9090, but I 
> don't know how to set that up to view the data on a prometheus server 
> running on a vpn.I can't just open localhost:9090 on my browser and see 
> data from a remote server, there must be some step I am missing. 
>
> Can anyone give me detailed step by step instructions? 
> 1.Do I need to add anything to the prometheus.yml file? if so, what? 
> 2. what command do I run in the terminal of my computer to get the 
> prometheus ui going?
>
> Thank you for your help.
>
> By the way, I've heard Grafana is also a possibility but I'm interested in 
> trying the Prometheus Ui instead. Thank you.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4a64826e-e3c0-42ce-8923-88b59f39f64fn%40googlegroups.com.

[prometheus-users] Re: Hi

2024-02-14 Thread 'Brian Candler' via Prometheus Users

Sorry, I still don't understand what your metrics currently represent, and 
what condition (in terms of the values of those metrics) you need to alert 
on. Perhaps if you give some specific examples of those metrics it would 
help, both in a normal condition, and an error condition.

On Wednesday 14 February 2024 at 13:39:21 UTC h0ksa wrote:

> I apologize if my explanation was unclear. I currently have four tables 
> for web analytics data. I aim to create a query that triggers an alert when 
> either the data retrieval stops or the number of rows falls below one. Any 
> recommendations on how to formulate this query?
> On Wednesday, February 14, 2024 at 2:28:33 PM UTC+1 Brian Candler wrote:
>
>> I'm not sure why you're summing over increase, but if you plot that 
>> PromQL expression in the web UI, does its value drop to zero when the 
>> problem occurs?
>>
>> If so, just add "== 0" to the end of this expression and you can use it 
>> as an alerting expression.
>>
>> On Wednesday 14 February 2024 at 11:37:32 UTC h0ksa wrote:
>>
>>> sum by (table) (increase(pinot_server_realtimeRowsConsumed_Count[5m])) 
>>>
>>> right now iam using this query 
>>>
>>> and rows are always rising in the database but i want to know when they 
>>> stop and trigger an alert 
>>>
>>> On Wednesday, February 14, 2024 at 12:31:53 PM UTC+1 Brian Candler wrote:
>>>
 What Prometheus metrics are you collecting? For example, do you have a 
 metric for the total number of rows in the database? Or do you have a 
 metric for the last time a row was inserted? Or some other metric which 
 can 
 identify new rows - if so, what?

 What is the "previously suggested function"?

 We can't really suggest an alerting function without seeing the metrics 
 themselves.

 On Wednesday 14 February 2024 at 10:16:01 UTC h0ksa wrote:

> Hi all ,
>
>
> I have a dataset with 5000 rows, and my objective is to determine if 
> any new rows have been inserted or created within the last 5 minutes. The 
> previously suggested function may not be suitable because the row count 
> is 
> never expected to drop below 1. Consequently, my focus is on identifying 
> instances where no rows have been created within a 5-minute timeframe. If 
> no new rows are found during this period, I intend to initiate an alert. 
> So 
> which function to use 
> increase()
> delta()
> increase()  or another one .



-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b5b14849-9b21-4194-8d86-b1c8b6675131n%40googlegroups.com.

[prometheus-users] Re: Hi

2024-02-14 Thread 'Brian Candler' via Prometheus Users

I'm not sure why you're summing over increase, but if you plot that PromQL 
expression in the web UI, does its value drop to zero when the problem 
occurs?

If so, just add "== 0" to the end of this expression and you can use it as 
an alerting expression.

On Wednesday 14 February 2024 at 11:37:32 UTC h0ksa wrote:

> sum by (table) (increase(pinot_server_realtimeRowsConsumed_Count[5m])) 
>
> right now iam using this query 
>
> and rows are always rising in the database but i want to know when they 
> stop and trigger an alert 
>
> On Wednesday, February 14, 2024 at 12:31:53 PM UTC+1 Brian Candler wrote:
>
>> What Prometheus metrics are you collecting? For example, do you have a 
>> metric for the total number of rows in the database? Or do you have a 
>> metric for the last time a row was inserted? Or some other metric which can 
>> identify new rows - if so, what?
>>
>> What is the "previously suggested function"?
>>
>> We can't really suggest an alerting function without seeing the metrics 
>> themselves.
>>
>> On Wednesday 14 February 2024 at 10:16:01 UTC h0ksa wrote:
>>
>>> Hi all ,
>>>
>>>
>>> I have a dataset with 5000 rows, and my objective is to determine if any 
>>> new rows have been inserted or created within the last 5 minutes. The 
>>> previously suggested function may not be suitable because the row count is 
>>> never expected to drop below 1. Consequently, my focus is on identifying 
>>> instances where no rows have been created within a 5-minute timeframe. If 
>>> no new rows are found during this period, I intend to initiate an alert. So 
>>> which function to use 
>>> increase()
>>> delta()
>>> increase()  or another one .
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a7dac863-2dc7-42de-ad4d-69c4f24ce725n%40googlegroups.com.

[prometheus-users] Re: Hi

2024-02-14 Thread 'Brian Candler' via Prometheus Users

What Prometheus metrics are you collecting? For example, do you have a 
metric for the total number of rows in the database? Or do you have a 
metric for the last time a row was inserted? Or some other metric which can 
identify new rows - if so, what?

What is the "previously suggested function"?

We can't really suggest an alerting function without seeing the metrics 
themselves.

On Wednesday 14 February 2024 at 10:16:01 UTC h0ksa wrote:

> Hi all ,
>
>
> I have a dataset with 5000 rows, and my objective is to determine if any 
> new rows have been inserted or created within the last 5 minutes. The 
> previously suggested function may not be suitable because the row count is 
> never expected to drop below 1. Consequently, my focus is on identifying 
> instances where no rows have been created within a 5-minute timeframe. If 
> no new rows are found during this period, I intend to initiate an alert. So 
> which function to use 
> increase()
> delta()
> increase()  or another one .

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8316ff9a-7aba-4c6c-9c45-06e91d5510d5n%40googlegroups.com.

Re: [prometheus-users] Re: Alert Query

2024-02-14 Thread 'Brian Candler' via Prometheus Users

(max_over_time(kube_pod_status_ready{condition="true"}[10m]) == 0))

That won't solve your problem, because if there were only one initial 
sample with value zero, it would trigger immediately, just as you do 
today.  You would need to make this more complex, for example with 
count_over_time as well.

Be very careful with "and" and "or" operators. They do not work like 
booleans in normal languages. They are combining vectors, matching the 
label sets of each element in the vector.

I recommend you go for the simplest expression which meets your needs 
sufficiently well.

On Wednesday 14 February 2024 at 03:46:33 UTC sri L wrote:

> Thanks Brian Candler.
>
> I am thinking of combining two conditions.
>
> ((kube_pod_status_ready{condition="true"} == 0 and 
> max_over_time(kube_pod_status_ready{condition="true"}[10m]) == 1) or 
> (max_over_time(kube_pod_status_ready{condition="true"}[10m]) == 0))
>
> Expecting this expression to alert if pod was up in the last 10 mins and 
> currently unreachable or pod is unreachable from last 10mins or more.
>
> Please correct if there is any better way
>
>
> On Wed, Feb 14, 2024 at 12:32 AM 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> I guess it goes through non-ready states while it's starting up.
>>
>> A simple approach is to put "for: 3m" on the alert so that it doesn't 
>> fire an alert until it has been in the down state for 3 minutes.
>>
>> Another approach would be:
>>
>> kube_pod_status_ready{condition="true"} == 0 and 
>> max_over_time(kube_pod_status_ready{condition="true"}[10m]) == 1
>>
>> This will fire if the pod was ready at any time in the last 10 minutes, 
>> but is not ready now. This does mean that the alert will clear after 10 
>> minutes of error condition, though.
>>
>> On Tuesday 13 February 2024 at 17:18:37 UTC sri L wrote:
>>
>>> Hi all,
>>>
>>> I am trying to create an alert rule for pod unreachable condition. Below 
>>> expression I used but alert was triggering whenever new pod got created, we 
>>> want alert only when the previous state of a pod was in the ready state and 
>>> then went to unreachable/terminating/pending states.
>>>
>>> kube_pod_status_ready{condition="true"} == 0
>>>
>>> Please suggest if we have any suitable alert expression for the above 
>>> requirement.
>>>
>>> Thanks
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/7728c736-797f-4771-b809-24e5f6b3931dn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/7728c736-797f-4771-b809-24e5f6b3931dn%40googlegroups.com?utm_medium=email_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/82297f3b-1579-41b9-94df-b14bc4eb3f72n%40googlegroups.com.

[prometheus-users] Re: Alert Query

2024-02-13 Thread 'Brian Candler' via Prometheus Users

I guess it goes through non-ready states while it's starting up.

A simple approach is to put "for: 3m" on the alert so that it doesn't fire 
an alert until it has been in the down state for 3 minutes.

Another approach would be:

kube_pod_status_ready{condition="true"} == 0 and 
max_over_time(kube_pod_status_ready{condition="true"}[10m]) == 1

This will fire if the pod was ready at any time in the last 10 minutes, but 
is not ready now. This does mean that the alert will clear after 10 minutes 
of error condition, though.

On Tuesday 13 February 2024 at 17:18:37 UTC sri L wrote:

> Hi all,
>
> I am trying to create an alert rule for pod unreachable condition. Below 
> expression I used but alert was triggering whenever new pod got created, we 
> want alert only when the previous state of a pod was in the ready state and 
> then went to unreachable/terminating/pending states.
>
> kube_pod_status_ready{condition="true"} == 0
>
> Please suggest if we have any suitable alert expression for the above 
> requirement.
>
> Thanks
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7728c736-797f-4771-b809-24e5f6b3931dn%40googlegroups.com.

Re: [prometheus-users] snmp_exporter 0.25.0 + and prometheus 2.49.1 with "%" in label value - format issue

2024-02-12 Thread 'Brian Candler' via Prometheus Users

Are you running either the Prometheus server or the web browser under 
Windows? STATUS_BREAKPOINT appears here:
https://pkg.go.dev/golang.org/x/sys@v0.17.0/windows#pkg-constants

On Monday 12 February 2024 at 15:58:44 UTC Ben Kochie wrote:

> On Mon, Feb 12, 2024, 16:39 Alexander Wilke  wrote:
>
>> Hello,
>
> thanks for the fast response. Unfortunately the linux environment I have 
>> is very restricted and I first have to check which snmpwalk tool I can use 
>> because downloads are very limited.
>> Will take me some time but I think I will open the issue with the 
>> information I have.
>>
>
> The output from the exporter is fine, no need for other tools.
>
>
>> if I run ltmNodeAddresstype I can see a value of (1) for the IPs in the 
>> /Common partition which is the base partition and has no suffix like %xyz.
>> Other partitions I have the suffix and the address Type value is (3).
>>
>
> Yup, that's what I thought.
>
>
>> So it is probably as you said:
>> ipv4z(3) A non-global IPv4 address including a zone index as defined by 
>> the InetAddressIPv4z textual convention.
>>
>>
>> PS:
>> is it possible that this may cause instability of the prometheus webui? 
>> If I browse the "graph" page and searching for f5 metrics sometimes the 
>> rbwoser is showing a white error page "STATUS_BREAKPOINT".
>> This is a test environment and maybe there something else wrong - however 
>> - it feels like it started with the monitoring of f5 devices via SNMP.
>>
>
> No, this is just a failed string conversion. So you get the default hex 
> conversion instead. 
>
> I don't know what your error is, but I am fairly sure this is unrelated to 
> Prometheus or SNMP data.
>
>
>> Ben Kochie schrieb am Montag, 12. Februar 2024 um 15:20:05 UTC+1:
>>
>>> Looking at the MIB (F5-BIGIP-LOCAL-MIB), I see this MIB definition:
>>>
>>> ltmPoolMemberAddr OBJECT-TYPE
>>>   SYNTAX InetAddress
>>>   MAX-ACCESS read-only
>>>   STATUS current
>>>   DESCRIPTION
>>> "The IP address of a pool member in the specified pool.
>>> It is interpreted within the context of an ltmPoolMemberAddrType 
>>> value."
>>>   ::= { ltmPoolMemberEntry 3 }
>>>
>>> InetAddress syntax comes from INET-ADDRESS-MIB, which has several 
>>> conversion types. Without knowing what the device is exposing 
>>> for ltmPoolMemberAddrType it's hard to say, but I'm guessing it's type 
>>> 3, InetAddressIPv4z.
>>>
>>> I don't think we have this textual convention implemented in the 
>>> exporter.
>>>
>>> Would you mind filing this as an issue on GitHub?
>>> * It would also be helpful to have the sample data as text, rather than 
>>> a screenshot. This makes it easier to work with for creating test cases.
>>> * Please also include walks of `ltmPoolMemberAddrType` as well as 
>>> `ltmPoolMemberAddr`
>>>
>>> https://github.com/prometheus/snmp_exporter/issues
>>>
>>> It would also be helpful to have the sample data as text, rather than a 
>>> screenshot. This makes it easier to work with for creating test cases.
>>>
>>> On Mon, Feb 12, 2024 at 2:54 PM Alexander Wilke  
>>> wrote:
>>>
 Hello,

 I am using the snmp_exporter 0.25.0 and prometheus 2.49.1.

 I am collecting metrics from F5 LTM Loadbalancers. I want to collect 
 the IP-Address.

  

 in general it is working however some IP-address formats are looking 
 like that:

  

 10.10.10.10 which I can import in the correct fromat

  

 Others a displayed by the F5 system like this:

  

 10.10.10.10%0

 or

 10.10.10.10%1

  

 The trailing  %0 or %1 ... represents a logical separation on the 
 system.

  

 The ingestion into prometheus works however the format is then 
 different and looks like hex. Any chance to get the "raw" information or 
 at 
 least replace the trailing %0?


 [image: ip_address_format_includes_percent.jpg]


 -- 
 You received this message because you are subscribed to the Google 
 Groups "Prometheus Users" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to prometheus-use...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/prometheus-users/dd89ed7e-a276-43ff-8bb1-5631ba98cfb7n%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/a5658075-d685-487e-9cac-5d16d3cb0e15n%40googlegroups.com
>>  
>>

[prometheus-users] Re: PromQL filter based on current date

2024-02-12 Thread 'Brian Candler' via Prometheus Users

The only ways I know are to use the Prometheus API 
 
and set the evaluation time, or to use the @ timestamp 
 PromQL 
modifier. But in either case you have to work out the timestamp of the end 
of the 24 hours of interest, and insert it yourself.

On Friday 9 February 2024 at 14:29:39 UTC Dipesh J wrote:

> Hi, 
>
> Is there way to get metrics only for current date instead of using time 
> like [24h] which would probably give metrics for day before too.
>
> last 24 hours
> my_metric{node="ABC"} [24h]
>
> Something like below to give metrics for current date only.
> my_metric{node="ABC"} [TODAYS_DATE]
>
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f37d3e0-f044-46bb-9c9a-0933c41741acn%40googlegroups.com.

[prometheus-users] Re: Prometheus Federation: cannot unmarshal number into Go struct field

2024-02-05 Thread 'Brian Candler' via Prometheus Users

If you are using file_sd_configs to read /etc/prometheus/federate_sd.json 
(which is what I guess you're doing), then changes will be automatically 
picked up. There's no need to hit the reload endpoint for this (*). 

Of course, the new file has to be valid JSON (or YAML); if it's not, then 
it will be ignored and prometheus will continue using the old contents. You 
can check validity using:

/path/to/promtool check config /path/to/prometheus.yml
/path/to/promtool check service-discovery /path/to/prometheus.yml job_name

(*) You only need to hit /-/reload if you've changed prometheus.yml or any 
rules files. Note that if you'd omitted --web.enable-lifecycle then you'd 
get a 403 Forbidden response with text "Lifecycle API is not enabled."

On Monday 5 February 2024 at 09:28:38 UTC Edwin Vasquez wrote:

> Hey Brian,
>
> You're awesome!  You gave me a hint on what to check It's now working 
> perfectly!  Now I need to figure out on how prometheus read the updates in 
> federate_sd.json without restarting the prometheus service.  It seems 
> like `curl -X POST http://localhost:9090/-/reload` 
>  is not working.  Note that 
> "--web.enable-lifecycle" is included in the docker compose YAML file.  Any 
> thoughts?
> On Sunday, February 4, 2024 at 12:13:14 AM UTC+8 Brian Candler wrote:
>
>> Can you show the content of your /etc/prometheus/federate_sd.json file?  
>> The error suggests to me that you are putting a number where you need a 
>> string, for example
>>
>> {"labels": {"foo": 123}}
>>
>> where it should be
>>
>>{"labels": {"foo": "123"}}
>>
>> On Saturday 3 February 2024 at 16:03:02 UTC Edwin Vasquez wrote:
>>
>>> The issue is happening on Prometheus version: *2.49.1*
>>>
>>> I'm getting the following error on Prometheus federation:
>>>
>>> ts=2024-02-03T13:30:14.209Z caller=file.go:343 level=error 
>>> component="discovery manager scrape" discovery=file config=nrp-federation 
>>> msg="Error reading file" path=/etc/prometheus/federate_sd.json err="json: 
>>> cannot unmarshal number into Go struct field .labels of type 
>>> model.LabelValue"
>>>
>>> *It is not happening on my old instance (with version 2.29.2)*.  One 
>>> thing I have noticed that if the generated federation JSON file has json 
>>> objects that are not properly arrange, the error occurs.
>>>
>>> Any thoughts?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/de05fdf6-c34a-44d4-af71-11a569d672d4n%40googlegroups.com.

Re: [prometheus-users] Prometheus alert evaluation, are they instant queries?

2024-02-03 Thread 'Brian Candler' via Prometheus Users

Even without a subquery, a rule can include a range vector expression and 
then reduce it to an instant vector, e.g.

expr: avg_over_time(snmp_scrape_duration_seconds[5m]) >= 3

On Saturday 3 February 2024 at 16:04:56 UTC Ben Kochie wrote:

> All rule evaluations are instant queries. You do all the "reducer 
> functions" in PromQL itself.
>
> For example, you can use subquery syntax to do something like 
> `avg_over_time()`.
>
> On Sat, Feb 3, 2024 at 5:02 PM 'Andrew Dedesko' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> Hi,
>>
>> I'm wondering whether prometheus uses instant queries or range queries 
>> when evaluating alert expressions?  The context about why I'm asking might 
>> help clarify my question.
>>
>> I'm comparing Grafana Cloud's alerting functionality with prometheus.  
>> From Grafana Cloud we're querying Google Cloud Metrics with PromQL (it's 
>> Google's Monarch DB with a PromQL interface).  Grafana Cloud's alerting 
>> system takes your PromQL query and performs a *range query* against 
>> Google Cloud Metrics, returning multiple data points over the range you 
>> have selected (e.g. 10 minutes ago to now).  Then you need to choose a 
>> reducer function to turn the time series into an instant scalar (e.g. min, 
>> max, last, mean).
>>
>> Prometheus alerts don't seem to have an option for specifying a range and 
>> also don't have a reducer option.  So this leads me to believe prometheus 
>> uses instant queries to evaluate alert expressions.  But I'd like to know 
>> for sure.
>>
>> Thanks for reading!
>>
>> Here's the Grafana Cloud documentation on alert query ranges and reducers:
>>
>> https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-grafana-managed-rule/
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/ef0e2ee6-a32e-479e-bbe4-10499372715cn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/12386cd6-b032-45a6-b755-f0f51551b5d3n%40googlegroups.com.

[prometheus-users] Re: Prometheus Federation: cannot unmarshal number into Go struct field

2024-02-03 Thread 'Brian Candler' via Prometheus Users

Can you show the content of your /etc/prometheus/federate_sd.json file?  
The error suggests to me that you are putting a number where you need a 
string, for example

{"labels": {"foo": 123}}

where it should be

   {"labels": {"foo": "123"}}

On Saturday 3 February 2024 at 16:03:02 UTC Edwin Vasquez wrote:

> The issue is happening on Prometheus version: *2.49.1*
>
> I'm getting the following error on Prometheus federation:
>
> ts=2024-02-03T13:30:14.209Z caller=file.go:343 level=error 
> component="discovery manager scrape" discovery=file config=nrp-federation 
> msg="Error reading file" path=/etc/prometheus/federate_sd.json err="json: 
> cannot unmarshal number into Go struct field .labels of type 
> model.LabelValue"
>
> *It is not happening on my old instance (with version 2.29.2)*.  One 
> thing I have noticed that if the generated federation JSON file has json 
> objects that are not properly arrange, the error occurs.
>
> Any thoughts?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a2f69af9-d6c5-4358-81b2-2f65a124b07an%40googlegroups.com.

[prometheus-users] Re: Prometheus target disappears from Grafana metrics when it is down.

2024-01-31 Thread 'Brian Candler' via Prometheus Users

Questions about Grafana would be best asked to the Grafana 
Community: https://community.grafana.com/

On Wednesday 31 January 2024 at 14:48:07 UTC donna_u...@comcast.net wrote:

> I have a dashboard in Grafana with a Prometheus data source.  When target 
> goes down it disappears from the Grafana panel. Is there a way to make it 
> static.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e87d1884-2cfc-4d6c-9339-b3056ad1b452n%40googlegroups.com.

[prometheus-users] Re: Actual alert repeat_interval = group_interval + repeat_interval ?

2024-01-30 Thread 'Brian Candler' via Prometheus Users

Not wanting to state the obvious, but have you tried
group_interval: 1h
repeat_interval: 3h
?


On Tuesday 30 January 2024 at 18:35:46 UTC Puneet Singh wrote:

> Hi All,
> I am facing an issue with the latest version of Alert manager.
> I have a group_interval which is a perfect divisor of repeat_interval
>
> *group_interval: 1hrepeat_interval: 4h*
>
> in the aforementioned setting , i get repeated alerts after 5 hours ( if 
> no new alerts are added) which is contrary to what i expected.
> [image: almgr.png]
>
> After countless keystrokes in google search bar, i came across an issue 
> which mirrors what i am experiencing with Alert manager. - 
> https://github.com/prometheus/alertmanager/issues/2320
> [image: issue.png]
> There is no response from the developers on this post.
>
>
> Seems i need to try 
>
> *group_interval: 1mrepeat_interval: 4h*
> to get as close as possible (4 hour 1 minute)
>  but with this setting i may end up spamming alert receivers -  with 
> updates every 1 minutes in worst case scenario.
>
> Is there a way out ? Please advice.
>
> Thanks , 
> Puneet
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c8d67736-8eb8-4578-8c9c-8f5005038f24n%40googlegroups.com.

[prometheus-users] Re: How to monitor UP and downtime in Prometheus

2024-01-29 Thread 'Brian Candler' via Prometheus Users

This is a duplicate 
of https://groups.google.com/g/prometheus-users/c/f5aM1n7aPY8 - please 
don't keep posting the same question.

You need a separate piece of software to create your dashboard, the most 
popular of which is Grafana. For any questions about Grafana, please go 
to  https://community.grafana.com/

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2df776b1-a14c-47b4-a9c1-8636a2262b3dn%40googlegroups.com.

Re: [prometheus-users] Binary operations between range vectors and scalars

2024-01-29 Thread 'Brian Candler' via Prometheus Users

On Monday 29 January 2024 at 15:24:26 UTC Chris Siebenmann wrote:

For instance, if you do 
delta(metric[1h] > 0), does delta() extrapolate using the timestamps of 
the first and last time series in the original range vector or the 
filtered one?


I would expect it to use the timestamps of the first and last points the 
filtered one.
 

PS: speaking of timestamps, it would be nice if timestamp() worked on a 
range vector and yielded its own range vector of the timestamps of every 
element of the range vector, basically replacing the original values in 
the range vector with the corresponding timestamps. But this is probably 
outside PromQL's processing model.


I agree with that.

I'd also like to be able to answer the question "what was the timestamp of 
the most recent successful scrape"? One option would be
timestamp(last_over_time(up[10m] == 1))
... but as far as I can see, this currently wouldn't work: 
timestamp(last_over_time(up[10m])) just shows the current eval time, unlike 
timestamp(up). Maybe last_over_time() could be changed to preserve the 
timestamp?

Another option would be
max(timestamp(up[10m]))
which requires your proposed change to allow timestamp(rangevector).

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7846ad98-fae6-4d31-b18d-b0bc825f5a4cn%40googlegroups.com.

[prometheus-users] Re: Prometheus jobs are not showing in prometheus UI

2024-01-29 Thread 'Brian Candler' via Prometheus Users

I can't see the image link you posted: it gives a 404.

There's probably some config error in your prometheus.yml or your targets 
file, but seeing only snippets means I can't spot them.

Try these commands (adjust paths as necessary):

/path/to/promtool check service-discovery /path/to/prometheus.yml 
blackbox-fundconnect_retail
/path/to/promtool check config /path/to/prometheus.yml

And what does the query
up{job="blackbox-fundconnect_retail"}
show?

Apart from that, I don't think there's much more I can do to help you. 
There's a problem with the configs on your system that you'll need to 
diagnose locally.

On Monday 29 January 2024 at 13:57:19 UTC Venkatraman Natarajan wrote:

> Hi Brian,
>
> Thanks for the response.
>
> *Are you saying that job blackbox-fundconnect_retail does not appear 
> there, but blackbox-fundconnect_retail_03 does? *
>
> Yes, Correct. We have 100 jobs in prometheus. Now we have added 5 jobs 
> additionally, in that it is showing only 4 jobs. 
>
> blackbox-fundconnect_retail_03, 
> blackbox-fundconnect_retail_04
> blackbox-fundconnect_retail_05
> blackbox-fundconnect_retail_06
> blackbox-fundconnect_retail - This one is not showing in prometheus UI. 
>
> *Is it possible that blackbox-fundconnect_retail has zero targets 
> configured?*
>
> No, It has 48 targets configured. 
>
> Note: We need to configure different scrape intervals for those jobs 
> that's why configuring different jobs instead of a single job.
>
> [image: image.png]
>
>
> Thanks,
> Venkatraman N
>
> On Thursday, January 25, 2024 at 7:13:23 PM UTC+5:30 Brian Candler wrote:
>
>> Sorry, I don't know what you mean by "not showing all the jobs".  You 
>> have only shown a small portion of the targets page.  Are you saying that 
>> job blackbox-fundconnect_retail does not appear there, 
>> but blackbox-fundconnect_retail_03 does? Is it possible 
>> that blackbox-fundconnect_retail has zero targets configured?
>>
>> The PromQL query "up" will show you all the targets. "count by (job) 
>> (up)" will show you the jobs, with the number of targets for each. Those 
>> won't show a job with zero targets though.
>>
>> Note that in the above case, both scrape jobs look to be identical, in 
>> which case you could have one job:
>>
>>   file_sd_configs:
>> - files:
>> - './dynamic/blackbox/blackbox_retail_FundConnect_targets.yml'
>> - './dynamic/blackbox/blackbox_retail_FundConnectHealth-03_targets.yml'
>>
>> (You can use target labels to distinguish them, if you wish)
>>
>> > Do we have limitations in prometheus jobs.? 
>>
>> No.
>>
>> > Prometheus version: prometheus:v2.27.1
>>
>> That's pretty old (May 2021). Current LTS version is v2.45.2
>>
>> On Thursday 25 January 2024 at 13:11:27 UTC Venkatraman Natarajan wrote:
>>
>>> Hi Team,
>>>
>>> We have 100+ prometheus jobs in prometheus UI but not showing all the 
>>> jobs. 
>>>
>>> The below are sample 2 jobs. In this, first job is not showing in UI but 
>>> second job metrics showing fine.  
>>> - job_name: 'blackbox-fundconnect_retail'
>>>   scrape_interval: 1020s
>>>   metrics_path: /probe
>>>   honor_timestamps: true
>>>   params:
>>> module: [http_2xx]
>>>   file_sd_configs:
>>> - files: 
>>> ['./dynamic/blackbox/blackbox_retail_FundConnect_targets.yml']
>>>   relabel_configs:
>>>   - source_labels: [__address__]
>>> separator: ;
>>> regex: (.*)
>>> target_label: __param_target
>>> replacement: $1
>>> action: replace
>>>   - source_labels: [__param_target]
>>> separator: ;
>>> regex: (.*)
>>> target_label: instance
>>> replacement: $1
>>> action: replace
>>>   - separator: ;
>>> regex: (.*)
>>> target_label: __address__
>>> replacement: {{ XXX }}:9122
>>> action: replace
>>>
>>> - job_name: 'blackbox-fundconnect_retail_03'
>>>   scrape_interval: 1020s
>>>   metrics_path: /probe
>>>   honor_timestamps: true
>>>   params:
>>> module: [http_2xx]
>>>   file_sd_configs:
>>> - files: 
>>> ['./dynamic/blackbox/blackbox_retail_FundConnectHealth-03_targets.yml']
>>>   relabel_configs:
>>>   - source_labels: [__address__]
>>> separator: ;
>>> regex: (.*)
>>> target_label: __param_target
>>> replacement: $1
>>> action: replace
>>>   - source_labels: [__param_target]
>>> separator: ;
>>> regex: (.*)
>>> target_label: instance
>>> replacement: $1
>>> action: replace
>>>   - separator: ;
>>> regex: (.*)
>>> target_label: __address__
>>> replacement: {{ X}}:9122
>>> action: replace
>>>
>>> Do we have limitations in prometheus jobs.? 
>>>
>>> I would like to add more prometheus jobs to scrape the metrics.
>>>
>>> Please find attached screenshot which shows jobs in prometheus UI
>>>
>>> Prometheus version: prometheus:v2.27.1
>>>
>>> Could you please help me on this? 
>>>
>>> Thanks,
>>> Venkatraman N
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop

[prometheus-users] Re: Prometheus Authentication

2024-01-29 Thread 'Brian Candler' via Prometheus Users

See 
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

There are settings for "authorization", "basic_auth" and "tls_config 
"
 
that can be used to enable authentication to the remote_write endpoint.

On Monday 29 January 2024 at 12:35:00 UTC Siradj Eddine Fisli wrote:

> Actually i am using ingress-nginx to expose prometheus endpoint , shall i 
> use nginx controller auth ? but what should i pass as argument to 
> remote_write ?
>
>
> Le lundi 29 janvier 2024 à 12:24:32 UTC+1, Brian Candler a écrit :
>
>> Using --web.config-file you can make Prometheus require HTTP Basic 
>> Authentication (basic_auth_users) or TLS client certificate 
>> authentication (client_auth_type, client_ca_file, client_allowed_sans).
>>
>> See: 
>> https://prometheus.io/docs/prometheus/latest/configuration/https/#https-and-authentication
>>
>> If you want this to happen only for certain endpoints like remote_write, 
>> then you'll need to bind prometheus to 127.0.0.1 and run a reverse proxy in 
>> front of it with whatever authorization policy you want.
>>
>> On Monday 29 January 2024 at 10:45:12 UTC Siradj Eddine Fisli wrote:
>>
>>> I have two prometheus instances , one is in agent mode remote writing 
>>> metrics to the second one, i want to add authentication mechanism, also i 
>>> am using kube-prometheus-stack. is there any solution ? 
>>> also prometheus is accessible via https, i configured that using 
>>> cert-manager and letsencrypt.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/49b22ab5-97ba-4fcb-9229-837d15c3f80en%40googlegroups.com.

[prometheus-users] Re: Prometheus Authentication

2024-01-29 Thread 'Brian Candler' via Prometheus Users

Using --web.config-file you can make Prometheus require HTTP Basic 
Authentication (basic_auth_users) or TLS client certificate authentication 
(client_auth_type, 
client_ca_file, client_allowed_sans).

See: 
https://prometheus.io/docs/prometheus/latest/configuration/https/#https-and-authentication

If you want this to happen only for certain endpoints like remote_write, 
then you'll need to bind prometheus to 127.0.0.1 and run a reverse proxy in 
front of it with whatever authorization policy you want.

On Monday 29 January 2024 at 10:45:12 UTC Siradj Eddine Fisli wrote:

> I have two prometheus instances , one is in agent mode remote writing 
> metrics to the second one, i want to add authentication mechanism, also i 
> am using kube-prometheus-stack. is there any solution ? 
> also prometheus is accessible via https, i configured that using 
> cert-manager and letsencrypt.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ef65060b-caea-4fb6-b4f2-12af9d041aa6n%40googlegroups.com.

[prometheus-users] Binary operations between range vectors and scalars

2024-01-28 Thread 'Brian Candler' via Prometheus Users

I don't know if this has been proposed before, so I'd like to raise it here 
before taking it to github or prometheus-developers.

There are cases where binary operators could act between range vectors and 
scalars, but this is not currently allowed today (except by using 
subqueries, which end up resampling the timeseries). Simple examples:

up[10m] == 1# filtering

foo[10m] * 8# arithmetic

I think the semantics of these are obvious. The operation would take place 
between each value in the range vector and the scalar, and the timestamp of 
each result value would equal the timestamp of each value in the input 
range vector.

Use cases would include things like:

count_over_time(up[10m] == 1)# how many times was there a successful 
scrape in last 10 minutes?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/72f1d082-0817-4bdf-81ca-a02c3ae19484n%40googlegroups.com.

[prometheus-users] Re: storage.tsdb.max-block-duration to a lower value completely stops compaction

2024-01-26 Thread 'Brian Candler' via Prometheus Users

As far as I know, if you set the compaction period to 3 days, then every 3 
days it will compact the last 3 days worth of data.  As simple as that.

When you say "I have setup GOGC to 60%", what *exact* string value have you 
given for GOGC? I think it must be GOGC=60 not GOGC=60%

If you're limiting the whole VM to 128GiB then setting GOMEMLIMIT a bit 
below this (e.g. "110GiB") may help during compaction time. There are blogs 
about this, e.g.
https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications

See https://pkg.go.dev/runtime for the exact format of this setting.

On Thursday 25 January 2024 at 19:50:22 UTC Sukhada Sankpal wrote:

> Thanks Brian
> I have enclosed a screenshot of TSDB head stats.
> I have setup GOGC to 60% based on recommendation by Bryan Boreham for this 
> setup
>
> However, what does this parameter exactly do? Let's say my data retention 
> is 30 days, this parameter by default sets to 3 days. Does that mean every 
> 3 days the data compaction will be triggered for 30days of data?
> On Wednesday, January 24, 2024 at 11:15:09 PM UTC-8 Brian Candler wrote:
>
>> Since regular blocks are 2h, setting maximum size of compacted blocks to 
>> 1h sound unlikely to work.  And therefore testing with 1d seems reasonable.
>>
>> Can you provide more details about the scale of your environment, in 
>> particular the "head stats" from Status > TSDB Stats in the Prometheus web 
>> interface?
>>
>> However, I think what you're seeing could be simply an artefact of how 
>> Go's garbage collection works, and you can make it more aggressive by 
>> tuning GOGC and/or GOMEMLIMIT. See
>> https://tip.golang.org/doc/gc-guide#GOGC
>> for more details.
>>
>> Roughly speaking, the default garbage collector behaviour in Go is to 
>> allow memory usage to expand to double the current usage, before triggering 
>> a garbage collector cycle. So if the steady-state heap is 50GB, it would be 
>> normal for it to grow to 100GB if you don't tune it.
>>
>> If this is the case, setting smaller compacted blocks is unlikely to make 
>> any difference to memory usage - and it could degrade query performance.
>>
>> On Wednesday 24 January 2024 at 21:45:50 UTC Sukhada Sankpal wrote:
>>
>>> Background on why I wanted to play around this parameter:
>>> Using LTS version for testing i.e. 2.45.2
>>> During compaction i.e. every 3days, the resident memory of prometheus 
>>> spikes to a very high value. Example if average of 
>>> process_resident_memory_bytes is around 50 GB and at the time of compaction 
>>> it spikes to 120 to 160 GB. Considering the usage of 50 GB want memory 
>>> allocated to the host to be around 128GB. But looking at memory usage spike 
>>> during compaction, this doesn't seem to be a workable option and keeping a 
>>> low value may lead to OOM during compaction. It also adds to cost for cloud 
>>> based VMs.
>>> On Wednesday, January 24, 2024 at 1:35:16 PM UTC-8 Sukhada Sankpal wrote:
>>>
 storage.tsdb.max-block-duration default value is set to be 10% of 
 retention time. I am currently using a setup with 30 days of retention and 
 thereby this flags default value is set to be 3 days.
 Based on suggestions posted here: 
 https://github.com/prometheus/prometheus/issues/6934#issuecomment-1610921555
 I changed storage.tsdb.min-block-duration to 30m and 
 storage.tsdb.max-block-duration to 1h. This resulted in no-compaction 
 state 
 and local storage increased quickly.

 In order to enable the compaction and have a safe test, I changed 
 storage.tsdb.max-block-duration to 1day

 I want some guideline on what is a safe lower value of this parameter 
 and keeping it low impact in increased memory usage?

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/454ad3b1-2d2a-4f97-b845-eb3608dcd7cfn%40googlegroups.com.

[prometheus-users] Re: Secure BlackBox exporter (Basic Authentication)

2024-01-25 Thread 'Brian Candler' via Prometheus Users

Duplicate of https://groups.google.com/g/prometheus-users/c/TMhocibN14M

On Thursday 25 January 2024 at 14:29:17 UTC Cres Portillo wrote:

> Hello Everyone,
>
> Can you add basic authentication to secure Blackbox Exporter like you can 
> with UI Endpoints?
>
> SECURING PROMETHEUS API AND UI ENDPOINTS USING BASIC AUTH
> https://prometheus.io/docs/guides/basic-auth/
>
> I have basic authentication working for my UI endpoints, but would also 
> like to enable this for Blackbox endpoint (server1:9115)
>
> Thanks in advance. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74ab77aa-2e72-42a5-9ee2-9ee4ab3b5d77n%40googlegroups.com.

[prometheus-users] Re: "Secure" the Blackbox exporter using basic authentication

2024-01-25 Thread 'Brian Candler' via Prometheus Users

Please read the documentation 
here: 
https://github.com/prometheus/blackbox_exporter?tab=readme-ov-file#tls-and-basic-authentication

> To use TLS and/or basic authentication, you need to pass a configuration 
file using the --web.config.file parameter. The format of the file is 
described in the exporter-toolkit repository 

.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d7f2d598-8d1c-4787-a52c-7726072a1c3cn%40googlegroups.com.

[prometheus-users] Re: Prometheus jobs are not showing in prometheus UI

2024-01-25 Thread 'Brian Candler' via Prometheus Users

Sorry, I don't know what you mean by "not showing all the jobs".  You have 
only shown a small portion of the targets page.  Are you saying that job 
blackbox-fundconnect_retail does not appear there, 
but blackbox-fundconnect_retail_03 does? Is it possible 
that blackbox-fundconnect_retail has zero targets configured?

The PromQL query "up" will show you all the targets. "count by (job) (up)" 
will show you the jobs, with the number of targets for each. Those won't 
show a job with zero targets though.

Note that in the above case, both scrape jobs look to be identical, in 
which case you could have one job:

  file_sd_configs:
- files:
- './dynamic/blackbox/blackbox_retail_FundConnect_targets.yml'
- './dynamic/blackbox/blackbox_retail_FundConnectHealth-03_targets.yml'

(You can use target labels to distinguish them, if you wish)

> Do we have limitations in prometheus jobs.? 

No.

> Prometheus version: prometheus:v2.27.1

That's pretty old (May 2021). Current LTS version is v2.45.2

On Thursday 25 January 2024 at 13:11:27 UTC Venkatraman Natarajan wrote:

> Hi Team,
>
> We have 100+ prometheus jobs in prometheus UI but not showing all the 
> jobs. 
>
> The below are sample 2 jobs. In this, first job is not showing in UI but 
> second job metrics showing fine.  
> - job_name: 'blackbox-fundconnect_retail'
>   scrape_interval: 1020s
>   metrics_path: /probe
>   honor_timestamps: true
>   params:
> module: [http_2xx]
>   file_sd_configs:
> - files: ['./dynamic/blackbox/blackbox_retail_FundConnect_targets.yml']
>   relabel_configs:
>   - source_labels: [__address__]
> separator: ;
> regex: (.*)
> target_label: __param_target
> replacement: $1
> action: replace
>   - source_labels: [__param_target]
> separator: ;
> regex: (.*)
> target_label: instance
> replacement: $1
> action: replace
>   - separator: ;
> regex: (.*)
> target_label: __address__
> replacement: {{ XXX }}:9122
> action: replace
>
> - job_name: 'blackbox-fundconnect_retail_03'
>   scrape_interval: 1020s
>   metrics_path: /probe
>   honor_timestamps: true
>   params:
> module: [http_2xx]
>   file_sd_configs:
> - files: 
> ['./dynamic/blackbox/blackbox_retail_FundConnectHealth-03_targets.yml']
>   relabel_configs:
>   - source_labels: [__address__]
> separator: ;
> regex: (.*)
> target_label: __param_target
> replacement: $1
> action: replace
>   - source_labels: [__param_target]
> separator: ;
> regex: (.*)
> target_label: instance
> replacement: $1
> action: replace
>   - separator: ;
> regex: (.*)
> target_label: __address__
> replacement: {{ X}}:9122
> action: replace
>
> Do we have limitations in prometheus jobs.? 
>
> I would like to add more prometheus jobs to scrape the metrics.
>
> Please find attached screenshot which shows jobs in prometheus UI
>
> Prometheus version: prometheus:v2.27.1
>
> Could you please help me on this? 
>
> Thanks,
> Venkatraman N
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c758fd26-c945-4f96-a579-59ccb8983cf7n%40googlegroups.com.

[prometheus-users] Re: storage.tsdb.max-block-duration to a lower value completely stops compaction

2024-01-24 Thread 'Brian Candler' via Prometheus Users

Since regular blocks are 2h, setting maximum size of compacted blocks to 1h 
sound unlikely to work.  And therefore testing with 1d seems reasonable.

Can you provide more details about the scale of your environment, in 
particular the "head stats" from Status > TSDB Stats in the Prometheus web 
interface?

However, I think what you're seeing could be simply an artefact of how Go's 
garbage collection works, and you can make it more aggressive by tuning 
GOGC and/or GOMEMLIMIT. See
https://tip.golang.org/doc/gc-guide#GOGC
for more details.

Roughly speaking, the default garbage collector behaviour in Go is to allow 
memory usage to expand to double the current usage, before triggering a 
garbage collector cycle. So if the steady-state heap is 50GB, it would be 
normal for it to grow to 100GB if you don't tune it.

If this is the case, setting smaller compacted blocks is unlikely to make 
any difference to memory usage - and it could degrade query performance.

On Wednesday 24 January 2024 at 21:45:50 UTC Sukhada Sankpal wrote:

> Background on why I wanted to play around this parameter:
> Using LTS version for testing i.e. 2.45.2
> During compaction i.e. every 3days, the resident memory of prometheus 
> spikes to a very high value. Example if average of 
> process_resident_memory_bytes is around 50 GB and at the time of compaction 
> it spikes to 120 to 160 GB. Considering the usage of 50 GB want memory 
> allocated to the host to be around 128GB. But looking at memory usage spike 
> during compaction, this doesn't seem to be a workable option and keeping a 
> low value may lead to OOM during compaction. It also adds to cost for cloud 
> based VMs.
> On Wednesday, January 24, 2024 at 1:35:16 PM UTC-8 Sukhada Sankpal wrote:
>
>> storage.tsdb.max-block-duration default value is set to be 10% of 
>> retention time. I am currently using a setup with 30 days of retention and 
>> thereby this flags default value is set to be 3 days.
>> Based on suggestions posted here: 
>> https://github.com/prometheus/prometheus/issues/6934#issuecomment-1610921555
>> I changed storage.tsdb.min-block-duration to 30m and 
>> storage.tsdb.max-block-duration to 1h. This resulted in no-compaction state 
>> and local storage increased quickly.
>>
>> In order to enable the compaction and have a safe test, I changed 
>> storage.tsdb.max-block-duration to 1day
>>
>> I want some guideline on what is a safe lower value of this parameter and 
>> keeping it low impact in increased memory usage?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a0e18b56-13bc-4ce5-839e-17f66eededeen%40googlegroups.com.

[prometheus-users] Re: query to plot graph to monitor endpoints

2024-01-24 Thread 'Brian Candler' via Prometheus Users

I don't think it's a question of "creating a query" unless you've already 
got this data in Prometheus - and if you have, you need to show what 
metrics you have before anyone can advise on queries against them.

If you're not already collecting the data then that would be the starting 
point. If the data is coming from web server logs then you could look at 
mtail or grok_exporter. If you're going to add native instrumentation to 
your web server then you have more options. 

For response latency, remember that one collection interval can cover many 
requests: do you want the graph to show the mean response time, the median, 
the 99th percentile, something else? You might want to consider collecting 
data in histograms to allow for richer querying.

Up/down is generally a more straightforward: a metric like 0/1. Again 
you'll have to decide where and how to collect it.

Visualizations require a separate front-end like Grafana. The timeline view 
should be doable. However, the Grafana community would be a better place to 
ask questions about Grafana.

On Wednesday 24 January 2024 at 16:07:22 UTC Keshav Sinha wrote:

> Hi Team,
>
> I need help creating a query where I can plot the graph at the X-axis 
> denotes (URL UP /Down Status) and on the Y-Axix denotes response time in 
> per sec
>
> please help me to create something like this 
>
> sample graph i added here 
>
>
> [image: WhatsApp Image 2024-01-24 at 21.00.33.jpeg]

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7f565d30-e96c-46a3-8e73-66fc79ce62e3n%40googlegroups.com.

[prometheus-users] Re: Drop Target

2024-01-24 Thread 'Brian Candler' via Prometheus Users

> Now, I want prometheus to read only from job 2,3 and drop 1, do we have a 
provision to do that in file_sd_config?

Yes. Use target relabelling using "drop" or "keep" rules. You will have to 
match on some label(s) which distinguish job 1 from jobs 2 and 3.

Note 1: the text you provided is not valid JSON. Even if it were, it is not 
structured correctly for consumption by Prometheus. The top level entity 
must be a list of objects [...], and each object needs to have 
{"targets":[...], "labels":{...}}

Note 2: it's a really bad idea to override the "job" label. This is set by 
prometheus, and should identify the scrape job which collected the data.

On Tuesday 23 January 2024 at 08:01:20 UTC akshay sharma wrote:

> Hi, 
>
> I have a file_sd_config defined in Prometheus configuration file 
> (/tmp/test.json)
> in that, I have 3-4 scrape targets as defined below:
>
> cat /tmp/test.json
>
> {"targets": ["x:123"],"labels": { 
>"job": "1","element_name": "x","__metrics_path__": 
> "/x/test"}
>
> "targets": [
>
> "y:123"],"labels": {"job": "2",   
>  "element_name": "y","__metrics_path__": "/y/test"}
>
> "targets": [
>
> "3:123"],"labels": {"job": "3",   
>  "element_name": "z","__metrics_path__": "/z/test"}
>
> }
>
> Now, I want prometheus to read only from job 2,3 and drop 1, do we have a 
> provision to do that in file_sd_config?
>
> How can I achieve this? please let me know.
>
>
> thanks,
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4ad74fdc-4a82-4a67-a915-4cfb20822ad1n%40googlegroups.com.

Re: [prometheus-users] snmp exporter & snmpv3

2024-01-20 Thread 'Brian Candler' via Prometheus Users

> If you have a working SNMP.yml thrn Just add this at the top of the File

But indented and nested under the "auths" key.

> f you use Cisco devices you have to use
> AES-128C or AES-256C

That is not true. You can use "AES" as normal, or you can use "AES192C" or 
"AES256C".  There is no "AES-128C" or similar.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f2225213-c5da-4ae2-be5b-0a5e21bf6dcen%40googlegroups.com.

Re: [prometheus-users] delta/increase on a counter return wrong value

2024-01-18 Thread 'Brian Candler' via Prometheus Users

If you are not worried too much about what happens if the counter resets 
during that period, then you can use:

(metric - metric offset 15m) >= 0

On Friday 19 January 2024 at 05:26:42 UTC+8 Chris Siebenmann wrote:

> > I have a counter and I want to counter the number of occurences on a 
> > duration (let's say 15m). I'm using delta() or increase but I'm not 
> getting 
> > the result I'm expecting.
> >
> > value @t0: 30242494
> > value @t0+15m: 30609457
> > calculated diff: 366963
> > round(max_over_time(metric[15m])) - round(min_over_time(metric[15m])): 
> > 366963
> > round(delta(metric[15m])): 373183
> > round(increase(metric[15m])): 373183
> >
> > increase and delta both return the same value but it appears to be wrong 
> > (+6220) while max_over_time - min_over_time return the expected value.
> >
> > I do not understand this behaviour. I must have miss something.
>
> I suspect that you may be running into delta() and increase() time range
> extrapolation. To selectively quote from the delta() documentation
> (there's similar wording for increase()):
>
> The delta is extrapolated to cover the full time range as
> specified in the range vector selector, so that it is possible
> to get a non-integer result even if the sample values are all
> integers.
>
> As far as I know, what matters here is the times when the first and last
> time series points in the range were recorded by Prometheus. If the
> first time series point was actually scraped 35 seconds after the start
> of the range and the last time series point was scraped 20 seconds
> before its end, Prometheus will extrapolate each end out to cover those
> missing 55 seconds. As far as I know there's currently no way of
> disabling this extrapolation; you just have to hope that its effects are
> small.
>
> Unfortunately these true first and last values and timestamps are very
> hard to observe. If you ask for the value at t0, the start of the range,
> as a single value (for example issuing an instant query for 'metric
> @'), Prometheus will actually look back before the start of the
> range for the most recently scraped value. The timestamp of the most
> recently observed value is 'timestamp(metric)', and you can make that
> 'the most recently observed metric at some time' with 'timestamp(metric
> @)' (and then use 'date -d @' to convert that to a
> human-readable time string; 'date -d "20234-01-18 13:00" +%s' will go
> the other way). If you know your scrape interval, it's possible to
> deduce the likely timestamp of the first time series point within a
> range from getting the timestamp of the most recent point at the start
> of the range (it's likely to be that time plus your scrape interval,
> more or less).
>
> (The brute force way to find this information is to issue an instant
> query for 'metric[15m]', which in the Prometheus web interface will
> return a list of measurements and timestamps; you can then look at the
> first and last timestamps.)
>
> - cks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/822fc17e-4d7f-4cf6-9774-ceef360ce53cn%40googlegroups.com.

[prometheus-users] Re: Node_exporter 1.7.0 - http_server_config - Strict-Transport-Security

2024-01-17 Thread 'Brian Candler' via Prometheus Users

The YAML parsing error is simply saying that under "http_server_config", 
you cannot put "Strict-Transport-Security".

The documentation says that the only keys allowed under 
"http_server_config" are "http2" and "headers". So it needs to be like this:

http_server_config:
  headers:
Strict-Transport-Security: 

On Wednesday 17 January 2024 at 15:43:06 UTC+8 Alexander Wilke wrote:

> Hello,
>
> I am running:
>
> node_exporter, version 1.7.0 (branch: HEAD, revision: 
> 7333465abf9efba81876303bb57e6fadb946041b)
>   build date:   20231112-23:53:35
>   go version:   go1.21.4
>   platform: linux/amd64
>   tags: netgo osusergo static_build
>
>
>
> Vulnerability scan complained that HSTS is not enabled so I wanted to 
> enable it:
>
> tls_server_config:
>   cert_file: "/opt/node_exporter/node_exporter.pem"
>   key_file: "/opt/node_exporter/node_exporter.key"
>
>   min_version: "TLS12"
>   max_version: "TLS13"
>
>   client_auth_type: "NoClientCert"
>
> basic_auth_users:
> user: 'xxx'
>
> http_server_config:
>   Strict-Transport-Security: max-age=31536000  # 1 year
>
>
> Unfortunately I get this error:
>
> node_exporter: ts=2024-01-17T07:30:04.483Z caller=node_exporter.go:223 
> level=error err="yaml: unmarshal errors:\n  line 14: field 
> Strict-Transport-Security not found in type web.HTTPConfig"
> systemd: node_exporter.service: main process exited, code=exited, 
> status=1/FAILURE
>
>
> I tried to configure it based on this documentation:
> https://prometheus.io/docs/prometheus/latest/configuration/https/
>
> probably I need the other parameters, too like:
> Strict-Transport-Security: max-age=; includeSubDomains; 
> preload 
> How to get this working?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dd0e574e-e515-421c-881a-1af7e297e468n%40googlegroups.com.

[prometheus-users] Re: Weird node_exporter network metrics behaviour - NIC problem?

2024-01-16 Thread 'Brian Candler' via Prometheus Users

I would suspect due to how the counters are incremented and the new values 
published.

Suppose in the NIC's API new counter values are published at some odd 
interval like every 0.9 seconds. Your 15 second scrape will sometimes see 
the results of 16 increments from the previous counter, and sometimes 17 
increments.

It's just a guess, but it's the sort of thing that can cause such artefacts.

On Tuesday 16 January 2024 at 21:55:06 UTC+8 Dito Windyaksa wrote:

> You're right - it's related to our irate query. We tried switching to 
> rate() and it gives us a straight linear line during iperf tests.
>
> We've been using irate for years across dozens of servers, but we've only 
> noticed 'weird drops'/instability samples on this single server.
>
> We don't see any drops during iperf tests using irate query on other 
> servers.
>
> Any clues why? NIC related?
>
>
> On Monday, January 15, 2024 at 7:24:46 PM UTC+8 Bryan Boreham wrote:
>
>> I would recommend you stop using irate().
>> With 4 samples per minute, irate(...[1m]) discards half your 
>> information.  This can lead to artefacts.
>>
>> There is probably some instability in the underlying samples, which is 
>> worth investigating. 
>> An *instant* query like 
>> node_network_transmit_bytes_total{instance="xxx:9100", device="eno1"}[10m] 
>> will give the real, un-sampled, counts.
>>
>>
>> On Monday 15 January 2024 at 01:02:59 UTC mor...@gmail.com wrote:
>>
>>> Yup - both are running under the same scrape interval (15s) and using 
>>> the same irate query:
>>> irate(node_network_transmit_bytes_total{instance="xxx:9100", 
>>> device="eno1"}[1m])*8
>>>
>>> It's an iperf test between each other and no interval argument is set 
>>> (default zero.)
>>>
>>> I wonder if it has something to do with how Broadcom reports network 
>>> stats to /proc/net/dev?
>>> On Monday, January 15, 2024 at 7:49:35 AM UTC+8 Alexander Wilke wrote:
>>>
 Do you have the same scrape_interval for both machines?
 Are you running irate on both queties or "rate" on the one and "irate" 
 on the other?
 Are the iperf intervals the same for both tests?

 Dito Windyaksa schrieb am Montag, 15. Januar 2024 um 00:02:26 UTC+1:

> Hi,
>
> We're migrating to a new bare metal provider and noticed that the 
> network metrics doesnt add up.
>
> We conducted an iperf test between A and B, and noticed there are 
> "drops" on the new machine during an ongoing iperf test.
>
> We also did not see any bandwidth drops from both iperf server/client 
> side.
>
> [image: Screenshot 2024-01-13 at 06.27.43.png]
>
> Both are running similar queries:
> irate(node_network_receive_bytes_total{instance="xxx", 
> device="eno1"}[1m])*8
>
> One thing is certain: green line machine is running an Intel 10G NIC, 
> while blue line machine is running an Broadcom 10G NIC.
>
> Any ideas?
> Dito
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a1a02c4c-d7a5-4c73-857d-c2c085249c06n%40googlegroups.com.

Re: [prometheus-users] Maximum targets for exporter

2024-01-13 Thread 'Brian Candler' via Prometheus Users

Just to clarify: I picked "4 cores" out of thin air just as an example to 
work through, same as I picked 15 second scrape interval and 150ms per 
scrape.

On Saturday 13 January 2024 at 09:34:21 UTC Brian Candler wrote:

> One reason is you may already have eight 4-core servers lying around.
>
> If it's a VM then of course you can just scale up to the largest instance 
> size available, before you need to go to multiple instnaces.
>
> On Saturday 13 January 2024 at 00:20:10 UTC Alexander Wilke wrote:
>
>> Hello,
>> sorry to hijack this thread a little bit but Brian talks about "4 CPU 
>> cores" and Ben says "scale horizontally".
>>
>> Just for interest - why not just use 8, 16, or 32 CPU cores? Is Go 
>> limited at a specific CPU amount or is there a disadvantage to have to many 
>> cores?
>> I think if someone is monitoring so many devices this is enterprise 
>> network and servers/VMs with more CPUs are no problem.
>>
>> Ben Kochie schrieb am Freitag, 12. Januar 2024 um 21:50:57 UTC+1:
>>
>>> Those sound like reasonable amounts for those exporters.
>>>
>>> I've heard of people hitting thousands of SNMP devices from the 
>>> snmp_exporter.
>>>
>>> Since the exporters are in Go, they scale well. But if it's not enough, 
>>> the advantage of their design means they can be deployed horizontally. You 
>>> could run several exporters in parallel and use a simple http load balancer 
>>> like Envoy or HAProxy. 
>>>
>>> On Fri, Jan 12, 2024, 02:32 'Elliott Balsley' via Prometheus Users <
>>> promethe...@googlegroups.com> wrote:
>>>
 I'm curious if anyone has experimented to find out how many targets can 
 reasonably be scraped by a single instance of blackbox and snmp exporters. 
  
 I know Prometheus itself can handle tens of thousands of targets, but I'm 
 wondering at what point it becomes necessary to split up the scraping.  
 I'll find out for myself soon enough, I just wanted to check and see if 
 anyone has tested this already.  I'm thinking I would have around 10K 
 targets for blackbox, and 1K for snmp.

 I'm using http_sd_config with a 15 second refresh interval, so that's 
 another potential bottleneck I'll have to test.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "Prometheus Users" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to prometheus-use...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/prometheus-users/CALajkdh7EhHAVN5nJNYqJjKvcH_rfT1L7ZaPvPR4L-xjypKSbg%40mail.gmail.com
  
 
 .

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f3f82b80-74c3-4816-aae0-24e39d33c36bn%40googlegroups.com.

Re: [prometheus-users] Maximum targets for exporter

2024-01-13 Thread 'Brian Candler' via Prometheus Users

One reason is you may already have eight 4-core servers lying around.

If it's a VM then of course you can just scale up to the largest instance 
size available, before you need to go to multiple instnaces.

On Saturday 13 January 2024 at 00:20:10 UTC Alexander Wilke wrote:

> Hello,
> sorry to hijack this thread a little bit but Brian talks about "4 CPU 
> cores" and Ben says "scale horizontally".
>
> Just for interest - why not just use 8, 16, or 32 CPU cores? Is Go limited 
> at a specific CPU amount or is there a disadvantage to have to many cores?
> I think if someone is monitoring so many devices this is enterprise 
> network and servers/VMs with more CPUs are no problem.
>
> Ben Kochie schrieb am Freitag, 12. Januar 2024 um 21:50:57 UTC+1:
>
>> Those sound like reasonable amounts for those exporters.
>>
>> I've heard of people hitting thousands of SNMP devices from the 
>> snmp_exporter.
>>
>> Since the exporters are in Go, they scale well. But if it's not enough, 
>> the advantage of their design means they can be deployed horizontally. You 
>> could run several exporters in parallel and use a simple http load balancer 
>> like Envoy or HAProxy. 
>>
>> On Fri, Jan 12, 2024, 02:32 'Elliott Balsley' via Prometheus Users <
>> promethe...@googlegroups.com> wrote:
>>
>>> I'm curious if anyone has experimented to find out how many targets can 
>>> reasonably be scraped by a single instance of blackbox and snmp exporters.  
>>> I know Prometheus itself can handle tens of thousands of targets, but I'm 
>>> wondering at what point it becomes necessary to split up the scraping.  
>>> I'll find out for myself soon enough, I just wanted to check and see if 
>>> anyone has tested this already.  I'm thinking I would have around 10K 
>>> targets for blackbox, and 1K for snmp.
>>>
>>> I'm using http_sd_config with a 15 second refresh interval, so that's 
>>> another potential bottleneck I'll have to test.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to prometheus-use...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/CALajkdh7EhHAVN5nJNYqJjKvcH_rfT1L7ZaPvPR4L-xjypKSbg%40mail.gmail.com
>>>  
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/10599de2-7f2b-4059-9fb0-3d9cdf9a3a8fn%40googlegroups.com.

[prometheus-users] Re: Maximum targets for exporter

2024-01-12 Thread 'Brian Candler' via Prometheus Users

The http_sd_config refresh is going to be a very tiny part of the resource 
utilisation of Prometheus, although 15 seconds is quite aggressive.

As for the exporters, it depends very much on the scrape interval and the 
duration of each probe, the type of probe, and number of cores you have.

For example: let's say you have a 15 second scrape interval and 10K targets 
= a new scrape every 1.5ms on average (it spreads them out over the time 
period)

If each blackbox or snmp probe takes 150ms to complete, then you are 
processing 100 probes concurrently on average.

If you have 4 cores, then each core is handling 25 probe goroutines. Most 
of the time each goroutine will be waiting for network response from the 
target system.  But some probes may be more computationally expensive, e.g. 
those which involve setting up TLS connections, or SNMP 
privacy/authentication modes.

In short, it sounds to me like it should be fine, but monitor it to be sure.

Before doing any sort of sharding, I'd first put blackbox/snmp exporters 
into separate VMs (i.e. separate from Prometheus itself). That's very 
simple to implement, and gives you a clearer picture of the resource 
utilisation of each.

On Friday 12 January 2024 at 01:32:17 UTC Elliott Balsley wrote:

> I'm curious if anyone has experimented to find out how many targets can 
> reasonably be scraped by a single instance of blackbox and snmp exporters.  
> I know Prometheus itself can handle tens of thousands of targets, but I'm 
> wondering at what point it becomes necessary to split up the scraping.  
> I'll find out for myself soon enough, I just wanted to check and see if 
> anyone has tested this already.  I'm thinking I would have around 10K 
> targets for blackbox, and 1K for snmp.
>
> I'm using http_sd_config with a 15 second refresh interval, so that's 
> another potential bottleneck I'll have to test.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3e614743-0e9a-44b2-bb78-5f0d8747d7bfn%40googlegroups.com.

[prometheus-users] Re: snmp_exporter 0.25.0 - IF-MIB and CISCO-IF-EXTENSION-MIB

2024-01-11 Thread 'Brian Candler' via Prometheus Users

On Thursday 11 January 2024 at 11:27:15 UTC Alexander Wilke wrote:

thank you for that snippet. I could use it to solve my issue:
(sysUpTime - on (instance) group_right () ifLastChange) / 100

However I need to find some time and try to better understand how these 
operations work.


Sure. In addition to the links I gave, there are other PromQL introductions 
you can find via Google. Here are a few that I bookmarked:

https://prometheus.io/docs/prometheus/latest/querying/examples/
https://github.com/infinityworks/prometheus-example-queries
https://timber.io/blog/promql-for-humans/
https://www.weave.works/blog/promql-queries-for-the-rest-of-us/
https://www.slideshare.net/weaveworks/promql-deep-dive-the-prometheus-query-language
https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085
https://www.robustperception.io/blog
https://www.robustperception.io/common-query-patterns-in-promql
https://www.robustperception.io/booleans-logic-and-math
https://www.robustperception.io/composing-range-vector-functions-in-promql
https://www.robustperception.io/rate-then-sum-never-sum-then-rate
https://www.robustperception.io/using-group_left-to-calculate-label-proportions
https://www.robustperception.io/extracting-raw-samples-from-prometheus
https://www.robustperception.io/prometheus-query-results-as-csv
https://www.robustperception.io/existential-issues-with-metrics
 

Is there some sort of script builder for promQL ?


Maybe this helps? https://promlens.com/

Referenced from: 
https://groups.google.com/g/prometheus-users/c/gB2r-KabtYU/m/M629uOUGDAAJ

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74361b3d-e9f0-4a85-b57a-6a108d375858n%40googlegroups.com.

[prometheus-users] Re: smokeping_prober - $(target:raw) - help with ":raw" and how to use multiple targets

2024-01-11 Thread 'Brian Candler' via Prometheus Users

This is a question about Grafana and/or the smokeping_exporter Grafana 
dashboard, not Prometheus.

${target:raw} is a Grafana variable expansion, and the :raw suffix is a 
format specifier:
https://grafana.com/docs/grafana/latest/dashboards/variables/variable-syntax/#variable-syntax
https://grafana.com/docs/grafana/latest/dashboards/variables/variable-syntax/#raw

If you want multiple Grafana selections to be active at once in a PromQL 
query, then in general you need to use regex: *foo{host=~"${target}"}*
Because Grafana understands PromQL it shouldn't be necessary to add a 
:regex suffix here, although it's probably OK to add it. It should expand 
to something like
*foo{host=~"1\.1\.1\.1|8\.8\.8\.8"}*
The important thing is that you use =~ instead of =.

All this is standard Grafana functionality, and therefore further questions 
about this would best be asked in the Grafana Community forum.

If the published smokeping_exporter dashboard allows multiple selections in 
its target var, but uses host= instead of host=~, then that's a bug in the 
dashboard which you'd need to raise with the author.

However, if the published dashboard only allows a single target selection 
and you *modified* it to allow multiple selections, then you broke it. At 
this point you've become a Grafana dashboard developer, and again, the 
Grafana Community would be the best place to ask for help. It's Grafana 
that builds the query; Prometheus can only process whatever query it's 
given.

On Thursday 11 January 2024 at 08:01:15 UTC Alexander Wilke wrote:

> Hello,
>
> I am using the smokeping_prober (
> https://github.com/SuperQ/smokeping_prober) v0.7.1 and the provided 
> dashboard.json.
>
> For whatever reason the queries contain ".raw" endings for the targets.
> This leads to a problem if I want to show several targets in the same 
> graph because targets are not added with "|" in between but with ","
>
> Here is the query with one selected target which is working:
> [image: smoke_ping_one_target.JPG]
>
>
> If I select two or more targets than the query looks like this but not 
> data anymore:
> [image: smoke_ping_more_targets_no_data.JPG]
>
> If I remove the ":raw" at the end of the target I do not get any data no 
> difference if one or more clients. So this is somehow relevant.
>
>
> My idea was to have any overview panel which shows the latency of several 
> smokeping_probes and to compare them. I want to place several "sensors" in 
> our DataCenter and they should ping each other. If someone tells me he has 
> performance issues with an appliacation I can select the relevant 
> zones/probers and can compare if the latency changed or not.
>
> If I want to compare 6 probers and have to scroll through 6 pannels it is 
> not so elegant because depending on the latency the scale of the panels is 
> different and may lead to wrong assumptions.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d951a14c-42ec-4ba2-8dcd-7dc93eae7f93n%40googlegroups.com.

1 2 >

1 - 100 of 161 matches

Mail list logo