Re: [prometheus-users] Cannot get native Prometheus metrics from graph console

2021-02-11 Thread Christian Hoffmann

Hi,

On 2021-02-12 00:21, Corey Abma wrote:
Thanks for your response Christian. The Prometheus metrics endpoint is 
down (http://:10090/metrics). I've set up TLS encryption in a 
separate config file (web-config.yml) that I give as a command line 
argument for Prometheus. Could there be something wrong there? The file 
contents for that are below.


tls_server_config:
   # Certificate and key files for server to use to authenticate to client.
   cert_file: C:\path\to\cert.crt
   key_file: C:\path\to\key.key


I haven't used that feature myself yet. But assuming that Prometheus 
goes to HTTPS-only in this case, I assume that you would have to adjust 
your scrape job to use HTTPS as well.


scheme: https

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scheme

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1699a469-f48b-f6c1-32cf-8a29e0c58af0%40hoffmann-christian.info.


Re: [prometheus-users] Cannot get native Prometheus metrics from graph console

2021-02-11 Thread Christian Hoffmann

On 2021-02-11 22:57, Corey Abma wrote:
Yes, I can confirm there's still an explicit scrape job. It even 
auto-fills in the graphing query console.

That might be historical data though.

I even get this from going to 
/metrics (see screenshot).

Yeah, this confirms that Prometheus still exposes metrics about itself.

Can you check the Targets page in the Web UI regarding the status of 
your Prometheus scrape job at localhost?


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/829cf47c-4c47-0031-ae0d-e47d36b638d5%40hoffmann-christian.info.


Re: [prometheus-users] Cannot get native Prometheus metrics from graph console

2021-02-11 Thread Christian Hoffmann

Hi,

On 2021-02-11 20:43, Corey Abma wrote:
I recently updated my Prometheus to v2.24.1. I noticed that if I go to 
my endpoint localhost:10090/metrics, I can see Prometheus metrics (such 
as prometheus_target_interval_length_seconds) being published just fine. 
However, if I go to the graphing console and type in that exact same 
metric, I get no results.


Is anyone else experiencing this issue?


Can you confirm that your config still contains a job to scrape these 
metrics?


Even for Prometheus' own metrics, an explicit scrape job is needed.

You explicitly mention that you've updated. Do you suspect that this has 
been different before? Is there a chance that your config was modified 
during the update?


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c3c61948-e16a-d971-9aa6-88037c4bd333%40hoffmann-christian.info.


Re: [prometheus-users] Encryption support for storing basic auth/bearer token in prometheus.yaml

2021-02-01 Thread Christian Hoffmann

Hi,

On 2021-02-01 14:39, vinoth dharmalingam wrote:
We see the basic auth/bearer token details are being stored in the 
prometheus.yaml/password file in a plain text for target scraping. Our 
cyber process does not allow this plain storage. Are there ways for 
storing it in the encrypted format similar to how bcrypt encryption 
supported in web.config.file for HTTP APIs (version 2.24)?
If you are talking about credentials which are required for Prometheus 
to access other endpoints as a client, then the answer is "no" for 
Prometheus and the answer is most likely "no" for any other software as 
well, as far as I know.


I have often seen this issue come up: Some IT policy forbids storing 
plaintext passwords without any context.


It is easily possible (and a good policy!) to avoid storing plaintext 
passwords when acting as a server or account database, e.g. when the 
server has to verify a user- or client-supplied secret. This can be 
solved by using plain hashes (outdated, as it is prone to quick 
cracking/rainbow tables) or more advanced schemes such as pbkdf2, bcrypt 
and scrypt.


However, when Prometheus has to prove to another endpoint that it 
possesses a certain secret, this won't work as hash functions, bcrypt 
etc. are designed to be non-reversible. In theorey, you could encrypt 
such passwords and lots of Enterprise software supports such schemes. 
Nobody wants to ask the next question: How would the software (i.e. 
Prometheus) decrypt the stored password? The answer is: With a key which 
has to be placed right next to the config file with the encrypted values.


In other words, in most cases there is zero gain in security while still 
having to deal with an increase of complexity.


The proper way to deal with this is ensuring that only the desired 
technical users and persons have read permissions for your config files.


This applies to passwords, tokens and private keys all in the same way.

If your IT department has a solution to this fundamental "issue", then I 
guess everyone will be eager to learn about it. :)


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/91a53255-09f2-428e-ebec-1a358123d7b1%40hoffmann-christian.info.


Re: [prometheus-users] inhibit_rules

2021-01-30 Thread Christian Hoffmann

Hi,

On 2021-01-30 10:47, Auggie Yang wrote:

I was confused with alertmanager inhibit_rules; details as following:

Exapmles:
8 servers, one of servers always got high memory(50%) or disk usage(60%),
and the others use resource low; how do you set inhibit_rules for  this 
special normal node;


if this node memory usage >50% and <90%, then drop this meassge. if this 
node memory usage >=90%, then alert fire info to alertmanager;


I think I would rather solve this only Prometheus-side, e.g. using this 
pattern:

https://www.robustperception.io/using-time-series-as-alert-thresholds

This way, you can configure your per-server thresholds and don't need 
any special-casing in Alertmanager.


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4741e7cf-6e8f-97da-0962-f69366fb8895%40hoffmann-christian.info.


Re: [prometheus-users] SFP 10Gb Port

2021-01-17 Thread Christian Hoffmann

Hi,

On 2021-01-17 19:58, Hirad Rasoolinejad wrote:
Unfortunately, we have lots of services that connect using SFP 10Gb 
ethernet. We have huge traffic coming to our servers, and we need to 
process them and analyze network traffic based on transmit and receive. 
Prometheus and Node Exporter can't detect any SFP port. Is there any 
specific exporter? Do I need to change any configuration?


Are we talking about a standad network interface which shows up in the 
OS as enSOMETHING? Then it should be covered by the kernel metrics which 
node_exporter collects as part of netclass/netdev/netstat collectors 
which are enabled by default.

We have no troubles monitoring our interfaces with 10G ports (or higher).

If this is some specialized hardware which is driven by a userspace 
network stack or something, then things might be different


Can you share some more details how the interfaces look like in the 
operating system and which metrics you are missing?



Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/784afa37-8608-0f3b-c6dc-ffe420193ec8%40hoffmann-christian.info.


Re: [prometheus-users] Querying prometheus data/series.

2020-12-23 Thread Christian Hoffmann

On 2020-12-23 09:53, akshay sharma wrote:
ONTSTAT{IDF="false",Interval="0",METype="NT",ResourceID="t",instance="172.27.173.103:8999 
",job="prometheus",rxByte="33",time="1608711202557257990",txByte="18"} 
3



1) I want to perform some action on the labels of metrics above, how can 
we achieve this?

ex: rxbyte-txbyte/timestamp
You seem to have created a single metric with several labels which 
contain values. This is problematic, because that's not what Prometheus 
was designed for. You will likely get performance issues and will miss 
methods to work with your data.



2) Above metrics that support multiple values?? like list or map

Not directly. This is solved using multiple metrics in Prometheus.

3) As prometheus uses a TSDB, like influx, how can we query series from 
prometheus DB?
Prometheus contains a TSDB. It can be queried using PromQL. There is no 
other, secret interface or something.



4) Does promql support operations on labels?
Yes, it supports some operations (such as label_replace()), but this is 
probably not what you are looking for. All values which you want to be 
part of calculations should be the value of a metric, not the label content.


To provide more explicit guidance how your metrics should look like, we 
would need some more context. I assume that you are monitoring some kind 
of network interface. I guess it will have some kind of ID (maybe that's 
the numbers 1 to 10 which you currently store as a value?). I'll just 
assume it can be called device_id.


The metrics for one such device could look like this:

ontstat_rx_bytes{device_id="1"} 31
ontstat_tx_bytes{device_id="1"} 16
ontstat_idf{device_id="1"} 0 # 0 for false
ontstat_info{device_id="1",type="NT",resource_id="t"} 1 # always 1

If the value time= denotes the time when the metric was gathered, it 
should be dropped entirely. Prometheus will figure this out on its own. 
If this is some other timestamp (timestamp of last administrative 
modification or something), it should be added as another metric:


ontstat_last_admin_action_timestamp_seconds{device_id="1"} 
1608712842316890688


The value doesn't look like a unix timestamp. Maybe it should converted 
to one (in seconds).


The following documents provide further background:
https://prometheus.io/docs/concepts/data_model/
https://prometheus.io/docs/practices/naming/

In fact, I just provided some examples for the hopefully right 
direction. I'm probably missing something as well. Depending on context 
you will also want to do some things differently (e.g. if it makes sense 
semantically to sum up rx/tx metrics, you might consider making it a 
single metric with a label to distinguish, e.g. 
ontstat_traffic_bytes{device_id="1",queue="tx"}.


The above examples are the examples for a single device_id. You would 
have additional such lines for every other device.


It may also make sense to look at the (network) metrics of an existing 
exporter such as node_exporter.


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/54936447-5f4e-464a-6db7-b9359b8f3ac9%40hoffmann-christian.info.


OpenPGP_signature
Description: OpenPGP digital signature


Re: [prometheus-users] combine 2 querues

2020-12-17 Thread Christian Hoffmann

Hi,

On 2020-12-17 23:06, Alan Miller wrote:
The problem is that my "instance" fields are IP address:port (eg: 
10.123.5.5:9182).

The best solution would be to fix exactly this. ;)
https://www.robustperception.io/controlling-the-instance-label


So this query returns the instances and what looks like % cpu utilization:
   ( 100 - (avg by (instance) 
(irate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 95)


And this query gets me the hostnames I'm looking for:
   windows_cs_hostname{hostname=~".*as.*"}

So how do I combine them so that I get the CPU utilization value for 
ONLY hostnames
starting with "as" (here again, the instance fields are the same 
ipaddress:port pairs.



Try something like:

( 100 - (avg by (instance)
> (irate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 95) and 
on(instance) windows_cs_hostname{hostname=~".*as.*"}



Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/cca4e46f-4e41-be6d-f318-56f86d342b35%40hoffmann-christian.info.


Re: [prometheus-users] How do I set the maintenance cycle

2020-12-15 Thread Christian Hoffmann

On 2020-12-15 04:39, zhengwei y wrote:
Does this mean That I have to maintain a timer task to create and delete 
the corresponding silence rules every day ?


That would be one possibility. This could also be automated using cron 
or something.


We chose a different approach and solve that on the alerting side using 
specific rules and inhibition:

https://groups.google.com/g/prometheus-users/c/EfgshKvBua0/m/8GCUpT5XBwAJ

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/58ed4f87-3ed9-4112-ca6c-a705a5bb1112%40hoffmann-christian.info.


Re: [prometheus-users] overriding alert levels?

2020-12-14 Thread Christian Hoffmann

Hi,

On 2020-12-14 15:56, klavs@gmail.com wrote:
I would like to adjust alert levels for f.ex. disk space - so hosts that 
match som tag (like environment: dev) has a different level..


I see I am not the only one with such needs - these guys even 
implemented their own "extension" to Prometheus: 
https://www.lablabs.io/2020/04/19/how-we-solved-our-need-to-override-prometheus-alerts/


I tried to ask if such shouldn't be added (I'd gladly try and work on 
upstreaming to Prometheus with my colleagues) - but that did not get any 
traction https://github.com/prometheus/prometheus/issues/7923


Do any of you know if this is at all possible today?

regarding disk size f.ex. - larger disks (10TB + ) - we normally want 
levels on gigabyte instead of percentages - as 300GB avail is only a 
small percentage, but more than enough -and often we won't want that to 
trigger an alert on such a large device.
Before prometheus, we implemented the disk check, so it swiched to a 
"<100GB avail" - on devices larger than 10TB instead of using the 
percentage alert levels.


We are using this pattern successfully. Have you tried it?
https://www.robustperception.io/using-time-series-as-alert-thresholds

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d01fd00b-eacf-bb50-35f7-3fc615317275%40hoffmann-christian.info.


Re: [prometheus-users] Best practice for gathering metrics from client sites

2020-12-14 Thread Christian Hoffmann

Hi,

On 2020-12-14 11:24, Patrick Macdonald wrote:
If we have a Prometheus server on site-A and an arbitrary number of 
client sites, each with hardware we might want monitor, is there best 
practice for how to achieve this?
I'm assuming pushgateway isn't the correct use-case here?  Is the only 
way to create a tunnel in to each client site so that the Prometheus 
server can access each target directly?
I'm hoping there might be some way to achieve this without much work on 
the client IT side.


Do you have SSH access? Then sshified may help for creating dynamical 
tunnels:


https://github.com/hoffie/sshified

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1962fdf6-3545-48b5-7384-965b580eec29%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus metrics repeat, and cause the promql to be unavailable.

2020-12-11 Thread Christian Hoffmann

Hi,

On 2020-12-11 04:36, sowhat7 wrote:
I add a label for node 10.193.32.44, and I get repeat metrics from 
prometheus at a certain moment.
Is this a permanent issue or does it affect the time around your change 
only?
I suspect the latter. In that case, you could use some kind of 
aggregation to work around that (e.g. topk(1, ...), min()). You could 
also try to delete the overlapping data (probably plus some minutes due 
to staleness) using the DELETE API.


If the former, then there might be some issue with service discovery and 
you would need to share your relevant configs in order for someone to be 
able to help.



prometheus version: 2.7.2
node-exporter version: 0.17.0

These versions sound really, really old. You should consider updating.

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/812cad88-4efc-8e45-e7d6-988e30265c5b%40hoffmann-christian.info.


Re: [prometheus-users] prom query time range

2020-12-10 Thread Christian Hoffmann

Hi,

On 12/10/20 8:57 AM, robu...@gmail.com wrote:
I struggle with a prom query, probably not so difficult when you know 
how to do it ;-)


I monitor how many systems are up, with count(up{job="node"}

S couple of days ago this number jumped, now I like to find out what 
systems are "new"/haven't been there before.


I can get a list of systems by calling the api with 'query="up"' (and 
then some jq magic to produce a list of systems that are up NOW.


but how do I query prometheus to get a list of systems that where up 
let's say three days before NOW ?


If you just want the list of systems which were up three days ago, you 
can use plain PromQL: "up offset 3d".


You can also use the query API's time= parameter (independent from 
PromQL) to ask for a specific evaluation time:


https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries

If you want to compare the results between now and three days before, 
you can do so as well. To list all current up metrics which were not 
there 3 days ago, you could run: "up unless up offset 3d"


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c37e067f-7508-d753-d3bf-7218003f8a22%40hoffmann-christian.info.


Re: [prometheus-users] Alert descriptions on the edge

2020-12-10 Thread Christian Hoffmann

Hi,

On 12/10/20 11:33 AM, deln...@gmail.com wrote:
   expr: (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} 
/ node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} * 100 < 
10 and node_filesystem_readonly{fstype=~"ext[234]|btrfs|xfs|zfs"} == 0) 
 


   for: 15m labels:
   severity: warning
   annotations:
      description: Filesystem on {{ $labels.device }} at {{ 
$labels.instance }} has only {{ printf "%.2f" $value }}% available space 
left. identifier: '{{ $labels.instance }}'

      summary: Filesystem has less than 10% space left.

When the alert is *fired* - in notification we're getting
/*    has only 9.99% available space left. */
Once the alert is *resolved* - assuming that it's little above 10% in 
notification we're getting  for instance

*has only 9.98% available space left*

How do you explain this anomaly?


The notifications are sent by Alertmanager. Alertmanager gets the 
current state of an alert in regular intervals. The notifications always 
contain the most recent data, this includes the description.


Therefore, when the alert first fired and the first notification was 
sent out, 9.98% was proabably the current value. It later got updated in 
Alertmanager with more recent updates from Prometheus. You probably did 
not notice as this does not cause a new notification to be sent out 
(descriptions are annotations, not labels). When the alert resolved, the 
most recent representation was taken.


I guess that would explain it. :)

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/556be349-1ace-4c3a-63a3-ed8ac0cc005e%40hoffmann-christian.info.


Re: [prometheus-users] Hunting down the root cause of fluctuating node_filesystem_avail_bytes

2020-12-09 Thread Christian Hoffmann

Hi,

On 12/9/20 3:48 PM, iono sphere wrote:

I am not sure who could I ask this, but I would like to try here.

Currently, I'm seeing something weird in my server. Thanks to Prometheus 
and Node-Exporter, I have seen that node_filesystem_avail_bytes has been 
fluctuating up and down for hundreds of Megabytes once a day or two. So 
I am trying to hunt down the root cause of this. I have been asking 
everyone already that at the time of 
node_filesystem_avail_bytes increasing/decreasing for hundreds of 
Megabytes nobody has access the server. I have already checked the 
auth.log file as well to confirm this. Our server only has a PHP Larvel 
project being used as a development server, MySQL, Vue.js Front-End 
Server, and Docker to containerize these. But none of these should 
create or remove hundreds of Megabytes...


My current guess right now is the swap file might be at work but then I 
am not sure.
Swap often has its own partition. If it's really a file, it is usually 
pre-allocated, I think. So this sounds unlikely to me.


Are there other possible systematic process in Ubuntu that 
creates/removes hundreds of Megabytes once in a day or two?
Cron jobs or systemd timers come to mind. Log rotation might behave that 
way for large logs. MySQL might perform some reorganization. Prometheus 
itself runs a process called compaction which temporarily needs more space:


https://prometheus.io/docs/prometheus/latest/storage/#compaction

Checking the Prometheus logs and maybe increasing the log level can 
help. There are also metrics around compaction (e.g. 
prometheus_tsdb_compactions_total) which are worth checking.


In fact, is there a way to subtract all the values of expected 
systematic processes from node_filesystem_avail_bytes?
I don't think so. node_exporter takes the value directly from the 
relevant kernel interfaces. It's hard to decide what's systematic and 
what not. ;)


If you cannot find the cause after checking the items from above, you 
can try hunting it down using systematic tracing (perf trace, systemtap, 
iosnoop [1]).


[1] http://www.brendangregg.com/blog/2014-07-16/iosnoop-for-linux.html

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8c1b329e-7821-5413-f073-7b01ce453591%40hoffmann-christian.info.


Re: [prometheus-users] How do I detect a status code of 301,302 with blackbox-exporter

2020-12-09 Thread Christian Hoffmann

Hi,

On 12/8/20 4:01 AM, fun...@gmail.com wrote:

someone can help me?


If you want to verify that your target always returns one of the two 
status codes, then define a custom blackbox_exporter http module with 
valid_status_codes: [301, 302].


I assume you will also have a Location: header in this case. If this 
should be checked as well, you might want to use 
fail_if_header_not_matches additionally. no_follow_redirects: true might 
also be necessary.


If you are just looking for such a metric, probe_http_status_code may be 
what you want.


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a68da866-1b5b-4d28-71c7-c8ddb5c68e4c%40hoffmann-christian.info.


Re: [prometheus-users] Is there any way to update prometheus expr?

2020-12-03 Thread Christian Hoffmann

Hi,

On 12/3/20 10:06 AM, lizhihua0925 wrote:
What I want to do is limit the rule to eval metrics which have the 
spcified label.
Ah, I understood that you wanted to add labels, but you already said 
that you wanted to add matchers.


So, just to confirm that I understand correctly:

If the user enters the expression 'foo > 1' it should automatically be 
modified to read 'foo{some_static_label="some_dynamic_value"} > 1', right?


So, basically you are looking for a way to programatically modify PromQL 
expressions.


This is possible, prefereable by using Go as that's where the reference 
PromQL parser happens to be.


That sounds like what I'm doing in the prometheus-filter-proxy except 
that I'm modifying in-flight requests while you want to modify PromQL 
expressions in alerts. You could still use the same pattern:


* Parse the expression
* Traverse all relevant selectors
* Add the matcher you want
* Convert to string representation again

For reference: 
https://github.com/hoffie/prometheus-filter-proxy/blob/master/filter.go#L57


There are other projects which do similar things, e.g. in the 
Kubernetes/OpenShift ecosystem. This may provide further inspiration. :)


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/513ec04f-e1b7-278c-cb11-2217e6af7ef7%40hoffmann-christian.info.


OpenPGP_0x2C1F41CF361CB280.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [prometheus-users] Is there any way to update prometheus expr?

2020-12-03 Thread Christian Hoffmann

Hi,

On 12/3/20 9:48 AM, Allenzh li wrote:

Exactly, I develop a API which accept prometheus rule from web.
When user create a new rule, I want to add a fixed matcher(xxId="xxx"), 
which label name is fixed and label value is various.


eg.
     cpu_usage / avg_over_time(cpu_usage[5m] offset 24h) > 2   ->  
  cpu_usage{xxId="xxx"} / avg_over_time(cpu_usage{xxId="xxx"}[5m] offset 
24h) > 2
So it seems like your web application will only handle the 
alert-specific PromQL expression which ends up in the alert's "expr" field.


It seems like it would be straight forward to add the static label with 
the dynamic value directly into the Prometheus alert using the labels: 
field?


https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 
(Ctrl+F "severity: page")


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2883a76e-8221-d626-32cd-abc7233e686d%40hoffmann-christian.info.


Re: [prometheus-users] Is there any way to update prometheus expr?

2020-12-03 Thread Christian Hoffmann

Hi,

On 12/2/20 4:51 AM, Allenzh li wrote:
Hi, recently, I want to add a label to all alerting expr, is there any 
way to acheive that?


What exactly are you trying to do? Technically, you can use alert 
relabeling to add a static (or at least deterministic) label to each 
outgoing alert.

Depending on what you want to accomplish, there may be better ways though.

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c5fe503d-66ff-9dfb-e863-45d053a0eae1%40hoffmann-christian.info.


Re: [prometheus-users] Promethus metrics caluclated in MB or Bytes or KB

2020-11-26 Thread Christian Hoffmann

Hi,

On 11/27/20 7:18 AM, Bharathwaj Shankar wrote:
I can see various metrics in promethus,basically in which unit it is 
calculated?


As per the official documentations, all values should be in base units. 
That would be Bytes:

https://prometheus.io/docs/practices/naming/#base-units

This should usually be documented using the _bytes suffix in the metric 
names.


However, technically Prometheus does not have a concept of units. 
Prometheus just stores the numbers. It's up to the exporters to provide 
sane values and up to the visualizations (e.g. Grafana dashboards) and 
rules to make sense of it.


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3a3a621f-5b33-7ca8-c14d-5622bb64acf7%40hoffmann-christian.info.


Re: [prometheus-users] Issue in start promentheus service

2020-11-25 Thread Christian Hoffmann

On 11/25/20 9:58 AM, 'Kunal Khandelwal' via Prometheus Users wrote:
root@ARL-KUNAL:/home/kunal/Documents/Prometheus/prometheus-2.22.2.linux-amd64# 
systemctl cat prometheus.service
# Warning: prometheus.service changed on disk, the version systemd has 
loaded is outdated.
# This output shows the current version of the unit's original fragment 
and drop-in files.
# If fragments or drop-ins were added or removed, they are not properly 
reflected in this output.

# Run 'systemctl daemon-reload' to reload units.

^ I suggest first trying this.

Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1cba4b6f-ca25-3270-5f6b-205ec92ad5b0%40hoffmann-christian.info.


OpenPGP_0x2C1F41CF361CB280.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [prometheus-users] Issue in start promentheus service

2020-11-25 Thread Christian Hoffmann

Hi,

On 11/25/20 9:40 AM, 'Kunal Khandelwal' via Prometheus Users wrote:

No, I didn't miss "s" that's cut, In my service file it has a proper name.

Ah, ok.


Well, here is my service file, can you tell me what I missed?
I don't spot any obvious problems at the first glance. However, I do 
notice that the file does not match the error message you posted. Line 
14 in this file is the --config.file line, while the error message 
refers to --web.listen-address.


I still suspect that this is a systemd problem.

Can you confirm that systemd loaded the most recent version ("systemctl 
daemon-reload" if unsure)? Does the "systemctl cat prometheus.service" 
output match what you posted?


Kind regards,
Christian



[Unit]
Description=Prometheus
#Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
Group=prometheus
Type=simple
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
   --config.file=/etc/prometheus/prometheus.yml \
   --storage.tsdb.path=/var/lib/prometheus \
   --web.console.templates=/etc/prometheus/consoles \
   --web.console.libraries=/etc/prometheus/console_libraries \
   --web.listen-address=0.0.0.0:9090

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target



On Wednesday, November 25, 2020 at 2:01:58 PM UTC+5:30 Christian 
Hoffmann wrote:


Hi,

On 11/25/20 9:19 AM, 'Kunal Khandelwal' via Prometheus Users wrote:
 > I am facing an issue while starting Prometheus Service in
Ubuntu., it's
 > throwing the following:
In the future, could you please start a new thread? The way you posted
makes it appear that your issue is somehow related to the one from
Danilo.

 > Nov 25 12:52:36 ARL-KUNAL systemd[1]:
 > /etc/systemd/system/prometheus.service:14: Unknown lvalue
 > '--web.listen-addres
 >
 > Can someone help me out ??
You're missing an "s". The proper name is --web.listen-address.

Also, the message sounds like systemd is complaining (not Prometheus).
This would mean that your .service file has a syntax error. Sounds like
an unintended line break or a missing line continuation character on
the
previous line (\).

Kind regards,
Christian

--
You received this message because you are subscribed to the Google 
Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to prometheus-users+unsubscr...@googlegroups.com 
<mailto:prometheus-users+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6511c918-0aef-4cef-8677-16e4eed7bb8dn%40googlegroups.com 
<https://groups.google.com/d/msgid/prometheus-users/6511c918-0aef-4cef-8677-16e4eed7bb8dn%40googlegroups.com?utm_medium=email_source=footer>.


--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3ae57867-ee60-9d92-5546-663ee0d2468e%40hoffmann-christian.info.


Re: [prometheus-users] Issue in start promentheus service

2020-11-25 Thread Christian Hoffmann

Hi,

On 11/25/20 9:19 AM, 'Kunal Khandelwal' via Prometheus Users wrote:
I am facing an issue while starting Prometheus Service in Ubuntu., it's 
throwing the following:
In the future, could you please start a new thread? The way you posted 
makes it appear that your issue is somehow related to the one from Danilo.


Nov 25 12:52:36 ARL-KUNAL systemd[1]: 
/etc/systemd/system/prometheus.service:14: Unknown lvalue 
'--web.listen-addres


Can someone help me out ??

You're missing an "s". The proper name is --web.listen-address.

Also, the message sounds like systemd is complaining (not Prometheus). 
This would mean that your .service file has a syntax error. Sounds like 
an unintended line break or a missing line continuation character on the 
previous line (\).


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d7808a95-b240-b849-5a64-5e6460e047de%40hoffmann-christian.info.


Re: [prometheus-users] Storing log events data

2020-11-23 Thread Christian Hoffmann

Hi,

On 11/23/20 10:41 PM, kiran wrote:> I am trying to push log events for 
lambda functions in Prometheus.

> I am trying to see if we can even save this kind of data and if so any
> recommended structure in Prometheus. E.g whenever a lambda function is
> invoked, AWS puts log events data and each run will have multiple events
> and associated data for each event. Any suggestions on how to design the
> data? Is a event considered a metric in this case.
> Here is a generic structure of the data from AWS where you see
> ‘LogEvents’ is an array of events.
>
> 
https://docs.aws.amazon.com/lambda/latest/dg/services-cloudwatchlogs.html 
This FAQ entry might be relevant: 
https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus


In short, if you want to extract metrics from logs, there are tools to 
do that. If you want to store plain logs, use something which aims to 
support that such as ELK or Loki. :)


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9717544f-56fa-8b5c-1f79-752fe539ee50%40hoffmann-christian.info.


Re: [prometheus-users] Time of day alert

2020-11-23 Thread Christian Hoffmann

Hi,

On 11/21/20 2:41 PM, Aleksandar Ilic wrote:
I was wondering if there is any way to set an alert only to be triggered 
at a specific time of day.


As I saw for alertmanager there is PR open on GitHub but wondering if 
there is any workaround for this or any other way.


What David suggested would be the simplest method, I think. It has a 
drawback though: The alert will resolve once the working hours pass.


We use the following pattern instead:

* Create a pseudo-alert InhibitOutOfWorkingHours which fires except 
during the relevant working hours. We want this to be localtime-aware, 
which is why we don't use hour() but a textfile collector-provided 
metric called localtime_hour, etc.


* We use alert relabeling to automatically add the label 
inhibited_by= to all Inhibit.* alerts (i.e. 
{alertname="InhibitOutOfWorkingHours",inhibited_by="InhibitOutOfWorkingHours"})


* We set up a generic inhibition rule which takes source alerts with 
alertname="Inhibit.+" and inhibits target alerts with equal inhibited_by 
labels.


* The alert(s) which should be inhibited during certain timeframes are 
modified to have the appropriate label (e.g. 
inhibited_by="InhibitOutOfWorkingHours")



Might sound complicated, but it works fine so far and is supported by 
our configuration management logic.


Kind regards,
Christian

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/663b62ec-fdde-193b-7183-4fce4a40b003%40hoffmann-christian.info.


Re: [prometheus-users] Correctly using metric_relabel_configs

2020-11-19 Thread Christian Hoffmann
Hi,

On 11/19/20 11:32 PM, Laurent Dumont wrote:
> collectd_openstack_nova_gauge{exported_instance="site1-director.potato.com
> ",instance="123.123.123.123:9103
> ",job="collectors",openstack_nova="hypervisor-site1-compute001.potato.com
> ",role="Director",short_hostname="director",site="site1",tripleo_role="Director",type="hypervisor_version"}
> 
> In our case, because we are scraping a central endpoint, our
> short_hostname will carry the label of the actual scraped target.
> Ideally, I would want short_hostname to be changed to qasite1-compute001.
> 
> I've tried using the following metric_relabel_configs, but while it
> doesn't break anything, it doesn't seem to change the labels when
> looking at Prometheus.
> 
>     metric_relabel_configs:
>       # Change the nova metric short_hostname to the compute name.
>       - source_labels: ["openstack_nova"]
>         regex: '([a-z]+.-compute\d\d\d)'

Prometheus regexps are fully anchored. Either try adding the
\.potato\.com suffix to the end of your regex (after the bracket) or
adapt the regex to add some wildcard (e.g. \..+) in the same place.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/00c0d494-ce49-27b5-abcd-74e3fcc152af%40hoffmann-christian.info.


Re: [prometheus-users] Debugging OOM issue.

2020-11-09 Thread Christian Hoffmann
Hi,

On 11/9/20 10:56 AM, yagyans...@gmail.com wrote:
> Hi. I am using Promtheus v 2.20.1 and suddenly my Prometheus crashed
> because of Memory overshoot. How to pinpoint what caused the Prometheus
> to go OOM or which query caused the Prometheus go OOM?

Prometheus writes the currently active queries to a file which is read
upon restart. Prometheus will print all unfinished queries, see here:

https://www.robustperception.io/what-queries-were-running-when-prometheus-died

This should help pin-pointing the relevant queries.

Often it's some combination of querying long timestamps and/or high
cardinality metrics.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/caf68358-7dbd-c851-8c36-479d6e30cb6a%40hoffmann-christian.info.


Re: [prometheus-users] Case insensitive regex for Alertmanager

2020-10-25 Thread Christian Hoffmann
Hi,

On 10/25/20 12:42 PM, Shubham Choudhary wrote:
> Can we write Case insensitive regex for Alertmanager?
> 
> For example, label is {team="analytics"} but sometimes it can be
> {team="Analytics"} or {team="ANALYTICS"}
> 
> - receiver: opsgenie-ANALYTICS_Prometheus
>   match_re:
>     chef_env: .*analytics.*
>     severity: ^(critical)$
> 
> I have used (?i:(Match)) as mentioned in
> https://groups.google.com/g/prometheus-users/c/xOeIocdR_hI/m/bK1lDYF5AQAJ but
> didn't work.

Hrm, I would have expected it to work. Could it be some quoting issue?
I'd use single quotes as soon as some non-word characters are included.
What was the behavior you were seeing? What version of alertmanager are
you using?

A quick test with alertmanager 0.21's amtool looks like it would be
working as expected, see below.

Kind regards,
Christian


$ cat test.yml
route:
  receiver: 'default'
  routes:
- match_re:
severity: '(?i:critical)'
  receiver: critical

receivers:
- name: 'default'
- name: 'critical'

$ ./amtool config routes --config.file test.yml test severity=CrItIcAl
critical

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5c101486-93c2-c118-e10d-e6374ab0cb67%40hoffmann-christian.info.


Re: [prometheus-users] ssl expiry notification

2020-10-25 Thread Christian Hoffmann
Hi,

On 10/23/20 5:54 PM, barnyb...@gmail.com wrote:
> Hello my friends.
> I'm using ribbybibby /*ssl_exporter
>   *for checking ssl expiry
> for some services. All works fine  but I would like to add more
> information to the slack message. Specifically, add to the messages
> instance on which the certificates expire.
> Now the slack receives messages with the name of the alert, the number
> of instances with expiring certificates and a link to the prom. I'm
> trying to change the config but to no avail so far.
> ```
>   - alert: ssl_cert_expire
>     expr: ssl_cert_not_after{ - time() < 86400 *7
>     for: 2m
>     labels:
>       severity: 'warning'
>     annotations:
>       title: 'Warning: SSL cert will expire soon for the site (instance
> {{ $labels.instance }})'
>       description: 'SSL cert expires in 7 days\n  VALUE = {{ $value
> }}\n  LABELS: {{ $labels }}'
> ```

I think we would need some more details in order to help:

1) Can you verify the PromQL expression (there seems to be a "{" too
much or something else missing)?
2) Can you share an example result from your PromQL?
3) Can you show your Alertmanager config, especially the slack receiver
part (be sure to delete any secrets)?
4) Can you verify that your config reloads of Prometheus and
Alertmanager have been successful? Try checking the logs (stderr) and/or
the config metrics.

Kind regards,
Christi

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8a31234f-87c1-863f-3325-67e420f96100%40hoffmann-christian.info.


Re: [prometheus-users] Removing old data it not happening if "no space left on device" regardless retention params

2020-10-25 Thread Christian Hoffmann
Hi,

On 10/23/20 12:56 AM, Shox wrote:
> We are experiencing a catch22, when there is "no space left on device"
> on attempt to write to /wal and disk space is not freed. After some
> investigation, it looks like removing old data is happening only after
> compaction, but compaction can't happen as no pace left on device. 
> 
> Is that a expected behavior? Is there any solutions around this? Or any
> ongoing work to prevent this?
> 
> I actually do not understand a reason why that was done by this way.
> Could somebody please help?

I would say this is expected behavior and you would be expected to leave
enough space for Prometheus to breathe. As far as I understand,
compaction works by reading the existing blocks, merging them and
writing the new files to disk. The original files can only be safely
deleted when the new file has been written.

There is ongoing work regarding compaction [2], but I think it's related
to those cases where it is explcitly triggered via clean_tombstones.

[1] https://prometheus.io/docs/prometheus/latest/storage/
[2] https://github.com/prometheus/prometheus/issues/7957

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/32d0cc5c-97ff-bf6c-b389-e7334480900c%40hoffmann-christian.info.


Re: [prometheus-users] blackbox exporter's probe_ssl_earliest_cert_expiry giving negative values

2020-10-20 Thread Christian Hoffmann
Hi,

On 10/20/20 11:25 AM, deln...@gmail.com wrote:
> I understand there's an ongoing discussion
>  on this
> issue. How do you prevent false(or true) alerts when one of the
> applications is providing multiple certs and one of these has expired?
> Silencing these is not solution.

You mail's subject says probe_ssl_earliest_cert_expiry while the linked
issue references the rather new
probe_ssl_last_chain_expiry_timestamp_seconds.

Which one have you been experimenting with? The latter sounds like it
might fit your usecase.
Which blackbox_exporter version are you using? There was a bugfix
related to this metric in the last release:

[1] https://github.com/prometheus/blackbox_exporter/pull/681

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/599c390d-e935-970c-3bc0-ca25df7e5f14%40hoffmann-christian.info.


Re: [prometheus-users] Re: Delta usage issues?

2020-10-17 Thread Christian Hoffmann
On 10/17/20 2:31 AM, li yun wrote:
> sum_over_time(isphone{name="qq",exname!~"test|test1"}[5m
> ])-sum_over_time(isphone{name="qq",exname!~"test|test1"}[5m])offset 5m

Try placing the offset modifier right next to the metric name:

sum_over_time(isphone{name="qq",exname!~"test|test1"}[5m]) -
sum_over_time(isphone{name="qq",exname!~"test|test1"}[5m] offset 5m)

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bcd245d0-3315-f78c-209b-dec613d75cc5%40hoffmann-christian.info.


Re: [prometheus-users] Re: How to find newly added indicators

2020-10-17 Thread Christian Hoffmann
On 10/17/20 2:28 AM, li yun wrote:
> Because my data is collected every 5 minutes
If possible, try fixing that. The maximum sane scrape interval is 2m. If
nothing else helps, you may consider hiding the problem by using
recording rules (with sum_over_time) or changing --query.lookback-delta.

>, I need to use
> sum_over_time(isphone{name="user",exname~".*"}[5m] unleass
> sum_over_time(isphone{name="user",exname~".*" }[5m], but there will be
> grammatical errors, is there any better way

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/26c82460-d7a1-00f9-840e-9a21513b61de%40hoffmann-christian.info.


Re: [prometheus-users] Re: How to find newly added indicators

2020-10-14 Thread Christian Hoffmann
Hi,

On 10/13/20 12:22 PM, li yun wrote:
> For example, the following situation
> *isphone{name="user",exname~"13"}*
> This exname program will continuously collect a lot of monitoring
> indicators, but I want to know which indicators have been added in a
> certain period of time
> 在2020年10月13日星期二 UTC+8 下午6:19:15 写道:
> 
> Hello everyone, I have encountered some problems in the process of
> using prometheus. There are many monitoring indicators. I used the
> service discovery method. I need to know which indicators are newly
> added by service discovery in a certain period of time. May I ask
> this prometheus Can it be done?

You can ask Prometheus to return those metrics which do exist now and
haven't existed some time ago. Example for 1 hour:


isphone{name="user",exname~"13"} unless
isphone{name="user",exname~"13"} offset 1h

Note: It sounds like you might be putting rather dynamical data
(exname?) into labels. This may lead to cardinality issues. Be sure to
implement some safeguards and/or beware of the potential resource
requirements.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e93aeffb-1d7b-175d-be04-405be5a526bd%40hoffmann-christian.info.


Re: [prometheus-users] Time drift between my browser and server

2020-10-14 Thread Christian Hoffmann
Hi,

On 10/13/20 9:06 AM, HENG KUAN WEE _ wrote:
> Therefore, I believe that no data being shown is due to the following
> error on the Prometheus UI: "*Warning!* Detected 5508.10 seconds time
> difference between your browser and the server. Prometheus relies on
> accurate time and time drift might cause unexpected query results." 

Certainly sounds like something which should be fixed. :)

Have you tried anything to analyze/fix this?

It essentially means that either your server's clock or your client's
clock is wrong. Try syncing both to a reliable time source.

On Linux you would typically use chronyd, ntpd or systemd-timesyncd for
that. Windows and OSX offer system settings for doing similar things, I
think.
If all things fail, it might temporary work by setting the clocks
manually. But really, NTP would be the proper solution. :)

Kind regards
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9e4817cf-e5b3--42ae-616e3894065a%40hoffmann-christian.info.


Re: [prometheus-users] Monitor email incoming

2020-09-26 Thread Christian Hoffmann
Hi,

On 9/24/20 1:09 PM, Igor pember wrote:
> How do I use Prometheus to monitor email system?
> For example, I want to monitor gmail, I sent email to a test account
> every hour, if gmail system can’t receive messages, the alerts will be
> raised.

You would need some mechanism for sending the mail and some exporter
which verifies that this (or at least "a") mail made it through.

I could think of sending the mail via a cronjob and trying one of the
existing Gmail or IMAP exporters to verify the count:

https://github.com/jamesread/prometheus-gmail-exporter
https://github.com/camptocamp/imap-mailbox-exporter

(I don't have experience with any of them)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/38e6e098-3837-4089-f754-2d52b2d3ba21%40hoffmann-christian.info.


Re: [prometheus-users] Re: Exemplars: Jump from Grafana to Traces (a dream come true)

2020-09-26 Thread Christian Hoffmann
On 9/25/20 3:59 PM, 'Thomas Güttler' via Prometheus Users wrote:
> Is it already possible to let Prometheus export Exemplars (trace-ids)?

I think there has been some progress meanwhile, but I haven't seen
something externally usable yet.

Issues might give a good clue:

https://github.com/prometheus/prometheus/pull/6309
https://github.com/prometheus/prometheus/search?q=exemplars=issues

Kind regards,
Christian


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2a966b15-1d37-6dda-0dbc-6493966ba90a%40hoffmann-christian.info.


Re: [prometheus-users] different alert thresholds per service

2020-09-18 Thread Christian Hoffmann
Hi,

On 9/18/20 7:53 AM, Bhupendra kumar wrote:
> My question is how can configure Prometheus alert thresholds per service
> and as well as server.
> 
> Example: I have two machine 1 is (webserver 1) and 2 is (webserver 2).
> 
> server 1 alert receive after 5 minute
> server 2 alert receive after 10 minute

For different timinigs (e.g. for: or *_over_time(...[5m])) you will need
multiple alerts, I think.
You can either hardcode the servers in these alerts (e.g. only match on
instance=~"instanceA|instanceB" or instance!~"instanceA|instanceB") or
you can use this pattern to decide which servers/services apply to which
alert:

https://www.robustperception.io/using-time-series-as-alert-thresholds

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/015a6676-35f7-3d9a-164d-4c68bac3b452%40hoffmann-christian.info.


Re: [prometheus-users] Prettifying and simplifying metrics/visualizations

2020-09-15 Thread Christian Hoffmann
On 9/15/20 10:55 AM, John Dexter wrote:
> I'm still finding my feet with Prometheus and one thing that is a bit
> awkward is that time-series names are pretty cumbersome. We want a
> customer-facing dashboard so let's say I want to monitor network activity:
> 
> rate(windows_net_packets_total[2m])
> 
> What is displayed is:
> 
> {instance="localhost:9182",job="node",nic="Local_Area_Connection__11"} 0
> {instance="localhost:9182",job="node",nic="isatap__..._"} 0
> {instance="localhost:9182",job="node",nic="vmxnet3_Ethernet_Adapter__2"}
> 14.411582607039099
> 
> If I push into Grafana I get the 3 time-series displayed and the sort of
> issues I face are:
> 
>   * instance=..., job=... is pretty verbose, I wish it just said
> 'localhost'. Is this possible somehow?
Yes, Grafana handles this. You can use Grafana templates in the Legend
field (e.g. {{instance}})¹.

If you want to get rid of the port as well, you might want to look into
modifying your instance label when scraping.

Example:
https://www.robustperception.io/controlling-the-instance-label

>   * I just want one time-series per machine, and I don't really want to
> have to hard-code nic-name in my YAML. Does PromQL let me
> aggregate over a specific label?
Yes, sure. Just choose the kind of aggregation, e.g.

sum by (instance) (... your query ...)

https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1a2ea603-9c79-91ef-37ce-18ff2fdbdf8e%40hoffmann-christian.info.


Re: [prometheus-users] Query when executet with prometheus web interface, but doesn't work with Grafana

2020-09-14 Thread Christian Hoffmann
On 9/14/20 4:20 PM, dykow wrote:
>> Maybe promxy, Thanos, Cortex or VictoriaMetrics may be better solutions
>> for you.
> I've done some research and I don't understand how these tools are
> better than prometheus' native federation functionality, except HA and
> long retention capabilities.
> What problems of prometheus' native federation do the solve?
> 
> For example promxy:
> /"Promxy is a prometheus proxy that makes many shards of prometheus
> appear as a single API endpoint to the user."  /
> It sounds exactly like what federation can do.
Promxy would allow you to run your independent Prometheus instances
without having to forward/copy all data to a central instance, have
Grafana point to just one data source and still have a global view.

(I'm not saying this would solve your problem, it was just a side node
regarding the federation setup ;)).

Why version of Prometheus are you using? I'm asking because I think I
remember that the message you are seeing (duplicate series) has been
improved recently to include more specific details. Maybe this helps in
debugging.

Kind regards,
Christian


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/67d53f58-0166-ffcb-c24b-c9de239bb32b%40hoffmann-christian.info.


Re: [prometheus-users] Query when executet with prometheus web interface, but doesn't work with Grafana

2020-09-14 Thread Christian Hoffmann
Hi,

On 9/14/20 1:58 PM, dykow wrote:
> I am quering vmware_exporter metrics with the following:
> *(vmware_vm_mem_usage_average / on(host_name) group_left(instance,
> cluster_name, dc_name, monitoring_name) vmware_host_memory_max) * 100*/
> /
> Prometheus web interface is returing metrics, but when I try to build
> Grafana dashboard using that query I get:
> /"found duplicate series for the match group {...} found duplicate
> series for the match group//on the right hand-side of the operation:
> [..]; //many-to-many matching not allowed: matching labels must be
> unique on one side"/
> 
> Unfortunately I cannot share full error message. How is possible, that
> prometheus has no problem, but Grafana has?
If it is really the exact query (maybe you can confirm via Browser
developer tools or tcpdump if the traffic between Grafana and Prometheus
is unencrypted), then it might be because you are most likely running a
query over a time range via Grafana and your problem may only manifest
there.


> PS I am pulling data from "global" prometheus in federation, which
> collects all metrics from Prometheis down in hierarchy. (I know it's not
> recommended, but manager really want's to have all metrics on one chart.
> I suspect, that this kind of federation may be the root cause of my
> problem, but what can I do...)

Maybe promxy, Thanos, Cortex or VictoriaMetrics may be better solutions
for you.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2eaa18a8-012e-56a5-f2a9-a2c0c2221fc5%40hoffmann-christian.info.


Re: [prometheus-users] compression in prometheus

2020-09-01 Thread Christian Hoffmann
Hi,

On 9/1/20 2:50 PM, Rodolphe Ghio wrote:
> I am curently doing an intership and my tutor asked my to do an
> algorithm to compress prometheus data, what do you think about that is
> it possible ?

I think there are lots of ressources regarding Prometheus' encoding
which is already supposed to be highly efficient. I guess this would be
a good starting point. Lots of useful information can be found in docs,
issues and PromCon talks.

Some examples:
https://prometheus.io/docs/prometheus/latest/storage/
https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chunks/
https://github.com/prometheus/prometheus/issues/5865
https://github.com/prometheus/prometheus/issues/5876
https://github.com/prometheus/prometheus/issues/478
https://promcon.io/2017-munich/talks/storing-16-bytes-at-scale/

Third party blogs:
https://medium.com/faun/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932

https://blog.timescale.com/blog/time-series-compression-algorithms-explained/

Long story short: Lots of optimizations have already been done, but
there are still more ideas floating around.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a467c995-e62c-7458-6838-3a0de1777266%40hoffmann-christian.info.


Re: [prometheus-users] /metrics endpoint not showing metrics scraped from applications

2020-08-29 Thread Christian Hoffmann
Hi,

On 8/28/20 10:08 PM, 'Rounak Salim' via Prometheus Users wrote:
> I'm unfamiliar with federation but it seems like it's mostly used for
> pulling data from multiple Prometheus instances into a single instance.
> Do all the scraped metrics show up on the /federate URL if federation is
> setup?
The federate endpoint can access all metrics from the Prometheus
instance. It's supposed to be used to grab all aggregated metrics, for
example (see the docs and the job:.* matcher). In theory, it can also
export all metrics.

However, as soon as the data volume rises you may run into resource
problems on Prometheus, the network or DataDog. The API simply isn't
intended to be used like this.

Maybe DataDog can use be used as a remote_write target instead?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3cd34a8e-32d1-2a9f-9533-54c93b1d44c4%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus/Zookeeper version

2020-08-27 Thread Christian Hoffmann
Hi,

On 8/27/20 4:49 PM, My Me wrote:
> Are you saying that Prometheus doesn't use Zookeeper internally ?
I guess you are both right, somehow. :)

Prometheus does not require Zookeeper and it's not a core feature.
However, Prometheus can use Zookeeper for its service discovery. It
therefore contains a zookeeper client library.

The following file might help as a pointer which version is used:
https://github.com/prometheus/prometheus/blob/master/go.mod

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/130c0e86-3e50-9d8c-01a1-5b38d6fd08cf%40hoffmann-christian.info.


Re: [prometheus-users] Correlating different alerts to produce single alert

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/27/20 2:19 AM, radhamani...@gmail.com wrote:
> If pods in two different namespaces go down,then we need to send a alert
> as an appA is down..
> Can I simply write expr as Kubepoddown_in_namespaceA and
> Kubepoddown_in_namespaceB ,and send alert message as "AppA is down"?
Yes, that's what I meant with the ALERTS metric. Have you tried it? You
can check those queries using the Prometheus UI (or API) before putting
them into alert rules.

For example, check
ALERTS{alertname="Kubepoddown_in_namespaceA",alertstate="firing"}.

> I just wrote the pseudocode of the expr but I want to know if this expr
> simply works with just AND operator..
My second suggestion still applies: It might be better or more
transparent if you could directly reference the relevant metrics.
However, to provide more specific guidance, we would need to know the
details of your existing two alerts.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1dbd7a6e-f277-a93c-0c23-f1747cbb018b%40hoffmann-christian.info.


Re: [prometheus-users] go_memstats_alloc_bytes

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/26/20 1:27 PM, deepak...@gmail.com wrote:
> How can i pull the metrics related to a application from the server
> using Prometheus?

You would usually instrument the application itself (by using one of the
language clients such as client_python, client_golang, etc.). If this
isn't possible (e.g. because it's a proprietary application), you can
try to find other ways such as finding or writing an exporter which gets
the data via an API. If this isn't feasable either, you can try to go to
higher levels of abstractions: You can parse logs (mtail,
multilog_exporter, grok_exporter), scrape process or service details
(process_exporter, systemd_exporter, node_exporter's systemd collector)
or you can even just target the network service itself by using
blackbox_exporter.

This will give different levels of insights. Often, not all ways are
possible.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b6bdc90c-3a0a-48f2-bc0d-48f0bea2a8fa%40hoffmann-christian.info.


Re: [prometheus-users] /metrics endpoint not showing metrics scraped from applications

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/27/20 3:16 AM, 'Rounak Salim' via Prometheus Users wrote:
> The /metrics endpoint only shows me the metrics for the Prometheus
> server and none of the metrics scraped by Prometheus from other
> applications are shown.
> 
> How can I get all my metrics to be shown in the /metrics endpoint?
I don't think you can. /metrics is supposed to be about the application
(Prometheus) itself, not about data from other servers.

The /federate endpoint would probably do what you want:
https://prometheus.io/docs/prometheus/latest/federation/

However, keep in mind that it's not intended to export arbitrary large
chunks of data.

> This is causing problems getting our metrics consumed by DataDog agents
> since they are only able to pick up the Prometheus server metrics from
> the /metrics endpoint and I couldn't get it to work with any other
> endpoint on their config file. 
> Followed this documentation to get it setup with the Prometheus
> integration: https://www.datadoghq.com/blog/monitor-prometheus-metrics/
I don't know datadog but it seems like it is intended to scrape the
targets itself without any Prometheus server inbetween?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/20e07153-12c4-cc53-f69f-d3d04bb6bca6%40hoffmann-christian.info.


Re: [prometheus-users] Correlating different alerts to produce single alert

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/26/20 11:00 PM, radhamani...@gmail.com wrote:
> I want to send alert by  doing some correlation based upon multiple
> alerts.For eg: if podA,podB,serviceA are all 100% down in two different
> namespaces(namespace1,namespace2),then  I want to send alert like
> ApplicationA is down. Is this possible? How to do the correlation
> between different alerts?

There is a special metric called ALERTS which exists for every active
(pending or firing) alert. You could cross-reference this.

However, there might be better ways which are more readable. I'm not
sure either if it would be a good idea to use ALERTS in other alerts. :)

In general this sounds like you could go with a simple up-based alert?

If you need to distinguish based on namespace, you might be looking for
the count or sum functions along with some by(namespace)?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1d5d9627-8cc1-de20-811b-cf9a855bb804%40hoffmann-christian.info.


Re: [prometheus-users] Alerts / Alert Manager

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/20/20 12:47 PM, 'azha...@googlemail.com' via Prometheus Users wrote:
> I have 2 alerts 
> 
> - The first being to fire if CPU is more then 70% (WMI)
> 
> - The second  to report whether an instance is down
> 
> 100 - (avg by(instance) (rate(wmi_cpu_time_total{mode="idle"}[2m])) *
> 100) > 70
> 
>  
> 
> 
> up == 0  
> 
> 
> Post generating a CPU spike i can confirm that my client CPU is indeed 100% 
> 
> @echo off 
> :loop 
> goto loop 
> 
> however i get the second alert (up==0) firing and  reporting the
> instance is down despite it not being down. The strange thing is this is
> intermittent behavior as occasionally I do get the CPU firing alert
> instead of the instance down alert. 
>  
> 
> Im just wondering why when the CPU is clearly maxed out at 100% the 
> instance is reporting as down... and why sometimes this isn't the case.

So you are getting the Instance Down alert instead of the High CPU alert?

The up metric is special. It is generated by Prometheus itself and
always exists for anything which is a scrape target.

The fact that your CPU alert does not fire and that up == 0 probably
indicates that Prometheus fails to receive metrics from your
wmi_exporter. We may only speculate why that is. Maybe the load is so
high that the scrape times out?

You can check the Prometheus Web UI Targets page to see the last scrape
error for your target. If it is indeed a timeout ("deadline exceeded")
you could try increasing the scrape_timeout option to make Prometheus
wait longer for the exporter to reply.

Side note: If I remember correctly, the wmi_exporter has been renamed to
windows_exporter (along with the metrics). This might mean that you are
running an older version. Maybe updating helps if the newer version is
more performant (I don't know, just guessing).


Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9e0a07ba-baf0-6a97-b652-06d7ecde1d17%40hoffmann-christian.info.


Re: [prometheus-users] Wal inclusion in retention.size

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/21/20 10:59 AM, Venkata Bhagavatula wrote:
> In the https://prometheus.io/docs/prometheus/2.20/storage/ link,
> Following is mentioned regarding retention.size
> 
> |--storage.tsdb.retention.size|: [EXPERIMENTAL] This determines the
> maximum number of bytes that storage blocks can use (note that this does
> not include the WAL size, which can be substantial). The oldest data
> will be removed first. Defaults to |0| or disabled. This flag is
> experimental and can be changed in future releases. Units supported: B,
> KB, MB, GB, TB, PB, EB. Ex: "512MB" 
> 
> But where in Release notes of 2.15.0, see that it is mentioned that
> retention.size includes wal also. 
> [ENHANCEMENT] TSDB: WAL size is now used for size based retention
> calculation. #5886 
> 
> Does documentation needs correction? 

Yes, well spotted, I would assume the same. I guess a Pull Request would
be welcome. :)

I think this would be the repo/file to target:
https://github.com/prometheus/prometheus/blob/master/docs/storage.md

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7c8a8ca7-5750-879a-0435-8db138f26a3d%40hoffmann-christian.info.


Re: [prometheus-users] blackbox probe : x509: certificate signed by unknown authority even with insecure_skip_verify set to true

2020-08-26 Thread Christian Hoffmann
Hi,

On 8/26/20 1:26 PM, Marion Guthmuller wrote:
> I'm trying to monitor a website with prometheus and blackbox exporter.
> Each of them are running inside a docker (images pulled from official
> docker hub https://hub.docker.com/r/prom/prometheus and
> https://hub.docker.com/r/prom/blackbox-exporter/).
> 
> My prometheus config:
> 
> prometheus_config.png
> 
> My blackbox config:
> 
> blackbox_config.png
> 
> *My issue:*
> 
> The target that I'm trying to monitor is using a self-signed
> certificate. That's why I tried to set insecure_skip_verify: true but it
> doesn't seem to handle it. I have the following error in the probe debug:
> 
> ts=2020-08-26T10:09:33.988405182Z caller=main.go:119 module=http_2xx
> target=https://example.com *level=error msg="Error for HTTP request"
> err="Get \"https://x.x.x.x\": x509: certificate signed by unknown
> authority"*

Can you confirm that you restarted blackbox_exporter after changing the
config? Comparing start time (ps) with config file modification time
(ls/stat) should be a good indicator.

I cannot think of any other reason why this wouldn't work. The config
looks right and the module name matches.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/faf6f6f1-2c85-9726-dab1-d5c32066b8b8%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus and metric_relabel_configs

2020-08-16 Thread Christian Hoffmann
Hi,

On 8/16/20 8:43 PM, Thomas Berger wrote:
> I already searched for it (HTTP API), but found no information about it.
> However, under localhost: 9090 (insert metric at cursor) all metrics are
> displayed.
> Above the Execute button, the input field says Expession,
> but i don't know how can formulate the query . No chance.
> Do you have any idea??

This exactly the right place. The "Expression" field takes PromQL
syntax. You can just copy and paste what I suggested.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2cad44b8-d53b-4665-fca7-1fce87dd5498%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus and metric_relabel_configs

2020-08-16 Thread Christian Hoffmann
Hi,

On 8/16/20 12:33 AM, Thomas Berger wrote:
> with the exporters i get a large number of metrics.
> However, I don't need all of them.
> I don't want to save these and remove them from the DB.
> This is described in the documentation with metric_relabel_configs.
> 
> Example to remove all go:. * metrics:
> 
>     metric_relabel_configs:
>     - source_labels: [__name__]
>   regex: (go_.*)
>   action: drop
> 
> My question:
> Can I test the regex beforehand?
> Is there a way to display all metrics by specifying this regex?

I think you should be able to test this via a regular PromQL query. The
metric name is stored in the magic label __name__:

{__name__=~"go_.*"}

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9de5a5a5-158c-e4bb-e043-10f629ef5be6%40hoffmann-christian.info.


Re: [prometheus-users] Conditional routing to alertmanager

2020-08-13 Thread Christian Hoffmann
Hi,

On 8/13/20 7:56 PM, Johny wrote:
> i have a requirement  which requires routing only alerts for
> pagerduty to a specific alert manager mesh due to proxy set up. All
> other alerts are sent to default alertmanager. Is it possible to do
> this conditional routing in prometheus alerting configuration?
If you know which alerts are destined for pagerduty on the
Prometheus-side (e.g. because they already have some label), you could
drop them using alert_relabel_configs.

Otherwise, null-route the unneeded alerts as Brian suggested.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3fe6fa7b-b30a-8abf-56d8-ddfe1566a3e1%40hoffmann-christian.info.


Re: [prometheus-users] Grouping of alarms (group_interval, group_wait and repeat_interval)

2020-08-12 Thread Christian Hoffmann
Hi,

On 8/12/20 3:41 PM, rosaLux161 wrote:
> If alert 1 and alert 2 occur simultaneously or in a very short time,
> then only one alert should be sent out. If alert 2 only occurs after
> some time, then another alert should be sent. The latter does not work.
> If alert 2 occurs, nothing happens.
Hrm, that sounds unexpected. Could it be that OpsGenie is doing some
additional filtering/grouping?
Maybe try with a simpler receiver for testing, e.g. email?
You can also try checking the logs and/or Alertmanager metrics to see if
there are any problems with sending notifications.

Note: What you describe as "alert" is usually referred to as
"notification" in Alertmanager terms.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f3c7eccb-6bb6-8780-9fc5-c11163059153%40hoffmann-christian.info.


Re: [prometheus-users] Silenced alerts and web.hook receiver

2020-08-10 Thread Christian Hoffmann
Hi,

On 8/9/20 4:47 PM, Anthony Dakhin wrote:
> I'm new to Prometheus and currently trying to implement a proxy between
> Alertmanager and our monitoring system. I've configured web.hook
> receiver based on sample Alertmanager config, so I'm able to receive
> POSTs when alert is firing or resolved - so far so good.
> 
> If I create silence before an alert starts firing, receiver does not get
> any information about this alert - that's OK. But if a silence is
> created after an alert started firing, Alertmanager does not inform
> (web.hook) receiver about this silence in any way. I expected to receive
> some sort of notification, so I could supress the corresponding alert in
> my monitoring system.
> 
> Is this behaviour intenional? If so, is there any possible way to work
> around this problem?
I think this behaviour is intentional and I cannot think of any way to
circumvent this, besides extending Alertmanager.

The only thing which comes to mind would be observing (polling)
Alertmanager via its APIs in order to find any silenced alerts. This
should be possible, but would probably not really be that elegant.

https://github.com/prometheus/alertmanager#api

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/14562d52-6708-babd-d7d0-4b245f550f67%40hoffmann-christian.info.


Re: [prometheus-users] AlertManager: how the time in message generated?

2020-08-10 Thread Christian Hoffmann
Hi,

On 8/10/20 8:51 AM, leiwa...@gmail.com wrote:
> A pushed the alert message wo wechat: 
> 
>   [Warning]:
> 2020-08-10 03:03:53
> description = zhaoqing : 1081.122236  
> 
> I want to know how the time i  marked as red is generated and what it
> represents?
> It is  about 40 mins earlier than the time when the msg is received.

As far as I know, this should come from the Prometheus side and would be
part of the HTTP request to alertmanager.

There may be some delays between a firing alert in Prometheus and a sent
notification in Alertmanager (e.g. group_wait). Can you show your config?

A difference of 40 minutes sounds like a lot. Not sure, but another wild
guess: it could also be related to a flapping alert?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a5754bd9-a57c-881c-a522-4a2ea20c2a0f%40hoffmann-christian.info.


Re: [prometheus-users] alert label with variable for pushgateway metrics

2020-08-09 Thread Christian Hoffmann
Hi,

On 8/10/20 7:06 AM, Aravind Poojari wrote:
> We are facing an issue while writing alert rules for the above jobs &
> instances.
> We are unable to use a template so we have to write the alert rules for
> each and every job and their respective instances. It's kind of hard as
> instances keep on increasing every day. Following is an example of alert
> rule how we are using configuration
> 
> ##Alert rule config
> ---
>     - alert: HighCPU-Critical
>   expr:
> instance:node_cpu_utilization:ratio{job="node-exporter-test",
> instance="instance-two",  mode="idle"} > 0.90
>   for: 1m
>   labels:
>     severity: critical
>   annotations:
>     title: CPU use percent is extremely high on {{ $labels.instance
> }} for the past 10 minutes.
> 
>     - alert: HighCPU-Critical
>   expr:
> instance:node_cpu_utilization:ratio{job="node-exporter-test",
> instance="instance-two",  mode="idle"} > 0.90
>   for: 1m
>   labels:
>     severity: critical
>   annotations:
>     title: CPU use percent is extremely high on {{ $labels.instance
> }} for the past 10 minutes.
> --
> We have to repeat the same for every job and their instances. Let us
> know if we can use variables for all jobs & instance values may be
> something like this [I ain't sure].

What's your reason for repeating the rules for each server?

Prometheus does not have any concept of objects or servers (in contrast
to other monitoring systems). This means that you can design your
queries rather freely.

To be more specific: If you want to monitor all configured targets, just
drop the instance= label. If you want to continue explicitly listing
your targets there, you can still simplify by using regular expressions
(instance=~"instance1|instance2|instance3"}). The same is true for
recording rules.

Hope this helps.

Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/21fc61ed-daba-5983-b4f8-fad2cd67ce19%40hoffmann-christian.info.


Re: [prometheus-users] How can I Send Individual Alert for Different Servers on Different usage Criteria

2020-08-08 Thread Christian Hoffmann
Hi,

On 8/5/20 3:45 PM, Pachha Gopi wrote:
> Hi @Christian I am facing an issue that my alaert manager is not sending
> the alerts on regular intervals.how can i configure my alert manager to
> send alerts to my slack?

Are you looking for the repeat_interval option, which defaults to 4h?

https://prometheus.io/docs/alerting/latest/configuration/

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5b333a2b-86b8-a5c6-a0d9-79c7fb50aaa3%40hoffmann-christian.info.


Re: [prometheus-users] How can I Send Individual Alert for Different Servers on Different usage Criteria

2020-08-05 Thread Christian Hoffmann
Hi,

On 8/5/20 2:40 PM, Pachha Gopi wrote:
> I am Using Prometheus for my Production Servers,my question is is there
> any way that we can configure different alerts for individual server.
> for example I have 3 Servers ,Server1 cpu usage is 20% ,Server 2 cpu
> usage is 30% and Server 3 cpu usage is 90% .Now I need to get alert if
> server 1 is above 50% and server 2 cpu usage is above 90% and server 3
> cpu usage is below 50%.is there anyway that i can create rules according
> my requirement for individual servers.Please help me out if anyone know
> how to do it.

Did you have a look at this pattern already?
https://www.robustperception.io/using-time-series-as-alert-thresholds

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b5bdce0b-5fca-fd3c-cd8d-4fd616d6cf70%40hoffmann-christian.info.


Re: [prometheus-users] Transmission of (Database) Table Data to Prometheus

2020-08-05 Thread Christian Hoffmann
On 8/3/20 8:34 AM, 'Píer Bauer' via Prometheus Users wrote:
> Due to the fact that my query (in real world) contains several thousand
> rows of output, I would like to pursue a generic approach to avoid
> setting a separate PowerShell variable for each table cell data...
> 
> 
> But currently I don't know what data structure is necessary or how to
> parse my results (JOB_NAMES (column 1) and STATUS (column 2)) in order
> to trasmit them subsequently to Prometheus, so that Prometheus
> "understands" that there is not only one single value arriving
> simultaneously but mutltiple values (table data) simultaneously instead...
Not sure I understand completely, but I'll try:

If you are looking for a way to get the status of multiple jobs into
Prometheus in a dynamic way (i.e. rows), then this should be possible. I
assume you are looking for labels.

You could create output such as:
job_status{job_name="a"}

(If possible, it might make sense to choose another word than "job" as
this is a term which is already used in Prometheus in the form of
"scrape jobs").

If you are looking for a way to get multiple datapoints for the same job
(i.e. columns) into Prometheus, then this will get harder. Prometheus is
designed to scrape exactly one data point per metric per scrape.
Everything else would be considered backfilling, which is not yet ready
for use for regular users, as far as I understand).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/71fdbc27-5d0f-54f2-0919-70e81a347a59%40hoffmann-christian.info.


Re: [prometheus-users] Alertmanager: resolved message received immediately after warning message even resolve_timeout is 5m

2020-08-05 Thread Christian Hoffmann
On 8/5/20 10:21 AM, leiwa...@gmail.com wrote:
> rules.yml
> groups:
> - name: network-delay
>   rules:
>   - alert: "network delay"
>     expr: probe_duration_seconds * 1000 > 3000
>     for: 1s
>     labels:
>       severity: warning
>       team: ops
>     annotations:
>       description: "{{$labels.instance}} : {{ $value }}"
> 
> alertmanager resolve_timeout is 5m
> 
> So after a warning is fired, in my opinion at least 5 mins later the
> resolved messge can be sent. But the resolved msg is sent immediately
> after the warning msg. why?

resolve_timeout is only relevant in very rare cases, at least when
Alertmanager is used together with Prometheus.

What's your reason for asking? Are you maybe looking for group_wait and
group_interval instead?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e0ba6677-9ed2-f20d-a392-7826bb6205ba%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus memory issue

2020-08-05 Thread Christian Hoffmann
Hi,

On 8/4/20 12:24 PM, Vinod M V wrote:
> 
>           I am facing Memory usage with Prometheus service and
> Maintaining 30 days of data from Node exporter, Process exporter and JMX
> exporter for 95 servers in Prometheus Database. 
> 
>          Grafana and Prometheus are running on the same node. When
> loading node_exporter or JVM Grafana dashboards, memory usage will shoot
> up to more than 10 GB and sometime usage will hit max and Prometheus
> will be restarted. Moreover memory usage will not reduce even after
> closing JVM/Node exporter Grafana Dashboards. 
> 
>          Always need to restart Prometheus to release the memory. I am
> expecting 300 more servers to this configuration in future. 

95 servers don't sound that much. However, it really depends you many
metrics you are scraping from them. node_exporter should usually be now
problem. process_exporter's metric count may highly depend on your
configuration. jvm_exporter is probably what accounts for the largest
number of metrics. You can try to check if you can drop some of them
(e.g. by reworking the jmx_exporter config, or, if nothing helps, via
relabelling).

You may also want to look into your dashboards. Queries which load a lot
of metrics over longer timeframes might cause memory spikes. It can help
to set up recording rules which you can then use in your dashboard.

> Can someone please suggest a solution for this memory issue ( or
> suggest a best database for save the metrics) and suggest a open source
> solution for High availability

Prometheus already contains a highly optimized TSDB. It can also be set
up as high availability setup. Just run two identically configured
Prometheus instances in order to ensure that at least one stays
available. You can then either point Grafana to just one of them
(especially if you want to avoid dashboard queries to overload both
instances) or you could load balance to both of your Prometheus instances.

If you are looking for alternatives, you can try Thanos, Cortex or
VictoriaMetrics.


Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/abf47c43-12f7-975e-f711-4466ee2fb12b%40hoffmann-christian.info.


Re: [prometheus-users] Able to specify bind port, but not address

2020-08-05 Thread Christian Hoffmann
Hi,

On 8/4/20 3:23 PM, jumble wrote:
> Latest prometheus, on RHEL8.
> 
> Observed behavior: bound to |127.0.0.1:9090|

This sounds unexpected. Are you using the official binaries from
prometheus.io / github?

Can you share the exact logs from your experiments?

Is it possible that you've got multiple Prometheus installations, i.e.
one that is already running (bound to localhost) and the one where you
tried to run experiments?

I suggest checking the process list (ps aux | grep prometheus), watching
the logs for "address already in use" messages and/or trying a different
port.

It would be really surprising if this was a bug in Prometheus, Go or
RHEL8. But who knows :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4b0a3bb2-eebe-69ee-5608-feceeb207383%40hoffmann-christian.info.


Re: [prometheus-users] blackbox dns probe failed

2020-08-05 Thread Christian Hoffmann
Hi,

On 8/4/20 10:54 AM, e huang wrote:
> ts=2020-08-04T05:41:58.646Z caller=main.go:169
> module=dns_eboss.enmonster.com target=10.208.100.9 level=debug
> msg="Error while sending a DNS query" err="read udp4 10.208.100.
> 10:36709->10.208.100.9:53: i/o timeout"
> ts=2020-08-04T05:41:58.646Z caller=main.go:169
> module=dns_eboss.enmonster.com target=10.208.100.9 level=debug
> msg="Probe failed" duration_seconds=9.500409824
> 
> 
>   What did you do that produced an error?
> 
> 就如上面的配置,监控dns解析
> 
> 
>   What did you expect to see?
> 
> 它不应该报错
> 
> 
>   What did you see instead?
> 
> 我的dns服务器是用dnsmasq做的,压测过,没有问题,查询监控失败时间的解析日
> 志也是返回正常,奇怪是我用过coredns搭建的dns服务器没有这个问题
> (My dns server is made with dnsmasq, and it has been pressure tested
and there is no problem. The resolution date of the query monitoring
failure time
Zhi also returned to normal. It is strange that the dns server built by
coredns did not have this problem.)


The log seems to say that there was a timeout. Is this issue
reproducible? It may help to set up tcpdump to capture the exact traffic
between blackbox_exporter and your DNS server.

Could there be any firewalls inbetween?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fcdd64f4-ec54-04e4-2f0c-8ebdab9ababe%40hoffmann-christian.info.


Re: [prometheus-users] How to prevent sending resolve notification after resolve_timeout?

2020-08-05 Thread Christian Hoffmann
Hi,

On 8/4/20 2:21 PM, shiqi chai wrote:
> Hey guys,I have a problem with configuration of resolve_timeout. As
> it means, a notication of resolved will be send after the timeout.
> But actually the issue still be firing, it disturb the correct
> resolved notification How can I prevent it?

Not sure I understand. Are you using Prometheus with Alertmanager? If
so, the resolve_timeout option should only be relevant in very rare
cases as Prometheus will send the EndsAt field. See also the note in the
docs about it:
https://prometheus.io/docs/alerting/latest/configuration/

Can you explain in more detail how your setup looks, what you are doing,
what you are expecting and what happens instead?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/26d13432-c046-df25-3fe8-593cd405d839%40hoffmann-christian.info.


Re: [prometheus-users] Remove orphan alert in Prometheus

2020-07-25 Thread Christian Hoffmann
Hi,

On 7/25/20 7:41 AM, jro...@gmail.com wrote:
> Somehow I ended with an alert from a Prometheus scrape job that was
> removed at some point and now I got this orphan alert that it's been
> triggered and being sent to the Alertmanager configured receiver. How
> can I remove this alert? Can I remove it or resolve it using the API? I
> tried the documentation and I did not find it. Please advise.

I don't think there are any ways for removing or resolving alerts using
the API. I don't even think there is any cache or something. I rather
suspect that your alert is still around.

Can you provide some more details about your alert rules, which rule
still fires unexpectedly and what service discovery you are using?

I guess you would have to delete an old alert rule or maybe an old
target (also check the Targets page in the web UI).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/590e0f0e-314f-4414-fd6a-3e7cb5884709%40hoffmann-christian.info.


Re: [prometheus-users] How do I to determine if a single alert out of a set of grouped together alerts common label has a certain value?

2020-07-21 Thread Christian Hoffmann
On 7/16/20 11:51 PM, 'a z' via Prometheus Users wrote:
> I am unsure how within Prometheus/Alertmanager templating how I can
> check if one of the alert labels has a certain value.
In Alertmanager templates, you can use Go's template syntax which allows
for the "eq" function. You can see an example here:

https://github.com/prometheus/alertmanager/blob/master/template/default.tmpl#L4

If you are grouping by this label, you can access it using .GroupLabels:

https://prometheus.io/docs/alerting/latest/notifications/

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/250e788c-11ad-d1e8-cf06-2d0aea65d226%40hoffmann-christian.info.


Re: [prometheus-users] node exporter process in defunct state, unable to restart

2020-07-21 Thread Christian Hoffmann
Hi,

On 7/21/20 10:01 PM, Lakshman Savadamuthu wrote:
> just FYI, there are few other hosts in this cluster, where node_exporter
> is running just fine without any issues.
> We have started the process using systemctl command, here is the service
> file:
> 
> # cat /etc/systemd/system/node_exporter.service
> 
> [Unit]
> 
> Description=Node Exporter
> 
> 
> [Service]
> 
> User=prometheus
> 
> ExecStart=/usr/local/bin/node_exporter --collector.filesystem
> --collector.netdev --collector.cpu --collector.diskstats
> --collector.mdadm --collector.loadavg --collector.time --collector.uname
> --collector.logind
> --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
> --collector.systemd
> 
> 
> [Install]
> 
> WantedBy=default.target
> 
> [
^^^ This looks truncated somehow?

> Also here is the stack trace:
> 
> [root@mesosagent13 ~]# cat /proc/53547/stack
> 
> [] do_exit+0x6bb/0xa40
> 
> [] do_group_exit+0x3f/0xa0
> 
> [] get_signal_to_deliver+0x1ce/0x5e0
> 
> [] do_signal+0x57/0x6f0
> 
> [] do_notify_resume+0x72/0xc0
> 
> [] int_signal+0x12/0x17
> 
> [] 0x
> 
> [root@mesosagent13 ~]#

Sounds like a classical Zombie process example. This means, the parent
(i.e. systemd) is expected to clean this up.
Not sure how it can happen with systemd. Maybe try restarting it
(systemctl daemon-reexec).


Besides that, I suggest continuing the other tests such as running
node_exporter without systemd and with increased debug level. This
should be possible despite the Zombie process.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b53dbf6c-a74d-30eb-b4a0-3eee37cef8f7%40hoffmann-christian.info.


Re: [prometheus-users] node exporter process in defunct state, unable to restart

2020-07-21 Thread Christian Hoffmann
Hi,

On 7/21/20 9:34 PM, Lakshman Savadamuthu wrote:
> Thanks for the reply Christian.
> Looks like the node_exporter is in defunct state, i can't even stop the
> process now.
> 
> Here is the version:
> 
> [root@mesosagent13 ~]# /usr/local/bin/node_exporter --version
> 
> node_exporter, version 0.17.0 (branch: master, revision:
> 36e3b2a923e551830b583ecd43c8f9a9726576cf)
Meanwhile, the latest version is 1.0.1, so updating might be worth a try
(although I don't know of any fixes specific to your issue).

> [root@mesosagent13 ~]# ps -aef | grep node_exporter
> 
> root       8600  61971  0 12:31 pts/0    00:00:00 grep --color=auto
> *node_exporter*
> 
> prometh+  53547      1 20 Jun22 ?        6-02:57:16 [*node_exporter*]
> 
> 
> [root@mesosagent13 ~]#
> 
> Tried killing the process also using pkill -f option, that also didnt help.
Hrm, this usually sounds like the process invoking node_exporter has not
recognized the exit properly yet. Is this from the start using systemd?
Can you share the unit file?

Or is this from a manual start? Could it be that you had backgrounded
the process using "&" or using Ctrl+Z? If so, try foregrounding it (fg)
so that the shell can properly handle the exit.

You can try to look at what this process was doing lastly by running
cat /proc/53547/stack

But I suspect that it will not lead to anything useful.

I think this may just be a dead process table entry. If nothing helps,
you could reboot. In any case, this shouldn't prevent you from running
further tests (e.g. it should not block the listening port or anything).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9823e730-ac37-9979-cbfb-a9642291ea33%40hoffmann-christian.info.


Re: [prometheus-users] How can I check which specific version of node exporter was installed together with the deb package?

2020-07-21 Thread Christian Hoffmann
Hi,

On 7/21/20 8:54 PM, mordowiciel wrote:
> I've installed the 0.15.2+ds version of the prometheus-node-exporter deb
> package on Ubuntu 18.04. I was expecting that it would contain the
> prometheus-node-exporter version 0.15.2 too, but looking at the names of
> the exported metrics, I can see that installed node exporter must have
> the version >=0.16.0, due to the breaking changes in the naming of some
> popular metrics (ex. node_memory_Active_bytes -> node_memory_Active). 
> 
> Running /usr/bin/prometheus-node-exporter --version did not help me much
> - it just returned 
>      version 0.15.2+ds (branch: debian/sid, revision: 0.15.2+ds-1)
> 
> Is there a possibility to check which specific version of node exporter
> is installed in chosen deb package version?

This sounds more like a distro/packaging question. I'm no Ubuntu/Debian
specialist, but this may help:

This lists your version along with the specification file (*.dsc?):

https://packages.ubuntu.com/bionic/prometheus-node-exporter

The .dsc file references this git:

https://salsa.debian.org/go-team/packages/prometheus-node-exporter/-/commits/debian/0.15.2+ds-1

I guess you could work from either a diff (maybe via git command line
client) or via the commit list to work out the differences.

Maybe the changelog also helps. :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d2e31060-2c47-33d4-8696-8df39c241517%40hoffmann-christian.info.


Re: [prometheus-users] node exporter process in defunct state, unable to restart

2020-07-21 Thread Christian Hoffmann
Hi,


On 7/21/20 8:58 PM, Lakshman Savadamuthu wrote:
[...]
> Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com
> node_exporter[35895]: time="2020-07-21T11:51:21-07:00" level=info msg="
> - diskstats" source="node_exporter.go:104"
> 
> Jul 21 11:51:21 mesosagent13.xstackstage1.infosight.nimblestorage.com
> systemd[1]: *node_exporter.service: main process exited, code=exited,
> status=1/FAILURE*
Looks like node_exporter is exiting directly after starting, right?

Have you tried running node_exporter without systemd? You can also try
setting --log.level=debug so that we can have some more hints about the
possible issue.

I suspect a failure in an early startup stage (parameter parsing, path
validation), but I would have expected an explicit log message about this.

If this issue is also reproducible when starting without systemd and the
debug log level does not lead to anything, I would try running it with
strace.

If this is a systemd-only issue, try verifying users/permissions and
maybe share your unit file.

What version of node_exporter is this?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6a116323-8b59-461a-d9cc-5536c25b8702%40hoffmann-christian.info.


Re: [prometheus-users] How to set the maximum number of alertmanager group_by

2020-07-18 Thread Christian Hoffmann
Hi,

On 7/17/20 5:26 AM, long wrote:
> When I set up group_by in alertmanager.yml, I have an alert manager with
> 25 alerts, but it will be split into 3 messages each with a maximum of
> 10 alerts. How do I set the maximum number of alerts in 1 message?

I haven't heard of an alertmanager-side limitation. Could it be that
this is done on your msteams integration side?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/85338988-3d21-fb79-4c8c-d1e210b357b6%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus AlertManager filter.

2020-07-14 Thread Christian Hoffmann
On 7/14/20 8:45 PM, Zhang Zhao wrote:
> Hi Christian,
> After I updated the config below, seems everything stopped feeding to
> ServiceNow even the ones with “inc:servicenow” label.. Any idea?

Hrm, your config looks fine to me. Can you show us an example alert
definition which is not routed correctly?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f872ba2a-3dba-e46c-fcdc-8123afff172c%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus AlertManager filter.

2020-07-14 Thread Christian Hoffmann
Hi,

On 7/14/20 7:24 PM, Zhang Zhao wrote:
>> I added a filter in the alertmanager config so that only alerts that
>> contain "inc:servicenow" label are able to be fed to ServiceNow.
>> However it didn't work as expected. I still saw events that do not
>> contain this label getting fed to ServiceNow. Below was my config.
>> Please advice where was wrong. Appreciate it.
>>
[...]
>>   receiver: prometheus-snow
^^
>>   routes:
>>   - receiver: "prometheus-snow"
>> match:
>>   inc: servicenow

You are still setting prometheus-snow as your default receiver. If you
want to null-route everything which doesn't match, you can define a
receiver without any details (such as "devnull") and use that as a default.

Another option would be dropping all irrelevant alerts from ever
reaching alertmanager by using alert relabelling.


Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f8195f20-55ca-6385-f1d8-8e7d9f5af87f%40hoffmann-christian.info.


Re: [prometheus-users] Blacbox exporter soap check 500

2020-07-14 Thread Christian Hoffmann
Hi,

On 7/14/20 11:17 AM, Yusuf Dönmez wrote:
[...]
>   - source_labels: [__address__]
> target_label: __param_target
>   - source_labels: [__param_target]
> target_label: instance
>   - target_label: __address__
> replacement:
[...]
> - labels:
> module: magento1_x
>   targets:
>   - 'https://www._client_host_.com/api/v2_soap/'

The only thing I notice immediately is that you seem to set a label
which is identical to your blackbox_exporter module (module=magento1_x),
but I suspect that it is not passed to blackbox_exporter. Can you
confirm via the targets on the Prometheus Web UI? I suspect there is no
module=magento1_x.

If you don't need the module as a label in your metrics, I simply
suggest to change it to __param_module: magento1_x so that it is used
internally only.

If you do need it, I suggest adding a relabeling rule which copies the
value of the module label into the __param_module special label.


If this is not the case, I could only suspect some problem with quoting
or some other headers. A tcpdump would help, but only if you are able to
run this without TLS or are able to intercept it some how (mitmproxy or
Wireshark + TLS secret capturing via LD_PRELOAD or something). You could
also hack up blackbox_exporter to output the relevant headers if possible.

As far as I can see, the existing debug methods do not cover it.

I guess you've already tried what's listed here:

https://www.robustperception.io/debugging-blackbox-exporter-failures

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f36bb881-0868-1a7c-6b3a-be7cac00ae85%40hoffmann-christian.info.


Re: [prometheus-users] Re: Storage Retention 15d for Prometheus

2020-07-13 Thread Christian Hoffmann
On 7/13/20 3:58 PM, Bhupendra kumar wrote:
> Yes pls check.
The last line looks like the restart was not successful. This might mean
that Prometheus is still running with an older cmdline. Can you check
the process list?

ps aux | grep prometheus or something? I suspect it might still say 15d.

If this is the case, I suggest debuggin the restart problem. Looks like
you are using systemd. What unit are you using? Has Prometheus always
been started using the systemd unit?

Does it work reliably if you kill Prometheus once and start/restart it
using systemd?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2c29bc4a-93a9-9de7-5306-291fea3036c9%40hoffmann-christian.info.


Re: [prometheus-users] Sent resolved or inactive status (alertmanager)

2020-07-13 Thread Christian Hoffmann
Hi,

On 7/13/20 1:27 PM, Dmitry wrote:
> Hello!
> I have standard rule for prometheus alertmanager:
>   rules:
>   - alert: Instance_down
> expr: up == 0
> for: 1m
> # Labels - additional labels to be attached to the alert
> labels:
>   severity: 'critical'
> annotations:
>   title: 'Instance {{ $labels.instance }} down'
>   description: '{{ $labels.instance }} of job {{ $labels.job }} has been 
> down for more than 1 minute.'
> 
> I receive firing alert messages on my email and slack when instance
> falls. But i don't receive any messages when instance work is restored,
> and alert status changes from firing.
> Is there a way to sent a message when problem is resolved or alert comes
> from firing to inactive?

Can you confirm that you've set the send_resolved: true option in the
relevant email_config and slack_config blocks?

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2b53f914-a74b-894f-7c0d-cc0a2360fdc8%40hoffmann-christian.info.


Re: [prometheus-users] Monitor process in RHEL 6

2020-07-07 Thread Christian Hoffmann
Hi,

On 7/7/20 7:49 PM, Krishnan Subramanian wrote:
> Hi, am looking at monitoring processes in RHEL 6 via node exporter.  For
> RHEL 7 and above i can use the node exporter --collector.systemd.  
> 
> I am looking at similar option in RHEL 6? is there a way possible?  

There are multiple process_exporter projects such as this one:
https://github.com/ncabatoff/process-exporter

I don't think node_exporter has any integrated feature for that (in
fact, even the systemd part is targeted to be moved to a dedicated
exporter).

On the other hand, one should have hope that RHEL6 is gone by the end of
the year when regular support ends... :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9cc41847-cefc-3a63-fbff-c5d2ad6fb8f1%40hoffmann-christian.info.


Re: [prometheus-users] Defining Prometheus alerts with different thresholds per node

2020-07-02 Thread Christian Hoffmann
Hi,

On 7/2/20 2:17 AM, LabTest Diagnostics wrote:
> I've written some alerts for memory usage (for windows nodes) that look
> like this:
> 
> |
> expr:100*(windows_os_physical_memory_free_bytes)/(windows_cs_physical_memory_bytes)<70
> |
> 
> Currently, any server that exceeds 70% of available mem should give us
> an alert. This doesn't work for me as there are some nodes that
> consistently clock over 80% of the memory.
> 
> Is there a way to specify the threshold levels for alerts on a instance
> basis?

Yes, you can use time series as thresholds:
https://www.robustperception.io/using-time-series-as-alert-thresholds

Kind regards
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9ceec19b-ac17-2b07-f073-b810ef42a331%40hoffmann-christian.info.


Re: [prometheus-users] Re: Is it possible to use REST Service to provide data for prometheus

2020-07-02 Thread Christian Hoffmann
Hi,

On 7/2/20 8:15 AM, Thorsten Stork wrote:
> another question to this: How/where will the endpoint of my REST service
> configured, so prometheus will call ist it and get the values at the
> actual time ?

That's the scrape configuration.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config

The simplest form is the static service discovery:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config

The path defaults to /metrics, but can be changed (metrics_path).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/78c8f914-cf03-0395-2a40-f6d8b89ea350%40hoffmann-christian.info.


Re: [prometheus-users] Is it possible to use REST Service to provide data for prometheus

2020-07-01 Thread Christian Hoffmann
Hi,

On 7/1/20 5:25 PM, Thorsten Stork wrote:
> I am really new to the prometheus topics, but should evaluate monitoring
> functionality 
> for a system.
> 
> On this system (middleware) we kan provide REST Service, http-Services
> or webservices which can deliver some metrics at the current time (no
> historic data).
> On this system we could not install exporter or other things like
> libraries or so.
> 
> How far I understand, Prometheus pulls the data from an exporter via
> REST oder http-call, is this right ?
Right, it's just a simple HTTP GET request which returns the metrics in
a defined format.

> So is the a definition how the service have to look like to simulate an
> "exporter" or my complete idea the wrong way ?
No need to "simulate". It's a completely valid way. :)

Depending on your environment and the programming language, it might be
easier to use one of the existing client libraries for that.

This documentation describes the format:
https://prometheus.io/docs/instrumenting/exposition_formats/

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6c4889a1-348b-3dc6-2b73-1169bdf14f9c%40hoffmann-christian.info.


Re: [prometheus-users] Multiple remote_write

2020-07-01 Thread Christian Hoffmann
Hi,

On 7/1/20 4:44 PM, Ramachandra Bhaskar Ayalavarapu wrote:
> Is it possible for a single Prometheus to have multiple remote_write adapters 
> depending on jobs ?
> For example job1 should be writing to r1 (cortex) and r2 to another 
> remote_write instance ?
Although I'm not using it: As far as I understand, multiple remote write
configs are supported.
Each remote write config can have write_relabel_configs which allow you
to filter what to send. This works per-remote write target.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

> Another problem I face is certain jobs only need to do remote_write and 
> certain jobs locally. Please let me know if that’s possible at job level
Yes, this should be possible using the same mechanism by filtering on
the job label.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a51cdeee-3983-853e-7853-1a7e77a28c1d%40hoffmann-christian.info.


Re: [prometheus-users] Is it possible to extract labels when generating AlertManager alert ?

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/25/20 8:55 PM, Sébastien Dionne wrote:
> I have few java applications that I'll deploy in my cluster.  I need to
> know how can I detect if a instance is up or down with Prometheus. 
> 
> *Alerting with AlertManager*
> *
> *
> I have a alert that check for "instanceDown" and send a alert to
> AlertManager-webhook. So when one instance is down, i'm receiving alerts
> in my application.  
> 
> But how can I extract the labels that are in that instance ?
What do you mean by "in that instance"?

If the label is part of your service discovery, then it should be
attached to all series from that target. This would also imply that it
would be part of any alert by default unless you aggregate it away (e.g.
by using sum, avg or something).

If the label is only part of some info-style metric, you will have to
mix this metric into your alert.

Can you share one of the relevant alert rules if you need more specific
guidance?

Note: I don't know how many releaseUUIDGroups you have, but having UUIDs
as label values might ring some alarm bells due to the potential for
high cardinality issues. :)

Kind regards,
Christian


> 
> ex : I have a special labels in all my application that link the pod to
> the information that I have in the database
> 
> releaseUUIDGroup=bf79b8ab-a7c1-4d27-8f3c-6e0f0a089c70
> 
> 
> there is a way to add that information in the message that AlertManager
> send ?
> 
> right now I configure AlertManager to send the alert to
> : 
> https://webhook.site/#!/815a0b0b-f40c-4fc2-984d-e29cb9606840/b0dd701d-e972-48d4-9083-385e6a788d55/1
> 
> for an example, I kill the pod : prometheus-pushgateway
> 
> and I received this message : 
> 
> {
>   "receiver": "default-receiver",
>   "status": "resolved",
>   "alerts": [
> {
>   "status": "resolved",
>   "labels": {
> "alertname": "InstanceDown",
> "instance": "prometheus-pushgateway.default.svc:9091",
> "job": "prometheus-pushgateway",
> "severity": "page"
>   },
>   "annotations": {
> "description": "prometheus-pushgateway.default.svc:9091 of job 
> prometheus-pushgateway has been down for more than 1 minute.",
> "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
>   },
>   "startsAt": "2020-06-19T17:09:53.862877577Z",
>   "endsAt": "2020-06-22T11:23:53.862877577Z",
>   "generatorURL": 
> "http://prometheus-server-57d8dcc67f-qnmkj:9090/graph?g0.expr=up+%3D%3D+0=1;,
>   "fingerprint": "1ed4a1dca68d64fb"
> }
>   ],
>   "groupLabels": {},
>   "commonLabels": {
> "alertname": "InstanceDown",
> "instance": "prometheus-pushgateway.default.svc:9091",
> "job": "prometheus-pushgateway",
> "severity": "page"
>   },
>   "commonAnnotations": {
> "description": "prometheus-pushgateway.default.svc:9091 of job 
> prometheus-pushgateway has been down for more than 1 minute.",
> "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
>   },
>   "externalURL": "http://localhost:9093;,
>   "version": "4",
>   "groupKey": "{}:{}"
> }
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-users+unsubscr...@googlegroups.com
> .
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/20ec33e0-e9bf-4f2a-b366-092743dad957o%40googlegroups.com
> .

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f46ecd4e-0986-939f-a98e-def56d0f1fe9%40hoffmann-christian.info.


Re: [prometheus-users] prometheus is scraping metrics from an instance which has no exporter running on

2020-06-30 Thread Christian Hoffmann
Hi,


On 6/23/20 8:32 AM, Yashar Nesabian wrote:
> Hi,
> A few days ago I realized IPMI exporter is not running on one of our
> bare metals but we didn't get any alert from our Prometheus. Although I
> cannot get the metrics via curl on the Prometheus server, our Prometheus
> is scraping metrics successfully from this server!
> here is the Prometheus page indicating the Prometheus can scrap metrics
> successfully :
> 
> prom.png
> 
> 
> But when I SSH to the server, no one is listening on port 9290:
> 
> 
> 
> And I've checked the DNS records, they are correct (when I ping the
> address, it returns the correct address. Here is the curl result from
> the Prometheus server for one-08:
> |
> curl http://one-08.compute.x.y.z:9290
> curl: (7) Failed to connect to one-08.compute.x.y.z port 9290:
> Connection refused
> |
> 
> The weird thing is I can see one-08 metrics on the Prometheus server
> (for the moment):
> 
> prom1.png
> 
> 
> 
> I tried to put this job on another Prometheus server but I get an error
> on the second one claiming context deadline exceeded which is correct.

Could a DNS cache be involved?

Try comparing
getent hosts one-08
vs.
dig one-08
on the Prometheus machine.

You can also try tcpdump to analyze where Prometheus is actually
connecting to.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c338d7d6-1e08-2ce6-10ce-64be1e416f3a%40hoffmann-christian.info.


Re: [prometheus-users] Is there a good grok user group? I need a pattern!

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/24/20 11:14 AM, Danny de Waard wrote:
> Prometheus users,
> 
> Who of you knows a good grok site/group/knowledge base where i can
> figure out my pattern.
> I can not figure out how to get my ssl log good in grok.

Looks like this is used in Logstash, maybe you can ask there?

https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/28e3825b-87c5-4aa4-4320-d8bf8b332e7b%40hoffmann-christian.info.


Re: [prometheus-users] Issues with group_left to exclude specific label value

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/25/20 4:05 PM, Al wrote:
> All hosts from which I collect node_exporter metrics each have an
> additional node_role metric (added via textfile collector) which
> identifies all the Chef roles a given host has.  As an example, say we
> have 3 hosts with the following textfile collector metrics:
> 
> *server1:*
> |
> node_role{role="web_server_app1"}
> |
> 
> *server2:*
> |
> node_role{role="redis_server"}
> |
> 
> *server3:*
> |
> node_role{role="db_server"}
> |
> 
> *server4:*
> |
> node_role{role="web_server_app2"}
> |
> 
> 
> 
> I'm attempting to write a PromQL query which will return the current
> disk usage % for hosts that do not have a specific role "web_server"
> assigned to them.  I've attempted the following PromQL query although
> it's invalid as we end up with many results on the right hand side,
> which doesn't match the many-to-one nature of the group left:
> 
> |
> 
> 100-(
> 
>    (node_filesystem_free_bytes{mountpoint=“”/data}*on
> (hostname)group_left(role)node_role{role!~“web_server.*”})
> 
>    /
> 
>    (node_filesystem_size_bytes{mountpoint=“”/data}*on
> (hostname)group_left(role)node_role{role!~“web_server”})
> 
>    *
> 
>    100
> 
> )
> 
> |
> 
> How could I modify this query so that it correctly return the disk usage
> percentage of server2 and server3?

The pattern basically looks fine.
Some remarks:

- Your quotation marks do not look like ASCII quotes -- I guess this
comes from pasting?
- The quotation marks around mountpoint= seem off (one should come after
/data).
- Your role regexp is not identical. The regexp in the second part lacks
the .* and will therefore match all your example servers.
- The role "join" can probably be omitted from the second part when
using on(instance, mountpoint)

In general I suggest trying the parts of your query individually and
only putting them into the larger query once both parts return what you
need.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9e2af627-d490-7eba-9738-faa6b72e1758%40hoffmann-christian.info.


Re: [prometheus-users] Merging too prometheus datasources on the same grafana dashboard

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/29/20 11:43 AM, Daly Graty wrote:
> I got to grafana servers first one is monitoring kubernetes installed on
> the master the second is on a separate Vm both are pinging !
> I need to merge both of them in order to access them with the same URL
> I tried to added kubernetes prometheus ( my first Grafana server) as a
> data source on the second one but i got an ‘’ error gateway’’ !
> some help please !

I suggest looking at the Grafana logs. Also try accessing your data
source URL of the problematic Prometheus instance from your Grafana
server via curl. Maybe there is some firewall restriction in place (ping
is not sufficient, you will need tcp access on the relevant port [9090,
by default]).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/453f1c40-18c7-6b6a-67e9-271e6dfdba8d%40hoffmann-christian.info.


Re: [prometheus-users] Custom Threshold for a particular instance.

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/24/20 8:09 PM, yagyans...@gmail.com wrote:
> Hi. Currently I am using a custom threshold in case of my Memory alerts.
> I have 2 main labels for my every node exporter target - cluster and
> component.
> My custom threshold till now has been based on the component as I had to
> define that particular custom threshold for all the servers of the
> component. But now, I have 5 instances, all from different components
> and I have to set the threshold as 97. How do approach this?
> 
> My typical node exporter job.
>   - job_name: 'node_exporter_JOB-A'
>     static_configs:
>     - targets: [ 'x.x.x.x:9100' , 'x.x.x.x:9100']
>   labels:
>     cluster: 'Cluster-A'
>     env: 'PROD'
>     component: 'Comp-A'
>     scrape_interval: 10s
> 
> Recording rule for custom thresholds.
>   - record: abcd_critical
>     expr: 99.9
>     labels:
>   component: 'Comp-A'
> 
>   - record: xyz_critical
>     expr: 95
>     labels:
>   node: 'Comp-B'
> 
> The expression for Memory Alert.
> ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
> node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100) *
> on(instance) group_left(nodename) node_uname_info > on(component)
> group_left() (*abcd_critical* or *xyz_critical* or on(node) count by
> (component)((node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
> node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100) * 0 + 90)
> 
> Now, I have 5 servers with different components. How to include that in
> the most optimized manner?

This looks almost like the pattern described here:
https://www.robustperception.io/using-time-series-as-alert-thresholds

It looks like you already tried to integrate the two different ways to
specific thresholds, right? Is there any specific problem with it?

Sadly, this pattern quickly becomes complex, especially if nested (like
you would need to do) and if combined with an already longer query (like
in your case).

I can only suggest to try to move some of the complexity out of the
query (e.g. by moving the memory calculation to a recording rule instead).

You can also split the rule into multiple rules (with the same name).
You will just have to ensure that they only ever fire for a subset of
your instances (e.g. the first variant would only fire for
compartment-based thresholds, the second only for instance-based
thresholds).

Hope this helps.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2565fb74-b5ab-26a9-7656-8b81eeb277ff%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus timeseries and table panel in grafana

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/23/20 5:43 PM, neel patel wrote:
> I am using prometheus and grafana combo to monitor PostgreSQL database.
> 
> Now prometheus stores the timeseries as below.
> 
> disk_free_space{file_system="/dev/sda1",file_system_type=“xfs”,mount_point="/boot",server=“127.0.0.1:5432”}
> 9.5023104e+07
> disk_free_space{file_system="/dev/sda1",file_system_type=“xfs”,mount_point="/boot",server=“127.0.0.1:5433”}
> 9.5023104e+07
> disk_free_space{file_system="/dev/sda3",file_system_type=“xfs”,mount_point="/",server=“127.0.0.1:5432”}
> 2.8713885696e+10
> disk_free_space{file_system="/dev/sda3",file_system_type=“xfs”,mount_point="/",server=“127.0.0.1:5433”}
> 2.8714070016e+10
> disk_free_space{file_system=“rootfs”,file_system_type=“rootfs”,mount_point="/",server=“127.0.0.1:5432”}
> 2.8713885696e+10
> disk_free_space{file_system=“rootfs”,file_system_type=“rootfs”,mount_point="/",server=“127.0.0.1:5433”}
> 2.8714070016e+10
> 
> How to plot Table panel in grafana using above metrics. I can plot the
> time series using line chart but how to represent above data in Table in
> grafana as it is timeseries ? Any pointers will be helpful.

Are you looking for the table panel?

https://grafana.com/docs/grafana/latest/panels/visualizations/table-panel/

Note: This mailing list is primarily targeted at Prometheus. Although
many Prometheus users are also Grafana users, there may be more
Grafana-focused support facilities somewhere else (I don't know). :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7a3934fb-41ab-00f6-a4c6-01ec2b735fca%40hoffmann-christian.info.


Re: [prometheus-users] disk speed

2020-06-30 Thread Christian Hoffmann
Hi,

On 6/23/20 4:45 PM, 'Metrics Searcher' via Prometheus Users wrote:
> Does anyone know how to collect the disk speed, like I can do it via
> hdparm or dd?

I don't know of a standard solution for this. Also, your examples are
performance metrics which cannot be collected passively and continuously
such as other system metrics (i.e. disk throughput based on normal usage).

You can still set up a small cronjob to run such benchmarks and place
the results in a .prom file for the node_exporter's textfile collector.

For background/examples, see this blog post and the three linked articles:
https://www.robustperception.io/atomic-writes-and-the-textfile-collector

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3146ddb5-af8f-5616-8a24-2383cb9bb1d7%40hoffmann-christian.info.


Re: [prometheus-users] Job label in file-based SD

2020-06-29 Thread Christian Hoffmann
Hi,

On 6/26/20 10:33 AM, Björn Fischer wrote:
> I was going through the guide for file-based service discovery [1] and
> noticed that they are setting the job label in the targets file. That
> doesn't make sense to me. Targets are not strictly job-specific and
> Prometheus is setting the job label anyways. Can someone think of a
> reason to set the job label explicitly here?

Yes, Prometheus will set a job label anyway. I assume it was simple
chosen as an example for setting a label there without inventing a
non-default label.

While I think there are cases where overriding the job label makes sense
[1], I don't think it is too common.

So, I agree that this example is a bit confusing. Maybe a Pull Request
to improve this would be welcome?


Kind regards,
Christian


[1] For example, we want all node_exporter jobs to have the job="node"
label, but we have to make them multiple jobs because of different
proxy_url settings. Therefore, we have node-proxy1, but explicitly set
job="node" as a label.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2eafb501-70f8-25f3-d98a-634754f19ec2%40hoffmann-christian.info.


Re: [prometheus-users] Alert handling using alertmanager even handler .

2020-06-29 Thread Christian Hoffmann
Hi,

On 6/30/20 7:51 AM, Pooja Chauhan wrote:
> Hi Christian,
> Can u pls gve me the official document link which you are referring.

This is the official documentation outlining the alert rule syntax:

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2635fcc6-12f7-a4c0-ccd3-1e24d51019e2%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus Query/alerting rules related to NFS Detach mount using node-exporter mountstats and nfs collector does not work

2020-06-29 Thread Christian Hoffmann
Hi,

On 6/26/20 12:50 PM, Satyam Vishnoi wrote:
> I should get alert when below given /mapr nfs mount-point get detached .
> 
> 
> I am using following 2 metrics provided by node-exporter collector
> mountstats and nfs .
> 
> 
> Query-1 absent(node_filesystem_size_bytes {
> job="exporter_node",fstype="nfs" ,mountpoint="/mapr",instance=~".*:9100"
> }) ==1
> 
> 
> Query-2 rate(node_mountstats_nfs_age_seconds_total
> {export="localhost:/mapr"}[5m]) < 1
> 
> 
> My first query is providing me desired output but i have to hard-code
> instance value on it. the regex used above does not work. My second
> query with rate doesn't give me a stable value to put on alert .

Right, absent will only work if you know what to look for. However,
maybe Prometheus already has the required information in the form of
other series?

If you want to alert if any of your instances lacks a specific
mountpoint, you can try this:

  up == 1 unless on(instance)
node_filesystem_size_bytes{fstype="nfs",mountpoint="/mapr"}

This selects all online nodes [1] and discards all nodes which do have
the mountpoint. What's left is all online nodes without the mountpoint,
which should be exactly what you were after.

Kind regards,
Christian


[1] You can also skip the == 1 part to check all nodes. However, it
usually makes little sense to place such an alert for offline nodes as I
assume you will alert on up != 1 anyway.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3b8c36f9-91b3-72db-ad31-45186e06e194%40hoffmann-christian.info.


Re: [prometheus-users] Alert handling using alertmanager even handler .

2020-06-29 Thread Christian Hoffmann
Hi,

On 6/28/20 3:23 PM, Pooja Chauhan wrote:
> I want to handle alerts like jenkins process down using alertmanager
> even handler. But the document is not helping me with how to configure
> it . Really need help on from where to download this
> :https://github.com/jjneely/am-event-handler  and how to start with
> it.The documentation is not really enough for me .I tried searching for
> examples but couldn't fine.Please help me with some examples also.

This tool does not seem to be that popular. Maybe you can outline what
you have tried, what you intend to do and what doesn't work yet as expected?

This will increase chances that someone can help.

Note: The documented alert rule syntax is outdated. You will have to use
the modern, yaml-based syntax from the official docs for this to work
(promtool bundled with older Prometheus versions will also be able to
convert such rules).

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fe82ca2b-364f-9d96-41b7-fc65201e5a66%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus 2.18 incompatibility with 2.04

2020-06-20 Thread Christian Hoffmann
On 6/20/20 5:31 PM, Johny wrote:
> If it is non-compliant endpoint, the problem should appear in both
> versions, isn't it? It is effecting more than one series. The set up is
> in corporate org so I cannot expose end points publicly.
Maybe you can build a small reproducer: Grab your metrics via curl, set
up a webserver to serve the file and let a fresh Prometheus instance
scrape it. If the problem no longer occurs, this would be a chance to
look for differences.
If it does occur, try obfuscating the data as needed and providing the
obfuscated data points so that someone can look into it.

> I have an prometheus front end instance that remote reads from multiple
> prometheus backends. the time series is sharded across multiple
> backends. The results are also inconsistent in 2.18.1. Sometimes I get
> fewer time series back but what is consistent is the last data point is
> duplicated on all time series. Just switching front end to 2.4 with same
> configuration file fixes the problem.
As you are using remote read, try updating to at least 2.18.2 as Brian
suggested.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/43a63d12-3f46-6b24-c26a-b3a3ca2ece11%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus javamelody

2020-06-19 Thread Christian Hoffmann
Hi,

On 6/16/20 12:09 PM, Shivam Soni wrote:
> I got an issue in Prometheus configure to java melody.
> can anyone solve this?
> plz check URL:
> 
> https://github.com/prometheus/prometheus/issues/7404

I'm seeing a small but maybe relevant difference between your Prometheus
config and your test case: The Prometheus config uses "localhost" while
your test uses a hostname. This may not matter network-wise if your
service is bound to both these addresses. However, it may matter
HTTP-wise as these can be interpreted as different virtual hosts by the
target web servers.

I suggest trying to use the hostname instead of localhost in Prometheus.

If that doesn't work either, it might be some more buried difference
(e.g. headers), although that would sound like an uncommon reason for a
404 error. A tcpdump/wireshark comparison of both requests might help there.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/938b7e60-aa50-020a-c8a5-dc9cc4611839%40hoffmann-christian.info.


Re: [prometheus-users] Alertmanger "Not Grouped" alerts

2020-06-19 Thread Christian Hoffmann
Hi,

On 6/19/20 8:34 AM, Romenyrr wrote:
> I've come across this issue where I'm grouping by 'alertname' but
> nothing is being grouped except for one odd group. When I click on the
> group tab and click on "Enable custom grouping" that seems to sort
> everything by 'alertname'. 
> 
> This grouping issue is creating an issue where I'm just getting 1 big
> alert in Opsgenie with 74 items in it. Has anyone come across this before? 
> 
> 2765A9DC-B163-4D2B-A83B-07D34E71A66F.jpeg
> 
> Here's the ouput of Status > Config from the Alertmanager UI
> 
> |
> 
> route:
>   receiver: opsgenie
>   routes:
>   - receiver: opsgenie
>     group_by:
>     - alertname
>     match_re:
>       severity: warning|critical
>     group_wait: 10s

Can you confirm that all of your affected alerts have a severity label
of either warning or critical?

All others will probably be handled by the default route where you don't
have a group_by config. Maybe you intended to place it there instead?

If receiver and everything else is identical, I don't think you should
have child routes (routes:) at all. Just place it at the top-level route
instead.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/06a55961-180c-2a17-f1ea-fda0a0e2fa26%40hoffmann-christian.info.


Re: [prometheus-users] Error while writing an alert rule in alert.yml file

2020-06-19 Thread Christian Hoffmann
Hi,

On 6/19/20 4:20 PM, Isabel Noronha wrote:
>  This is just code snippet of my alerts.yml file
> - alert: ContainerKilled
>     expr: IF absent(((time() - container_last_seen{name=".+"}) < 5))
>     for: 15s
>     labels:
>       severity: warning
>     annotations:
>       summary: "Container killed (instance {{ $labels.instance }})"
>       description: "A container has been killed\n  VALUE = {{ $value
> }}\n  LABELS: {{ $labels }}"
> 
> Could anyone tell me what am I doing wrong in this rule?
> It's throwing error :parse error: unexpected identifier \"absent\""

The "IF" doesn't belong there. Try removing it. :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7e03929a-ac68-d56e-4a14-1bfacdc329d5%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus AlertManager Alert Grouping

2020-06-18 Thread Christian Hoffmann
On 6/18/20 3:00 AM, Zhang Zhao wrote:
> Hi, I have a question for alert grouping in AlertManager. I integrated
> Prometheus Alerts to ServiceNow via Webhook.  I see the events were
> captured on ServiceNow side as below. However, inside each of events
> below, there were multiple alerts included. Is there a way to break it
> off so that one alert from Prometheus corresponds to one event on
> ServiceNow? I tried to group_by by alertname and status, but it didn't
> work as expected. Seems have to add other condition in group_by setting.
> Thanks.
> image.png

Sounds like you are looking for the magic
group_by: ['...']

option. :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f9b1bfd-09f5-cfdf-1fda-f462a1ef9641%40hoffmann-christian.info.


  1   2   >