[prometheus-users] Defining Prometheus alerts with different thresholds per node

2020-07-01 Thread LabTest Diagnostics
I've written some alerts for memory usage (for windows nodes) that look 
like this:

expr: 100 * (windows_os_physical_memory_free_bytes) / (
windows_cs_physical_memory_bytes) < 70

Currently, any server that exceeds 70% of available mem should give us an 
alert. This doesn't work for me as there are some nodes that consistently 
clock over 80% of the memory.

Is there a way to specify the threshold levels for alerts on a instance 
basis? 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/44ec9291-83fe-441f-9d4a-5e52f05c6649o%40googlegroups.com.


[prometheus-users] Node-exporter does not provide metrics from all filesystems

2020-07-01 Thread Eduardo Cardoso
**RancherOS Version: (ros os version)**

v1.5.5

**Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, 
etc.)**

Vmware

Why node-exporter does not take metrics from additional filesystems in 
RancherOS

# RancherOS version

cat /etc/os-release
```
NAME="RancherOS"
VERSION=v1.5.5
ID=rancheros
ID_LIKE=
VERSION_ID=v1.5.5
PRETTY_NAME="RancherOS v1.5.5"
HOME_URL="http://rancher.com/rancher-os/;
SUPPORT_URL="https://forums.rancher.com/c/rancher-os;
BUG_REPORT_URL="https://github.com/rancher/os/issues;
BUILD_ID=
```

# mounts configuration
```
ros config set mounts 
'[["/dev/sdb1","/var/lib/user-docker","ext4",""],["/dev/sdb2","/data","ext4",""],["/dev/sdb3","/var/lib/kubelet","ext4",""],["/dev/sdb4","/opt/rke","ext4",""]]'
```
# filesystems 

```
df -h 

Filesystem  Size  Used Avail Use% Mounted on
overlay 4.7G  1.9G  2.6G  43% /
devtmpfs 16G 0   16G   0% /dev
tmpfs16G 0   16G   0% /sys/fs/cgroup
/dev/sda1   4.7G  1.9G  2.6G  43% /mnt
none 16G  1.1M   16G   1% /run
shm  64M 0   64M   0% /dev/shm
/dev/sdb198G  600M   93G   1% /var/lib/user-docker
/dev/sdb249G   53M   47G   1% /data
/dev/sdb3   9.8G   37M  9.3G   1% /var/lib/kubelet
/dev/sdb4   9.8G   37M  9.3G   1% /opt/rke
```

# node-exporter service configuration
```
node_exporter:
  image: prometheus/node-exporter:v0.18.1
  command: |
--path.procfs=/host/proc
--path.sysfs=/host/sys

--collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$$

--collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($$|/)
--collector.ksmd
--collector.filesystem
--collector.mountstats
--path.rootfs=/host/root
--web.listen-address=0.0.0.0:9100
--log.level=debug
  privileged: true
  labels:
io.rancher.os.scope: "system"
io.rancher.os.after: "docker"
io.rancher.os.detach: "true"
  ports:
- "9100:9100/tcp"
  volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host/root:ro
  restart: unless-stopped 
```
# metrics

```curl localhost:9100/metrics | grep node_filesystem_free_bytes

  % Total% Received % Xferd  Average Speed   TimeTime Time  
Current
 Dload  Upload   Total   SpentLeft  
Speed
  0 00 00 0  0  0 --:--:-- --:--:-- 
--:--:-- 0# HELP node_filesystem_free_bytes Filesystem free space in 
bytes.
TYPE node_filesystem_free_bytes gauge
node_filesystem_free_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 
2.98029056e+09
node_filesystem_free_bytes{device="none",fstype="tmpfs",mountpoint="/run"} 
1.6853835776e+10
100 570030 570030 0  1562k  0 --:--:-- --:--:-- --:--:-- 
1637k
```

# problem

I would like to receive metrics for the filesystems: 

/data 
/var/lib/user-docker
/var/lib/kubelet
/opt/rke 

but I only have metrics from filesystem:

/





-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7d825700-5d4b-44b2-9908-8478517a56cdo%40googlegroups.com.


Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Viktor Radnai
Hi Julien,

Thanks for clarifying that. In that case I'll see if the issue will recur
with 2.19.2 in the next few weeks.

Vik

On Wed, 1 Jul 2020 at 19:08, Julien Pivotto 
wrote:

> When 2.19 will run then it will create mmaped head which will improve that.
>
> I agree that starting 2.19 with a 2.18 wal won't make a change.
>
> Le mer. 1 juil. 2020 à 19:55, Viktor Radnai  a
> écrit :
>
>> Hi again Ben,
>>
>> Unfortunately upgrading to 2.19.2 does not solve the startup issue.
>> Prometheus gets OOMKilled before even starting to parse the last 25
>> segments which represent the last 50 minutes worth of data. Based on this
>> the estimated memory requirement should be somewhere between 60-70GB but
>> the iworker node only has 52GB. The other Prometheus pod currently consumes
>> 7.7GB.
>>
>> The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently
>> reinstated a previously set 40GB memory limit and updated the replicaset to
>> increase it back to 50GB -- this is the reason for the second Prometheus
>> restart and the slightly higher plateau for the last two OOMs.
>>
>> Unless there is a way to move some WAL segments out and the restore them
>> later, I'll try to delete the last 50 minutes worth of segments to get the
>> pod to come up.
>>
>> Thanks,
>> Vik
>>
>> On Wed, 1 Jul 2020 at 16:39, Viktor Radnai 
>> wrote:
>>
>>> Hi Ben,
>>>
>>> We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves
>>> the problem. I currently have one of the two replicas in production
>>> crashlooping so I'll try to roll this out in the next few hours and report
>>> back.
>>>
>>> Thanks,
>>> Vik
>>>
>>> On Wed, 1 Jul 2020 at 16:32, Ben Kochie  wrote:
>>>
 What version of Prometheus do you have deployed? We've made several
 major improvements to WAL handling and startup in the last couple of
 releases.

 I would recommend upgrading to 2.19.2 if you haven't.

 On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai 
 wrote:

> Hi all,
>
> We have a recurring problem with Prometheus repeatedly getting
> OOMKilled on startup while trying to process the write ahead log. I tried
> to look through Github issues but there was no solution or currently open
> issue as far as I could see.
>
> We are running on Kubernetes in GKE using the prometheus-operator Helm
> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
> hours maximum, so our Prometheus pods also get killed and automatically
> migrated by Kubernetes (the data is on a persistent volume of course). To
> avoid loss of metrics, we run two identically configured replicas with
> their own storage, scraping all the same targets.
>
> We monitor numerous GCE VMs that do batch processing, running anywhere
> between a few minutes to several hours. This workload is bursty,
> fluctuating between tens and hundreds of VMs active at any time, so
> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
> Prometheus usually handles this workload with about half a CPU core and 
> 8GB
> of RAM and if left to its own devices, the wal folder will shrink again
> when the load decreases.
>
> The problem is that when there is a backlog and Prometheus is
> restarted (due to the preemptive VM going away), it will use several times
> more RAM to recover the wal folder. This often exhausts all the available
> memory on the Kubernetes worker, so Prometheus is killed by the OOM killed
> over and over again, until I log in and delete the wal folder, losing
> several hours of metrics. I have already doubled the size of the VMs just
> to accommodate Prometheus and I am reluctant to do this again. Running
> non-preemptive VMs would triple the cost of these instances and Prometheus
> might still get restarted when we roll out an update -- so this would
> probably not even solve the issue properly.
>
> I don't know if there is something special in our use case, but I did
> come across a blog describing the same high memory usage behaviour on
> startup.
>
> I feel that unless there is a fix I can do, this would warrant either
> a bug or feature request -- Prometheus should be able to recover without
> operator intervention or losing metrics. And for a process running on
> Kubernetes, we should be able to set memory "request" and "limit" values
> that are close to actual expected usage, rather than 3-4 times the steady
> state usage just to accommodate the memory requirements of the startup
> phase.
>
> Please let me know what information I should provide, if any. I have
> some graph screenshots that would be relevant.
>
> Many thanks,
> Vik
>
> --
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving 

Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Julien Pivotto
When 2.19 will run then it will create mmaped head which will improve that.

I agree that starting 2.19 with a 2.18 wal won't make a change.

Le mer. 1 juil. 2020 à 19:55, Viktor Radnai  a
écrit :

> Hi again Ben,
>
> Unfortunately upgrading to 2.19.2 does not solve the startup issue.
> Prometheus gets OOMKilled before even starting to parse the last 25
> segments which represent the last 50 minutes worth of data. Based on this
> the estimated memory requirement should be somewhere between 60-70GB but
> the iworker node only has 52GB. The other Prometheus pod currently consumes
> 7.7GB.
>
> The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently
> reinstated a previously set 40GB memory limit and updated the replicaset to
> increase it back to 50GB -- this is the reason for the second Prometheus
> restart and the slightly higher plateau for the last two OOMs.
>
> Unless there is a way to move some WAL segments out and the restore them
> later, I'll try to delete the last 50 minutes worth of segments to get the
> pod to come up.
>
> Thanks,
> Vik
>
> On Wed, 1 Jul 2020 at 16:39, Viktor Radnai 
> wrote:
>
>> Hi Ben,
>>
>> We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves
>> the problem. I currently have one of the two replicas in production
>> crashlooping so I'll try to roll this out in the next few hours and report
>> back.
>>
>> Thanks,
>> Vik
>>
>> On Wed, 1 Jul 2020 at 16:32, Ben Kochie  wrote:
>>
>>> What version of Prometheus do you have deployed? We've made several
>>> major improvements to WAL handling and startup in the last couple of
>>> releases.
>>>
>>> I would recommend upgrading to 2.19.2 if you haven't.
>>>
>>> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai 
>>> wrote:
>>>
 Hi all,

 We have a recurring problem with Prometheus repeatedly getting
 OOMKilled on startup while trying to process the write ahead log. I tried
 to look through Github issues but there was no solution or currently open
 issue as far as I could see.

 We are running on Kubernetes in GKE using the prometheus-operator Helm
 chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
 hours maximum, so our Prometheus pods also get killed and automatically
 migrated by Kubernetes (the data is on a persistent volume of course). To
 avoid loss of metrics, we run two identically configured replicas with
 their own storage, scraping all the same targets.

 We monitor numerous GCE VMs that do batch processing, running anywhere
 between a few minutes to several hours. This workload is bursty,
 fluctuating between tens and hundreds of VMs active at any time, so
 sometimes the Prometheus wal folder grows to  between 10-15GB in size.
 Prometheus usually handles this workload with about half a CPU core and 8GB
 of RAM and if left to its own devices, the wal folder will shrink again
 when the load decreases.

 The problem is that when there is a backlog and Prometheus is restarted
 (due to the preemptive VM going away), it will use several times more RAM
 to recover the wal folder. This often exhausts all the available memory on
 the Kubernetes worker, so Prometheus is killed by the OOM killed over and
 over again, until I log in and delete the wal folder, losing several hours
 of metrics. I have already doubled the size of the VMs just to accommodate
 Prometheus and I am reluctant to do this again. Running non-preemptive VMs
 would triple the cost of these instances and Prometheus might still get
 restarted when we roll out an update -- so this would probably not even
 solve the issue properly.

 I don't know if there is something special in our use case, but I did
 come across a blog describing the same high memory usage behaviour on
 startup.

 I feel that unless there is a fix I can do, this would warrant either a
 bug or feature request -- Prometheus should be able to recover without
 operator intervention or losing metrics. And for a process running on
 Kubernetes, we should be able to set memory "request" and "limit" values
 that are close to actual expected usage, rather than 3-4 times the steady
 state usage just to accommodate the memory requirements of the startup
 phase.

 Please let me know what information I should provide, if any. I have
 some graph screenshots that would be relevant.

 Many thanks,
 Vik

 --
 You received this message because you are subscribed to the Google
 Groups "Prometheus Users" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to prometheus-users+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
 

[prometheus-users] Re: Is it possible to use REST Service to provide data for prometheus

2020-07-01 Thread Thorsten Stork
Thank you very much. I will try it out.

Am Mittwoch, 1. Juli 2020 17:25:22 UTC+2 schrieb Thorsten Stork:
>
> Hello,
>
> I am really new to the prometheus topics, but should evaluate monitoring 
> functionality 
> for a system.
>
> On this system (middleware) we kan provide REST Service, http-Services or 
> webservices which can deliver some metrics at the current time (no historic 
> data).
> On this system we could not install exporter or other things like 
> libraries or so.
>
> How far I understand, Prometheus pulls the data from an exporter via REST 
> oder http-call, is this right ?
>
> So is the a definition how the service have to look like to simulate an 
> "exporter" or my complete idea the wrong way ?
>
> How should be the best way to solve my requirements ?
>
> Thank you.
>
> Regards
>Thorsten
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fda02ff5-be54-4f93-bd0e-18a3223ef477o%40googlegroups.com.


[prometheus-users] Re: Multiple remote_write

2020-07-01 Thread Brian Candler
Remote writes can be filtered using write_relabel_configs. See:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
and in particular the "drop" action.  You can filter by any labels, 
including job.

You can't disable local writing.  Prometheus uses it for alert rules and 
recording rules.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5d67c69c-9234-45aa-a992-4f4e0a378301o%40googlegroups.com.


[prometheus-users] Re: Not able to receive emails from alert manager

2020-07-01 Thread Brian Candler
Well, all I can say for definite is that you're not running with the config 
file that you think you are.  You're running alertmanager with a config 
that has a webhook receiver pointing to port 5001.

Check how you're starting alertmanager.  Check what config file you're 
pointing it to on the command line.  Check you restarted it after editing 
the config (actually, "killall -HUP alertmanager" is all you need).  Check 
the logs after restarting it.  Check you're looking on the right server.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/14ac5c46-09a8-4a43-b6a1-0e21dc26aca1o%40googlegroups.com.


Re: [prometheus-users] Is it possible to use REST Service to provide data for prometheus

2020-07-01 Thread Christian Hoffmann
Hi,

On 7/1/20 5:25 PM, Thorsten Stork wrote:
> I am really new to the prometheus topics, but should evaluate monitoring
> functionality 
> for a system.
> 
> On this system (middleware) we kan provide REST Service, http-Services
> or webservices which can deliver some metrics at the current time (no
> historic data).
> On this system we could not install exporter or other things like
> libraries or so.
> 
> How far I understand, Prometheus pulls the data from an exporter via
> REST oder http-call, is this right ?
Right, it's just a simple HTTP GET request which returns the metrics in
a defined format.

> So is the a definition how the service have to look like to simulate an
> "exporter" or my complete idea the wrong way ?
No need to "simulate". It's a completely valid way. :)

Depending on your environment and the programming language, it might be
easier to use one of the existing client libraries for that.

This documentation describes the format:
https://prometheus.io/docs/instrumenting/exposition_formats/

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6c4889a1-348b-3dc6-2b73-1169bdf14f9c%40hoffmann-christian.info.


Re: [prometheus-users] Multiple remote_write

2020-07-01 Thread Christian Hoffmann
Hi,

On 7/1/20 4:44 PM, Ramachandra Bhaskar Ayalavarapu wrote:
> Is it possible for a single Prometheus to have multiple remote_write adapters 
> depending on jobs ?
> For example job1 should be writing to r1 (cortex) and r2 to another 
> remote_write instance ?
Although I'm not using it: As far as I understand, multiple remote write
configs are supported.
Each remote write config can have write_relabel_configs which allow you
to filter what to send. This works per-remote write target.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

> Another problem I face is certain jobs only need to do remote_write and 
> certain jobs locally. Please let me know if that’s possible at job level
Yes, this should be possible using the same mechanism by filtering on
the job label.

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a51cdeee-3983-853e-7853-1a7e77a28c1d%40hoffmann-christian.info.


Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Viktor Radnai
Hi Ben,

We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves
the problem. I currently have one of the two replicas in production
crashlooping so I'll try to roll this out in the next few hours and report
back.

Thanks,
Vik

On Wed, 1 Jul 2020 at 16:32, Ben Kochie  wrote:

> What version of Prometheus do you have deployed? We've made several major
> improvements to WAL handling and startup in the last couple of releases.
>
> I would recommend upgrading to 2.19.2 if you haven't.
>
> On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai 
> wrote:
>
>> Hi all,
>>
>> We have a recurring problem with Prometheus repeatedly getting OOMKilled
>> on startup while trying to process the write ahead log. I tried to look
>> through Github issues but there was no solution or currently open issue as
>> far as I could see.
>>
>> We are running on Kubernetes in GKE using the prometheus-operator Helm
>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
>> hours maximum, so our Prometheus pods also get killed and automatically
>> migrated by Kubernetes (the data is on a persistent volume of course). To
>> avoid loss of metrics, we run two identically configured replicas with
>> their own storage, scraping all the same targets.
>>
>> We monitor numerous GCE VMs that do batch processing, running anywhere
>> between a few minutes to several hours. This workload is bursty,
>> fluctuating between tens and hundreds of VMs active at any time, so
>> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
>> Prometheus usually handles this workload with about half a CPU core and 8GB
>> of RAM and if left to its own devices, the wal folder will shrink again
>> when the load decreases.
>>
>> The problem is that when there is a backlog and Prometheus is restarted
>> (due to the preemptive VM going away), it will use several times more RAM
>> to recover the wal folder. This often exhausts all the available memory on
>> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
>> over again, until I log in and delete the wal folder, losing several hours
>> of metrics. I have already doubled the size of the VMs just to accommodate
>> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
>> would triple the cost of these instances and Prometheus might still get
>> restarted when we roll out an update -- so this would probably not even
>> solve the issue properly.
>>
>> I don't know if there is something special in our use case, but I did
>> come across a blog describing the same high memory usage behaviour on
>> startup.
>>
>> I feel that unless there is a fix I can do, this would warrant either a
>> bug or feature request -- Prometheus should be able to recover without
>> operator intervention or losing metrics. And for a process running on
>> Kubernetes, we should be able to set memory "request" and "limit" values
>> that are close to actual expected usage, rather than 3-4 times the steady
>> state usage just to accommodate the memory requirements of the startup
>> phase.
>>
>> Please let me know what information I should provide, if any. I have some
>> graph screenshots that would be relevant.
>>
>> Many thanks,
>> Vik
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
>> 
>> .
>>
>

-- 
My other sig is hilarious

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANx-tGgz631pR8LGDxWpw%2BGFQDAGezOWETWZAj4-n%3DSV-m3www%40mail.gmail.com.


Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Viktor Radnai
Hi Matthias,

Thanks, I think this should definitely help but not sure if it will always
solve the problem. If I understand it correctly, the WAL holds 6 hours of
data and in our experience the high water mark for memory usage seems to be
about 3-4 times the WAL size. So while processing 2 hours, you might go
higher than normal, but not severalt times higher.

What would be very nice if Prometheus would try to observe the rlimit set
for maximum virtual memory size and flush the WAL when it gets close to
that. When Prometheus starts up, it already prints the values (if set):
level=info ts=2020-07-01T15:31:50.711Z caller=main.go:341
vm_limits="(soft=unlimited, hard=unlimited)"

I tried setting these with a small Bash script wrapper and ulimit, but this
resulted in a Golang OOM error and termination instead of the Linux OOM
killer and termination :)

Many thanks,
Vik

On Wed, 1 Jul 2020 at 16:27, Matthias Rampke  wrote:

> I have been thinking about this problem as well, since we ran into a
> similar issue yesterday. In our case, Prometheus had already failed to
> write out a TSDB block for a few hours but kept on piling data into the
> head block.
>
> Could TSDB write out blocks *during* WAL recovery? Say, for every two
> hours' worth of WAL or even more frequently, it could pause recovery, write
> a block, delete the WAL up to that point, continue recovery. This would put
> something of a bound on the memory usage during recovery, and alleviate the
> issue that recovery from out-of-memory takes *even more memory*.
>
> Would this help in your case?
>
> /MR
>
>
> On Wed, Jul 1, 2020 at 3:06 PM Viktor Radnai 
> wrote:
>
>> Hi all,
>>
>> We have a recurring problem with Prometheus repeatedly getting OOMKilled
>> on startup while trying to process the write ahead log. I tried to look
>> through Github issues but there was no solution or currently open issue as
>> far as I could see.
>>
>> We are running on Kubernetes in GKE using the prometheus-operator Helm
>> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
>> hours maximum, so our Prometheus pods also get killed and automatically
>> migrated by Kubernetes (the data is on a persistent volume of course). To
>> avoid loss of metrics, we run two identically configured replicas with
>> their own storage, scraping all the same targets.
>>
>> We monitor numerous GCE VMs that do batch processing, running anywhere
>> between a few minutes to several hours. This workload is bursty,
>> fluctuating between tens and hundreds of VMs active at any time, so
>> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
>> Prometheus usually handles this workload with about half a CPU core and 8GB
>> of RAM and if left to its own devices, the wal folder will shrink again
>> when the load decreases.
>>
>> The problem is that when there is a backlog and Prometheus is restarted
>> (due to the preemptive VM going away), it will use several times more RAM
>> to recover the wal folder. This often exhausts all the available memory on
>> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
>> over again, until I log in and delete the wal folder, losing several hours
>> of metrics. I have already doubled the size of the VMs just to accommodate
>> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
>> would triple the cost of these instances and Prometheus might still get
>> restarted when we roll out an update -- so this would probably not even
>> solve the issue properly.
>>
>> I don't know if there is something special in our use case, but I did
>> come across a blog describing the same high memory usage behaviour on
>> startup.
>>
>> I feel that unless there is a fix I can do, this would warrant either a
>> bug or feature request -- Prometheus should be able to recover without
>> operator intervention or losing metrics. And for a process running on
>> Kubernetes, we should be able to set memory "request" and "limit" values
>> that are close to actual expected usage, rather than 3-4 times the steady
>> state usage just to accommodate the memory requirements of the startup
>> phase.
>>
>> Please let me know what information I should provide, if any. I have some
>> graph screenshots that would be relevant.
>>
>> Many thanks,
>> Vik
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
>> 
>> .
>>
>

-- 
My other sig is hilarious

-- 
You received this message because you are subscribed to the 

Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Ben Kochie
What version of Prometheus do you have deployed? We've made several major
improvements to WAL handling and startup in the last couple of releases.

I would recommend upgrading to 2.19.2 if you haven't.

On Wed, Jul 1, 2020 at 5:06 PM Viktor Radnai 
wrote:

> Hi all,
>
> We have a recurring problem with Prometheus repeatedly getting OOMKilled
> on startup while trying to process the write ahead log. I tried to look
> through Github issues but there was no solution or currently open issue as
> far as I could see.
>
> We are running on Kubernetes in GKE using the prometheus-operator Helm
> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
> hours maximum, so our Prometheus pods also get killed and automatically
> migrated by Kubernetes (the data is on a persistent volume of course). To
> avoid loss of metrics, we run two identically configured replicas with
> their own storage, scraping all the same targets.
>
> We monitor numerous GCE VMs that do batch processing, running anywhere
> between a few minutes to several hours. This workload is bursty,
> fluctuating between tens and hundreds of VMs active at any time, so
> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
> Prometheus usually handles this workload with about half a CPU core and 8GB
> of RAM and if left to its own devices, the wal folder will shrink again
> when the load decreases.
>
> The problem is that when there is a backlog and Prometheus is restarted
> (due to the preemptive VM going away), it will use several times more RAM
> to recover the wal folder. This often exhausts all the available memory on
> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
> over again, until I log in and delete the wal folder, losing several hours
> of metrics. I have already doubled the size of the VMs just to accommodate
> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
> would triple the cost of these instances and Prometheus might still get
> restarted when we roll out an update -- so this would probably not even
> solve the issue properly.
>
> I don't know if there is something special in our use case, but I did come
> across a blog describing the same high memory usage behaviour on startup.
>
> I feel that unless there is a fix I can do, this would warrant either a
> bug or feature request -- Prometheus should be able to recover without
> operator intervention or losing metrics. And for a process running on
> Kubernetes, we should be able to set memory "request" and "limit" values
> that are close to actual expected usage, rather than 3-4 times the steady
> state usage just to accommodate the memory requirements of the startup
> phase.
>
> Please let me know what information I should provide, if any. I have some
> graph screenshots that would be relevant.
>
> Many thanks,
> Vik
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmqmdsgoXqk4TPNVOannHjT8acUfWG2mFv4oj0uJQb2S-A%40mail.gmail.com.


Re: [prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Matthias Rampke
I have been thinking about this problem as well, since we ran into a
similar issue yesterday. In our case, Prometheus had already failed to
write out a TSDB block for a few hours but kept on piling data into the
head block.

Could TSDB write out blocks *during* WAL recovery? Say, for every two
hours' worth of WAL or even more frequently, it could pause recovery, write
a block, delete the WAL up to that point, continue recovery. This would put
something of a bound on the memory usage during recovery, and alleviate the
issue that recovery from out-of-memory takes *even more memory*.

Would this help in your case?

/MR


On Wed, Jul 1, 2020 at 3:06 PM Viktor Radnai 
wrote:

> Hi all,
>
> We have a recurring problem with Prometheus repeatedly getting OOMKilled
> on startup while trying to process the write ahead log. I tried to look
> through Github issues but there was no solution or currently open issue as
> far as I could see.
>
> We are running on Kubernetes in GKE using the prometheus-operator Helm
> chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
> hours maximum, so our Prometheus pods also get killed and automatically
> migrated by Kubernetes (the data is on a persistent volume of course). To
> avoid loss of metrics, we run two identically configured replicas with
> their own storage, scraping all the same targets.
>
> We monitor numerous GCE VMs that do batch processing, running anywhere
> between a few minutes to several hours. This workload is bursty,
> fluctuating between tens and hundreds of VMs active at any time, so
> sometimes the Prometheus wal folder grows to  between 10-15GB in size.
> Prometheus usually handles this workload with about half a CPU core and 8GB
> of RAM and if left to its own devices, the wal folder will shrink again
> when the load decreases.
>
> The problem is that when there is a backlog and Prometheus is restarted
> (due to the preemptive VM going away), it will use several times more RAM
> to recover the wal folder. This often exhausts all the available memory on
> the Kubernetes worker, so Prometheus is killed by the OOM killed over and
> over again, until I log in and delete the wal folder, losing several hours
> of metrics. I have already doubled the size of the VMs just to accommodate
> Prometheus and I am reluctant to do this again. Running non-preemptive VMs
> would triple the cost of these instances and Prometheus might still get
> restarted when we roll out an update -- so this would probably not even
> solve the issue properly.
>
> I don't know if there is something special in our use case, but I did come
> across a blog describing the same high memory usage behaviour on startup.
>
> I feel that unless there is a fix I can do, this would warrant either a
> bug or feature request -- Prometheus should be able to recover without
> operator intervention or losing metrics. And for a process running on
> Kubernetes, we should be able to set memory "request" and "limit" values
> that are close to actual expected usage, rather than 3-4 times the steady
> state usage just to accommodate the memory requirements of the startup
> phase.
>
> Please let me know what information I should provide, if any. I have some
> graph screenshots that would be relevant.
>
> Many thanks,
> Vik
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMV%3D_gb5nbM3a1hXOPmeJxFN6mhAFbkkY7Ec21iGxbxPUtA4pQ%40mail.gmail.com.


[prometheus-users] Is it possible to use REST Service to provide data for prometheus

2020-07-01 Thread Thorsten Stork
Hello,

I am really new to the prometheus topics, but should evaluate monitoring 
functionality 
for a system.

On this system (middleware) we kan provide REST Service, http-Services or 
webservices which can deliver some metrics at the current time (no historic 
data).
On this system we could not install exporter or other things like libraries 
or so.

How far I understand, Prometheus pulls the data from an exporter via REST 
oder http-call, is this right ?

So is the a definition how the service have to look like to simulate an 
"exporter" or my complete idea the wrong way ?

How should be the best way to solve my requirements ?

Thank you.

Regards
   Thorsten

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f63e434d-1a21-4a91-a086-077900182412o%40googlegroups.com.


[prometheus-users] Re: How to include separate labels into custom email template for notification

2020-07-01 Thread 'Владимир organ2' via Prometheus Users
I figured out that this part using for creating *[FIRING:1] or [RESOLVED] 
and it works.*
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" 
}}:{{ .Alerts.Firing | len }}{{ end }}] {{ end }}
But I don't now what I should insert inside for getting only one label.

If I insert {{ .labels.alertname }} inside it just broke my email
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" 
}}:{{ .Alerts.Firing | len }}{{ end }}] {{ .labels.alertname }} {{ end }}


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b23afb1c-85e8-4387-aa3a-013405a3fcb1o%40googlegroups.com.


[prometheus-users] Prometheus wal folder and memory usage on startup

2020-07-01 Thread Viktor Radnai
Hi all,

We have a recurring problem with Prometheus repeatedly getting OOMKilled on
startup while trying to process the write ahead log. I tried to look
through Github issues but there was no solution or currently open issue as
far as I could see.

We are running on Kubernetes in GKE using the prometheus-operator Helm
chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24
hours maximum, so our Prometheus pods also get killed and automatically
migrated by Kubernetes (the data is on a persistent volume of course). To
avoid loss of metrics, we run two identically configured replicas with
their own storage, scraping all the same targets.

We monitor numerous GCE VMs that do batch processing, running anywhere
between a few minutes to several hours. This workload is bursty,
fluctuating between tens and hundreds of VMs active at any time, so
sometimes the Prometheus wal folder grows to  between 10-15GB in size.
Prometheus usually handles this workload with about half a CPU core and 8GB
of RAM and if left to its own devices, the wal folder will shrink again
when the load decreases.

The problem is that when there is a backlog and Prometheus is restarted
(due to the preemptive VM going away), it will use several times more RAM
to recover the wal folder. This often exhausts all the available memory on
the Kubernetes worker, so Prometheus is killed by the OOM killed over and
over again, until I log in and delete the wal folder, losing several hours
of metrics. I have already doubled the size of the VMs just to accommodate
Prometheus and I am reluctant to do this again. Running non-preemptive VMs
would triple the cost of these instances and Prometheus might still get
restarted when we roll out an update -- so this would probably not even
solve the issue properly.

I don't know if there is something special in our use case, but I did come
across a blog describing the same high memory usage behaviour on startup.

I feel that unless there is a fix I can do, this would warrant either a bug
or feature request -- Prometheus should be able to recover without operator
intervention or losing metrics. And for a process running on Kubernetes, we
should be able to set memory "request" and "limit" values that are close to
actual expected usage, rather than 3-4 times the steady state usage just to
accommodate the memory requirements of the startup phase.

Please let me know what information I should provide, if any. I have some
graph screenshots that would be relevant.

Many thanks,
Vik

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com.


[prometheus-users] Multiple remote_write

2020-07-01 Thread Ramachandra Bhaskar Ayalavarapu
Hello

Is it possible for a single Prometheus to have multiple remote_write adapters 
depending on jobs ?
For example job1 should be writing to r1 (cortex) and r2 to another 
remote_write instance ?


Another problem I face is certain jobs only need to do remote_write and certain 
jobs locally. Please let me know if that’s possible at job level

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8ee29864-473e-4b7c-8dda-0d75a5a5a0c2o%40googlegroups.com.


[prometheus-users] Re: Not able to receive emails from alert manager

2020-07-01 Thread Ali Shiekh
Here's my alertmanager.yml file:

[image: 2.JPG]



And it contains the same config i attached above.
Please assist me over this, i'm new to prometheus.
Thanks.


On Wednesday, July 1, 2020 at 12:23:48 PM UTC+5, Brian Candler wrote:
>
> You're not showing us the actual alertmanager.yml config you're using.
>
> Port 5001 comes from one of the example configs for an alertmanager 
> webhook:
>
> ./examples/webhook/echo.go: log.Fatal(http.ListenAndServe(":5001", 
> http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
> ./examples/ha/alertmanager.yml:  - url: 'http://127.0.0.1:5001/'
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/748289dc-931b-472f-8016-a44cc4c6e99ao%40googlegroups.com.


[prometheus-users] Re: Not able to receive emails from alert manager

2020-07-01 Thread Ali Shiekh
This is the only file that I'm using, can you guide me with this, I'm new to 
Prometheus.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7ae47db0-7d5b-48bb-91a0-92db5b8ec7ceo%40googlegroups.com.


Re: [prometheus-users] Re: node_exporter support to read from libraries

2020-07-01 Thread Ben Kochie
We did some work previously to eliminate use of CGO in the node_exporter on
Linux. CGO has a long history of being a bit fragile. We do use it for
Non-Linux for gathering syscalls from OS functions that don't have
Go-native implementations.

Normally I wouldn't object to adding RedFish to the node_exporter, but
since it's missing a Go-native library implementation makes it not a good
fit for our goals.

On Wed, Jul 1, 2020 at 9:28 AM Brian Candler  wrote:

> It sounds to me like this should be a separate exporter, like
> ipmi_exporter - especially since libredfish is a C library (according to
> Google anyway)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/4afce4b3-425e-4962-98c4-128b6b2e1548o%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmquxkGYh9CHNdPW-Z9tWS%2BfQrGSAnmHM4FhnNZqcMcX5A%40mail.gmail.com.


[prometheus-users] Adding extra lable to all prometheus metrics

2020-07-01 Thread Sandeep Rao Kokkirala
Hi Team,

I want to display clustername for my all prometheus queries. Tried  adding 
below two scrape configs but getting the same output. 

additionalScrapeConfigs:
  - job_name: 'local'
honor_labels: true
static_configs:
- targets:
  - ':'
  labels:
cluster: 'qa-app'

  additionalScrapeConfigs:
  - job_name: 'local'
honor_labels: true
static_configs:
- targets:
  - '127.0.0.1'
  labels:
cluster: 'qa-app'


query:

up{job="apiserver"}

Output:

Currentoutput:
up{endpoint="https",instance=":6443",job="apiserver",namespace="default",service="kubernetes"}
 
1

Expected output:
up{cluster: 'qa-app',
endpoint="https",instance=":6443",job="apiserver",namespace="default",service="kubernetes"}
 
1



-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/830290da-fdc7-42ac-bf71-c89dc3c8ee63o%40googlegroups.com.


[prometheus-users] Re: node_exporter support to read from libraries

2020-07-01 Thread Brian Candler
It sounds to me like this should be a separate exporter, like ipmi_exporter 
- especially since libredfish is a C library (according to Google anyway)

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4afce4b3-425e-4962-98c4-128b6b2e1548o%40googlegroups.com.


[prometheus-users] Re: Not able to receive emails from alert manager

2020-07-01 Thread Brian Candler
You're not showing us the actual alertmanager.yml config you're using.

Port 5001 comes from one of the example configs for an alertmanager webhook:

./examples/webhook/echo.go: log.Fatal(http.ListenAndServe(":5001", 
http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
./examples/ha/alertmanager.yml:  - url: 'http://127.0.0.1:5001/'

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8807c69d-c13b-4f26-b0d9-976de0660c48o%40googlegroups.com.


[prometheus-users] Re: promql

2020-07-01 Thread Brian Candler
Prometheus is not a general-purpose database.

You cannot populate data using PromQL - there is no "insert" statement 
equivalent.

In fact, you cannot populate data into Prometheus' internal time series 
database (TSDB) in any way at all, except by having Prometheus scrape the 
data from an exporter.  You cannot backfill historical data, for instance.

You *can* get Prometheus to write data to a remote storage system, and read 
it back again.  There are some integrations here 
.
  
I don't see mongodb listed, so you might end up having to write that 
yourself.

It could be that some other system will suit your needs better - 
TimescaleDB, InfluxDB, VictoriaMetrics etc.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6140b571-42c1-46ce-8a89-4754f3f65c42o%40googlegroups.com.


Re: [prometheus-users] Is it possible to extract labels when generating AlertManager alert ?

2020-07-01 Thread Brian Candler
On Tuesday, 30 June 2020 14:40:46 UTC+1, Sébastien Dionne wrote:
>
> but there is a way to configure the scrape interval with annoation too ?
>
> I could have applications that we want to monitor each 15 sec and others 
> at 45sec interval or more.
>
>
You can have two different scrape jobs, one with interval 15s and one with 
interval 45s.  Use the relabeling step to drop targets which have the wrong 
annotation for that job.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1c166180-a19f-46de-9e89-53291c0e21ceo%40googlegroups.com.