Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-25 Thread Suren Baghdasaryan
Hi Johannes,
I tried your previous memdelay patches before this new set was posted
and results were promising for predicting when Android system is close
to OOM. I'm definitely going to try this one after I backport it to
4.9.

On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner  wrote:
> Hi,
>
> I previously submitted a version of this patch set called "memdelay",
> which translated delays from reclaim, swap-in, thrashing page cache
> into a pressure percentage of lost walltime. I've since extended this
> code to aggregate all delay states tracked by delayacct in order to
> have generalized pressure/overcommit levels for CPU, memory, and IO.
>
> There was feedback from Peter on the previous version that I have
> incorporated as much as possible and as it still applies to this code:
>
> - got rid of the extra lock in the sched callbacks; all task
>   state changes we care about serialize through rq->lock
>
> - got rid of ktime_get() inside the sched callbacks and
>   switched time measuring to rq_clock()
>
> - got rid of all divisions inside the sched callbacks,
>   tracking everything natively in ns now
>
> I also moved this stuff into existing sched/stat.h callbacks, so it
> doesn't get in the way in sched/core.c, and of course moved the whole
> thing behind CONFIG_PSI since not everyone is going to want it.

Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
subset of this feature?

>
> Real-world applications
>
> Since the last posting, we've begun using the data collected by this
> code quite extensively at Facebook, and with several success stories.
>
> First we used it on systems that frequently locked up in low memory
> situations. The reason this happens is that the OOM killer is
> triggered by reclaim not being able to make forward progress, but with
> fast flash devices there is *always* some clean and uptodate cache to
> reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
> the time faulting executables. There is no situation where this ever
> makes sense in practice. We wrote a <100 line POC python script to
> monitor memory pressure and kill stuff manually, way before such
> pathological thrashing.
>
> We've since extended the python script into a more generic oomd that
> we use all over the place, not just to avoid livelocks but also to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> We also use the memory pressure info for loadshedding. Our batch job
> infrastructure used to refuse new requests on heuristics based on RSS
> and other existing VM metrics in an attempt to avoid OOM kills and
> maximize utilization. Since it was still plagued by frequent OOM
> kills, we switched it to shed load on psi memory pressure, which has
> turned out to be a much better bellwether, and we managed to reduce
> OOM kills drastically. Reducing the rate of OOM outages from the
> worker pool raised its aggregate productivity, and we were able to
> switch that service to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to do this properly without the
> pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem. We now
> log and graph the pressure metrics for all containers in our fleet and
> can trivially link service drops to resource pressure after the fact.
>
> How do you use this?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> calculate pressure at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which some tasks are
> delayed on the runqueue while another task has the CPU. They're recent
> averages over 10s, 1m, 5m windows, so you can tell short term trends
> from long term ones, similarly to the load average.
>
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do 

Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-25 Thread Suren Baghdasaryan
Hi Johannes,
I tried your previous memdelay patches before this new set was posted
and results were promising for predicting when Android system is close
to OOM. I'm definitely going to try this one after I backport it to
4.9.

On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner  wrote:
> Hi,
>
> I previously submitted a version of this patch set called "memdelay",
> which translated delays from reclaim, swap-in, thrashing page cache
> into a pressure percentage of lost walltime. I've since extended this
> code to aggregate all delay states tracked by delayacct in order to
> have generalized pressure/overcommit levels for CPU, memory, and IO.
>
> There was feedback from Peter on the previous version that I have
> incorporated as much as possible and as it still applies to this code:
>
> - got rid of the extra lock in the sched callbacks; all task
>   state changes we care about serialize through rq->lock
>
> - got rid of ktime_get() inside the sched callbacks and
>   switched time measuring to rq_clock()
>
> - got rid of all divisions inside the sched callbacks,
>   tracking everything natively in ns now
>
> I also moved this stuff into existing sched/stat.h callbacks, so it
> doesn't get in the way in sched/core.c, and of course moved the whole
> thing behind CONFIG_PSI since not everyone is going to want it.

Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
subset of this feature?

>
> Real-world applications
>
> Since the last posting, we've begun using the data collected by this
> code quite extensively at Facebook, and with several success stories.
>
> First we used it on systems that frequently locked up in low memory
> situations. The reason this happens is that the OOM killer is
> triggered by reclaim not being able to make forward progress, but with
> fast flash devices there is *always* some clean and uptodate cache to
> reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
> the time faulting executables. There is no situation where this ever
> makes sense in practice. We wrote a <100 line POC python script to
> monitor memory pressure and kill stuff manually, way before such
> pathological thrashing.
>
> We've since extended the python script into a more generic oomd that
> we use all over the place, not just to avoid livelocks but also to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> We also use the memory pressure info for loadshedding. Our batch job
> infrastructure used to refuse new requests on heuristics based on RSS
> and other existing VM metrics in an attempt to avoid OOM kills and
> maximize utilization. Since it was still plagued by frequent OOM
> kills, we switched it to shed load on psi memory pressure, which has
> turned out to be a much better bellwether, and we managed to reduce
> OOM kills drastically. Reducing the rate of OOM outages from the
> worker pool raised its aggregate productivity, and we were able to
> switch that service to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to do this properly without the
> pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem. We now
> log and graph the pressure metrics for all containers in our fleet and
> can trivially link service drops to resource pressure after the fact.
>
> How do you use this?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> calculate pressure at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which some tasks are
> delayed on the runqueue while another task has the CPU. They're recent
> averages over 10s, 1m, 5m windows, so you can tell short term trends
> from long term ones, similarly to the load average.
>
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more 

Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 14 May 2018, Johannes Weiner wrote:

> Since I'm using the same model and infrastructure for memory and IO
> load as well, IMO it makes more sense to present them in a coherent
> interface instead of trying to retrofit and change the loadavg file,
> which might not even be possible.

Well I keep looking at the loadavg output from numerous tools and then in
my mind I divide by the number of processors, guess if any of the threads
would be doing I/O and if I cannot figure that out groan and run "vmstat"
for awhile to figure that out.

Lets have some numbers there that make more sense please.



Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 14 May 2018, Johannes Weiner wrote:

> Since I'm using the same model and infrastructure for memory and IO
> load as well, IMO it makes more sense to present them in a coherent
> interface instead of trying to retrofit and change the loadavg file,
> which might not even be possible.

Well I keep looking at the loadavg output from numerous tools and then in
my mind I divide by the number of processors, guess if any of the threads
would be doing I/O and if I cannot figure that out groan and run "vmstat"
for awhile to figure that out.

Lets have some numbers there that make more sense please.



Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Johannes Weiner
On Mon, May 14, 2018 at 03:39:33PM +, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
> 
> > What to make of this number? If CPU utilization is at 100% and CPU
> > pressure is 0, it means the system is perfectly utilized, with one
> > runnable thread per CPU and nobody waiting. At two or more runnable
> > tasks per CPU, the system is 100% overcommitted and the pressure
> > average will indicate as much. From a utilization perspective this is
> > a great state of course: no CPU cycles are being wasted, even when 50%
> > of the threads were to go idle (and most workloads do vary). From the
> > perspective of the individual job it's not great, however, and they
> > might do better with more resources. Depending on what your priority
> > is, an elevated "some" number may or may not require action.
> 
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?

Counting iowaiting tasks is one thing, but there are a few more things
that make it hard to use for telling the impact of CPU competition:

- It's not normalized to available CPU count. The loadavg in isolation
  doesn't mean anything, and you have to know the number of CPUs and
  any CPU bindings / restrictions in effect, which presents at least
  some difficulty when monitoring a big heterogeneous fleet.

- The way it's sampled makes it impossible to use for latencies. You
  could be mostly idle but periodically have herds of tasks competing
  for the CPU for short, low-latency operations. Even if we changed
  this in the implementation, you're still stuck with the interface
  that has...

- ...a short-term load window of 1m. This is generally fairly coarse
  for something that can be loaded and unloaded as abruptly as the CPU

I'm trying to fix these with a portable way of aggregating multi-cpu
states, as well as tracking the true time spent in a state instead of
sampling it. Plus a smaller short-term window of 10s, but that's
almost irrelevant because I'm exporting the absolute state time clock
so you can calculate your own averages over any time window you want.

Since I'm using the same model and infrastructure for memory and IO
load as well, IMO it makes more sense to present them in a coherent
interface instead of trying to retrofit and change the loadavg file,
which might not even be possible.


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Johannes Weiner
On Mon, May 14, 2018 at 03:39:33PM +, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
> 
> > What to make of this number? If CPU utilization is at 100% and CPU
> > pressure is 0, it means the system is perfectly utilized, with one
> > runnable thread per CPU and nobody waiting. At two or more runnable
> > tasks per CPU, the system is 100% overcommitted and the pressure
> > average will indicate as much. From a utilization perspective this is
> > a great state of course: no CPU cycles are being wasted, even when 50%
> > of the threads were to go idle (and most workloads do vary). From the
> > perspective of the individual job it's not great, however, and they
> > might do better with more resources. Depending on what your priority
> > is, an elevated "some" number may or may not require action.
> 
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?

Counting iowaiting tasks is one thing, but there are a few more things
that make it hard to use for telling the impact of CPU competition:

- It's not normalized to available CPU count. The loadavg in isolation
  doesn't mean anything, and you have to know the number of CPUs and
  any CPU bindings / restrictions in effect, which presents at least
  some difficulty when monitoring a big heterogeneous fleet.

- The way it's sampled makes it impossible to use for latencies. You
  could be mostly idle but periodically have herds of tasks competing
  for the CPU for short, low-latency operations. Even if we changed
  this in the implementation, you're still stuck with the interface
  that has...

- ...a short-term load window of 1m. This is generally fairly coarse
  for something that can be loaded and unloaded as abruptly as the CPU

I'm trying to fix these with a portable way of aggregating multi-cpu
states, as well as tracking the true time spent in a state instead of
sampling it. Plus a smaller short-term window of 10s, but that's
almost irrelevant because I'm exporting the absolute state time clock
so you can calculate your own averages over any time window you want.

Since I'm using the same model and infrastructure for memory and IO
load as well, IMO it makes more sense to present them in a coherent
interface instead of trying to retrofit and change the loadavg file,
which might not even be possible.


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Bart Van Assche

On 05/14/18 08:39, Christopher Lameter wrote:

On Mon, 7 May 2018, Johannes Weiner wrote:

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.


This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?


The following article explains why it probably made sense in 1993 to 
include TASK_UNINTERRUPTIBLE in loadavg and also why this no longer 
makes sense today:


http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Bart.


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Bart Van Assche

On 05/14/18 08:39, Christopher Lameter wrote:

On Mon, 7 May 2018, Johannes Weiner wrote:

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.


This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?


The following article explains why it probably made sense in 1993 to 
include TASK_UNINTERRUPTIBLE in loadavg and also why this no longer 
makes sense today:


http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Bart.


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 7 May 2018, Johannes Weiner wrote:

> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resources. Depending on what your priority
> is, an elevated "some" number may or may not require action.

This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?



Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 7 May 2018, Johannes Weiner wrote:

> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resources. Depending on what your priority
> is, an elevated "some" number may or may not require action.

This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?



[PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
Hi,

I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.

There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:

- got rid of the extra lock in the sched callbacks; all task
  state changes we care about serialize through rq->lock

- got rid of ktime_get() inside the sched callbacks and
  switched time measuring to rq_clock()

- got rid of all divisions inside the sched callbacks,
  tracking everything natively in ns now

I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.

Real-world applications

Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.

First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.

We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.

How do you use this?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.

The cpu file contains one line:

some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU, 

[PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
Hi,

I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.

There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:

- got rid of the extra lock in the sched callbacks; all task
  state changes we care about serialize through rq->lock

- got rid of ktime_get() inside the sched callbacks and
  switched time measuring to rq_clock()

- got rid of all divisions inside the sched callbacks,
  tracking everything natively in ns now

I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.

Real-world applications

Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.

First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.

We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.

How do you use this?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.

The cpu file contains one line:

some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU,