Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-24 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 10:31:15PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > > +   /* Update task counts according to the set/clear bitmasks */
> > > > +   for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > > +   int idx = to + (bo - 1);
> > > > +
> > > > +   if (tasks[idx] == 0 && !psi_bug) {
> > > > +   printk_deferred(KERN_ERR "psi: task underflow! 
> > > > cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > > +   cpu, idx, tasks[0], tasks[1], 
> > > > tasks[2],
> > > > +   clear, set);
> > > > +   psi_bug = 1;
> > > > +   }
> > > 
> > >   WARN_ONCE(!tasks[idx], ...);
> > 
> > It's just open-coded because of the printk_deferred, since this is
> > inside the scheduler.
> 
> Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
> trigger anyway.

This one in particular gave us quite a runaround. We had a subtle bug
in how psi processed task CPU migration that would only manifest with
hundreds of thousands of machine hours. When it triggered, instead of
the warning, we'd crash on a corrupted stack with a completely useless
crash dump - PC pointing to things that couldn't possibly trap etc.

So printk_deferred has been a lot more useful in those rare but
desparate cases ;-) Plus we keep the machine alive.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-24 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 10:31:15PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > > +   /* Update task counts according to the set/clear bitmasks */
> > > > +   for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > > +   int idx = to + (bo - 1);
> > > > +
> > > > +   if (tasks[idx] == 0 && !psi_bug) {
> > > > +   printk_deferred(KERN_ERR "psi: task underflow! 
> > > > cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > > +   cpu, idx, tasks[0], tasks[1], 
> > > > tasks[2],
> > > > +   clear, set);
> > > > +   psi_bug = 1;
> > > > +   }
> > > 
> > >   WARN_ONCE(!tasks[idx], ...);
> > 
> > It's just open-coded because of the printk_deferred, since this is
> > inside the scheduler.
> 
> Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
> trigger anyway.

This one in particular gave us quite a runaround. We had a subtle bug
in how psi processed task CPU migration that would only manifest with
hundreds of thousands of machine hours. When it triggered, instead of
the warning, we'd crash on a corrupted stack with a completely useless
crash dump - PC pointing to things that couldn't possibly trap etc.

So printk_deferred has been a lot more useful in those rare but
desparate cases ;-) Plus we keep the machine alive.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-20 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{

> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + if (!groupc->nonidle_time)
> + continue;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = >res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }

An alternative for this, that also allows that ondemand update, but
without spamming the rq->lock would be something like:

struct psi_group_cpu {
u32 tasks[3];
u32 cpu_state : 2;
u32 mem_state : 2;
u32 io_state  : 2;
u32 :0;

u64 last_update_time;

u32 nonidle;
u32 full[2];
u32 some[3];
} cacheline_aligned_in_smp;

/* Allocate _2_ copies */
DEFINE_PER_CPU_ALIGNED_SHARED(struct psi_group_cpu[2], psi_cpus);

struct psi_group global_psi = {
.cpus = _cpus[0],
};


u64 sums[6] = { 0, };

for_each_possible_cpu(cpu) {
struct psi_group_cpu *pgc = per_cpu_ptr(group->cpus, cpu);
u32 *active, *shadow;

active = [0].nonidle;
shadow = [1].nonidle;

/*
 * Compare the active count to the shadow count
 * if different, compute the delta and update the shadow
 * copy.
 * This only writes to the shadow copy (separate line)
 * and leaves the active a read-only access.
 */
for (i = 0; i < 6; i++) {
u32 old = READ_ONCE(shadow[i]);
u32 new = READ_ONCE(active[i]);

delta = (new - old);
if (!delta) {
if (!i)
goto next;
continue;
}

WRITE_ONCE(shadow[i], new);

sums[i] += delta;
}
next:   ;
}


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-20 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{

> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + if (!groupc->nonidle_time)
> + continue;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = >res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }

An alternative for this, that also allows that ondemand update, but
without spamming the rq->lock would be something like:

struct psi_group_cpu {
u32 tasks[3];
u32 cpu_state : 2;
u32 mem_state : 2;
u32 io_state  : 2;
u32 :0;

u64 last_update_time;

u32 nonidle;
u32 full[2];
u32 some[3];
} cacheline_aligned_in_smp;

/* Allocate _2_ copies */
DEFINE_PER_CPU_ALIGNED_SHARED(struct psi_group_cpu[2], psi_cpus);

struct psi_group global_psi = {
.cpus = _cpus[0],
};


u64 sums[6] = { 0, };

for_each_possible_cpu(cpu) {
struct psi_group_cpu *pgc = per_cpu_ptr(group->cpus, cpu);
u32 *active, *shadow;

active = [0].nonidle;
shadow = [1].nonidle;

/*
 * Compare the active count to the shadow count
 * if different, compute the delta and update the shadow
 * copy.
 * This only writes to the shadow copy (separate line)
 * and leaves the active a read-only access.
 */
for (i = 0; i < 6; i++) {
u32 old = READ_ONCE(shadow[i]);
u32 new = READ_ONCE(active[i]);

delta = (new - old);
if (!delta) {
if (!i)
goto next;
continue;
}

WRITE_ONCE(shadow[i], new);

sums[i] += delta;
}
next:   ;
}


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-20 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 06:06:23PM -0400, Johannes Weiner wrote:
> On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static bool psi_update_stats(struct psi_group *group)
> > > +{
> > > + u64 some[NR_PSI_RESOURCES] = { 0, };
> > > + u64 full[NR_PSI_RESOURCES] = { 0, };
> > > + unsigned long nonidle_total = 0;
> > > + unsigned long missed_periods;
> > > + unsigned long expires;
> > > + int cpu;
> > > + int r;
> > > +
> > > + mutex_lock(>stat_lock);
> > > +
> > > + /*
> > > +  * Collect the per-cpu time buckets and average them into a
> > > +  * single time sample that is normalized to wallclock time.
> > > +  *
> > > +  * For averaging, each CPU is weighted by its non-idle time in
> > > +  * the sampling period. This eliminates artifacts from uneven
> > > +  * loading, or even entirely idle CPUs.
> > > +  *
> > > +  * We could pin the online CPUs here, but the noise introduced
> > > +  * by missing up to one sample period from CPUs that are going
> > > +  * away shouldn't matter in practice - just like the noise of
> > > +  * previously offlined CPUs returning with a non-zero sample.
> > 
> > But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> > try and avoid it?
> 
> Hm, I don't feel strongly about it either way. I'll add it.

Thinking more about it, this really doesn't buy anything. Whether a
CPU comes online or goes offline during the loop is no different than
that happening right before grabbing the cpus_read_lock(). If we see a
sample from a CPU, we incorporate it, if not we don't.

So it's not so much avoidance as it's lack of reason for synchronizing
against hotplugging in any fashion. The comment is wrong. This noise
it points to is there with and without the lock, and the only way to
avoid it would be to do either for_each_possible_cpu() in that loop or
having a hotplug callback that would flush the offlining CPU bucket
into a holding place for missed dead cpu samples that the aggregation
loop checks every time. Neither of these seem remotely worth the cost.

I'll fix the comment instead.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-20 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 06:06:23PM -0400, Johannes Weiner wrote:
> On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static bool psi_update_stats(struct psi_group *group)
> > > +{
> > > + u64 some[NR_PSI_RESOURCES] = { 0, };
> > > + u64 full[NR_PSI_RESOURCES] = { 0, };
> > > + unsigned long nonidle_total = 0;
> > > + unsigned long missed_periods;
> > > + unsigned long expires;
> > > + int cpu;
> > > + int r;
> > > +
> > > + mutex_lock(>stat_lock);
> > > +
> > > + /*
> > > +  * Collect the per-cpu time buckets and average them into a
> > > +  * single time sample that is normalized to wallclock time.
> > > +  *
> > > +  * For averaging, each CPU is weighted by its non-idle time in
> > > +  * the sampling period. This eliminates artifacts from uneven
> > > +  * loading, or even entirely idle CPUs.
> > > +  *
> > > +  * We could pin the online CPUs here, but the noise introduced
> > > +  * by missing up to one sample period from CPUs that are going
> > > +  * away shouldn't matter in practice - just like the noise of
> > > +  * previously offlined CPUs returning with a non-zero sample.
> > 
> > But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> > try and avoid it?
> 
> Hm, I don't feel strongly about it either way. I'll add it.

Thinking more about it, this really doesn't buy anything. Whether a
CPU comes online or goes offline during the loop is no different than
that happening right before grabbing the cpus_read_lock(). If we see a
sample from a CPU, we incorporate it, if not we don't.

So it's not so much avoidance as it's lack of reason for synchronizing
against hotplugging in any fashion. The comment is wrong. This noise
it points to is there with and without the lock, and the only way to
avoid it would be to do either for_each_possible_cpu() in that loop or
having a hotplug callback that would flush the offlining CPU bucket
into a holding place for missed dead cpu samples that the aggregation
loop checks every time. Neither of these seem remotely worth the cost.

I'll fix the comment instead.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > + /* Update task counts according to the set/clear bitmasks */
> > > + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > + int idx = to + (bo - 1);
> > > +
> > > + if (tasks[idx] == 0 && !psi_bug) {
> > > + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> > > idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > + cpu, idx, tasks[0], tasks[1], tasks[2],
> > > + clear, set);
> > > + psi_bug = 1;
> > > + }
> > 
> > WARN_ONCE(!tasks[idx], ...);
> 
> It's just open-coded because of the printk_deferred, since this is
> inside the scheduler.

Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
trigger anyway. But yeah printk is crap, which is why I don't use printk
anymore:

  https://lkml.kernel.org/r/20170928121823.430053...@infradead.org




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > + /* Update task counts according to the set/clear bitmasks */
> > > + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > + int idx = to + (bo - 1);
> > > +
> > > + if (tasks[idx] == 0 && !psi_bug) {
> > > + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> > > idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > + cpu, idx, tasks[0], tasks[1], tasks[2],
> > > + clear, set);
> > > + psi_bug = 1;
> > > + }
> > 
> > WARN_ONCE(!tasks[idx], ...);
> 
> It's just open-coded because of the printk_deferred, since this is
> inside the scheduler.

Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
trigger anyway. But yeah printk is crap, which is why I don't use printk
anymore:

  https://lkml.kernel.org/r/20170928121823.430053...@infradead.org




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   /* Update task counts according to the set/clear bitmasks */
> > +   for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > +   int idx = to + (bo - 1);
> > +
> > +   if (tasks[idx] == 0 && !psi_bug) {
> > +   printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> > idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > +   cpu, idx, tasks[0], tasks[1], tasks[2],
> > +   clear, set);
> > +   psi_bug = 1;
> > +   }
> 
>   WARN_ONCE(!tasks[idx], ...);

It's just open-coded because of the printk_deferred, since this is
inside the scheduler.

It actually used to be a straight-up WARN_ONCE() in older
versions. Recursive scheduling bugs are no fun to debug ;)


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   /* Update task counts according to the set/clear bitmasks */
> > +   for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > +   int idx = to + (bo - 1);
> > +
> > +   if (tasks[idx] == 0 && !psi_bug) {
> > +   printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> > idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > +   cpu, idx, tasks[0], tasks[1], tasks[2],
> > +   clear, set);
> > +   psi_bug = 1;
> > +   }
> 
>   WARN_ONCE(!tasks[idx], ...);

It's just open-coded because of the printk_deferred, since this is
inside the scheduler.

It actually used to be a straight-up WARN_ONCE() in older
versions. Recursive scheduling bugs are no fun to debug ;)


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote:
> On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra  wrote:
> >
> > And as said before, we can compress the state from 12 bytes, to 6 bits
> > (or 1 byte), giving another 11 bytes for 59 bytes free.
> >
> > Leaving us just 5 bytes short of needing a single cacheline :/
> 
> Do you actually need 64 bits for the times?
> 
> That's the big cost. And it seems ridiculous, if you actually care about size.
> 
> You already have a 64-bit start time. Everything else is some
> cumulative relative time. Do those really need 64-bit and nanosecond
> resolution?
> 
> Maybe a 32-bit microsecond would be ok - would you ever account more
> than 35 minutes of anything without starting anew?

D'oh, you're right, the per-cpu buckets don't need to be this big at
all. In fact, we flush those deltas out every 2 seconds when there is
activity to maintain the running averages. Since we get 4.2s worth of
nanoseconds into a u32, we don't even need to divide in the hotpath.

Something along the lines of this here should work:

static void psi_group_change(struct psi_group *group, int cpu, u64 now,
 unsigned int clear, unsigned int set)
{
struct psi_group_cpu *groupc;
unsigned int *tasks;
unsigned int t;
u32 delta;

groupc = per_cpu_ptr(group->cpus, cpu);
tasks = groupc->tasks;

/* Time since last task change on this runqueue */
delta = now - groupc->last_time;
groupc->last_time = now;

/* Tasks waited for IO? */
if (tasks[NR_IOWAIT]) {
if (!tasks[NR_RUNNING])
groupc->full_time[PSI_IO] += delta;
else
groupc->some_time[PSI_IO] += delta;
}

/* Tasks waited for memory? */
if (tasks[NR_MEMSTALL]) {
if (!tasks[NR_RUNNING] ||
(cpu_curr(cpu)->flags & PF_MEMSTALL))
groupc->full_time[PSI_MEM] += delta;
else
groupc->some_time[PSI_MEM] += delta;
}

/* Tasks waited for the CPU? */
if (tasks[NR_RUNNING] > 1)
groupc->some_time[PSI_CPU] += delta;

/* Tasks were generally non-idle? To weigh the CPU in summaries */
if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
groupc->nonidle_time += delta;

/* Update task counts according to the set/clear bitmasks */
for (t = 0; clear; clear &= ~(1 << t), t++)
if (clear & (1 << t))
groupc->tasks[t]--;
for (t = 0; set; set &= ~(1 << t), t++)
if (set & (1 << t))
groupc->tasks[t]++;

/* Kick the stats aggregation worker if it's gone to sleep */
if (!delayed_work_pending(>clock_work))
schedule_delayed_work(>clock_work, PSI_FREQ);
}

And then we can pack it down to one cacheline:

struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3

/* Time sampling bucket for pressure states - no FULL for CPU */
u32 some_time[NR_PSI_RESOURCES];
u32 full_time[NR_PSI_RESOURCES - 1];

/* Time sampling bucket for non-idle state (ns) */
u32 nonidle_time;

/* Time of last task change in this group (rq_clock) */
u64 last_time;
};

I'm going to go test with this.

Thanks


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote:
> On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra  wrote:
> >
> > And as said before, we can compress the state from 12 bytes, to 6 bits
> > (or 1 byte), giving another 11 bytes for 59 bytes free.
> >
> > Leaving us just 5 bytes short of needing a single cacheline :/
> 
> Do you actually need 64 bits for the times?
> 
> That's the big cost. And it seems ridiculous, if you actually care about size.
> 
> You already have a 64-bit start time. Everything else is some
> cumulative relative time. Do those really need 64-bit and nanosecond
> resolution?
> 
> Maybe a 32-bit microsecond would be ok - would you ever account more
> than 35 minutes of anything without starting anew?

D'oh, you're right, the per-cpu buckets don't need to be this big at
all. In fact, we flush those deltas out every 2 seconds when there is
activity to maintain the running averages. Since we get 4.2s worth of
nanoseconds into a u32, we don't even need to divide in the hotpath.

Something along the lines of this here should work:

static void psi_group_change(struct psi_group *group, int cpu, u64 now,
 unsigned int clear, unsigned int set)
{
struct psi_group_cpu *groupc;
unsigned int *tasks;
unsigned int t;
u32 delta;

groupc = per_cpu_ptr(group->cpus, cpu);
tasks = groupc->tasks;

/* Time since last task change on this runqueue */
delta = now - groupc->last_time;
groupc->last_time = now;

/* Tasks waited for IO? */
if (tasks[NR_IOWAIT]) {
if (!tasks[NR_RUNNING])
groupc->full_time[PSI_IO] += delta;
else
groupc->some_time[PSI_IO] += delta;
}

/* Tasks waited for memory? */
if (tasks[NR_MEMSTALL]) {
if (!tasks[NR_RUNNING] ||
(cpu_curr(cpu)->flags & PF_MEMSTALL))
groupc->full_time[PSI_MEM] += delta;
else
groupc->some_time[PSI_MEM] += delta;
}

/* Tasks waited for the CPU? */
if (tasks[NR_RUNNING] > 1)
groupc->some_time[PSI_CPU] += delta;

/* Tasks were generally non-idle? To weigh the CPU in summaries */
if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
groupc->nonidle_time += delta;

/* Update task counts according to the set/clear bitmasks */
for (t = 0; clear; clear &= ~(1 << t), t++)
if (clear & (1 << t))
groupc->tasks[t]--;
for (t = 0; set; set &= ~(1 << t), t++)
if (set & (1 << t))
groupc->tasks[t]++;

/* Kick the stats aggregation worker if it's gone to sleep */
if (!delayed_work_pending(>clock_work))
schedule_delayed_work(>clock_work, PSI_FREQ);
}

And then we can pack it down to one cacheline:

struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3

/* Time sampling bucket for pressure states - no FULL for CPU */
u32 some_time[NR_PSI_RESOURCES];
u32 full_time[NR_PSI_RESOURCES - 1];

/* Time sampling bucket for non-idle state (ns) */
u32 nonidle_time;

/* Time of last task change in this group (rq_clock) */
u64 last_time;
};

I'm going to go test with this.

Thanks


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Linus Torvalds
On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra  wrote:
>
> And as said before, we can compress the state from 12 bytes, to 6 bits
> (or 1 byte), giving another 11 bytes for 59 bytes free.
>
> Leaving us just 5 bytes short of needing a single cacheline :/

Do you actually need 64 bits for the times?

That's the big cost. And it seems ridiculous, if you actually care about size.

You already have a 64-bit start time. Everything else is some
cumulative relative time. Do those really need 64-bit and nanosecond
resolution?

Maybe a 32-bit microsecond would be ok - would you ever account more
than 35 minutes of anything without starting anew?

 Linus


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Linus Torvalds
On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra  wrote:
>
> And as said before, we can compress the state from 12 bytes, to 6 bits
> (or 1 byte), giving another 11 bytes for 59 bytes free.
>
> Leaving us just 5 bytes short of needing a single cacheline :/

Do you actually need 64 bits for the times?

That's the big cost. And it seems ridiculous, if you actually care about size.

You already have a 64-bit start time. Everything else is some
cumulative relative time. Do those really need 64-bit and nanosecond
resolution?

Maybe a 32-bit microsecond would be ok - would you ever account more
than 35 minutes of anything without starting anew?

 Linus


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 06:36:44PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > + /* Time in which tasks wait for the CPU */
> > > + state = PSI_NONE;
> > > + if (tasks[NR_RUNNING] > 1)
> > > + state = PSI_SOME;
> > > + time_state(>res[PSI_CPU], state, now);
> > > +
> > > + /* Time in which tasks wait for memory */
> > > + state = PSI_NONE;
> > > + if (tasks[NR_MEMSTALL]) {
> > > + if (!tasks[NR_RUNNING] ||
> > > + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> > 
> > I'm confused, why do we care if the current tasks is MEMSTALL or not?
> 
> We want to know whether we're losing CPU potential because of a lack
> of memory. That can happen when the task waits for refaults and the
> CPU goes idle, but it can also happen when the CPU is performing
> reclaim.
> 
> If the task waits for refaults and something else is runnable, we're
> not losing CPU potential. But if the task performs reclaim and uses
> the CPU, nothing else can do productive work on that CPU.

Right, this is because MEMSTALL is not just blocking (as per that other
sub-thread).

This is really unfortunate, because it means the state is not a simple
function of the task counts.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 06:36:44PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > + /* Time in which tasks wait for the CPU */
> > > + state = PSI_NONE;
> > > + if (tasks[NR_RUNNING] > 1)
> > > + state = PSI_SOME;
> > > + time_state(>res[PSI_CPU], state, now);
> > > +
> > > + /* Time in which tasks wait for memory */
> > > + state = PSI_NONE;
> > > + if (tasks[NR_MEMSTALL]) {
> > > + if (!tasks[NR_RUNNING] ||
> > > + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> > 
> > I'm confused, why do we care if the current tasks is MEMSTALL or not?
> 
> We want to know whether we're losing CPU potential because of a lack
> of memory. That can happen when the task waits for refaults and the
> CPU goes idle, but it can also happen when the CPU is performing
> reclaim.
> 
> If the task waits for refaults and something else is runnable, we're
> not losing CPU potential. But if the task performs reclaim and uses
> the CPU, nothing else can do productive work on that CPU.

Right, this is because MEMSTALL is not just blocking (as per that other
sub-thread).

This is really unfortunate, because it means the state is not a simple
function of the task counts.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Thu, Jul 19, 2018 at 08:50:38AM -0400, Johannes Weiner wrote:
> On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > 
> > > Leaving us just 5 bytes short of needing a single cacheline :/
> > > 
> > > struct ponies {
> > > unsigned int   tasks[3];  
> > >/* 012 */
> > > unsigned int   cpu_state:2;   
> > >/*12:30  4 */
> > > unsigned int   io_state:2;
> > >/*12:28  4 */
> > > unsigned int   mem_state:2;   
> > >/*12:26  4 */
> > > 
> > > /* XXX 26 bits hole, try to pack */
> > > 
> > > /* typedef u64 */ long long unsigned int last_time;   
> > >/*16 8 */
> > > /* typedef u64 */ long long unsigned int some_time[3];
> > >/*2424 */
> > > /* typedef u64 */ long long unsigned int full_time[2];
> > >/*4816 */
> > > /* --- cacheline 1 boundary (64 bytes) --- */
> > > /* typedef u64 */ long long unsigned int nonidle_time;
> > >/*64 8 */
> > > 
> > > /* size: 72, cachelines: 2, members: 8 */
> > > /* bit holes: 1, sum bit holes: 26 bits */
> > > /* last cacheline: 8 bytes */
> > > };
> > > 
> > > ARGGH!
> > 
> > It _might_ be possible to use curr->se.exec_start for last_time if you
> > very carefully audit and place the hooks. I've not gone through it in
> > detail, but it might just work.
> 
> Hnngg, and chop off an entire cacheline...

Yes.. a worthy goal :-)

> But don't we flush that delta out and update the timestamp on every
> tick?

Indeed.

> entity_tick() does update_curr(). That might be too expensive :(

Well, since you already do all this accounting on every enqueue/dequeue,
this can run many thousands of times per tick already, so once per tick
doesn't sound bad.

However, I just realized this might not in fact work, because
curr->se.exec_start is per task, and you really want something per-cpu
for this.

Bah, if only perf had a useful tool to report on data layout instead of
this c2c crap.. :-( The thinking being that we could maybe find a
usage-hole (a data member that is not in fact used) near something we
already touch for writing. 






Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Thu, Jul 19, 2018 at 08:50:38AM -0400, Johannes Weiner wrote:
> On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > 
> > > Leaving us just 5 bytes short of needing a single cacheline :/
> > > 
> > > struct ponies {
> > > unsigned int   tasks[3];  
> > >/* 012 */
> > > unsigned int   cpu_state:2;   
> > >/*12:30  4 */
> > > unsigned int   io_state:2;
> > >/*12:28  4 */
> > > unsigned int   mem_state:2;   
> > >/*12:26  4 */
> > > 
> > > /* XXX 26 bits hole, try to pack */
> > > 
> > > /* typedef u64 */ long long unsigned int last_time;   
> > >/*16 8 */
> > > /* typedef u64 */ long long unsigned int some_time[3];
> > >/*2424 */
> > > /* typedef u64 */ long long unsigned int full_time[2];
> > >/*4816 */
> > > /* --- cacheline 1 boundary (64 bytes) --- */
> > > /* typedef u64 */ long long unsigned int nonidle_time;
> > >/*64 8 */
> > > 
> > > /* size: 72, cachelines: 2, members: 8 */
> > > /* bit holes: 1, sum bit holes: 26 bits */
> > > /* last cacheline: 8 bytes */
> > > };
> > > 
> > > ARGGH!
> > 
> > It _might_ be possible to use curr->se.exec_start for last_time if you
> > very carefully audit and place the hooks. I've not gone through it in
> > detail, but it might just work.
> 
> Hnngg, and chop off an entire cacheline...

Yes.. a worthy goal :-)

> But don't we flush that delta out and update the timestamp on every
> tick?

Indeed.

> entity_tick() does update_curr(). That might be too expensive :(

Well, since you already do all this accounting on every enqueue/dequeue,
this can run many thousands of times per tick already, so once per tick
doesn't sound bad.

However, I just realized this might not in fact work, because
curr->se.exec_start is per task, and you really want something per-cpu
for this.

Bah, if only perf had a useful tool to report on data layout instead of
this c2c crap.. :-( The thinking being that we could maybe find a
usage-hole (a data member that is not in fact used) near something we
already touch for writing. 






Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> 
> > Leaving us just 5 bytes short of needing a single cacheline :/
> > 
> > struct ponies {
> > unsigned int   tasks[3];
> >  /* 012 */
> > unsigned int   cpu_state:2; 
> >  /*12:30  4 */
> > unsigned int   io_state:2;  
> >  /*12:28  4 */
> > unsigned int   mem_state:2; 
> >  /*12:26  4 */
> > 
> > /* XXX 26 bits hole, try to pack */
> > 
> > /* typedef u64 */ long long unsigned int last_time; 
> >  /*16 8 */
> > /* typedef u64 */ long long unsigned int some_time[3];  
> >  /*2424 */
> > /* typedef u64 */ long long unsigned int full_time[2];  
> >  /*4816 */
> > /* --- cacheline 1 boundary (64 bytes) --- */
> > /* typedef u64 */ long long unsigned int nonidle_time;  
> >  /*64 8 */
> > 
> > /* size: 72, cachelines: 2, members: 8 */
> > /* bit holes: 1, sum bit holes: 26 bits */
> > /* last cacheline: 8 bytes */
> > };
> > 
> > ARGGH!
> 
> It _might_ be possible to use curr->se.exec_start for last_time if you
> very carefully audit and place the hooks. I've not gone through it in
> detail, but it might just work.

Hnngg, and chop off an entire cacheline...

But don't we flush that delta out and update the timestamp on every
tick? entity_tick() does update_curr(). That might be too expensive :(


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Johannes Weiner
On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> 
> > Leaving us just 5 bytes short of needing a single cacheline :/
> > 
> > struct ponies {
> > unsigned int   tasks[3];
> >  /* 012 */
> > unsigned int   cpu_state:2; 
> >  /*12:30  4 */
> > unsigned int   io_state:2;  
> >  /*12:28  4 */
> > unsigned int   mem_state:2; 
> >  /*12:26  4 */
> > 
> > /* XXX 26 bits hole, try to pack */
> > 
> > /* typedef u64 */ long long unsigned int last_time; 
> >  /*16 8 */
> > /* typedef u64 */ long long unsigned int some_time[3];  
> >  /*2424 */
> > /* typedef u64 */ long long unsigned int full_time[2];  
> >  /*4816 */
> > /* --- cacheline 1 boundary (64 bytes) --- */
> > /* typedef u64 */ long long unsigned int nonidle_time;  
> >  /*64 8 */
> > 
> > /* size: 72, cachelines: 2, members: 8 */
> > /* bit holes: 1, sum bit holes: 26 bits */
> > /* last cacheline: 8 bytes */
> > };
> > 
> > ARGGH!
> 
> It _might_ be possible to use curr->se.exec_start for last_time if you
> very carefully audit and place the hooks. I've not gone through it in
> detail, but it might just work.

Hnngg, and chop off an entire cacheline...

But don't we flush that delta out and update the timestamp on every
tick? entity_tick() does update_curr(). That might be too expensive :(


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:

> Leaving us just 5 bytes short of needing a single cacheline :/
> 
> struct ponies {
> unsigned int   tasks[3];  
>/* 012 */
> unsigned int   cpu_state:2;   
>/*12:30  4 */
> unsigned int   io_state:2;
>/*12:28  4 */
> unsigned int   mem_state:2;   
>/*12:26  4 */
> 
> /* XXX 26 bits hole, try to pack */
> 
> /* typedef u64 */ long long unsigned int last_time;   
>/*16 8 */
> /* typedef u64 */ long long unsigned int some_time[3];
>/*2424 */
> /* typedef u64 */ long long unsigned int full_time[2];
>/*4816 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> /* typedef u64 */ long long unsigned int nonidle_time;
>/*64 8 */
> 
> /* size: 72, cachelines: 2, members: 8 */
> /* bit holes: 1, sum bit holes: 26 bits */
> /* last cacheline: 8 bytes */
> };
> 
> ARGGH!

It _might_ be possible to use curr->se.exec_start for last_time if you
very carefully audit and place the hooks. I've not gone through it in
detail, but it might just work.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-19 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:

> Leaving us just 5 bytes short of needing a single cacheline :/
> 
> struct ponies {
> unsigned int   tasks[3];  
>/* 012 */
> unsigned int   cpu_state:2;   
>/*12:30  4 */
> unsigned int   io_state:2;
>/*12:28  4 */
> unsigned int   mem_state:2;   
>/*12:26  4 */
> 
> /* XXX 26 bits hole, try to pack */
> 
> /* typedef u64 */ long long unsigned int last_time;   
>/*16 8 */
> /* typedef u64 */ long long unsigned int some_time[3];
>/*2424 */
> /* typedef u64 */ long long unsigned int full_time[2];
>/*4816 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> /* typedef u64 */ long long unsigned int nonidle_time;
>/*64 8 */
> 
> /* size: 72, cachelines: 2, members: 8 */
> /* bit holes: 1, sum bit holes: 26 bits */
> /* last cacheline: 8 bytes */
> };
> 
> ARGGH!

It _might_ be possible to use curr->se.exec_start for last_time if you
very carefully audit and place the hooks. I've not gone through it in
detail, but it might just work.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   /* Time in which tasks wait for the CPU */
> > +   state = PSI_NONE;
> > +   if (tasks[NR_RUNNING] > 1)
> > +   state = PSI_SOME;
> > +   time_state(>res[PSI_CPU], state, now);
> > +
> > +   /* Time in which tasks wait for memory */
> > +   state = PSI_NONE;
> > +   if (tasks[NR_MEMSTALL]) {
> > +   if (!tasks[NR_RUNNING] ||
> > +   (cpu_curr(cpu)->flags & PF_MEMSTALL))
> 
> I'm confused, why do we care if the current tasks is MEMSTALL or not?

We want to know whether we're losing CPU potential because of a lack
of memory. That can happen when the task waits for refaults and the
CPU goes idle, but it can also happen when the CPU is performing
reclaim.

If the task waits for refaults and something else is runnable, we're
not losing CPU potential. But if the task performs reclaim and uses
the CPU, nothing else can do productive work on that CPU.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   /* Time in which tasks wait for the CPU */
> > +   state = PSI_NONE;
> > +   if (tasks[NR_RUNNING] > 1)
> > +   state = PSI_SOME;
> > +   time_state(>res[PSI_CPU], state, now);
> > +
> > +   /* Time in which tasks wait for memory */
> > +   state = PSI_NONE;
> > +   if (tasks[NR_MEMSTALL]) {
> > +   if (!tasks[NR_RUNNING] ||
> > +   (cpu_curr(cpu)->flags & PF_MEMSTALL))
> 
> I'm confused, why do we care if the current tasks is MEMSTALL or not?

We want to know whether we're losing CPU potential because of a lack
of memory. That can happen when the task waits for refaults and the
CPU goes idle, but it can also happen when the CPU is performing
reclaim.

If the task waits for refaults and something else is runnable, we're
not losing CPU potential. But if the task performs reclaim and uses
the CPU, nothing else can do productive work on that CPU.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 05:17:05PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
> >  
> >   Say N if unsure.
> >  
> > +config PSI
> > +   bool "Pressure stall information tracking"
> > +   select SCHED_INFO
> 
> What's the deal here? AFAICT it does not in fact use SCHED_INFO for
> _anything_. You just hooked into the sched_info_{en,de}queue() hooks,
> but you don't use any of the sched_info data.
> 
> So the dependency is an artificial one that should not exist.

You're right, it doesn't strictly depend on it. I'll split that out.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 05:17:05PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
> >  
> >   Say N if unsure.
> >  
> > +config PSI
> > +   bool "Pressure stall information tracking"
> > +   select SCHED_INFO
> 
> What's the deal here? AFAICT it does not in fact use SCHED_INFO for
> _anything_. You just hooked into the sched_info_{en,de}queue() hooks,
> but you don't use any of the sched_info data.
> 
> So the dependency is an artificial one that should not exist.

You're right, it doesn't strictly depend on it. I'll split that out.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static bool psi_update_stats(struct psi_group *group)
> > +{
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> > +   unsigned long nonidle_total = 0;
> > +   unsigned long missed_periods;
> > +   unsigned long expires;
> > +   int cpu;
> > +   int r;
> > +
> > +   mutex_lock(>stat_lock);
> > +
> > +   /*
> > +* Collect the per-cpu time buckets and average them into a
> > +* single time sample that is normalized to wallclock time.
> > +*
> > +* For averaging, each CPU is weighted by its non-idle time in
> > +* the sampling period. This eliminates artifacts from uneven
> > +* loading, or even entirely idle CPUs.
> > +*
> > +* We could pin the online CPUs here, but the noise introduced
> > +* by missing up to one sample period from CPUs that are going
> > +* away shouldn't matter in practice - just like the noise of
> > +* previously offlined CPUs returning with a non-zero sample.
> 
> But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> try and avoid it?

Hm, I don't feel strongly about it either way. I'll add it.

> > +   /* total= */
> > +   for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +   do_div(some[r], max(nonidle_total, 1UL));
> > +   do_div(full[r], max(nonidle_total, 1UL));
> > +
> > +   group->some[r] += some[r];
> > +   group->full[r] += full[r];
> 
>   group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
>   group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));
> 
> Is easier to read imo.

Sounds good to me, I'll change that.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static bool psi_update_stats(struct psi_group *group)
> > +{
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> > +   unsigned long nonidle_total = 0;
> > +   unsigned long missed_periods;
> > +   unsigned long expires;
> > +   int cpu;
> > +   int r;
> > +
> > +   mutex_lock(>stat_lock);
> > +
> > +   /*
> > +* Collect the per-cpu time buckets and average them into a
> > +* single time sample that is normalized to wallclock time.
> > +*
> > +* For averaging, each CPU is weighted by its non-idle time in
> > +* the sampling period. This eliminates artifacts from uneven
> > +* loading, or even entirely idle CPUs.
> > +*
> > +* We could pin the online CPUs here, but the noise introduced
> > +* by missing up to one sample period from CPUs that are going
> > +* away shouldn't matter in practice - just like the noise of
> > +* previously offlined CPUs returning with a non-zero sample.
> 
> But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> try and avoid it?

Hm, I don't feel strongly about it either way. I'll add it.

> > +   /* total= */
> > +   for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +   do_div(some[r], max(nonidle_total, 1UL));
> > +   do_div(full[r], max(nonidle_total, 1UL));
> > +
> > +   group->some[r] += some[r];
> > +   group->full[r] += full[r];
> 
>   group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
>   group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));
> 
> Is easier to read imo.

Sounds good to me, I'll change that.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 04:21:57PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> > index 04f1321d14c4..ac39435d1521 100644
> > --- a/include/linux/sched/stat.h
> > +++ b/include/linux/sched/stat.h
> > @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
> > return 1;
> >  #elif defined(CONFIG_TASK_DELAY_ACCT)
> > extern int delayacct_on;
> > +   if (delayacct_on)
> > +   return 1;
> > +#elif defined(CONFIG_PSI)
> > +   extern int psi_disabled;
> > +   if (!psi_disabled)
> > +   return 1;
> >  #endif
> > +   return 0;
> >  }
> 
> Doesn't that want to be something like:
> 
> static inline bool sched_info_on(void)
> {
> #ifdef CONFIG_SCHEDSTAT
>   return true;
> #else /* !SCHEDSTAT */
> #ifdef CONFIG_TASK_DELAY_ACCT
>   extern int delayacct_on;
>   if (delayacct_on)
>   return true;
> #endif /* DELAYACCT */
> #ifdef CONFIG_PSI
>   extern int psi_disabled;
>   if (!psi_disabled)
>   return true;
> #endif
>   return false;
> #endif /* !SCHEDSTATE */
> }
> 
> Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
> nodelayacct, you still get sched_info_on().

You're right, that was a brainfart on my end. But as you point out in
the other email, the SCHED_INFO dependency is artificial, so I'll
rework this entire part.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 04:21:57PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> > index 04f1321d14c4..ac39435d1521 100644
> > --- a/include/linux/sched/stat.h
> > +++ b/include/linux/sched/stat.h
> > @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
> > return 1;
> >  #elif defined(CONFIG_TASK_DELAY_ACCT)
> > extern int delayacct_on;
> > +   if (delayacct_on)
> > +   return 1;
> > +#elif defined(CONFIG_PSI)
> > +   extern int psi_disabled;
> > +   if (!psi_disabled)
> > +   return 1;
> >  #endif
> > +   return 0;
> >  }
> 
> Doesn't that want to be something like:
> 
> static inline bool sched_info_on(void)
> {
> #ifdef CONFIG_SCHEDSTAT
>   return true;
> #else /* !SCHEDSTAT */
> #ifdef CONFIG_TASK_DELAY_ACCT
>   extern int delayacct_on;
>   if (delayacct_on)
>   return true;
> #endif /* DELAYACCT */
> #ifdef CONFIG_PSI
>   extern int psi_disabled;
>   if (!psi_disabled)
>   return true;
> #endif
>   return false;
> #endif /* !SCHEDSTATE */
> }
> 
> Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
> nodelayacct, you still get sched_info_on().

You're right, that was a brainfart on my end. But as you point out in
the other email, the SCHED_INFO dependency is artificial, so I'll
rework this entire part.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 04:16:14PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +   NR_RUNNING,
> > +   NR_IOWAIT,
> > +   NR_MEMSTALL,
> > +   NR_PSI_TASK_COUNTS,
> > +};
> 
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +   PSI_CPU,
> > +   PSI_MEM,
> > +   PSI_IO,
> > +   NR_PSI_RESOURCES,
> > +};
> 
> These two have mem and iowait in different order. It really doesn't
> matter, but my brain stumbled.

No problem, I swapped them around for v3.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 04:16:14PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +   NR_RUNNING,
> > +   NR_IOWAIT,
> > +   NR_MEMSTALL,
> > +   NR_PSI_TASK_COUNTS,
> > +};
> 
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +   PSI_CPU,
> > +   PSI_MEM,
> > +   PSI_IO,
> > +   NR_PSI_RESOURCES,
> > +};
> 
> These two have mem and iowait in different order. It really doesn't
> matter, but my brain stumbled.

No problem, I swapped them around for v3.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 12:03:47PM +0200, Peter Zijlstra wrote:
> This is still a scary amount of accounting; not to mention you'll be
> adding O(cgroup-depth) to this in a later patch.
> 
> Where are the performance numbers for all this?

I benchmarked it using our two most scheduling sensitive workloads:
memcache and webserver. They handle a ton of small requests - lots of
wakeups and sleeps with little actual work in between - so they tend
to be canaries for scheduler regressions.

In the tests, the boxes were handling live traffic over the course of
several hours. Half the machines, the control, ran with CONFIG_PSI=n.

For memcache I used eight machines total. They're 2-socket, 14 core,
56 thread boxes. The test runs for half the test period, flips the
test and control kernels on the hardware to rule out HW factors, DC
location etc., then runs the other half of the test.

For the webservers, I used 32 machines total. They're single socket,
16 core, 32 thread machines.

During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the
first half and nopsi=77.52% psi=78.25%, so psi added between 0.7 and
0.9 percentage points to the CPU load, a difference of about 1%.

As far as end-to-end request latency from the client perspective goes,
we don't sample those finely enough to capture the requests going to
those particular machines during the test, but we know the p50
turnaround time in this workload is 54us, and perf bench sched pipe on
those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so
this doesn't add much here either.

The profile for the pipe benchmark shows:

 0.87%  sched-pipe  [kernel.vmlinux][k] psi_group_change
 0.83%  perf.real   [kernel.vmlinux][k] psi_group_change
 0.82%  perf.real   [kernel.vmlinux][k] psi_task_change
 0.58%  sched-pipe  [kernel.vmlinux][k] psi_task_change


The webserver load is running inside 4 nested cgroup levels. The CPU
load with both nopsi and psi kernels was indistinguishable at 81%.

For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.

Versions of this accounting code now run on 80% of our fleet. None of
our workloads have reported regressions during the rollout.

[ Also note that the webservers that tested the nopsi kernel were
  during that time susceptible to swap storms, memory livelocks, and
  eventual hardresets because without psi they couldn't run our full
  resource isolation stack that would prevent that ;) ]

Let me know if there are other tests I could run.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Tue, Jul 17, 2018 at 12:03:47PM +0200, Peter Zijlstra wrote:
> This is still a scary amount of accounting; not to mention you'll be
> adding O(cgroup-depth) to this in a later patch.
> 
> Where are the performance numbers for all this?

I benchmarked it using our two most scheduling sensitive workloads:
memcache and webserver. They handle a ton of small requests - lots of
wakeups and sleeps with little actual work in between - so they tend
to be canaries for scheduler regressions.

In the tests, the boxes were handling live traffic over the course of
several hours. Half the machines, the control, ran with CONFIG_PSI=n.

For memcache I used eight machines total. They're 2-socket, 14 core,
56 thread boxes. The test runs for half the test period, flips the
test and control kernels on the hardware to rule out HW factors, DC
location etc., then runs the other half of the test.

For the webservers, I used 32 machines total. They're single socket,
16 core, 32 thread machines.

During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the
first half and nopsi=77.52% psi=78.25%, so psi added between 0.7 and
0.9 percentage points to the CPU load, a difference of about 1%.

As far as end-to-end request latency from the client perspective goes,
we don't sample those finely enough to capture the requests going to
those particular machines during the test, but we know the p50
turnaround time in this workload is 54us, and perf bench sched pipe on
those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so
this doesn't add much here either.

The profile for the pipe benchmark shows:

 0.87%  sched-pipe  [kernel.vmlinux][k] psi_group_change
 0.83%  perf.real   [kernel.vmlinux][k] psi_group_change
 0.82%  perf.real   [kernel.vmlinux][k] psi_task_change
 0.58%  sched-pipe  [kernel.vmlinux][k] psi_task_change


The webserver load is running inside 4 nested cgroup levels. The CPU
load with both nopsi and psi kernels was indistinguishable at 81%.

For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.

Versions of this accounting code now run on 80% of our fleet. None of
our workloads have reported regressions during the rollout.

[ Also note that the webservers that tested the nopsi kernel were
  during that time susceptible to swap storms, memory livelocks, and
  eventual hardresets because without psi they couldn't run our full
  resource isolation stack that would prevent that ;) ]

Let me know if there are other tests I could run.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 06:31:15PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> 
> > > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > > account the time we were _blocked_ because of memstall, but you seem to
> > > count the time we're _running_ with PF_MEMSTALL.
> > 
> > Under heavy memory pressure, a lot of active CPU time is spent
> > scanning and rotating through the LRU lists, which we do want to
> > capture in the pressure metric. What we really want to know is the
> > time in which CPU potential goes to waste due to a lack of
> > resources. That's the CPU going idle due to a memstall, but it's also
> > a CPU doing *work* which only occurs due to a lack of memory. We want
> > to know about both to judge how productive system and workload are.
> 
> Then maybe memstall (esp. the 'stall' part of it) is a bit of a
> misnomer.

I'm not tied to that name, but I can't really think of a better
one. It was called PF_MEMDELAY in the past, but "delay" also has
busy-spinning connotations in the kernel. "wait" also implies that
it's a passive state.

> > > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > > and the above function is quite expensive.
> > 
> > Right, but we don't call it on every invocation, only when waiting for
> > the IO to read back a page that was recently deactivated and evicted:
> > 
> > if (bit_nr == PG_locked &&
> > !PageUptodate(page) && PageWorkingset(page)) {
> > if (!PageSwapBacked(page))
> > delayacct_thrashing_start();
> > psi_memstall_enter();
> > thrashing = true;
> > }
> > 
> > That means the page cache workingset/file active list is thrashing, in
> > which case the IO itself is our biggest concern, not necessarily a few
> > additional cycles before going to sleep to wait on its completion.
> 
> Ah, right. PageWorkingset() is only true if we (recently) evicted that
> page before, right?

Yep, but not all of those, only the ones who were on the active list
in their previous incarnation, aka refaulting *hot* pages, aka there
is little chance this is healthy behavior.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
On Wed, Jul 18, 2018 at 06:31:15PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> 
> > > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > > account the time we were _blocked_ because of memstall, but you seem to
> > > count the time we're _running_ with PF_MEMSTALL.
> > 
> > Under heavy memory pressure, a lot of active CPU time is spent
> > scanning and rotating through the LRU lists, which we do want to
> > capture in the pressure metric. What we really want to know is the
> > time in which CPU potential goes to waste due to a lack of
> > resources. That's the CPU going idle due to a memstall, but it's also
> > a CPU doing *work* which only occurs due to a lack of memory. We want
> > to know about both to judge how productive system and workload are.
> 
> Then maybe memstall (esp. the 'stall' part of it) is a bit of a
> misnomer.

I'm not tied to that name, but I can't really think of a better
one. It was called PF_MEMDELAY in the past, but "delay" also has
busy-spinning connotations in the kernel. "wait" also implies that
it's a passive state.

> > > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > > and the above function is quite expensive.
> > 
> > Right, but we don't call it on every invocation, only when waiting for
> > the IO to read back a page that was recently deactivated and evicted:
> > 
> > if (bit_nr == PG_locked &&
> > !PageUptodate(page) && PageWorkingset(page)) {
> > if (!PageSwapBacked(page))
> > delayacct_thrashing_start();
> > psi_memstall_enter();
> > thrashing = true;
> > }
> > 
> > That means the page cache workingset/file active list is thrashing, in
> > which case the IO itself is our biggest concern, not necessarily a few
> > additional cycles before going to sleep to wait on its completion.
> 
> Ah, right. PageWorkingset() is only true if we (recently) evicted that
> page before, right?

Yep, but not all of those, only the ones who were on the active list
in their previous incarnation, aka refaulting *hot* pages, aka there
is little chance this is healthy behavior.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:

> > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > account the time we were _blocked_ because of memstall, but you seem to
> > count the time we're _running_ with PF_MEMSTALL.
> 
> Under heavy memory pressure, a lot of active CPU time is spent
> scanning and rotating through the LRU lists, which we do want to
> capture in the pressure metric. What we really want to know is the
> time in which CPU potential goes to waste due to a lack of
> resources. That's the CPU going idle due to a memstall, but it's also
> a CPU doing *work* which only occurs due to a lack of memory. We want
> to know about both to judge how productive system and workload are.

Then maybe memstall (esp. the 'stall' part of it) is a bit of a
misnomer.

> > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > and the above function is quite expensive.
> 
> Right, but we don't call it on every invocation, only when waiting for
> the IO to read back a page that was recently deactivated and evicted:
> 
>   if (bit_nr == PG_locked &&
>   !PageUptodate(page) && PageWorkingset(page)) {
>   if (!PageSwapBacked(page))
>   delayacct_thrashing_start();
>   psi_memstall_enter();
>   thrashing = true;
>   }
> 
> That means the page cache workingset/file active list is thrashing, in
> which case the IO itself is our biggest concern, not necessarily a few
> additional cycles before going to sleep to wait on its completion.

Ah, right. PageWorkingset() is only true if we (recently) evicted that
page before, right?


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:

> > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > account the time we were _blocked_ because of memstall, but you seem to
> > count the time we're _running_ with PF_MEMSTALL.
> 
> Under heavy memory pressure, a lot of active CPU time is spent
> scanning and rotating through the LRU lists, which we do want to
> capture in the pressure metric. What we really want to know is the
> time in which CPU potential goes to waste due to a lack of
> resources. That's the CPU going idle due to a memstall, but it's also
> a CPU doing *work* which only occurs due to a lack of memory. We want
> to know about both to judge how productive system and workload are.

Then maybe memstall (esp. the 'stall' part of it) is a bit of a
misnomer.

> > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > and the above function is quite expensive.
> 
> Right, but we don't call it on every invocation, only when waiting for
> the IO to read back a page that was recently deactivated and evicted:
> 
>   if (bit_nr == PG_locked &&
>   !PageUptodate(page) && PageWorkingset(page)) {
>   if (!PageSwapBacked(page))
>   delayacct_thrashing_start();
>   psi_memstall_enter();
>   thrashing = true;
>   }
> 
> That means the page cache workingset/file active list is thrashing, in
> which case the IO itself is our biggest concern, not necessarily a few
> additional cycles before going to sleep to wait on its completion.

Ah, right. PageWorkingset() is only true if we (recently) evicted that
page before, right?


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
Hi Peter,

thanks for the feedback so far, I'll get to the other emails
later. I'm currently running A/B tests against our production traffic
to get uptodate numbers in particular on the optimizations you
suggested for the cacheline packing, time_state(), ffs() etc.

On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> 
> > +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> > +{
> > +   int clear = 0, set = TSK_RUNNING;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   if (!wakeup || p->sched_psi_wake_requeue) {
> > +   if (p->flags & PF_MEMSTALL)
> > +   set |= TSK_MEMSTALL;
> > +   if (p->sched_psi_wake_requeue)
> > +   p->sched_psi_wake_requeue = 0;
> > +   } else {
> > +   if (p->in_iowait)
> > +   clear |= TSK_IOWAIT;
> > +   }
> > +
> > +   psi_task_change(p, now, clear, set);
> > +}
> > +
> > +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> > +{
> > +   int clear = TSK_RUNNING, set = 0;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   if (!sleep) {
> > +   if (p->flags & PF_MEMSTALL)
> > +   clear |= TSK_MEMSTALL;
> > +   } else {
> > +   if (p->in_iowait)
> > +   set |= TSK_IOWAIT;
> > +   }
> > +
> > +   psi_task_change(p, now, clear, set);
> > +}
> 
> > +/**
> > + * psi_memstall_enter - mark the beginning of a memory stall section
> > + * @flags: flags to handle nested sections
> > + *
> > + * Marks the calling task as being stalled due to a lack of memory,
> > + * such as waiting for a refault or performing reclaim.
> > + */
> > +void psi_memstall_enter(unsigned long *flags)
> > +{
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   *flags = current->flags & PF_MEMSTALL;
> > +   if (*flags)
> > +   return;
> > +   /*
> > +* PF_MEMSTALL setting & accounting needs to be atomic wrt
> > +* changes to the task's scheduling state, otherwise we can
> > +* race with CPU migration.
> > +*/
> > +   rq = this_rq_lock_irq();
> > +
> > +   update_rq_clock(rq);
> > +
> > +   current->flags |= PF_MEMSTALL;
> > +   psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +   rq_unlock_irq(rq, );
> > +}
> 
> I'm confused by this whole MEMSTALL thing... I thought the idea was to
> account the time we were _blocked_ because of memstall, but you seem to
> count the time we're _running_ with PF_MEMSTALL.

Under heavy memory pressure, a lot of active CPU time is spent
scanning and rotating through the LRU lists, which we do want to
capture in the pressure metric. What we really want to know is the
time in which CPU potential goes to waste due to a lack of
resources. That's the CPU going idle due to a memstall, but it's also
a CPU doing *work* which only occurs due to a lack of memory. We want
to know about both to judge how productive system and workload are.

> And esp. the wait_on_page_bit_common caller seems performance sensitive,
> and the above function is quite expensive.

Right, but we don't call it on every invocation, only when waiting for
the IO to read back a page that was recently deactivated and evicted:

if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
if (!PageSwapBacked(page))
delayacct_thrashing_start();
psi_memstall_enter();
thrashing = true;
}

That means the page cache workingset/file active list is thrashing, in
which case the IO itself is our biggest concern, not necessarily a few
additional cycles before going to sleep to wait on its completion.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Johannes Weiner
Hi Peter,

thanks for the feedback so far, I'll get to the other emails
later. I'm currently running A/B tests against our production traffic
to get uptodate numbers in particular on the optimizations you
suggested for the cacheline packing, time_state(), ffs() etc.

On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> 
> > +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> > +{
> > +   int clear = 0, set = TSK_RUNNING;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   if (!wakeup || p->sched_psi_wake_requeue) {
> > +   if (p->flags & PF_MEMSTALL)
> > +   set |= TSK_MEMSTALL;
> > +   if (p->sched_psi_wake_requeue)
> > +   p->sched_psi_wake_requeue = 0;
> > +   } else {
> > +   if (p->in_iowait)
> > +   clear |= TSK_IOWAIT;
> > +   }
> > +
> > +   psi_task_change(p, now, clear, set);
> > +}
> > +
> > +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> > +{
> > +   int clear = TSK_RUNNING, set = 0;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   if (!sleep) {
> > +   if (p->flags & PF_MEMSTALL)
> > +   clear |= TSK_MEMSTALL;
> > +   } else {
> > +   if (p->in_iowait)
> > +   set |= TSK_IOWAIT;
> > +   }
> > +
> > +   psi_task_change(p, now, clear, set);
> > +}
> 
> > +/**
> > + * psi_memstall_enter - mark the beginning of a memory stall section
> > + * @flags: flags to handle nested sections
> > + *
> > + * Marks the calling task as being stalled due to a lack of memory,
> > + * such as waiting for a refault or performing reclaim.
> > + */
> > +void psi_memstall_enter(unsigned long *flags)
> > +{
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +
> > +   if (psi_disabled)
> > +   return;
> > +
> > +   *flags = current->flags & PF_MEMSTALL;
> > +   if (*flags)
> > +   return;
> > +   /*
> > +* PF_MEMSTALL setting & accounting needs to be atomic wrt
> > +* changes to the task's scheduling state, otherwise we can
> > +* race with CPU migration.
> > +*/
> > +   rq = this_rq_lock_irq();
> > +
> > +   update_rq_clock(rq);
> > +
> > +   current->flags |= PF_MEMSTALL;
> > +   psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +   rq_unlock_irq(rq, );
> > +}
> 
> I'm confused by this whole MEMSTALL thing... I thought the idea was to
> account the time we were _blocked_ because of memstall, but you seem to
> count the time we're _running_ with PF_MEMSTALL.

Under heavy memory pressure, a lot of active CPU time is spent
scanning and rotating through the LRU lists, which we do want to
capture in the pressure metric. What we really want to know is the
time in which CPU potential goes to waste due to a lack of
resources. That's the CPU going idle due to a memstall, but it's also
a CPU doing *work* which only occurs due to a lack of memory. We want
to know about both to judge how productive system and workload are.

> And esp. the wait_on_page_bit_common caller seems performance sensitive,
> and the above function is quite expensive.

Right, but we don't call it on every invocation, only when waiting for
the IO to read back a page that was recently deactivated and evicted:

if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
if (!PageSwapBacked(page))
delayacct_thrashing_start();
psi_memstall_enter();
thrashing = true;
}

That means the page cache workingset/file active list is thrashing, in
which case the IO itself is our biggest concern, not necessarily a few
additional cycles before going to sleep to wait on its completion.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:

> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!wakeup || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + if (p->sched_psi_wake_requeue)
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!sleep) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}

> +/**
> + * psi_memstall_enter - mark the beginning of a memory stall section
> + * @flags: flags to handle nested sections
> + *
> + * Marks the calling task as being stalled due to a lack of memory,
> + * such as waiting for a refault or performing reclaim.
> + */
> +void psi_memstall_enter(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (psi_disabled)
> + return;
> +
> + *flags = current->flags & PF_MEMSTALL;
> + if (*flags)
> + return;
> + /*
> +  * PF_MEMSTALL setting & accounting needs to be atomic wrt
> +  * changes to the task's scheduling state, otherwise we can
> +  * race with CPU migration.
> +  */
> + rq = this_rq_lock_irq();
> +
> + update_rq_clock(rq);
> +
> + current->flags |= PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> +
> + rq_unlock_irq(rq, );
> +}

I'm confused by this whole MEMSTALL thing... I thought the idea was to
account the time we were _blocked_ because of memstall, but you seem to
count the time we're _running_ with PF_MEMSTALL.


And esp. the wait_on_page_bit_common caller seems performance sensitive,
and the above function is quite expensive.




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:

> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!wakeup || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + if (p->sched_psi_wake_requeue)
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!sleep) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}

> +/**
> + * psi_memstall_enter - mark the beginning of a memory stall section
> + * @flags: flags to handle nested sections
> + *
> + * Marks the calling task as being stalled due to a lack of memory,
> + * such as waiting for a refault or performing reclaim.
> + */
> +void psi_memstall_enter(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (psi_disabled)
> + return;
> +
> + *flags = current->flags & PF_MEMSTALL;
> + if (*flags)
> + return;
> + /*
> +  * PF_MEMSTALL setting & accounting needs to be atomic wrt
> +  * changes to the task's scheduling state, otherwise we can
> +  * race with CPU migration.
> +  */
> + rq = this_rq_lock_irq();
> +
> + update_rq_clock(rq);
> +
> + current->flags |= PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> +
> + rq_unlock_irq(rq, );
> +}

I'm confused by this whole MEMSTALL thing... I thought the idea was to
account the time we were _blocked_ because of memstall, but you seem to
count the time we're _running_ with PF_MEMSTALL.


And esp. the wait_on_page_bit_common caller seems performance sensitive,
and the above function is quite expensive.




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> > +   tasks[to + (bo - 1)]++;
> 
> You want to benchmark this, but since it's only 3 consecutive bits, it
> might actually be faster to not use ffs() and simply test all 3 bits:
> 
>   for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)

if (to & (1 << bo))

>   tasks[bo]++;




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +   for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> > +   tasks[to + (bo - 1)]++;
> 
> You want to benchmark this, but since it's only 3 consecutive bits, it
> might actually be faster to not use ffs() and simply test all 3 bits:
> 
>   for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)

if (to & (1 << bo))

>   tasks[bo]++;




Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING  (1 << NR_RUNNING)
> +#define TSK_IOWAIT   (1 << NR_IOWAIT)
> +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> + PSI_NONE,   /* No stalled tasks */
> + PSI_SOME,   /* Stalled tasks & working tasks */
> + PSI_FULL,   /* Stalled tasks & no working tasks */
> + NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> + /* Current pressure state for this resource */
> + enum psi_state state;

This has a 4 byte hole here (really 7 but GCC is generous and uses 4
bytes for the enum that spans the value range [0-2]).

> + /* Start of current state (rq_clock) */
> + u64 state_start;
> +
> + /* Time sampling buckets for pressure states SOME and FULL (ns) */
> + u64 times[2];
> +};
> +
> +struct psi_group_cpu {
> + /* States of the tasks belonging to this group */
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> +
> + /* There are runnable or D-state tasks */
> + int nonidle;
> +
> + /* Start of current non-idle state (rq_clock) */
> + u64 nonidle_start;
> +
> + /* Time sampling bucket for non-idle state (ns) */
> + u64 nonidle_time;
> +
> + /* Per-resource pressure tracking in this group */
> + struct psi_resource res[NR_PSI_RESOURCES];
> +};

> +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);

Since psi_group_cpu is exactly 2 lines big, I think you want the above
to be DEFINE_PER_CPU_SHARED_ALIGNED() to minimize cache misses on
accounting. Also, I think you want to stick cacheline_aligned_in_smp
on the structure, such that alloc_percpu() also DTRT.

Of those 2 lines, 12 bytes are wasted because of that hole above, and a
further 8 are wasted because PSI_CPU does not use FULL, for a total of
20 wasted bytes in there.

> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> + if (res->state != PSI_NONE) {
> + bool was_full = res->state == PSI_FULL;
> +
> + res->times[was_full] += now - res->state_start;
> + }
> + if (res->state != state)
> + res->state = state;
> + if (res->state != PSI_NONE)
> + res->state_start = now;
> +}

Does the compiler optimize that and fold the two != NONE branches?

> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +  unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;

bool was_nonidle = tasks[NR_RUNNING] || tasks[NR_IOWAIT] || 
tasks[NR_MEMSTALL];

> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + cpu, idx, tasks[0], tasks[1], tasks[2],
> + clear, set);
> + psi_bug = 1;
> + }

WARN_ONCE(!tasks[idx], ...);

> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;

You want to benchmark this, but since it's only 3 consecutive bits, it
might actually be faster to not use ffs() and simply test all 3 bits:

for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)
tasks[bo]++;

or something like that.

> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(>res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))

I'm confused, why do we care if the current tasks is MEMSTALL or not?

> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = 

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-18 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING  (1 << NR_RUNNING)
> +#define TSK_IOWAIT   (1 << NR_IOWAIT)
> +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> + PSI_NONE,   /* No stalled tasks */
> + PSI_SOME,   /* Stalled tasks & working tasks */
> + PSI_FULL,   /* Stalled tasks & no working tasks */
> + NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> + /* Current pressure state for this resource */
> + enum psi_state state;

This has a 4 byte hole here (really 7 but GCC is generous and uses 4
bytes for the enum that spans the value range [0-2]).

> + /* Start of current state (rq_clock) */
> + u64 state_start;
> +
> + /* Time sampling buckets for pressure states SOME and FULL (ns) */
> + u64 times[2];
> +};
> +
> +struct psi_group_cpu {
> + /* States of the tasks belonging to this group */
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> +
> + /* There are runnable or D-state tasks */
> + int nonidle;
> +
> + /* Start of current non-idle state (rq_clock) */
> + u64 nonidle_start;
> +
> + /* Time sampling bucket for non-idle state (ns) */
> + u64 nonidle_time;
> +
> + /* Per-resource pressure tracking in this group */
> + struct psi_resource res[NR_PSI_RESOURCES];
> +};

> +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);

Since psi_group_cpu is exactly 2 lines big, I think you want the above
to be DEFINE_PER_CPU_SHARED_ALIGNED() to minimize cache misses on
accounting. Also, I think you want to stick cacheline_aligned_in_smp
on the structure, such that alloc_percpu() also DTRT.

Of those 2 lines, 12 bytes are wasted because of that hole above, and a
further 8 are wasted because PSI_CPU does not use FULL, for a total of
20 wasted bytes in there.

> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> + if (res->state != PSI_NONE) {
> + bool was_full = res->state == PSI_FULL;
> +
> + res->times[was_full] += now - res->state_start;
> + }
> + if (res->state != state)
> + res->state = state;
> + if (res->state != PSI_NONE)
> + res->state_start = now;
> +}

Does the compiler optimize that and fold the two != NONE branches?

> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +  unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;

bool was_nonidle = tasks[NR_RUNNING] || tasks[NR_IOWAIT] || 
tasks[NR_MEMSTALL];

> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + cpu, idx, tasks[0], tasks[1], tasks[2],
> + clear, set);
> + psi_bug = 1;
> + }

WARN_ONCE(!tasks[idx], ...);

> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;

You want to benchmark this, but since it's only 3 consecutive bits, it
might actually be faster to not use ffs() and simply test all 3 bits:

for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)
tasks[bo]++;

or something like that.

> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(>res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))

I'm confused, why do we care if the current tasks is MEMSTALL or not?

> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = 

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +struct psi_group {
> + struct psi_group_cpu *cpus;

That one wants a __percpu annotation on I think. Also, maybe a rename.

> +
> + struct mutex stat_lock;
> +
> + u64 some[NR_PSI_RESOURCES];
> + u64 full[NR_PSI_RESOURCES];
> +
> + unsigned long period_expires;
> +
> + u64 last_some[NR_PSI_RESOURCES];
> + u64 last_full[NR_PSI_RESOURCES];
> +
> + unsigned long avg_some[NR_PSI_RESOURCES][3];
> + unsigned long avg_full[NR_PSI_RESOURCES][3];
> +
> + struct delayed_work clock_work;
> +};


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +struct psi_group {
> + struct psi_group_cpu *cpus;

That one wants a __percpu annotation on I think. Also, maybe a rename.

> +
> + struct mutex stat_lock;
> +
> + u64 some[NR_PSI_RESOURCES];
> + u64 full[NR_PSI_RESOURCES];
> +
> + unsigned long period_expires;
> +
> + u64 last_some[NR_PSI_RESOURCES];
> + u64 last_full[NR_PSI_RESOURCES];
> +
> + unsigned long avg_some[NR_PSI_RESOURCES][3];
> + unsigned long avg_full[NR_PSI_RESOURCES][3];
> +
> + struct delayed_work clock_work;
> +};


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>   return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>   extern int delayacct_on;
> - return delayacct_on;
> -#else
> - return 0;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
>  #endif
> + return 0;
>  }
>  
>  #ifdef CONFIG_SCHEDSTATS

> diff --git a/init/Kconfig b/init/Kconfig
> index 18b151f0ddc1..e34859bda33e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
>  
> Say N if unsure.
>  
> +config PSI
> + bool "Pressure stall information tracking"
> + select SCHED_INFO

What's the deal here? AFAICT it does not in fact use SCHED_INFO for
_anything_. You just hooked into the sched_info_{en,de}queue() hooks,
but you don't use any of the sched_info data.

So the dependency is an artificial one that should not exist.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9586a8141f16..16e8c8c8f432 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -744,7 +744,7 @@ static inline void enqueue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>   update_rq_clock(rq);
>  
>   if (!(flags & ENQUEUE_RESTORE))
> - sched_info_queued(rq, p);
> + sched_info_queued(rq, p, flags & ENQUEUE_WAKEUP);
>  
>   p->sched_class->enqueue_task(rq, p, flags);
>  }
> @@ -755,7 +755,7 @@ static inline void dequeue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>   update_rq_clock(rq);
>  
>   if (!(flags & DEQUEUE_SAVE))
> - sched_info_dequeued(rq, p);
> + sched_info_dequeued(rq, p, flags & DEQUEUE_SLEEP);
>  
>   p->sched_class->dequeue_task(rq, p, flags);
>  }

> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8aea199a39b4..15b858cbbcb0 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h

>  #ifdef CONFIG_SCHED_INFO
>  static inline void sched_info_reset_dequeued(struct task_struct *t)
>  {
>   t->sched_info.last_queued = 0;
>  }
>  
> +static inline void sched_info_reset_queued(struct task_struct *t, u64 now)
> +{
> + if (!t->sched_info.last_queued)
> + t->sched_info.last_queued = now;
> +}
> +
>  /*
>   * We are interested in knowing how long it was from the *first* time a
>   * task was queued to the time that it finally hit a CPU, we call this 
> routine
>   * from dequeue_task() to account for possible rq->clock skew across CPUs. 
> The
>   * delta taken on each CPU would annul the skew.
>   */
> -static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t,
> +bool sleep)
>  {
>   unsigned long long now = rq_clock(rq), delta = 0;
>  
> - if (unlikely(sched_info_on()))
> + if (unlikely(sched_info_on())) {
>   if (t->sched_info.last_queued)
>   delta = now - t->sched_info.last_queued;
> + psi_dequeue(t, now, sleep);
> + }
>   sched_info_reset_dequeued(t);
>   t->sched_info.run_delay += delta;
>  
> @@ -104,11 +190,14 @@ static void sched_info_arrive(struct rq *rq, struct 
> task_struct *t)
>   * the timestamp if it is already not set.  It's assumed that
>   * sched_info_dequeued() will clear that stamp when appropriate.
>   */
> -static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_queued(struct rq *rq, struct task_struct *t,
> +  bool wakeup)
>  {
>   if (unlikely(sched_info_on())) {
> - if (!t->sched_info.last_queued)
> - t->sched_info.last_queued = rq_clock(rq);
> + unsigned long long now = rq_clock(rq);
> +
> + sched_info_reset_queued(t, now);
> + psi_enqueue(t, now, wakeup);
>   }
>  }
>  
> @@ -127,7 +216,8 @@ static inline void sched_info_depart(struct rq *rq, 
> struct task_struct *t)
>   rq_sched_info_depart(rq, delta);
>  
>   if (t->state == TASK_RUNNING)
> - sched_info_queued(rq, t);
> + if (unlikely(sched_info_on()))
> + sched_info_reset_queued(t, rq_clock(rq));
>  }


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>   return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>   extern int delayacct_on;
> - return delayacct_on;
> -#else
> - return 0;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
>  #endif
> + return 0;
>  }
>  
>  #ifdef CONFIG_SCHEDSTATS

> diff --git a/init/Kconfig b/init/Kconfig
> index 18b151f0ddc1..e34859bda33e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
>  
> Say N if unsure.
>  
> +config PSI
> + bool "Pressure stall information tracking"
> + select SCHED_INFO

What's the deal here? AFAICT it does not in fact use SCHED_INFO for
_anything_. You just hooked into the sched_info_{en,de}queue() hooks,
but you don't use any of the sched_info data.

So the dependency is an artificial one that should not exist.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9586a8141f16..16e8c8c8f432 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -744,7 +744,7 @@ static inline void enqueue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>   update_rq_clock(rq);
>  
>   if (!(flags & ENQUEUE_RESTORE))
> - sched_info_queued(rq, p);
> + sched_info_queued(rq, p, flags & ENQUEUE_WAKEUP);
>  
>   p->sched_class->enqueue_task(rq, p, flags);
>  }
> @@ -755,7 +755,7 @@ static inline void dequeue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>   update_rq_clock(rq);
>  
>   if (!(flags & DEQUEUE_SAVE))
> - sched_info_dequeued(rq, p);
> + sched_info_dequeued(rq, p, flags & DEQUEUE_SLEEP);
>  
>   p->sched_class->dequeue_task(rq, p, flags);
>  }

> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8aea199a39b4..15b858cbbcb0 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h

>  #ifdef CONFIG_SCHED_INFO
>  static inline void sched_info_reset_dequeued(struct task_struct *t)
>  {
>   t->sched_info.last_queued = 0;
>  }
>  
> +static inline void sched_info_reset_queued(struct task_struct *t, u64 now)
> +{
> + if (!t->sched_info.last_queued)
> + t->sched_info.last_queued = now;
> +}
> +
>  /*
>   * We are interested in knowing how long it was from the *first* time a
>   * task was queued to the time that it finally hit a CPU, we call this 
> routine
>   * from dequeue_task() to account for possible rq->clock skew across CPUs. 
> The
>   * delta taken on each CPU would annul the skew.
>   */
> -static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t,
> +bool sleep)
>  {
>   unsigned long long now = rq_clock(rq), delta = 0;
>  
> - if (unlikely(sched_info_on()))
> + if (unlikely(sched_info_on())) {
>   if (t->sched_info.last_queued)
>   delta = now - t->sched_info.last_queued;
> + psi_dequeue(t, now, sleep);
> + }
>   sched_info_reset_dequeued(t);
>   t->sched_info.run_delay += delta;
>  
> @@ -104,11 +190,14 @@ static void sched_info_arrive(struct rq *rq, struct 
> task_struct *t)
>   * the timestamp if it is already not set.  It's assumed that
>   * sched_info_dequeued() will clear that stamp when appropriate.
>   */
> -static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_queued(struct rq *rq, struct task_struct *t,
> +  bool wakeup)
>  {
>   if (unlikely(sched_info_on())) {
> - if (!t->sched_info.last_queued)
> - t->sched_info.last_queued = rq_clock(rq);
> + unsigned long long now = rq_clock(rq);
> +
> + sched_info_reset_queued(t, now);
> + psi_enqueue(t, now, wakeup);
>   }
>  }
>  
> @@ -127,7 +216,8 @@ static inline void sched_info_depart(struct rq *rq, 
> struct task_struct *t)
>   rq_sched_info_depart(rq, delta);
>  
>   if (t->state == TASK_RUNNING)
> - sched_info_queued(rq, t);
> + if (unlikely(sched_info_on()))
> + sched_info_reset_queued(t, rq_clock(rq));
>  }


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + unsigned long nonidle_total = 0;
> + unsigned long missed_periods;
> + unsigned long expires;
> + int cpu;
> + int r;
> +
> + mutex_lock(>stat_lock);
> +
> + /*
> +  * Collect the per-cpu time buckets and average them into a
> +  * single time sample that is normalized to wallclock time.
> +  *
> +  * For averaging, each CPU is weighted by its non-idle time in
> +  * the sampling period. This eliminates artifacts from uneven
> +  * loading, or even entirely idle CPUs.
> +  *
> +  * We could pin the online CPUs here, but the noise introduced
> +  * by missing up to one sample period from CPUs that are going
> +  * away shouldn't matter in practice - just like the noise of
> +  * previously offlined CPUs returning with a non-zero sample.

But why!? cpuu_read_lock() is neither expensive nor complicated. So why
try and avoid it?

> +  */
> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + if (!groupc->nonidle_time)
> + continue;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = >res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }
> +
> + /*
> +  * Integrate the sample into the running statistics that are
> +  * reported to userspace: the cumulative stall times and the
> +  * decaying averages.
> +  *
> +  * Pressure percentages are sampled at PSI_FREQ. We might be
> +  * called more often when the user polls more frequently than
> +  * that; we might be called less often when there is no task
> +  * activity, thus no data, and clock ticks are sporadic. The
> +  * below handles both.
> +  */
> +
> + /* total= */
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + do_div(some[r], max(nonidle_total, 1UL));
> + do_div(full[r], max(nonidle_total, 1UL));
> +
> + group->some[r] += some[r];
> + group->full[r] += full[r];

group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));

Is easier to read imo.

> + }
> +
> + /* avgX= */
> + expires = group->period_expires;
> + if (time_before(jiffies, expires))
> + goto out;
> +
> + missed_periods = (jiffies - expires) / PSI_FREQ;
> + group->period_expires = expires + ((1 + missed_periods) * PSI_FREQ);
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + u64 some, full;
> +
> + some = group->some[r] - group->last_some[r];
> + full = group->full[r] - group->last_full[r];
> +
> + calc_avgs(group->avg_some[r], some, missed_periods);
> + calc_avgs(group->avg_full[r], full, missed_periods);
> +
> + group->last_some[r] = group->some[r];
> + group->last_full[r] = group->full[r];
> + }
> +out:
> + mutex_unlock(>stat_lock);
> + return nonidle_total;
> +}


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + unsigned long nonidle_total = 0;
> + unsigned long missed_periods;
> + unsigned long expires;
> + int cpu;
> + int r;
> +
> + mutex_lock(>stat_lock);
> +
> + /*
> +  * Collect the per-cpu time buckets and average them into a
> +  * single time sample that is normalized to wallclock time.
> +  *
> +  * For averaging, each CPU is weighted by its non-idle time in
> +  * the sampling period. This eliminates artifacts from uneven
> +  * loading, or even entirely idle CPUs.
> +  *
> +  * We could pin the online CPUs here, but the noise introduced
> +  * by missing up to one sample period from CPUs that are going
> +  * away shouldn't matter in practice - just like the noise of
> +  * previously offlined CPUs returning with a non-zero sample.

But why!? cpuu_read_lock() is neither expensive nor complicated. So why
try and avoid it?

> +  */
> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + if (!groupc->nonidle_time)
> + continue;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = >res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }
> +
> + /*
> +  * Integrate the sample into the running statistics that are
> +  * reported to userspace: the cumulative stall times and the
> +  * decaying averages.
> +  *
> +  * Pressure percentages are sampled at PSI_FREQ. We might be
> +  * called more often when the user polls more frequently than
> +  * that; we might be called less often when there is no task
> +  * activity, thus no data, and clock ticks are sporadic. The
> +  * below handles both.
> +  */
> +
> + /* total= */
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + do_div(some[r], max(nonidle_total, 1UL));
> + do_div(full[r], max(nonidle_total, 1UL));
> +
> + group->some[r] += some[r];
> + group->full[r] += full[r];

group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));

Is easier to read imo.

> + }
> +
> + /* avgX= */
> + expires = group->period_expires;
> + if (time_before(jiffies, expires))
> + goto out;
> +
> + missed_periods = (jiffies - expires) / PSI_FREQ;
> + group->period_expires = expires + ((1 + missed_periods) * PSI_FREQ);
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + u64 some, full;
> +
> + some = group->some[r] - group->last_some[r];
> + full = group->full[r] - group->last_full[r];
> +
> + calc_avgs(group->avg_some[r], some, missed_periods);
> + calc_avgs(group->avg_full[r], full, missed_periods);
> +
> + group->last_some[r] = group->some[r];
> + group->last_full[r] = group->full[r];
> + }
> +out:
> + mutex_unlock(>stat_lock);
> + return nonidle_total;
> +}


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>   return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>   extern int delayacct_on;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
>  #endif
> + return 0;
>  }

Doesn't that want to be something like:

static inline bool sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTAT
return true;
#else /* !SCHEDSTAT */
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on;
if (delayacct_on)
return true;
#endif /* DELAYACCT */
#ifdef CONFIG_PSI
extern int psi_disabled;
if (!psi_disabled)
return true;
#endif
return false;
#endif /* !SCHEDSTATE */
}

Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
nodelayacct, you still get sched_info_on().


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>   return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>   extern int delayacct_on;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
>  #endif
> + return 0;
>  }

Doesn't that want to be something like:

static inline bool sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTAT
return true;
#else /* !SCHEDSTAT */
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on;
if (delayacct_on)
return true;
#endif /* DELAYACCT */
#ifdef CONFIG_PSI
extern int psi_disabled;
if (!psi_disabled)
return true;
#endif
return false;
#endif /* !SCHEDSTATE */
}

Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
nodelayacct, you still get sched_info_on().


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};

> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};

These two have mem and iowait in different order. It really doesn't
matter, but my brain stumbled.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};

> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};

These two have mem and iowait in different order. It really doesn't
matter, but my brain stumbled.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> + if (res->state != PSI_NONE) {
> + bool was_full = res->state == PSI_FULL;
> +
> + res->times[was_full] += now - res->state_start;
> + }
> + if (res->state != state)
> + res->state = state;
> + if (res->state != PSI_NONE)
> + res->state_start = now;
> +}
> +
> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +  unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;
> +
> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + cpu, idx, tasks[0], tasks[1], tasks[2],
> + clear, set);
> + psi_bug = 1;
> + }
> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;
> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(>res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_IO], state, now);
> +
> + /* Time in which tasks are non-idle, to weigh the CPU in summaries */
> + if (groupc->nonidle)
> + groupc->nonidle_time += now - groupc->nonidle_start;
> + groupc->nonidle = tasks[NR_RUNNING] ||
> + tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> + if (groupc->nonidle)
> + groupc->nonidle_start = now;
> +
> + /* Kick the stats aggregation worker if it's gone to sleep */
> + if (!delayed_work_pending(>clock_work))
> + schedule_delayed_work(>clock_work, PSI_FREQ);
> +}
> +
> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> +{
> + int cpu = task_cpu(task);
> +
> + if (psi_disabled)
> + return;
> +
> + if (!task->pid)
> + return;
> +
> + if (((task->psi_flags & set) ||
> +  (task->psi_flags & clear) != clear) &&
> + !psi_bug) {
> + printk_deferred(KERN_ERR "psi: inconsistent task state! 
> task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
> + task->pid, task->comm, cpu,
> + task->psi_flags, clear, set);
> + psi_bug = 1;
> + }
> +
> + task->psi_flags &= ~clear;
> + task->psi_flags |= set;
> +
> + psi_group_change(_system, cpu, now, clear, set);
> +}


> +/*
> + * PSI tracks state that persists across sleeps, such as iowaits and
> + * memory stalls. As a result, it has to distinguish between sleeps,
> + * where a task's runnable state changes, and requeues, where a task
> + * and its state are being moved between CPUs and runqueues.
> + */
> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!wakeup || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + if (p->sched_psi_wake_requeue)
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!sleep) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-17 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> + if (res->state != PSI_NONE) {
> + bool was_full = res->state == PSI_FULL;
> +
> + res->times[was_full] += now - res->state_start;
> + }
> + if (res->state != state)
> + res->state = state;
> + if (res->state != PSI_NONE)
> + res->state_start = now;
> +}
> +
> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +  unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;
> +
> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + cpu, idx, tasks[0], tasks[1], tasks[2],
> + clear, set);
> + psi_bug = 1;
> + }
> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;
> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(>res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(>res[PSI_IO], state, now);
> +
> + /* Time in which tasks are non-idle, to weigh the CPU in summaries */
> + if (groupc->nonidle)
> + groupc->nonidle_time += now - groupc->nonidle_start;
> + groupc->nonidle = tasks[NR_RUNNING] ||
> + tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> + if (groupc->nonidle)
> + groupc->nonidle_start = now;
> +
> + /* Kick the stats aggregation worker if it's gone to sleep */
> + if (!delayed_work_pending(>clock_work))
> + schedule_delayed_work(>clock_work, PSI_FREQ);
> +}
> +
> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> +{
> + int cpu = task_cpu(task);
> +
> + if (psi_disabled)
> + return;
> +
> + if (!task->pid)
> + return;
> +
> + if (((task->psi_flags & set) ||
> +  (task->psi_flags & clear) != clear) &&
> + !psi_bug) {
> + printk_deferred(KERN_ERR "psi: inconsistent task state! 
> task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
> + task->pid, task->comm, cpu,
> + task->psi_flags, clear, set);
> + psi_bug = 1;
> + }
> +
> + task->psi_flags &= ~clear;
> + task->psi_flags |= set;
> +
> + psi_group_change(_system, cpu, now, clear, set);
> +}


> +/*
> + * PSI tracks state that persists across sleeps, such as iowaits and
> + * memory stalls. As a result, it has to distinguish between sleeps,
> + * where a task's runnable state changes, and requeues, where a task
> + * and its state are being moved between CPUs and runqueues.
> + */
> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!wakeup || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + if (p->sched_psi_wake_requeue)
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (psi_disabled)
> + return;
> +
> + if (!sleep) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-14 Thread Peter Zijlstra
On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + if (psi_disabled)
> > > + return;
> > > + /*
> > > +  * Is the task being migrated during a wakeup? Make sure to
> > > +  * deregister its sleep-persistent psi states from the old
> > > +  * queue, and let psi_enqueue() know it has to requeue.
> > > +  */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, );
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, );
> > > + }
> > > +}
> > 
> > Still NAK, what happened to this here:

> That's my thought process, anyway. I'd be more than happy to make this
> more lightweight, but I don't see a way to do it without losing
> significant functional precision.

I think you're going to have to. We put a lot of effort into not taking
the old rq->lock on remote wakeups and got a significant performance
benefit from that.

You just utterly destroyed that for workloads with a high number of
iowait wakeups.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-14 Thread Peter Zijlstra
On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + if (psi_disabled)
> > > + return;
> > > + /*
> > > +  * Is the task being migrated during a wakeup? Make sure to
> > > +  * deregister its sleep-persistent psi states from the old
> > > +  * queue, and let psi_enqueue() know it has to requeue.
> > > +  */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, );
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, );
> > > + }
> > > +}
> > 
> > Still NAK, what happened to this here:

> That's my thought process, anyway. I'd be more than happy to make this
> more lightweight, but I don't see a way to do it without losing
> significant functional precision.

I think you're going to have to. We put a lot of effort into not taking
the old rq->lock on remote wakeups and got a significant performance
benefit from that.

You just utterly destroyed that for workloads with a high number of
iowait wakeups.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-14 Thread Peter Zijlstra


Hi Johannes,

A few quick comments on first reading; I'll do a second and more
thorough reading on Monday.

On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> First off, what I want to do can indeed be done without a strong link
> of a sleeping task to a CPU. We don't rely on it, and it's something I
> only figured out in v2. The important thing is not, as I previously
> thought, that CPUs are tracked independently from each other, but that
> we use potential execution threads as the baseline for potential that
> could be wasted by resource delays. Tracking CPUs independently just
> happens to do that implicitly, but it's not a requirement.

I don't follow, but I don't think I agree.

Consider the case of 2 CPUs and 2 blocked tasks. If they both blocked on
the same CPU, then only that CPU has lost potential. Whereas the only
thing that matters is the number of blocked tasks and the number of idle
CPUs.

Those two tasks can fill the two idle CPUs. Tracking per CPU just
utterly confuses the matter.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-14 Thread Peter Zijlstra


Hi Johannes,

A few quick comments on first reading; I'll do a second and more
thorough reading on Monday.

On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> First off, what I want to do can indeed be done without a strong link
> of a sleeping task to a CPU. We don't rely on it, and it's something I
> only figured out in v2. The important thing is not, as I previously
> thought, that CPUs are tracked independently from each other, but that
> we use potential execution threads as the baseline for potential that
> could be wasted by resource delays. Tracking CPUs independently just
> happens to do that implicitly, but it's not a requirement.

I don't follow, but I don't think I agree.

Consider the case of 2 CPUs and 2 blocked tasks. If they both blocked on
the same CPU, then only that CPU has lost potential. Whereas the only
thing that matters is the number of blocked tasks and the number of idle
CPUs.

Those two tasks can fill the two idle CPUs. Tracking per CPU just
utterly confuses the matter.


Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-13 Thread Johannes Weiner
Hi Peter,

On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > +{
> > +   if (psi_disabled)
> > +   return;
> > +   /*
> > +* Is the task being migrated during a wakeup? Make sure to
> > +* deregister its sleep-persistent psi states from the old
> > +* queue, and let psi_enqueue() know it has to requeue.
> > +*/
> > +   if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +   int clear = 0;
> > +
> > +   if (p->in_iowait)
> > +   clear |= TSK_IOWAIT;
> > +   if (p->flags & PF_MEMSTALL)
> > +   clear |= TSK_MEMSTALL;
> > +
> > +   rq = __task_rq_lock(p, );
> > +   update_rq_clock(rq);
> > +   psi_task_change(p, rq_clock(rq), clear, 0);
> > +   p->sched_psi_wake_requeue = 1;
> > +   __task_rq_unlock(rq, );
> > +   }
> > +}
> 
> Still NAK, what happened to this here:
> 
>   
> https://lkml.kernel.org/r/20180514083353.gn12...@hirez.programming.kicks-ass.net

I did react to this in the v2 docs / code comments, but I should have
been more direct about addressing your points - sorry about that.

In that thread we disagree about exactly how to aggregate task stalls
to produce meaningful numbers, but your main issue is with the way we
track state per-CPU instead of globally, given the rq lock cost on
wake-migration and the meaning of task->cpu of a sleeping task.

First off, what I want to do can indeed be done without a strong link
of a sleeping task to a CPU. We don't rely on it, and it's something I
only figured out in v2. The important thing is not, as I previously
thought, that CPUs are tracked independently from each other, but that
we use potential execution threads as the baseline for potential that
could be wasted by resource delays. Tracking CPUs independently just
happens to do that implicitly, but it's not a requirement.

In v2 of psi.c I'm outlining a model that formulates the SOME and FULL
states from global state in a way that still produces meaningful
numbers on SMP machines by comparing the task state to the number of
possible concurrent execution threads. Here is the excerpt:

threads = min(nr_nonidle_tasks, nr_cpus)
   SOME = min(nr_delayed_tasks / threads, 1)
   FULL = (threads - min(nr_running_tasks, threads)) / threads

It's followed in psi.c by examples of how/why it works, but whether
you agree with the exact formula or not, what you can see is that it
could be implemented exactly like the load average: use per-cpu
counters to construct global values for those task counts, fold and
sample that state periodically and feed it into the running averages.

So whytf is it still done with cpu-local task states?

The general problem with sampling here is that it's way too coarse to
capture the events we want to know about. The load average is okay-ish
for long term trends, but interactive things care about stalls in the
millisecond range each, and we cannot get those accurately with
second-long sampling intervals (and we cannot fold the CPU state much
more frequently than this before it gets prohibitively expensive).

Since our stall states are composed of multiple tasks, recording the
precise time spent in them requires some sort of serialization with
scheduling activity, and doing that globally would be a non-starter on
SMP. Hence still the CPU-local state tracking to approximate the
global state.

Now to your concern about relying on the task<->CPU association.

We don't *really* rely on a strict association, it's more of a hint or
historic correlation. It's fine if tasks move around on us, we just
want to approximate when CPUs go idle due to stalls or lack of work.
Let's take your quote from the thread:

: Note that a task doesn't sleep on a CPU. When it sleeps it is not
: strictly associated with a CPU, only when it runs does it have an
: association.
:
: What is the value of accounting a sleep state to a particular CPU
: if the task when wakes up on another? Where did the sleep take place?

Let's say you have a CPU running a task that then stalls on
memory. When it wakes back up it gets moved to another CPU.

We don't care so much about what happens after the task wakes up, we
just need to know where the task was running when it stalled. Even if
the task gets migrated on wakeup - *while* the stall is occuring, we
can say whether that task's old CPU goes idle due to that stall, and
has to report FULL; or something else can run on it, in which case it
only reports SOME. And even if the task bounced around CPUs while it
was running, and it was only briefly on the CPU on which it stalled -
what we care about is a CPU being idle because of stalls instead of a
genuine lack of work.

This is certainly susceptible to delayed tasks bunching 

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-13 Thread Johannes Weiner
Hi Peter,

On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > +{
> > +   if (psi_disabled)
> > +   return;
> > +   /*
> > +* Is the task being migrated during a wakeup? Make sure to
> > +* deregister its sleep-persistent psi states from the old
> > +* queue, and let psi_enqueue() know it has to requeue.
> > +*/
> > +   if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +   int clear = 0;
> > +
> > +   if (p->in_iowait)
> > +   clear |= TSK_IOWAIT;
> > +   if (p->flags & PF_MEMSTALL)
> > +   clear |= TSK_MEMSTALL;
> > +
> > +   rq = __task_rq_lock(p, );
> > +   update_rq_clock(rq);
> > +   psi_task_change(p, rq_clock(rq), clear, 0);
> > +   p->sched_psi_wake_requeue = 1;
> > +   __task_rq_unlock(rq, );
> > +   }
> > +}
> 
> Still NAK, what happened to this here:
> 
>   
> https://lkml.kernel.org/r/20180514083353.gn12...@hirez.programming.kicks-ass.net

I did react to this in the v2 docs / code comments, but I should have
been more direct about addressing your points - sorry about that.

In that thread we disagree about exactly how to aggregate task stalls
to produce meaningful numbers, but your main issue is with the way we
track state per-CPU instead of globally, given the rq lock cost on
wake-migration and the meaning of task->cpu of a sleeping task.

First off, what I want to do can indeed be done without a strong link
of a sleeping task to a CPU. We don't rely on it, and it's something I
only figured out in v2. The important thing is not, as I previously
thought, that CPUs are tracked independently from each other, but that
we use potential execution threads as the baseline for potential that
could be wasted by resource delays. Tracking CPUs independently just
happens to do that implicitly, but it's not a requirement.

In v2 of psi.c I'm outlining a model that formulates the SOME and FULL
states from global state in a way that still produces meaningful
numbers on SMP machines by comparing the task state to the number of
possible concurrent execution threads. Here is the excerpt:

threads = min(nr_nonidle_tasks, nr_cpus)
   SOME = min(nr_delayed_tasks / threads, 1)
   FULL = (threads - min(nr_running_tasks, threads)) / threads

It's followed in psi.c by examples of how/why it works, but whether
you agree with the exact formula or not, what you can see is that it
could be implemented exactly like the load average: use per-cpu
counters to construct global values for those task counts, fold and
sample that state periodically and feed it into the running averages.

So whytf is it still done with cpu-local task states?

The general problem with sampling here is that it's way too coarse to
capture the events we want to know about. The load average is okay-ish
for long term trends, but interactive things care about stalls in the
millisecond range each, and we cannot get those accurately with
second-long sampling intervals (and we cannot fold the CPU state much
more frequently than this before it gets prohibitively expensive).

Since our stall states are composed of multiple tasks, recording the
precise time spent in them requires some sort of serialization with
scheduling activity, and doing that globally would be a non-starter on
SMP. Hence still the CPU-local state tracking to approximate the
global state.

Now to your concern about relying on the task<->CPU association.

We don't *really* rely on a strict association, it's more of a hint or
historic correlation. It's fine if tasks move around on us, we just
want to approximate when CPUs go idle due to stalls or lack of work.
Let's take your quote from the thread:

: Note that a task doesn't sleep on a CPU. When it sleeps it is not
: strictly associated with a CPU, only when it runs does it have an
: association.
:
: What is the value of accounting a sleep state to a particular CPU
: if the task when wakes up on another? Where did the sleep take place?

Let's say you have a CPU running a task that then stalls on
memory. When it wakes back up it gets moved to another CPU.

We don't care so much about what happens after the task wakes up, we
just need to know where the task was running when it stalled. Even if
the task gets migrated on wakeup - *while* the stall is occuring, we
can say whether that task's old CPU goes idle due to that stall, and
has to report FULL; or something else can run on it, in which case it
only reports SOME. And even if the task bounced around CPUs while it
was running, and it was only briefly on the CPU on which it stalled -
what we care about is a CPU being idle because of stalls instead of a
genuine lack of work.

This is certainly susceptible to delayed tasks bunching 

Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-13 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + if (psi_disabled)
> + return;
> + /*
> +  * Is the task being migrated during a wakeup? Make sure to
> +  * deregister its sleep-persistent psi states from the old
> +  * queue, and let psi_enqueue() know it has to requeue.
> +  */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, );
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, );
> + }
> +}

Still NAK, what happened to this here:

  
https://lkml.kernel.org/r/20180514083353.gn12...@hirez.programming.kicks-ass.net



Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-13 Thread Peter Zijlstra
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + if (psi_disabled)
> + return;
> + /*
> +  * Is the task being migrated during a wakeup? Make sure to
> +  * deregister its sleep-persistent psi states from the old
> +  * queue, and let psi_enqueue() know it has to requeue.
> +  */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, );
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, );
> + }
> +}

Still NAK, what happened to this here:

  
https://lkml.kernel.org/r/20180514083353.gn12...@hirez.programming.kicks-ass.net



[PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-12 Thread Johannes Weiner
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

   cpu: some tasks are runnable but not executing on a CPU
   memory: tasks are reclaiming, or waiting for swapin or thrashing cache
   io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates

Signed-off-by: Johannes Weiner 
---
 Documentation/accounting/psi.txt |  64 
 include/linux/psi.h  |  27 ++
 include/linux/psi_types.h|  90 +
 include/linux/sched.h|  10 +
 include/linux/sched/stat.h   |  10 +-
 init/Kconfig |  16 +
 kernel/fork.c|   4 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  |   7 +-
 kernel/sched/psi.c   | 585 +++
 kernel/sched/sched.h |   2 +
 kernel/sched/stats.h | 102 +-
 mm/compaction.c  |   5 +
 mm/filemap.c |  15 +-
 mm/page_alloc.c  |  10 +
 mm/vmscan.c  |  13 +
 16 files changed, 946 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index ..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+
+PSI - Pressure Stall Information
+
+
+:Date: April, 2018
+:Author: Johannes Weiner 
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are 

[PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

2018-07-12 Thread Johannes Weiner
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

   cpu: some tasks are runnable but not executing on a CPU
   memory: tasks are reclaiming, or waiting for swapin or thrashing cache
   io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates

Signed-off-by: Johannes Weiner 
---
 Documentation/accounting/psi.txt |  64 
 include/linux/psi.h  |  27 ++
 include/linux/psi_types.h|  90 +
 include/linux/sched.h|  10 +
 include/linux/sched/stat.h   |  10 +-
 init/Kconfig |  16 +
 kernel/fork.c|   4 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  |   7 +-
 kernel/sched/psi.c   | 585 +++
 kernel/sched/sched.h |   2 +
 kernel/sched/stats.h | 102 +-
 mm/compaction.c  |   5 +
 mm/filemap.c |  15 +-
 mm/page_alloc.c  |  10 +
 mm/vmscan.c  |  13 +
 16 files changed, 946 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index ..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+
+PSI - Pressure Stall Information
+
+
+:Date: April, 2018
+:Author: Johannes Weiner 
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are