Re: [PATCH 07/13] memcontrol: schedule throttling if we are congested

2018-06-11 Thread Johannes Weiner
On Tue, Jun 05, 2018 at 09:29:42AM -0400, Josef Bacik wrote:
> From: Tejun Heo 
> 
> Memory allocations can induce swapping via kswapd or direct reclaim.  If
> we are having IO done for us by kswapd and don't actually go into direct
> reclaim we may never get scheduled for throttling.  So instead check to
> see if our cgroup is congested, and if so schedule the throttling.
> Before we return to user space the throttling stuff will only throttle
> if we actually required it.
> 
> Signed-off-by: Tejun Heo 

Looks good to me now, thanks.

Acked-by: Johannes Weiner 


Re: [PATCH 05/13] swap,blkcg: issue swap io with the appropriate context

2018-06-11 Thread Johannes Weiner
On Tue, Jun 05, 2018 at 09:29:40AM -0400, Josef Bacik wrote:
> From: Tejun Heo 
> 
> For backcharging we need to know who the page belongs to when swapping
> it out.
> 
> Signed-off-by: Tejun Heo 
> Signed-off-by: Josef Bacik 

Acked-by: Johannes Weiner 


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-23 Thread Johannes Weiner
On Wed, May 09, 2018 at 04:33:24PM +0530, Vinayak Menon wrote:
> On 5/8/2018 2:31 AM, Johannes Weiner wrote:
> > +   /* Kick the stats aggregation worker if it's gone to sleep */
> > +   if (!delayed_work_pending(>clock_work))
> 
> This causes a crash when the work is scheduled before system_wq is up. In my 
> case when the first
> schedule was called from kthreadd. And I had to do this to make it work.
> if (keventd_up() && !delayed_work_pending(>clock_work))
>
> > +   schedule_delayed_work(>clock_work, MY_LOAD_FREQ);

I was trying to figure out how this is possible, and it didn't make
sense because we do initialize the system_wq way before kthreadd.

Did you by any chance backport this to a pre-4.10 kernel which does
not have 3347fa092821 ("workqueue: make workqueue available early
during boot") yet?

> > +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> > +{
> > +   struct cgroup *cgroup, *parent;
> 
> unused variables

They're used in the next patch, I'll fix that up.

Thanks


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Johannes Weiner
On Mon, May 14, 2018 at 03:39:33PM +, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
> 
> > What to make of this number? If CPU utilization is at 100% and CPU
> > pressure is 0, it means the system is perfectly utilized, with one
> > runnable thread per CPU and nobody waiting. At two or more runnable
> > tasks per CPU, the system is 100% overcommitted and the pressure
> > average will indicate as much. From a utilization perspective this is
> > a great state of course: no CPU cycles are being wasted, even when 50%
> > of the threads were to go idle (and most workloads do vary). From the
> > perspective of the individual job it's not great, however, and they
> > might do better with more resources. Depending on what your priority
> > is, an elevated "some" number may or may not require action.
> 
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?

Counting iowaiting tasks is one thing, but there are a few more things
that make it hard to use for telling the impact of CPU competition:

- It's not normalized to available CPU count. The loadavg in isolation
  doesn't mean anything, and you have to know the number of CPUs and
  any CPU bindings / restrictions in effect, which presents at least
  some difficulty when monitoring a big heterogeneous fleet.

- The way it's sampled makes it impossible to use for latencies. You
  could be mostly idle but periodically have herds of tasks competing
  for the CPU for short, low-latency operations. Even if we changed
  this in the implementation, you're still stuck with the interface
  that has...

- ...a short-term load window of 1m. This is generally fairly coarse
  for something that can be loaded and unloaded as abruptly as the CPU

I'm trying to fix these with a portable way of aggregating multi-cpu
states, as well as tracking the true time spent in a state instead of
sampling it. Plus a smaller short-term window of 10s, but that's
almost irrelevant because I'm exporting the absolute state time clock
so you can calculate your own averages over any time window you want.

Since I'm using the same model and infrastructure for memory and IO
load as well, IMO it makes more sense to present them in a coherent
interface instead of trying to retrofit and change the loadavg file,
which might not even be possible.


Re: [PATCH 7/7] psi: cgroup support

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 01:07:36PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:35PM -0400, Johannes Weiner wrote:
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -260,6 +260,18 @@ void psi_task_change(struct task_struct *task, u64 
> > now, int clear, int set)
> > task->psi_flags |= set;
> >  
> > psi_group_update(_system, cpu, now, clear, set);
> > +
> > +#ifdef CONFIG_CGROUPS
> > +   cgroup = task->cgroups->dfl_cgrp;
> > +   while (cgroup && (parent = cgroup_parent(cgroup))) {
> > +   struct psi_group *group;
> > +
> > +   group = cgroup_psi(cgroup);
> > +   psi_group_update(group, cpu, now, clear, set);
> > +
> > +   cgroup = parent;
> > +   }
> > +#endif
> >  }
> 
> TJ fixed needing that for stats at some point, why can't you do the
> same?

The stats deltas are all additive, so it's okay to delay flushing them
up the tree right before somebody is trying to look at them.

With this, though, we are tracking time of an aggregate state composed
of child tasks, and that state might not be identical for you and all
your ancestor, so everytime a task state changes we have to evaluate
and start/stop clocks on every level, because we cannot derive our
state from the state history of our child groups.

For example, say you have the following tree:

  root
 /
A
  /   \
 A1   A2
  running=1   running=1

I.e. There is a a running task in A1 and one in A2.

root, A, A1, and A2 are all PSI_NONE as nothing is stalled.

Now the task in A2 enters a memstall.

  root
 /
A
  /   \
 A1   A2
  running=1   memstall=1

>From the perspective of A2, the group is now fully blocked and starts
recording time in PSI_FULL.

>From the perspective of A, it has a working group below it and a
stalled one, which would make it PSI_SOME, so it starts recording time
in PSI_SOME.

The root/sytem level likewise has to start the timer on PSI_SOME.

Now the task in A1 enters a memstall, and we have to propagate the
PSI_FULL state up A1 -> A -> root.

I'm not quite sure how we could make this lazy. Say we hadn't
propagated the state from A1 and A2 right away, and somebody is asking
about the averages for A. We could tell that A1 and A2 had been in
PSI_FULL recently, but we wouldn't know exactly if them being in these
states fully overlapped (all PSI_FULL), overlapped partially (some
PSI_FULL and some PSI_SOME), or didn't overlap at all (PSI_SOME).


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:21:00PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   local_irq_disable();
> > +   rq = this_rq();
> > +   raw_spin_lock(>lock);
> > +   rq_pin_lock(rq, );
> 
> Given that churn in sched.h, you seen rq_lock() and friends.
> 
> Either write this like:
> 
>   local_irq_disable();
>   rq = this_rq();
>   rq_lock(rq, );
> 
> Or instroduce "rq = this_rq_lock_irq()", which we could also use in
> do_sched_yield().

Sounds good, I'll add that.

> > +   update_rq_clock(rq);
> > +
> > +   current->flags |= PF_MEMSTALL;
> > +   psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +   rq_unpin_lock(rq, );
> > +   raw_spin_unlock(>lock);
> > +   local_irq_enable();
> 
> That's called rq_unlock_irq().

I'll use that. This code was first written against a kernel that
didn't have 8a8c69c32778 ("sched/core: Add rq->lock wrappers.") yet ;)


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:14:54PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 15750c222ca2..1658477466d5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
[...]
> What's all this churn about?

The psi callbacks in kernel/sched/stat.h use these rq lock functions
from this file, but sched.h includes stat.h before those definitions.

I'll move this into a separate patch with a proper explanation.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:05:51PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> 
> > +   some[r] /= max(nonidle_total, 1UL);
> > +   full[r] /= max(nonidle_total, 1UL);
> 
> That's a bare 64bit divide.. that typically failed to build on 32bit
> archs.

Ah yes, I'll switch that to do_div(). Thanks


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:04:55PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +static void psi_clock(struct work_struct *work)
> > +{
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> > +   unsigned long nonidle_total = 0;
> > +   unsigned long missed_periods;
> > +   struct delayed_work *dwork;
> > +   struct psi_group *group;
> > +   unsigned long expires;
> > +   int cpu;
> > +   int r;
> > +
> > +   dwork = to_delayed_work(work);
> > +   group = container_of(dwork, struct psi_group, clock_work);
> > +
> > +   /*
> > +* Calculate the sampling period. The clock might have been
> > +* stopped for a while.
> > +*/
> > +   expires = group->period_expires;
> > +   missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> > +   group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> > +
> > +   /*
> > +* Aggregate the per-cpu state into a global state. Each CPU
> > +* is weighted by its non-idle time in the sampling period.
> > +*/
> > +   for_each_online_cpu(cpu) {
> 
> Typically when using online CPU state, you also need hotplug notifiers
> to deal with changes in the online set.
> 
> You also typically need something like cpus_read_lock() around an
> iteration of online CPUs, to avoid the set changing while you're poking
> at them.
> 
> The lack for neither is evident or explained.

The per-cpu state we access is allocated for each possible CPU, so
that is safe (and state being all 0 is semantically sound, too). In a
race with onlining, we might miss some per-cpu samples, but would
catch them the next time. In a race with offlining, we may never
consider the final up to 2s state history of the disappearing CPU; we
could have an offlining callback to flush the state, but I'm not sure
this would be an actual problem in the real world since the error is
small (smallest averaging window is 5 sampling periods) and then would
age out quickly.

I can certainly add a comment explaining this at least.

> > +   struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> > +   unsigned long nonidle;
> > +
> > +   nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> > +   groupc->nonidle_time = 0;
> > +   nonidle_total += nonidle;
> > +
> > +   for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +   struct psi_resource *res = >res[r];
> > +
> > +   some[r] += (res->times[0] + res->times[1]) * nonidle;
> > +   full[r] += res->times[1] * nonidle;
> > +
> > +   /* It's racy, but we can tolerate some error */
> > +   res->times[0] = 0;
> > +   res->times[1] = 0;
> > +   }
> > +   }


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 11:59:38AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > new file mode 100644
> > index ..b22b0ffc729d
> > --- /dev/null
> > +++ b/include/linux/psi_types.h
> > @@ -0,0 +1,84 @@
> > +#ifndef _LINUX_PSI_TYPES_H
> > +#define _LINUX_PSI_TYPES_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_PSI
> > +
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +   NR_RUNNING,
> > +   NR_IOWAIT,
> > +   NR_MEMSTALL,
> > +   NR_PSI_TASK_COUNTS,
> > +};
> > +
> > +/* Task state bitmasks */
> > +#define TSK_RUNNING(1 << NR_RUNNING)
> > +#define TSK_IOWAIT (1 << NR_IOWAIT)
> > +#define TSK_MEMSTALL   (1 << NR_MEMSTALL)
> > +
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +   PSI_CPU,
> > +   PSI_MEM,
> > +   PSI_IO,
> > +   NR_PSI_RESOURCES,
> > +};
> > +
> > +/* Pressure states for a group of tasks */
> > +enum psi_state {
> > +   PSI_NONE,   /* No stalled tasks */
> > +   PSI_SOME,   /* Stalled tasks & working tasks */
> > +   PSI_FULL,   /* Stalled tasks & no working tasks */
> > +   NR_PSI_STATES,
> > +};
> > +
> > +struct psi_resource {
> > +   /* Current pressure state for this resource */
> > +   enum psi_state state;
> > +
> > +   /* Start of current state (cpu_clock) */
> > +   u64 state_start;
> > +
> > +   /* Time sampling buckets for pressure states (ns) */
> > +   u64 times[NR_PSI_STATES - 1];
> 
> Fails to explain why no FULL.

It's NONE that's excluded. I'll add a comment.

> > +struct psi_group_cpu {
> > +   /* States of the tasks belonging to this group */
> > +   unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +
> 
> AFAICT there's a hole here, that would fit the @nonidle member. Which
> also avoids the later hole generated by it.

Good spot, I'll reshuffle this accordingly.

> > +   /* Per-resource pressure tracking in this group */
> > +   struct psi_resource res[NR_PSI_RESOURCES];
> > +
> > +   /* There are runnable or D-state tasks */
> > +   bool nonidle;
> 
> Mandatory complaint about using _Bool in composites goes here.

int it is.

Thanks


Re: [PATCH 5/7] sched: loadavg: make calc_load_n() public

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 11:49:06AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:33PM -0400, Johannes Weiner wrote:
> > +static inline unsigned long
> > +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> > +{
> > +   unsigned long result = 1UL << frac_bits;
> > +
> > +   if (n) {
> > +   for (;;) {
> > +   if (n & 1) {
> > +   result *= x;
> > +   result += 1UL << (frac_bits - 1);
> > +   result >>= frac_bits;
> > +   }
> > +   n >>= 1;
> > +   if (!n)
> > +   break;
> > +   x *= x;
> > +   x += 1UL << (frac_bits - 1);
> > +   x >>= frac_bits;
> > +   }
> > +   }
> > +
> > +   return result;
> > +}
> 
> No real objection; but that does look a wee bit fat for an inline I
> suppose.

Fair enough, I'll put these back where I found them and make
calc_load_n() extern instead.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > 
> > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> > > state, int wake_flags)
> > >   cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > >   if (task_cpu(p) != cpu) {
> > >   wake_flags |= WF_MIGRATED;
> > > + psi_ttwu_dequeue(p);
> > >   set_task_cpu(p, cpu);
> > >   }
> > >  
> > 
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + /*
> > > +  * Is the task being migrated during a wakeup? Make sure to
> > > +  * deregister its sleep-persistent psi states from the old
> > > +  * queue, and let psi_enqueue() know it has to requeue.
> > > +  */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, );
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, );
> > > + }
> > > +}
> > 
> > Yeah, no... not happening.
> > 
> > We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> > was the one pushing for that, so he should very well know this.
> > 
> > The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> > accounting being useless makes it even worse), adding significant remote
> > prodding is just really bad.
> 
> Also, since all you need is the global number, I don't think you
> actually need any of this. See what we do for nr_uninterruptible.
> 
> In general I think you want to (re)read loadavg.c some more, and maybe
> reuse a bit more of that.

So there is a reason I'm tracking productivity states per-cpu and not
globally. Consider the following example periods on two CPUs:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | memstalled | EXECUTING  |

If we tracked only the global number of stalled tasks, similarly to
nr_uninterruptible, the number would be elevated throughout the whole
sampling period, giving a pressure value of 100% for "some stalled".
And, since there is always something executing, a "full stall" of 0%.

Now consider what happens when the Task 3 sequence is the other way
around:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | EXECUTING  | memstalled |

Here the number of stalled tasks is elevated only during half of the
sampling period, this time giving a pressure reading of 50% for "some"
(and again 0% for "full").

That's a different measurement, but in terms of workload progress, the
sequences are functionally equivalent. In both scenarios the same
amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
and the same amount of potentially productive CPU time is lost due to
the contention of memory. We really ought to read the same pressure.

So what I'm doing is calculating the productivity loss on each CPU in
a sampling period as if they were independent time slices. It doesn't
matter how you slice and dice the sequences within each one - if used
CPU time and lost CPU time have the same proportion, we have the same
pressure.

In both scenarios above, this method will give a pressure reading of
some=50% and full=25% of "normalized walltime", which is the time loss
the work would experience on a single CPU executing it serially.

To illustrate:

CPU X
1234
Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |

You can clearly see the 50% of walltime in which *somebody* isn't
advancing (2 and 4), and the 25% of walltime in which *no* tasks are
(3). Same amount of work, same memory stalls, same pressure numbers.

Globalized state tracking would produce those numbers on the single
CPU (obviously), but once concurrency gets into the mix, it's
questionable what its results mean. It certainly isn't able to
reliably detect equivalent slowdowns of individual tasks ("some" is
all over the place), and in this example wasn't able to capture the
impact of contention on overall work completion ("full" is 0%).

* CPU 0: some = 50%, full =  0%
  CPU 1: some = 50%, full = 50%
avg: some = 50%, full = 25%


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-08 Thread Johannes Weiner
On Mon, May 07, 2018 at 05:42:36PM -0700, Randy Dunlap wrote:
> On 05/07/2018 02:01 PM, Johannes Weiner wrote:
> > + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> > + * windows. Cumluative stall times are tracked and exported as well to
> 
>Cumulative
> 

> > +/**
> > + * psi_memstall_leave - mark the end of an memory stall section
> 
> end of a memory

Thanks Randy, I'll get those fixed.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-08 Thread Johannes Weiner
On Tue, May 08, 2018 at 11:04:09AM +0800, kbuild test robot wrote:
>118#else /* CONFIG_PSI */
>119static inline void psi_enqueue(struct task_struct *p, u64 now)
>120{
>121}
>122static inline void psi_dequeue(struct task_struct *p, u64 now)
>123{
>124}
>125static inline void psi_ttwu_dequeue(struct task_struct *p) {}
>  > 126{
>127}

Stupid last-minute cleanup reshuffling. v2 will have:

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index cb4a68bcf37a..ff6256b3d216 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -122,7 +122,7 @@ static inline void psi_enqueue(struct task_struct *p, u64 
now)
 static inline void psi_dequeue(struct task_struct *p, u64 now)
 {
 }
-static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p)
 {
 }
 #endif /* CONFIG_PSI */


[PATCH 1/7] mm: workingset: don't drop refault information prematurely

2018-05-07 Thread Johannes Weiner
From: Johannes Weiner <jwei...@fb.com>

If we just keep enough refault information to match the CURRENT page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out
all the cache. Once cache comes back, we won't see those refaults.
They might not be actionable for LRU aging, but we want to know about
them for measuring memory pressure.

Signed-off-by: Johannes Weiner <jwei...@fb.com>
---
 mm/workingset.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..53759a3cf99a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 {
unsigned long max_nodes;
unsigned long nodes;
-   unsigned long cache;
+   unsigned long pages;
 
/* list_lru lock nests inside the IRQ-safe i_pages lock */
local_irq_disable();
@@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 *
 * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
 */
-   if (sc->memcg) {
-   cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
-LRU_ALL_FILE);
-   } else {
-   cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
-   node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
-   }
-   max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+   if (sc->memcg)
+   pages = page_counter_read(>memcg->memory);
+   else
+#endif
+   pages = node_present_pages(sc->nid);
+
+   max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
 
if (nodes <= max_nodes)
return 0;
-- 
2.17.0



[PATCH 2/7] mm: workingset: tell cache transitions from workingset thrashing

2018-05-07 Thread Johannes Weiner
Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

20 bits are always page flags

21 if you have an MMU

23 with the zone bits for DMA, Normal, HighMem, Movable

29 with the sparsemem section bits

30 if PAE is enabled

31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 include/linux/mmzone.h |  1 +
 include/linux/page-flags.h |  5 +-
 include/linux/swap.h   |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c   |  9 ++--
 mm/huge_memory.c   |  1 +
 mm/memcontrol.c|  2 +
 mm/migrate.c   |  2 +
 mm/swap_state.c|  1 +
 mm/vmscan.c|  1 +
 mm/vmstat.c|  1 +
 mm/workingset.c| 95 ++
 12 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@ enum node_stat_item {
NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..7af1c3c15d8e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@
  */
 enum pageflags {
PG_locked,  /* Page is locked. Don't touch. */
-   PG_error,
PG_referenced,
PG_uptodate,
PG_dirty,
PG_lru,
PG_active,
+   PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
+   PG_error,
PG_slab,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
@@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, 
PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+   TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..d8c47dcdec6f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@ struct vma_swap_readahead {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
 /* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@
{1UL << PG_dirty,   "dirty" },  \
{1UL << PG_lru, "lru"   },  \
{1UL << PG_active,  "active"},  \
+   {1UL << PG_workingset,  "workingset"},  \
{1UL << PG_slab,"slab"  },  \
{1UL << PG_owner_priv_1,"owner_priv_1"  },  \
{1UL << PG_arch_1,  "arch_1"},  \
diff --git a/mm/filemap.c b/mm/filemap.c
index 0604cb02e6f3.

[PATCH 7/7] psi: cgroup support

2018-05-07 Thread Johannes Weiner
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system,
but also that of individual jobs, so that we can ensure individual job
health, fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for
only the tasks inside the cgroup.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 Documentation/cgroup-v2.txt | 18 +
 include/linux/cgroup-defs.h |  4 ++
 include/linux/cgroup.h  | 15 +++
 include/linux/psi.h | 25 
 init/Kconfig|  4 ++
 kernel/cgroup/cgroup.c  | 45 -
 kernel/sched/psi.c  | 79 -
 7 files changed, 186 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..a22879dba019 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -963,6 +963,12 @@ All time durations are in microseconds.
$PERIOD duration.  "max" for $MAX indicates no limit.  If only
one number is written, $MAX is updated.
 
+  cpu.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for CPU. See
+   Documentation/accounting/psi.txt for details.
+
 
 Memory
 --
@@ -1199,6 +1205,12 @@ PAGE_SIZE multiple when read back.
Swap usage hard limit.  If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for memory. See
+   Documentation/accounting/psi.txt for details.
+
 
 Usage Guidelines
 
@@ -1334,6 +1346,12 @@ IO Interface Files
 
  8:16 rbps=2097152 wbps=max riops=max wiops=max
 
+  io.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for IO. See
+   Documentation/accounting/psi.txt for details.
+
 
 Writeback
 ~
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index dc5b70449dc6..280f18da956a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -424,6 +425,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to track pressure stalls */
+   struct psi_group psi;
+
/* used to store eBPF programs */
struct cgroup_bpf bpf;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..fd94c294c207 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+   return >psi;
+}
+
 static inline void cgroup_init_kthreadd(void)
 {
/*
@@ -680,6 +685,16 @@ static inline union kernfs_node_id 
*cgroup_get_kernfs_id(struct cgroup *cgrp)
return NULL;
 }
 
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+   return NULL;
+}
+
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+   return NULL;
+}
+
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
   struct cgroup *ancestor)
 {
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 371af1479699..05c3dae3e9c5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,9 @@
 #include 
 #include 
 
+struct seq_file;
+struct css_set;
+
 #ifdef CONFIG_PSI
 
 extern bool psi_disabled;
@@ -15,6 +18,14 @@ void psi_task_change(struct task_struct *task, u64 now, int 
clear, int set);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
+
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgrp);
+void psi_cgroup_free(struct cgroup *cgrp);
+void cgroup_move_task(struct task_struct *p, struct css_set *to);
+#endif
+
 #else /* CONFIG_PSI */
 
 static inline void psi_init(void) {}
@@ -22,6 +33,20 @@ static inline void psi_init(void) {}
 static inline void psi_memstall_enter(unsigned long *flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}
 
+#ifdef CONFIG_CGROUPS
+static inline int psi_cgroup_alloc(struct cgroup *cgrp)
+{
+   return 0;
+}
+static inline void psi_cgroup_free(struct cgroup *cgrp)
+{
+}
+static inline void cgroup_move_task(struct task_str

[PATCH 3/7] delayacct: track delays from thrashing cache pages

2018-05-07 Thread Johannes Weiner
Delay accounting already measures the time a task spends in direct
reclaim and waiting for swapin, but in low memory situations tasks
spend can spend a significant amount of their time waiting on
thrashing page cache. This isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache
page to read back into memory.

Also update tools/accounting/getdelays.c:

 [hannes@computer accounting]$ sudo ./getdelays -d -p 1
 print delayacct stats ON
 PID 1

 CPU count real total  virtual totaldelay total  delay 
average
 50318  74500  847346785  400533713 
 0.008ms
 IO  countdelay total  delay average
   435  122601218  0ms
 SWAPcountdelay total  delay average
 0  0  0ms
 RECLAIM countdelay total  delay average
 0  0  0ms
 THRASHING   countdelay total  delay average
19   12621439  0ms

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 include/linux/delayacct.h  | 23 +++
 include/uapi/linux/taskstats.h |  6 +-
 kernel/delayacct.c | 15 +++
 mm/filemap.c   | 11 +++
 tools/accounting/getdelays.c   |  8 +++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 5e335b6203f4..d3e75b3ba487 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -57,7 +57,12 @@ struct task_delay_info {
 
u64 freepages_start;
u64 freepages_delay;/* wait for memory reclaim */
+
+   u64 thrashing_start;
+   u64 thrashing_delay;/* wait for thrashing page */
+
u32 freepages_count;/* total count of memory reclaim */
+   u32 thrashing_count;/* total count of thrash waits */
 };
 #endif
 
@@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct 
task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
 extern void __delayacct_freepages_end(void);
+extern void __delayacct_thrashing_start(void);
+extern void __delayacct_thrashing_end(void);
 
 static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
 {
@@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void)
__delayacct_freepages_end();
 }
 
+static inline void delayacct_thrashing_start(void)
+{
+   if (current->delays)
+   __delayacct_thrashing_start();
+}
+
+static inline void delayacct_thrashing_end(void)
+{
+   if (current->delays)
+   __delayacct_thrashing_end();
+}
+
 #else
 static inline void delayacct_set_flag(int flag)
 {}
@@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void)
 {}
 static inline void delayacct_freepages_end(void)
 {}
+static inline void delayacct_thrashing_start(void)
+{}
+static inline void delayacct_thrashing_end(void)
+{}
 
 #endif /* CONFIG_TASK_DELAY_ACCT */
 
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index b7aa7bb2349f..5e8ca16a9079 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
  */
 
 
-#define TASKSTATS_VERSION  8
+#define TASKSTATS_VERSION  9
 #define TS_COMM_LEN32  /* should be >= TASK_COMM_LEN
 * in linux/sched.h */
 
@@ -164,6 +164,10 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64   freepages_count;
__u64   freepages_delay_total;
+
+   /* Delay waiting for thrashing page */
+   __u64   thrashing_count;
+   __u64   thrashing_delay_total;
 };
 
 
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index e2764d767f18..02ba745c448d 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -134,9 +134,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct 
task_struct *tsk)
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
+   tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay;
+   d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
d->freepages_count += tsk->delays->freepages_count;
+   d->thrashing_count += tsk->delays->thrashing_count;
spin_unlock_irqrestore(>delays->lock, flags);
 
return 0;
@@ -168,3

[PATCH 5/7] sched: loadavg: make calc_load_n() public

2018-05-07 Thread Johannes Weiner
It's going to be used in the following patch. Keep the churn separate.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 include/linux/sched/loadavg.h | 69 +++
 kernel/sched/loadavg.c| 69 ---
 2 files changed, 69 insertions(+), 69 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index cc9cc62bb1f8..0e4c24978751 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -37,6 +37,75 @@ calc_load(unsigned long load, unsigned long exp, unsigned 
long active)
return newload / FIXED_1;
 }
 
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static inline unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+   unsigned long result = 1UL << frac_bits;
+
+   if (n) {
+   for (;;) {
+   if (n & 1) {
+   result *= x;
+   result += 1UL << (frac_bits - 1);
+   result >>= frac_bits;
+   }
+   n >>= 1;
+   if (!n)
+   break;
+   x *= x;
+   x += 1UL << (frac_bits - 1);
+   x >>= frac_bits;
+   }
+   }
+
+   return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ *= (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ *= a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ *= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ *= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ *  ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ *= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ *= a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ *  n 1 - x^(n+1)
+ * S_n := \Sum x^i = -
+ * i=0  1 - x
+ */
+static inline unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+   unsigned long active, unsigned int n)
+{
+   return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
 #define LOAD_INT(x) ((x) >> FSHIFT)
 #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 54fbdfb2d86c..0736e349a54e 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -210,75 +210,6 @@ static long calc_load_nohz_fold(void)
return delta;
 }
 
-/**
- * fixed_power_int - compute: x^n, in O(log n) time
- *
- * @x: base of the power
- * @frac_bits: fractional bits of @x
- * @n: power to raise @x to.
- *
- * By exploiting the relation between the definition of the natural power
- * function: x^n := x*x*...*x (x multiplied by itself for n times), and
- * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
- * (where: n_i \elem {0, 1}, the binary vector representing n),
- * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
- * of course trivially computable in O(log_2 n), the length of our binary
- * vector.
- */
-static unsigned long
-fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
-{
-   unsigned long result = 1UL << frac_bits;
-
-   if (n) {
-   for (;;) {
-   if (n & 1) {
-   result *= x;
-   result += 1UL << (frac_bits - 1);
-   result >>= frac_bits;
-   }
-   n >>= 1;
-   if (!n)
-   break;
-   x *= x;
-   x += 1UL << (frac_bits - 1);
-   x >>= frac_bits;
-   }
-   }
-
-   return result;
-}
-
-/*
- * a1 = a0 * e + a * (1 - e)
- *
- * a2 = a1 * e + a * (1 - e)
- *= (a0 * e + a * (1 - e)) * e + a * (1 - e)
- *= a0 * e^2 + a * (1 - e) * (1 + e)
- *
- * a3 = a2 * e + a * (1 - e)
- *= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
- *= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
- *
- *  ...
- *
- * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]

[PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

   cpu: some tasks are runnable but not executing on a CPU
   memory: tasks are reclaiming, or waiting for swapin or thrashing cache
   io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 Documentation/accounting/psi.txt |  73 ++
 include/linux/psi.h  |  27 ++
 include/linux/psi_types.h|  84 ++
 include/linux/sched.h|  10 +
 include/linux/sched/stat.h   |  10 +-
 init/Kconfig |  16 ++
 kernel/fork.c|   4 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  |   3 +
 kernel/sched/psi.c   | 424 +++
 kernel/sched/sched.h | 166 ++--
 kernel/sched/stats.h |  91 ++-
 mm/compaction.c  |   5 +
 mm/filemap.c |  15 +-
 mm/page_alloc.c  |  10 +
 mm/vmscan.c  |  13 +
 16 files changed, 859 insertions(+), 93 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index ..e051810d5127
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,73 @@
+
+PSI - Pressure Stall Information
+
+
+:Date: April, 2018
+:Author: Johannes Weiner <han...@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to 

[PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
Hi,

I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.

There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:

- got rid of the extra lock in the sched callbacks; all task
  state changes we care about serialize through rq->lock

- got rid of ktime_get() inside the sched callbacks and
  switched time measuring to rq_clock()

- got rid of all divisions inside the sched callbacks,
  tracking everything natively in ns now

I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.

Real-world applications

Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.

First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.

We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.

How do you use this?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.

The cpu file contains one line:

some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU, 

[PATCH 4/7] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD

2018-05-07 Thread Johannes Weiner
There are several definitions of those functions/macso in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
 .../platforms/cell/cpufreq_spudemand.c|  2 +-
 arch/powerpc/platforms/cell/spufs/sched.c |  9 +++-
 arch/s390/appldata/appldata_os.c  |  4 
 drivers/cpuidle/governors/menu.c  |  4 
 fs/proc/loadavg.c |  3 ---
 include/linux/sched/loadavg.h | 21 +++
 kernel/debug/kdb/kdb_main.c   |  7 +--
 kernel/sched/loadavg.c| 15 -
 8 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c 
b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
index 882944c36ef5..5d8e8b6bb1cc 100644
--- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info)
cpu = info->policy->cpu;
busy_spus = atomic_read(_spu_info[cpu_to_node(cpu)].busy_spus);
 
-   CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+   info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1);
pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
cpu, busy_spus, info->busy_spus);
 
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index ccc421503363..70101510b19d 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -987,9 +987,9 @@ static void spu_calc_load(void)
unsigned long active_tasks; /* fixed-point */
 
active_tasks = count_active_contexts() * FIXED_1;
-   CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks);
-   CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks);
-   CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks);
+   spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks);
+   spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks);
+   spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks);
 }
 
 static void spusched_wake(struct timer_list *unused)
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 433a994b1a89..54f375627532 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -25,10 +25,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1bfe03ceb236..3738b670df7a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -133,10 +133,6 @@ struct menu_device {
int interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index b572cc865b92..8bee50a97c0f 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -10,9 +10,6 @@
 #include 
 #include 
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 80bc84ba5d2a..cc9cc62bb1f8 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long 
offset, int shift);
 #define EXP_5  2014/* 1/exp(5sec/5min) */
 #define EXP_15 2037/* 1/exp(5sec/15min) */
 
-#define CALC_LOAD(load,exp,n) \
-   load *= exp; \
-   load += n*(FIXED_1-exp); \
-   load >>= FSHIFT;
+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
+static inline unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+   unsigned long newload;
+
+   newload = load * exp + active * (FIXED_1 - exp);
+   if (active >= load)
+   newload += FIXED_1-1;
+
+   return newload / FIXED_1;
+}
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
 extern void calc_global_load(unsigned long ticks);
 
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
i

LSF/MM 2018: Call for Proposals

2018-01-15 Thread Johannes Weiner
The annual Linux Storage, Filesystem and Memory Management (LSF/MM)
Summit for 2018 will be held from April 23-25 at the Deer Valley
Lodges in Park City, Utah. LSF/MM is an invitation-only technical
workshop to map out improvements to the Linux storage, filesystem and
memory management subsystems that will make their way into the
mainline kernel within the coming years.


http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit

LSF/MM 2018 will be a three day, stand-alone conference with three
subsystem-specific tracks, cross-track discussions, as well as BoF and
hacking sessions.

On behalf of the committee I am issuing a call for agenda proposals
that are suitable for cross-track discussion as well as technical
subjects for the breakout sessions.

If advance notice is required for visa applications then please point
that out in your proposal or request to attend, and submit the topic
as soon as possible.

1) Proposals for agenda topics should be sent before January 31st,
2018 to:

lsf...@lists.linux-foundation.org

and CC the mailing lists that are relevant for the topic in question:

FS: linux-fsde...@vger.kernel.org
MM: linux...@kvack.org
Block:  linux-block@vger.kernel.org
ATA:linux-...@vger.kernel.org
SCSI:   linux-s...@vger.kernel.org
NVMe:   linux-n...@lists.infradead.org

Please tag your proposal with [LSF/MM TOPIC] to make it easier to
track. In addition, please make sure to start a new thread for each
topic rather than following up to an existing one. Agenda topics and
attendees will be selected by the program committee, but the final
agenda will be formed by consensus of the attendees on the day.

2) Requests to attend the summit for those that are not proposing a
topic should be sent to:

lsf...@lists.linux-foundation.org

Please summarize what expertise you will bring to the meeting, and
what you would like to discuss. Please also tag your email with
[LSF/MM ATTEND] and send it as a new thread so there is less chance of
it getting lost.

We will try to cap attendance at around 25-30 per track to facilitate
discussions although the final numbers will depend on the room sizes
at the venue.

For discussion leaders, slides and visualizations are encouraged to
outline the subject matter and focus the discussions. Please refrain
from lengthy presentations and talks; the sessions are supposed to be
interactive, inclusive discussions.

There will be no recording or audio bridge. However, we expect that
written minutes will be published as we did in previous years:

2017: https://lwn.net/Articles/lsfmm2017/

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

2013: http://lwn.net/Articles/548089/

3) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

lsf...@lists.linux-foundation.org

Thank you on behalf of the program committee:

Anna Schumaker (Filesystems)
Jens Axboe (Storage)
Josef Bacik (Filesystems)
Martin K. Petersen (Storage)
Michal Hocko (MM)
Rik van Riel (MM)

Johannes Weiner


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Johannes Weiner
On Mon, Jan 09, 2017 at 09:30:05PM +0100, Jan Kara wrote:
> On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> > On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > > > <da...@fromorbit.com> wrote:
> > > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > > workload again and about a minute in this fired:
> > > > > > > > >
> > > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > > 
> > > > > > > > Well, part of the changes during the merge window were the 
> > > > > > > > shadow
> > > > > > > > entry tracking changes that came in through Andrew's tree. 
> > > > > > > > Adding
> > > > > > > > Johannes Weiner to the participants.
> > 
> > Okay, the below patch should address this problem. Dave Jones managed
> > to reproduce it with the added WARN_ONs, and they made it obvious. He
> > cannot trigger it anymore with this fix applied. Thanks Dave!
> 
> FWIW the patch looks good to me. I'd just note that the need to pass the
> callback to deletion function and the fact that we do it only in cases
> where we think it is needed appears errorprone. With the warning you've
> added it should at least catch the cases where we got it wrong but more
> robust would be if the radix tree root contained a pointer to the callback
> function so that we would not rely on passing the callback to every place
> which can possibly free a node. Also conceptually this would make more
> sense to me since the fact that we may need to do some cleanup on node
> deletion is a property of the particular radix tree and how we use it.
> OTOH that would mean growing radix tree root by one pointer which would be
> unpopular I guess.

The last sentence is the crux, unfortunately. The first iteration of
the shadow shrinker linked up mappings that contained shadow entries,
rather than nodes. The code would have been drastically simpler in
pretty much all regards, without changes to the radix tree API. But
Dave Chinner objected to adding a pointer to the inode when we could
stick them into empty space (slab fragmentation) inside the nodes. A
fair point, I guess, especially when you consider metadata-intense
workloads. But now we're kind of stuck with the complexity of this.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-02 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <da...@fromorbit.com> 
> > > > wrote:
> > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > workload again and about a minute in this fired:
> > > > >
> > > > > [628867.607417] [ cut here ]
> > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > shadow_lru_isolate+0x171/0x220
> > > > 
> > > > Well, part of the changes during the merge window were the shadow
> > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > Johannes Weiner to the participants.
> > > > 
> > > > > Now, this workload does not touch the page cache at all - it's
> > > > > entirely an XFS metadata workload, so it should not really be
> > > > > affecting the working set code.
> > > > 
> > > > Well, I suspect that anything that creates memory pressure will end up
> > > > triggering the working set code, so ..
> > > > 
> > > > That said, obviously memory corruption could be involved and result in
> > > > random issues too, but I wouldn't really expect that in this code.
> > > > 
> > > > It would probably be really useful to get more data points - is the
> > > > problem reliably in this area, or is it going to be random and all
> > > > over the place.
> > > 
> > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > shadow_lru_isolate,
> > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > investigate, just set it aside as something to worry about if it happens
> > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > 
> > This seems consistent with what Dave observed: we encounter regular
> > pages in radix tree nodes on the shadow LRU that should only contain
> > nodes full of exceptional shadow entries. It could be an issue in the
> > new slot replacement code and the node tracking callback.
> 
> Both encounters seem to indicate use-after-free. Dave's node didn't
> warn about an unexpected node->count / node->exceptional state, but
> had entries that were inconsistent with that. Hugh got the counter
> warning but crashed on a list_head that's not NULLed in a live node.
> 
> workingset_update_node() should be called on page cache radix tree
> leaf nodes that go empty. I must be missing an update_node callback
> where a leaf node gets freed somewhere.

Sorry for dropping silent on this. I'm traveling over the holidays
with sporadic access to my emails and no access to real equipment.

The times I managed to sneak away to look at the code didn't turn up
anything useful yet.

Andrea encountered the warning as well and I gave him a debugging
patch (attached below), but he hasn't been able to reproduce this
condition. I've personally never seen the warning trigger, even though
the patches have been running on my main development machine for quite
a while now. Albeit against an older base; I've updated to Linus's
master branch now in case it's an interaction with other new code.

If anybody manages to reproduce this, that would be helpful. Any extra
eyes on this would be much appreciated too until I'm back at my desk.

Thanks

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0783af1c0ebb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,8 @@ static inline void radix_tree_shrink(struct radix_tree_root 
*root,
update_node(node, private);
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
}
 }
@@ -666,6 +668,8 @@ static void delete_node(struct radix_tree_root *root,
root->rnode = NULL;
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
 
node = parent;
@@ -767,6 +771,7 @@ static void radix_tree_free_nodes(struct radix_tree_node 
*node)
struct radix_tree_node *old = child;
offset = child->offset + 1;
child = child->parent;
+   WARN_ON_ONCE(!list_empty(>private_list));
radix_tree_node_free(old);
if (old == entry_to_node(node))
return;

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <da...@fromorbit.com> wrote:
> > > > I unmounted the fs, mkfs'd it again, ran the
> > > > workload again and about a minute in this fired:
> > > >
> > > > [628867.607417] [ cut here ]
> > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > shadow_lru_isolate+0x171/0x220
> > > 
> > > Well, part of the changes during the merge window were the shadow
> > > entry tracking changes that came in through Andrew's tree. Adding
> > > Johannes Weiner to the participants.
> > > 
> > > > Now, this workload does not touch the page cache at all - it's
> > > > entirely an XFS metadata workload, so it should not really be
> > > > affecting the working set code.
> > > 
> > > Well, I suspect that anything that creates memory pressure will end up
> > > triggering the working set code, so ..
> > > 
> > > That said, obviously memory corruption could be involved and result in
> > > random issues too, but I wouldn't really expect that in this code.
> > > 
> > > It would probably be really useful to get more data points - is the
> > > problem reliably in this area, or is it going to be random and all
> > > over the place.
> > 
> > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > investigate, just set it aside as something to worry about if it happens
> > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> 
> This seems consistent with what Dave observed: we encounter regular
> pages in radix tree nodes on the shadow LRU that should only contain
> nodes full of exceptional shadow entries. It could be an issue in the
> new slot replacement code and the node tracking callback.

Both encounters seem to indicate use-after-free. Dave's node didn't
warn about an unexpected node->count / node->exceptional state, but
had entries that were inconsistent with that. Hugh got the counter
warning but crashed on a list_head that's not NULLed in a live node.

workingset_update_node() should be called on page cache radix tree
leaf nodes that go empty. I must be missing an update_node callback
where a leaf node gets freed somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] mm: don't cap request size based on read-ahead setting

2016-11-18 Thread Johannes Weiner
On Thu, Nov 17, 2016 at 02:23:10PM -0700, Jens Axboe wrote:
> We ran into a funky issue, where someone doing 256K buffered reads saw
> 128K requests at the device level. Turns out it is read-ahead capping
> the request size, since we use 128K as the default setting. This doesn't
> make a lot of sense - if someone is issuing 256K reads, they should see
> 256K reads, regardless of the read-ahead setting, if the underlying
> device can support a 256K read in a single command.
> 
> To make matters more confusing, there's an odd interaction with the
> fadvise hint setting. If we tell the kernel we're doing sequential IO on
> this file descriptor, we can get twice the read-ahead size. But if we
> tell the kernel that we are doing random IO, hence disabling read-ahead,
> we do get nice 256K requests at the lower level. This is because
> ondemand and forced read-ahead behave differently, with the latter doing
> the right thing. An application developer will be, rightfully,
> scratching his head at this point, wondering wtf is going on. A good one
> will dive into the kernel source, and silently weep.

With the FADV_RANDOM part of the changelog updated, this looks good to
me. Just a few nitpicks below.

> This patch introduces a bdi hint, io_pages. This is the soft max IO size
> for the lower level, I've hooked it up to the bdev settings here.
> Read-ahead is modified to issue the maximum of the user request size,
> and the read-ahead max size, but capped to the max request size on the
> device side. The latter is done to avoid reading ahead too much, if the
> application asks for a huge read. With this patch, the kernel behaves
> like the application expects.
> 
> Signed-off-by: Jens Axboe <ax...@fb.com>

> @@ -207,12 +207,17 @@ int __do_page_cache_readahead(struct address_space
> *mapping, struct file *filp,
>   * memory at once.
>   */
>  int force_page_cache_readahead(struct address_space *mapping, struct file
> *filp,

Linewrap (but you already knew that ;))

> - pgoff_t offset, unsigned long nr_to_read)
> +pgoff_t offset, unsigned long nr_to_read)
>  {
> + struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + struct file_ra_state *ra = >f_ra;
> + unsigned long max_pages;
> +
>   if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
>   return -EINVAL;
>
> - nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
> + max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
> + nr_to_read = min(nr_to_read, max_pages);

It would be useful to have the comment on not capping below optimal IO
size from ondemand_readahead() here as well.

> @@ -369,10 +374,18 @@ ondemand_readahead(struct address_space *mapping,
>  bool hit_readahead_marker, pgoff_t offset,
>  unsigned long req_size)
>  {
> - unsigned long max = ra->ra_pages;
> + struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + unsigned long max_pages = ra->ra_pages;
>   pgoff_t prev_offset;
> 
>   /*
> +  * If the request exceeds the readahead window, allow the read to
> +  * be up to the optimal hardware IO size
> +  */
> + if (req_size > max_pages && bdi->io_pages > max_pages)
> + max_pages = min(req_size, bdi->io_pages);
> +
> + /*
>* start of file
>*/
>   if (!offset)

Please feel free to add:

Acked-by: Johannes Weiner <han...@cmpxchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] mm: don't cap request size based on read-ahead setting

2016-11-16 Thread Johannes Weiner
On Tue, Nov 15, 2016 at 03:41:58PM -0700, Jens Axboe wrote:
> On 11/15/2016 03:27 PM, Johannes Weiner wrote:
> > Hi Jens,
> > 
> > On Thu, Nov 10, 2016 at 10:00:37AM -0700, Jens Axboe wrote:
> > > Hi,
> > > 
> > > We ran into a funky issue, where someone doing 256K buffered reads saw
> > > 128K requests at the device level. Turns out it is read-ahead capping
> > > the request size, since we use 128K as the default setting. This doesn't
> > > make a lot of sense - if someone is issuing 256K reads, they should see
> > > 256K reads, regardless of the read-ahead setting.
> > > 
> > > To make matters more confusing, there's an odd interaction with the
> > > fadvise hint setting. If we tell the kernel we're doing sequential IO on
> > > this file descriptor, we can get twice the read-ahead size. But if we
> > > tell the kernel that we are doing random IO, hence disabling read-ahead,
> > > we do get nice 256K requests at the lower level. An application
> > > developer will be, rightfully, scratching his head at this point,
> > > wondering wtf is going on. A good one will dive into the kernel source,
> > > and silently weep.
> > > 
> > > This patch introduces a bdi hint, io_pages. This is the soft max IO size
> > > for the lower level, I've hooked it up to the bdev settings here.
> > > Read-ahead is modified to issue the maximum of the user request size,
> > > and the read-ahead max size, but capped to the max request size on the
> > > device side. The latter is done to avoid reading ahead too much, if the
> > > application asks for a huge read. With this patch, the kernel behaves
> > > like the application expects.
> > > 
> > > 
> > > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > > index f679ae122843..65f16cf4f850 100644
> > > --- a/block/blk-settings.c
> > > +++ b/block/blk-settings.c
> > > @@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue *q,
> > > unsigned int max_hw_secto
> > >   max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
> > >   max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS);
> > >   limits->max_sectors = max_sectors;
> > > + q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9);
> > 
> > Could we simply set q->backing_dev_info.ra_pages here? This would
> > start the disk out with a less magical readahead setting than the
> > current 128k default, while retaining the ability for the user to
> > override it in sysfs later on. Plus, one less attribute to juggle.
> 
> We could, but then we'd have two places that tweak the same knob. I
> think it's perfectly valid to have the read-ahead size be bigger than
> the max request size, if you want some pipelining, for instance.

I'm not sure I follow. Which would be the two places and which knob?

What I meant how it could work is this: when the queue gets allocated,
we set ra_pages to the hard-coded 128K, like we do right now. When the
driver initializes and calls blk_queue_max_hw_sectors() it would set
ra_pages to the more informed, device-optimized max_sectors >>
(PAGE_SHIFT - 9). And once it's all initialized, the user can still
make adjustments to the default we picked in the kernel heuristic.

> The 128k default is silly, though, that should be smarter. It should
> probably default to the max request size.

Could you clarify the difference between max request size and what
blk_queue_max_hw_sectors() sets? The way I understood your patch is
that we want to use a readahead cap that's better suited to the
underlying IO device than the magic 128K. What am I missing?

> > > @@ -369,10 +369,18 @@ ondemand_readahead(struct address_space *mapping,
> > >  bool hit_readahead_marker, pgoff_t offset,
> > >  unsigned long req_size)
> > >  {
> > > - unsigned long max = ra->ra_pages;
> > > + unsigned long max_pages;
> > >   pgoff_t prev_offset;
> > > 
> > >   /*
> > > +  * Use the max of the read-ahead pages setting and the requested IO
> > > +  * size, and then the min of that and the soft IO size for the
> > > +  * underlying device.
> > > +  */
> > > + max_pages = max_t(unsigned long, ra->ra_pages, req_size);
> > > + max_pages = min_not_zero(inode_to_bdi(mapping->host)->io_pages, 
> > > max_pages);
> > 
> > This code would then go away, and it would apply the benefit of this
> > patch automatically to explicit readahead(2) and FADV_WILLNEED calls
> > going through fo

Re: [PATCH/RFC] mm: don't cap request size based on read-ahead setting

2016-11-15 Thread Johannes Weiner
Hi Jens,

On Thu, Nov 10, 2016 at 10:00:37AM -0700, Jens Axboe wrote:
> Hi,
> 
> We ran into a funky issue, where someone doing 256K buffered reads saw
> 128K requests at the device level. Turns out it is read-ahead capping
> the request size, since we use 128K as the default setting. This doesn't
> make a lot of sense - if someone is issuing 256K reads, they should see
> 256K reads, regardless of the read-ahead setting.
>
> To make matters more confusing, there's an odd interaction with the
> fadvise hint setting. If we tell the kernel we're doing sequential IO on
> this file descriptor, we can get twice the read-ahead size. But if we
> tell the kernel that we are doing random IO, hence disabling read-ahead,
> we do get nice 256K requests at the lower level. An application
> developer will be, rightfully, scratching his head at this point,
> wondering wtf is going on. A good one will dive into the kernel source,
> and silently weep.
> 
> This patch introduces a bdi hint, io_pages. This is the soft max IO size
> for the lower level, I've hooked it up to the bdev settings here.
> Read-ahead is modified to issue the maximum of the user request size,
> and the read-ahead max size, but capped to the max request size on the
> device side. The latter is done to avoid reading ahead too much, if the
> application asks for a huge read. With this patch, the kernel behaves
> like the application expects.
> 
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index f679ae122843..65f16cf4f850 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue *q,
> unsigned int max_hw_secto
>   max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
>   max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS);
>   limits->max_sectors = max_sectors;
> + q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9);

Could we simply set q->backing_dev_info.ra_pages here? This would
start the disk out with a less magical readahead setting than the
current 128k default, while retaining the ability for the user to
override it in sysfs later on. Plus, one less attribute to juggle.

> @@ -369,10 +369,18 @@ ondemand_readahead(struct address_space *mapping,
>  bool hit_readahead_marker, pgoff_t offset,
>  unsigned long req_size)
>  {
> - unsigned long max = ra->ra_pages;
> + unsigned long max_pages;
>   pgoff_t prev_offset;
> 
>   /*
> +  * Use the max of the read-ahead pages setting and the requested IO
> +  * size, and then the min of that and the soft IO size for the
> +  * underlying device.
> +  */
> + max_pages = max_t(unsigned long, ra->ra_pages, req_size);
> + max_pages = min_not_zero(inode_to_bdi(mapping->host)->io_pages, 
> max_pages);

This code would then go away, and it would apply the benefit of this
patch automatically to explicit readahead(2) and FADV_WILLNEED calls
going through force_page_cache_readahead() as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html