Re: [PATCH] x86/retpoline: Avoid return buffer underflows on context switch

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 4:48 PM, David Woodhouse  wrote:
> On Tue, 2018-01-09 at 00:44 +, Woodhouse, David wrote:
>> On IRC, Arjan assures me that 'pause' here really is sufficient as a
>> speculation trap. If we do end up returning back here as a
>> misprediction, that 'pause' will stop the speculative execution on
>> affected CPUs even though it isn't *architecturally* documented to do
>> so.
>>
>> Arjan, can you confirm that in email please?
>
>
> That actually doesn't make sense to me. If 'pause' alone is sufficient,
> then why in $DEITY's name would we need a '1:pause;jmp 1b' loop in the
> retpoline itself?
>
> Arjan?

On further investigation, I don't understand any of the motivation for
the changes here:
- It micro-benchmarks several cycles slower than the suggested
implementation on average (38 vs 44 cycles) [likely due to lost 16-byte call
alignment]
- It's much larger in terms of .text size (120 bytes @ 16 calls, 218
bytes @ 30 calls) vs (61 bytes)
- I'm not sure it's universally correct in preventing speculation:

(1) I am able to observe a small timing difference between executing
"1: pause; jmp 1b;" and "pause" in the speculative path.
 Given that alignment is otherwise identical, this should only
occur if execution is non-identical, which would require speculative
execution to proceed beyond the pause.
(2) When we proposed and reviewed the sequence.  This was not cited by
architects as a way of presenting speculation.  Indeed, as David
points out, we'd consider using this within the sequence without the
loop.


If the claim above is true -- which (1) actually appears to contradict
-- it seems to bear stronger validation.  Particularly since that in
the suggested sequences we can fit the jmps within the space we get
for free by aligning the call targets.


Re: [PATCH v6 11/10] x86/retpoline: Avoid return buffer underflows on context switch II

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 5:21 PM, Andi Kleen  wrote:
> On Mon, Jan 08, 2018 at 05:16:02PM -0800, Andi Kleen wrote:
>> > If we clear the registers, what the hell are you going to put in the
>> > RSB that helps you?
>>
>> RSB allows you to control chains of gadgets.
>
> I admit the gadget thing is a bit obscure.
>
> There's another case we were actually more worried about:
>
> On Skylake and Broadwell when the RSB underflows it will fall back to the

(Broadwell without Microcode)

> indirect branch predictor, which can be poisoned and we try to avoid
> using with retpoline. So we try to avoid underflows, and this filling
> helps us with that.
>
> Does that make more sense?

The majority of the confusion does not stem from this being complicated.
It's that there's been great reluctance to document the details which
are different -- or changed by the microcode -- even at a high level
such as this.
Because of this, some of the details instead include vague "things are
different on Skylake" notes, with no exposition.

>
> -Andi


Re: [PATCH] x86/retpoline: Avoid return buffer underflows on context switch

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 2:25 PM, Andi Kleen  wrote:
>> So pjt did alignment, a single unroll and per discussion earlier today
>> (CET) or late last night (PST), he only does 16.
>
> I used the Intel recommended sequence, which recommends 32.
>
> Not sure if alignment makes a difference. I can check.
>
>> Why is none of that done here? Also, can we pretty please stop using
>> those retarded number labels, they make this stuff unreadable.
>
> Personally I find the magic labels with strange ASCII characters
> far less readable than a simple number.
>
> But can change it if you insist.
>
>> Also, pause is unlikely to stop speculation, that comment doesn't make
>> sense. Looking at PJT's version there used to be a speculation trap in
>> there, but I can't see that here.
>
> My understanding is that it stops speculation. But could also
> use LFENCE.
>

Neither pause nor lfence stop speculation.

> -Andi


Re: [PATCH] x86/retpoline: Avoid return buffer underflows on context switch

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 2:11 PM, Peter Zijlstra  wrote:
> On Mon, Jan 08, 2018 at 12:15:31PM -0800, Andi Kleen wrote:
>> diff --git a/arch/x86/include/asm/nospec-branch.h 
>> b/arch/x86/include/asm/nospec-branch.h
>> index b8c8eeacb4be..e84e231248c2 100644
>> --- a/arch/x86/include/asm/nospec-branch.h
>> +++ b/arch/x86/include/asm/nospec-branch.h
>> @@ -53,6 +53,35 @@
>>  #endif
>>  .endm
>>
>> +/*
>> + * We use 32-N: 32 is the max return buffer size,
>> + * but there should have been at a minimum two
>> + * controlled calls already: one into the kernel
>> + * from entry*.S and another into the function
>> + * containing this macro. So N=2, thus 30.
>> + */
>> +#define NUM_BRANCHES_TO_FILL 30
>> +
>> +/*
>> + * Fill the CPU return branch buffer to prevent
>> + * indirect branch prediction on underflow.
>> + * Caller should check for X86_FEATURE_SMEP and X86_FEATURE_RETPOLINE
>> + */
>> +.macro FILL_RETURN_BUFFER
>> +#ifdef CONFIG_RETPOLINE
>> + .rept   NUM_BRANCHES_TO_FILL
>> + call1221f
>> + pause   /* stop speculation */
>> +1221:
>> + .endr
>> +#ifdef CONFIG_64BIT
>> + addq$8*NUM_BRANCHES_TO_FILL, %rsp
>> +#else
>> + addl$4*NUM_BRANCHES_TO_FILL, %esp
>> +#endif
>> +#endif
>> +.endm
>
> So pjt did alignment, a single unroll and per discussion earlier today
> (CET) or late last night (PST), he only does 16.
>
> Why is none of that done here? Also, can we pretty please stop using
> those retarded number labels, they make this stuff unreadable.
>
> Also, pause is unlikely to stop speculation, that comment doesn't make
> sense. Looking at PJT's version there used to be a speculation trap in
> there, but I can't see that here.
>

You definitely want the speculation traps.. these entries are
potentially consumed.
Worse: The first entry that will be consumed is the last call in your
linear chain, meaning that it immediately gets to escape into
alternative execution.
(When I was experimenting with icache-minimizing constructions here I
actually used intentional backwards jumps in linear chains to avoid
this.)

The sequence I reported is what ended up seeming optimal.

>


Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-08 Thread Paul Turner
For Intel the manuals state that it's 16 entries -- 2.5.2.1
Agner also reports 16 (presumably experimentally measured)  e.g.
http://www.agner.org/optimize/microarchitecture.pdf [3.8]
For AMD it can be larger, for example 32 entries on Fam17h (but 16
entries on Fam16h).

For future proofing a binary, or a new AMD processor, 32 calls are
required.  I would suggest tuning this based on the current CPU (which
also covers the future case while saving cycles now) to save overhead.



On Mon, Jan 8, 2018 at 3:16 AM, Andrew Cooper <andrew.coop...@citrix.com> wrote:
> On 08/01/18 10:42, Paul Turner wrote:
>> A sequence for efficiently refilling the RSB is:
>> mov $8, %rax;
>> .align 16;
>>3: call 4f;
>>   3p: pause; call 3p;
>>  .align 16;
>>   4: call 5f;
>>   4p: pause; call 4p;
>>  .align 16;
>>5: dec %rax;
>>   jnz 3b;
>>   add $(16*8), %rsp;
>> This implementation uses 8 loops, with 2 calls per iteration.  This is
>> marginally faster than a single call per iteration.  We did not
>> observe useful benefit (particularly relative to text size) from
>> further unrolling.  This may also be usefully split into smaller (e.g.
>> 4 or 8 call)  segments where we can usefully pipeline/intermix with
>> other operations.  It includes retpoline type traps so that if an
>> entry is consumed, it cannot lead to controlled speculation.  On my
>> test system it took ~43 cycles on average.  Note that non-zero
>> displacement calls should be used as these may be optimized to not
>> interact with the RSB due to their use in fetching RIP for 32-bit
>> relocations.
>
> Guidance from both Intel and AMD still states that 32 calls are required
> in general.  Is your above code optimised for a specific processor which
> you know the RSB to be smaller on?
>
> ~Andrew


Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 2:45 AM, David Woodhouse <dw...@infradead.org> wrote:
> On Mon, 2018-01-08 at 02:34 -0800, Paul Turner wrote:
>> One detail that is missing is that we still need RSB refill in some
>> cases.
>> This is not because the retpoline sequence itself will underflow (it
>> is actually guaranteed not to, since it consumes only RSB entries
>> that it generates.
>> But either to avoid poisoning of the RSB entries themselves, or to
>> avoid the hardware turning to alternate predictors on RSB underflow.
>>
>> Enumerating the cases we care about:
>>
>> • user->kernel in the absence of SMEP:
>> In the absence of SMEP, we must worry about user-generated RSB
>> entries being consumable by kernel execution.
>> Generally speaking, for synchronous execution this will not occur
>> (e.g. syscall, interrupt), however, one important case remains.
>> When we context switch between two threads, we should flush the RSB
>> so that execution generated from the unbalanced return path on the
>> thread that we just scheduled into, cannot consume RSB entries
>> potentially installed by the prior thread.
>
> Or IBPB here, yes? That's what we had in the original patch set when
> retpoline came last, and what I assume will be put back again once we
> *finally* get our act together and reinstate the full set of microcode
> patches.

IBPB is *much* more expensive than the sequence I suggested.
If the kernel has been protected with a retpoline compilation, it is
much faster to not use IBPB here; we only need to prevent
ret-poisoning in this case.

>
>> kernel->kernel independent of SMEP:
>> While much harder to coordinate, facilities such as eBPF potentially
>> allow exploitable return targets to be created.
>> Generally speaking (particularly if eBPF has been disabled) the risk
>> is _much_ lower here, since we can only return into kernel execution
>> that was already occurring on another thread (which could e.g. likely
>> be attacked there directly independent of RSB poisoning.)
>>
>> guest->hypervisor, independent of SMEP:
>> For guest ring0 -> host ring0 transitions, it is possible that the
>> tagging only includes that the entry was only generated in a ring0
>> context.  Meaning that a guest generated entry may be consumed by the
>> host.  This admits:
>
> We are also stuffing the RSB on vmexit in the IBRS/IBPB patch set,
> aren't we?

A) I am enumerating all of the cases for completeness.  It was missed
by many that this detail was necessary on this patch, independently of
IBRS.
B) On the parts duplicated in (A), for specifics that are contributory to
correctness in both cases, we should not hand-wave over the fact that
they may or may not be covered by another patch-set.  Users need to
understand what's required for complete protection.  Particularly if they
are backporting.


Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-08 Thread Paul Turner
On Mon, Jan 8, 2018 at 2:38 AM, Jiri Kosina <ji...@kernel.org> wrote:
> On Mon, 8 Jan 2018, Paul Turner wrote:
>
>> user->kernel in the absence of SMEP:
>> In the absence of SMEP, we must worry about user-generated RSB entries
>> being consumable by kernel execution.
>> Generally speaking, for synchronous execution this will not occur (e.g.
>> syscall, interrupt), however, one important case remains.
>> When we context switch between two threads, we should flush the RSB so that
>> execution generated from the unbalanced return path on the thread that we
>> just scheduled into, cannot consume RSB entries potentially installed by
>> the prior thread.
>
> I am still unclear whether this closes it completely, as when HT is on,
> the RSB is shared between the threads, right? Therefore one thread can
> poision it for the other without even context switch happening.
>

See 2.6.1.1 [Replicated resources]:
  "The return stack predictor is replicated to improve branch
prediction of return instructions"

(This is part of the reason that the sequence is attractive; its use
of the RSB to control prediction naturally prevents cross-sibling
attack.)

> --
> Jiri Kosina
> SUSE Labs
>


Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel

2018-01-08 Thread Paul Turner
[ First send did not make list because gmail ate its plain-text force
when I pasted content. ]

One detail that is missing is that we still need RSB refill in some cases.
This is not because the retpoline sequence itself will underflow (it
is actually guaranteed not to, since it consumes only RSB entries that
it generates.
But either to avoid poisoning of the RSB entries themselves, or to
avoid the hardware turning to alternate predictors on RSB underflow.

Enumerating the cases we care about:

user->kernel in the absence of SMEP:
In the absence of SMEP, we must worry about user-generated RSB entries
being consumable by kernel execution.
Generally speaking, for synchronous execution this will not occur
(e.g. syscall, interrupt), however, one important case remains.
When we context switch between two threads, we should flush the RSB so
that execution generated from the unbalanced return path on the thread
that we just scheduled into, cannot consume RSB entries potentially
installed by the prior thread.

kernel->kernel independent of SMEP:
While much harder to coordinate, facilities such as eBPF potentially
allow exploitable return targets to be created.
Generally speaking (particularly if eBPF has been disabled) the risk
is _much_ lower here, since we can only return into kernel execution
that was already occurring on another thread (which could e.g. likely
be attacked there directly independent of RSB poisoning.)

guest->hypervisor, independent of SMEP:
For guest ring0 -> host ring0 transitions, it is possible that the
tagging only includes that the entry was only generated in a ring0
context.  Meaning that a guest generated entry may be consumed by the
host.  This admits:

hypervisor_run_vcpu_implementation() {
  
  … run virtualized work (1)
   
  < update vmcs state, prior to any function return > (2)
   < return from hypervisor_run_vcpu_implementation() to handle VMEXIT > (3)
}

A guest to craft poisoned entries at (1) which, if not flushed at (2),
may immediately be eligible for consumption at (3).

While the cases above involve the crafting and use of poisoned
entries.  Recall also that one of the initial conditions was that we
should avoid RSB underflow as some CPUs may try to use other indirect
predictors when this occurs.

The cases we care about here are:
- When we return _into_ protected execution.  For the kernel, this
means when we exit interrupt context into kernel context, since may
have emptied or reduced the number of RSB entries while in iinterrupt
context.
- Context switch (even if we are returning to user code, we need to at
unwind the scheduler/triggering frames that preempted it previously,
considering that detail, this is a subset of the above, but listed for
completeness)
- On VMEXIT (it turns out we need to worry about both poisoned
entries, and no entries, the solution is a single refill nonetheless).
- Leaving deeper (>C1) c-states, which may have flushed hardware state
- Where we are unwinding call-chains of >16 entries[*]

[*] This is obviously the trickiest case.  Fortunately, it is tough to
exploit since such call-chains are reasonably rare, and action must
typically be predicted at a considerable distance from where current
execution lies.  Both dramatically increasing the feasibility of an
attack and lowering the bit-rate (number of ops per attempt is
necessarily increased).  For our systems, since we control the binary
image we can determine this through aggregate profiling of every
machine in the fleet.  I'm happy to provide those symbols; but it's
obviously restricted from complete coverage due to code differences.
Generally, this is a level of paranoia no typical user will likely
care about and only applies to a subset of CPUs.


A sequence for efficiently refilling the RSB is:
mov $8, %rax;
.align 16;
   3: call 4f;
  3p: pause; call 3p;
 .align 16;
  4: call 5f;
  4p: pause; call 4p;
 .align 16;
   5: dec %rax;
  jnz 3b;
  add $(16*8), %rsp;
This implementation uses 8 loops, with 2 calls per iteration.  This is
marginally faster than a single call per iteration.  We did not
observe useful benefit (particularly relative to text size) from
further unrolling.  This may also be usefully split into smaller (e.g.
4 or 8 call)  segments where we can usefully pipeline/intermix with
other operations.  It includes retpoline type traps so that if an
entry is consumed, it cannot lead to controlled speculation.  On my
test system it took ~43 cycles on average.  Note that non-zero
displacement calls should be used as these may be optimized to not
interact with the RSB due to their use in fetching RIP for 32-bit
relocations.

On Mon, Jan 8, 2018 at 2:34 AM, Paul Turner <p...@google.com> wrote:
> One detail that is missing is that we still need RSB refill in some cases.
> This is not because the retpoline sequence itself will underflow (it is
> actually guaranteed not to, since it consumes only RSB entries that it

Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Fri, Jan 5, 2018 at 3:26 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> On 05/01/2018 11:28, Paul Turner wrote:
>>
>> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is 
>> why
>> it was chosen.
>>
>>   "pause; jmp" 33.231 cycles/call 9.517 ns/call
>>   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
>
> Do you have timings for a non-retpolined indirect branch with the
> predictor suppressed via IBRS=1?  So at least we can compute the break
> even point.

The data I collected here previously had the run-time cost as a wash.
On Skylake, an IBRS=1 and a retpolined indirect branch had cost within
a few cycles.

The costs to consider when making a choice here are:

- The transition overheads.  This is how frequently will you be
switching in and out of protected code (as IBRS needs to be enabled
and disabled at these boundaries).
- The frequency at which you will be executing protected code on one
sibling, and unprotected code on another (enabling IBRS may affect
sibling execution, depending on SKU)
- The implementation cost (retpoline requires auditing/rebuilding your
target, while IBRS can be used out of the box).


>
> Paolo


Re: [PATCH 0/7] IBRS patch series

2018-01-05 Thread Paul Turner
On Fri, Jan 5, 2018 at 3:32 AM, Paolo Bonzini  wrote:
> On 04/01/2018 22:22, Van De Ven, Arjan wrote:
>> this is about a level of paranoia you are comfortable with.
>>
>> Retpoline on Skylake raises the bar for the issue enormously, but
>> there are a set of corner cases that exist and that are not trivial
>> to prove you dealt with them.
>>
>> personally I am comfortable with retpoline on Skylake, but I would
>> like to have IBRS as an opt in for the paranoid.
>>
>
> IBRS actually seems to perform more than decently on Skylake with the
> microcode updates we're shipping in RHEL---much more so than Broadwell
> or Haswell.  Can you confirm that this is expected?
>

The cost of IBRS performance varies with processor generation.
Skylake incurs the least overhead.  It is expected that future
generations will be better still.

There are two distinct costs: The overhead to an indirect branch, and
the transition cost for enabling/disabling the feature as we schedule
into (and out of) protected code.

A naked indirect call (with IBRS enabled) on Skylake and a retpolined
call have approximately the same cost.
(I have not compared this cost for pre-Skylake uarchs.)

The main difference is that a retpolined binary does not incur the
transition costs, nor does it interact with sibling execution while
protected code is running.

A trade-off to consider when choosing between them is that it does on
the other hand carry a higher implementation (versus run-time) cost.
As Arjan references above, this is also a sliding scale.  The bar is
significantly raised by compiling with retpoline alone, and the
vulnerability is wholly eliminated if the preconditions are also
satisfied.
(And obviously, if you do not have the sources to the target you are
trying to protect, IBRS allows you to run it in a protected fashion --
while it cannot easily be retpolined.)


Re: [PATCH 0/7] IBRS patch series

2018-01-05 Thread Paul Turner
On Thu, Jan 4, 2018 at 11:33 AM, Linus Torvalds
 wrote:
> On Thu, Jan 4, 2018 at 11:19 AM, David Woodhouse  wrote:
>>
>> On Skylake the target for a 'ret' instruction may also come from the
>> BTB. So if you ever let the RSB (which remembers where the 'call's came
>> from get empty, you end up vulnerable.
>
> That sounds like it could cause mispredicts, but it doesn't sound 
> _exploitable_.
>
> Sure, interrupts in between the call instruction and the 'ret' could
> overflow the return stack. And we could migrate to another CPU. And so
> apparently SMM clears the return stack too.
>
> ... but again, none of them sound even remotely _exploitable_.

These are also mitigatable; the retpoline sequence itself will never
result in an RSB underflow.

So long as the underlying binary satisfies the precondition that it
will not underflow its own RSB.

Then we if we subsequently guarantee never to _reduce_ the number of
entries in its RSB at any point remote to its own execution, then the
precondition is preserved and underflow will not occur.
This does not preclude perturbing the RSB, only that it's contents
contain at least as many (non-exploitable) entries as when execution
was interrupted.
We do not need to actually track how many entries it's currently
using, because returning control to it with a full RSB, guarantees
that number of entries cannot have decreased.

>
> Remember: it's not mispredicts that leak information. It's *exploits"
> that use forced very specific  mispredicts to leak information.
>
> There's a big difference there. And I think patch authors should keep
> that difference in mind.
>
> For example, flushing the BTB at kernel entry doesn't mean that later
> in-kernel indirect branches don't get predicted, and doesn't even mean
> that they don't get mis-predicted. It only means that an exploit can't
> pre-populate those things and use them for exploits.

Or worse, that a co-executing SMT sibling is not continually refilling them.

>
>Linus


Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Fri, Jan 05, 2018 at 10:55:38AM +, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > > 
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > >   pause;
> > > >   jmp capture_spec;
> > > > set_up_target:
> > > >   mov %r11, (%rsp);
> > > >   ret;
> > > > where capture_spec part will be looping speculatively.
> > > 
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> > 
> > When choosing this sequence I benchmarked several alternatives here, 
> > including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> > 
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is 
> > why
> > it was chosen.
> > 
> >   "pause; jmp" 33.231 cycles/call 9.517 ns/call
> >   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> > 
> > (Timings are for a complete retpolined indirect branch.)
> 
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)

It's also notable here that while the difference is small in terms of absolute
values, it's likely due to reduced variation:

I would expect:
- pause to be extremely consistent in its timings
- pause and lfence to be close on their average timings, particularly in a
  micro-benchmark.

Which suggests that the difference may be larger in the occasional cases that
you are getting "unlucky" and seeing some other uarch interaction in the lfence
path.
> 
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.
> 
> I'm all for optimising it further, but maybe not this week.
> 
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/




Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Fri, Jan 05, 2018 at 10:55:38AM +, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > > 
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > >   pause;
> > > >   jmp capture_spec;
> > > > set_up_target:
> > > >   mov %r11, (%rsp);
> > > >   ret;
> > > > where capture_spec part will be looping speculatively.
> > > 
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> > 
> > When choosing this sequence I benchmarked several alternatives here, 
> > including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> > 
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is 
> > why
> > it was chosen.
> > 
> >   "pause; jmp" 33.231 cycles/call 9.517 ns/call
> >   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> > 
> > (Timings are for a complete retpolined indirect branch.)
> 
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)
> 
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.

While I can't speak to the various iterations of the gcc patches, I can confirm
that the details originally provided were reviewed by Intel.

> 
> I'm all for optimising it further, but maybe not this week.
> 
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/




Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

2018-01-05 Thread Paul Turner
On Thu, Jan 04, 2018 at 08:18:57AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <d...@amazon.co.uk> wrote:
> > On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
> >> Apologies for the discombobulation around today's disclosure.  Obviously 
> >> the
> >> original goal was to communicate this a little more coherently, but the
> >> unscheduled advances in the disclosure disrupted the efforts to pull this
> >> together more cleanly.
> >>
> >> I wanted to open discussion the "retpoline" approach and and define its
> >> requirements so that we can separate the core
> >> details from questions regarding any particular implementation thereof.
> >>
> >> As a starting point, a full write-up describing the approach is available 
> >> at:
> >>   https://support.google.com/faqs/answer/7625886
> >
> > Note that (ab)using 'ret' in this way is incompatible with CET on
> > upcoming processors. HJ added a -mno-indirect-branch-register option to
> > the latest round of GCC patches, which puts the branch target in a
> > register instead of on the stack. My kernel patches (which I'm about to
> > reconcile with Andi's tweaks and post) do the same.
> >
> > That means that in the cases where at runtime we want to ALTERNATIVE
> > out the retpoline, it just turns back into a bare 'jmp *\reg'.
> >
> >
> 
> I hate to say this, but I think Intel should postpone CET until the
> dust settles.  Intel should also consider a hardware-protected stack
> that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
> instruction.  That, by itself, would give considerable protection.
> But we still need JMP_NO_SPECULATE.  Or, better yet, get the CPU to
> stop leaking data during speculative execution.

Echoing Andy's thoughts, but from a slightly different angle:

1) BTI is worse than the current classes of return attack.  Given this,
   considered as a binary choice, it's equivalent to the current state of the
   world (e.g. no CET).
2) CET will not be "free".  I suspect in its initial revisions it will be more
   valuable for protecting end-users then enterprise workloads (cost is not
   observable for interactive workloads because there's tons of headroom in the
   first place).

While the potential incompatibility is unfortunate; I'm not sure it makes a
significant adoption to the adoption rate of CET.



Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Thu, Jan 04, 2018 at 10:25:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
>  wrote:
> >
> > Clearly Paul's approach to retpoline without lfence is faster.

Using pause rather than lfence does not represent a fundamental difference here.

A protected indirect branch is always adding ~25-30 cycles of overhead.

That this can be avoided in practice is a function of two key factors:
(1) Kernel code uses fewer indirect branches.
(2) The overhead can be avoided for hot indirect branches via devirtualization.
  e.g. the semantic equivalent of,
if (ptr == foo)
  foo();
else
  (*ptr)();
  Allowing foo() to be called directly, even though it was provided as an
  indirect.

> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
> 
> What is the alternative approach? Is it literally just doing a
> 
>   call 1f
> 1:mov real_target,(%rsp)
>ret
> 
> on the assumption that the "ret" will always just predict to that "1"
> due to the call stack?
> 
> Linus


Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Thu, Jan 04, 2018 at 10:40:23AM -0800, Andi Kleen wrote:
> > Clearly Paul's approach to retpoline without lfence is faster.
> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> > 
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
> 
> I don't think it's a good idea to use any sequence not signed
> off by CPU designers and extensively tested. 

I can confirm that "pause; jmp" has been previously reviewed by your side.

That said, again, per the other email, once we have guaranteed that speculative
execution will reach this point, beyond the guarantee that it does something
"safe" the choice of sequence here is a performance detail rather than
correctness.

> 
> While another one may work for most tests, it could always fail in
> some corner case.
> 
> Right now we have the more heavy weight one and I would
> suggest to stay with that one for now. Then worry
> about more optimizations later.
> 
> Correctness first.
> 
> -Andi


Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support

2018-01-05 Thread Paul Turner
On Thu, Jan 04, 2018 at 07:27:58PM +, David Woodhouse wrote:
> On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > 
> > Pretty much.
> > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > tldr: jmp *%r11 gets converted to:
> > call set_up_target;
> > capture_spec:
> >   pause;
> >   jmp capture_spec;
> > set_up_target:
> >   mov %r11, (%rsp);
> >   ret;
> > where capture_spec part will be looping speculatively.
> 
> That is almost identical to what's in my latest patch set, except that
> the capture_spec loop has 'lfence' instead of 'pause'.

When choosing this sequence I benchmarked several alternatives here, including
(nothing, nops, fences, and other serializing instructions such as cpuid).

The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
it was chosen.

  "pause; jmp" 33.231 cycles/call 9.517 ns/call
  "lfence; jmp" 33.354 cycles/call 9.552 ns/call

(Timings are for a complete retpolined indirect branch.)
> 
> As Andi says, I'd want to see explicit approval from the CPU architects
> for making that change.

Beyond guaranteeing that speculative execution is constrained, the choice of
sequence here is a performance detail and not one of correctness.

> 
> We've already had false starts there — for a long time, Intel thought
> that a much simpler option with an lfence after the register load was
> sufficient, and then eventually worked out that in some rare cases it
> wasn't. While AMD still seem to think it *is* sufficient for them,
> apparently.

As an interesting aside, that speculation proceeds beyond lfence can be
trivially proven using the timings above.  In fact, if we substitute only:
  "lfence" (with no jmp)

We see:
  29.573 cycles/call 8.469 ns/call

Now, the only way for this timing to be different, is if speculation beyond the
lfence was executed differently.

That said, this is a negative result, it does suggest that the jmp is
contributing a larger than realized cost to our speculative loop.  We can likely
shave off some additional time with some unrolling.  I did try this previously,
but did not see results above the noise floor; it seems worth trying this again;
will take a look tomorrow.




Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

2018-01-04 Thread Paul Turner
On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <p...@google.com> wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch.  As function returns are 
> intended
> to pair with function calls, processors often implement dedicated return stack
> predictors.  The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected 
> into
> a controlled location by a return stack target that we control.  Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of 
> a
> "native" indirect branch for which branch prediction hardware has been 
> disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary).  With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing 
> engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.

An implementation for LLVM is available at:
  https://reviews.llvm.org/D41723

> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
>   early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
>The key points where this occurs are:
>- Context switches (into protected targets)
>- interrupt return (we return into potentially unwinding execution)
>- sleep state exit (flushes cashes)
>- guest exit.
>   (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
>indirects. While as discussed above, kernel execution generally depends on
>fewer indirect branches, there are a few places (in particular, the
>networking stack) where we have chained sequences of indirects on hot 
> paths.
> (4) More general support for guarding against RSB underflow in an affected
> target.  While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:

Sorry this should have been, a kernel that does not care about this protection.

It has been a long day :-).

>  (1a) At minimum annotation only.  More complicated, config and
> run-time gated options are also possigble.
>  (2) Trivially run-time & config gated.
>  (3) The de-virtualizing of these branches improves performance in both the
>  retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are 
> required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here.  That was one of the original
> goals for this week.  Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here.  This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
> any details in sub-threads.  I'll also link back some of the other compiler 
> work
> which is landing for (1).
>
> Thanks,
>
> - Paul


Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

2018-01-04 Thread Paul Turner
On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <p...@google.com> wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch.  As function returns are 
> intended
> to pair with function calls, processors often implement dedicated return stack
> predictors.  The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected 
> into
> a controlled location by a return stack target that we control.  Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of 
> a
> "native" indirect branch for which branch prediction hardware has been 
> disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary).  With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing 
> engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.
> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
> made compatible with (1).
> [ Note: The only known indirect jmp which is not safe to convert, is the
>   early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
> above).
>The key points where this occurs are:
>- Context switches (into protected targets)
>- interrupt return (we return into potentially unwinding execution)
>- sleep state exit (flushes cashes)
>- guest exit.
>   (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
>indirects. While as discussed above, kernel execution generally depends on
>fewer indirect branches, there are a few places (in particular, the
>networking stack) where we have chained sequences of indirects on hot 
> paths.
> (4) More general support for guarding against RSB underflow in an affected
> target.  While this is harder to exploit and may not be required for many
> users, the approaches we have used here are not generally applicable.
> Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:
>  (1a) At minimum annotation only.  More complicated, config and
> run-time gated options are also possigble.
>  (2) Trivially run-time & config gated.
>  (3) The de-virtualizing of these branches improves performance in both the
>  retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are 
> required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here.  That was one of the original
> goals for this week.  Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here.  This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
> any details in sub-threads.  I'll also link back some of the other compiler 
> work
> which is landing for (1).
>
> Thanks,
>
> - Paul


[RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

2018-01-04 Thread Paul Turner
Apologies for the discombobulation around today's disclosure.  Obviously the
original goal was to communicate this a little more coherently, but the
unscheduled advances in the disclosure disrupted the efforts to pull this
together more cleanly.

I wanted to open discussion the "retpoline" approach and and define its
requirements so that we can separate the core
details from questions regarding any particular implementation thereof.

As a starting point, a full write-up describing the approach is available at:
  https://support.google.com/faqs/answer/7625886

The 30 second version is:
Returns are a special type of indirect branch.  As function returns are intended
to pair with function calls, processors often implement dedicated return stack
predictors.  The choice of this branch prediction allows us to generate an
indirect branch in which speculative execution is intentionally redirected into
a controlled location by a return stack target that we control.  Preventing
branch target injections (also known as "Spectre") against these binaries.

On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
"native" indirect branch for which branch prediction hardware has been disabled.
This is unfortunately measurable -- from 3 cycles on average to about 30.
However the cost is largely mitigated for many workloads since the kernel uses
comparatively few indirect branches (versus say, a C++ binary).  With some
effort we have the average overall overhead within the 0-1.5% range for our
internal workloads, including some particularly high packet processing engines.

There are several components, the majority of which are independent of kernel
modifications:

(1) A compiler supporting retpoline transformations.
(1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
made compatible with (1).
[ Note: The only known indirect jmp which is not safe to convert, is the
  early virtual address check in head entry. ]
(2) Kernel modifications for preventing return-stack underflow (see document
above).
   The key points where this occurs are:
   - Context switches (into protected targets)
   - interrupt return (we return into potentially unwinding execution)
   - sleep state exit (flushes cashes)
   - guest exit.
  (These can be run-time gated, a full refill costs 30-45 cycles.)
(3) Optional: Optimizations so that direct branches can be used for hot kernel
   indirects. While as discussed above, kernel execution generally depends on
   fewer indirect branches, there are a few places (in particular, the
   networking stack) where we have chained sequences of indirects on hot paths.
(4) More general support for guarding against RSB underflow in an affected
target.  While this is harder to exploit and may not be required for many
users, the approaches we have used here are not generally applicable.
Further discussion is required.

With respect to the what these deltas mean for an unmodified kernel:
 (1a) At minimum annotation only.  More complicated, config and
run-time gated options are also possigble.
 (2) Trivially run-time & config gated.
 (3) The de-virtualizing of these branches improves performance in both the
 retpoline and non-retpoline cases.

For an out of the box kernel that is reasonably protected, (1)-(3) are required.

I apologize that this does not come with a clean set of patches, merging the
things that we and Intel have looked at here.  That was one of the original
goals for this week.  Strictly speaking, I think that Andi, David, and I have
a fair amount of merging and clean-up to do here.  This is an attempt
to keep discussion of the fundamentals at least independent of that.

I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
any details in sub-threads.  I'll also link back some of the other compiler work
which is landing for (1).

Thanks,

- Paul


Re: Avoid speculative indirect calls in kernel

2018-01-03 Thread Paul Turner
On Wed, Jan 3, 2018 at 3:51 PM, Linus Torvalds
 wrote:
> On Wed, Jan 3, 2018 at 3:09 PM, Andi Kleen  wrote:
>> This is a fix for Variant 2 in
>> https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
>>
>> Any speculative indirect calls in the kernel can be tricked
>> to execute any kernel code, which may allow side channel
>> attacks that can leak arbitrary kernel data.
>
> Why is this all done without any configuration options?
>
> A *competent* CPU engineer would fix this by making sure speculation
> doesn't happen across protection domains. Maybe even a L1 I$ that is
> keyed by CPL.
>
> I think somebody inside of Intel needs to really take a long hard look
> at their CPU's, and actually admit that they have issues instead of
> writing PR blurbs that say that everything works as designed.
>
> .. and that really means that all these mitigation patches should be
> written with "not all CPU's are crap" in mind.
>
> Or is Intel basically saying "we are committed to selling you shit
> forever and ever, and never fixing anything"?
>
> Because if that's the case, maybe we should start looking towards the
> ARM64 people more.
>
> Please talk to management. Because I really see exactly two possibibilities:
>
>  - Intel never intends to fix anything
>
> OR
>
>  - these workarounds should have a way to disable them.
>
> Which of the two is it?
>
>Linus

With all of today's excitement these raced slightly with a post we are
making explaining the technique and its application.
The modifications are mostly at the compiler level, to produce
binaries which can safely execute on an affected target.
(Discussing how such a binary could be cross-compiled for vulnerable
and non-vulnerable targets is an interesting discussion, but right
now, all targets are obviously vulnerable.)

Let me finish getting the post up and I'll bring back more context here.

- Paul


Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11

2017-11-14 Thread Paul Turner
I have some comments that apply to many of the threads.
I've been fully occupied with a wedding and a security issue; but I'm
about to be free to spend the majority of my time on RSEQ things.
I was sorely hoping that day would be today.  But it's looking like
I'm still a day or two from being free for this.
Thank you for the extensive clean-ups and user-side development.  I
have some updates on these topics also.

- Paul

On Tue, Nov 14, 2017 at 1:15 PM, Andy Lutomirski  wrote:
> On Tue, Nov 14, 2017 at 1:08 PM, Linus Torvalds
>  wrote:
>> On Tue, Nov 14, 2017 at 12:03 PM, Mathieu Desnoyers
>>  wrote:
>>> Here is the last RFC round of the updated rseq patchset containing:
>>
>> Andy? You were the one with concerns here and said you'd have
>> something else ready for comparison.
>>
>
> I had a long discussion with Mathieu and KS and I think that this is a
> good compromise.  I haven't reviewed the series all that carefully,
> but I think the idea is sound.
>
> Basically, event_counter is gone (to be re-added in a later kernel if
> it really ends up being necessary, but it looks like it may primarily
> be a temptation to write subtly incorrect user code and to see
> scheduling details that shouldn't be readily exposed rather than a
> genuinely useful feature) and the versioning mechanism for the asm
> critical section bit is improved.  My crazy proposal should be doable
> on top of this if there's demand and if anyone wants to write the
> gnarly code involved.
>
> IOW no objection from me as long as those changes were made, which I
> *think* they were.  Mathieu?


Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

2017-03-31 Thread Paul Turner
On Thu, Mar 30, 2017 at 7:14 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Mar 30, 2017 at 02:16:58PM +0200, Peter Zijlstra wrote:
>> On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
>
>> > > +
>> > > +   if (unlikely(periods >= LOAD_AVG_MAX_N))
>> > > return LOAD_AVG_MAX;
>
>> >
>> > Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
>> > I don't think the decay above is guaranteed to return these to zero.
>>
>> Ah!
>>
>> Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
>> to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
>> 63 of those and we're 0.
>>
>> But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
>> LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.
>>
>> So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
>> to do about that.
>
>
> So at the very least it should be decay_load(LOAD_AVG_MAX, 1) (aka
> LOAD_AVG_MAX - 1024), but that still doesn't account for the !0
> decay_load() of the first segment.
>
> I'm thinking that we can compute the middle segment, by taking the max
> value and chopping off the ends, like:
>

>
>  p
>  c2 = 1024 \Sum y^n
> n=1
>
>   infinf
> = 1024 ( \Sum y^n - \Sum y^n - y^0 )
>   n=0n=p
>

Very nice!
Minor nit: Second sum needs to be from n=p+1

>
> Which gives something like the below.. Or am I completely off my rocker?
>
> ---
>  kernel/sched/fair.c | 70 
> ++---
>  1 file changed, 18 insertions(+), 52 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76f67b3e34d6..4f17ec0a378a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2744,26 +2744,6 @@ static const u32 runnable_avg_yN_inv[] = {
>  };
>
>  /*
> - * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
> - * over-estimates when re-combining.
> - */
> -static const u32 runnable_avg_yN_sum[] = {
> -   0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> -9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> -   17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> -};
> -
> -/*
> - * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> - * lower integers. See Documentation/scheduler/sched-avg.txt how these
> - * were generated:
> - */
> -static const u32 __accumulated_sum_N32[] = {
> -   0, 23371, 35056, 40899, 43820, 45281,
> -   46011, 46376, 46559, 46650, 46696, 46719,
> -};
> -
> -/*
>   * Approximate:
>   *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
>   */
> @@ -2795,40 +2775,25 @@ static u64 decay_load(u64 val, u64 n)
> return val;
>  }
>
> -static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
> +static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
>  {
> -   u32 c1, c2, c3 = remainder; /* y^0 == 1 */
> -
> -   if (!periods)
> -   return remainder - period_contrib;
> -
> -   if (unlikely(periods >= LOAD_AVG_MAX_N))
> -   return LOAD_AVG_MAX;
> +   u32 c1, c2, c3 = d3; /* y^0 == 1 */
>
> /*
>  * c1 = d1 y^(p+1)
>  */
> -   c1 = decay_load((u64)(1024 - period_contrib), periods);
> +   c1 = decay_load((u64)d1, periods);
>
> -   periods -= 1;
> /*
> -* For updates fully spanning n periods, the contribution to runnable
> -* average will be:
> +* p
> +* c2 = 1024 \Sum y^n
> +*n=1
>  *
> -*   c2 = 1024 \Sum y^n
> -*
> -* We can compute this reasonably efficiently by combining:
> -*
> -*   y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD}
> +*  infinf
> +*= 1024 ( \Sum y^n - \Sum y^n - y^0 )
> +*  n=0n=p+1
>  */
> -   if (likely(periods <= LOAD_AVG_PERIOD)) {
> -   c2 = runnable_avg_yN_sum[periods];
> -   } else {
> -   c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD];
> -   periods %= LOAD_AVG_PERIOD;
> -   c2 = decay_load(c2, periods);
> -   c2 += runnable_avg_yN_sum[periods];
> -   }
> +   c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;

decay_load(LOAD_AVG_MAX, periods + 1)

I computed 

Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

2017-03-31 Thread Paul Turner
On Fri, Mar 31, 2017 at 12:01 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Mar 30, 2017 at 03:02:47PM -0700, Paul Turner wrote:
>> On Thu, Mar 30, 2017 at 7:14 AM, Peter Zijlstra <pet...@infradead.org> wrote:
>> > On Thu, Mar 30, 2017 at 02:16:58PM +0200, Peter Zijlstra wrote:
>> >> On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
>> >
>> >> > > +
>> >> > > +   if (unlikely(periods >= LOAD_AVG_MAX_N))
>> >> > > return LOAD_AVG_MAX;
>> >
>> >> >
>> >> > Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
>> >> > I don't think the decay above is guaranteed to return these to zero.
>> >>
>> >> Ah!
>> >>
>> >> Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
>> >> to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
>> >> 63 of those and we're 0.
>> >>
>> >> But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
>> >> LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.
>> >>
>> >> So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
>> >> to do about that.
>> >
>> >
>> > So at the very least it should be decay_load(LOAD_AVG_MAX, 1) (aka
>> > LOAD_AVG_MAX - 1024), but that still doesn't account for the !0
>> > decay_load() of the first segment.
>> >
>> > I'm thinking that we can compute the middle segment, by taking the max
>> > value and chopping off the ends, like:
>> >
>> >
>> >  p
>> >  c2 = 1024 \Sum y^n
>> > n=1
>> >
>> >   infinf
>> > = 1024 ( \Sum y^n - \Sum y^n - y^0 )
>> >   n=0n=p
>> >
>> >
>>
>> So this is endemic to what I think is a deeper problem:
>
> I think not.
>
> 47742*e((345/32)*l(.5))
> 27.12819932019487579284
>
> So even without weight, 345 periods isn't enough to flatten
> LOAD_AVG_MAX.
>
>> The previous rounds of optimization folded weight into load_sum.  I
>> think this introduced a number of correctness problems:
>
> I'll argue you on correctness; but these might well be problems indeed.

Yeah, this side of it will never be numerically perfect.  I was just
suggesting that we can more easily clamp to LOAD_AVG_MAX directly if
weight is not accumulated into the sum.

>
>> a) the load_avg is no longer independent of weight; a high weight
>> entity can linger for eons. [63 LOAD_AVG_PERIODS]
>
> Certainly longer than before, agreed.
>
>> b) it breaks the dynamic response of load_avg on a weight change.
>> While nice is not common, there's a case that this is really important
>> for which is cgroups with a low number of threads running.  E.g. When we
>> transition from 1->2 threads we immediately halve the weight, but
>> because of the folding it takes a very large time to be numerically
>> correct again.
>
> I think this is a matter of semantics, not correctness. We did have that

Challenge accepted ;)

> weight in the past, so our past average including that isn't incorrect
> per se.

I agree, but it's not correct relative to the numerical
interpretations we actually want to use.  We examine these values for
forward-looking decisions, e.g. if we move this thread, how much load
are we moving and vruntime calculations.

E.g.   Consider the case above, suppose we have 2 nice-0 threads, one
has been running on 0, one wakes up on 1.  The load-balancer will see
a false imbalance between 0, 1 [as well as potentially other cores],
when they should be ~equal (with some obvious caveats or what the
threads are actually doing).

Further this weight is specifically bound to cpu 0.  Meaning that we
will also see inconsistent future behavior, dependent on wake-up
placement.

>
> Similarly, our vruntime includes increments based on prior weight.
>
> Now; you're arguing that this is undesired, and this might well be.

Yes, this leads to cross and intra-group fairness issues, taking the
same example as above:
- The thread running on 0 will receive additional time versus the thread on 1
- However, the thread on 1 will be correctly weighted at 50% so
between 0 and 1 we'll see more time in competition with an
equivalently weighted external group than we should.

>
>> c) It doesn't work with scale_load_down and fractional shares below
>> SCHED_LOAD_SCALE [we multiply in a zero -> zero rq load]
>
> Yes, I noticed this, and this is indeed undesired.
>
>> d) It makes doing stabil

Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

2017-03-30 Thread Paul Turner
On Thu, Mar 30, 2017 at 7:14 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Mar 30, 2017 at 02:16:58PM +0200, Peter Zijlstra wrote:
>> On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
>
>> > > +
>> > > +   if (unlikely(periods >= LOAD_AVG_MAX_N))
>> > > return LOAD_AVG_MAX;
>
>> >
>> > Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
>> > I don't think the decay above is guaranteed to return these to zero.
>>
>> Ah!
>>
>> Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
>> to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
>> 63 of those and we're 0.
>>
>> But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
>> LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.
>>
>> So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
>> to do about that.
>
>
> So at the very least it should be decay_load(LOAD_AVG_MAX, 1) (aka
> LOAD_AVG_MAX - 1024), but that still doesn't account for the !0
> decay_load() of the first segment.
>
> I'm thinking that we can compute the middle segment, by taking the max
> value and chopping off the ends, like:
>
>
>  p
>  c2 = 1024 \Sum y^n
> n=1
>
>   infinf
> = 1024 ( \Sum y^n - \Sum y^n - y^0 )
>   n=0n=p
>
>

So this is endemic to what I think is a deeper problem:

The previous rounds of optimization folded weight into load_sum.  I
think this introduced a number of correctness problems:

a) the load_avg is no longer independent of weight; a high weight
entity can linger for eons. [63 LOAD_AVG_PERIODS]
b) it breaks the dynamic response of load_avg on a weight change.
While nice is not common, there's a case that this is really important
for which is cgroups with a low number of threads running.  E.g. When we
transition from 1->2 threads we immediately halve the weight, but
because of the folding it takes a very large time to be numerically
correct again.
c) It doesn't work with scale_load_down and fractional shares below
SCHED_LOAD_SCALE [we multiply in a zero -> zero rq load]
d) It makes doing stability/clipping above a nightmare.

I think it's actually *costing* us cycles, since we end up multiplying
in the weight every partial [load_sum] update, but we only actually
need to compute it when updating load_avg [which is per period
overflow].

> Which gives something like the below.. Or am I completely off my rocker?
>
> ---
>  kernel/sched/fair.c | 70 
> ++---
>  1 file changed, 18 insertions(+), 52 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76f67b3e34d6..4f17ec0a378a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2744,26 +2744,6 @@ static const u32 runnable_avg_yN_inv[] = {
>  };
>
>  /*
> - * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
> - * over-estimates when re-combining.
> - */
> -static const u32 runnable_avg_yN_sum[] = {
> -   0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> -9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> -   17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> -};
> -
> -/*
> - * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> - * lower integers. See Documentation/scheduler/sched-avg.txt how these
> - * were generated:
> - */
> -static const u32 __accumulated_sum_N32[] = {
> -   0, 23371, 35056, 40899, 43820, 45281,
> -   46011, 46376, 46559, 46650, 46696, 46719,
> -};
> -
> -/*
>   * Approximate:
>   *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
>   */
> @@ -2795,40 +2775,25 @@ static u64 decay_load(u64 val, u64 n)
> return val;
>  }
>
> -static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
> +static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
>  {
> -   u32 c1, c2, c3 = remainder; /* y^0 == 1 */
> -
> -   if (!periods)
> -   return remainder - period_contrib;
> -
> -   if (unlikely(periods >= LOAD_AVG_MAX_N))
> -   return LOAD_AVG_MAX;
> +   u32 c1, c2, c3 = d3; /* y^0 == 1 */
>
> /*
>  * c1 = d1 y^(p+1)
>  */
> -   c1 = decay_load((u64)(1024 - period_contrib), periods);
> +   c1 = decay_load((u64)d1, periods);
>
> -   periods -= 1;
> /*
> -* For updates fully spanning n periods, the contribution to runnable
> -* average will be:
> +

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner
On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
>> > This patch extends the CPU controller by adding a couple of new
>> > attributes, capacity_min and capacity_max, which can be used to enforce
>> > bandwidth boosting and capping. More specifically:
>> >
>> > - capacity_min: defines the minimum capacity which should be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task will run at least at that capacity
>> >
>> > - capacity_max: defines the maximum capacity which can be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task can run up to that capacity
>>
>> cpu.capacity.min and cpu.capacity.max are the more conventional names.
>
> Ok, should be an easy renaming.
>
>> I'm not sure about the name capacity as it doesn't encode what it does
>> and is difficult to tell apart from cpu bandwidth limits.  I think
>> it'd be better to represent what it controls more explicitly.
>
> In the scheduler jargon, capacity represents the amount of computation
> that a CPU can provide and it's usually defined to be 1024 for the
> biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> maximum frequency).
>
> It's true that it kind of overlaps with the concept of "bandwidth".
> However, the main difference here is that "bandwidth" is not frequency
> (and architecture) scaled.
> Thus, for example, assuming we have only one CPU with these two OPPs:
>
>OPP | Frequency | Capacity
>  1 |500MHz |  512
>  2 |  1GHz | 1024

I think exposing capacity in this manner is extremely challenging.
It's not normalized in any way between architectures, which places a
lot of the ABI in the API.

Have you considered any schemes for normalizing this in a reasonable fashion?
`
>
> a task running 60% of the time on that CPU when configured to run at
> 500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
> the capacity standpoint, is using only 30% of the available capacity.
>
> IOW, bandwidth is purely temporal based while capacity factors in both
> frequency and architectural differences.
> Thus, while a "bandwidth" constraint limits the amount of time a task
> can use a CPU, independently from the "actual computation" performed,
> with the new "capacity" constraints we can enforce much "actual
> computation" a task can perform in the "unit of time".
>
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>
> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.
>
>> > b) allow to create subgroups of tasks which are not violating the
>> >capacity constraints defined by the parent group.
>> >Thus, tasks on a subgroup can only be more boosted and/or more
>>
>> For both limits and protections, the parent caps the maximum the
>> children can get.  At least that's what memcg does for memory.low.
>> Doing that makes sense for memcg because for memory the parent can
>> still do protections regardless of what its children are doing and it
>> makes delegation safe by default.
>
> Just to be more clear, the current proposal enforces:
>
> - capacity_max_child <= capacity_max_parent
>
>   Since, if a task is constrained to get only up to a certain amount
>   of capacity, than its childs cannot use more than that... eventually
>   they can only be further constrained.
>
> - capacity_min_child >= capacity_min_parent
>
>   Since, if a task has been boosted to run at least as much fast, than
>   its childs cannot be constrained to go slower without eventually
>   impacting parent performance.
>
>> I understand why you would want a property like capacity to be the
>> other direction as that way you get more specific as you walk down the
>> tree for both limits and 

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner
There is one important, fundamental difference here:
 {cfs,rt}_{period,runtime}_us is a property that applies to a group of
threads, it can be sub-divided.
We can consume 100ms of quota either by having one thread run for
100ms, or 2 threads running for 50ms.

This is not true for capacity.  It's a tag that affects the individual
threads it's applied to.
I'm also not sure if it's a hard constraint.  For example, suppose we
set a max that is smaller than a "big" cpu on an asymmetric system.
In the case that the faster CPU is relatively busy, but still
opportunistically available, we would still want to schedule it there.

This definitely seems to make more sense as a per-thread interface in
its current form.


Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

2017-03-30 Thread Paul Turner
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[]
>   * Approximate:
>   *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
>   */
> -static __always_inline u64 decay_load(u64 val, u64 n)
> +static u64 decay_load(u64 val, u64 n)
>  {
> unsigned int local_n;
>
> @@ -2795,32 +2795,113 @@ static __always_inline u64 decay_load(u6
> return val;
>  }
>
> -/*
> - * For updates fully spanning n periods, the contribution to runnable
> - * average will be: \Sum 1024*y^n
> - *
> - * We can compute this reasonably efficiently by combining:
> - *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n  - */
> -static u32 __compute_runnable_contrib(u64 n)
> +static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)

- The naming here is really ambiguous:
"__accumulate_sum" -> "__accumulate_pelt_segments"?
- Passing in "remainder" seems irrelevant to the sum accumulation.  It would be
  more clear to handle it from the caller.

>
>  {
> -   u32 contrib = 0;
> +   u32 c1, c2, c3 = remainder; /* y^0 == 1 */
>
> -   if (likely(n <= LOAD_AVG_PERIOD))
> -   return runnable_avg_yN_sum[n];
> -   else if (unlikely(n >= LOAD_AVG_MAX_N))
> +   if (!periods)
> +   return remainder - period_contrib;

This is super confusing.  It only works because remainder already had
period_contrib aggregated _into_ it.  We're literally computing:
  remainder + period_contrib - period_contrib

We should just not call this in the !periods case and handle the remainder
below.

> +
> +   if (unlikely(periods >= LOAD_AVG_MAX_N))
> return LOAD_AVG_MAX;
>
> -   /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> -   contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> -   n %= LOAD_AVG_PERIOD;
> -   contrib = decay_load(contrib, n);
> -   return contrib + runnable_avg_yN_sum[n];
> +   /*
> +* c1 = d1 y^(p+1)
> +*/
> +   c1 = decay_load((u64)(1024 - period_contrib), periods);
> +
> +   periods -= 1;
> +   /*
> +* For updates fully spanning n periods, the contribution to runnable
> +* average will be:
> +*
> +*   c2 = 1024 \Sum y^n
> +*
> +* We can compute this reasonably efficiently by combining:
> +*
> +*   y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD}
> +*/
> +   if (likely(periods <= LOAD_AVG_PERIOD)) {
> +   c2 = runnable_avg_yN_sum[periods];
> +   } else {
> +   c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD];

This still wants the comment justifying why we can't overflow.

> +   periods %= LOAD_AVG_PERIOD;
> +   c2 = decay_load(c2, periods);
> +   c2 += runnable_avg_yN_sum[periods];
> +   }
> +
> +   return c1 + c2 + c3;
>  }
>
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>
>  /*
> + * Accumulate the three separate parts of the sum; d1 the remainder
> + * of the last (incomplete) period, d2 the span of full periods and d3
> + * the remainder of the (incomplete) current period.
> + *
> + *   d1  d2   d3
> + *   ^   ^^
> + *   |   ||
> + * |<->|<->|<--->|
> + * ... |---x---|--| ... |--|-x (now)
> + *
> + *p
> + * u' = (u + d1) y^(p+1) + 1024 \Sum y^n + d3 y^0
> + *   n=1
> + *
> + *= u y^(p+1) +(Step 1)
> + *
> + *  p
> + *  d1 y^(p+1) + 1024 \Sum y^n + d3 y^0(Step 2)
> + * n=1
> + */
> +static __always_inline u32
> +accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> +  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> +{
> +   unsigned long scale_freq, scale_cpu;
> +   u64 periods;
> +   u32 contrib;
> +
> +   scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +
> +   delta += sa->period_contrib;
> +   periods = delta / 1024; /* A period is 1024us (~1ms) */
> +
> +   if (periods) {
> +   sa->load_sum = decay_load(sa->load_sum, periods);
> +   if (cfs_rq) {
> +   cfs_rq->runnable_load_sum =
> +   decay_load(cfs_rq->runnable_load_sum, 
> periods);
> +   }
> +   sa->util_sum = decay_load((u64)(sa->util_sum), periods);

Move step 2 here, handle remainder below.

> +   }
> +
> +   /*
> +* Step 2
> +*/
> +   delta %= 1024;
> +   contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> +   sa->period_contrib = delta;
> +
> +   contrib = cap_scale(contrib, scale_freq);
> +   if (weight) {
> +   

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Paul Turner
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo  wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
>
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
>
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
>
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.
>

I think that this is definitely a step in the right direction versus
previous proposals.
However, as proposed it feels like the API is conflating the
process/thread distinction with the core process hierarchy.  While
this does previous use-cases to be re-enabled, it seems to do so at an
unnecessary API cost.

As proposed, the cgroup.threads file means that threads are always
anchored in the tree by their process parent.  They may never move
past it.  I.e.
  If I have cgroups root/A/B
With B allowing sub-thread moves and the parent belonging to A, or B.
it is clear that the child cannot be moved beyond the parent.

Now this, in itself, is a natural restriction.  However, with this in
hand, it means that we are effectively co-mingling two hierarchies
onto the same tree: one that applies to processes, and per-process
sub-trees.

This introduces the following costs/restrictions:

1) We lose the ability to reasonably move a process.  This puts us
back to the existing challenge of the V1 API in which a thread is the
unit we can move atomically.  Hierarchies must be externally managed
and synchronized.

2) This retains all of the problems of the existing V1 API for a
process which wants to use these sub-groups to coordinate its threads.
It must coordinate its operations on these groups with the global
hierarchy (which is not consistently mounted) as well as potential
migration -- (1) above.

With the split as proposed, I fundamentally do not see the advantage
of exposing these as the same hierarchy.  By definition these .thread
files are essentially introducing independent, process level,
sub-hierarchies.

It seems greatly preferable to expose the sub-process level
hierarchies via separate path, e.g.:
  /proc/{pid, self}/thread_cgroups/

Any controllers enabled on the hierarchy that the process belonged to,
which support thread level operations would appear within.  This fully
addresses (1) and (2) while allowing us to keep the unified
process-granular v2-cgroup mounts as is.

The only case that this does not support vs ".threads" would be some
hybrid where we co-mingle threads from different processes (with the
processes belonging to the same node in the hierarchy).  I'm not aware
of any usage that looks like this.

What are the motivations that you see for forcing this all onto one
mount-point via .threads sub-tree tags?


> This patchset contains the following five patches.  For more details
> on the interface and behavior, please refer to the last patch.
>
>  0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
>  0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
>  0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch
>  0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
>  0005-cgroup-implement-cgroup-v2-thread-support.patch
>
> This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch
> 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the
> following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-cgroup2-threads
>
> diffstat follows.  Thanks.
>
>  Documentation/cgroup-v2.txt |   75 -
>  include/linux/cgroup-defs.h |   38 ++
>  include/linux/cgroup.h  |   12
>  kernel/cgroup/cgroup-internal.h |8
>  kernel/cgroup/cgroup-v1.c   |   64 +++-
>  kernel/cgroup/cgroup.c  |  589 
> 
>  kernel/cgroup/cpuset.c  |6
>  kernel/cgroup/freezer.c |6
>  kernel/cgroup/pids.c|1
>  kernel/events/core.c|1
>  

Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics

2016-12-19 Thread Paul Turner
On Mon, Dec 19, 2016 at 3:29 PM, Samuel Thibault
<samuel.thiba...@ens-lyon.org> wrote:
> Paul Turner, on Mon 19 Dec 2016 15:26:19 -0800, wrote:
>> >> > -   if (shares < MIN_SHARES)
>> >> > -   shares = MIN_SHARES;
>> > ...
>> >> > return shares;
>> >
>> > This will only make sure that the returned shares is 2, not 2048.
>>
>> This is intentional.  The MIN_SHARES you are seeing here is overloaded.
>> Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally.
>
> I'm not talking about the SCHED_LOAD_RESOLUTION scaling, but about the
> SCHED_FIXEDPOINT_SHIFT scaling, which is what
> 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
> kernels")
> modified on 64bit platforms.

 From that commit:

"""
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power
usage under light load  */
+#ifdef CONFIG_64BIT
 # define SCHED_LOAD_RESOLUTION 10
 # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
 # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)
"""


Please take a deeper look at the scale_load() interactions.


Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics

2016-12-19 Thread Paul Turner
On Mon, Dec 19, 2016 at 3:07 PM, Samuel Thibault
<samuel.thiba...@ens-lyon.org> wrote:
> Paul Turner, on Mon 19 Dec 2016 14:44:38 -0800, wrote:
>> On Mon, Dec 19, 2016 at 2:40 PM, Samuel Thibault
>> <samuel.thiba...@ens-lyon.org> wrote:
>> > 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
>> > kernels")
>> >
>> > exposed yet another miscalculation in calc_cfs_shares: MIN_SHARES is 
>> > unscaled,
>> > and must thus be scaled before being manipulated against "shares" amounts.
>>
>> It's actually intentional that MIN_SHARES is un-scaled here, this is
>> necessary to support the goal of sub-partitioning groups with small
>> shares.
>
> Uh? you mean it's normal that MIN_SHARES is here compared as such
> against "shares" while e.g. in sched_group_set_shares or effective_load
> it is scaled before comparing with "shares"?

Yes.

sched_group_set_shares() controls the amount allocated to the group.

Both calc_cfs_shares() and effective_load() are subdividing this
total.  Which is why it is scaled up from the external value of 2.

>
>> E.g.  A group with shares=2 and 5 threads will internally provide 2048
>> units of weight for the load-balancer to account for their
>> distribution.
>
> But here "shares" is already scaled, so
>
>> > -   if (shares < MIN_SHARES)
>> > -   shares = MIN_SHARES;
> ...
>> > return shares;
>
> This will only make sure that the returned shares is 2, not 2048.

This is intentional.  The MIN_SHARES you are seeing here is overloaded.
Every "1" unit of share is "SCHED_LOAD_RESOLUTION" bits internally.
We express a minimum of "2" in terms of the unit weight due to larger
numerical errors in the "1" case.
In the unscaled case this needs to be MIN_SHARES, and in the scaled
case, the subdivision of the scaled values must still be >=2.

To make this concrete:
In this case we can then internally say that there are (internally)
~410 units of weight for each of these 5 threads.
Thus, if one cpu has 4 threads and another 1, we see that as a
1640/410 imbalance, not a 2048/2048 balance.

>
> Samuel


Re: [PATCH] sched/fair: fix calc_cfs_shares fixed point arithmetics

2016-12-19 Thread Paul Turner
On Mon, Dec 19, 2016 at 2:40 PM, Samuel Thibault
 wrote:
> 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
> kernels")
>
> exposed yet another miscalculation in calc_cfs_shares: MIN_SHARES is unscaled,
> and must thus be scaled before being manipulated against "shares" amounts.

It's actually intentional that MIN_SHARES is un-scaled here, this is
necessary to support the goal of sub-partitioning groups with small
shares.

E.g.  A group with shares=2 and 5 threads will internally provide 2048
units of weight for the load-balancer to account for their
distribution.

>
> Signed-off-by: Samuel Thibault 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: sta...@vger.kernel.org
> Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit 
> kernels")
>
> ---
> This should be backported to 4.7 and 4.8 to fix scheduling priorities
> miscalculations
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6559d19..be84f72 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2657,8 +2657,8 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
> struct task_group *tg)
> if (tg_weight)
> shares /= tg_weight;
>
> -   if (shares < MIN_SHARES)
> -   shares = MIN_SHARES;
> +   if (shares < scale_load(MIN_SHARES))
> +   shares = scale_load(MIN_SHARES);
> if (shares > tg->shares)
> shares = tg->shares;
>


Re: [RFC PATCH v8 1/9] Restartable sequences system call

2016-11-26 Thread Paul Turner
On Sat, Aug 27, 2016 at 5:21 AM, Pavel Machek <pa...@ucw.cz> wrote:
>
> Hi!
>
>> Expose a new system call allowing each thread to register one userspace
>> memory area to be used as an ABI between kernel and user-space for two
>> purposes: user-space restartable sequences and quick access to read the
>> current CPU number value from user-space.
>>
>> * Restartable sequences (per-cpu atomics)
>>
>> Restartables sequences allow user-space to perform update operations on
>> per-cpu data without requiring heavy-weight atomic operations.
>>
>> The restartable critical sections (percpu atomics) work has been started
>> by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
>> critical sections. [1] [2] The re-implementation proposed here brings a
>> few simplifications to the ABI which facilitates porting to other
>> architectures and speeds up the user-space fast path. A locking-based
>> fall-back, purely implemented in user-space, is proposed here to deal
>> with debugger single-stepping. This fallback interacts with rseq_start()
>> and rseq_finish(), which force retries in response to concurrent
>> lock-based activity.
>
> Hmm. Purely software fallback needed for singlestepping... Looks like this is 
> malware
> writer's dream come true...
>
> Also if you ever get bug in the restartable code, debugger will be useless to 
> debug it...
> unless new abilities are added to debuggers to manually schedule threads on 
> CPUs.
>
> Is this good idea?

We've had some off-list discussion.

I have a revised version which incoprorates some of Mattheiu's
improvements, but avoids this requirement nearly ready for posting.

- Paul


Re: [PATCH] sched/fair: Fix fixed point arithmetic width for shares and effective load

2016-08-23 Thread Paul Turner
On Mon, Aug 22, 2016 at 7:00 AM, Dietmar Eggemann
 wrote:
>
> Since commit 2159197d6677 ("sched/core: Enable increased load resolution
> on 64-bit kernels") we now have two different fixed point units for
> load.
> shares in calc_cfs_shares() has 20 bit fixed point unit on 64-bit
> kernels. Therefore use scale_load() on MIN_SHARES.
> wl in effective_load() has 10 bit fixed point unit. Therefore use
> scale_load_down() on tg->shares which has 20 bit fixed point unit on
> 64-bit kernels.
>
> Signed-off-by: Dietmar Eggemann 
> ---
>
> Besides the issue with load_above_capacity when it comes to different
> fixed point units for load "[PATCH] sched/fair: Fix load_above_capacity
> fixed point arithmetic width" there are similar issues for shares in
> calc_cfs_shares() as well as wl in effective_load().
>
>  kernel/sched/fair.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 61d485421bed..18f80c4c7737 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2530,8 +2530,8 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
> struct task_group *tg)
> if (tg_weight)
> shares /= tg_weight;
>
> -   if (shares < MIN_SHARES)
> -   shares = MIN_SHARES;
> +   if (shares < scale_load(MIN_SHARES))
> +   shares = scale_load(MIN_SHARES);
> if (shares > tg->shares)
> shares = tg->shares;


MIN_SHARES is never scaled; it is an independent floor on the value
that can be assigned as a weight, so we never need to scale it down
(this would actually allow the weight to drop to zero which would be
bad).

>
>
> @@ -5023,9 +5023,9 @@ static long effective_load(struct task_group *tg, int 
> cpu, long wl, long wg)
>  * wl = S * s'_i; see (2)
>  */
> if (W > 0 && w < W)
> -   wl = (w * (long)tg->shares) / W;
> +   wl = (w * (long)scale_load_down(tg->shares)) / W;
> else
> -   wl = tg->shares;
> +   wl = scale_load_down(tg->shares);


This part looks good, effective_load() works with
source_load/target_load values, which originate in unscaled values.

Independent of this patch:
  When we initially introduced load scaling, it was ~uniform on every
value.  Most of the current pain has come from, and will continue to
come from, that with v2 of the load-tracking this is no longer the
case.  We have a massive number of scaled and unscaled inputs floating
around, many of them derived values (e.g. source_load above) which
require chasing.

I propose we simplify this.  Load scaling is only here so that the
load-balancer can make sane decisions.  We only need
cfs_rq->load.weight values to be scaled.

This means:
  (1) scaling in calculating group se->se.weight (calc_cfs_shares)
  (2) probable scaling in h_load calculations
  (2) if you have a calculation involving a cfs_rq->load.weight value,
you may need to think about scaling.
   [There's a bunch of obvious places this covers, most of them
are the load-balancer.  There's some indirects, e.g. the removed need
to scale in calculating vruntime, but these are easy to audit just by
searching for existing calls to scale.]

Probably missed one, but the fact that this list can be written means
its 1000 pages shorter than today's requirements.

The fact that (3) becomes the only rule to remember for the most part
makes reasoning about all of this stuff possible again because right
now it sucks.


Re: [tip:locking/core] sched/wait: Fix signal handling in bit wait helpers

2015-12-11 Thread Paul Turner
On Thu, Dec 10, 2015 at 5:09 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Dec 10, 2015 at 08:30:01AM +1100, NeilBrown wrote:
>> On Wed, Dec 09 2015, Peter Zijlstra wrote:
>>
>> > On Wed, Dec 09, 2015 at 12:06:33PM +1100, NeilBrown wrote:
>> >> On Tue, Dec 08 2015, Peter Zijlstra wrote:
>> >>
>> >> >>
>> >> >
>> >> > *sigh*, so that patch was broken.. the below might fix it, but please
>> >> > someone look at it, I seem to have a less than stellar track record
>> >> > here...
>> >>
>> >> This new change seems to be more intrusive than should be needed.
>> >> Can't we just do:
>> >>
>> >>
>> >>  __sched int bit_wait(struct wait_bit_key *word)
>> >>  {
>> >> +  long state = current->state;
>> >
>> > No, current->state can already be changed by this time.
>>
>> Does that matter?
>> It can only have changed to TASK_RUNNING - right?
>> In that case signal_pending_state() will return 0 and the bit_wait() acts
>> as though the thread was woken up normally (which it was) rather than by
>> a signal (which maybe it was too, but maybe that happened just a tiny
>> bit later).
>>
>> As long as signal delivery doesn't change ->state, we should be safe.
>> We should even be safe testing ->state *after* the call the schedule().
>
> Blergh, all I've managed to far is to confuse myself further. Even
> something like the original (+- the EINTR) should work when we consider
> the looping, even when mixed with an occasional spurious wakeup.
>
>
> int bit_wait()
> {
> if (signal_pending_state(current->state, current))
> return -EINTR;
> schedule();
> }
>
>
> This can go wrong against raising a signal thusly:
>
> prepare_to_wait()
> 1:  if (signal_pending_state(current->state, current))
> // false, nothing pending
> schedule();
> set_tsk_thread_flag(t, TIF_SIGPENDING);
>
> 
>
> prepare_to_wait()
> wake_up_state(t, ...);
> 2:  if (signal_pending_state(current->state, current))
> // false, TASK_RUNNING
>
> schedule(); // doesn't block because pending

Note that a quick inspection does not turn up _any_ TASK_INTERRUPTIBLE
callers.  When this previously occurred, it could likely only be with
a fatal signal, which would have hidden these sins.

>
> prepare_to_wait()
> 3:  if (signal_pending_state(current->state, current))
> // true, pending
>

Hugh asked me about this after seeing a crash, here's another exciting
way in which the current code breaks -- this one actually quite
serious:

Consider __lock_page:

void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, >flags, PG_locked);
__wait_on_bit_lock(page_waitqueue(page), , bit_wait_io,
TASK_UNINTERRUPTIBLE);
}

With the current state of the world,

 __sched int bit_wait_io(struct wait_bit_key *word)
 {
-   if (signal_pending_state(current->state, current))
-   return 1;
io_schedule();
+   if (signal_pending(current))
+   return -EINTR;
return 0;
 }

Called from __wait_on_bit_lock.

Previously, signal_pending_state() was checked under
TASK_UNINTERRUPTIBLE (via prepare_to_wait_exclusive).  Now, we simply
check for the presence of any signal -- after we have returned to
running state, e.g. post io_schedule() when somebody has kicked the
wait-queue.

However, this now means that _wait_on_bit_lock can return -EINTR up to
__lock_page; which does not validate the return code and blindly
returns.  This looks to have been a previously existing bug, but it
was at least masked by the fact that it required a fatal signal
previously (and that the page we return unlocked is likely going to be
freed from the dying process anyway).

Peter's proposed follow-up above looks strictly more correct.  We need
to evaluate the potential existence of a signal, *after* we return
from schedule, but in the context of the state which we previously
_entered_ schedule() on.

Reviewed-by: Paul Turner <p...@google.com>


>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] sched: Document Program-Order guarantees

2015-11-02 Thread Paul Turner
On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  wrote:
> These are some notes on the scheduler locking and how it provides
> program order guarantees on SMP systems.
>
> Cc: Linus Torvalds 
> Cc: Will Deacon 
> Cc: Oleg Nesterov 
> Cc: Boqun Feng 
> Cc: "Paul E. McKenney" 
> Cc: Jonathan Corbet 
> Cc: Michal Hocko 
> Cc: David Howells 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c |  142 
> 
>  1 file changed, 142 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc
> raw_spin_unlock(>lock);
>  }
>
> +/*
> + * Notes on Program-Order guarantees on SMP systems.
> + *
> + *
> + *   PREEMPTION/MIGRATION
> + *
> + * Regular preemption/migration is safe because as long as the task is 
> runnable
> + * migrations involve both rq locks, albeit not (necessarily) at the same 
> time.
> + *
> + * So we get (we allow 3 CPU migrations):
> + *
> + *   CPU0CPU1CPU2
> + *
> + *   LOCK rq(0)->lock
> + *   sched-out X
> + *   sched-in Y
> + *   UNLOCK rq(0)->lock
> + *
> + *   LOCK rq(0)->lock // MB against CPU0
> + *   dequeue X
> + *   UNLOCK rq(0)->lock
> + *
> + *   LOCK rq(1)->lock
> + *   enqueue X
> + *   UNLOCK rq(1)->lock
> + *
> + *   LOCK rq(1)->lock // MB against CPU2
> + *   sched-out Z
> + *   sched-in X
> + *   UNLOCK rq(1)->lock
> + *
> + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK 
> rq(0)
> + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
> + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
> + * observe the state it left behind on CPU0.
> + *

I suspect this part might be more explicitly expressed by specifying
the requirements that migration satisfies; then providing an example.
This makes it easier for others to reason about the locks and saves
worrying about whether the examples hit our 3 million sub-cases.

I'd also propose just dropping preemption from this part, we only need
memory order to be correct on migration, whether it's scheduled or not
[it also invites confusion with the wake-up case].

Something like:
When any task 't' migrates, all activity on its prior cpu [c1] is
guaranteed to be happens-before any subsequent execution on its new
cpu [c2].  There are 3 components to enforcing this.

[c1]  1) Sched-out of t requires rq(c1)->lock
[any cpu] 2) Any migration of t, by any cpu is required to synchronize
on *both* rq(c1)->lock and rq(c2)->lock
[c2]  3) Sched-in of t requires cq(c2)->lock

Transitivity guarantees that (2) orders after (1) and (3) after (2).
Note that in some cases (e.g. active, or idle cpu) the balancing cpu
in (2) may be c1 or c2.

[Follow example]


> + *
> + *   BLOCKING -- aka. SLEEP + WAKEUP
> + *
> + * For blocking things are a little more interesting, because when we dequeue
> + * the task, we don't need to acquire the old rq lock in order to migrate it.
> + *
> + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the 
> task
> + * to CPU2 (the most complex example):
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
> + *
> + *   X->state = UNINTERRUPTIBLE
> + *   MB
> + *   if (cond)
> + * break
> + *cond = true
> + *
> + *   LOCK rq(0)->lock LOCK X->pi_lock
> + *
> + *   dequeue X
> + *while (X->on_cpu)
> + *  cpu_relax()
> + *   sched-out X
> + *   RELEASE
> + *   X->on_cpu = 0
> + *RMB
> + *X->state = WAKING
> + *set_task_cpu(X,2)
> + *  WMB
> + *  ti(X)->cpu = 2
> + *
> + *llist_add(X, rq(2)) // MB
> + *  llist_del_all() // MB
> + *
> + *  LOCK rq(2)->lock
> + *  enqueue X
> + *  X->state = RUNNING
> + *  UNLOCK rq(2)->lock
> + *
> + *  LOCK rq(2)->lock
> + *  sched-out Z
> + *  sched-in X
> + *  UNLOCK rq(1)->lock
> + *
> + *  if (cond) // _TRUE_
> + *UNLOCK X->pi_lock
> + 

Re: [PATCH] sched: Update task->on_rq when tasks are moving between runqueues

2015-11-02 Thread Paul Turner
On Wed, Oct 28, 2015 at 6:58 PM, Peter Zijlstra  wrote:
> On Wed, Oct 28, 2015 at 05:57:10PM -0700, Olav Haugan wrote:
>> On 15-10-25 11:09:24, Peter Zijlstra wrote:
>> > On Sat, Oct 24, 2015 at 11:01:02AM -0700, Olav Haugan wrote:
>> > > Task->on_rq has three states:
>> > >   0 - Task is not on runqueue (rq)
>> > >   1 (TASK_ON_RQ_QUEUED) - Task is on rq
>> > >   2 (TASK_ON_RQ_MIGRATING) - Task is on rq but in the process of being
>> > >   migrated to another rq
>> > >
>> > > When a task is moving between rqs task->on_rq state should be
>> > > TASK_ON_RQ_MIGRATING
>> >
>> > Only when not holding both rq locks..
>>
>> IMHO I think we should keep the state of p->on_rq updated with the correct 
>> state
>> all the time unless I am incorrect in what p->on_rq represent. The task
>> is moving between rq's and is on the rq so the state should be
>> TASK_ON_RQ_MIGRATING right? I do realize that the code is currently not
>> broken. However, in the future someone might come along and change
>> set_task_cpu() and the code change might rely on an accurate p->on_rq value. 
>> It
>> would be good design to keep this value correct.
>
> At the same time; we should also provide lean and fast code. Is it
> better to add assertions about required state than to add superfluous
> code for just in case scenarios.

The state is only worth publishing if it's exceptional.  I think
Peter's new documentation helps to make this more clear.

The intent of this change may be better captured by pointing out in a
comment somewhere that detach_task() is *also* updating the task_cpu
pointer which then lets us lean on holding that lock to make the state
non-interesting.`
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] sched: Document Program-Order guarantees

2015-11-02 Thread Paul Turner
On Mon, Nov 2, 2015 at 12:34 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Mon, Nov 02, 2015 at 12:27:05PM -0800, Paul Turner wrote:
>> I suspect this part might be more explicitly expressed by specifying
>> the requirements that migration satisfies; then providing an example.
>> This makes it easier for others to reason about the locks and saves
>> worrying about whether the examples hit our 3 million sub-cases.
>>
>> I'd also propose just dropping preemption from this part, we only need
>> memory order to be correct on migration, whether it's scheduled or not
>> [it also invites confusion with the wake-up case].
>>
>> Something like:
>> When any task 't' migrates, all activity on its prior cpu [c1] is
>> guaranteed to be happens-before any subsequent execution on its new
>> cpu [c2].  There are 3 components to enforcing this.
>>
>> [c1]  1) Sched-out of t requires rq(c1)->lock
>> [any cpu] 2) Any migration of t, by any cpu is required to synchronize
>> on *both* rq(c1)->lock and rq(c2)->lock
>> [c2]  3) Sched-in of t requires cq(c2)->lock
>>
>> Transitivity guarantees that (2) orders after (1) and (3) after (2).
>> Note that in some cases (e.g. active, or idle cpu) the balancing cpu
>> in (2) may be c1 or c2.
>>
>> [Follow example]
>
> Make sense, I'll try and reword things like that.
>
> Note that in don't actually need the strong transitivity here (RCsc),
> weak transitivity (RCpc) is in fact sufficient.

Yeah, I thought about just using acquire/release to talk about the
interplay, in particular with the release in (1) in release and
acquire from (3)  which would make some of this much more explicit and
highlight that we only need RCpc.  We have not been very consistent at
using this terminology, although this could be a good starting point.

If we went this route, we could do something like:

+ * So in this case the scheduler does not provide an obvious full barrier; but
+ * the smp_store_release() in finish_lock_switch(), paired with the control-dep
+ * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
+ * order things between CPU0 and CPU1.

Instead of having this, which is complete, but hard to synchronize
with the points at which it actually matters.  Just use acquire and
release above, then at the actual site, e.g. in try_to_wake_up()
document how we deliver the acquire required by the higher level
documentation/requirements.

This makes it easier to maintain the stupidly racy documentation
consistency property in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v2 2/3] restartable sequences: x86 ABI

2015-10-27 Thread Paul Turner
From: Paul Turner <p...@google.com>

Recall the general ABI is:
   The kernel ABI generally consists of:
 a) A shared TLS word which exports the current cpu and event-count
 b) A shared TLS word which, when non-zero, stores the first post-commit
instruction if a sequence is active.  (The kernel observing rips greater
than this takes no action and concludes user-space has completed its
critical section, but not yet cleared this state).
 c) An architecture specific way to publish the state read from (a) which
user-space believes to be current.

This patch defines (c) for x86, both x86_64 and i386.

The exact sequence is:
  *0. Userspace stores current event+cpu counter values
   1. Userspace loads the rip to move to at failure into cx
   2. Userspace loads the rip of the instruction following
  the critical section into a registered TLS address.
   3. Userspace loads the values read at [0] into a known
  location.
   4. Userspace tests to see whether the current event and
  cpu counter values match those stored at 0.  Manually
  jumping to the address from [1] in the case of a
  mismatch.

  Note that if we are preempted or otherwise interrupted
  then the kernel can also now perform this comparison
  and conditionally jump us to [1].
   4. Our final instruction bfeore [2] is then our commit.
  The critical section is self-terminating.  [2] must
  also be cleared at this point.

 For x86_64:
   [3] uses rdx to represent cpu and event counter as a
   single 64-bit value.

 For i386:
   [3] uses ax for cpu and dx for the event_counter.

  Both:
   Instruction after commit: rseq_state->post_commit_instr
   Current event and cpu state: rseq_state->event_and_cpu

An example user-space x86_64 implementation:
__asm__ __volatile__ goto (
"movq $%l[failed], %%rcx\n"
"movq $1f, %[commit_instr]\n"
"cmpq %[start_value], %[current_value]\n"
"jnz %l[failed]\n"
"movq %[to_write], (%[target])\n"
"1: movq $0, %[commit_instr]\n"
  : /* no outputs */
  : [start_value]"d"(start_value.storage),
[current_value]"m"(__rseq_state),
[to_write]"r"(to_write),
[target]"r"(p),
    [commit_instr]"m"(__rseq_state.post_commit_instr)
  : "rcx", "memory"
  : failed

Signed-off-by: Paul Turner <p...@google.com>
---
 arch/x86/entry/common.c  |3 +
 arch/x86/include/asm/restartable_sequences.h |   18 +++
 arch/x86/kernel/Makefile |2 
 arch/x86/kernel/restartable_sequences.c  |  136 ++
 arch/x86/kernel/signal.c |7 +
 kernel/restartable_sequences.c   |   10 +-
 6 files changed, 173 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/restartable_sequences.h
 create mode 100644 arch/x86/kernel/restartable_sequences.c

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc..e382487 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -23,6 +23,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -249,6 +250,8 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 
cached_flags)
if (cached_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   if (rseq_active(current))
+   arch_rseq_update_event_counter(regs);
}
 
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/include/asm/restartable_sequences.h 
b/arch/x86/include/asm/restartable_sequences.h
new file mode 100644
index 000..75864a7
--- /dev/null
+++ b/arch/x86/include/asm/restartable_sequences.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
+#define _ASM_X86_RESTARTABLE_SEQUENCES_H
+
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+void arch_rseq_update_event_counter(struct pt_regs *regs);
+
+#else /* !CONFIG_RESTARTABLE_SEQUENCES */
+
+static inline void arch_rseq_update_event_counter(struct pt_regs *regs) {}
+
+#endif
+
+#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index b1b78ff..ee98fb6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -110,6 +110,8 @@ obj-$(CONFIG_EFI)   += sysfb_efi.o
 obj-$(CONFIG_PERF_EVENTS)  += perf_regs.o
 obj-$(CONFIG_TRACING)  += tracepoint.o
 
+obj-$(CONFIG_RESTARTABLE_SEQUENCES)+= restartable_sequences.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x

[RFC PATCH v2 3/3] restartable sequences: basic self-tests

2015-10-27 Thread Paul Turner
From: pjt <p...@pjt-glaptop.roam.corp.google.com>

Implements two basic tests of RSEQ functionality.

The first, "basic_test" only asserts that RSEQ works moderately correctly.
E.g. that:
- The CPUID pointer works
- Code infinitely looping within a critical section will eventually be
interrupted.
- Critical sections are interrupted by signals.

"basic_percpu_ops_test" is a slightly more "realistic" variant, implementing a
few simple per-cpu operations and testing their correctness.  It also includes
a trivial example of user-space may multiplexing the critical section via the
restart handler.

Signed-off-by: Paul Turner <p...@google.com>
---
 tools/testing/selftests/rseq/Makefile  |   13 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  268 
 tools/testing/selftests/rseq/basic_test.c  |   87 ++
 tools/testing/selftests/rseq/rseq.c|   36 +++
 tools/testing/selftests/rseq/rseq.h|  109 
 5 files changed, 513 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/Makefile 
b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 000..40b9338
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,13 @@
+CFLAGS += -Wall
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test
+
+all: $(TESTS)
+%: %.c
+   $(CC) $(CFLAGS) -o $@ $^ rseq.c $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+   $(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c 
b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 000..dcc57ad
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,268 @@
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rseq.h"
+
+struct percpu_lock {
+   intptr_t word[CPU_SETSIZE][64 / sizeof(intptr_t)];  /* Cache aligned */
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+   struct rseq_state start;
+   int cpu;
+
+   do {
+   start = rseq_start();
+   cpu = rseq_cpu_at_start(start);
+   } while (lock->word[cpu][0] ||
+!rseq_finish(>word[cpu][0], 1, start));
+   return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+   barrier();  /* need a release-store here, this suffices on x86. */
+   assert(lock->word[cpu][0] == 1);
+   lock->word[cpu][0] = 0;
+}
+
+/*
+ * cmpxchg [with an additional check value].
+ *
+ * Returns:
+ *  -1 if *p != old [ || check_ptr != check_val, ] otherwise
+ *  cpu that rseq_percpu_cmpxchgcheck was executed.
+ *   - If this is different from the passed cpu, no modifications were made.
+ *
+ * Note: When specified, check_ptr is dereferenced iff *p == old
+ */
+int rseq_percpu_cmpxchg(int cpu, intptr_t *p, intptr_t old, intptr_t new)
+{
+   struct rseq_state start;
+
+   while (1) {
+   start = rseq_start();
+   if (rseq_current_cpu() != cpu) return rseq_current_cpu();
+   if (*p != old)
+   return -1;
+   if (rseq_finish(p, new, start)) return cpu;
+   }
+}
+
+int rseq_percpu_cmpxchgcheck(int cpu, intptr_t *p, intptr_t old, intptr_t new,
+intptr_t *check_ptr, intptr_t check_val)
+{
+   struct rseq_state start;
+
+   while (1) {
+   start = rseq_start();
+   if (rseq_current_cpu() != cpu) return rseq_current_cpu();
+   /*
+* Note that we'd want the ultimate implementation of this to
+* be open coded (similar to rseq_finish) so that we can
+* guarantee *check is not dereferenced when old does not
+* match.  This could also be facilitated with a generic
+* rseq_read_if_valid(...) helper.
+*/
+   if (*p != old || *check_ptr != check_val)
+   return -1;
+   if (rseq_finish(p, new, start)) return cpu;
+   }
+}
+
+void rseq_unknown_restart_addr(void *addr)
+{
+   fprintf(stderr, "rseq: unrecognized restart address %p\n", addr);
+   exit(1);
+}
+
+struct spinlock_test_data {
+   struct percpu_lock lock;
+   int counts[CPU_SETSIZE];
+   int reps;
+};
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+   struct spinlock_test_data *data = arg;
+
+   int i, cpu;
+   rseq_init_current_thread();
+   for (i = 0; i < data->reps; i++) {

[RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

2015-10-27 Thread Paul Turner
This is an update to the previously posted series at:
  https://lkml.org/lkml/2015/6/24/665

Dave Watson has posted a similar follow-up which allows additional critical
regions to be registered as well as single-step support at:
  https://lkml.org/lkml/2015/10/22/588

This series is a new approach which introduces an alternate ABI that does not
depend on open-coded assembly nor a central 'repository' of rseq sequences.
Sequences may now be inlined and the preparatory[*] work for the sequence can
be written in a higher level language.

This new ABI has also been written to support debugger interaction in a way
that the previous ABI could not.

[*] A sequence essentially has 3 steps:
  1) Determine which cpu the sequence is being run on
  2) Preparatory work specific to the state read in 1)
  3) A single commit instruction which finalizes any state updates.

We require a single instruction for (3) so that if it is interrupted in any
way, we can proceed from (1) once more [or potentially bail].

This new ABI can be described as:
 Progress is ordered as follows:
  *0. Userspace stores current event+cpu counter values
   1. Userspace loads the rip to move to at failure into cx
   2. Userspace loads the rip of the instruction following
  the critical section into a registered TLS address.
   3. Userspace loads the values read at [0] into a known
  location.
   4. Userspace tests to see whether the current event and
  cpu counter values match those stored at 0.  Manually
  jumping to the address from [1] in the case of a
  mismatch.
 
  Note that if we are preempted or otherwise interrupted
  then the kernel can also now perform this comparison
  and conditionally jump us to [1].
   5. Our final instruction before [2] is then our commit.
  The critical section is self-terminating.  [2] must
  also be cleared at this point.
 
 For x86_64:
   [3] uses rdx to represent cpu and event counter as a
   single 64-bit value.
 
 For i386:
   [3] uses ax for cpu and dx for the event_counter.
 
  Both:
   Instruction after commit: rseq_state->post_commit_instr
   Current event and cpu state: rseq_state->event_and_cpu

Exactly, for x86_64 this looks like:
  movq , rcx [1]
  movq $1f,  [2]
  cmpq ,  [3] (start is in rcx)
  jnz  (4)
  movq , () (5)
  1: movq $0, 

There has been some related discussion, which I am supportive of, in which
we use fs/gs instead of TLS.  This maps naturally to the above and removes
the current requirement for per-thread initialization (this is a good thing!).

On debugger interactions:

There are some nice properties about this new style of API which allow it to
actually support safe interactions with a debugger:
 a) The event counter is a per-cpu value.  This means that we can not advance
it if no threads from the same process execute on that cpu.  This
naturally allows basic single step support with thread-isolation.
 b) Single-step can be augmented to evalute the ABI without incrementing the
event count.
 c) A debugger can also be augmented to evaluate this ABI and push restarts
on the kernel's behalf.

This is also compatible with David's approach of not single stepping between
2-4 above.  However, I think these are ultimately a little stronger since true
single-stepping and breakpoint support would be available.  Which would be
nice to allow actual debugging of sequences.

(Note that I haven't bothered implementing these in the current patchset as we
are still winnowing down on the ABI and they just add complexity.  It's
important to note that they are possible however.)

Thanks,

- Paul


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v2 1/3] restartable sequences: user-space per-cpu critical sections

2015-10-27 Thread Paul Turner
From: Paul Turner <p...@google.com>

Introduce the notion of a restartable sequence.  This is a piece of user code
that can be described in 3 components:

 1) Establish where [e.g. which cpu] the thread is running
 2) Preparatory work that is dependent on the state in [1].
 3) A committing instruction that proceeds only if [2] and [1] have proceeded
in a coherent state (i.e. not thread from the same address state has run
on the same cpu).

Preemption, or other interruption beyond [1] guarantees that [3] will not
execute.  With support for control being transferred to a user-defined restart
handler.  This handler may arrange for the original operation to be retried,
including potentially resynchronizing with dependent state that may
have been updated in the interim.

This may be used in combination with an in-memory cpu-id to allow user programs
to implement cpu-local data-structures and primitives, without the use/overhead
of any atomics.

The kernel ABI generally consists of:
 a) A shared TLS word which exports the current cpu and associated event-count
 b) A shared TLS word which, when non-zero, stores the first post-commit
instruction if a sequence is active.  (The kernel observing rips greater
than this takes no action and concludes user-space has completed its
critical section, but not yet cleared this state).
 c) An architecture specific way to publish the state read from (a) which
user-space believes to be current.

[ (a) is only read by userspace, it is published by the kernel.
  (b) is only ready by the kernel, it is published by userspace. ]

Signed-off-by: Paul Turner <p...@google.com>
---
 arch/Kconfig   |7 ++
 arch/x86/Kconfig   |1 
 arch/x86/entry/syscalls/syscall_64.tbl |1 
 fs/exec.c  |1 
 include/linux/sched.h  |   27 
 include/uapi/asm-generic/unistd.h  |4 +
 init/Kconfig   |   11 +++
 kernel/Makefile|2 -
 kernel/restartable_sequences.c |  112 
 kernel/sched/core.c|4 +
 kernel/sched/sched.h   |3 +
 kernel/sys_ni.c|2 +
 12 files changed, 172 insertions(+), 3 deletions(-)
 create mode 100644 kernel/restartable_sequences.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 671810c..336880c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -241,6 +241,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
  declared in asm/ptrace.h
  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RESTARTABLE_SEQUENCE_SUPPORT
+   bool
+   depends on HAVE_REGS_AND_STACK_ACCESS_API
+   help
+ This symbol should be selected by an architecture if it supports an
+ implementation of restartable sequences.
+
 config HAVE_CLK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 262b79c..25a0a60 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -135,6 +135,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RESTARTABLE_SEQUENCE_SUPPORT
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16   if X86_32 || IA32_EMULATION
select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..0fd4243 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 32264  execveatstub_execveat
 323common  userfaultfd sys_userfaultfd
 324common  membarrier  sys_membarrier
+325common  restartable_sequences   sys_restartable_sequences
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 0a77a69..120ef19 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1599,6 +1599,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
current->in_execve = 0;
acct_update_integrals(current);
task_numa_free(current);
+   rseq_clear_state_exec();
free_bprm(bprm);
kfree(pathbuf);
putname(filename);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e1e06c..03ab1f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1189,6 +1189,21 @@ struct mempolicy;
 struct pipe_inode_info;
 struct uts_namespace;
 
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+struct restartable_sequence_state {
+   __user void *post_commit_instr_addr;
+   __user u64 *event_and_cpu;
+
+   u32 last_switch_count;
+   u32 event_counter;
+   struct preempt_notifier notifier;
+};
+
+void rseq_clear_state_exec(void);
+#else
+static inline void rseq_clear_state_exec(void) {}
+#endif
+
 str

Re: [RFC PATCH v2 2/3] restartable sequences: x86 ABI

2015-10-27 Thread Paul Turner
On Tue, Oct 27, 2015 at 10:03 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>
> On Tue, Oct 27, 2015 at 04:57:05PM -0700, Paul Turner wrote:
> > +static void rseq_sched_out(struct preempt_notifier *pn,
> > +struct task_struct *next)
> > +{
> > + set_thread_flag(TIF_NOTIFY_RESUME);
> > +}
> >
> >  static __read_mostly struct preempt_ops rseq_preempt_ops = {
> >   .sched_in = rseq_sched_in_nop,
> > - .sched_out = rseq_sched_out_nop,
> > + .sched_out = rseq_sched_out,
> >  };
>
> Since we're unconditionally setting this TIF flag for these tasks, can't
> we introduce something similar to the (contested) TIF_NOHZ_FULL thing
> which is kept on the task indefinitely.
>
So Andy and I talked about this also, I'm in favor, in particular this
has two nice effects:
 a) In exit_to_usermode_loop() we can ensure that this is evaluated
prior to _TIF_SIGPENDING.  This removes the current requirement that
we also validate this state in setup_rt_frame() [which can perturb
this state prior to our existing notifier].
 b) We avoid spurious interactions with other things that use notify resume.

> That avoids having the preempt notifiers and this atomic op in the
> schedule path.


So we still want something there (although it can be definitely be
inlined as opposed to a preempt_notifier) since this allows us to only
evaluate this check on returns to user-space that might matter as
opposed to every syscall.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-15 Thread Paul Turner
On Thu, Oct 1, 2015 at 11:46 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Paul.
>
> Sorry about the delay.  Things were kinda hectic in the past couple
> weeks.

Likewise :-(

>
> On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
>> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <t...@kernel.org> wrote:
>> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> >> I do not think this is a layering problem.  This is more like C++:
>> >> there is no sane way to concurrently use all the features available,
>> >> however, reasonably self-consistent subsets may be chosen.
>> >
>> > That's just admitting failure.
>> >
>>
>> Alternatively: accepting there are varied use-cases to
>> support.
>
> Analogies like this can go awry but as we're in it anyway, let's push
> it a bit further.  One of the reasons why C++ isn't lauded as an
> example of great engineering is while it does support a vast number of
> use-cases or rather usage-scenarios (it's not necessarily focused on
> utility but just how things are done) it fails to distill the essence
> of the actual utility out of them and condense it.  It's not just an
> aesthetic argument.  That failure exacts heavy costs on its users and
> is one of the reasons why C++ projects are more prone to horrible
> disarrays unless specific precautions are taken.
>
> I'm not against supporting valid and useful use-cases but not all
> usage-scenarios are equal.  If we can achieve the same eventual goals
> with reasonable trade-offs in a simpler and more straight-forward way,
> that's what we should do even though that'd require some modifications
> to specific usage-scenarios.  ie. the usage-scenarios need to
> scrutinized so that the core of the utility can be extracted and
> abstracted in the, hopefully, minimal way.
>
> This is what worries me when you liken the situation to C++.  You
> probably were trying to make a different point but I'm not sure we're
> on the same page and I think we need to agree at least on this in
> principle; otherwise, we'll just keep talking past each other.


I agree with trying to reach a minimal core functionality that
satisfies all use-cases.  I am only saying however, that I think that
I do not think we can reduce to an api so minimal that all users will
use all parts of it.  We have to fit more than one usage model in.

>
>> > The kernel does not update all CPU affinity masks when a CPU goes down
>> > or comes up.  It just enforces the intersection and when the
>> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
>> > should reflect what the system does in the global case in general, and
>> > the global behavior here, although missing some bits, is a lot saner
>> > than what cpuset is currently doing.
>>
>> You are conflating two things here:
>> 1) How we maintain these masks
>> 2) The interactions on updates
>>
>> I absolutely agree with you that we want to maintain (1) in a
>> non-pointwise format.  I've already talked about that in other replies
>> on this thread.
>>
>> However for (2) I feel you are:
>>  i) Underestimating the complexity of synchronizing updates with user-space.
>>  ii) Introducing more non-desirable behaviors [partial overwrite] than
>> those you object to [total overwrite].
>
> The thing which bothers me the most is that cpuset behavior is
> different from global case for no good reason.

I've tried to explain above that I believe there are reasonable
reasons for it working the way it does from an interface perspective.
I do not think they can be so quickly discarded out of hand.  However,
I think we should continue winnowing focus and first resolve the model
of interaction for sub-process hierarchies,

> We don't have a model
> right now.  It's schizophrenic.  And what I was trying to say was that
> maybe this is because we never had a working model in the global case
> either but if that's the case we need to solve the global case too or
> at least figure out where we wanna be in the long term.
>
>> It's the most consistent choice; you've not given any reasons above
>> why a solution with only partial consistency is any better.
>>
>> Any choice here is difficult to coordinate, that two APIs allow
>> manipulation of the same property means that we must always
>> choose some compromise here.  I prefer the one with the least
>> surprises.
>
> I don't think the current situation around affinity mask handling can
> be considered consistent and cpuset is pouring more inconsistencies
> into it.  We need to figure it out one way or the other.
>
> ...
>> I do not yet see a 

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-18 Thread Paul Turner
On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> I do not think this is a layering problem.  This is more like C++:
>> there is no sane way to concurrently use all the features available,
>> however, reasonably self-consistent subsets may be chosen.
>
> That's just admitting failure.
>

Alternatively: accepting there are varied use-cases to
support.

>> > I don't get it.  How is that the only consistent way?  Why is making
>> > irreversible changes even a good way?  Just layer the masks and
>> > trigger notification on changes.
>>
>> I'm not sure if you're agreeing or disagreeing here.  Are you saying
>> there's another consistent way from "clobbering then triggering
>> notification changes" since it seems like that's what is rejected and
>> then described.  This certainly does not include any provisions for
>> reversibility (which I think is a non-starter).
>>
>> With respect to layering:  Are you proposing we maintain a separate
>> mask for sched_setaffinity and cpusets, then do different things
>> depending on their intersection, or lack thereof?  I feel this would
>> introduce more consistencies than it would solve as these masks would
>> not be separately inspectable from user-space, leading to confusing
>> interactions as they are changed.
>
> So, one of the problems is that the kernel can't have tasks w/o
> runnable CPUs, so we have to some workaround when, for whatever
> reason, a task ends up with no CPU that it can run on.  The other is
> that we never established a consistent way to deal with it in global
> case either.
>
> You say cpuset isn't a layering thing but that simply isn't true.
> It's a cgroup-scope CPU mask.  It layers atop task affinities
> restricting what they can be configured to, limiting the effective
> cpumask to the intersection of actually existing CPUs and overriding
> individual affinity setting when the intersection doesn't exist.
>
> The kernel does not update all CPU affinity masks when a CPU goes down
> or comes up.  It just enforces the intersection and when the
> intersection becomes empty, ignores it.  cgroup-scoped behaviors
> should reflect what the system does in the global case in general, and
> the global behavior here, although missing some bits, is a lot saner
> than what cpuset is currently doing.

You are conflating two things here:
1) How we maintain these masks
2) The interactions on updates

I absolutely agree with you that we want to maintain (1) in a
non-pointwise format.  I've already talked about that in other replies
on this thread.

However for (2) I feel you are:
 i) Underestimating the complexity of synchronizing updates with user-space.
 ii) Introducing more non-desirable behaviors [partial overwrite] than
those you object to [total overwrite].

>
>> There are also very real challenges with how any notification is
>> implemented, independent of delivery:
>> The 'main' of an application often does not have good control or even
>> understanding over its own threads since many may be library managed.
>> Designation of responsibility for updating these masks is difficult.
>> That said, I think a notification would still be a useful improvement
>> here and that some applications would benefit.
>
> And this is the missing piece in the global case too.  We've just
> never solved this problem properly but that does not mean we should go
> off and do something completely different for cgroup case.  Clobbering
> is fine if there's a single entity controlling everything but at that
> level it's nothing more than a shorthand for running taskset on member
> tasks.
>

>From user-space's perspective it always involves some out-of-band
clobber since what's specified by cpusets takes precedence.

However the result of overlaying the masks is that different update
combinations will have very different effects, varying from greatly
expanding parallelism to greatly restricting it.  Further, these
effects are hard to predict since anything returned by getaffinity is
obscured by whatever the instantaneous cpuset-level masks happen to be.

>> At the very least, I do not think that cpuset's behavior here can be
>> dismissed as unreasonable.
>
> It sure is very misguided.
>

It's the most consistent choice; you've not given any reasons above
why a solution with only partial consistency is any better.

Any choice here is difficult to coordinate, that two APIs allow
manipulation of the same property means that we must always
choose some compromise here.  I prefer the one with the least
surprises.

>> > What if optimizing cache allocation across competing threads o

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-09 Thread Paul Turner
[ Picking this back up, I was out of the country last week.  Note that
we are also wrestling with some DMARC issues as it was just activated
for Google.com so apologies if people do not receive this directly. ]

On Tue, Aug 25, 2015 at 2:02 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote:
>> > This is an erratic behavior on cpuset's part tho.  Nothing else
>> > behaves this way and it's borderline buggy.
>>
>> It's actually the only sane possible interaction here.
>>
>> If you don't overwrite the masks you can no longer manage cpusets with
>> a multi-threaded application.
>> If you partially overwrite the masks you can create a host of
>> inconsistent behaviors where an application suddenly loses
>> parallelism.
>
> It's a layering problem.  It'd be fine if cpuset either did "layer
> per-thread affinities below w/ config change notification" or "ignore
> and/or reject per-thread affinities".  What we have now is two layers
> manipulating the same field without any mechanism for coordination.
>

I think this is a mischaracterization.  With respect to the two
proposed solutions:

a) Notifications do not solve this problem.
b) Rejecting per-thread affinities is a non-starter.  It's absolutely
needed.  (Aside:  This would also wholly break the existing
sched_setaffinity/getaffinity syscalls.)

I do not think this is a layering problem.  This is more like C++:
there is no sane way to concurrently use all the features available,
however, reasonably self-consistent subsets may be chosen.

>> The *only* consistent way is to clobber all masks uniformly.  Then
>> either arrange for some notification to the application to re-sync, or
>> use sub-sub-containers within the cpuset hierarchy to advertise
>> finer-partitions.
>
> I don't get it.  How is that the only consistent way?  Why is making
> irreversible changes even a good way?  Just layer the masks and
> trigger notification on changes.

I'm not sure if you're agreeing or disagreeing here.  Are you saying
there's another consistent way from "clobbering then triggering
notification changes" since it seems like that's what is rejected and
then described.  This certainly does not include any provisions for
reversibility (which I think is a non-starter).

With respect to layering:  Are you proposing we maintain a separate
mask for sched_setaffinity and cpusets, then do different things
depending on their intersection, or lack thereof?  I feel this would
introduce more consistencies than it would solve as these masks would
not be separately inspectable from user-space, leading to confusing
interactions as they are changed.

There are also very real challenges with how any notification is
implemented, independent of delivery:
The 'main' of an application often does not have good control or even
understanding over its own threads since many may be library managed.
Designation of responsibility for updating these masks is difficult.
That said, I think a notification would still be a useful improvement
here and that some applications would benefit.

At the very least, I do not think that cpuset's behavior here can be
dismissed as unreasonable.

>
>> I don't think the case of having a large compute farm with
>> "unimportant" and "important" work is particularly fringe.  Reducing
>> the impact on the "important" work so that we can scavenge more cycles
>> for the latency insensitive "unimportant" is very real.
>
> What if optimizing cache allocation across competing threads of a
> process can yield, say, 3% gain across large compute farm?  Is that
> fringe?

Frankly, yes.  If you have a compute farm sufficiently dedicated to a
single application I'd say that's a fairly specialized use.  I see no
reason why a more 'technical' API should be a challenge for such a
user.  The fundamental point here was that it's ok for the API of some
controllers to be targeted at system rather than user control in terms
of interface.  This does not restrict their use by users where
appropriate.

>
>> Right, but it's exactly because of _how bad_ those other mechanisms
>> _are_ that cgroups was originally created.Its growth was not
>> managed well from there, but let's not step away from the fact that
>> this interface was created to solve this problem.
>
> Sure, at the same time, please don't forget that there are ample
> reasons we can't replace more basic mechanisms with cgroups.  I'm not
> saying this can't be part of cgroup but rather that it's misguided to
> do plunge into cgroups as the first and only step.
>

So there is definitely a proliferation of discussion regarding
applying cgroups to other problems which I agree we n

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 10:04 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Austin.

 On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
 Just to learn more, what sort of hypervisor support threads are we
 talking about?  They would have to consume considerable amount of cpu
 cycles for problems like this to be relevant and be dynamic in numbers
 in a way which letting them competing against vcpus makes sense.  Do
 IO helpers meet these criteria?
 
 Depending on the configuration, yes they can.  VirtualBox has some rather
 CPU intensive threads that aren't vCPU threads (their emulated APIC thread
 immediately comes to mind), and so does QEMU depending on the emulated

 And the number of those threads fluctuate widely and dynamically?

 hardware configuration (it gets more noticeable when the disk images are
 stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
 pretty typical usage for large virtualization deployments).  I've seen cases
 first hand where the vCPU's can make no reasonable progress because they are
 constantly getting crowded out by other threads.

 That alone doesn't require hierarchical resource distribution tho.
 Setting nice levels reasonably is likely to alleviate most of the
 problem.

Nice is not sufficient here.  There could be arbitrarily many threads
within the hypervisor that are not actually hosting guest CPU threads.
The only way to have this competition occur at a reasonably fixed
ratio is a sub-hierarchy.


 The use of the term 'hypervisor support threads' for this is probably not
 the best way of describing the contention, as it's almost always a full
 system virtualization issue, and the contending threads are usually storage
 back-end access threads.

 I would argue that there are better ways to deal properly with this (Isolate
 the non vCPU threads on separate physical CPU's from the hardware emulation
 threads), but such methods require large systems to be practical at any
 scale, and many people don't have the budget for such large systems, and
 this way of doing things is much more flexible for small scale use cases
 (for example, someone running one or two VM's on a laptop under QEMU or
 VirtualBox).

 I don't know.  Someone running one or two VM's on a laptop under
 QEMU doesn't really sound like the use case which absolutely requires
 hierarchical cpu cycle distribution.


We run more than 'one or two' VMs using this configuration.  :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 1:25 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Austin.

 On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote:
 That alone doesn't require hierarchical resource distribution tho.
 Setting nice levels reasonably is likely to alleviate most of the
 problem.

 In the cases I've dealt with this myself, nice levels didn't cut it, and I
 had to resort to SCHED_RR with particular care to avoid priority inversions.

 I wonder why.  The difference between -20 and 20 is around 2500x in
 terms of weight.  That should have been enough for expressing whatever
 precedence the vcpus should have over other threads.

This strongly perturbs the load-balancer which performs busiest cpu
selection by weight.

Note that also we do not necessarily want total dominance by vCPU
threads, the hypervisor threads are almost always doing work on their
behalf and we want to provision them with _some_ time.  A
sub-hierarchy allows this to be performed in a way that is independent
of how many vCPUs or support threads that are present.


 I don't know.  Someone running one or two VM's on a laptop under
 QEMU doesn't really sound like the use case which absolutely requires
 hierarchical cpu cycle distribution.

 It depends on the use case.  I never have more than 2 VM's running on my
 laptop (always under QEMU, setting up Xen is kind of pointless ona quad core
 system with only 8G of RAM), and I take extensive advantage of the cpu
 cgroup to partition resources among various services on the host.

 Hmmm... I'm trying to understand the usecases where having hierarchy
 inside a process are actually required so that we don't end up doing
 something complex unnecessarily.  So far, it looks like an easy
 alternative for qemu would be teaching it to manage priorities of its
 threads given that the threads are mostly static - vcpus going up and
 down are explicit operations which can trigger priority adjustments if
 necessary, which is unlikely to begin with.

What you're proposing is both unnecessarily complex and imprecise.
Arbitrating competition between groups of threads is exactly why we
support sub-hierarchies within cpu.


 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 2:02 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote:
  That alone doesn't require hierarchical resource distribution tho.
  Setting nice levels reasonably is likely to alleviate most of the
  problem.

 Nice is not sufficient here.  There could be arbitrarily many threads
 within the hypervisor that are not actually hosting guest CPU threads.
 The only way to have this competition occur at a reasonably fixed
 ratio is a sub-hierarchy.

 I get that having hierarchy of threads would be nicer but am having a
 bit of difficulty seeing why adjusting priorities of threads wouldn't
 be sufficient.  It's not like threads of the same process competing
 with each other is a new problem.  People have been dealing with it
 for ages.  Hierarchical management can be a nice plus but we want the
 problem and proposed solution to be justifiable.

Consider what happens with load asymmetry:

Suppose that we have 10 vcpu threads and 100 support threads.
Suppose that we want the support threads to receive up to 10% of the
time available to the VM as a whole on that machine.

If I have one particular support thread that is busy, I want it to
receive that entire 10% (maybe a guest is pounding on scsi for
example, or in the thread-pool case, I've passed a single expensive
computation).  Conversely, suppose the guest is doing lots of
different things and several support threads are active, I want the
time to be shared between them.

There is no way to implement this with nice.  Either a single thread
can consume 10%, and the group can dominate, or the group cannot
dominate and the single thread can be starved.


 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 2:36 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Paul.

 On Mon, Aug 24, 2015 at 01:52:01PM -0700, Paul Turner wrote:
 We typically share our machines between many jobs, these jobs can have
 cores that are private (and not shared with other jobs) and cores
 that are shared (general purpose cores accessible to all jobs on the
 same machine).

 The pool of cpus in the shared pool is dynamic as jobs entering and
 leaving the machine take or release their associated private cores.

 By creating the appropriate sub-containers within the cpuset group we
 allow jobs to pin specific threads to run on their (typically) private
 cores.  This also allows the management daemons additional flexibility
 as it's possible to update which cores we place as private, without
 synchronization with the application.  Note that sched_setaffinity()
 is a non-starter here.

 Why isn't it?  Because the programs themselves might try to override
 it?

The major reasons are:

1) Isolation.  Doing everything with sched_setaffinity means that
programs can use arbitrary resources if they desire.
  1a) These restrictions need to also apply to threads created by
library code.  Which may be 3rd party.
2) Interaction between cpusets and sched_setaffinity.  For necessary
reasons, a cpuset update always overwrites all extant
sched_setaffinity values. ...And we need some cpusets for (1)And
we need periodic updates for access to shared cores.
3) Virtualization of CPU ids.  (Multiple applications all binding to
core 1 is a bad thing.)


 Let me try to restate:
   I think that we can specify the usage is specifically niche that it
 will *typically* be used by higher level management daemons which

 I really don't think that's the case.


Can you provide examples of non-exceptional usage in this fashion?

 prefer a more technical and specific interface.  This does not
 preclude use by threads, it just makes it less convenient; I think
 that we should be optimizing for flexibility over ease-of-use for a
 very small number of cases here.

 It's more like there are two niche sets of use cases.  If a
 programmable interface or cgroups has to be picked as an exclusive
 alternative, it's pretty clear that programmable interface is the way
 to go.

I strongly disagree here:
  The *major obvious use* is partitioning of a system, which must act
on groups of processes.  Cgroups is the only interface we have which
satisfies this today.


  It's not contained in the process at all.  What if an external entity
  decides to migrate the process into another cgroup inbetween?
 

 If we have 'atomic' moves and a way to access our sub-containers from
 the process in a consistent fashion (e.g. relative paths) then this is
 not an issue.

 But it gets so twisted.  Relative paths aren't enough.  It actually
 has to proxy accesses to already open files.  At that point, why would
 we even keep it as a file-system based interface?

Well no, this can just be reversed and we can have the relative paths
be the actual files which the hierarchy points back at.

Ultimately, they could potentially not even be exposed in the regular
hierarchy.  At this point we could not expose anything that does not
support sub-process splits within processes' hierarchy and we're at a
more reasonable state of affairs.

There is real value in being able to duplicate interface between
process and sub-process level control.


 I am not endorsing the world we are in today, only describing how it
 can be somewhat sanely managed.  Some of these lessons could be
 formalized in imagining the world of tomorrow.  E.g. the sub-process
 mounts could appear within some (non-movable) alternate file-system
 path.

 Ditto.  Wouldn't it be better to implement something which resemables
 conventional programming interface rather than contorting the
 filesystem semantics?


Maybe?  This is a trade-off, some of which is built on the assumptions
we're now debating.

There is also value, cost-wise, in iterative improvement of what we
have today rather than trying to nuke it from orbit.  I do not know
which of these is the right choice, it likely depends strongly on
where we end up for sub-process interfaces.  If we do support those
I'm not sure it makes sense for them to have an entirely different API
from process-level coordination, at which point the file-system
overload is a trade-off rather than a cost.

  The harder answer is:  How do we handle non-fungible resources such as
  CPU assignments within a hierarchy?  This is a big part of why I make
  arguments for certain partitions being management-software only above.
  This is imperfect, but better then where we stand today.
 
  I'm not following.  Why is that different?

 This is generally any time a change in the external-to-application's
 cgroup-parent requires changes in the sub-hierarchy.  This is most
 visible with a resource such as a cpu which is uniquely identified,
 but similarly applies to any limits.

 So, except

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Sat, Aug 22, 2015 at 11:29 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Paul.

 On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
 ...
 A very concrete example of the above is a virtual machine in which you
 want to guarantee scheduling for the vCPU threads which must schedule
 beside many hypervisor support threads.   A hierarchy is the only way
 to fix the ratio at which these compete.

 Just to learn more, what sort of hypervisor support threads are we
 talking about?  They would have to consume considerable amount of cpu
 cycles for problems like this to be relevant and be dynamic in numbers
 in a way which letting them competing against vcpus makes sense.  Do
 IO helpers meet these criteria?

I'm not sure what you mean by an IO helper.  By support threads I mean
any threads that are used in the hypervisor implementation that are
not hosting a vCPU.


 An example that's not the cpu controller is that we use cpusets to
 expose to applications their shared and private cores.  (These
 sets are dynamic based on what is coscheduled on a given machine.)

 Can you please go into more details with these?

We typically share our machines between many jobs, these jobs can have
cores that are private (and not shared with other jobs) and cores
that are shared (general purpose cores accessible to all jobs on the
same machine).

The pool of cpus in the shared pool is dynamic as jobs entering and
leaving the machine take or release their associated private cores.

By creating the appropriate sub-containers within the cpuset group we
allow jobs to pin specific threads to run on their (typically) private
cores.  This also allows the management daemons additional flexibility
as it's possible to update which cores we place as private, without
synchronization with the application.  Note that sched_setaffinity()
is a non-starter here.


  Why would you assume that threads of a process wouldn't want to
  configure it ever?  How is this different from CPU affinity?

 In general cache and CPU behave differently.  Generally for it to make
 sense between threads in a process they would have to have wholly
 disjoint memory, at which point the only sane long-term implementation
 is separate processes and the management moves up a level anyway.

 That said, there are surely cases in which it might be convenient to
 use at a per-thread level to correct a specific performance anomaly.
 But at that point, you have certainly reached the level of hammer that
 you can coordinate with an external daemon if necessary.

 So, I'm not super familiar with all the use cases but the whole cache
 allocation thing is almost by nature a specific niche thing and I feel
 pretty reluctant to blow off per-thread usages as too niche to worry
 about.

Let me try to restate:
  I think that we can specify the usage is specifically niche that it
will *typically* be used by higher level management daemons which
prefer a more technical and specific interface.  This does not
preclude use by threads, it just makes it less convenient; I think
that we should be optimizing for flexibility over ease-of-use for a
very small number of cases here.


  I don't follow what you're trying to way with the above paragraph.
  Are you still talking about CAT?  If so, that use case isn't the only
  one.  I'm pretty sure there are people who would want to configure
  cache allocation at thread level.

 I'm not agreeing with you that in cgroups means must be usable by
 applications within that hierarchy.  A cgroup subsystem used as a
 partitioning API only by system management daemons is entirely
 reasonable.  CAT is a reasonable example of this.

 I see.  The same argument.  I don't think CAT just being system
 management thing makes sense.

  So, this is a trade-off we're consciously making.  If there are
  common-enough use cases which require jumping across different cgroup
  domains, we'll try to figure out a way to accomodate those but by
  default migration is a very cold and expensive path.

 The core here was the need for allowing sub-process migration.  I'm
 not sure I follow the performance trade-off argument; haven't we
 historically seen the opposite?  That migration has been a slow-path
 without optimizations and people pushing to make it faster?  This
 seems a hard generalization to make for something that's inherently
 tied to a particular controller.

 It isn't something tied to a particular controller.  Some controllers
 may get impacted less by than others but there's an inherent
 connection between how dynamic an association is and how expensive the
 locking around it needs to be and we need to set up basic behavior and
 usage conventions so that different controllers are designed and
 implemented assuming similar usage patterns; otherwise, we end up with
 the chaotic shit show that we have had where everything behaves
 differently and nobody knows what's the right way to do things and we
 end up locked into weird requirements which some

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 2:17 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Aug 24, 2015 at 02:10:17PM -0700, Paul Turner wrote:
 Suppose that we have 10 vcpu threads and 100 support threads.
 Suppose that we want the support threads to receive up to 10% of the
 time available to the VM as a whole on that machine.

 If I have one particular support thread that is busy, I want it to
 receive that entire 10% (maybe a guest is pounding on scsi for
 example, or in the thread-pool case, I've passed a single expensive
 computation).  Conversely, suppose the guest is doing lots of
 different things and several support threads are active, I want the
 time to be shared between them.

 There is no way to implement this with nice.  Either a single thread
 can consume 10%, and the group can dominate, or the group cannot
 dominate and the single thread can be starved.

 Would it be possible for you to give realistic and concrete examples?
 I'm not trying to play down the use cases but concrete examples are
 usually helpful at putting things in perspective.

I don't think there's anything that's not realistic or concrete about
the example above.  The suppose parts were only for qualifying the
pool sizes for vcpu and non-vcpu threads above since discussion of
implementation using nice is dependent on knowing these counts.



 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 2:12 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Paul.

 On Mon, Aug 24, 2015 at 02:00:54PM -0700, Paul Turner wrote:
  Hmmm... I'm trying to understand the usecases where having hierarchy
  inside a process are actually required so that we don't end up doing
  something complex unnecessarily.  So far, it looks like an easy
  alternative for qemu would be teaching it to manage priorities of its
  threads given that the threads are mostly static - vcpus going up and
  down are explicit operations which can trigger priority adjustments if
  necessary, which is unlikely to begin with.

 What you're proposing is both unnecessarily complex and imprecise.
 Arbitrating competition between groups of threads is exactly why we
 support sub-hierarchies within cpu.

 Sure, and to make that behave half-way acceptable, we'll have to take
 on significant amount of effort and likely complexity and I'm trying
 to see whether the usecases are actually justifiable.  I get that
 priority based solution will be less precise and more complex on the
 application side but by how much and does the added precision enough
 to justify the extra facilities to support that?  If it is, sure,
 let's get to it but it'd be great if the concrete prolem cases are
 properly identified and understood.  I'll continue on the other reply.


No problem, I think the conversation is absolutely
constructive/important to have and am happy to help drill down.

 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 2:40 PM, Tejun Heo t...@kernel.org wrote:
 On Mon, Aug 24, 2015 at 02:19:29PM -0700, Paul Turner wrote:
  Would it be possible for you to give realistic and concrete examples?
  I'm not trying to play down the use cases but concrete examples are
  usually helpful at putting things in perspective.

 I don't think there's anything that's not realistic or concrete about
 the example above.  The suppose parts were only for qualifying the
 pool sizes for vcpu and non-vcpu threads above since discussion of
 implementation using nice is dependent on knowing these counts.

 Hmm... I was hoping for an actual configurations and usage scenarios.
 Preferably something people can set up and play with.

This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.

 I take that the
 CPU intensive helper threads are usually IO workers?  Is the scenario
 where the VM is set up with a lot of IO devices and different ones may
 consume large amount of CPU cycles at any given point?

Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:
  Hmm... I was hoping for an actual configurations and usage scenarios.
  Preferably something people can set up and play with.

 This is much easier to set up and play with synthetically.  Just
 create the 10 threads and 100 threads above then experiment with
 configurations designed at guaranteeing the set of 100 threads
 relatively uniform throughput regardless of how many are active.   I
 don't think trying to run a VM stack adds anything except complexity
 of reproduction here.

 Well, but that loses most of details and why such use cases matter to
 begin with.  We can imagine up stuff to induce arbitrary set of
 requirements.

All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.


  I take that the
  CPU intensive helper threads are usually IO workers?  Is the scenario
  where the VM is set up with a lot of IO devices and different ones may
  consume large amount of CPU cycles at any given point?

 Yes, generally speaking there are a few major classes of IO (flash,
 disk, network) that a guest may invoke.  Each of these backends is
 separate and chooses its own threading.

 Hmmm... if that's the case, would limiting iops on those IO devices
 (or classes of them) work?  qemu already implements IO limit mechanism
 after all.

No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.


 Anyways, a point here is that threads of the same process competing
 isn't a new problem.  There are many ways to make those threads play
 nice as the application itself often has to be involved anyway,
 especially for something like qemu which is heavily involved in
 provisioning resources.

It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.


 cgroups can be a nice brute-force add-on which lets sysadmins do wild
 things but it's inherently hacky and incomplete for coordinating
 threads.  For example, what is it gonna do if qemu cloned vcpus and IO
 helpers dynamically off of the same parent thread?

We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.

  It requires
 application's cooperation anyway but at the same time is painful to
 actually interact from those applications.

As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.


 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner
On Mon, Aug 24, 2015 at 3:19 PM, Tejun Heo t...@kernel.org wrote:
 Hey,

 On Mon, Aug 24, 2015 at 02:58:23PM -0700, Paul Turner wrote:
  Why isn't it?  Because the programs themselves might try to override
  it?

 The major reasons are:

 1) Isolation.  Doing everything with sched_setaffinity means that
 programs can use arbitrary resources if they desire.
   1a) These restrictions need to also apply to threads created by
 library code.  Which may be 3rd party.
 2) Interaction between cpusets and sched_setaffinity.  For necessary
 reasons, a cpuset update always overwrites all extant
 sched_setaffinity values. ...And we need some cpusets for (1)And
 we need periodic updates for access to shared cores.

 This is an erratic behavior on cpuset's part tho.  Nothing else
 behaves this way and it's borderline buggy.


It's actually the only sane possible interaction here.

If you don't overwrite the masks you can no longer manage cpusets with
a multi-threaded application.
If you partially overwrite the masks you can create a host of
inconsistent behaviors where an application suddenly loses
parallelism.

The *only* consistent way is to clobber all masks uniformly.  Then
either arrange for some notification to the application to re-sync, or
use sub-sub-containers within the cpuset hierarchy to advertise
finer-partitions.

(Generally speaking, there is no real way to mate these APIs and part
of the reason we use sub-containers here.  What's being proposed will
make this worse rather than better.)

 3) Virtualization of CPU ids.  (Multiple applications all binding to
 core 1 is a bad thing.)

 This is about who's setting the affinity, right?  As long as an agent
 which knows system details sets it, which mechanism doesn't really
 matter.

Yes, there are other ways to implement this.


  Let me try to restate:
I think that we can specify the usage is specifically niche that it
  will *typically* be used by higher level management daemons which
 
  I really don't think that's the case.
 

 Can you provide examples of non-exceptional usage in this fashion?

 I heard of two use cases.  One is sytem-partitioning that you're
 talking about and the other is preventing threads of the same process
 from stepping on each other's toes.  There was a fancy word for the
 cacheline cannibalizing behavior which shows up in those scenarios.

So this is a single example right, since the system partitioning case
is the one in which it's exclusively used by a higher level management
daemon.

The case of an process with specifically identified threads in
conflict certainly seems exceptional in the level of optimization both
in implementation and analysis present.  I would expect in this case
that either they are comfortable with the more technical API, or they
can coordinate with an external controller.  Which is much less
overloaded both by number of callers and by number of interfaces than
it is in the cpuset case.


  It's more like there are two niche sets of use cases.  If a
  programmable interface or cgroups has to be picked as an exclusive
  alternative, it's pretty clear that programmable interface is the way
  to go.

 I strongly disagree here:
   The *major obvious use* is partitioning of a system, which must act

 I don't know.  Why is that more major obvious use?  This is super
 duper fringe to begin with.  It's like tallying up beans.  Sure, some
 may be taller than others but they're all still beans and I'm not even
 sure there's a big difference between the two use cases here.

I don't think the case of having a large compute farm with
unimportant and important work is particularly fringe.  Reducing
the impact on the important work so that we can scavenge more cycles
for the latency insensitive unimportant is very real.


 on groups of processes.  Cgroups is the only interface we have which
 satisfies this today.

 Well, not really.  cgroups is more convenient / better at these things
 but not the only way to do it.  People have been doing isolation to
 varying degrees with other mechanisms for ages.


Right, but it's exactly because of _how bad_ those other mechanisms
_are_ that cgroups was originally created.Its growth was not
managed well from there, but let's not step away from the fact that
this interface was created to solve this problem.

  Ditto.  Wouldn't it be better to implement something which resemables
  conventional programming interface rather than contorting the
  filesystem semantics?

 Maybe?  This is a trade-off, some of which is built on the assumptions
 we're now debating.

 There is also value, cost-wise, in iterative improvement of what we
 have today rather than trying to nuke it from orbit.  I do not know
 which of these is the right choice, it likely depends strongly on
 where we end up for sub-process interfaces.  If we do support those
 I'm not sure it makes sense for them to have an entirely different API
 from process-level coordination, at which point the file-system

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-21 Thread Paul Turner
On Tue, Aug 18, 2015 at 1:31 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Paul.

 On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote:
  2) Control within an address-space.  For subsystems with fungible 
  resources,
  e.g. CPU, it can be useful for an address space to partition its own
  threads.  Losing the capability to do this against the CPU controller would
  be a large set-back for instance.  Occasionally, it is useful to share 
  these
  groupings between address spaces when processes are cooperative, but this 
  is
  less of a requirement.
 
  This is important to us.

 Sure, let's build a proper interface for that.  Do you actually need
 sub-hierarchy inside a process?  Can you describe your use case in
 detail and why having hierarchical CPU cycle distribution is essential
 for your use case?


One common example here is a thread-pool.  Having a hierarchical
constraint allows users to specify what proportion of time it should
receive, independent of how many threads are placed in the pool.

A very concrete example of the above is a virtual machine in which you
want to guarantee scheduling for the vCPU threads which must schedule
beside many hypervisor support threads.   A hierarchy is the only way
to fix the ratio at which these compete.

An example that's not the cpu controller is that we use cpusets to
expose to applications their shared and private cores.  (These
sets are dynamic based on what is coscheduled on a given machine.)

  And that's one of the major fuck ups on cgroup's part that must be
  rectified.  Look at the interface being proposed there.  It's exposing
  direct hardware details w/o much abstraction which is fine for a
  system management interface but at the same time it's intended to be
  exposed to individual applications.
 
  FWIW this is something we've had no significant problems managing with
  separate mount mounts and file system protections.  Yes, there are some
  potential warts around atomicity; but we've not found them too onerous.

 You guys control the whole stack.  Of course, you can get away with an
 interface which are pretty messed up in terms of layering and
 isolation; however, generic kernel interface cannot be designed
 according to that standard.

I feel like two points are being conflated here:

Yes, it is sufficiently generic that it's possible to configure
nonsensical things.

But, it is also possible to lock things down presently.  This is, for
better or worse, the direction that general user-space has also taken
with centralized management daemons such as systemd.

Setting design aside for a moment -- which I fully agree with you that
there is room for large improvement in.  The largest idiosyncrasy
today is that the configuration above does depend on having a stable
mount point for applications to manage their sub-hierarchies.
Migrations would improve this greatly, but this is a bit of a detour
because you're looking to fix the fundamental design rather than
improve the state of the world and that's probably a good thing :)


  What I don't quite follow here is the assumption that CAT should would be
  necessarily exposed to individual applications? What's wrong with 
  subsystems
  that are primarily intended only for system management agents, we already
  have several of these.

 Why would you assume that threads of a process wouldn't want to
 configure it ever?  How is this different from CPU affinity?

In general cache and CPU behave differently.  Generally for it to make
sense between threads in a process they would have to have wholly
disjoint memory, at which point the only sane long-term implementation
is separate processes and the management moves up a level anyway.

That said, there are surely cases in which it might be convenient to
use at a per-thread level to correct a specific performance anomaly.
But at that point, you have certainly reached the level of hammer that
you can coordinate with an external daemon if necessary.


  This lack of distinction makes
  people skip the attention that they should be paying when they're
  designing interface exposed to individual programs.  Worse, this makes
  these things fly under the review scrutiny that public API accessible
  to applications usually receives.  Yet, that's what these things end
  up to be.  This just has to stop.  cgroups can't continue to be this
  ghetto shortcut to implementing half-assed APIs.
 
  I certainly don't disagree on this point :).  But as above, I don't quite
  follow why an API being in cgroups must mean it's accessible to an
  application controlled by that group.  This has certainly not been a
  requirement for our use.

 I don't follow what you're trying to way with the above paragraph.
 Are you still talking about CAT?  If so, that use case isn't the only
 one.  I'm pretty sure there are people who would want to configure
 cache allocation at thread level.

I'm not agreeing with you that in cgroups means must be usable by
applications within

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-17 Thread Paul Turner
Apologies for the repeat.  Gmail ate its plain text setting for some
reason.  Shame bells.

On Mon, Aug 17, 2015 at 9:02 PM, Paul Turner p...@google.com wrote:


 On Wed, Aug 5, 2015 at 7:31 AM, Tejun Heo t...@kernel.org wrote:

 Hello,

 On Wed, Aug 05, 2015 at 11:10:36AM +0200, Peter Zijlstra wrote:
   I've been thinking about it and I'm now convinced that cgroups just is
   the wrong interface to require each application to be programming
   against.
 
  But people are doing it. So you must give them something. You cannot
  just tell them to go away.

 Sure, more on specifics later, but, first of all, the transition to v2
 is a gradual process.  The new and old hierarchies can co-exist, so
 nothing forces abrupt transitions.  Also, we do want to start as
 restricted as possible and then widen it gradually as necessary.

  So where are the people doing this in this discussion? Or are you
  one-sidedly forcing things? IIRC Google was doing this.

 We've been having those discussions for years in person and on the
 cgroup mailing list.  IIRC, the google case was for blkcg where they
 have an IO proxy process which wanna issue IOs as different cgroups
 depending on who's the original issuer.  They created multiple
 threads, put them in different cgroups and bounce the IOs to the
 matching one; however, this is already pretty silly as they have to
 bounce IOs to different threads.  What makes a lot more sense here is
 the ability to tag an IO as coming from a specific cgroup (or a
 process's cgroup) and there was discussion of using an extra field in
 aio request to indicate this, which is an a lot better solution for
 the problem, can also express different IO priority and pretty easy to
 implement.


 So we have two major types of use that are relevant to this interface:

 1) Proxy agents.  When a control systems want to perform work on behalf of a
 container, they will sometimes move the acting thread into the relevant
 control groups so that it can be accounted on that container's behalf.
 [This is more relevant for non-persistent resources such as CPU time or I/O
 priorities than charges that will outlive the work such as memory
 allocations.]

 I agree (1) is at best a bit of a hack and can be worked around on the type
 of time-frame these interfaces move at.

 2) Control within an address-space.  For subsystems with fungible resources,
 e.g. CPU, it can be useful for an address space to partition its own
 threads.  Losing the capability to do this against the CPU controller would
 be a large set-back for instance.  Occasionally, it is useful to share these
 groupings between address spaces when processes are cooperative, but this is
 less of a requirement.

 This is important to us.


  The whole libvirt trainwreck also does this (the programming against
  cgroups, not the per task thing afaik).

 AFAIK, libvirt is doing multiple backends anyway and as long as the
 delegation rules are clear, libvirt managing its own subhierarchy is
 not a problem.  It's an administration software stack which requires
 fairly close integration with the userland part of operating system.

  You also cannot mandate system-disease, not everybody will want to run
  that monster. From what I understood last time, Google has no interest
  what so ever of using it.

 But what would require tight coupling of individual applications and
 something like systemd is the kernel failing to set up a reasonable
 boundary between management and application interfaces.  If the kernel
 provides a useable API for individual applications to use, they'll
 program against it and the management part can be whatever.  If we
 fail to do that, individual applications will have to talk to external
 agent to coordinate access to management interface


 It's notable here that for a managed system, the agent coordinating access
 *must* be external


 and that's what'll
 end up creating hard dependency on specific system agents from
 applications like apache or mysql or whatever.  We really don't want
 that.  The kernel *NEEDS* to clearly distinguish those two to prevent
 that from happening.

   I wrote this in the CAT thread too but cgroups may be an
   okay management / administration interface but is a horrible
   programming interface to be used by individual applications.
 
  Yeah, I need to catch up on that CAT thread, but the reality is, people
  use it as a programming interface, whether you like it or not.

 And that's one of the major fuck ups on cgroup's part that must be
 rectified.  Look at the interface being proposed there.  It's exposing
 direct hardware details w/o much abstraction which is fine for a
 system management interface but at the same time it's intended to be
 exposed to individual applications.


 FWIW this is something we've had no significant problems managing with
 separate mount mounts and file system protections.  Yes, there are some
 potential warts around atomicity; but we've not found them too onerous.

 What I

Re: [RFC PATCH 2/3] restartable sequences: x86 ABI

2015-06-26 Thread Paul Turner
On Fri, Jun 26, 2015 at 12:31 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 26, 2015 at 11:09 AM, Mathieu Desnoyers
 mathieu.desnoy...@efficios.com wrote:
 - On Jun 24, 2015, at 6:26 PM, Paul Turner p...@google.com wrote:

 Implements the x86 (i386  x86-64) ABIs for interrupting and restarting
 execution within restartable sequence sections.

 With respect to the x86-specific ABI:
  On 32-bit:   Upon restart, the interrupted rip is placed in %ecx
  On 64-bit (or x32):  Upon restart, the interrupted rip is placed in %r10

 While potentially surprising at first glance, this choice is strongly 
 motivated
 by the fact that the available scratch registers under the i386 function 
 call
 ABI overlap with those used as argument registers under x86_64.

 Given that sequences are already personality specific and that we always 
 want
 the arguments to be available for sequence restart, it's much more natural 
 to
 ultimately differentiate the ABI in these two cases.

 Signed-off-by: Paul Turner p...@google.com
 ---
 arch/x86/include/asm/restartable_sequences.h |   50 +++
 arch/x86/kernel/Makefile |2 +
 arch/x86/kernel/restartable_sequences.c  |   69 
 ++
 arch/x86/kernel/signal.c |   12 +
 kernel/restartable_sequences.c   |   11 +++-
 5 files changed, 141 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/restartable_sequences.h
 create mode 100644 arch/x86/kernel/restartable_sequences.c

 diff --git a/arch/x86/include/asm/restartable_sequences.h
 b/arch/x86/include/asm/restartable_sequences.h
 new file mode 100644
 index 000..0ceb024
 --- /dev/null
 +++ b/arch/x86/include/asm/restartable_sequences.h
 @@ -0,0 +1,50 @@
 +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
 +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
 +
 +#include asm/processor.h
 +#include asm/ptrace.h
 +#include linux/sched.h
 +
 +#ifdef CONFIG_RESTARTABLE_SEQUENCES
 +
 +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
 +  struct pt_regs *regs)
 +{
 + struct task_struct *leader = p-group_leader;
 + struct restartable_sequence_state *rseq_state = leader-rseq_state;
 +
 + unsigned long ip = (unsigned long)regs-ip;
 + if (unlikely(ip  (unsigned long)rseq_state-crit_end 
 +  ip = (unsigned long)rseq_state-crit_start))
 + return true;
 +
 + return false;
 +}
 +
 +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
 +{
 +#ifdef CONFIG_PREEMPT
 + /*
 +  * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
 +  * case that we took an interrupt during syscall entry.  Avoid this by
 +  * always deferring to our notify-resume handler.
 +  */
 + return true;

 I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
 code, task_pt_regs() seems to return the user-space pt_regs for a task with
 a current-mm set (iow, not a kernel thread), even if an interrupt nests on
 top of a system call. The only corner-case is NMIs, where an NMI may 
 interrupt
 in the middle of setting up the task pt_regs, but scheduling should never 
 happen
 there, right ?

 Careful, here!  task_pt_regs returns a pointer to the place where regs
 would be if they were fully initialized.  We can certainly take an
 interrupt in the middle of pt_regs setup (entry_SYSCALL_64 enables
 interrupts very early, for example).  To me, the question is whether
 we can ever be preemptable at such a time.

 It's a bit worse, though: we can certainly be preemptible when other
 code is accessing pt_regs.  clone, execve, sigreturn, and signal
 delivery come to mind.

Yeah Andy covered it exactly: interrupt in pt_regs setup.

With respect to whether we can be preemptible; I think we were
concerned about rescheduling during syscall entry but I'd have to
re-audit the current state of entry_64.S :)

Mathieu also wrote:
 Moving ENABLE_INTERRUPTS(CLBR_NONE) 3 instructions down, just after
 pushq   %rcx/* pt_regs-ip */
 might solve your issue here. (in entry_SYSCALL_64_after_swapgs)

We considered doing something exactly like this; but I think any
potential changes here should be made in isolation of this series.


 Why don't we give up on poking at user state from the scheduler and do
 it on exit to user mode instead?  Starting in 4.3 (hopefully landing
 in -tip in a week or two), we should have a nice function
 prepare_exit_to_usermode that runs with well-defined state,
 non-reentrantly, that can do whatever you want here, *including user
 memory access*.

So this series already does the exact approximation of that:
The only thing we touch in the scheduler is looking at the kernel copy
pt_regs in the case we know it's safe to.

The entirety of *any* poking (both current cpu pointer updates and
potential rip manipulation) at user-state exactly happens in the
exit

Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-25 Thread Paul Turner
On Thu, Jun 25, 2015 at 6:15 PM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:
 - On Jun 24, 2015, at 10:54 PM, Paul Turner p...@google.com wrote:

 On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be 
 quite
 powerful in practice, restartable sequences.


 On an extremely short glance, I'm starting to think that the right
 approach, at least for x86, is to implement per-cpu gsbase.  Then you
 could do cmpxchg with a gs prefix to atomically take a percpu lock and
 atomically release a percpu lock and check whether someone else stole
 the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
 fast.)

 This is totally useless for other architectures, but I think it would
 be reasonable clean on x86.  Thoughts?

 So this gives semantics that are obviously similar to this_cpu().
 This provides allows reasonable per-cpu counters (which is alone
 almost sufficient for a strong user-space RCU implementation giving
 this some legs).

 However, unless there's a nice implementation trick I'm missing, the
 thing that stands out to me for locks (or other primitives) is that
 this forces a two-phase commit.  There's no way (short of say,
 cmpxchg16b) to perform a write conditional on the lock not having been
 stolen from us (and subsequently release the lock).

 e.g.
 1) We take the operation in some sort of speculative mode, that
 another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
 our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)

 However, this means that if we're preempted at (3) then no other
 thread on that cpu can make progress until we've been rescheduled and
 released the lock; a nice property of the model we have today is that
 threads sharing a cpu can not impede each other beyond what the
 scheduler allows.

 A lesser concern, but worth mentioning, is that there are also
 potential pitfalls in the interaction with signal handlers,
 particularly if a 2-phase commit is used.

 Assuming we have a gs segment we can use to address per-cpu locks
 in userspace, would the following scheme take care of some of your
 concerns ?

 per-cpu int32_t: each lock initialized to cpu_nr value

 per-cpu lock:
   get current cpu number. Remember this value as CPU lock nr.
   use cmpxchg on gs:lock to grab the lock.
   - Expect old value to be CPU lock nr.
   - Update with a lock flag in most significant bit, CPU lock nr
 in lower bits.
   - Retry if fails. Can be caused by migration or lock being already
 held.

So this is equivalent to just taking the lock in a non-speculative
mode (e.g. non-stealable) from the start.

The challenge remains exactly the same, as soon as an exclusive mode
exists you have some 'critical' critical inner section about the
commit, which impedes the progress of other threads if interrupted.
[The only reason you'd take the lock in any sort of speculative mode
above would be to reduce the length of this.]

I definitely agree that this gets us locks and counters, there's lots
of good things we can do with those and those alone may be sufficient
to implement it this way but I feel there is a lot of power compared
to what can be done with full lockless semantics that this leaves on
the table.  As a concrete example, consider what this does to the
complexity and usefulness of Pop().


 per-cpu unlock:
   clear lock flag within the CPU lock nr lock.

 Thanks,

 Mathieu

 --
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
This is a fairly small series demonstrating a feature we've found to be quite
powerful in practice, restartable sequences.

Most simply: these sequences comprise small snippets of user-code that are
guaranteed to be (effectively) executed serially, with support for restart (or
other handling) in the event of preemption or interruption.

This (when combined with an in-memory cpu-id; potentially optional on some
architectures) can be used to implement extremely fast per-cpu operations that
do not rely on the use of actual processor atomics.  We've used this to back
performance critical code such as our malloc implementation with good results.

We previously discussed this feature at LPC:
  
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Mathieu Desnoyers posted an alternate implementation based on the ideas above
at:
  https://lkml.org/lkml/2015/5/21/472

This RFC is based on the approach we currently use internally.  However, I'll
likely posit that neither this approach, nor the one posted above, is what we
should ultimately adopt (provided sufficient interest exists).

The implementation in this series can be summarized as follows:
- We allow a single (per-address) space critical section (and associated
  handler) to be defined.
- When a thread with RSEQ configured (via new syscall) is interrupted within
  a critical section, we modify its return address.  Either within signal
  delivery, or the notify-resume path.  The scheduler touchpoint is only a
  preempt_notifier which (potentially, dependent on config) examines the
  kernel copy of pt_regs.

There are a few core requirements which motivate the approach taken here:
1) We must not interfere with regular scheduling.  Unlike preemption protection
   (or similar), there are no special considerations for code running within a
   critical region beyond that we must move to the restart handler if
   interrupted.
2) The code executing in scheduler context must be tightly constrained, both in
   terms of overhead and that it must not require access to user memory.
3) The fast-path must be fast.  That is, user entry/execution/exit of a
   non-interrupted critical section is the most important case.  The restart
   handler is a 'slow' path that should only happen comparatively infrequently.
   We're strongly motivated here by high-performance, low-level primitives:
   think malloc, or rcu_read_lock.

While the contained implementation performs well under these constraints, it
has some notable limitations which we should consider for more general usage:

1) The definition of a single region works well for statically linked binaries
   but can be challenging when shared-libraries want to use this feature.  This
   is partially mitigated in our experience that a library of primitives is
   generally more useful than a myriad of custom sequences, but it's still a
   major concern.
2) Due to the nature of restart and explicit location requirements it's only
   really reasonable to write this critical section in assembly; which makes
   porting and maintenance less convenient.  (One of the included tests shows a
   reasonable example of what this looks like.)
3) TLS spreads the entrails of its implementation all over compile _and_ link.
   This makes properly handling it within the critical section cumbersome in
   the shared binary case.

We've been thinking about how to address these issues and are considering some
alternate ABIs that still satisfy (1)-(3), but have a more convenient
programming model.  I'll send this as a follow-up, but wanted to first share
the approach we've used to date.

Thanks,

- Paul



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be quite
 powerful in practice, restartable sequences.


 On an extremely short glance, I'm starting to think that the right
 approach, at least for x86, is to implement per-cpu gsbase.  Then you
 could do cmpxchg with a gs prefix to atomically take a percpu lock and
 atomically release a percpu lock and check whether someone else stole
 the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
 fast.)

 This is totally useless for other architectures, but I think it would
 be reasonable clean on x86.  Thoughts?

So this gives semantics that are obviously similar to this_cpu().
This provides allows reasonable per-cpu counters (which is alone
almost sufficient for a strong user-space RCU implementation giving
this some legs).

However, unless there's a nice implementation trick I'm missing, the
thing that stands out to me for locks (or other primitives) is that
this forces a two-phase commit.  There's no way (short of say,
cmpxchg16b) to perform a write conditional on the lock not having been
stolen from us (and subsequently release the lock).

e.g.
 1) We take the operation in some sort of speculative mode, that
another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)

However, this means that if we're preempted at (3) then no other
thread on that cpu can make progress until we've been rescheduled and
released the lock; a nice property of the model we have today is that
threads sharing a cpu can not impede each other beyond what the
scheduler allows.

A lesser concern, but worth mentioning, is that there are also
potential pitfalls in the interaction with signal handlers,
particularly if a 2-phase commit is used.

- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 3/3] restartable sequences: basic user-space self-tests

2015-06-24 Thread Paul Turner
Implements two basic tests of RSEQ functionality.

The first, basic_test only asserts that RSEQ works moderately correctly.
E.g. that:
  - The CPUID pointer works
  - Code infinitely looping within a critical section will eventually be
interrupted.

basic_percpu_ops_test is a slightly more realistic variant, implementing a
few simple per-cpu operations and testing their correctness.  It also includes
a trivial example of user-space may multiplexing the critical section via the
restart handler.

Signed-off-by: Paul Turner p...@google.com
---
 tools/testing/selftests/rseq/Makefile  |   15 +
 .../testing/selftests/rseq/basic_percpu_ops_test.S |  131 ++
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  250 
 tools/testing/selftests/rseq/basic_test.c  |   76 ++
 tools/testing/selftests/rseq/rseq.c|   48 
 tools/testing/selftests/rseq/rseq.h|   28 ++
 6 files changed, 548 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.S
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
 create mode 100644 tools/testing/selftests/rseq/basic_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/Makefile 
b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 000..c5a2b47
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,15 @@
+CFLAGS += -Wall
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test
+
+basic_percpu_ops_test: basic_percpu_ops_test.c basic_percpu_ops_test.S
+
+all: $(TESTS)
+%: %.c
+   $(CC) $(CFLAGS) -o $@ $^ rseq.c $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+   $(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.S 
b/tools/testing/selftests/rseq/basic_percpu_ops_test.S
new file mode 100644
index 000..7da7781
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.S
@@ -0,0 +1,131 @@
+#include rseq.h
+
+#ifdef __x86_64__
+   .text
+   .code64
+
+#define FETCH_CPU(dest) movl %fs:__rseq_current_cpu@TPOFF, dest
+#define CRITICAL_SECTION_OFFSET(label) $label
+
+/* If start = %RESTART_ADDR_REG  %end, jump to jump_to */
+#define HANDLE_REGION(start, end, jump_to) \
+   cmpqCRITICAL_SECTION_OFFSET(end), %RESTART_ADDR_REG; \
+   jge 1f; \
+   cmpqCRITICAL_SECTION_OFFSET(start), %RESTART_ADDR_REG; \
+   jge jump_to; \
+   1:;
+
+#define HANDLE_REGION_PREFIX(prefix, start, end, jump_to) \
+   HANDLE_REGION(prefix##start, prefix##end, prefix##jump_to)
+
+/*-
+ * Start of actual restartable sequences.
+ *---*/
+   .align 8
+   .globl RSEQ_CRITICAL_SECTION_START
+RSEQ_CRITICAL_SECTION_START:
+/* int rseq_percpu_lock() */
+   .globl rseq_percpu_lock
+   .type  rseq_percpu_lock, @function
+rseq_percpu_lock:
+   .cfi_startproc
+rseq_percpu_lock_region0:
+   FETCH_CPU(%eax)
+   leaq (,%eax,8), %RESTART_ADDR_REG
+   leaq (%rdi,%RESTART_ADDR_REG,8), %RESTART_ADDR_REG
+rseq_percpu_lock_retry:
+   cmpw $0, (%RESTART_ADDR_REG)
+   jne rseq_percpu_lock_retry
+   movw $1, (%RESTART_ADDR_REG)  /* 1 = lock owned */
+rseq_percpu_lock_region1:
+   ret
+rseq_percpu_lock_region2:
+   .cfi_endproc
+
+/*
+ * int rseq_cmpxchg(int cpu, intptr_t *p, intptr_t old, intptr_t new)
+ * int rseq_percpu_cmpxchgcheck(int cpu, intptr_t *p,
+ *  intptr_t old, intptr_t new,
+ *  intptr_t *check_ptr, intptr_t check_val)
+ *
+ * NOTE:  We don't use cmpxchg in the implementation below as that would make
+ * checking the success of our commit operation was dependent on flags (which
+ * are in turn clobbered by the restart region) -- furthermore we can't just
+ * retry to fill in the flags since the restarted cmpxchg may have actually
+ * succeeded; spuriously failing subsequent attempts.
+ */
+
+   .globl rseq_percpu_cmpxchg
+   .type   rseq_percpu_cmpxchg, @function
+rseq_percpu_cmpxchg:
+   .cfi_startproc
+rseq_percpu_cmpxchg_region0:
+   FETCH_CPU(%eax)
+   cmp %eax, %edi   /* check cpu vs current_cpu */
+   jne rseq_percpu_cmpxchg_region1
+   cmp %rdx, (%rsi) /* verify *p == old */
+   jne rseq_percpu_cmpxchg_region2
+   mov %rcx, (%rsi)
+rseq_percpu_cmpxchg_region1:
+   ret/* return current cpu, indicating mismatch OR success */
+rseq_percpu_cmpxchg_region2:
+   mov $-1, %eax  /* mismatch versus old or check, return -1 */
+   ret
+rseq_percpu_cmpxchg_region3:
+   .cfi_endproc
+
+   .globl rseq_percpu_cmpxchgcheck
+   .type  rseq_percpu_cmpxchgcheck, @function

[RFC PATCH 2/3] restartable sequences: x86 ABI

2015-06-24 Thread Paul Turner
Implements the x86 (i386  x86-64) ABIs for interrupting and restarting
execution within restartable sequence sections.

With respect to the x86-specific ABI:
  On 32-bit:   Upon restart, the interrupted rip is placed in %ecx
  On 64-bit (or x32):  Upon restart, the interrupted rip is placed in %r10

While potentially surprising at first glance, this choice is strongly motivated
by the fact that the available scratch registers under the i386 function call
ABI overlap with those used as argument registers under x86_64.

Given that sequences are already personality specific and that we always want
the arguments to be available for sequence restart, it's much more natural to
ultimately differentiate the ABI in these two cases.

Signed-off-by: Paul Turner p...@google.com
---
 arch/x86/include/asm/restartable_sequences.h |   50 +++
 arch/x86/kernel/Makefile |2 +
 arch/x86/kernel/restartable_sequences.c  |   69 ++
 arch/x86/kernel/signal.c |   12 +
 kernel/restartable_sequences.c   |   11 +++-
 5 files changed, 141 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/restartable_sequences.h
 create mode 100644 arch/x86/kernel/restartable_sequences.c

diff --git a/arch/x86/include/asm/restartable_sequences.h 
b/arch/x86/include/asm/restartable_sequences.h
new file mode 100644
index 000..0ceb024
--- /dev/null
+++ b/arch/x86/include/asm/restartable_sequences.h
@@ -0,0 +1,50 @@
+#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
+#define _ASM_X86_RESTARTABLE_SEQUENCES_H
+
+#include asm/processor.h
+#include asm/ptrace.h
+#include linux/sched.h
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+static inline bool arch_rseq_in_crit_section(struct task_struct *p,
+struct pt_regs *regs)
+{
+   struct task_struct *leader = p-group_leader;
+   struct restartable_sequence_state *rseq_state = leader-rseq_state;
+
+   unsigned long ip = (unsigned long)regs-ip;
+   if (unlikely(ip  (unsigned long)rseq_state-crit_end 
+ip = (unsigned long)rseq_state-crit_start))
+   return true;
+
+   return false;
+}
+
+static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
+{
+#ifdef CONFIG_PREEMPT
+   /*
+* Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
+* case that we took an interrupt during syscall entry.  Avoid this by
+* always deferring to our notify-resume handler.
+*/
+   return true;
+#else
+   return arch_rseq_in_crit_section(p, task_pt_regs(p));
+#endif
+}
+
+void arch_rseq_handle_notify_resume(struct pt_regs *regs);
+void arch_rseq_check_critical_section(struct task_struct *p,
+ struct pt_regs *regs);
+
+#else /* !CONFIG_RESTARTABLE_SEQUENCES */
+
+static inline void arch_rseq_handle_notify_resume(struct pt_regs *regs) {}
+static inline void arch_rseq_check_critical_section(struct task_struct *p,
+   struct pt_regs *regs) {}
+
+#endif
+
+#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index febaf18..bd7827d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -113,6 +113,8 @@ obj-$(CONFIG_TRACING)   += tracepoint.o
 obj-$(CONFIG_IOSF_MBI) += iosf_mbi.o
 obj-$(CONFIG_PMC_ATOM) += pmc_atom.o
 
+obj-$(CONFIG_RESTARTABLE_SEQUENCES)+= restartable_sequences.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/restartable_sequences.c 
b/arch/x86/kernel/restartable_sequences.c
new file mode 100644
index 000..3b38013
--- /dev/null
+++ b/arch/x86/kernel/restartable_sequences.c
@@ -0,0 +1,69 @@
+/*
+ * Restartable Sequences: x86 ABI.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner p...@google.com and Andrew Hunter a...@google.com
+ *
+ */
+
+#include linux/sched.h
+#include linux/uaccess.h
+#include asm/restartable_sequences.h
+
+void arch_rseq_check_critical_section(struct task_struct *p,
+ struct pt_regs *regs

[RFC PATCH 1/3] restartable sequences: user-space per-cpu critical sections

2015-06-24 Thread Paul Turner
Introduce the notion of 'restartable sequence'.  This is a user-defined range
within which we guarantee user-execution will occur serially with respect
to scheduling events such as migration or competition with other threads.

Preemption, or other interruption within this region, results in control being
transferred to a user-defined restart handler when rescheduled.  This handler
may arrange for the original operation to be retried, including potentially
resynchronizing with dependent state that may have been updated in the interim.

This may be used in combination with an in-memory cpu-id to allow user programs
to implement cpu-local data-structures and primitives, without the use/overhead
of any atomics.

The kernel ABI generally consists of:
- A single (per-address space) critical region
- A restart handler which pairs with the region above
- A (per-thread) memory location which will be kept current with its cpu

The definition of the above is performed via a new syscall,
  SYSCALL_DEFINE5(restartable_sequences,
  int, op, int, flags, long, val1, long, val2, long, val3)

There are currently 2 possible operations,
  1) Configure the critical region (and restart handler)
  2) Configure the per-thread cpu pointer

[ See kernel/restartable_sequences.c for full documentation ]

A thread that has not configured (2) will not be restarted when executing in
(1).

Note that while the kernel only sees a single critical region, arbitrarily many
sequences can be composed via multiplexing of the user-space restart handler.

This patch introduces the general framework for configuration, as well as
exposing the syscall.  We minimally expose x86 as having support (even though
the actual ABI is added by a subsequent patch) so that this can be compile
tested in isolation.

Signed-off-by: Paul Turner p...@google.com
---
 arch/Kconfig  |7 +
 arch/x86/Kconfig  |1 
 arch/x86/syscalls/syscall_64.tbl  |1 
 fs/exec.c |1 
 include/linux/sched.h |   28 ++
 include/uapi/asm-generic/unistd.h |5 +
 init/Kconfig  |9 ++
 kernel/Makefile   |1 
 kernel/restartable_sequences.c|  185 +
 kernel/sched/core.c   |4 +
 kernel/sched/sched.h  |3 +
 kernel/sys_ni.c   |3 +
 12 files changed, 246 insertions(+), 2 deletions(-)
 create mode 100644 kernel/restartable_sequences.c

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb..fb31981 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -229,6 +229,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
  declared in asm/ptrace.h
  For example the kprobes-based event tracer needs this API.
 
+config HAVE_RESTARTABLE_SEQUENCE_SUPPORT
+   bool
+   depends on HAVE_REGS_AND_STACK_ACCESS_API
+   help
+ This symbol should be selected by an architecture if it supports an
+ implementation of restartable sequences.
+
 config HAVE_CLK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fec044..9c9c92f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -67,6 +67,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select USER_STACKTRACE_SUPPORT
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RESTARTABLE_SEQUENCE_SUPPORT
select HAVE_DMA_API_DEBUG
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_BZIP2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 9ef32d5..1de5cbc 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec_file_load
 321common  bpf sys_bpf
 32264  execveatstub_execveat
+323common  restartable_sequences   sys_restartable_sequences
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a..acd38f6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1590,6 +1590,7 @@ static int do_execveat_common(int fd, struct filename 
*filename,
current-in_execve = 0;
acct_update_integrals(current);
task_numa_free(current);
+   rseq_clear_state_exec();
free_bprm(bprm);
kfree(pathbuf);
putname(filename);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index af0eeba..0540735 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1178,6 +1178,22 @@ struct mempolicy;
 struct pipe_inode_info;
 struct uts_namespace;
 
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+struct restartable_sequence_state {
+   /* Start and end of an address space's critical section. */
+   void __user *crit_start, __user *crit_end;
+   /* Where preempted threads will be restarted. */
+   void __user *crit_restart;
+   /* Thread's current CPU, typically

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Paul Turner
Very nice!

Some inline notes:

On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:
 Expose a new system call allowing userspace threads to register
 a TLS area used as an ABI between the kernel and userspace to
 share information required to create efficient per-cpu critical
 sections in user-space.

 This ABI consists of a thread-local structure containing:

 - a nesting count surrounding the critical section,
 - a signal number to be sent to the thread when preempting a thread
   with non-zero nesting count,
 - a flag indicating whether the signal has been sent within the
   critical section,
 - an integer where to store the current CPU number, updated whenever
   the thread is preempted. This CPU number cache is not strictly
   needed, but performs better than getcpu vdso.

While it is fractionally faster, this is actually something we've
strongly considered dropping.  The complexity of correct TLS
addressing is non-trivial.


 This approach is inspired by Paul Turner and Andrew Hunter's work
 on percpu atomics, which lets the kernel handle restart of critical
 sections, ref. 
 http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

 What is done differently here compared to percpu atomics: we track
 a single nesting counter per thread rather than many ranges of
 instruction pointer values. We deliver a signal to user-space and
 let the logic of restart be handled in user-space, thus moving
 the complexity out of the kernel. The nesting counter approach
 allows us to skip the complexity of interacting with signals that
 would be otherwise needed with the percpu atomics approach, which
 needs to know which instruction pointers are preempted, including
 when preemption occurs on a signal handler nested over an instruction
 pointer of interest.

 Advantages of this approach over percpu atomics:
 - kernel code is relatively simple: complexity of restart sections
   is in user-space,

I think delivering a signal is actually much more complex than the
address check here.
 - easy to port to other architectures: just need to reserve a new
   system call,

Not having an ABI is a potential nice advantage here.

 - for threads which have registered a TLS structure, the fast-path
   at preemption is only a nesting counter check, along with the
   optional store of the current CPU number, rather than comparing
   instruction pointer with possibly many registered ranges,

I'd argue this is not actually an advantage:  The preemption is the
_slow path_.  For a reasonably behaving application this is something
that should be happening on O(ms) boundaries.  Whereas entry/exit of
the critical section may be occurring much more frequently.  This is
also coming at the cost of a slower fast path (nesting updates).

It's also worth noting that the range comparison can be done in logN
(binary search) or even constant time (bound max sequence size and
mask out bits on address, or use a lookup table).

That said there are two very specific advantages:
1) CONFIG_PREEMPT greatly complicates correct inspection of the user address.
2) One region per address space is a difficult model for binaries with
shared linkage.

[ Note:  Compatibility of (1) is considered the primary problem with
our current patch-set.   I'll publish it as is, then try to follow up
with some alternatives we've been considering to get around this. ]


 Caveats of this approach compared to the percpu atomics:
 - We need a signal number for this, so it cannot be done without
   designing the application accordingly,
 - Handling restart in user-space is currently performed with page
   protection, for which we install a SIGSEGV signal handler. Again,
   this requires designing the application accordingly, especially
   if the application installs its own segmentation fault handler,
 - It cannot be used for tracing of processes by injection of code
   into their address space, due to interactions with application
   signal handlers.

 The user-space proof of concept code implementing the restart section
 can be found here: https://github.com/compudj/percpu-dev

 Benchmarking sched_getcpu() vs tls cache approach. Getting the
 current CPU number:

 - With Linux vdso:12.7 ns
 - With TLS-cached cpu number:  0.3 ns


Have you tried also a naked read against the segment limit?  This is
more comparable to TLS.  However, it's a definite ABI challenge.
 We will use the TLS-cached cpu number for the following
 benchmarks.

 On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
 with a baseline running very few load/stores (no locking,
 no getcpu, assuming one thread per CPU with affinity),
 against locking scheme based on lock; cmpxchg, cmpxchg
 (using restart signal), load-store (using restart signal).
 This is performed with 32 threads on a 16-core, hyperthread
 system:

  ns/loop  overhead (ns)
 Baseline:  3.7   0.0
 lock

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Paul Turner
On Thu, May 21, 2015 at 12:08 PM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:
 - Original Message -
 On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:

  +struct thread_percpu_user {
  +   int32_t nesting;
  +   int32_t signal_sent;
  +   int32_t signo;
  +   int32_t current_cpu;
  +};

 I would require this thing be naturally aligned, such that it does not
 cross cacheline boundaries.

 Good point. Adding a comment into the code to that effect.


  +
  +static void percpu_user_sched_in(struct preempt_notifier *notifier, int
  cpu)
  +{
  +   struct thread_percpu_user __user *tpu_user;
  +   struct thread_percpu_user tpu;
  +   struct task_struct *t = current;
  +
  +   tpu_user = t-percpu_user;
  +   if (tpu_user == NULL)
  +   return;
  +   if (unlikely(t-flags  PF_EXITING))
  +   return;
  +   /*
  +* access_ok() of tpu_user has already been checked by sys_percpu().
  +*/
  +   if (__put_user(smp_processor_id(), tpu_user-current_cpu)) {
  +   WARN_ON_ONCE(1);
  +   return;
  +   }

 This seems a waste; you already read the number unconditionally, might
 as well double check and avoid the store.

  +   if (__copy_from_user(tpu, tpu_user, sizeof(tpu))) {
  +   WARN_ON_ONCE(1);
  +   return;
  +   }

   if (tpu.current_cpu != smp_processor_id())
   __put_user();

 Yep, and I could even use the cpu parameter received by the
 function rather than smp_processor_id().




  +   if (!tpu.nesting || tpu.signal_sent)
  +   return;
  +   if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
  +   WARN_ON_ONCE(1);
  +   return;
  +   }
  +   tpu.signal_sent = 1;
  +   if (__copy_to_user(tpu_user, tpu, sizeof(tpu))) {
  +   WARN_ON_ONCE(1);
  +   return;
  +   }
  +}

 Please do not use preempt notifiers for this.

 Do you recommend we issue a function call from the scheduler
 finish_task_switch() ?


 Second, this all is done with preemption disabled, this means that all
 that user access can fail.

 OK, this is one part I was worried about.


 You print useless WARNs and misbehave. If you detect a desire to fault,
 you could delay until return to userspace and try again there. But it
 all adds complexity.

 We could keep a flag, and then call the function again if we detect a
 desire to fault.


 The big advantage pjt's scheme had is that we have the instruction
 pointer, we do not need to go read userspace memory that might not be
 there. And it being limited  to a single range, while inconvenient,
 simplifies the entire kernel side to:

   if ((unsigned long)(ip - offset)  size)
   do_magic();

 Which is still simpler than the above.

Yeah, we've thought about this part of the API in some depth.  There
are obvious pros and cons.

The advantage of this approach is that:
  a) All of the demuxing can be done in userspace, the kernel check is
as you say, wholly trivial.
  B) We do not need to do anything complicated, e.g. signal delivery

The largest drawback is that it does require centralization of your
percpu regions.  It's been our experience that most features are best
implemented using a small library of operations rather than
long-form-atomic-sequences.  This is a notable challenge for binaries
with shared linkage.


 There is one big aspect of pjt's approach that I still don't grasp
 after all this time that makes me worry. How does it interact with
 the following scenario ?

 Userspace thread
   - within the code region that needs to be restarted
 - signal handler nested on top
   - running within the signal handler code
 - preempted by kernel
   - checking instruction pointer misses the userspace stack
 underneath the signal handler.

 Given this scenario, is the kernel code really as simple as a pointer check
 on pt_regs, or do we need a stack walk over all signal frames ? Another way
 would be to check for the pt_regs instruction pointer whenever we receive
 a signal, but then it would require per-architectures modifications, and
 suddenly becomes less straightforward.


This is actually straightforward to handle:
On signal delivery, as you are setting up the signal stack, you modify
the interrupted state so that the signal handler will return into the
restart section rather than the original code.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-04 Thread Paul Turner
On Thu, Sep 4, 2014 at 2:30 PM, John Stultz john.stu...@linaro.org wrote:
 On Thu, Sep 4, 2014 at 2:17 PM, Andrew Hunter a...@google.com wrote:
 On Wed, Sep 3, 2014 at 5:06 PM, John Stultz john.stu...@linaro.org wrote:
 Maybe with the next version of the patch, before you get into the
 unwinding the math, you might practically describe what is broken,
 then explain how its broken.

 My quick read here is that we're converting a timespec - jiffies, and
 in doing so we round up by one jiffy.

 This seems actually perfectly normal, as we usually end up rounding up
 by a jiffy in many cases since we don't want to undershoot any
 timeout, and we're always allowed return later then specified.

 Well, yes, timeouts can be longer than specified, but what you said
 technically applies just as much to code that arbitrarily multiplies
 jiffies by 10 before returning, no? :)

 The problem isn't the rounding, it's that the rounding is _wrong_: I'm
 fine rounding 10100usec / (1000 usec/jiffie) = 11 jiffies. The current
 code rounds 1usec / (1000 usec/jiffies) to 11.  I've rewritten the
 description to make this clearer.

 Ok. Very much appreciated!

 In particular, with HZ=1000, we consistently computed that 1 usec
 was 11 jiffies; the same was true for any exact multiple of
 TICK_NSEC. This is obviously bad as a general rule, and caused
 observable user problems with setitimer() at the very least:

 setitimer(ITIMER_PROF, val, NULL);
 setitimer(ITIMER_PROF, NULL, val);

 would actually add a tick to val!

 So this looks like an issue. Since we convert and store the internal
 representation in jiffies, when we pull it back out, we get the
 rounded up value, which is larger then the timespec value originally
 submitted.  This is really the core issue, correct?

 For the particular user bug reported to me, yes, this was the core
 issue: some code that stopped and restarted an itimer found the
 interval growing by 1ms each time.  But again, it wasn't that it was
 rounded:  if we initially passed e.g. 10500 usec and got back 11000,
 that'd be annoying but workable, because if we then went through
 another cycle of enabling/disabling itimer, we'd set it to 11000 usec
 and get back 11000 again.  What we have now instead adds a full jiffie
 _every time_.

 Ah, ok. This part is key to understanding the problem. Thanks for
 clarifying this.

 This seems to be a quite old bug.. Do you think this is needed for -stable?

Seems reasonable to me.


 thanks
 -john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix timeval conversion to jiffies

2014-09-03 Thread Paul Turner
John -- Any chance to take a look at this?  (The dilation is pretty
unfortunate when profiling reprograms a timer with it.)

On Thu, Aug 7, 2014 at 5:09 PM, Andrew Hunter a...@google.com wrote:
 timeval_to_jiffies rounding was broken.  It essentially computed
 (eliding seconds)

 jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

 by using scaling arithmetic, which took the best approximation of
 NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
 x/(2^USEC_JIFFIE_SC), and computed:

 jiffies = (usec * x)  USEC_JIFFIE_SC

 and it rounded this calculation up in the intermediate form (since we
 can't necessarily exactly represent TICK_NSEC in usec.) But the
 scaling arithmetic is a (very slight) *over*approximation of the true
 value: rounding the division up by adding 2^USEC_JIFFIE_SC - 1, when
 the error in scaling was added in, was sufficient to add one jiffie to
 the final result.

 In particular, with HZ=1000, we consistently computed that 1 usec
 was 11 jiffies; the same was true for any exact multiple of
 TICK_NSEC. This is obviously bad as a general rule, and caused
 observable user problems with setitimer() at the very least:

 setitimer(ITIMER_PROF, val, NULL);
 setitimer(ITIMER_PROF, NULL, val);

 would actually add a tick to val!

 We could possibly still round in the intermediate form, adding
 something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
 convert usec-nsec, round in nanoseconds, and then convert using
 time*spec*_to_jiffies.  This adds one constant multiplication, and is
 not observably slower in microbenchmarks on recent x86 hardware.

 Tested: the following program:

 int main() {
   struct itimerval zero = {{0, 0}, {0, 0}};
   /* Initially set to 10 ms. */
   struct itimerval initial = zero;
   initial.it_interval.tv_usec = 1;
   setitimer(ITIMER_PROF, initial, NULL);
   /* Save and restore several times. */
   for (size_t i = 0; i  10; ++i) {
 struct itimerval prev;
 setitimer(ITIMER_PROF, zero, prev);
 /* on old kernels, this goes up by TICK_USEC every iteration */
 printf(previous value: %ld %ld %ld %ld\n,
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
 setitimer(ITIMER_PROF, prev, NULL);
   }
 return 0;
 }

 Signed-off-by: Andrew Hunter a...@google.com
 Reviewed-by: Paul Turner p...@google.com
 Reported-by: Aaron Jacobs jaco...@google.com
 Change-Id: I7cd0f0764847fd055d39531f54e6ea3dd3ce5453
 ---
  include/linux/jiffies.h | 12 ---
  kernel/time.c   | 54 
 +++--
  2 files changed, 30 insertions(+), 36 deletions(-)

 diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
 index 1f44466..c367cbd 100644
 --- a/include/linux/jiffies.h
 +++ b/include/linux/jiffies.h
 @@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
  #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
  #endif
  #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
 -#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
  #define SEC_CONVERSION ((unsigned long)u64)NSEC_PER_SEC  
 SEC_JIFFIE_SC) +\
  TICK_NSEC -1) / (u64)TICK_NSEC))

  #define NSEC_CONVERSION ((unsigned long)u64)1  NSEC_JIFFIE_SC) +\
  TICK_NSEC -1) / (u64)TICK_NSEC))
 -#define USEC_CONVERSION  \
 -((unsigned long)u64)NSEC_PER_USEC  USEC_JIFFIE_SC) 
 +\
 -TICK_NSEC -1) / (u64)TICK_NSEC))
 -/*
 - * USEC_ROUND is used in the timeval to jiffie conversion.  See there
 - * for more details.  It is the scaled resolution rounding value.  Note
 - * that it is a 64-bit value.  Since, when it is applied, we are already
 - * in jiffies (albit scaled), it is nothing but the bits we will shift
 - * off.
 - */
 -#define USEC_ROUND (u64)(((u64)1  USEC_JIFFIE_SC) - 1)
  /*
   * The maximum jiffie value is (MAX_INT  1).  Here we translate that
   * into seconds.  The 64-bit case will overflow if we are not careful,
 diff --git a/kernel/time.c b/kernel/time.c
 index 7c7964c..3c49ab4 100644
 --- a/kernel/time.c
 +++ b/kernel/time.c
 @@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
   * that a remainder subtract here would not do the right thing as the
   * resolution values don't fall on second boundries.  I.e. the line:
   * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
 + * Note that due to the small error in the multiplier here, this
 + * rounding is incorrect for sufficiently large values of tv_nsec, but
 + * well formed timespecs should have tv_nsec  NSEC_PER_SEC, so we're
 + * OK.
   *
   * Rather, we just shift the bits off the right.
   *
   * The  (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
   * value to a scaled second value.
   */
 -unsigned long
 -timespec_to_jiffies(const struct timespec *value)
 +static unsigned long
 +__timespec_to_jiffies(unsigned long sec, long nsec)
  {
 -   unsigned long sec = value-tv_sec

Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load

2014-08-26 Thread Paul Turner
On Tue, Aug 26, 2014 at 4:11 PM, Jason Low jason.l...@hp.com wrote:
 Based on perf profiles, the update_cfs_rq_blocked_load function constantly
 shows up as taking up a noticeable % of system run time. This is especially
 apparent on larger numa systems.

 Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
 updating the tg load contribution stats. However, it was noticed that the
 values often don't get modified by much. In fact, much of the time, they
 don't get modified at all. However, the update can always get attempted due
 to force_update.

 In this patch, we remove the force_update in only the
 __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
 modified only if the delta is large enough (in the current code, they get
 updated when the delta is larger than 12.5%). This is a way to rate-limit
 the updates while largely keeping the values accurate.

 When testing this change with AIM7 workloads, we found that it was able to
 reduce the overhead of the function by up to a factor of 20x.

Looks reasonable.


 Cc: Yuyang Du yuyang...@intel.com
 Cc: Waiman Long waiman.l...@hp.com
 Cc: Mel Gorman mgor...@suse.de
 Cc: Mike Galbraith umgwanakikb...@gmail.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Aswin Chandramouleeswaran as...@hp.com
 Cc: Chegu Vinod chegu_vi...@hp.com
 Cc: Scott J Norton scott.nor...@hp.com
 Signed-off-by: Jason Low jason.l...@hp.com
 ---
  kernel/sched/fair.c |   10 --
  1 files changed, 4 insertions(+), 6 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index fea7d33..7a6e18b 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct 
 sched_entity *se)
  }

  #ifdef CONFIG_FAIR_GROUP_SCHED
 -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 -int force_update)
 +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
  {
 struct task_group *tg = cfs_rq-tg;
 long tg_contrib;
 @@ -2361,7 +2360,7 @@ static inline void 
 __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg;
 tg_contrib -= cfs_rq-tg_load_contrib;

 -   if (force_update || abs(tg_contrib)  cfs_rq-tg_load_contrib / 8) {

Another option with slightly higher accuracy would be to increase the
sensitivity here when force_update == 1.

E.g.:
abs(tg_contrib)  cfs_rq-tg_load_contrib / (8 * (1 + force_update))) { ...

Alternatively we could bound total inaccuracy:
   int divisor = force_update ? NR_CPUS : 8;
   if (abs(tg_contrib)  cfs_rq-tg_load_contrib / divisor) { ...


[ And probably rename force_update to want_update ]


 +   if (abs(tg_contrib)  cfs_rq-tg_load_contrib / 8) {
 atomic_long_add(tg_contrib, tg-load_avg);
 cfs_rq-tg_load_contrib += tg_contrib;
 }
 @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq 
 *rq, int runnable)
 __update_tg_runnable_avg(rq-avg, rq-cfs);
  }
  #else /* CONFIG_FAIR_GROUP_SCHED */
 -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 -int force_update) {}
 +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {}
  static inline void __update_tg_runnable_avg(struct sched_avg *sa,
   struct cfs_rq *cfs_rq) {}
  static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq 
 *cfs_rq, int force_update)
 cfs_rq-last_decay = now;
 }

 -   __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 +   __update_cfs_rq_tg_load_contrib(cfs_rq);
  }

  /* Add the load generated by se into cfs_rq's child load-average */
 --
 1.7.1



 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revert sched: Fix sleep time double accounting in enqueue entity

2014-01-22 Thread Paul Turner
On Wed, Jan 22, 2014 at 9:53 AM,  bseg...@google.com wrote:
 Vincent Guittot vincent.guit...@linaro.org writes:

 This reverts commit 282cf499f03ec1754b6c8c945c9674b02631fb0f.

 With the current implementation, the load average statistics of a sched 
 entity
 change according to other activity on the CPU even if this activity is done
 between the running window of the sched entity and have no influence on the
 running duration of the task.

 When a task wakes up on the same CPU, we currently update 
 last_runnable_update
 with the return  of __synchronize_entity_decay without updating the
 runnable_avg_sum and runnable_avg_period accordingly. In fact, we have to 
 sync
 the load_contrib of the se with the rq's blocked_load_contrib before removing
 it from the latter (with __synchronize_entity_decay) but we must keep
 last_runnable_update unchanged for updating runnable_avg_sum/period during 
 the
 next update_entity_load_avg.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 Unless paul wants to squash this into a possible change to the if
 (wakeup) stuff:
 Reviewed-by: Ben Segall bseg...@google.com


I can send that separately as Vincent suggests.  This is good to go.


 ---
  kernel/sched/fair.c |8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index e64b079..6d61f20 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2365,13 +2365,7 @@ static inline void enqueue_entity_load_avg(struct 
 cfs_rq *cfs_rq,
   }
   wakeup = 0;
   } else {
 - /*
 -  * Task re-woke on same cpu (or else migrate_task_rq_fair()
 -  * would have made count negative); we must be careful to avoid
 -  * double-accounting blocked time after synchronizing decays.
 -  */
 - se-avg.last_runnable_update += __synchronize_entity_decay(se)
 -  20;
 + __synchronize_entity_decay(se);
   }

   /* migrated tasks did not contribute to our blocked load */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix sched_entity avg statistics update

2014-01-21 Thread Paul Turner
On Tue, Jan 21, 2014 at 12:00 PM, Vincent Guittot
vincent.guit...@linaro.org wrote:

 Le 21 janv. 2014 19:39, bseg...@google.com a écrit :



 Vincent Guittot vincent.guit...@linaro.org writes:

  With the current implementation, the load average statistics of a sched
  entity
  change according to other activity on the CPU even if this activity is
  done
  between the running window of the sched entity and have no influence on
  the
  running duration of the task.
 
  When a task wakes up on the same CPU, we currently update
  last_runnable_update
  with the return  of __synchronize_entity_decay without updating the
  runnable_avg_sum and runnable_avg_period accordingly. In fact, we have
  to sync
  the load_contrib of the se with the rq's blocked_load_contrib before
  removing
  it from the latter (with __synchronize_entity_decay) but we must keep
  last_runnable_update unchanged for updating runnable_avg_sum/period
  during the
  next update_entity_load_avg.

 ... Gah, that's correct, we had this right the first time. Could you do
 this as a full revert of 282cf499f03ec1754b6c8c945c9674b02631fb0f (ie
 remove the now inaccurate comment, or maybe replace it with a correct
 one).

 Ok i'm going to remove comment as well and replace it with a new description


I think I need to go through and do a comments patch like we did with
the wake-affine math; it's too easy to make a finicky mistake like
this when not touching this path for a while.

OK, so there are two numerical components we're juggling here:

1) The actual quotient for the current runnable average, stored as
(runnable_avg_sum / runnable_avg period).  Last updated at
last_runnable_update.
2) The last time we computed the quotient in (1) and accumulated it in
within cfs_rq-{runnable, blocked}_load_avg, this is stored in
load_avg_contrib.  We track the passage of off-rq time and migrations
against this value using decay_count.
[ All of the values above are stored on se / se-avg ]

When we are re-enqueuing something and we wish to remove its
contribution from blocked_load_avg, we must update load_avg_contrib in
(2) using the total time it spent off rq (using a jiffy rounded
approximation in decay_count).  However, Alex's patch (which this
reverts) also adjusted the quotient by modifying its last update time
so as to make it look up-to-date, effectively skipping the most recent
idle span.

I think we could make the connection between (1) and (2) more explicit
if we moved the subsequent if (wakeup) logic within the else.  We
can then have a comment that refers to (1) and (2) explicitly, perhaps
something like:

Task re-woke on same cpu (or else migrate_task_rq_fair() would have
made count negative).  Perform an approximate decay on
load_avg_contrib to match blocked_load_avg, and compute a precise
runnable_avg_sum quotient update that will be accumulated into
runnable_load_avg below.



 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
   kernel/sched/fair.c |3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index e64b079..5b0ef90 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -2370,8 +2370,7 @@ static inline void enqueue_entity_load_avg(struct
  cfs_rq *cfs_rq,
 * would have made count negative); we must be careful to
  avoid
 * double-accounting blocked time after synchronizing
  decays.
 */
  - se-avg.last_runnable_update +=
  __synchronize_entity_decay(se)
  -  20;
  + __synchronize_entity_decay(se);
}
 
/* migrated tasks did not contribute to our blocked load */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] sched: remove cpu_load decay

2013-12-09 Thread Paul Turner
On Mon, Dec 9, 2013 at 5:04 PM, Alex Shi alex@linaro.org wrote:
 On 12/03/2013 06:26 PM, Peter Zijlstra wrote:

 Paul, can you guys have a look at this, last time around you have a
 regression with this stuff, so it would be good to hear from you.


 Ping Paul.


Ben was looking at this right before the thanksgiving break.  I'll ask
him what he has tomorrow.

 --
 Thanks
 Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [sched]: pick the NULL entity caused the panic.

2013-11-11 Thread Paul Turner
On Tue, Nov 12, 2013 at 8:29 AM, Wang, Xiaoming xiaoming.w...@intel.com wrote:
 cfs_rq get its group run queue but the value of
 cfs_rq-nr_running maybe zero, which will cause
 the panic in pick_next_task_fair.
 So the evaluated of cfs_rq-nr_running is needed.

 [15729.985797] BUG: unable to handle kernel NULL pointer dereference at 
 0008
 [15729.993838] IP: [c15266f1] rb_next+0x1/0x50
 [15729.998745] *pdpt = 2861a001 *pde = 
 [15730.005221] Oops:  [#1] PREEMPT SMP
 [15730.009677] Modules linked in: atomisp_css2400b0_v2 lm3554 ov2722 imx1x5 
 atmel_mxt_ts
 vxd392 videobuf_vmalloc videobuf_core lm_dump(O) bcm_bt_lpm hdmi_audio 
 bcm4334x(O) kct_daemon(O)
 [15730.028159] CPU: 1 PID: 2510 Comm: mts Tainted: G W O 
 3.10.16-261326-g88236a2 #1
 [15730.037215] task: e86ff080 ti: e83ac000 task.ti: e83ac000
 [15730.043261] EIP: 0060:[c15266f1] EFLAGS: 00010046 CPU: 1
 [15730.049402] EIP is at rb_next+0x1/0x50
 [15730.053602] EAX: 0008 EBX: f3655950 ECX: 004c090e EDX: 
 [15730.060607] ESI:  EDI:  EBP: e83ada44 ESP: e83ada28
 [15730.067623] DS: 007b ES: 007b FS: 00d8 GS: 003b SS: 0068
 [15730.073668] CR0: 80050033 CR2: 0008 CR3: 28095000 CR4: 001007f0
 [15730.080684] DR0:  DR1:  DR2:  DR3: 
 [15730.087699] DR6: 0ff0 DR7: 0400
 [15730.091994] Stack:
 [15730.094251] e83ada44 c12719f0 004c090e f3655900 e86ff334 f3655900 0002 
 e83adacc
 [15730.103086] c1ae384f f3655900 254c b7581800 6be38330 004e 0e4e 
 c20d6900
 [15730.111922] f3655950 c20d6900 f3655900 e86ff080 f1d40600 cfcfa794 e83ada90 
 e83ada8c
 [15730.120754] Call Trace:
 [15730.123502] [c12719f0] ? pick_next_task_fair+0xf0/0x130
 [15730.129647] [c1ae384f] __schedule+0x11f/0x800
 [15730.134821] [c12c7421] ? tracer_tracing_is_on+0x11/0x30
 [15730.140964] [c12c74ad] ? tracing_is_on+0xd/0x10
 [15730.146331] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.152185] [c1266394] ? finish_task_switch+0x54/0xb0
 [15730.158136] [c1ae3fa3] schedule+0x23/0x60
 [15730.162920] [c1ae16e5] schedule_timeout+0x165/0x280
 [15730.168676] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.174529] [c1ae350f] wait_for_completion+0x6f/0xc0
 [15730.180382] [c126b3e0] ? try_to_wake_up+0x250/0x250
 [15730.186139] [c1255658] flush_work+0xa8/0x110
 [15730.191214] [c1253fc0] ? worker_pool_assign_id+0x40/0x40
 [15730.197457] [c15c3955] tty_flush_to_ldisc+0x25/0x30
 [15730.203212] [c15bde18] n_tty_poll+0x68/0x180
 [15730.208288] [c15bddb0] ? process_echoes+0x2c0/0x2c0
 [15730.214044] [c15bb2fb] tty_poll+0x6b/0x90
 [15730.218828] [c15bddb0] ? process_echoes+0x2c0/0x2c0
 [15730.224584] [c1339862] do_sys_poll+0x202/0x440
 [15730.229856] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.235710] [c13234a1] ? kmem_cache_free+0x71/0x180
 [15730.241466] [c13d2bfa] ? jbd2_journal_stop+0x25a/0x370
 [15730.247513] [c13d2bfa] ? jbd2_journal_stop+0x25a/0x370
 [15730.253561] [c13bb2df] ? __ext4_journal_stop+0x5f/0x90
 [15730.259608] [c139787d] ? ext4_dirty_inode+0x4d/0x60
 [15730.265364] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.271218] [c13549ac] ? generic_write_end+0xac/0x100
 [15730.277168] [c13bb2df] ? __ext4_journal_stop+0x5f/0x90
 [15730.283216] [c1338780] ? __pollwait+0xd0/0xd0
 [15730.288388] [c1338780] ? __pollwait+0xd0/0xd0
 [15730.293561] [c1338780] ? __pollwait+0xd0/0xd0
 [15730.298734] [c1338780] ? __pollwait+0xd0/0xd0
 [15730.303908] [c12ecd85] ? __generic_file_aio_write+0x245/0x470
 [15730.310635] [c12ed059] ? generic_file_aio_write+0xa9/0xd0
 [15730.316975] [c138c910] ? ext4_file_write+0xc0/0x460
 [15730.322730] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.328583] [c125d09f] ? remove_wait_queue+0x3f/0x50
 [15730.334436] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.340289] [c12615e6] ? __srcu_read_lock+0x66/0x90
 [15730.346045] [c1ae8285] ? sub_preempt_count+0x55/0xe0
 [15730.351899] [c15407f6] ? __percpu_counter_add+0x96/0xe0
 [15730.358043] [c1329df1] ? __sb_end_write+0x31/0x70
 [15730.363603] [c13285c5] ? vfs_write+0x165/0x1c0
 [15730.368874] [c1339b4a] SyS_poll+0x5a/0xd0
 [15730.373658] [c1ae52a8] syscall_call+0x7/0xb
 [15730.378639] [c1ae] ? add_sysfs_fw_map_entry+0x2f/0x85

 Signed-off-by: xiaoming wang xiaoming.w...@intel.com
 Signed-off-by: Zhang Dongxing dongxing.zh...@intel.com
 ---
  kernel/sched/fair.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 7c70201..2d440f9 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3708,7 +3708,7 @@ static struct task_struct *pick_next_task_fair(struct 
 rq *rq)
 se = pick_next_entity(cfs_rq);
 set_next_entity(cfs_rq, se);
 cfs_rq = group_cfs_rq(se);
 -   } while (cfs_rq);
 +   } while (cfs_rq  cfs_rq-nr_running);

 p = task_of(se);
 if (hrtick_enabled(rq))

This can only happen when something else has already corrupted the
rb-tree.  Breaking out here 

Re: [PATCH 01/14] sched: add sched_class-task_dead.

2013-11-11 Thread Paul Turner
On Thu, Nov 7, 2013 at 5:43 AM, Juri Lelli juri.le...@gmail.com wrote:
 From: Dario Faggioli raist...@linux.it

 Add a new function to the scheduling class interface. It is called
 at the end of a context switch, if the prev task is in TASK_DEAD state.

 It might be useful for the scheduling classes that want to be notified
 when one of their task dies, e.g. to perform some cleanup actions.

We could also usefully use this within SCHED_FAIR to remove a task's
load contribution when it completes.


 Signed-off-by: Dario Faggioli raist...@linux.it
 Signed-off-by: Juri Lelli juri.le...@gmail.com
 ---
  kernel/sched/core.c  |3 +++
  kernel/sched/sched.h |1 +
  2 files changed, 4 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 5ac63c9..850a02c 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1890,6 +1890,9 @@ static void finish_task_switch(struct rq *rq, struct 
 task_struct *prev)
 if (mm)
 mmdrop(mm);
 if (unlikely(prev_state == TASK_DEAD)) {
 +   if (prev-sched_class-task_dead)
 +   prev-sched_class-task_dead(prev);
 +
 /*
  * Remove function-return probe instances associated with this
  * task and put them back on the free list.
 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
 index b3c5653..64eda5c 100644
 --- a/kernel/sched/sched.h
 +++ b/kernel/sched/sched.h
 @@ -992,6 +992,7 @@ struct sched_class {
 void (*set_curr_task) (struct rq *rq);
 void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 void (*task_fork) (struct task_struct *p);
 +   void (*task_dead) (struct task_struct *p);

 void (*switched_from) (struct rq *this_rq, struct task_struct *task);
 void (*switched_to) (struct rq *this_rq, struct task_struct *task);

Reviewed-by: Paul Turner p...@google.com

 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Guarantee new group-entities always have weight

2013-10-29 Thread tip-bot for Paul Turner
Commit-ID:  0ac9b1c21874d2490331233b3242085f8151e166
Gitweb: http://git.kernel.org/tip/0ac9b1c21874d2490331233b3242085f8151e166
Author: Paul Turner p...@google.com
AuthorDate: Wed, 16 Oct 2013 11:16:27 -0700
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Tue, 29 Oct 2013 12:02:23 +0100

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for g on cpu x. )

Suppose that we have root-a and that a enters a throttled state,
immediately followed by a[0]-t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]-b[0]-t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
   skipped because b is part of a throttled hierarchy 
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])-load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Signed-off-by: Paul Turner p...@google.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Link: 
http://lkml.kernel.org/r/20131016181627.22647.47543.st...@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6308cb..0923ab2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7198,7 +7198,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct 
cfs_rq *cfs_rq,
se-cfs_rq = parent-my_q;
 
se-my_q = cfs_rq;
-   update_load_set(se-load, 0);
+   /* guarantee group entities always have weight */
+   update_load_set(se-load, NICE_0_LOAD);
se-parent = parent;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] sched: Guarantee new group-entities always have weight

2013-10-16 Thread Paul Turner
On Wed, Oct 16, 2013 at 3:01 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Wed, Oct 16, 2013 at 11:16:27AM -0700, Ben Segall wrote:
 From: Paul Turner p...@google.com

 Currently, group entity load-weights are initialized to zero. This
 admits some races with respect to the first time they are re-weighted in
 earlty use. ( Let g[x] denote the se for g on cpu x. )

 Suppose that we have root-a and that a enters a throttled state,
 immediately followed by a[0]-t1 (the only task running on cpu[0])
 blocking:

 put_prev_task(group_cfs_rq(a[0]), t1)
 put_prev_entity(..., t1)
 check_cfs_rq_runtime(group_cfs_rq(a[0]))
 throttle_cfs_rq(group_cfs_rq(a[0]))

 Then, before unthrottling occurs, let a[0]-b[0]-t2 wake for the first
 time:

 enqueue_task_fair(rq[0], t2)
 enqueue_entity(group_cfs_rq(b[0]), t2)
 enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
 account_entity_enqueue(group_cfs_ra(b[0]), t2)
 update_cfs_shares(group_cfs_rq(b[0]))
  skipped because b is part of a throttled hierarchy 
 enqueue_entity(group_cfs_rq(a[0]), b[0])
 ...

 We now have b[0] enqueued, yet group_cfs_rq(a[0])-load.weight == 0
 which violates invariants in several code-paths. Eliminate the
 possibility of this by initializing group entity weight.

 Signed-off-by: Paul Turner p...@google.com
 ---
  kernel/sched/fair.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index fc44cc3..424c294 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -7207,7 +7207,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct 
 cfs_rq *cfs_rq,
   se-cfs_rq = parent-my_q;

   se-my_q = cfs_rq;
 - update_load_set(se-load, 0);
 + /* guarantee group entities always have weight */
 + update_load_set(se-load, NICE_0_LOAD);
   se-parent = parent;
  }

 Hurm.. this gives new groups a massive weight; nr_cpus * NICE_0. ISTR
 there being some issues with this; or was that on the wakeup path where
 a task woke on a cpu who's group entity had '0' load because it used to
 run on another cpu -- I can't remember.

This could also arbitrarily be MIN_WEIGHT.

I don't think it actually matters, in practice the set of conditions
for this weight to ever see use are very specific (e.g. the race
described above). Otherwise it's always going to be re-initialized (on
first actual enqueue) to the right value.  (NICE_0_LOAD seemed to make
sense since this is what you'd expect for a new, single thread,
autogroup/group.)


 But please do expand how this isn't a problem. I suppose for the regular
 cgroup case, group creation is a rare event so nobody cares, but
 autogroups can come and go far too quickly I think.


This isn't typically a problem since the first enqueue will almost
universally set a weight.

An alternative considered was to remove the throttled-hierarchy check
in the re-weight; however it seemed better to make this actually an
invariant (e.g. an entity ALWAYS has some weight).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:sched/core] sched/balancing: Fix cfs_rq- task_h_load calculation

2013-09-30 Thread Paul Turner
On Mon, Sep 30, 2013 at 7:22 PM, Yuanhan Liu
yuanhan@linux.intel.com wrote:
 On Mon, Sep 30, 2013 at 12:14:03PM +0400, Vladimir Davydov wrote:
 On 09/29/2013 01:47 PM, Yuanhan Liu wrote:
 On Fri, Sep 20, 2013 at 06:46:59AM -0700, tip-bot for Vladimir Davydov 
 wrote:
 Commit-ID:  7e3115ef5149fc502e3a2e80719dba54a8e7409d
 Gitweb:http://git.kernel.org/tip/7e3115ef5149fc502e3a2e80719dba54a8e7409d
 Author: Vladimir Davydovvdavy...@parallels.com
 AuthorDate: Sat, 14 Sep 2013 19:39:46 +0400
 Committer:  Ingo Molnarmi...@kernel.org
 CommitDate: Fri, 20 Sep 2013 11:59:39 +0200
 
 sched/balancing: Fix cfs_rq-task_h_load calculation
 
 Patch a003a2 (sched: Consider runnable load average in move_tasks())
 sets all top-level cfs_rqs' h_load to rq-avg.load_avg_contrib, which is
 always 0. This mistype leads to all tasks having weight 0 when load
 balancing in a cpu-cgroup enabled setup. There obviously should be sum
 of weights of all runnable tasks there instead. Fix it.
 Hi Vladimir,
 
 FYI, Here we found a 17% netperf regression by this patch. Here are some
 changed stats between this commit 7e3115ef5149fc502e3a2e80719dba54a8e7409d
 and it's parent(3029ede39373c368f402a76896600d85a4f7121b)

 Hello,

 Could you please report the following info:

 Hi Vladimir,

 This regression was first found at a 2-core 32 CPU Sandybridge server
 with 64G memory. However, I can't ssh to it now and we are off work
 this week due to holiday. So, sorry, email response may be delayed.

 Then I found this regression exists at another atom micro server as
 well. And the following machine and testcase specific info are all from it.

 And to not make old data confuse you, here I also update the changed
 stats and corresponding text plot as well in attachment.

 1) the test machine cpu topology (i.e. output of 
 /sys/devices/system/cpu/cpu*/{thread_siblings_list,core_siblings_list})

 # grep . 
 /sys/devices/system/cpu/cpu*/topology/{thread_siblings_list,core_siblings_list}
 /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1
 /sys/devices/system/cpu/cpu1/topology/thread_siblings_list:0-1
 /sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2-3
 /sys/devices/system/cpu/cpu3/topology/thread_siblings_list:2-3
 /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
 /sys/devices/system/cpu/cpu1/topology/core_siblings_list:0-3
 /sys/devices/system/cpu/cpu2/topology/core_siblings_list:0-3
 /sys/devices/system/cpu/cpu3/topology/core_siblings_list:0-3


 2) kernel config you used during the test

 Attached.

 3) the output of /sys/kernel/debug/sched_features (debugfs mounted).

 # cat /sys/kernel/debug/sched_features
 GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY
 WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_POWER
 TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_NUMA 
 NO_NUMA_FORCE

 4) netperf server/client options

 Here is our testscript we used:
 #!/bin/bash
 # - test

 # start netserver
 netserver

 sleep 1

 for i in $(seq $nr_threads)
 do
 netperf -t $test -c -C -l $runtime 
 done

 Where,
   $test is TCP_SENDFILE,
   $nr_threads is 8, two times of nr cpu
   $runtime is 120s

 5) did you place netserver into a separate cpu cgroup?

 Nope.



If this is causing a regression I think it actually calls into
question the original series that included a003a25b227d59d.  This
patch only makes h_load not be a nonsense value.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

2013-09-26 Thread Paul Turner
On Thu, Sep 26, 2013 at 2:58 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
 That will make pipe-test go fugly - pretty, and help very fast/light
 localhost network, but eat heavier localhost overlap recovery.  We need
 a working (and cheap) overlap detector scheme, so we can know when there
 is enough to be worth going after.

 We used to have an overlap detectoring thing.. It went away though.

 But see if you can make something like the below work?

 You could make it a general overlap thing and try without the sync too I
 suppose..

 ---
  include/linux/sched.h |  3 +++
  kernel/sched/fair.c   | 25 +
  2 files changed, 28 insertions(+)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index b5344de..5428016 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -974,6 +974,9 @@ struct sched_entity {
 u64 vruntime;
 u64 prev_sum_exec_runtime;

 +   u64 last_sync_wakeup;
 +   u64 avg_overlap;
 +
 u64 nr_migrations;

  #ifdef CONFIG_SCHEDSTATS
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 2b89cd2..47b0d0f 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct 
 task_struct *p, int flags)
 struct sched_entity *se = p-se;
 int task_sleep = flags  DEQUEUE_SLEEP;

 +   if (se-last_sync_wakeup) {
 +   u64 overlap;
 +   s64 diff;
 +
 +   overlap = rq-clock - se-last_sync_wakeup;
 +   se-last_sync_wakeup = 0;
 +
 +   diff = overlap - se-avg_overlap;
 +   se-avg_overlap += diff  8;
 +   }
 +
 for_each_sched_entity(se) {
 cfs_rq = cfs_rq_of(se);
 dequeue_entity(cfs_rq, se, flags);
 @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
 int wake_flags)
 int want_affine = 0;
 int sync = wake_flags  WF_SYNC;

 +   if (sync)
 +   p-se.last_sync_wakeup = sched_clock_cpu(cpu);
 +
 if (p-nr_cpus_allowed == 1)
 return prev_cpu;

 @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
 if (cpu != prev_cpu  wake_affine(affine_sd, p, sync))
 prev_cpu = cpu;

 +   /*
 +* Don't bother with select_idle_sibling() in the case of a 
 sync wakeup
 +* where we know the only running task will soon go-away. 
 Going
 +* through select_idle_sibling will only lead to pointless 
 ping-pong.
 +*/
 +   if (sync  prev_cpu == cpu  cpu_rq(cpu)-nr_running == 1 

I've long thought of trying something like this.

I like the intent but I'd go a step further in that I think we want to
also implicitly extract WF_SYNC itself.  While pipe_test is a good
microbenchmark it's not entirely representative since, in reality,
overlap is most usefully applied to threads in the same process -- and
they rarely communicate using a pipe.

What we really then care about is predicting the overlap associated
with userspace synchronization objects, typically built on top of
futexes.  Unfortunately the existence/use of per-thread futexes
reduces how much state you could usefully associate with the futex.
One approach might be to hash (with some small saturating counter)
against rip.  But this gets more complicated quite quickly.

 +   current-se.avg_overlap  1) {
 +   new_cpu = cpu;
 +   goto unlock;
 +   }
 +
 new_cpu = select_idle_sibling(p, prev_cpu);
 goto unlock;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

2013-09-26 Thread Paul Turner
On Thu, Sep 26, 2013 at 4:16 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, Sep 26, 2013 at 03:55:55AM -0700, Paul Turner wrote:
  +   /*
  +* Don't bother with select_idle_sibling() in the case of 
  a sync wakeup
  +* where we know the only running task will soon go-away. 
  Going
  +* through select_idle_sibling will only lead to pointless 
  ping-pong.
  +*/
  +   if (sync  prev_cpu == cpu  cpu_rq(cpu)-nr_running == 
  1 

 I've long thought of trying something like this.

 I like the intent but I'd go a step further in that I think we want to
 also implicitly extract WF_SYNC itself.

 I have vague memories of actually trying something like that a good
 number of years ago.. sadly that's all I remember about it.

 What we really then care about is predicting the overlap associated
 with userspace synchronization objects, typically built on top of
 futexes.  Unfortunately the existence/use of per-thread futexes
 reduces how much state you could usefully associate with the futex.
 One approach might be to hash (with some small saturating counter)
 against rip.  But this gets more complicated quite quickly.

 Why would you need per object storage? To further granulate the
 predicted overlap? instead of having one per task, you have one per
 object?

It is my intuition that there are a few common objects with fairly
polarized behavior:  I.e. For condition variables and producer
consumer queues, a wakeup strongly predicts blocking.  Whereas for
locks protecting objects, e.g. a Mutex, would be expected to have the
opposite behavior.

For this hint to be beneficial you have to get it right frequently,
getting it wrong in the first case hurts cache and in the second hurts
parallelism.  Today we always err on the side of hurting locality
since the cost of getting it wrong is better bounded.  These are
sufficiently common, and likely to be interspersed, that I suspect
allowing them to interact on a thread-wide counter will basically give
a mush result (or even be an anti predictor since it will strongly
favor the last observation) an an input.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Fix task_h_load calculation

2013-09-14 Thread Paul Turner
On Sat, Sep 14, 2013 at 8:39 AM, Vladimir Davydov
vdavy...@parallels.com wrote:
 Patch a003a2 (sched: Consider runnable load average in move_tasks())
 sets all top-level cfs_rqs' h_load to rq-avg.load_avg_contrib, which is
 always 0. This mistype leads to all tasks having weight 0 when load
 balancing in a cpu-cgroup enabled setup. There obviously should be sum
 of weights of all runnable tasks there instead. Fix it.


load_avg_contrib is the weight that
 Signed-off-by: Vladimir Davydov vdavy...@parallels.com
 ---
  kernel/sched/fair.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 9b3fe1c..13abc29 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4242,7 +4242,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 } Once we'v Once we've made it that e made it that

 if (!se) {
 -   cfs_rq-h_load = rq-avg.load_avg_contrib;
 +   cfs_rq-h_load = cfs_rq-runnable_load_avg;

Looks good.

Reviewed-by: Paul Turner p...@google.com

 cfs_rq-last_h_load_update = now;
 }

 --
 1.7.10.4

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code

2013-08-27 Thread Paul Turner
On Mon, Aug 26, 2013 at 2:49 PM, Rik van Riel r...@surriel.com wrote:
 On 08/26/2013 08:09 AM, Peter Zijlstra wrote:
 On Sat, Aug 24, 2013 at 03:45:57AM -0700, Paul Turner wrote:
 @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
 tmp-parent = parent-parent;
 if (parent-parent)
 parent-parent-child = tmp;
 +   /*
 +* Transfer SD_PREFER_SIBLING down in case of a
 +* degenerate parent; the spans match for this
 +* so the property transfers.
 +*/
 +   if (parent-flags  SD_PREFER_SIBLING)
 +   tmp-flags |= SD_PREFER_SIBLING;
 destroy_sched_domain(parent, cpu);
 } else
 tmp = tmp-parent;


 Reviewed-by: Paul Turner p...@google.com

 BTW, did that comment make sense to you or would you suggest something
 different? I had/am having a hard time with that comment. Somehow it
 leaves me wanting. I know I understand the issue now, but I'll doubt the
 comment will suffice in a years time :/

 The comment made sense to me :)

It makes sense once you read the code.  That we push down is somehow
counter intuitive in the reading.


 --
 All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()

2013-08-27 Thread Paul Turner
On Mon, Aug 26, 2013 at 5:07 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Sat, Aug 24, 2013 at 03:33:59AM -0700, Paul Turner wrote:
 On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra pet...@infradead.org wrote:
  +++ b/kernel/sched/fair.c
  @@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
  unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
  int i;
 
  -   for_each_cpu(i, sched_group_cpus(group)) {
  +   for_each_cpu_and(i, sched_group_cpus(group), env-cpus) {
  unsigned long power = power_of(i);
  unsigned long capacity = DIV_ROUND_CLOSEST(power,
 
  SCHED_POWER_SCALE);
  @@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
  if (!capacity)
  capacity = fix_small_capacity(env-sd, group);
 
  -   if (!cpumask_test_cpu(i, env-cpus))
  -   continue;
  -
  rq = cpu_rq(i);
  wl = weighted_cpuload(i);

 There's no need to actually do the divisions immediately below this also.

 e.g.
   unsigned long max_load_power = SCHED_POWER_SCALE;
   ...
   if (wl * max_load_power  max_load * power) {
max_load = wl;
max_load_power = power;
...

 This would actually end up being a little more accurate even.

 [ Alternatively without caching max_load_power we could compare wl *
 power vs max_load * SCHED_POWER_SCALE. ]

 You've got me confused again. You're talking about something like the
 below?

Nevermind, I was looking at a tip tree as I reviewed this one.  What I
was suggesting was exactly:
  [PATCH 01/10] sched: Remove one division operation in find_busiest_queue()


 I suppose the problem with that is that we could keep selecting the
 busiest rq with an unmovable task due to:

 move_tasks():
 if ((load / 2)  env-imbalance)
 goto next;

 That said, the condition in fbq() should at least be modified to match
 this. Now the entire capacity crap comes from:

   bdb94aa sched: Try to deal with low capacity

 But thinking a little more about it, if power drops that low imbalance
 is likely to be _huge_ and we'd not meet that condition. Now if only I
 wrote a more comprehensive Changelog and explained why that wouldn't be
 the case. /me kicks himself.

 ---
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4990,28 +4990,12 @@ static struct sched_group *find_busiest_
  static struct rq *find_busiest_queue(struct lb_env *env,
  struct sched_group *group)
  {
 -   struct rq *busiest = NULL, *rq;
 unsigned long busiest_load = 0, busiest_power = 1;
 +   struct rq *busiest = NULL;
 int i;

 for_each_cpu_and(i, sched_group_cpus(group), env-cpus) {
 -   unsigned long power = power_of(i);
 -   unsigned long capacity = DIV_ROUND_CLOSEST(power,
 -  SCHED_POWER_SCALE);
 -   unsigned long wl;
 -
 -   if (!capacity)
 -   capacity = fix_small_capacity(env-sd, group);
 -
 -   rq = cpu_rq(i);
 -   wl = weighted_cpuload(i);
 -
 -   /*
 -* When comparing with imbalance, use weighted_cpuload()
 -* which is not scaled with the cpu power.
 -*/
 -   if (capacity  rq-nr_running == 1  wl  env-imbalance)
 -   continue;
 +   unsigned long wl = weighted_cpuload(i);

 /*
  * For the load comparisons with the other cpu's, consider
 @@ -5027,7 +5011,7 @@ static struct rq *find_busiest_queue(str
 if (wl * busiest_power  busiest_load * power) {
 busiest_load = wl;
 busiest_power = power;
 -   busiest = rq;
 +   busiest = cpu_rq(i);
 }
 }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat

2013-08-25 Thread Paul Turner
On Sun, Aug 25, 2013 at 7:56 PM, Lei Wen adrian.w...@gmail.com wrote:
 On Tue, Aug 20, 2013 at 12:01 AM, Peter Zijlstra pet...@infradead.org wrote:
 From: Joonsoo Kim iamjoonsoo@lge.com

 There is no reason to maintain separate variables for this_group
 and busiest_group in sd_lb_stat, except saving some space.
 But this structure is always allocated in stack, so this saving
 isn't really benificial [peterz: reducing stack space is good; in this
 case readability increases enough that I think its still beneficial]

 This patch unify these variables, so IMO, readability may be improved.

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 [peterz: lots of style edits, a few fixes and a rename]
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 Link: 
 http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo@lge.com
 ---
  kernel/sched/fair.c |  225 
 +---
  1 file changed, 112 insertions(+), 113 deletions(-)

 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4277,36 +4277,6 @@ static unsigned long task_h_load(struct

 [snip]...
 -   env-imbalance = DIV_ROUND_CLOSEST(
 -   sds-max_load * sds-busiest-sgp-power, SCHED_POWER_SCALE);
 +   env-imbalance = DIV_ROUND_CLOSEST(sds-busiest_stat.avg_load *
 +   sds-busiest-sgp-power, SCHED_POWER_SCALE);


 I am wondering whether we could change this line as below is more appropriate,
 since it would avoid the division here:
env-imbalance = (sds-busiest_stat.avg_load * 
 sds-busiest-sgp-power)
SCHED_POWER_SHIFT;

 I am not sure whether compiler would be smarter enough to covert into
 operation,
 if it see SCHED_POWER_SCALE is 1024 here.

This would change the rounding.  Fortunately, gcc is smart enough to
handle this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] sched: Clean-up struct sd_lb_stat

2013-08-24 Thread Paul Turner
)
 goto out_balanced;
 }


Reviewed-by: Paul  Turner p...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] sched, fair: Shrink sg_lb_stats and play memset games

2013-08-24 Thread Paul Turner
On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra pet...@infradead.org wrote:
 We can shrink sg_lb_stats because rq::nr_running is an 'unsigned int'
 and cpu numbers are 'int'

 Before:
   sgs:/* size: 72, cachelines: 2, members: 10 */
   sds:/* size: 184, cachelines: 3, members: 7 */

 After:
   sgs:/* size: 56, cachelines: 1, members: 10 */
   sds:/* size: 152, cachelines: 3, members: 7 */

 Further we can avoid clearing all of sds since we do a total
 clear/assignment of sg_stats in update_sg_lb_stats() with exception of
 busiest_stat.avg_load which is referenced in update_sd_pick_busiest().

 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/sched/fair.c |   42 +++---
  1 file changed, 35 insertions(+), 7 deletions(-)

 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4282,12 +4282,12 @@ static unsigned long task_h_load(struct
  struct sg_lb_stats {
 unsigned long avg_load; /*Avg load across the CPUs of the group */
 unsigned long group_load; /* Total load over the CPUs of the group */
 -   unsigned long sum_nr_running; /* Nr tasks running in the group */
 unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 unsigned long load_per_task;
 -   unsigned long group_capacity;
 -   unsigned long idle_cpus;
 -   unsigned long group_weight;
 +   unsigned int sum_nr_running; /* Nr tasks running in the group */
 +   unsigned int group_capacity;
 +   unsigned int idle_cpus;
 +   unsigned int group_weight;
 int group_imb; /* Is there an imbalance in the group ? */
 int group_has_capacity; /* Is there extra capacity in the group? */
  };
 @@ -4303,10 +4303,38 @@ struct sd_lb_stats {
 unsigned long total_pwr;/* Total power of all groups in sd */
 unsigned long avg_load; /* Average load across all groups in sd */

 -   struct sg_lb_stats this_stat;   /* Statistics of this group */
 struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 +   struct sg_lb_stats this_stat;   /* Statistics of this group */
  };

 +static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 +{
 +   /*
 +* struct sd_lb_stats {
 +* struct sched_group *   busiest; // 0  8
 +* struct sched_group *   this;// 8  8
 +* long unsigned int  total_load;  //16  8
 +* long unsigned int  total_pwr;   //24  8
 +* long unsigned int  avg_load;//32  8
 +* struct sg_lb_stats {
 +* long unsigned int  avg_load;//40  8
 +* long unsigned int  group_load;  //48  8
 +* ...
 +* } busiest_stat; //40 56
 +* struct sg_lb_stats this_stat;   //96 56
 +*
 +* // size: 152, cachelines: 3, members: 7
 +* // last cacheline: 24 bytes
 +* };
 +*
 +* Skimp on the clearing to avoid duplicate work. We can avoid 
 clearing
 +* this_stat because update_sg_lb_stats() does a full 
 clear/assignment.
 +* We must however clear busiest_stat::avg_load because
 +* update_sd_pick_busiest() reads this before assignment.
 +*/

Does gcc seriously not get this right if we just write:
  sds-busiest = NULL;
  sds-local = NULL;
  

etc.?

 +   memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
 +}
 +
  /**
   * get_sd_load_idx - Obtain the load index for a given sched domain.
   * @sd: The sched_domain whose load_idx is to be obtained.
 @@ -4665,7 +4693,7 @@ static inline void update_sd_lb_stats(st
  */
 if (prefer_sibling  !local_group 
 sds-this  
 sds-this_stat.group_has_capacity)
 -   sgs-group_capacity = min(sgs-group_capacity, 1UL);
 +   sgs-group_capacity = min(sgs-group_capacity, 1U);

 /* Now, start updating sd_lb_stats */
 sds-total_load += sgs-group_load;
 @@ -4896,7 +4924,7 @@ static struct sched_group *find_busiest_
 struct sg_lb_stats *this, *busiest;
 struct sd_lb_stats sds;

 -   memset(sds, 0, sizeof(sds));
 +   init_sd_lb_stats(sds);

 /*
  * Compute the various statistics relavent for load balancing at


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()

2013-08-24 Thread Paul Turner
On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra pet...@infradead.org wrote:
 Use for_each_cpu_and() and thereby avoid computing the capacity for
 CPUs we know we're not interested in.

 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/sched/fair.c |5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4977,7 +4977,7 @@ static struct rq *find_busiest_queue(str
 unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 int i;

 -   for_each_cpu(i, sched_group_cpus(group)) {
 +   for_each_cpu_and(i, sched_group_cpus(group), env-cpus) {
 unsigned long power = power_of(i);
 unsigned long capacity = DIV_ROUND_CLOSEST(power,
SCHED_POWER_SCALE);
 @@ -4986,9 +4986,6 @@ static struct rq *find_busiest_queue(str
 if (!capacity)
 capacity = fix_small_capacity(env-sd, group);

 -   if (!cpumask_test_cpu(i, env-cpus))
 -   continue;
 -
 rq = cpu_rq(i);
 wl = weighted_cpuload(i);

There's no need to actually do the divisions immediately below this also.

e.g.
  unsigned long max_load_power = SCHED_POWER_SCALE;
  ...
  if (wl * max_load_power  max_load * power) {
   max_load = wl;
   max_load_power = power;
   ...

This would actually end up being a little more accurate even.

[ Alternatively without caching max_load_power we could compare wl *
power vs max_load * SCHED_POWER_SCALE. ]

Reviewed-by: Paul Turner p...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] sched, fair: Fix the sd_parent_degenerate() code

2013-08-24 Thread Paul Turner
On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra pet...@infradead.org wrote:
 I found that on my wsm box I had a redundant domain:

 [0.949769] CPU0 attaching sched-domain:
 [0.953765]  domain 0: span 0,12 level SIBLING
 [0.958335]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
 [0.964548]   domain 1: span 0-5,12-17 level MC
 [0.969206]groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 
 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 
 (cpu_power = 1176)
 [0.984993]domain 2: span 0-5,12-17 level CPU
 [0.989822] groups: 0-5,12-17 (cpu_power = 7055)
 [0.995049] domain 3: span 0-23 level NUMA
 [0.999620]  groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 
 (cpu_power = 7056)

 Note how domain 2 has only a single group and spans the same CPUs as
 domain 1. We should not keep such domains and do in fact have code to
 prune these.

 It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
 makes sd_parent_degenerate() fail on the CPU domain. We can easily fix
 this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to
 whatever domain ends up covering the span.

 With this patch the domains now look like this:

 [0.950419] CPU0 attaching sched-domain:
 [0.954454]  domain 0: span 0,12 level SIBLING
 [0.959039]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
 [0.965271]   domain 1: span 0-5,12-17 level MC
 [0.969936]groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 
 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 
 (cpu_power = 1176)
 [0.985737]domain 2: span 0-23 level NUMA
 [0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power 
 = 7056)

 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/sched/core.c |   10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4948,7 +4948,8 @@ sd_parent_degenerate(struct sched_domain
 SD_BALANCE_FORK |
 SD_BALANCE_EXEC |
 SD_SHARE_CPUPOWER |
 -   SD_SHARE_PKG_RESOURCES);
 +   SD_SHARE_PKG_RESOURCES |
 +   SD_PREFER_SIBLING);
 if (nr_node_ids == 1)
 pflags = ~SD_SERIALIZE;
 }
 @@ -5157,6 +5158,13 @@ cpu_attach_domain(struct sched_domain *s
 tmp-parent = parent-parent;
 if (parent-parent)
 parent-parent-child = tmp;
 +   /*
 +* Transfer SD_PREFER_SIBLING down in case of a
 +* degenerate parent; the spans match for this
 +* so the property transfers.
 +*/
 +   if (parent-flags  SD_PREFER_SIBLING)
 +   tmp-flags |= SD_PREFER_SIBLING;
 destroy_sched_domain(parent, cpu);
 } else
 tmp = tmp-parent;


Reviewed-by: Paul Turner p...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] sched: Factor out code to should_we_balance()

2013-08-23 Thread Paul Turner
On Thu, Aug 22, 2013 at 3:42 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, Aug 22, 2013 at 02:58:27AM -0700, Paul Turner wrote:
 On Mon, Aug 19, 2013 at 9:01 AM, Peter Zijlstra pet...@infradead.org wrote:

  +   if (local_group)
  load = target_load(i, load_idx);

 Keep the braces here:

   if (local_group) {
 load = target_load(i, load_idx);
   } else {
...

 Right you are, I misplaced that hunk in the next patch.

  -   } else {
  +   else {
  load = source_load(i, load_idx);
  if (load  max_cpu_load)
  max_cpu_load = load;


  @@ -5123,12 +5120,11 @@ static int load_balance(int this_cpu, st
 
  schedstat_inc(sd, lb_count[idle]);
 
  -redo:
  -   group = find_busiest_group(env, balance);
  -
  -   if (*balance == 0)
  +   if (!(*should_balance = should_we_balance(env)))
  goto out_balanced;

 Given we always initialize *should_balance where we care, it might be
 more readable to write this as:

 Ah, but we don't actually, idle_balance() doesn't initialize
 should_balance -- easy enough to fix though.

I should have been more explicit here.

idle_balance() doesn't initialize it, but it completely ignores the
result; it's just passing something in case a write occurs.
(Arguably it should be int unused = 1 instead of int balance = 1)


 if (!should_we_balance(env)) {
   *continue_balancing = 0;
   goto out_balanced;
 }

 [ With a rename to can_balance ]

 You confused me, your code implied
 %s/should_balance/continue_balancing/g but now you suggest
 %%s/should_balance/can_balance/g ?

 I'm fine with either.

My bad, I started typing this comment, then went and looked at other
parts of the patch and came back to it :)

I think continue_balancing is more clear.


 
  +redo:

 One behavioral change worth noting here is that in the redo case if a
 CPU has become idle we'll continue trying to load-balance in the
 !new-idle case.

 This could be unpleasant in the case where a package has a pinned busy
 core allowing this and a newly idle cpu to start dueling for load.

 While more deterministically bad in this case now, it could racily do
 this before anyway so perhaps not worth worrying about immediately.

 Ah, because the old code would effectively redo the check and find the
 idle cpu and thereby our cpu would no longer be the balance_cpu.

 Indeed. And I don't think this was an intentional change. I'll go put
 the redo back before should_we_balance().


I was trying to get through the rest of these today but I'm out of
time.  I should be able to finish reviewing the other patches in the
set tomorrow.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] sched: Remove one division operation in find_busiest_queue()

2013-08-22 Thread Paul Turner
On Mon, Aug 19, 2013 at 9:00 AM, Peter Zijlstra pet...@infradead.org wrote:
 From: Joonsoo Kim iamjoonsoo@lge.com

 Remove one division operation in find_busiest_queue() by using
 crosswise multiplication:

 wl_i / power_i  wl_j / power_j :=
 wl_i * power_j  wl_j * power_i

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 [peterz: expanded changelog]
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 Link: 
 http://lkml.kernel.org/r/1375778203-31343-2-git-send-email-iamjoonsoo@lge.com
 ---
  kernel/sched/fair.c |9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5018,7 +5018,7 @@ static struct rq *find_busiest_queue(str
  struct sched_group *group)
  {
 struct rq *busiest = NULL, *rq;
 -   unsigned long max_load = 0;
 +   unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;

Initializing this to SCHED_POWER_SCALE assigns a meaning that isn't
really there.  How about just 1?

 int i;

 for_each_cpu(i, sched_group_cpus(group)) {
 @@ -5049,10 +5049,9 @@ static struct rq *find_busiest_queue(str
  * the load can be moved away from the cpu that is potentially
  * running at a lower capacity.
  */
 -   wl = (wl * SCHED_POWER_SCALE) / power;
 -
 -   if (wl  max_load) {
 -   max_load = wl;

A comment wouldn't hurt here.

 +   if (wl * busiest_power  busiest_load * power) {
 +   busiest_load = wl;
 +   busiest_power = power;
 busiest = rq;
 }
 }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] sched: Factor out code to should_we_balance()

2013-08-22 Thread Paul Turner
 idle_balance(int this_cpu, struct r
 rcu_read_lock();
 for_each_domain(this_cpu, sd) {
 unsigned long interval;
 -   int balance = 1;
 +   int should_balance;

 if (!(sd-flags  SD_LOAD_BALANCE))
 continue;
 @@ -5348,7 +5344,8 @@ void idle_balance(int this_cpu, struct r
 if (sd-flags  SD_BALANCE_NEWIDLE) {
 /* If we've pulled tasks over stop searching: */
 pulled_task = load_balance(this_cpu, this_rq,
 -  sd, CPU_NEWLY_IDLE, 
 balance);
 +  sd, CPU_NEWLY_IDLE,
 +  should_balance);
 }

 interval = msecs_to_jiffies(sd-balance_interval);
 @@ -5586,7 +5583,7 @@ void update_max_interval(void)
   */
  static void rebalance_domains(int cpu, enum cpu_idle_type idle)
  {
 -   int balance = 1;
 +   int should_balance = 1;
 struct rq *rq = cpu_rq(cpu);
 unsigned long interval;
 struct sched_domain *sd;
 @@ -5618,7 +5615,7 @@ static void rebalance_domains(int cpu, e
 }

 if (time_after_eq(jiffies, sd-last_balance + interval)) {
 -   if (load_balance(cpu, rq, sd, idle, balance)) {
 +   if (load_balance(cpu, rq, sd, idle, should_balance)) 
 {
 /*
  * The LBF_SOME_PINNED logic could have 
 changed
  * env-dst_cpu, so we can't know our idle
 @@ -5641,7 +5638,7 @@ static void rebalance_domains(int cpu, e
  * CPU in our sched group which is doing load balancing more
  * actively.
  */
 -   if (!balance)
 +   if (!should_balance)
 break;
 }
 rcu_read_unlock();



Reviewed-by: Paul Turner p...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: false nr_running check in load balance?

2013-08-15 Thread Paul Turner
On Thu, Aug 15, 2013 at 10:39 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, Aug 13, 2013 at 01:08:17AM -0700, Paul Turner wrote:
 On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra pet...@infradead.org 
 wrote:
  On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
   Not quite right; I think you need busiest-cfs.h_nr_running.
   cfs.nr_running is the number of entries running in this 'group'. If
   you've got nested groups like:
  
'root'
  \
  'A'
  / \
 t1 t2
  
   root.nr_running := 1 'A', even though you've got multiple running tasks.

 One thing though; doesn't h_nr_running over count the number of tasks?
 That is, doesn't it count the runnable entities so the above case would
 give root.h_nr_running := 3, where we would only have 2 runnable tasks.

 Double check this and be careful when doing the conversion.

This should be ok: it's accounted like rq-nr_running, not cfs_rq-nr_running.
Specifically: both only account tasks; group-entities do not contribute.

The fact that this distinction exists, despite the very similar names
is unfortunate.
We could consider renaming to h_nr_{running_,}tasks for clarity.
The same applies to rq-nr_running, although that would involve more churn.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: false nr_running check in load balance?

2013-08-13 Thread Paul Turner
On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
  Not quite right; I think you need busiest-cfs.h_nr_running.
  cfs.nr_running is the number of entries running in this 'group'. If
  you've got nested groups like:
 
   'root'
 \
 'A'
 / \
t1 t2
 
  root.nr_running := 1 'A', even though you've got multiple running tasks.
 

 You're absolutely right for this. :)
 I miss it for not considering the group case...

 Then do you think it is necessary to do below change in load_balance() code?
  -   if (busiest-nr_running  1) {
  +   if (busiest-cfs.h_nr_running  1) {


 Yes I think that would be fine.

If we pivot to use h_nr_running we should probably also update
call-sites such as cpu_load_avg_per_task() for consistency.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: false nr_running check in load balance?

2013-08-13 Thread Paul Turner
On Tue, Aug 13, 2013 at 1:18 AM, Lei Wen adrian.w...@gmail.com wrote:
 Hi Paul,

 On Tue, Aug 13, 2013 at 4:08 PM, Paul Turner p...@google.com wrote:
 On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra pet...@infradead.org 
 wrote:
 On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
  Not quite right; I think you need busiest-cfs.h_nr_running.
  cfs.nr_running is the number of entries running in this 'group'. If
  you've got nested groups like:
 
   'root'
 \
 'A'
 / \
t1 t2
 
  root.nr_running := 1 'A', even though you've got multiple running tasks.
 

 You're absolutely right for this. :)
 I miss it for not considering the group case...

 Then do you think it is necessary to do below change in load_balance() 
 code?
  -   if (busiest-nr_running  1) {
  +   if (busiest-cfs.h_nr_running  1) {


 Yes I think that would be fine.

 If we pivot to use h_nr_running we should probably also update
 call-sites such as cpu_load_avg_per_task() for consistency.

 I didn't find cpu_load_avg_per_task in the latest linux git...
 Is it a new patch pending while not being submitted?

Transposition typo: cpu_avg_load_per_task()
More generally: Most things that examine -nr_running in the fair
load-balance path.


 Thanks,
 Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Paul Turner
On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel r...@redhat.com wrote:
 Don Zickus and Joe Mario have been working on improvements to
 perf, and noticed heavy cache line contention on the mm_cpumask,
 running linpack on a 60 core / 120 thread system.

 The cause turned out to be unnecessary atomic accesses to the
 mm_cpumask. When in lazy TLB mode, the CPU is only removed from
 the mm_cpumask if there is a TLB flush event.

 Most of the time, no such TLB flush happens, and the kernel
 skips the TLB reload.  It can also skip the atomic memory
 set  test.

 Here is a summary of Joe's test results:

  * The __schedule function dropped from 24% of all program cycles down
to 5.5%.
  * The cacheline contention/hotness for accesses to that bitmask went
from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
all shared misses which is now quite cold)
  * The average load latency for the bit-test-n-set instruction in
__schedule dropped from 10k-15k cycles down to an average of 600 cycles.
  * The linpack program results improved from 133 GFlops to 144 GFlops.
Peak GFlops rose from 133 to 153.

 Reported-by: Don Zickus dzic...@redhat.com
 Reported-by: Joe Mario jma...@redhat.com
 Tested-by: Joe Mario jma...@redhat.com
 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  arch/x86/include/asm/mmu_context.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/include/asm/mmu_context.h 
 b/arch/x86/include/asm/mmu_context.h
 index cdbf367..987eb3d 100644
 --- a/arch/x86/include/asm/mmu_context.h
 +++ b/arch/x86/include/asm/mmu_context.h
 @@ -59,11 +59,12 @@ static inline void switch_mm(struct mm_struct *prev, 
 struct mm_struct *next,
 this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

 -   if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
 +   if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 /* We were in lazy tlb mode and leave_mm disabled
  * tlb flush IPI delivery. We must reload CR3
  * to make sure to use no freed page tables.
  */
 +   cpumask_set_cpu(cpu, mm_cpumask(next));
 load_cr3(next-pgd);
 load_LDT_nolock(next-context);
 }

We're carrying the *exact* same patch for *exact* same reason.  I've
been meaning to send it out but wasn't sure of a good external
workload for this.

Reviewed-by: Paul Turner p...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

2013-07-31 Thread Paul Turner
We attached the following explanatory comment to our version of the patch:

/*
* In the common case (two user threads sharing mm
* switching) the bit will be set; avoid doing a write
* (via atomic test  set) unless we have to.  This is
* safe, because no other CPU ever writes to our bit
* in the mask, and interrupts are off (so we can't
* take a TLB IPI here.)  If we don't do this, then
* switching threads will pingpong the cpumask
* cacheline.
*/

On Wed, Jul 31, 2013 at 4:12 PM, Rik van Riel r...@redhat.com wrote:
 On 07/31/2013 06:46 PM, Linus Torvalds wrote:


 On Jul 31, 2013 3:39 PM, Rik van Riel r...@redhat.com

 mailto:r...@redhat.com wrote:
  
   On 07/31/2013 06:21 PM, Linus Torvalds wrote:
  
   Ummm.. The race is to the testing of the bit, not setting. The testing
   of the bit is not valid before we have set the tlb state, AFAIK.
  
  
   I believe the bit is cleared and set by the current CPU.

 Yes, but we need to be careful with interrupts.


   Interrupts are blocked inside switch_mm, so I think we
   are safe.

 Are they? I thought we removed all that.


 context_switch() shows that the runqueue lock (which is an irq
 lock) is released, and irqs re-enabled, by the next task, after
 switch_to(), in finish_lock_switch(), called from finish_task_switch()

 Note that switch_mm gets called for activate_mm too, or something.


 Good catch, though it looks like activate_mm is only called from
 exec_mmap, with the new mm as its argument.  While the new mm can
 have pages in memory yet, it has never been run so there should
 be nothing in the TLB yet for the new mm.

 This is subtler than I thought, but it does appear to be safe.


 --
 All rights reversed
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


PROBLEM: Persistent unfair sharing of a processor by auto groups in 3.11-rc2 (has twice regressed)

2013-07-26 Thread Paul Turner
[Moving to LKML]

Max sent in the following (best ever) bug report:

On Wed, Jul 24, 2013 at 8:41 AM, Max Hailperin m...@gustavus.edu wrote:
 [1.] One line summary of the problem:
 Persistent unfair sharing of a processor by auto groups in 3.11-rc2 (has
 twice regressed)

 [2.] Full description of the problem/report:
 Although I've used bisection to figure out where this problem originated and
 have a one-line patch that fixes it, I don't understand the root cause and
 don't have any reason to think my patch is the right one.

 If I run a single processor-hogging loop in each of two terminal windows (so
 one task in each of two auto groups) and use schedtool to pin each of them
 on processor 0, they ought to each get 50% of that processor's time.
 However, a substantial fraction of my trials of this experiment (detailed in
 section 7 below) produce a different result: the two tasks get persistently
 stuck in some ratio other than 50/50.  In the screenshot I have attached,
 they are running 55/45, but I have observed as extreme as 66/33. Whatever
 the ratio is, it remains fixed aside from minor fluctuations for as long as
 I let it run, which has been over an hour.  This phenomenon doesn't show up
 if I boot with nosmp rather than using schedtool or if I disable auto
 grouping.  The tasks really are getting disproportionate CPU rather than
 just looking that way due to some accounting malfunction; I've confirmed
 this by having each produce output every N iterations so that I can see they
 are running at different rates.

 Bisection reveals that this problem most recently was introduced by commit
 17bc14b767cf0692420c43dbe5310ae98a5a7836 which reverted commit
 f269ae0469fc882332bdfb5db15d3c1315fe2a10 (sched: Update_cfs_shares at period
 edge). In between these two commits, I consistently get 50/50 sharing aside
 from  small fluctuations.

 As suggested by the problem having been (re-) introduced by a revert, it was
 also present as of the parent commit of the reverted commit (that is, as of
 48a1675323fa1b7844e479ad2a4469f4558c0f79).  A second round of bisection
 looking backward from there identifies the problem as having initially been
 introduced in commit 82958366cfea1a50e7e90907b2d55ae29ed69974 (sched:
 Replace update_shares weight distribution with per-entity computation).

 I was able to make the problem go away by applying to 3.11-rc2 the one-line
 patch I have attached, which is a subset of the reverted commit.

 [3.] Keywords (i.e., modules, networking, kernel):
 sched

 [4.] Kernel information
 [4.1.] Kernel version (from /proc/version):
 Linux version 3.11.0-rc2 (max@max-st1-Inspiron-5423) (gcc version 4.7.3
 (Ubuntu/Linaro 4.7.3-1ubuntu1) ) #1 SMP Tue Jul 23 13:09:32 CDT 2013

 [4.2.] Kernel .config file:
 See the attached config file.

 [5.] Most recent kernel version which did not have the bug:
 See bisection information above.

 [6.] Output of Oops.. message (if applicable) with symbolic information
  resolved (see Documentation/oops-tracing.txt)
 N/A

 [7.] A small shell script or example program which triggers the
  problem (if possible)
 In each of two terminal windows give the command
   schedtool -a 0 -e bash -c 'while ((1)); do :; done'
 while in a third terminal window running top to observe the result. See the
 attached screenshot for an example.  The results are not consistent, but in
 the buggy versions of the kernel, many of the attempts will show the tasks
 running in a non-50/50 ratio (persistently). It's never taken me more than a
 handful of tries to see the effect. Also, in the cases where the tasks are
 running in more or less 50/50, if I stop one with control-z and then restart
 it, sometimes that is enough to switch to disproportionate execution.

 [8.] Environment
 [8.1.] Software (add the output of the ver_linux script here)
 If some fields are empty or look unusual you may have an old version.
 Compare to the current minimal requirements in Documentation/Changes.

 Linux max-st1-Inspiron-5423 3.11.0-rc2 #1 SMP Tue Jul 23 13:09:32 CDT 2013
 x86_64 x86_64 x86_64 GNU/Linux

 Gnu C  4.7
 Gnu make   3.81
 binutils   2.23.2
 util-linux 2.20.1
 mount  support
 module-init-tools  9
 e2fsprogs  1.42.5
 pcmciautils018
 PPP2.4.5
 Linux C Library2.17
 Dynamic linker (ldd)   2.17
 Procps 3.3.3
 Net-tools  1.60
 Kbd1.15.5
 Sh-utils   8.20
 wireless-tools 30
 Modules Loaded rfcomm parport_pc bnep ppdev arc4 iwldvm mac80211
 i915 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel snd_hda_codec
 iwlwifi drm_kms_helper drm uvcvideo btusb videobuf2_vmalloc videobuf2_memops
 snd_hwdep videobuf2_core bluetooth snd_pcm videodev joydev snd_page_alloc
 snd_seq_midi cfg80211 snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device
 snd_timer mei_me lp i2c_algo_bit psmouse snd mei video 

Re: PROBLEM: Persistent unfair sharing of a processor by auto groups in 3.11-rc2 (has twice regressed)

2013-07-26 Thread Paul Turner
 Ah I see what's happening. Currently we only call update_cfs_shares()
 from the enqueue/dequeue paths. However since your tasks never sleep and
 cannot migrate they never pass this code. This in turns means that the
 inter-cgroup relations (the shares) don't get updated.

 I think your patch would indeed be a correct minimal fix in that it adds
 a share update to the regular tick path ensuring it gets done even in
 the absence of enqueues/dequeues.

Hmm.. /me scratches head.

We definitely _used_ to have a call to update_cfs_shares() in
update_cfs_rq_blocked_load() for _exactly_ this reason.

Let's see... a-ha.

It was originally part of:
  f269ae0469fc882332bdfb5db15d3c1315fe2a10 sched: Update_cfs_shares at
period edge

But we reverted that right at the end of the merge window because the
amortization turned out to be too aggressive for interactive
workloads.  This happened in 17bc14b767cf0692420c43dbe5310ae98a5a7836.

Unfortunately, when we did that, we lost the call to update_cfs_shares().

Yes, this wants to be there.

Reviewed-by: Paul Turner p...@google.com

For a description how about something like:

sched: ensure update_cfs_shares() is called for parents of
continuously-running tasks

We typically update a task_group's shares within the dequeue/enqueue
path.  However, continuously running tasks sharing a CPU are not
subject to these updates as they are only put/picked.  Unfortunately,
when we reverted f269ae046 (in 17bc14b7), we lost the augmenting
periodic update that was supposed to account for this; resulting in a
potential loss of fairness.

To fix this, re-introduce the explicit update in
update_cfs_rq_blocked_load() [called via entity_tick()].
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Persistent unfair sharing of a processor by auto groups in 3.11-rc2 (has twice regressed)

2013-07-26 Thread Paul Turner
On Fri, Jul 26, 2013 at 2:03 PM, Peter Zijlstra pet...@infradead.org wrote:


 OK, so I have the below; however on a second look, Paul, shouldn't that
 update_cfs_shares() call be in entity_tick(), right after calling
 update_cfs_rq_blocked_load(). Because placing it in
 update_cfs_rq_blocked_load() means its now called twice on the
 enqueue/dequeue paths through:

   {en,de}queue_entity()
 {en,de}queue_entity_load_avg()
   update_cfs_rq_blocked_load()
 update_cfs_shares()

Yes, I agree: placing it directly in entity_tick() would be better.

[ In f269ae046 the calls to update_cfs_rq_blocked_load() were amortized
and the separate update in {en,de}queue_entity_load_avg() were
removed. ]





 ---
 Subject: sched: Ensure update_cfs_shares() is called for parents of 
 continuously-running tasks
 From: Max Hailperin m...@gustavus.edu

 We typically update a task_group's shares within the dequeue/enqueue
 path.  However, continuously running tasks sharing a CPU are not
 subject to these updates as they are only put/picked.  Unfortunately,
 when we reverted f269ae046 (in 17bc14b7), we lost the augmenting
 periodic update that was supposed to account for this; resulting in a
 potential loss of fairness.

 To fix this, re-introduce the explicit update in
 update_cfs_rq_blocked_load() [called via entity_tick()].

 Cc: sta...@kernel.org
 Reviewed-by: Paul Turner p...@google.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/sched/fair.c |1 +
  1 file changed, 1 insertion(+)

 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -1531,6 +1531,7 @@ static void update_cfs_rq_blocked_load(s
 }

 __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 +   update_cfs_shares(cfs_rq);
  }

  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Persistent unfair sharing of a processor by auto groups in 3.11-rc2 (has twice regressed)

2013-07-26 Thread Paul Turner
On Fri, Jul 26, 2013 at 2:50 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Jul 26, 2013 at 02:24:50PM -0700, Paul Turner wrote:
 On Fri, Jul 26, 2013 at 2:03 PM, Peter Zijlstra pet...@infradead.org wrote:
 
 
  OK, so I have the below; however on a second look, Paul, shouldn't that
  update_cfs_shares() call be in entity_tick(), right after calling
  update_cfs_rq_blocked_load(). Because placing it in
  update_cfs_rq_blocked_load() means its now called twice on the
  enqueue/dequeue paths through:
 
{en,de}queue_entity()
  {en,de}queue_entity_load_avg()
update_cfs_rq_blocked_load()
  update_cfs_shares()

 Yes, I agree: placing it directly in entity_tick() would be better.

 OK, how about the below then?

Looks good.


 [ In f269ae046 the calls to update_cfs_rq_blocked_load() were amortized
 and the separate update in {en,de}queue_entity_load_avg() were
 removed. ]

 Right, I remember/saw that. Did you ever figure out why that regressed;
 as in should we look to bring some of that back?

Yes, the savings are measurable (we actually still use it internally).

So the particular problem in Linus's workload was that the amortization meant
that there was more delay until the first update for a newly created task.  This
then had negative interactivity with a make -j N workload since it allowed the
tasks to be over-represented in terms of the group shares they received.

With:
  a75cdaa9: sched: Set an initial value of runnable avg for new forked task

This should now be improved so we should look at bringing it back.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >