Re: [PATCH 3/8] powerpc: Mark functions called inside uaccess blocks w/ 'notrace'

2020-10-16 Thread Peter Zijlstra
On Fri, Oct 16, 2020 at 07:56:16AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 15, 2020 at 10:01:54AM -0500, Christopher M. Riedl wrote:
> > Functions called between user_*_access_begin() and user_*_access_end()
> > should be either inlined or marked 'notrace' to prevent leaving
> > userspace access exposed. Mark any such functions relevant to signal
> > handling so that subsequent patches can call them inside uaccess blocks.
> 
> I don't think running this much code with uaccess enabled is a good
> idea.  Please refactor the code to reduce the criticial sections with
> uaccess enabled.
> 
> Btw, does powerpc already have the objtool validation that we don't
> accidentally jump out of unsafe uaccess critical sections?

It does not, there was some effort on that a while ago, but I suspect
they're waiting for the ARM64 effort to land and build on that.


Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-24 Thread Peter Zijlstra
On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote:

> > It turns out, that getting selected for pull-balance is exactly that
> > condition, and clearly a migrate_disable() task cannot be pulled, but we
> > can use that signal to try and pull away the running task that's in the
> > way.
> 
> Unless of course that running task is in a migrate disable section
> itself ;-)

See my ramblings here:

  
https://lkml.kernel.org/r/20200924082717.ga1362...@hirez.programming.kicks-ass.net

My plan was to have the migrate_enable() of the running task trigger the
migration in that case.



Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-24 Thread Peter Zijlstra
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> Anyway, instead of blocking. What about having a counter of number of
> migrate disabled tasks per cpu, and when taking a migrate_disable(), and 
> there's
> already another task with migrate_disabled() set, and the current task has
> an affinity greater than 1, it tries to migrate to another CPU?

That doesn't solve the problem. On wakeup we should already prefer an
idle CPU over one running a (RT) task, but you can always wake more
tasks than there's CPUs around and you'll _have_ to stack at some point.

The trick is how to unstack them correctly. We need to detect when a
migrate_disable() task _should_ start running again, and migrate away
whoever is in the way at that point.

It turns out, that getting selected for pull-balance is exactly that
condition, and clearly a migrate_disable() task cannot be pulled, but we
can use that signal to try and pull away the running task that's in the
way.


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-25 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 3a0db7b0b46e..35060be09073 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>  #define powerpc_local_irq_pmu_save(flags)\
>do {   \
>   raw_local_irq_pmu_save(flags);  \
> - trace_hardirqs_off();   \
> + if (!raw_irqs_disabled_flags(flags))\
> + trace_hardirqs_off();   \
>   } while(0)

So one problem with the above is something like this:

raw_local_irq_save();

  powerpc_local_irq_pmu_save();

that would now no longer call into tracing/lockdep at all. As a
consequence, lockdep and tracing would show the NMI ran with IRQs
enabled, which is exceptionally weird..

Similar problem with:

raw_local_irq_disable();
local_irq_save()

Now, most architectures today seem to do what x86 also did:


  trace_hardirqs_off()
  ...
  if (irqs_unmasked(regs))
trace_hardirqs_on()


Which is 'funny' when it interleaves like:

local_irq_disable();
...
local_irq_enable()
  trace_hardirqs_on();
  
  raw_local_irq_enable();

Because then it will undo the trace_hardirqs_on() we just did. With the
result that both tracing and lockdep will see a hardirqs-disable without
a matching enable, while the hardware state is enabled.

Which is exactly the state Alexey seems to have ran into.

Now, x86, and at least arm64 call nmi_enter() before
trace_hardirqs_off(), but AFAICT Power never did that, and that's part
of the problem. nmi_enter() does lockdep_off() and that _used_ to also
kill IRQ tracking.

Now, my patch changed that, it makes IRQ tracking not respect
lockdep_off(). And that exposed x86 (and everybody else :/) to the same
problem you have.

And this is why I made x86 look at software state in NMIs. Because then
it all works out. For bonus points, trace_hardirqs_*() also has some
do-it-once logic for tracing.



Anyway, it's Saturday evening, time for a beer. I'll stare at this more
later.


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-25 Thread Peter Zijlstra
On Fri, Jul 24, 2020 at 03:10:59PM -0400, Waiman Long wrote:
> On 7/24/20 4:16 AM, Will Deacon wrote:
> > On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote:
> > > On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:
> > > > BTW, do you have any comment on my v2 lock holder cpu info qspinlock 
> > > > patch?
> > > > I will have to update the patch to fix the reported 0-day test problem, 
> > > > but
> > > > I want to collect other feedback before sending out v3.
> > > I want to say I hate it all, it adds instructions to a path we spend an
> > > aweful lot of time optimizing without really getting anything back for
> > > it.
> > > 
> > > Will, how do you feel about it?
> > I can see it potentially being useful for debugging, but I hate the
> > limitation to 256 CPUs. Even arm64 is hitting that now.
> 
> After thinking more about that, I think we can use all the remaining bits in
> the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending,
> there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit
> locked_pending), we can put all possible cpu numbers into the lock. We can
> also just use smp_processor_id() without additional percpu data.

That sounds horrific, wouldn't that destroy the whole point of using a
byte for pending?

> > Also, you're talking ~1% gains here. I think our collective time would
> > be better spent off reviewing the CNA series and trying to make it more
> > deterministic.
> 
> I thought you guys are not interested in CNA. I do want to get CNA merged,
> if possible. Let review the current version again and see if there are ways
> we can further improve it.

It's not a lack of interrest. We were struggling with the fairness
issues and the complexity of the thing. I forgot the current state of
matters, but at one point UNLOCK was O(n) in waiters, which is, of
course, 'unfortunate'.

I'll have to look up whatever notes remain, but the basic idea of
keeping remote nodes on a secondary list is obviously breaking all sorts
of fairness. After that they pile on a bunch of hacks to fix the worst
of them, but it feels exactly like that, a bunch of hacks.

One of the things I suppose we ought to do is see if some of the ideas
of phase-fair locks can be applied to this.

That coupled with a chronic lack of time for anything :-(


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 11:11:03PM +1000, Nicholas Piggin wrote:
> Excerpts from Peter Zijlstra's message of July 23, 2020 9:40 pm:
> > On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
> > 
> >> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> >> b/arch/powerpc/include/asm/hw_irq.h
> >> index 3a0db7b0b46e..35060be09073 100644
> >> --- a/arch/powerpc/include/asm/hw_irq.h
> >> +++ b/arch/powerpc/include/asm/hw_irq.h
> >> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
> >>  #define powerpc_local_irq_pmu_save(flags) \
> >> do {   \
> >>raw_local_irq_pmu_save(flags);  \
> >> -  trace_hardirqs_off();   \
> >> +  if (!raw_irqs_disabled_flags(flags))\
> >> +  trace_hardirqs_off();   \
> >>} while(0)
> >>  #define powerpc_local_irq_pmu_restore(flags)  \
> >>do {\
> >> -  if (raw_irqs_disabled_flags(flags)) {   \
> >> -  raw_local_irq_pmu_restore(flags);   \
> >> -  trace_hardirqs_off();   \
> >> -  } else {\
> >> +  if (!raw_irqs_disabled_flags(flags))\
> >>trace_hardirqs_on();\
> >> -  raw_local_irq_pmu_restore(flags);   \
> >> -  }   \
> >> +  raw_local_irq_pmu_restore(flags);   \
> >>} while(0)
> > 
> > You shouldn't be calling lockdep from NMI context!
> 
> After this patch it doesn't.

You sure, trace_hardirqs_{on,off}() calls into lockdep. (FWIW they're
also broken vs entry ordering, but that's another story).

> trace_hardirqs_on/off implementation appears to expect to be called in NMI 
> context though, for some reason.

Hurpm, not sure.. I'll have to go grep arch code now :/ The generic NMI
code didn't touch that stuff.

Argh, yes, there might be broken there... damn! I'll go frob around.


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 12:06:13PM -0400, Waiman Long wrote:
> We don't really need to do a pv_spinlocks_init() if pv_kick() isn't
> supported.

Waiman, if you cannot explain how not having kick is a sane thing, what
are you saying here?


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:

> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 3a0db7b0b46e..35060be09073 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>  #define powerpc_local_irq_pmu_save(flags)\
>do {   \
>   raw_local_irq_pmu_save(flags);  \
> - trace_hardirqs_off();   \
> + if (!raw_irqs_disabled_flags(flags))\
> + trace_hardirqs_off();   \
>   } while(0)
>  #define powerpc_local_irq_pmu_restore(flags) \
>   do {\
> - if (raw_irqs_disabled_flags(flags)) {   \
> - raw_local_irq_pmu_restore(flags);   \
> - trace_hardirqs_off();   \
> - } else {\
> + if (!raw_irqs_disabled_flags(flags))\
>   trace_hardirqs_on();\
> - raw_local_irq_pmu_restore(flags);   \
> - }   \
> + raw_local_irq_pmu_restore(flags);   \
>   } while(0)

You shouldn't be calling lockdep from NMI context! That is, I recently
added suport for that on x86:

  https://lkml.kernel.org/r/20200623083721.155449...@infradead.org
  https://lkml.kernel.org/r/20200623083721.216740...@infradead.org

But you need to be very careful on how you order things, as you can see
the above relies on preempt_count() already having been incremented with
NMI_MASK.


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Peter Zijlstra
On Wed, Jul 22, 2020 at 01:48:22PM +0530, Srikar Dronamraju wrote:
> * pet...@infradead.org  [2020-07-22 09:46:24]:
> 
> > On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > > same as SMT domain. However we could generalize this domain such that it
> > > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > What's the whole smallcore vs bigcore thing?
> > 
> > Would it make sense to have an actual overview of the various topology
> > layers somewhere? Because I'm getting lost and can't really make sense
> > of the code due to that.
> 
> To quote with an example: using (Power 9)
> 
> Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
> independent core with threads 0,2,4 and 6 acting like a core, while threads
> 1,3,5,7 acting as another core.  
> 
> The firmware can decide to showcase them as 2 independent small cores or as
> one big core. However the LLC (i.e last level of cache) is shared between
> all the threads of the SMT 8 core. Future power chips, LLC might change, it
> may be expanded to share with another SMT 8 core or it could be made
> specific to SMT 4 core.
> 
> The smt 8 core is also known as fused core/ Big core.
> The smaller smt 4 core is known as small core.
> 
> So on a Power9 Baremetal, the firmware would show up as SMT4 core.
> and we have a CACHE level at SMT 8. CACHE level would be very very similar
> to MC domain in X86.
> 
> Hope this is clear.

Ooh, that thing. I thought P9 was in actual fact an SMT4 hardware part,
but to be compatible with P8 it has this fused option where two cores
act like a single SMT8 part.

The interleaving enumeration is done due to P7 asymmetric cores,
resuting in schedulers having the preference to use the lower threads.

Combined that results in a P9-fused configuration using two independent
full cores when there's only 2 runnable threads.

Which is a subtly different story from yours.

But reading your explanation, it looks like the Linux topology setup
could actually unravel the fused-faux-SMT8 into two independent SMT4
parts, negating that firmware option.

Anyway, a few comments just around there might be helpfull.


Thanks!


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Peter Zijlstra
On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> > 
> >> That being said, the x86 sync core gap that I imagined could be fixed
> >> by changing to rq->curr == rq->idle test does not actually exist because
> >> the global membarrier does not have a sync core option. So fixing the
> >> exit_lazy_tlb points that this series does *should* fix that. So
> >> PF_KTHREAD may be less problematic than I thought from implementation
> >> point of view, only semantics.
> > 
> > So I've been trying to figure out where that PF_KTHREAD comes from,
> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> > 
> > So the first version:
> > 
> >  
> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> > 
> > appears to unconditionally send the IPI and checks p->mm in the IPI
> > context, but then v2:
> > 
> >  
> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> > 
> > has the current code. But I've been unable to find the reason the
> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
> 
> Looking back at my inbox, it seems like you are the one who proposed to
> skip all kthreads: 
> 
> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

I had a feeling it might've been me ;-) I just couldn't find the email.

> > The comment doesn't really help either; sure we have the whole lazy mm
> > thing, but that's ->active_mm, not ->mm.
> > 
> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
> > make the remote p->mm test work? Or were we over-eager with the !p->mm
> > doesn't imply kthread 'cleanups' at the time?
> 
> The nice thing about adding back kthreads to the threads considered for 
> membarrier
> IPI is that it has no observable effect on the user-space ABI. No 
> pre-existing kthread
> rely on this, and we just provide an additional guarantee for future kthread
> implementations.
> 
> > Also, I just realized, I still have a fix for use_mm() now
> > kthread_use_mm() that seems to have been lost.
> 
> I suspect we need to at least document the memory barriers in kthread_use_mm 
> and
> kthread_unuse_mm to state that they are required by membarrier if we want to
> ipi kthreads as well.

Right, so going by that email you found it was mostly a case of being
lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
any other bits that might be needed, covering kthreads should be
possible.

No objections from me for making it so.


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Peter Zijlstra
On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:

> > CPU0 CPU1
> > 1. user stuff
> > a. membarrier()  2. enter kernel
> > b. read rq->curr 3. rq->curr switched to kthread
> > c. is kthread, skip IPI  4. switch_to kthread
> > d. return to user5. rq->curr switched to user thread
> > 6. switch_to user thread
> > 7. exit kernel
> > 8. more user stuff

> I find it hard to believe that this is x86 only. Why would thread
> switch imply core sync on any architecture?  Is x86 unique in having a
> stupid expensive core sync that is heavier than smp_mb()?

smp_mb() is nowhere near the most expensive barrier we have in Linux,
mb() might qualify, since that has some completion requirements since it
needs to serialize against external actors.

On x86_64 things are rather murky, we have:

LOCK prefix -- which implies smp_mb() before and after RmW
LFENCE -- which used to be rmb like, until Spectre, and now it
  is ISYNC like. Since ISYNC ensures an empty pipeline,
  it also implies all loads are retired (and therefore
  complete) it implies rmb.
MFENCE -- which is a memop completion barrier like, it makes
  sure all previously issued memops are complete.

if you read that carefully, you'll note you'll have to use LFENCE +
MFENCE to order against non-memops instructions.

But none of them imply dumping the instruction decoder caches, that only
happens on core serializing instructions like CR3 writes, IRET, CPUID
and a few others, I think we recently got a SERIALIZE instruction to add
to this list.


On ARM64 there's something a whole different set of barriers, and again
smp_mb() isn't nowhere near the top of the list. They have roughly 3
classes:

ISB -- instruction sync barrier
DMB(x) -- memory ordering in domain x
DSB(x) -- memory completion in domain x

And they have at least 3 domains (IIRC), system, outer, inner.

The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to
have a SYNC, but since PowerPC is rare for only having one rediculously
heavy serializing instruction, we got to re-use the smp_mb() early in
__schedule() instead, but ARM64 can't do that.


So rather than say that x86 is special here, I'd say that PowerPC is
special here.

> But I’m wondering if all this deferred sync stuff is wrong. In the
> brave new world of io_uring and such, perhaps kernel access matter
> too.  Heck, even:

IIRC the membarrier SYNC_CORE use-case is about user-space
self-modifying code.

Userspace re-uses a text address and needs to SYNC_CORE before it can be
sure the old text is forgotten. Nothing the kernel does matters there.

I suppose the manpage could be more clear there.



Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 07:31:03PM +0300, Jarkko Sakkinen wrote:
> On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> > to help with text_alloc() usage in generic code, but I think
> > fundamentally, there's only these two options.
> 
> There is one arch (nios2), which uses a regular kzalloc(). This means
> that both text_alloc() and text_memfree() need to be weaks symbols and
> nios2 needs to have overriding text.c to do its thing.

IMO nios2 is just broken.


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> So perhaps the answer is to have text_alloc() not with a 'where'
> argument but with a 'why' argument. Or more simply, just have separate
> alloc/free APIs for each case, with generic versions that can be
> overridden by the architecture.

Well, there only seem to be 2 cases here, either the pointer needs to
fit in some immediate displacement, or not.

On x86 we seem have the advantage of a fairly large immediate
displacement as compared to many other architectures (due to our
variable sized instructions). And thus have been fairly liberal with our
usage of it (also our indirect jmps/calls suck, double so with
RETCH-POLINE).

Still, the indirect jump, as mentioned by Russel should work for
arbitrarily placed code for us too.


So I'm thinking that something like:

enum ptr_type {
immediate_displacement,
absolute,
};

void *text_alloc(unsigned long size, enum ptr_type type)
{
unsigned long vstart = VMALLOC_START;
unsigned long vend   = VMALLOC_END;

if (type == immediate_displacement) {
vstart = MODULES_VADDR;
vend   = MODULES_END;
}

return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE, _RET_IP_);
}

void text_free(void *ptr)
{
vfree(ptr);
}

Should work for all cases. Yes, we might then want something like a per
arch:

{BPF,FTRACE,KPROBE}_TEXT_TYPE

to help with text_alloc() usage in generic code, but I think
fundamentally, there's only these two options.


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote:
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)

I've seen patches for a 'sparse' bitmap to solve related problems.

It's basically the same code, except it multiplies everything (size,
bit-nr) by a constant to reduce the number of active bits per line.

This sadly doesn't take topology into account, but reducing contention
is still good ofcourse.


Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:33:33AM +0100, Russell King - ARM Linux admin wrote:
> For 32-bit ARM, our bpf code uses "blx/bx" (or equivalent code
> sequences) rather than encoding a "bl" or "b", so BPF there doesn't
> care where the executable memory is mapped, and doesn't need any
> PLTs.  Given that, should bpf always allocate from the vmalloc()
> region to preserve the module space for modules?

Ah, okay, then I suspect arm64 does something similar there. Thanks!

> I'm more concerned about ftrace though, but only because I don't
> have the understanding of that subsystem to really say whether there
> are any side effects from having the allocations potentially be out
> of range of a "bl" or "b" instruction.
> 
> If ftrace jumps both to and from the allocated page using a "load
> address to register, branch to register" approach like BPF does, then
> ftrace should be safe - and again, raises the issue that maybe it
> should always come from vmalloc() space.

I think the problem with ftrace is patching multiple instruction;
because it sounds like you'd need something to load the absolute address
in a register and then jump to that. And where it's relatively easy to
replace a single instruction, replace multiple instructions gets real
tricky real quick.

Which then leads to you being stuck with that 26bit displacement, IIRC.



Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:28:27AM +0100, Will Deacon wrote:

> As Ard says, module_alloc() _is_ special, in the sense that the virtual
> memory it allocates wants to be close to the kernel text, whereas the
> concept of allocating executable memory is broader and doesn't have these
> restrictions. So, while I'm in favour of having a text_alloc() interface
> that can be used by callers which only require an executable mapping, I'd
> much prefer for the module_alloc() code to remain for, err, modules.

So on x86 all those things (kprobes, bpf, ftrace) require that same
closeness.

An interface like the late vmalloc_exec() will simply not work for us.

We recently talked about arm64-kprobes and how you're not doing any of
the optimizations and fully rely on the exception return. And I see
you're one of the few archs that has bpf_jit_alloc_exec() (also,
shouldn't you be using PAGE_KERNEL_EXEC there?). But the BPF core seems
to use module_alloc() as a default means of allocating text.


So what should this look like? Have a text_alloc() with an argument that
indicates where? But then I suppose we also need a means to manage PLT
entries. Otherwise I don't exactly see how you're going to call BPF
code, or how that BPF stuff is going to call back into its helpers.




Re: [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions

2020-07-10 Thread Peter Zijlstra
On Fri, Jul 10, 2020 at 11:56:44AM +1000, Nicholas Piggin wrote:

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 73199470c265..ad95812d2a3f 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1253,7 +1253,7 @@ void start_secondary(void *unused)
>   unsigned int cpu = smp_processor_id();
>   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
>  
> - mmgrab(_mm);
> + mmgrab(_mm); /* XXX: where is the mmput for this? */
>   current->active_mm = _mm;
>  
>   smp_store_cpu_info(cpu);

Right; so IIRC it should be this one:

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 134688d79589..ff9fcbc4e76b 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -578,7 +578,7 @@ static int finish_cpu(unsigned int cpu)
>*/
>   if (mm != _mm)
>   idle->active_mm = _mm;
> - mmdrop(mm);
> + mmdrop_lazy_tlb(mm);
>   return 0;
>  }



Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Peter Zijlstra
On Fri, Jul 10, 2020 at 11:56:43AM +1000, Nicholas Piggin wrote:
> And get rid of the generic sync_core_before_usermode facility.
> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
> 
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
> 
> This makes lazy tlb code a bit more modular.

Hurmph, I know I've been staring at this at some point. I think I meant
to have a TIF to force the IRET path in the case of MEMBAR_SYNC_CORE.
But I was discouraged by amluto.



Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-10 Thread Peter Zijlstra
On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote:
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.

That's mighty impressive, however:

> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void 
> *)mm, 1);
> + do_shoot_lazy_tlb(mm);
> + }
> +}

IIRC you (power) never clear a CPU from that mask, so for other
workloads I can see this resulting in massive amounts of IPIs.

For instance, take as many processes as you have CPUs. For each,
manually walk the task across all CPUs and exit. Again.

Clearly, that's an extreme, but still...




Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-09 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 08:53:16PM +1000, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/paravirt.h   | 28 
> >  arch/powerpc/include/asm/qspinlock.h  | 66 +++
> >  arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ++
> >  arch/powerpc/platforms/pseries/Kconfig|  5 ++
> >  arch/powerpc/platforms/pseries/setup.c|  6 +-
> >  include/asm-generic/qspinlock.h   |  2 +
> 
> Another ack?

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH v3 4/6] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-09 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 08:20:25PM +1000, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> > These have shown significantly improved performance and fairness when
> > spinlock contention is moderate to high on very large systems.
> >
> >  [ Numbers hopefully forthcoming after more testing, but initial
> >results look good ]
> 
> Would be good to have something here, even if it's preliminary.
> 
> > Thanks to the fast path, single threaded performance is not noticably
> > hurt.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/Kconfig  | 13 
> >  arch/powerpc/include/asm/Kbuild   |  2 ++
> >  arch/powerpc/include/asm/qspinlock.h  | 25 +++
> >  arch/powerpc/include/asm/spinlock.h   |  5 +
> >  arch/powerpc/include/asm/spinlock_types.h |  5 +
> >  arch/powerpc/lib/Makefile |  3 +++
> 
> >  include/asm-generic/qspinlock.h   |  2 ++
> 
> Who's ack do we need for that part?

Mine I suppose would do, as discussed earlier, it probably isn't
required anymore, but I understand the paranoia of not wanting to change
too many things at once :-)


Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-09 Thread Peter Zijlstra
On Wed, Jul 08, 2020 at 07:54:34PM -0400, Waiman Long wrote:
> On 7/8/20 4:41 AM, Peter Zijlstra wrote:
> > On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
> > > Yes, powerpc could certainly get more performance out of the slow
> > > paths, and then there are a few parameters to tune.
> > Can you clarify? The slow path is already in use on ARM64 which is weak,
> > so I doubt there's superfluous serialization present. And Will spend a
> > fair amount of time on making that thing guarantee forward progressm, so
> > there just isn't too much room to play.
> > 
> > > We don't have a good alternate patching for function calls yet, but
> > > that would be something to do for native vs pv.
> > Going by your jump_label implementation, support for static_call should
> > be fairly straight forward too, no?
> > 
> >https://lkml.kernel.org/r/20200624153024.794671...@infradead.org
> > 
> Speaking of static_call, I am also looking forward to it. Do you have an
> idea when that will be merged?

0day had one crash on the last round, I think Steve send a fix for that
last night and I'll go look at it.

That said, the last posting got 0 feedback, so either everybody is
really happy with it, or not interested. So let us know in the thread,
with some review feedback.

Once I get through enough of the inbox to actually find the fix and test
it, I'll also update the thread, and maybe threaten to merge it if
everybody stays silent :-)


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Peter Zijlstra
On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
> Yes, powerpc could certainly get more performance out of the slow
> paths, and then there are a few parameters to tune.

Can you clarify? The slow path is already in use on ARM64 which is weak,
so I doubt there's superfluous serialization present. And Will spend a
fair amount of time on making that thing guarantee forward progressm, so
there just isn't too much room to play.

> We don't have a good alternate patching for function calls yet, but
> that would be something to do for native vs pv.

Going by your jump_label implementation, support for static_call should
be fairly straight forward too, no?

  https://lkml.kernel.org/r/20200624153024.794671...@infradead.org


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Peter Zijlstra
On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote:
> From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001
> From: Waiman Long 
> Date: Tue, 7 Jul 2020 22:29:16 -0400
> Subject: [PATCH 2/9] locking/pvqspinlock: Introduce
>  CONFIG_PARAVIRT_QSPINLOCKS_LITE
> 
> Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows
> architectures to use the PV qspinlock code without the need to use or
> implement a pv_kick() function, thus eliminating the atomic unlock
> overhead. The non-atomic queued_spin_unlock() can be used instead.
> The pv_wait() function will still be needed, but it can be a dummy
> function.
> 
> With that option set, the hybrid PV queued/unfair locking code should
> still be able to make it performant enough in a paravirtualized

How is this supposed to work? If there is no kick, you have no control
over who wakes up and fairness goes out the window entirely.

You don't even begin to explain...


Re: [PATCH 2/8] powerpc/pseries: use smp_rmb() in H_CONFER spin yield

2020-07-02 Thread Peter Zijlstra
On Thu, Jul 02, 2020 at 05:48:33PM +1000, Nicholas Piggin wrote:
> There is no need for rmb(), this allows faster lwsync here.

Since you determined this; I'm thinking you actually understand the
ordering here. How about recording this understanding in a comment?

Also, should the lock->slock load not use READ_ONCE() ?



Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-30 Thread Peter Zijlstra
On Tue, Jun 30, 2020 at 07:59:39AM +0200, Ahmed S. Darwish wrote:
> Peter Zijlstra wrote:
> 
> ...
> 
> > -#define lockdep_assert_irqs_disabled() do {\
> > -   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
> > - current->hardirqs_enabled,\
> > - "IRQs not disabled as expected\n");   \
> > -   } while (0)
> 
> ...
> 
> > +#define lockdep_assert_irqs_disabled() \
> > +do {   
> > \
> > +   WARN_ON_ONCE(debug_locks && this_cpu_read(hardirqs_enabled));   \
> > +} while (0)
> 
> I think it would be nice to keep the "IRQs not disabled as expected"
> message. It makes the lockdep splat much more readable.
> 
> This is similarly the case for the v3 lockdep preemption macros:
> 
>   https://lkml.kernel.org/r/20200630054452.3675847-5-a.darw...@linutronix.de
> 
> I did not add a message though to get in-sync with the IRQ macros above.

Hurmph.. the file:line output of a splat is usually all I look at, also
__WARN_printf() generates such atrocious crap code that try and not use
it.

I suppose I should do a __WARN_str() or something, but then people are
unlikely to want to use that, too much variation etc. :/

Cursed if you do, cursed if you don't.


Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-24 Thread Peter Zijlstra
On Wed, Jun 24, 2020 at 01:32:46PM +0200, Marco Elver wrote:
> From: Marco Elver 
> Date: Wed, 24 Jun 2020 11:23:22 +0200
> Subject: [PATCH] kcsan: Make KCSAN compatible with new IRQ state tracking
> 
> The new IRQ state tracking code does not honor lockdep_off(), and as
> such we should again permit tracing by using non-raw functions in
> core.c. Update the lockdep_off() comment in report.c, to reflect the
> fact there is still a potential risk of deadlock due to using printk()
> from scheduler code.
> 
> Suggested-by: Peter Zijlstra (Intel) 
> Signed-off-by: Marco Elver 

Thanks!

I've put this in front of the series at hand. I'll wait a little while
longer for arch people to give feedback on their header patches before I
stuff the lot into tip/locking/core.


Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-24 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 10:24:04PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> > Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> > anything happens.
> 
> OK, so the below patch doesn't seem to have any nasty recursion issues
> here. The only 'problem' is that lockdep now sees report_lock can cause
> deadlocks.
> 
> It is completely right about it too, but I don't suspect there's much we
> can do about it, it's pretty much the standard printk() with scheduler
> locks held report.

So I've been getting tons and tons of this:

[   60.471348] 
==
[   60.479427] BUG: KCSAN: data-race in __rcu_read_lock / __rcu_read_unlock
[   60.486909]
[   60.488572] write (marked) to 0x88840fff1cf0 of 4 bytes by interrupt on 
cpu 1:
[   60.497026]  __rcu_read_lock+0x37/0x60
[   60.501214]  cpuacct_account_field+0x1b/0x170
[   60.506081]  task_group_account_field+0x32/0x160
[   60.511238]  account_system_time+0xe6/0x110
[   60.515912]  update_process_times+0x1d/0xd0
[   60.520585]  tick_sched_timer+0xfc/0x180
[   60.524967]  __hrtimer_run_queues+0x271/0x440
[   60.529832]  hrtimer_interrupt+0x222/0x670
[   60.534409]  __sysvec_apic_timer_interrupt+0xb3/0x1a0
[   60.540052]  asm_call_on_stack+0x12/0x20
[   60.544434]  sysvec_apic_timer_interrupt+0xba/0x130
[   60.549882]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[   60.555621]  delay_tsc+0x7d/0xe0
[   60.559226]  kcsan_setup_watchpoint+0x292/0x4e0
[   60.564284]  __rcu_read_unlock+0x73/0x2c0
[   60.568763]  __unlock_page_memcg+0xda/0xf0
[   60.573338]  unlock_page_memcg+0x32/0x40
[   60.577721]  page_remove_rmap+0x5c/0x200
[   60.582104]  unmap_page_range+0x83c/0xc10
[   60.586582]  unmap_single_vma+0xb0/0x150
[   60.590963]  unmap_vmas+0x81/0xe0
[   60.594663]  exit_mmap+0x135/0x2b0
[   60.598464]  __mmput+0x21/0x150
[   60.601970]  mmput+0x2a/0x30
[   60.605176]  exit_mm+0x2fc/0x350
[   60.608780]  do_exit+0x372/0xff0
[   60.612385]  do_group_exit+0x139/0x140
[   60.616571]  __do_sys_exit_group+0xb/0x10
[   60.621048]  __se_sys_exit_group+0xa/0x10
[   60.625524]  __x64_sys_exit_group+0x1b/0x20
[   60.630189]  do_syscall_64+0x6c/0xe0
[   60.634182]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   60.639820]
[   60.641485] read to 0x88840fff1cf0 of 4 bytes by task 2430 on cpu 1:
[   60.648969]  __rcu_read_unlock+0x73/0x2c0
[   60.653446]  __unlock_page_memcg+0xda/0xf0
[   60.658019]  unlock_page_memcg+0x32/0x40
[   60.662400]  page_remove_rmap+0x5c/0x200
[   60.666782]  unmap_page_range+0x83c/0xc10
[   60.671259]  unmap_single_vma+0xb0/0x150
[   60.675641]  unmap_vmas+0x81/0xe0
[   60.679341]  exit_mmap+0x135/0x2b0
[   60.683141]  __mmput+0x21/0x150
[   60.686647]  mmput+0x2a/0x30
[   60.689853]  exit_mm+0x2fc/0x350
[   60.693458]  do_exit+0x372/0xff0
[   60.697062]  do_group_exit+0x139/0x140
[   60.701248]  __do_sys_exit_group+0xb/0x10
[   60.705724]  __se_sys_exit_group+0xa/0x10
[   60.710201]  __x64_sys_exit_group+0x1b/0x20
[   60.714872]  do_syscall_64+0x6c/0xe0
[   60.718864]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   60.724503]
[   60.726156] Reported by Kernel Concurrency Sanitizer on:
[   60.732089] CPU: 1 PID: 2430 Comm: sshd Not tainted 
5.8.0-rc2-00186-gb4ee11fe08b3-dirty #303
[   60.741510] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS 
SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   60.752957] 
==

And I figured a quick way to get rid of that would be something like the
below, seeing how volatile gets auto annotated... but that doesn't seem
to actually work.

What am I missing?



diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 352223664ebd..b08861118e1a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -351,17 +351,17 @@ static int rcu_preempt_blocked_readers_cgp(struct 
rcu_node *rnp)
 
 static void rcu_preempt_read_enter(void)
 {
-   current->rcu_read_lock_nesting++;
+   (*(volatile int *)>rcu_read_lock_nesting)++;
 }
 
 static int rcu_preempt_read_exit(void)
 {
-   return --current->rcu_read_lock_nesting;
+   return --(*(volatile int *)>rcu_read_lock_nesting);
 }
 
 static void rcu_preempt_depth_set(int val)
 {
-   current->rcu_read_lock_nesting = val;
+   WRITE_ONCE(current->rcu_read_lock_nesting, val);
 }
 
 /*



Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 10:24:04PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> > Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> > anything happens.
> 
> OK, so the below patch doesn't seem to have any nasty recursion issues
> here. The only 'problem' is that lockdep now sees report_lock can cause
> deadlocks.
> 
> It is completely right about it too, but I don't suspect there's much we
> can do about it, it's pretty much the standard printk() with scheduler
> locks held report.

Just for giggles I added the below and that works fine too. Right until
the report_lock deadlock splat of course, thereafter lockdep is
disabled.

diff --git a/kernel/kcsan/report.c b/kernel/kcsan/report.c
index ac5f8345bae9..a011cf0a1611 100644
--- a/kernel/kcsan/report.c
+++ b/kernel/kcsan/report.c
@@ -459,6 +459,8 @@ static void set_other_info_task_blocking(unsigned long 
*flags,
 */
int timeout = max(kcsan_udelay_task, kcsan_udelay_interrupt);

+   lockdep_assert_held(_lock);
+
other_info->task = current;
do {
if (is_running) {
@@ -495,6 +497,8 @@ static void set_other_info_task_blocking(unsigned long 
*flags,
 other_info->task == current);
if (is_running)
set_current_state(TASK_RUNNING);
+
+   lockdep_assert_held(_lock);
 }

 /* Populate @other_info; requires that the provided @other_info not in use. */


Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> anything happens.

OK, so the below patch doesn't seem to have any nasty recursion issues
here. The only 'problem' is that lockdep now sees report_lock can cause
deadlocks.

It is completely right about it too, but I don't suspect there's much we
can do about it, it's pretty much the standard printk() with scheduler
locks held report.

---
diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
index 15f67949d11e..732623c30359 100644
--- a/kernel/kcsan/core.c
+++ b/kernel/kcsan/core.c
@@ -397,8 +397,7 @@ kcsan_setup_watchpoint(const volatile void *ptr, size_t 
size, int type)
}
 
if (!kcsan_interrupt_watcher)
-   /* Use raw to avoid lockdep recursion via IRQ flags tracing. */
-   raw_local_irq_save(irq_flags);
+   local_irq_save(irq_flags);
 
watchpoint = insert_watchpoint((unsigned long)ptr, size, is_write);
if (watchpoint == NULL) {
@@ -539,7 +538,7 @@ kcsan_setup_watchpoint(const volatile void *ptr, size_t 
size, int type)
kcsan_counter_dec(KCSAN_COUNTER_USED_WATCHPOINTS);
 out_unlock:
if (!kcsan_interrupt_watcher)
-   raw_local_irq_restore(irq_flags);
+   local_irq_restore(irq_flags);
 out:
user_access_restore(ua_flags);
 }
diff --git a/kernel/kcsan/report.c b/kernel/kcsan/report.c
index ac5f8345bae9..ef31c1d2dac3 100644
--- a/kernel/kcsan/report.c
+++ b/kernel/kcsan/report.c
@@ -605,14 +605,6 @@ void kcsan_report(const volatile void *ptr, size_t size, 
int access_type,
if (WARN_ON(watchpoint_idx < 0 || watchpoint_idx >= 
ARRAY_SIZE(other_infos)))
goto out;
 
-   /*
-* With TRACE_IRQFLAGS, lockdep's IRQ trace state becomes corrupted if
-* we do not turn off lockdep here; this could happen due to recursion
-* into lockdep via KCSAN if we detect a race in utilities used by
-* lockdep.
-*/
-   lockdep_off();
-
if (prepare_report(, type, , other_info)) {
/*
 * Never report if value_change is FALSE, only if we it is
@@ -628,7 +620,6 @@ void kcsan_report(const volatile void *ptr, size_t size, 
int access_type,
release_report(, other_info);
}
 
-   lockdep_on();
 out:
kcsan_enable_current();
 }



Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 09:13:35PM +0200, Marco Elver wrote:
> I see the below report when I boot with your branch + KCSAN and
> PROVE_LOCKING. config attached. Trying to make sense of what's
> happening.

Ah, I was still playing with tip/master + PROVE_LOCKING + KCSAN and
slowly removing parts of that annotation patch to see what would come
unstuck.

I think I just hit a genuine but unavoidable lockdep report on
report_lock.

> -- >8 --
> 
> [   10.182354] [ cut here ]
> [   10.183058] WARNING: CPU: 7 PID: 136 at kernel/locking/lockdep.c:398 
> lockdep_hardirqs_on_prepare+0x1c6/0x270
> [   10.184347] Modules linked in:
> [   10.184771] CPU: 7 PID: 136 Comm: systemd-journal Not tainted 5.8.0-rc1+ #3
> [   10.185706] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.13.0-1 04/01/2014
> [   10.186821] RIP: 0010:lockdep_hardirqs_on_prepare+0x1c6/0x270
> [   10.187594] Code: 75 28 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 0f 85 b9 
> 00 00 00 48 83 c4 10 5b 41 5e 41 5f c3 65 48 ff 05 d4 24 4e 75 eb d8 <0f> 0b 
> 90 41 c7 86 c4 08 00 00 00 00 00 00 eb c8 e8 65 09 71 01 85
> [   10.190203] RSP: 0018:a7ee802b7848 EFLAGS: 00010017
> [   10.190989] RAX: 0001 RBX: 955e92a34ab0 RCX: 
> 0001
> [   10.192053] RDX: 0006 RSI: 955e92a34a88 RDI: 
> 955e92a341c0
> [   10.193117] RBP: a7ee802b7be8 R08:  R09: 
> 
> [   10.194186] R10:  R11: 8d07e268 R12: 
> 0001
> [   10.195249] R13: 8e41bb10 R14: 955e92a341c0 R15: 
> 0001
> [   10.196312] FS:  7fd6862aa8c0() GS:955e9fd8() 
> knlGS:
> [   10.197513] CS:  0010 DS:  ES:  CR0: 80050033
> [   10.198373] CR2: 7fd6837dd000 CR3: 000812acc001 CR4: 
> 00760ee0
> [   10.199436] DR0:  DR1:  DR2: 
> 
> [   10.200494] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   10.201554] PKRU: 5554
> [   10.201967] Call Trace:
> [   10.202348]  ? _raw_spin_unlock_irqrestore+0x40/0x70
> [   10.203093]  trace_hardirqs_on+0x56/0x60   <- enter 
> IRQ flags tracing code?
> [   10.203686]  _raw_spin_unlock_irqrestore+0x40/0x70 <- 
> take report_lock
> [   10.204406]  prepare_report+0x11f/0x150
> [   10.204986]  kcsan_report+0xca/0x6c0   <- 
> generating a KCSAN report
> [   10.212669]  kcsan_found_watchpoint+0xe5/0x110

That appears to be warning about a lockdep_recursion underflow, weird.
I'll go stare at it.




Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 07:59:57PM +0200, Marco Elver wrote:
> On Tue, Jun 23, 2020 at 06:37PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 06:13:21PM +0200, Ahmed S. Darwish wrote:
> > > Well, freshly merged code is using it. For example, KCSAN:
> > > 
> > > => f1bc96210c6a ("kcsan: Make KCSAN compatible with lockdep")
> > > => kernel/kcsan/report.c:
> > > 
> > > void kcsan_report(...)
> > > {
> > >   ...
> > > /*
> > >  * With TRACE_IRQFLAGS, lockdep's IRQ trace state becomes 
> > > corrupted if
> > >  * we do not turn off lockdep here; this could happen due to 
> > > recursion
> > >  * into lockdep via KCSAN if we detect a race in utilities used by
> > >  * lockdep.
> > >  */
> > > lockdep_off();
> > >   ...
> > > }
> > 
> > Marco, do you remember what exactly happened there? Because I'm about to
> > wreck that. That is, I'm going to make TRACE_IRQFLAGS ignore
> > lockdep_off().
> 
> Yeah, I was trying to squash any kind of recursion:
> 
>   lockdep -> other libs ->
>   -> KCSAN
>   -> print report
>   -> dump stack, printk and friends
>   -> lockdep -> other libs
>   -> KCSAN ...
> 
> Some history:
> 
> * Initial patch to fix:
>   https://lore.kernel.org/lkml/20200115162512.70807-1-el...@google.com/

That patch is weird; just :=n on lockdep.c should've cured that, the
rest is massive overkill.

> * KCSAN+lockdep+ftrace:
>   https://lore.kernel.org/lkml/20200214211035.209972-1-el...@google.com/

That doesn't really have anything useful..

> lockdep now has KCSAN_SANITIZE := n, but we still need to ensure that
> there are no paths out of lockdep, or the IRQ flags tracing code, that
> might lead through other libs, through KCSAN, libs used to generate a
> report, and back to lockdep.
> 
> I never quite figured out the exact trace that led to corruption, but
> avoiding any kind of potential for recursion was the only thing that
> would avoid the check_flags() warnings.

Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
anything happens.


Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 06:13:21PM +0200, Ahmed S. Darwish wrote:
> Well, freshly merged code is using it. For example, KCSAN:
> 
> => f1bc96210c6a ("kcsan: Make KCSAN compatible with lockdep")
> => kernel/kcsan/report.c:
> 
> void kcsan_report(...)
> {
>   ...
> /*
>  * With TRACE_IRQFLAGS, lockdep's IRQ trace state becomes corrupted if
>  * we do not turn off lockdep here; this could happen due to recursion
>  * into lockdep via KCSAN if we detect a race in utilities used by
>  * lockdep.
>  */
> lockdep_off();
>   ...
> }

Marco, do you remember what exactly happened there? Because I'm about to
wreck that. That is, I'm going to make TRACE_IRQFLAGS ignore
lockdep_off().


Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
On Tue, Jun 23, 2020 at 05:00:31PM +0200, Ahmed S. Darwish wrote:
> On Tue, Jun 23, 2020 at 10:36:52AM +0200, Peter Zijlstra wrote:
> ...
> > -#define lockdep_assert_irqs_disabled() do {
> > \
> > -   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
> > - current->hardirqs_enabled,\
> > - "IRQs not disabled as expected\n");   \
> > -   } while (0)
> > +#define lockdep_assert_irqs_enabled()  
> > \
> > +do {   
> > \
> > +   WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirqs_enabled));  \
> > +} while (0)
> >
> 
> Can we add a small comment on top of lockdep_off(), stating that lockdep
> IRQ tracking will still be kept after a lockdep_off call?

That would only legitimize lockdep_off(). The only comment I want to put
on that is: "if you use this, you're doing it wrong'.


[PATCH v4 6/8] arm: Break cyclic percpu include

2020-06-23 Thread Peter Zijlstra
In order to use  in irqflags.h, we need to make sure
asm/percpu.h does not itself depend on irqflags.h.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm/include/asm/percpu.h |2 ++
 1 file changed, 2 insertions(+)

--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -10,6 +10,8 @@
  * in the TPIDRPRW. TPIDRPRW only exists on V6K and V7
  */
 #if defined(CONFIG_SMP) && !defined(CONFIG_CPU_V6)
+register unsigned long current_stack_pointer asm ("sp");
+
 static inline void set_my_cpu_offset(unsigned long off)
 {
/* Set TPIDRPRW */




[PATCH v4 5/8] s390: Break cyclic percpu include

2020-06-23 Thread Peter Zijlstra
In order to use  in irqflags.h, we need to make sure
asm/percpu.h does not itself depend on irqflags.h

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/s390/include/asm/smp.h |1 +
 arch/s390/include/asm/thread_info.h |1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

--- a/arch/s390/include/asm/smp.h
+++ b/arch/s390/include/asm/smp.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 #define raw_smp_processor_id() (S390_lowcore.cpu_nr)
 
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -24,7 +24,6 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
-#include 
 
 #define STACK_INIT_OFFSET \
(THREAD_SIZE - STACK_FRAME_OVERHEAD - sizeof(struct pt_regs))




[PATCH v4 4/8] powerpc64: Break asm/percpu.h vs spinlock_types.h dependency

2020-06-23 Thread Peter Zijlstra
In order to use  in lockdep.h, we need to make sure
asm/percpu.h does not itself depend on lockdep.

The below seems to make that so and builds powerpc64-defconfig +
PROVE_LOCKING.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/powerpc/include/asm/dtl.h |   52 +
 arch/powerpc/include/asm/lppaca.h  |   44 ---
 arch/powerpc/include/asm/paca.h|2 -
 arch/powerpc/kernel/time.c |2 +
 arch/powerpc/kvm/book3s_hv.c   |1 
 arch/powerpc/platforms/pseries/dtl.c   |1 
 arch/powerpc/platforms/pseries/lpar.c  |1 
 arch/powerpc/platforms/pseries/setup.c |1 
 arch/powerpc/platforms/pseries/svm.c   |1 
 9 files changed, 60 insertions(+), 45 deletions(-)

--- /dev/null
+++ b/arch/powerpc/include/asm/dtl.h
@@ -0,0 +1,52 @@
+#ifndef _ASM_POWERPC_DTL_H
+#define _ASM_POWERPC_DTL_H
+
+#include 
+#include 
+
+/*
+ * Layout of entries in the hypervisor's dispatch trace log buffer.
+ */
+struct dtl_entry {
+   u8  dispatch_reason;
+   u8  preempt_reason;
+   __be16  processor_id;
+   __be32  enqueue_to_dispatch_time;
+   __be32  ready_to_enqueue_time;
+   __be32  waiting_to_ready_time;
+   __be64  timebase;
+   __be64  fault_addr;
+   __be64  srr0;
+   __be64  srr1;
+};
+
+#define DISPATCH_LOG_BYTES 4096/* bytes per cpu */
+#define N_DISPATCH_LOG (DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
+
+/*
+ * Dispatch trace log event enable mask:
+ *   0x1: voluntary virtual processor waits
+ *   0x2: time-slice preempts
+ *   0x4: virtual partition memory page faults
+ */
+#define DTL_LOG_CEDE   0x1
+#define DTL_LOG_PREEMPT0x2
+#define DTL_LOG_FAULT  0x4
+#define DTL_LOG_ALL(DTL_LOG_CEDE | DTL_LOG_PREEMPT | DTL_LOG_FAULT)
+
+extern struct kmem_cache *dtl_cache;
+extern rwlock_t dtl_access_lock;
+
+/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE = y, the cpu accounting code controls
+ * reading from the dispatch trace log.  If other code wants to consume
+ * DTL entries, it can set this pointer to a function that will get
+ * called once for each DTL entry that gets processed.
+ */
+extern void (*dtl_consumer)(struct dtl_entry *entry, u64 index);
+
+extern void register_dtl_buffer(int cpu);
+extern void alloc_dtl_buffers(unsigned long *time_limit);
+extern long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity);
+
+#endif /* _ASM_POWERPC_DTL_H */
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -42,7 +42,6 @@
  */
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -146,49 +145,6 @@ struct slb_shadow {
} save_area[SLB_NUM_BOLTED];
 } cacheline_aligned;
 
-/*
- * Layout of entries in the hypervisor's dispatch trace log buffer.
- */
-struct dtl_entry {
-   u8  dispatch_reason;
-   u8  preempt_reason;
-   __be16  processor_id;
-   __be32  enqueue_to_dispatch_time;
-   __be32  ready_to_enqueue_time;
-   __be32  waiting_to_ready_time;
-   __be64  timebase;
-   __be64  fault_addr;
-   __be64  srr0;
-   __be64  srr1;
-};
-
-#define DISPATCH_LOG_BYTES 4096/* bytes per cpu */
-#define N_DISPATCH_LOG (DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
-
-/*
- * Dispatch trace log event enable mask:
- *   0x1: voluntary virtual processor waits
- *   0x2: time-slice preempts
- *   0x4: virtual partition memory page faults
- */
-#define DTL_LOG_CEDE   0x1
-#define DTL_LOG_PREEMPT0x2
-#define DTL_LOG_FAULT  0x4
-#define DTL_LOG_ALL(DTL_LOG_CEDE | DTL_LOG_PREEMPT | DTL_LOG_FAULT)
-
-extern struct kmem_cache *dtl_cache;
-extern rwlock_t dtl_access_lock;
-
-/*
- * When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE = y, the cpu accounting code controls
- * reading from the dispatch trace log.  If other code wants to consume
- * DTL entries, it can set this pointer to a function that will get
- * called once for each DTL entry that gets processed.
- */
-extern void (*dtl_consumer)(struct dtl_entry *entry, u64 index);
-
-extern void register_dtl_buffer(int cpu);
-extern void alloc_dtl_buffers(unsigned long *time_limit);
 extern long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity);
 
 #endif /* CONFIG_PPC_BOOK3S */
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -29,7 +29,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
@@ -53,6 +52,7 @@ extern unsigned int debug_smp_processor_
 #define get_slb_shadow()   (get_paca()->slb_shadow_ptr)
 
 struct task_struct;
+struct rtas_args;
 
 /*
  * Defines the layout of the paca.
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -183,6 +183,8 @@ static inline unsigned long read_spurr(u
 
 #ifdef CONFIG_PPC_SPLPAR
 
+#include 
+
 /*
  * Scan the dispatch trace log and count up the stolen time.
  * Should be called with interrupts disab

[PATCH v4 8/8] lockdep: Remove lockdep_hardirq{s_enabled, _context}() argument

2020-06-23 Thread Peter Zijlstra
Now that the macros use per-cpu data, we no longer need the argument.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/entry/common.c|2 +-
 include/linux/irqflags.h   |8 
 include/linux/lockdep.h|2 +-
 kernel/locking/lockdep.c   |   30 +++---
 kernel/softirq.c   |2 +-
 tools/include/linux/irqflags.h |4 ++--
 6 files changed, 24 insertions(+), 24 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -689,7 +689,7 @@ noinstr void idtentry_exit_user(struct p
 
 noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
 {
-   bool irq_state = lockdep_hardirqs_enabled(current);
+   bool irq_state = lockdep_hardirqs_enabled();
 
__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -40,9 +40,9 @@ DECLARE_PER_CPU(int, hardirq_context);
   extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
-# define lockdep_hardirq_context(p)(this_cpu_read(hardirq_context))
+# define lockdep_hardirq_context() (this_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
-# define lockdep_hardirqs_enabled(p)   (this_cpu_read(hardirqs_enabled))
+# define lockdep_hardirqs_enabled()(this_cpu_read(hardirqs_enabled))
 # define lockdep_softirqs_enabled(p)   ((p)->softirqs_enabled)
 # define lockdep_hardirq_enter()   \
 do {   \
@@ -109,9 +109,9 @@ do {\
 # define trace_hardirqs_off_finish()   do { } while (0)
 # define trace_hardirqs_on()   do { } while (0)
 # define trace_hardirqs_off()  do { } while (0)
-# define lockdep_hardirq_context(p)0
+# define lockdep_hardirq_context() 0
 # define lockdep_softirq_context(p)0
-# define lockdep_hardirqs_enabled(p)   0
+# define lockdep_hardirqs_enabled()0
 # define lockdep_softirqs_enabled(p)   0
 # define lockdep_hardirq_enter()   do { } while (0)
 # define lockdep_hardirq_threaded()do { } while (0)
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -736,7 +736,7 @@ do {
\
 
 # define lockdep_assert_RT_in_threaded_ctx() do {  \
WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- lockdep_hardirq_context(current) &&   \
+ lockdep_hardirq_context() &&  \
  !(current->hardirq_threaded || current->irq_config),  
\
  "Not in threaded context on PREEMPT_RT as 
expected\n");   \
 } while (0)
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2062,9 +2062,9 @@ print_bad_irq_dependency(struct task_str
pr_warn("-\n");
pr_warn("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
curr->comm, task_pid_nr(curr),
-   lockdep_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+   lockdep_hardirq_context(), hardirq_count() >> HARDIRQ_SHIFT,
curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
-   lockdep_hardirqs_enabled(curr),
+   lockdep_hardirqs_enabled(),
curr->softirqs_enabled);
print_lock(next);
 
@@ -3331,9 +3331,9 @@ print_usage_bug(struct task_struct *curr
 
pr_warn("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
curr->comm, task_pid_nr(curr),
-   lockdep_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+   lockdep_hardirq_context(), hardirq_count() >> HARDIRQ_SHIFT,
lockdep_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
-   lockdep_hardirqs_enabled(curr),
+   lockdep_hardirqs_enabled(),
lockdep_softirqs_enabled(curr));
print_lock(this);
 
@@ -3658,7 +3658,7 @@ void lockdep_hardirqs_on_prepare(unsigne
if (unlikely(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
return;
 
-   if (unlikely(lockdep_hardirqs_enabled(current))) {
+   if (unlikely(lockdep_hardirqs_enabled())) {
/*
 * Neither irq nor preemption are disabled here
 * so this is racy by nature but losing one hit
@@ -3686,7 +3686,7 @@ void lockdep_hardirqs_on_prepare(unsigne
 * Can't allow enabling interrupts while in an interrupt handler,
 * that's general bad form and such. Recursion, limited stack etc..
 */
-   if (DEBUG_LOCKS_WA

[PATCH v4 2/8] x86/entry: Fix NMI vs IRQ state tracking

2020-06-23 Thread Peter Zijlstra
While the nmi_enter() users did
trace_hardirqs_{off_prepare,on_finish}() there was no matching
lockdep_hardirqs_*() calls to complete the picture.

Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state
tracking across the NMIs.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/entry/common.c |   42 
 arch/x86/include/asm/idtentry.h |3 ++
 arch/x86/kernel/nmi.c   |9 +++-
 arch/x86/kernel/traps.c |   17 +---
 include/linux/hardirq.h |   28 ++
 5 files changed, 70 insertions(+), 29 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -550,7 +550,7 @@ SYSCALL_DEFINE0(ni_syscall)
  * The return value must be fed into the rcu_exit argument of
  * idtentry_exit_cond_rcu().
  */
-bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
+noinstr bool idtentry_enter_cond_rcu(struct pt_regs *regs)
 {
if (user_mode(regs)) {
enter_from_user_mode();
@@ -640,7 +640,7 @@ static void idtentry_exit_cond_resched(s
  * Counterpart to idtentry_enter_cond_rcu(). The return value of the entry
  * function must be fed into the @rcu_exit argument.
  */
-void noinstr idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
+noinstr void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
 {
lockdep_assert_irqs_disabled();
 
@@ -684,7 +684,7 @@ void noinstr idtentry_exit_cond_rcu(stru
  * Invokes enter_from_user_mode() to establish the proper context for
  * NOHZ_FULL. Otherwise scheduling on exit would not be possible.
  */
-void noinstr idtentry_enter_user(struct pt_regs *regs)
+noinstr void idtentry_enter_user(struct pt_regs *regs)
 {
enter_from_user_mode();
 }
@@ -701,13 +701,47 @@ void noinstr idtentry_enter_user(struct
  *
  * Counterpart to idtentry_enter_user().
  */
-void noinstr idtentry_exit_user(struct pt_regs *regs)
+noinstr void idtentry_exit_user(struct pt_regs *regs)
 {
lockdep_assert_irqs_disabled();
 
prepare_exit_to_usermode(regs);
 }
 
+noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
+{
+   bool irq_state = lockdep_hardirqs_enabled(current);
+
+   __nmi_enter();
+   lockdep_hardirqs_off(CALLER_ADDR0);
+   lockdep_hardirq_enter();
+   rcu_nmi_enter();
+
+   instrumentation_begin();
+   trace_hardirqs_off_finish();
+   ftrace_nmi_enter();
+   instrumentation_end();
+
+   return irq_state;
+}
+
+noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
+{
+   instrumentation_begin();
+   ftrace_nmi_exit();
+   if (restore) {
+   trace_hardirqs_on_prepare();
+   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+   }
+   instrumentation_end();
+
+   rcu_nmi_exit();
+   lockdep_hardirq_exit();
+   if (restore)
+   lockdep_hardirqs_on(CALLER_ADDR0);
+   __nmi_exit();
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -16,6 +16,9 @@ void idtentry_exit_user(struct pt_regs *
 bool idtentry_enter_cond_rcu(struct pt_regs *regs);
 void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
 
+bool idtentry_enter_nmi(struct pt_regs *regs);
+void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -330,7 +330,6 @@ static noinstr void default_do_nmi(struc
__this_cpu_write(last_nmi_rip, regs->ip);
 
instrumentation_begin();
-   trace_hardirqs_off_finish();
 
handled = nmi_handle(NMI_LOCAL, regs);
__this_cpu_add(nmi_stats.normal, handled);
@@ -417,8 +416,6 @@ static noinstr void default_do_nmi(struc
unknown_nmi_error(reason, regs);
 
 out:
-   if (regs->flags & X86_EFLAGS_IF)
-   trace_hardirqs_on_prepare();
instrumentation_end();
 }
 
@@ -478,6 +475,8 @@ static DEFINE_PER_CPU(unsigned long, nmi
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
+   bool irq_state;
+
if (IS_ENABLED(CONFIG_SMP) && arch_cpu_is_offline(smp_processor_id()))
return;
 
@@ -491,14 +490,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
this_cpu_write(nmi_dr7, local_db_save());
 
-   nmi_enter();
+   irq_state = idtentry_enter_nmi(regs);
 
inc_irq_stat(__nmi_count);
 
if (!ignore_nmis)
default_do_nmi(regs);
 
-   nmi_exit();
+   idtentry_exit_nmi(regs, irq_state);
 
local_db_restore(this_cpu_read(nmi_dr7));
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -403,7 +403,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
 #endif
 
-   nmi_enter();
+   idtentry_enter_nmi(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error

[PATCH v4 1/8] lockdep: Prepare for NMI IRQ state tracking

2020-06-23 Thread Peter Zijlstra
There is no reason not to always, accurately, track IRQ state.

This change also makes IRQ state tracking ignore lockdep_off().

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/lockdep.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3646,7 +3646,16 @@ static void __trace_hardirqs_on_caller(v
  */
 void lockdep_hardirqs_on_prepare(unsigned long ip)
 {
-   if (unlikely(!debug_locks || current->lockdep_recursion))
+   if (unlikely(!debug_locks))
+   return;
+
+   /*
+* NMIs do not (and cannot) track lock dependencies, nothing to do.
+*/
+   if (unlikely(in_nmi()))
+   return;
+
+   if (unlikely(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
return;
 
if (unlikely(current->hardirqs_enabled)) {
@@ -3692,7 +3701,27 @@ void noinstr lockdep_hardirqs_on(unsigne
 {
struct task_struct *curr = current;
 
-   if (unlikely(!debug_locks || curr->lockdep_recursion))
+   if (unlikely(!debug_locks))
+   return;
+
+   /*
+* NMIs can happen in the middle of local_irq_{en,dis}able() where the
+* tracking state and hardware state are out of sync.
+*
+* NMIs must save lockdep_hardirqs_enabled() to restore IRQ state from,
+* and not rely on hardware state like normal interrupts.
+*/
+   if (unlikely(in_nmi())) {
+   /*
+* Skip:
+*  - recursion check, because NMI can hit lockdep;
+*  - hardware state check, because above;
+*  - chain_key check, see lockdep_hardirqs_on_prepare().
+*/
+   goto skip_checks;
+   }
+
+   if (unlikely(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
return;
 
if (curr->hardirqs_enabled) {
@@ -3720,6 +3749,7 @@ void noinstr lockdep_hardirqs_on(unsigne
DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key !=
current->curr_chain_key);
 
+skip_checks:
/* we'll do an OFF -> ON transition: */
curr->hardirqs_enabled = 1;
curr->hardirq_enable_ip = ip;
@@ -3735,7 +3765,15 @@ void noinstr lockdep_hardirqs_off(unsign
 {
struct task_struct *curr = current;
 
-   if (unlikely(!debug_locks || curr->lockdep_recursion))
+   if (unlikely(!debug_locks))
+   return;
+
+   /*
+* Matching lockdep_hardirqs_on(), allow NMIs in the middle of lockdep;
+* they will restore the software state. This ensures the software
+* state is consistent inside NMIs as well.
+*/
+   if (unlikely(!in_nmi() && (current->lockdep_recursion & 
LOCKDEP_RECURSION_MASK)))
return;
 
/*




[PATCH v4 3/8] sparc64: Fix asm/percpu.h build error

2020-06-23 Thread Peter Zijlstra
In order to break a header dependency between lockdep and task_struct,
I need per-cpu stuff from lockdep.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/sparc/include/asm/percpu_64.h  |2 ++
 arch/sparc/include/asm/trap_block.h |2 ++
 2 files changed, 4 insertions(+)

--- a/arch/sparc/include/asm/percpu_64.h
+++ b/arch/sparc/include/asm/percpu_64.h
@@ -4,7 +4,9 @@
 
 #include 
 
+#ifndef BUILD_VDSO
 register unsigned long __local_per_cpu_offset asm("g5");
+#endif
 
 #ifdef CONFIG_SMP
 
--- a/arch/sparc/include/asm/trap_block.h
+++ b/arch/sparc/include/asm/trap_block.h
@@ -2,6 +2,8 @@
 #ifndef _SPARC_TRAP_BLOCK_H
 #define _SPARC_TRAP_BLOCK_H
 
+#include 
+
 #include 
 #include 
 




[PATCH v4 0/8] lockdep: Change IRQ state tracking to use per-cpu variables

2020-06-23 Thread Peter Zijlstra
Ahmed and Sebastian wanted additional lockdep_assert*() macros and ran into
header hell. I figured using per-cpu variables would cure that, and also
ran into header hell, still tracktable though.

By moving the IRQ state into per-cpu variables we remove the dependency on
task_struct.

Patches go on top of anything recent I think, an actual git tree with them
in is (for now) here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git 
locking/irqstate

Which 0day blessed with 0 build fails.




[PATCH v4 7/8] lockdep: Change hardirq{s_enabled, _context} to per-cpu variables

2020-06-23 Thread Peter Zijlstra
Currently all IRQ-tracking state is in task_struct, this means that
task_struct needs to be defined before we use it.

Especially for lockdep_assert_irq*() this can lead to header-hell.

Move the hardirq state into per-cpu variables to avoid the task_struct
dependency.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/irqflags.h |   19 ---
 include/linux/lockdep.h  |   34 ++
 include/linux/sched.h|2 --
 kernel/fork.c|4 +---
 kernel/locking/lockdep.c |   30 +++---
 kernel/softirq.c |6 ++
 6 files changed, 52 insertions(+), 43 deletions(-)

--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 /* Currently lockdep_softirqs_on/off is used only by lockdep */
 #ifdef CONFIG_PROVE_LOCKING
@@ -31,18 +32,22 @@
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
+
+DECLARE_PER_CPU(int, hardirqs_enabled);
+DECLARE_PER_CPU(int, hardirq_context);
+
   extern void trace_hardirqs_on_prepare(void);
   extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
-# define lockdep_hardirq_context(p)((p)->hardirq_context)
+# define lockdep_hardirq_context(p)(this_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
-# define lockdep_hardirqs_enabled(p)   ((p)->hardirqs_enabled)
+# define lockdep_hardirqs_enabled(p)   (this_cpu_read(hardirqs_enabled))
 # define lockdep_softirqs_enabled(p)   ((p)->softirqs_enabled)
-# define lockdep_hardirq_enter()   \
-do {   \
-   if (!current->hardirq_context++)\
-   current->hardirq_threaded = 0;  \
+# define lockdep_hardirq_enter()   \
+do {   \
+   if (this_cpu_inc_return(hardirq_context) == 1)  \
+   current->hardirq_threaded = 0;  \
 } while (0)
 # define lockdep_hardirq_threaded()\
 do {   \
@@ -50,7 +55,7 @@ do {  \
 } while (0)
 # define lockdep_hardirq_exit()\
 do {   \
-   current->hardirq_context--; \
+   this_cpu_dec(hardirq_context);  \
 } while (0)
 # define lockdep_softirq_enter()   \
 do {   \
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -20,6 +20,7 @@ extern int lock_stat;
 #define MAX_LOCKDEP_SUBCLASSES 8UL
 
 #include 
+#include 
 
 enum lockdep_wait_type {
LD_WAIT_INV = 0,/* not checked, catch all */
@@ -703,28 +704,29 @@ do {  
\
lock_release(&(lock)->dep_map, _THIS_IP_);  \
 } while (0)
 
-#define lockdep_assert_irqs_enabled()  do {\
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- !current->hardirqs_enabled,   \
- "IRQs not enabled as expected\n");\
-   } while (0)
+DECLARE_PER_CPU(int, hardirqs_enabled);
+DECLARE_PER_CPU(int, hardirq_context);
 
-#define lockdep_assert_irqs_disabled() do {\
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- current->hardirqs_enabled,\
- "IRQs not disabled as expected\n");   \
-   } while (0)
+#define lockdep_assert_irqs_enabled()  \
+do {   \
+   WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirqs_enabled));  \
+} while (0)
 
-#define lockdep_assert_in_irq() do {   \
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- !current->hardirq_context,\
- "Not in hardirq as expected\n");  \
-   } while (0)
+#define lockdep_assert_irqs_disabled() \
+do {   \
+   WARN_ON_ONCE(debug_locks && this_cpu_read(hardirqs_enabled));   \
+} while (0)
+
+#define lockdep_assert_in_irq()
\
+do {   \
+   WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirq_context));   \
+} while (0)
 
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(l

Re: [PATCH 3/3] powerpc/8xx: Provide ptep_get() with 16k pages

2020-06-17 Thread Peter Zijlstra
On Thu, Jun 18, 2020 at 12:21:22AM +1000, Michael Ellerman wrote:
> Peter Zijlstra  writes:
> > On Mon, Jun 15, 2020 at 12:57:59PM +, Christophe Leroy wrote:

> >> +#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
> >> +#define __HAVE_ARCH_PTEP_GET
> >> +static inline pte_t ptep_get(pte_t *ptep)
> >> +{
> >> +  pte_t pte = {READ_ONCE(ptep->pte), 0, 0, 0};
> >> +
> >> +  return pte;
> >> +}
> >> +#endif
> >
> > Would it make sense to have a comment with this magic? The casual reader
> > might wonder WTH just happened when he stumbles on this :-)
> 
> I tried writing a helpful comment but it's too late for my brain to form
> sensible sentences.
> 
> Christophe can you send a follow-up with a comment explaining it? In
> particular the zero entries stand out, it's kind of subtle that those
> entries are only populated with the right value when we write to the
> page table.

static inline pte_t ptep_get(pte_t *ptep)
{
unsigned long val = READ_ONCE(ptep->pte);
/* 16K pages have 4 identical value 4K entries */
pte_t pte = {val, val, val, val);
return pte;
}

Maybe something like that?


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-17 Thread Peter Zijlstra
On Tue, Jun 16, 2020 at 04:23:46PM -0700, Fenghua Yu wrote:
> Hi, Peter,
> 
> On Mon, Jun 15, 2020 at 09:09:28PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 15, 2020 at 11:55:29AM -0700, Fenghua Yu wrote:
> > 
> > > Or do you suggest to add a random new flag in struct thread_info instead
> > > of a TIF flag?
> > 
> > Why thread_info? What's wrong with something simple like the below. It
> > takes a bit from the 'strictly current' flags word.
> > 
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b62e6aaf28f0..fca830b97055 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -801,6 +801,9 @@ struct task_struct {
> > /* Stalled due to lack of memory */
> > unsignedin_memstall:1;
> >  #endif
> > +#ifdef CONFIG_PCI_PASID
> > +   unsignedhas_valid_pasid:1;
> > +#endif
> >  
> > unsigned long   atomic_flags; /* Flags requiring atomic 
> > access. */
> >  
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 142b23645d82..10b3891be99e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -955,6 +955,10 @@ static struct task_struct *dup_task_struct(struct 
> > task_struct *orig, int node)
> > tsk->use_memdelay = 0;
> >  #endif
> >  
> > +#ifdef CONFIG_PCI_PASID
> > +   tsk->has_valid_pasid = 0;
> > +#endif
> > +
> >  #ifdef CONFIG_MEMCG
> > tsk->active_memcg = NULL;
> >  #endif
> 
> Can I add "Signed-off-by: Peter Zijlstra "
> to this patch? I will send this patch in the next version of the series.

Sure, n/p.


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 01:17:35PM -0700, Fenghua Yu wrote:
> Hi, Peter,
> 
> On Mon, Jun 15, 2020 at 09:09:28PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 15, 2020 at 11:55:29AM -0700, Fenghua Yu wrote:
> > 
> > > Or do you suggest to add a random new flag in struct thread_info instead
> > > of a TIF flag?
> > 
> > Why thread_info? What's wrong with something simple like the below. It
> > takes a bit from the 'strictly current' flags word.
> > 
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b62e6aaf28f0..fca830b97055 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -801,6 +801,9 @@ struct task_struct {
> > /* Stalled due to lack of memory */
> > unsignedin_memstall:1;
> >  #endif
> > +#ifdef CONFIG_PCI_PASID
> > +   unsignedhas_valid_pasid:1;
> > +#endif
> >  
> > unsigned long   atomic_flags; /* Flags requiring atomic 
> > access. */
> >  
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 142b23645d82..10b3891be99e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -955,6 +955,10 @@ static struct task_struct *dup_task_struct(struct 
> > task_struct *orig, int node)
> > tsk->use_memdelay = 0;
> >  #endif
> >  
> > +#ifdef CONFIG_PCI_PASID
> > +   tsk->has_valid_pasid = 0;
> > +#endif
> > +
> >  #ifdef CONFIG_MEMCG
> > tsk->active_memcg = NULL;
> >  #endif
> 
> The PASID MSR is x86 specific although PASID is PCIe concept and per-mm.
> Checking if the MSR has valid PASID (bit31=1) is an x86 specifc work.
> The flag should be cleared in cloned()/forked() and is only set and
> read in fixup() in x86 #GP for heuristic. It's not used anywhere outside
> of x86.
> 
> That's why we think the flag should be in x86 struct thread_info instead
> of in generice struct task_struct.

I don't think anybody really cares, it's just one bit, we have plenty
left.

On x86_64 there's a u32 sized alignment hole in thread_info, also we
don't use the upper 32bit of thread_info::flags, however using those
would still mean you have to use atomic ops, which you really don't
need.




Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 11:55:29AM -0700, Fenghua Yu wrote:

> Or do you suggest to add a random new flag in struct thread_info instead
> of a TIF flag?

Why thread_info? What's wrong with something simple like the below. It
takes a bit from the 'strictly current' flags word.


diff --git a/include/linux/sched.h b/include/linux/sched.h
index b62e6aaf28f0..fca830b97055 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -801,6 +801,9 @@ struct task_struct {
/* Stalled due to lack of memory */
unsignedin_memstall:1;
 #endif
+#ifdef CONFIG_PCI_PASID
+   unsignedhas_valid_pasid:1;
+#endif
 
unsigned long   atomic_flags; /* Flags requiring atomic 
access. */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..10b3891be99e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -955,6 +955,10 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
tsk->use_memdelay = 0;
 #endif
 
+#ifdef CONFIG_PCI_PASID
+   tsk->has_valid_pasid = 0;
+#endif
+
 #ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
 #endif


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 11:19:21AM -0700, Raj, Ashok wrote:
> On Mon, Jun 15, 2020 at 06:03:57PM +0200, Peter Zijlstra wrote:
> > 
> > I don't get why you need a rdmsr here, or why not having one would
> > require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed?
> > 
> > > > > +  */
> > > > > + rdmsrl(MSR_IA32_PASID, pasid_msr);
> > > > > + if (pasid_msr & MSR_IA32_PASID_VALID)
> > > > > + return false;
> > > > > +
> > > > > + /* Fix up the MSR if the MSR doesn't have a valid PASID. */
> > > > > + wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);
> > 
> > How much more expensive is the wrmsr over the rdmsr? Can't we just
> > unconditionally write the current PASID and call it a day?
> 
> The reason to check the rdmsr() is because we are using a hueristic taking
> GP faults. If we already setup the MSR, but we get it a second time it
> means the reason is something other than PASID_MSR not being set.
> 
> Ideally we should use the TIF_ to track this would be cheaper, but we were
> told those bits aren't easy to give out. 

Why do you need a TIF flag? Why not any other random flag in current?


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 11:12:59AM -0700, Fenghua Yu wrote:
> > I don't get why you need a rdmsr here, or why not having one would
> > require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed?
> 
> My concern is TIF flags are precious (only 3 slots available). Defining
> a new TIF flag may be not worth it while rdmsr() can check if PASID
> is valid in the MSR. And performance here might not be a big issue
> in #GP.
> 
> But if you think using TIF flag is better, I can define a new TIF flag
> and maintain it per thread (init 0 when clone()/fork(), set 1 in fixup()).
> Then we can avoid using rdmsr() to check valid PASID in the MSR.

WHY ?!?! What do you need a TIF flag for?


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 08:48:54AM -0700, Fenghua Yu wrote:
> Hi, Peter,
> On Mon, Jun 15, 2020 at 09:56:49AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 12, 2020 at 05:41:33PM -0700, Fenghua Yu wrote:
> > > +/*
> > > + * Apply some heuristics to see if the #GP fault was caused by a thread
> > > + * that hasn't had the IA32_PASID MSR initialized.  If it looks like that
> > > + * is the problem, try initializing the IA32_PASID MSR. If the heuristic
> > > + * guesses incorrectly, take one more #GP fault.
> > 
> > How is that going to help? Aren't we then going to run this same
> > heuristic again and again and again?
> 
> The heuristic always initializes the MSR with the per mm PASID IIF the
> mm has a valid PASID but the MSR doesn't have one. This heuristic usually
> happens only once on the first #GP in a thread.

But it doesn't guarantee the PASID is the right one. Suppose both the mm
has a PASID and the MSR has a VALID one, but the MSR isn't the mm one.
Then we NO-OP. So if the exception was due to us having the wrong PASID,
we stuck.

> If the next #GP still comes in, the heuristic finds out the MSR already
> has a valid PASID and thus will not fixup the MSR any more. The fixup()
> returns "false" and lets others to handle the #GP.
> 
> So the heuristic will be executed once (at most) and won't be executed
> again and again.

So I get that you want to set the PASID on-demand, but I find this all
really weird code to make that happen.

> > > +bool __fixup_pasid_exception(void)
> > > +{
> > > + u64 pasid_msr;
> > > + unsigned int pasid;
> > > +
> > > + /*
> > > +  * This function is called only when this #GP was triggered from user
> > > +  * space. So the mm cannot be NULL.
> > > +  */
> > > + pasid = current->mm->pasid;
> > > + /* If the mm doesn't have a valid PASID, then can't help. */
> > > + if (invalid_pasid(pasid))
> > > + return false;
> > > +
> > > + /*
> > > +  * Since IRQ is disabled now, the current task still owns the FPU on
> > 
> > That's just weird and confusing. What you want to say is that you rely
> > on the exception disabling the interrupt.
> 
> I checked SDM again. You are right. #GP can be interrupted by machine check
> or other interrupts. So I cannot assume the current task still owns the FPU.
> Instead of directly rdmsr() and wrmsr(), I will add helpers that can access
> either the MSR on the processor or the PASID state in the memory.

That's not in fact what I meant, but yes, you can take exceptions while
!IF just fine.

> > > +  * this CPU and the PASID MSR can be directly accessed.
> > > +  *
> > > +  * If the MSR has a valid PASID, the #GP must be for some other reason.
> > > +  *
> > > +  * If rdmsr() is really a performance issue, a TIF_ flag may be
> > > +  * added to check if the thread has a valid PASID instead of rdmsr().
> > 
> > I don't understand any of this. Nobody except us writes to this MSR, we
> > should bloody well know what's in it. What gives?
> 
> Patch 4 describes how to manage the MSR and patch 7 describes the format
> of the MSR (20-bit PASID value and bit 31 is valid bit).
> 
> Are they sufficient to help? Or do you mean something else?

I don't get why you need a rdmsr here, or why not having one would
require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed?

> > > +  */
> > > + rdmsrl(MSR_IA32_PASID, pasid_msr);
> > > + if (pasid_msr & MSR_IA32_PASID_VALID)
> > > + return false;
> > > +
> > > + /* Fix up the MSR if the MSR doesn't have a valid PASID. */
> > > + wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);

How much more expensive is the wrmsr over the rdmsr? Can't we just
unconditionally write the current PASID and call it a day?


Re: [PATCH 3/3] powerpc/8xx: Provide ptep_get() with 16k pages

2020-06-15 Thread Peter Zijlstra
On Mon, Jun 15, 2020 at 12:57:59PM +, Christophe Leroy wrote:
> READ_ONCE() now enforces atomic read, which leads to:


> Fixes: 2ab3a0a02905 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() 
> memory accesses")
> Cc: Will Deacon 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/nohash/32/pgtable.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
> b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index b56f14160ae5..77addb599ce7 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -286,6 +286,16 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
> *mm, unsigned long addr,
>   return __pte(pte_update(mm, addr, ptep, ~0, 0, 0));
>  }
>  
> +#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
> +#define __HAVE_ARCH_PTEP_GET
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> + pte_t pte = {READ_ONCE(ptep->pte), 0, 0, 0};
> +
> + return pte;
> +}
> +#endif

Would it make sense to have a comment with this magic? The casual reader
might wonder WTH just happened when he stumbles on this :-)

Other than that, the series looks good to me, Thanks!

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Fri, Jun 12, 2020 at 05:41:33PM -0700, Fenghua Yu wrote:
> +/*
> + * Apply some heuristics to see if the #GP fault was caused by a thread
> + * that hasn't had the IA32_PASID MSR initialized.  If it looks like that
> + * is the problem, try initializing the IA32_PASID MSR. If the heuristic
> + * guesses incorrectly, take one more #GP fault.

How is that going to help? Aren't we then going to run this same
heuristic again and again and again?

> + */
> +bool __fixup_pasid_exception(void)
> +{
> + u64 pasid_msr;
> + unsigned int pasid;
> +
> + /*
> +  * This function is called only when this #GP was triggered from user
> +  * space. So the mm cannot be NULL.
> +  */
> + pasid = current->mm->pasid;
> + /* If the mm doesn't have a valid PASID, then can't help. */
> + if (invalid_pasid(pasid))
> + return false;
> +
> + /*
> +  * Since IRQ is disabled now, the current task still owns the FPU on

That's just weird and confusing. What you want to say is that you rely
on the exception disabling the interrupt.

> +  * this CPU and the PASID MSR can be directly accessed.
> +  *
> +  * If the MSR has a valid PASID, the #GP must be for some other reason.
> +  *
> +  * If rdmsr() is really a performance issue, a TIF_ flag may be
> +  * added to check if the thread has a valid PASID instead of rdmsr().

I don't understand any of this. Nobody except us writes to this MSR, we
should bloody well know what's in it. What gives?

> +  */
> + rdmsrl(MSR_IA32_PASID, pasid_msr);
> + if (pasid_msr & MSR_IA32_PASID_VALID)
> + return false;
> +
> + /* Fix up the MSR if the MSR doesn't have a valid PASID. */
> + wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);
> +
> + return true;
> +}
> -- 
> 2.19.1
> 


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Peter Zijlstra
On Fri, Jun 12, 2020 at 05:41:33PM -0700, Fenghua Yu wrote:
> @@ -447,6 +458,18 @@ dotraplinkage void do_general_protection(struct pt_regs 
> *regs, long error_code)
>   int ret;
>  
>   RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> + /*
> +  * Perform the check for a user mode PASID exception before enable
> +  * interrupts. Doing this here ensures that the PASID MSR can be simply
> +  * accessed because the contents are known to be still associated
> +  * with the current process.
> +  */
> + if (user_mode(regs) && fixup_pasid_exception()) {
> + cond_local_irq_enable(regs);
> + return;

OK, so we're done with the exception, lets enable interrupts?

> + }


Re: [PATCH v2 00/12] x86: tag application address space for devices

2020-06-15 Thread Peter Zijlstra
On Fri, Jun 12, 2020 at 05:41:21PM -0700, Fenghua Yu wrote:

> This series only provides simple and basic support for ENQCMD and the MSR:
> 1. Clean up type definitions (patch 1-3). These patches can be in a
>separate series.
>- Define "pasid" as "unsigned int" consistently (patch 1 and 2).
>- Define "flags" as "unsigned int"
> 2. Explain different various technical terms used in the series (patch 4).
> 3. Enumerate support for ENQCMD in the processor (patch 5).
> 4. Handle FPU PASID state and the MSR during context switch (patches 6-7).
> 5. Define "pasid" in mm_struct (patch 8).
> 5. Clear PASID state for new mm and forked and cloned thread (patch 9-10).
> 6. Allocate and free PASID for a process (patch 11).
> 7. Fix up the PASID MSR in #GP handler when one thread in a process
>executes ENQCMD for the first time (patches 12).

If this is per mm, should not switch_mm() update the MSR ? I'm not
seeing that, nor do I see it explained why not.


Re: [RFC][PATCH v3 1/5] sparc64: Fix asm/percpu.h build error

2020-06-04 Thread Peter Zijlstra
On Thu, Jun 04, 2020 at 06:57:03PM +0200, Peter Zijlstra wrote:

> I think I see, what happens is that these headers end up in the VDSO
> build, and that doesn't have these CFLAGS, because userspace.
> 
> Let me see what to do about that.

I feel like the below is cheating, but it's the best I could find :/
VDSO including kernel headers and the utter maze that our kernel headers
are makes it really hard to untangle :/

This builds sparc64-defconfig and sparc64-all{no,mod}config.

Dave, does this work for you, or should I try hardder?

---
 arch/sparc/include/asm/percpu_64.h  | 2 ++
 arch/sparc/include/asm/trap_block.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/sparc/include/asm/percpu_64.h 
b/arch/sparc/include/asm/percpu_64.h
index 32ef6f05cc565..a8786a4b90b6b 100644
--- a/arch/sparc/include/asm/percpu_64.h
+++ b/arch/sparc/include/asm/percpu_64.h
@@ -4,7 +4,9 @@

 #include 

+#ifndef BUILD_VDSO
 register unsigned long __local_per_cpu_offset asm("g5");
+#endif

 #ifdef CONFIG_SMP

diff --git a/arch/sparc/include/asm/trap_block.h 
b/arch/sparc/include/asm/trap_block.h
index 0f6d0c4f66838..ace0d48e837e5 100644
--- a/arch/sparc/include/asm/trap_block.h
+++ b/arch/sparc/include/asm/trap_block.h
@@ -2,6 +2,8 @@
 #ifndef _SPARC_TRAP_BLOCK_H
 #define _SPARC_TRAP_BLOCK_H

+#include 
+
 #include 
 #include 




Re: [RFC][PATCH v3 1/5] sparc64: Fix asm/percpu.h build error

2020-06-04 Thread Peter Zijlstra
On Fri, May 29, 2020 at 04:29:17PM -0700, David Miller wrote:
> From: Peter Zijlstra 
> Date: Fri, 29 May 2020 23:35:51 +0200
> 
> > ../arch/sparc/include/asm/percpu_64.h:7:24: warning: call-clobbered 
> > register used for global register variable
> > register unsigned long __local_per_cpu_offset asm("g5");
> 
> The "-ffixed-g5" option on the command line tells gcc that we are
> using 'g5' as a fixed register, so some part of your build isn't using
> the:
> 
> KBUILD_CFLAGS += -ffixed-g4 -ffixed-g5 -fcall-used-g7 -Wno-sign-compare
> 
> from arch/sparc/Makefile for some reason.

Thanks, that was the clue I needed.

I think I see, what happens is that these headers end up in the VDSO
build, and that doesn't have these CFLAGS, because userspace.

Let me see what to do about that.


Re: linux-next: build failure on powerpc 8xx with 16k pages

2020-06-04 Thread Peter Zijlstra
On Thu, Jun 04, 2020 at 12:17:23PM +0100, Will Deacon wrote:
> Hi, [+Peter]
> 
> On Thu, Jun 04, 2020 at 10:48:03AM +, Christophe Leroy wrote:
> > Using mpc885_ads_defconfig with CONFIG_PPC_16K_PAGES instead of
> > CONFIG_PPC_4K_PAGES, getting the following build failure:
> > 
> >   CC  mm/gup.o
> > In file included from ./include/linux/kernel.h:11:0,
> >  from mm/gup.c:2:
> > In function 'gup_hugepte.constprop',
> > inlined from 'gup_huge_pd.isra.78' at mm/gup.c:2465:8:
> > ./include/linux/compiler.h:392:38: error: call to '__compiletime_assert_257'
> > declared with attribute error: Unsupported access size for
> > {READ,WRITE}_ONCE().
> >   _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >   ^
> > ./include/linux/compiler.h:373:4: note: in definition of macro
> > '__compiletime_assert'
> > prefix ## suffix();\
> > ^
> > ./include/linux/compiler.h:392:2: note: in expansion of macro
> > '_compiletime_assert'
> >   _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> >   ^
> > ./include/linux/compiler.h:405:2: note: in expansion of macro
> > 'compiletime_assert'
> >   compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
> >   ^
> > ./include/linux/compiler.h:291:2: note: in expansion of macro
> > 'compiletime_assert_rwonce_type'
> >   compiletime_assert_rwonce_type(x);\
> >   ^
> > mm/gup.c:2428:8: note: in expansion of macro 'READ_ONCE'
> >   pte = READ_ONCE(*ptep);
> > ^
> > In function 'gup_get_pte',
> > inlined from 'gup_pte_range' at mm/gup.c:2228:9,
> > inlined from 'gup_pmd_range' at mm/gup.c:2613:15,
> > inlined from 'gup_pud_range' at mm/gup.c:2641:15,
> > inlined from 'gup_p4d_range' at mm/gup.c:2666:15,
> > inlined from 'gup_pgd_range' at mm/gup.c:2694:15,
> > inlined from 'internal_get_user_pages_fast' at mm/gup.c:2785:3:
> 
> At first glance, this looks like a real bug in the 16k page code -- you're
> loading the pte non-atomically on the fast GUP path and so you're prone to
> tearing, which probably isn't what you want. For a short-term hack, I'd
> suggest having CONFIG_HAVE_FAST_GUP depend on !CONFIG_PPC_16K_PAGES, but if
> you want to support this them you'll need to rework your pte_t so that it
> can be loaded atomically.

Looking at commit 55c8fc3f49302, they're all the exact same value, so
what they could do is grow another special gup_get_pte() variant that
just loads the first value.

Also, per that very same commit, there's a distinct lack of WRITE_ONCE()
in the pte_update() / __set_pte_at() paths for much of Power.


Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()

2020-06-02 Thread Peter Zijlstra
On Tue, Jun 02, 2020 at 10:42:35AM -0400, Qian Cai wrote:

> Reverted this commit fixed the POWER9 boot warning,

ARGH, I'm an idiot. Please try this instead:


diff --git a/kernel/softirq.c b/kernel/softirq.c
index a3eb6eba8c41..c4201b7f42b1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -438,7 +438,7 @@ void irq_exit_rcu(void)
  */
 void irq_exit(void)
 {
-   irq_exit_rcu();
+   __irq_exit_rcu();
rcu_irq_exit();
 /* must be last! */
lockdep_hardirq_exit();




[PATCH v3 0/5] lockdep: Change IRQ state tracking to use per-cpu variables

2020-05-29 Thread Peter Zijlstra
Ahmed and Sebastian wanted additional lockdep_assert*() macros and ran
into header hell.

Move the IRQ state into per-cpu variables, which removes the dependency on
task_struct, which is what generated the header-hell.

These patches are intended to go on top of:

 https://lkml.kernel.org/r/20200529212728.795169...@infradead.org

but should apply on current tip/master without much difficulty.

There are a few build fixes for Sparc64, PowerPC64 and s390. Especially the
Sparc one I'm not sure about.



[PATCH v3 2/5] powerpc64: Break asm/percpu.h vs spinlock_types.h dependency

2020-05-29 Thread Peter Zijlstra
In order to use  in lockdep.h, we need to make sure
asm/percpu.h does not itself depend on lockdep.

The below seems to make that so and builds powerpc64-defconfig +
PROVE_LOCKING.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/powerpc/include/asm/dtl.h |   52 +
 arch/powerpc/include/asm/lppaca.h  |   44 ---
 arch/powerpc/kernel/time.c |1 
 arch/powerpc/kvm/book3s_hv.c   |1 
 arch/powerpc/platforms/pseries/dtl.c   |1 
 arch/powerpc/platforms/pseries/lpar.c  |1 
 arch/powerpc/platforms/pseries/setup.c |1 
 arch/powerpc/platforms/pseries/svm.c   |1 
 8 files changed, 58 insertions(+), 44 deletions(-)

--- /dev/null
+++ b/arch/powerpc/include/asm/dtl.h
@@ -0,0 +1,52 @@
+#ifndef _ASM_POWERPC_DTL_H
+#define _ASM_POWERPC_DTL_H
+
+#include 
+#include 
+
+/*
+ * Layout of entries in the hypervisor's dispatch trace log buffer.
+ */
+struct dtl_entry {
+   u8  dispatch_reason;
+   u8  preempt_reason;
+   __be16  processor_id;
+   __be32  enqueue_to_dispatch_time;
+   __be32  ready_to_enqueue_time;
+   __be32  waiting_to_ready_time;
+   __be64  timebase;
+   __be64  fault_addr;
+   __be64  srr0;
+   __be64  srr1;
+};
+
+#define DISPATCH_LOG_BYTES 4096/* bytes per cpu */
+#define N_DISPATCH_LOG (DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
+
+/*
+ * Dispatch trace log event enable mask:
+ *   0x1: voluntary virtual processor waits
+ *   0x2: time-slice preempts
+ *   0x4: virtual partition memory page faults
+ */
+#define DTL_LOG_CEDE   0x1
+#define DTL_LOG_PREEMPT0x2
+#define DTL_LOG_FAULT  0x4
+#define DTL_LOG_ALL(DTL_LOG_CEDE | DTL_LOG_PREEMPT | DTL_LOG_FAULT)
+
+extern struct kmem_cache *dtl_cache;
+extern rwlock_t dtl_access_lock;
+
+/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE = y, the cpu accounting code controls
+ * reading from the dispatch trace log.  If other code wants to consume
+ * DTL entries, it can set this pointer to a function that will get
+ * called once for each DTL entry that gets processed.
+ */
+extern void (*dtl_consumer)(struct dtl_entry *entry, u64 index);
+
+extern void register_dtl_buffer(int cpu);
+extern void alloc_dtl_buffers(unsigned long *time_limit);
+extern long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity);
+
+#endif /* _ASM_POWERPC_DTL_H */
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -42,7 +42,6 @@
  */
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -146,49 +145,6 @@ struct slb_shadow {
} save_area[SLB_NUM_BOLTED];
 } cacheline_aligned;
 
-/*
- * Layout of entries in the hypervisor's dispatch trace log buffer.
- */
-struct dtl_entry {
-   u8  dispatch_reason;
-   u8  preempt_reason;
-   __be16  processor_id;
-   __be32  enqueue_to_dispatch_time;
-   __be32  ready_to_enqueue_time;
-   __be32  waiting_to_ready_time;
-   __be64  timebase;
-   __be64  fault_addr;
-   __be64  srr0;
-   __be64  srr1;
-};
-
-#define DISPATCH_LOG_BYTES 4096/* bytes per cpu */
-#define N_DISPATCH_LOG (DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
-
-/*
- * Dispatch trace log event enable mask:
- *   0x1: voluntary virtual processor waits
- *   0x2: time-slice preempts
- *   0x4: virtual partition memory page faults
- */
-#define DTL_LOG_CEDE   0x1
-#define DTL_LOG_PREEMPT0x2
-#define DTL_LOG_FAULT  0x4
-#define DTL_LOG_ALL(DTL_LOG_CEDE | DTL_LOG_PREEMPT | DTL_LOG_FAULT)
-
-extern struct kmem_cache *dtl_cache;
-extern rwlock_t dtl_access_lock;
-
-/*
- * When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE = y, the cpu accounting code controls
- * reading from the dispatch trace log.  If other code wants to consume
- * DTL entries, it can set this pointer to a function that will get
- * called once for each DTL entry that gets processed.
- */
-extern void (*dtl_consumer)(struct dtl_entry *entry, u64 index);
-
-extern void register_dtl_buffer(int cpu);
-extern void alloc_dtl_buffers(unsigned long *time_limit);
 extern long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity);
 
 #endif /* CONFIG_PPC_BOOK3S */
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* powerpc clocksource/clockevent code */
 
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 

[RFC][PATCH v3 1/5] sparc64: Fix asm/percpu.h build error

2020-05-29 Thread Peter Zijlstra
In order to break a header dependency between lockdep and task_struct,
I need per-cpu stuff from lockdep.

Including  from lockdep.h gives a build error, this
patch cures that, but results in the following warning:

../arch/sparc/include/asm/percpu_64.h:7:24: warning: call-clobbered register 
used for global register variable
register unsigned long __local_per_cpu_offset asm("g5");

But i've no idea how to fix that :/ but it does build.

Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/sparc/include/asm/trap_block.h |2 ++
 1 file changed, 2 insertions(+)

--- a/arch/sparc/include/asm/trap_block.h
+++ b/arch/sparc/include/asm/trap_block.h
@@ -2,6 +2,8 @@
 #ifndef _SPARC_TRAP_BLOCK_H
 #define _SPARC_TRAP_BLOCK_H
 
+#include 
+
 #include 
 #include 
 




[PATCH v3 5/5] lockdep: Remove lockdep_hardirq{s_enabled, _context}() argument

2020-05-29 Thread Peter Zijlstra
Now that the macros use per-cpu data, we no longer need the argument.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/entry/common.c|2 +-
 include/linux/irqflags.h   |8 
 include/linux/lockdep.h|2 +-
 kernel/locking/lockdep.c   |   30 +++---
 kernel/softirq.c   |2 +-
 tools/include/linux/irqflags.h |4 ++--
 6 files changed, 24 insertions(+), 24 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -689,7 +689,7 @@ noinstr void idtentry_exit_user(struct p
 
 noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
 {
-   bool irq_state = lockdep_hardirqs_enabled(current);
+   bool irq_state = lockdep_hardirqs_enabled();
 
__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -40,9 +40,9 @@ DECLARE_PER_CPU(int, hardirq_context);
   extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
-# define lockdep_hardirq_context(p)(this_cpu_read(hardirq_context))
+# define lockdep_hardirq_context() (this_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
-# define lockdep_hardirqs_enabled(p)   (this_cpu_read(hardirqs_enabled))
+# define lockdep_hardirqs_enabled()(this_cpu_read(hardirqs_enabled))
 # define lockdep_softirqs_enabled(p)   ((p)->softirqs_enabled)
 # define lockdep_hardirq_enter()   \
 do {   \
@@ -109,9 +109,9 @@ do {\
 # define trace_hardirqs_off_finish()   do { } while (0)
 # define trace_hardirqs_on()   do { } while (0)
 # define trace_hardirqs_off()  do { } while (0)
-# define lockdep_hardirq_context(p)0
+# define lockdep_hardirq_context() 0
 # define lockdep_softirq_context(p)0
-# define lockdep_hardirqs_enabled(p)   0
+# define lockdep_hardirqs_enabled()0
 # define lockdep_softirqs_enabled(p)   0
 # define lockdep_hardirq_enter()   do { } while (0)
 # define lockdep_hardirq_threaded()do { } while (0)
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -736,7 +736,7 @@ do {
\
 
 # define lockdep_assert_RT_in_threaded_ctx() do {  \
WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- lockdep_hardirq_context(current) &&   \
+ lockdep_hardirq_context() &&  \
  !(current->hardirq_threaded || current->irq_config),  
\
  "Not in threaded context on PREEMPT_RT as 
expected\n");   \
 } while (0)
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2062,9 +2062,9 @@ print_bad_irq_dependency(struct task_str
pr_warn("-\n");
pr_warn("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
curr->comm, task_pid_nr(curr),
-   lockdep_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+   lockdep_hardirq_context(), hardirq_count() >> HARDIRQ_SHIFT,
curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
-   lockdep_hardirqs_enabled(curr),
+   lockdep_hardirqs_enabled(),
curr->softirqs_enabled);
print_lock(next);
 
@@ -3331,9 +3331,9 @@ print_usage_bug(struct task_struct *curr
 
pr_warn("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
curr->comm, task_pid_nr(curr),
-   lockdep_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
+   lockdep_hardirq_context(), hardirq_count() >> HARDIRQ_SHIFT,
lockdep_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
-   lockdep_hardirqs_enabled(curr),
+   lockdep_hardirqs_enabled(),
lockdep_softirqs_enabled(curr));
print_lock(this);
 
@@ -3655,7 +3655,7 @@ void lockdep_hardirqs_on_prepare(unsigne
if (DEBUG_LOCKS_WARN_ON(current->lockdep_recursion & 
LOCKDEP_RECURSION_MASK))
return;
 
-   if (unlikely(lockdep_hardirqs_enabled(current))) {
+   if (unlikely(lockdep_hardirqs_enabled())) {
/*
 * Neither irq nor preemption are disabled here
 * so this is racy by nature but losing one hit
@@ -3683,7 +3683,7 @@ void lockdep_hardirqs_on_prepare(unsigne
 * Can't allow enabling interrupts while in an interrupt handler,
 * that's general bad form and such. Recursion, limited stack etc..
 */
-  

[PATCH v3 4/5] lockdep: Change hardirq{s_enabled, _context} to per-cpu variables

2020-05-29 Thread Peter Zijlstra
Currently all IRQ-tracking state is in task_struct, this means that
task_struct needs to be defined before we use it.

Especially for lockdep_assert_irq*() this can lead to header-hell.

Move the hardirq state into per-cpu variables to avoid the task_struct
dependency.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/irqflags.h |   19 ---
 include/linux/lockdep.h  |   34 ++
 include/linux/sched.h|2 --
 kernel/fork.c|4 +---
 kernel/locking/lockdep.c |   30 +++---
 kernel/softirq.c |6 ++
 6 files changed, 52 insertions(+), 43 deletions(-)

--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 /* Currently lockdep_softirqs_on/off is used only by lockdep */
 #ifdef CONFIG_PROVE_LOCKING
@@ -31,18 +32,22 @@
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
+
+DECLARE_PER_CPU(int, hardirqs_enabled);
+DECLARE_PER_CPU(int, hardirq_context);
+
   extern void trace_hardirqs_on_prepare(void);
   extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
-# define lockdep_hardirq_context(p)((p)->hardirq_context)
+# define lockdep_hardirq_context(p)(this_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
-# define lockdep_hardirqs_enabled(p)   ((p)->hardirqs_enabled)
+# define lockdep_hardirqs_enabled(p)   (this_cpu_read(hardirqs_enabled))
 # define lockdep_softirqs_enabled(p)   ((p)->softirqs_enabled)
-# define lockdep_hardirq_enter()   \
-do {   \
-   if (!current->hardirq_context++)\
-   current->hardirq_threaded = 0;  \
+# define lockdep_hardirq_enter()   \
+do {   \
+   if (this_cpu_inc_return(hardirq_context) == 1)  \
+   current->hardirq_threaded = 0;  \
 } while (0)
 # define lockdep_hardirq_threaded()\
 do {   \
@@ -50,7 +55,7 @@ do {  \
 } while (0)
 # define lockdep_hardirq_exit()\
 do {   \
-   current->hardirq_context--; \
+   this_cpu_dec(hardirq_context);  \
 } while (0)
 # define lockdep_softirq_enter()   \
 do {   \
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -19,6 +19,7 @@ extern int lock_stat;
 
 #define MAX_LOCKDEP_SUBCLASSES 8UL
 
+#include 
 #include 
 
 enum lockdep_wait_type {
@@ -703,28 +704,29 @@ do {  
\
lock_release(&(lock)->dep_map, _THIS_IP_);  \
 } while (0)
 
-#define lockdep_assert_irqs_enabled()  do {\
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- !current->hardirqs_enabled,   \
- "IRQs not enabled as expected\n");\
-   } while (0)
+DECLARE_PER_CPU(int, hardirqs_enabled);
+DECLARE_PER_CPU(int, hardirq_context);
 
-#define lockdep_assert_irqs_disabled() do {\
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- current->hardirqs_enabled,\
- "IRQs not disabled as expected\n");   \
-   } while (0)
+#define lockdep_assert_irqs_enabled()  \
+do {   \
+   WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirqs_enabled));  \
+} while (0)
 
-#define lockdep_assert_in_irq() do {   \
-   WARN_ONCE(debug_locks && !current->lockdep_recursion && \
- !current->hardirq_context,\
- "Not in hardirq as expected\n");  \
-   } while (0)
+#define lockdep_assert_irqs_disabled() \
+do {   \
+   WARN_ON_ONCE(debug_locks && this_cpu_read(hardirqs_enabled));   \
+} while (0)
+
+#define lockdep_assert_in_irq()
\
+do {   \
+   WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirq_context));   \
+} while (0)
 
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(lock) do { } while (0)
 # define might_lock_nested(lock, subclass) do { } while (0)
+

[PATCH v3 3/5] s390: Break cyclic percpu include

2020-05-29 Thread Peter Zijlstra
In order to use  in irqflags.h, we need to make sure
asm/percpu.h does not itself depend on irqflags.h

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/s390/include/asm/smp.h |1 +
 arch/s390/include/asm/thread_info.h |1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

--- a/arch/s390/include/asm/smp.h
+++ b/arch/s390/include/asm/smp.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 #define raw_smp_processor_id() (S390_lowcore.cpu_nr)
 
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -24,7 +24,6 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
-#include 
 
 #define STACK_INIT_OFFSET \
(THREAD_SIZE - STACK_FRAME_OVERHEAD - sizeof(struct pt_regs))




Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Mon, May 25, 2020 at 04:05:49PM +0200, Frederic Weisbecker wrote:
> On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> > @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> > int cpu, int wake_flags)
> >  
> > if (llist_add(>wake_entry, >wake_list)) {
> > if (!set_nr_if_polling(rq->idle))
> > -   smp_call_function_single_async(cpu, >wake_csd);
> > +   smp_call_function_single_async(cpu, >wake_csd);
> > else
> > trace_sched_wake_idle_without_ipi(cpu);
> 
> Ok that's of course very unlikely but could it be possible to have the
> following:
> 
> CPU 0 CPU 1 CPU 2
> -   
> 
> //Wake up A
> ttwu_queue(TASK A, CPU 1) idle_loop {
>   ttwu_queue_pending {
>   
>   raw_spin_unlock_irqrestore(rq)
>   # VMEXIT (with IPI still pending)
> 
> //task A migrates here
> 
> wait_event()
> 
> //sleep
> 
> //Wake up A
> ttwu_queue(TASK A, CPU 2) {
> //IPI on CPU 2 ignored
> // due to csd->flags == CSD_LOCK
> 

Right you are.

Bah!

More thinking


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> - if (!(flags & NOHZ_KICK_MASK))
> + if (idle != CPU_IDLE)
>   return false;
>  
>   _nohz_idle_balance(this_rq, flags, idle);

Bah, I think I broke something there. Lemme go mend.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:41:14PM +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01f94cf52783..b6d8a7b991f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> >  * is idle. And the softirq performing nohz idle load balance
> >  * will be run before returning from the IPI.
> >  */
> > -   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> > +   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
> 
> My fear here is that if a previous call from the the same CPU but to another
> target is still pending, the new one will be spuriously ignored.
> 

Urgh, indeed!

> But I believe we can still keep the remote csd if nohz_flags() are
> strictly only set before the IPI and strictly only cleared from it.
> 
> And I still don't understand why trigger_load_balance() raise the
> softirq without setting the current CPU as ilb. run_rebalance_domains()
> thus ignores it most of the time in the end or it spuriously clear the
> nohz_flags set by an IPI sender. Or there is something I misunderstood
> there.

That is because it is simple and didn't matter before. Whoever got there
first go to run the ilb whenever the flag was set.

But now we have this race due to having to serialize access to the csd.

We want the IPI to clear the flag, but then the softirq no longer knows
it was supposed to do ILB.

How's this then?

---
 include/linux/sched.h |  4 
 kernel/sched/core.c   | 41 +
 kernel/sched/fair.c   | 15 +++
 kernel/sched/sched.h  |  2 +-
 4 files changed, 25 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_seuclamp[UCLAMP_CNT];
 #endif
 
+#ifdef CONFIG_SMP
+   call_single_data_t  wake_csd;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head   preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..90484b988b65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -637,41 +637,25 @@ void wake_up_nohz_cpu(int cpu)
wake_up_idle_cpu(cpu);
 }
 
-static inline bool got_nohz_idle_kick(void)
+static void nohz_csd_func(void *info)
 {
-   int cpu = smp_processor_id();
-
-   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
-   return false;
-
-   if (idle_cpu(cpu) && !need_resched())
-   return true;
+   struct rq *rq = info;
+   int cpu = cpu_of(rq);
 
+   WARN_ON(!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK));
/*
-* We can't run Idle Load Balance on this CPU for this time so we
-* cancel it and clear NOHZ_BALANCE_KICK
+* Release the rq::nohz_csd.
 */
+   smp_mb__before_atomic();
atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
-   return false;
-}
-
-static void nohz_csd_func(void *info)
-{
-   struct rq *rq = info;
 
-   if (got_nohz_idle_kick()) {
-   rq->idle_balance = 1;
+   rq->idle_balance = idle_cpu(cpu);
+   if (rq->idle_balance && !need_resched()) {
+   rq->nohz_idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
 }
 
-#else /* CONFIG_NO_HZ_COMMON */
-
-static inline bool got_nohz_idle_kick(void)
-{
-   return false;
-}
-
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu, int wake_flags)
 
if (llist_add(>wake_entry, >wake_list)) {
if (!set_nr_if_polling(rq->idle))
-   smp_call_function_single_async(cpu, >wake_csd);
+   smp_call_function_single_async(cpu, >wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2905,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #if defined(CONFIG_SMP)
p->on_cpu = 0;
+   p->wake_csd = (struct __call_single_data) {
+   .func = wake_csd_func,
+   };
 #endif
init_task_preempt_count(p);
 #ifdef CONFIG_SMP
@@ -6723,8 +6710,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-   rq_csd_init(rq, >wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(>cfs_tasks)

Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> 
> > This:
> > 
> > > smp_call_function_single_async() { 
> > > smp_call_function_single_async() {
> > > // verified csd->flags != CSD_LOCK // verified 
> > > csd->flags != CSD_LOCK
> > > csd->flags = CSD_LOCK  csd->flags = 
> > > CSD_LOCK
> > 
> > concurrent smp_call_function_single_async() using the same csd is what
> > I'm looking at as well.
> 
> So something like this ought to cure the fundamental problem and make
> smp_call_function_single_async() more user friendly, but also more
> expensive.
> 
> The problem is that while the ILB case is easy to fix, I can't seem to
> find an equally nice solution for the ttwu_remote_queue() case; that
> would basically require sticking the wake_csd in task_struct, I'll also
> post that.
> 
> So it's either this:

Or this:

---
 include/linux/sched.h | 4 
 kernel/sched/core.c   | 7 ---
 kernel/sched/fair.c   | 2 +-
 kernel/sched/sched.h  | 1 -
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_seuclamp[UCLAMP_CNT];
 #endif
 
+#ifdef CONFIG_SMP
+   call_single_data_t  wake_csd;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head   preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..a7129652e89b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu, int wake_flags)
 
if (llist_add(>wake_entry, >wake_list)) {
if (!set_nr_if_polling(rq->idle))
-   smp_call_function_single_async(cpu, >wake_csd);
+   smp_call_function_single_async(cpu, >wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #if defined(CONFIG_SMP)
p->on_cpu = 0;
+   p->wake_csd = (struct __call_single_data) {
+   .func = wake_csd_func,
+   };
 #endif
init_task_preempt_count(p);
 #ifdef CONFIG_SMP
@@ -6723,8 +6726,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-   rq_csd_init(rq, >wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(>cfs_tasks);
 
rq_attach_root(rq, _root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
 * is idle. And the softirq performing nohz idle load balance
 * will be run before returning from the IPI.
 */
-   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
+   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7ab6334e992..c35f0ef43ab0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1021,7 +1021,6 @@ struct rq {
 #endif
 
 #ifdef CONFIG_SMP
-   call_single_data_t  wake_csd;
struct llist_head   wake_list;
 #endif
 


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:

> This:
> 
> > smp_call_function_single_async() { 
> > smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified 
> > csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK  csd->flags = 
> > CSD_LOCK
> 
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well.

So something like this ought to cure the fundamental problem and make
smp_call_function_single_async() more user friendly, but also more
expensive.

The problem is that while the ILB case is easy to fix, I can't seem to
find an equally nice solution for the ttwu_remote_queue() case; that
would basically require sticking the wake_csd in task_struct, I'll also
post that.

So it's either this:

---
 kernel/smp.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 84303197caf9..d1ca2a2d1cc7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -109,6 +109,12 @@ static __always_inline void 
csd_lock_wait(call_single_data_t *csd)
smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
 }
 
+/*
+ * csd_lock() can use non-atomic operations to set CSD_FLAG_LOCK because it's
+ * users are careful to only use CPU-local data. IOW, there is no cross-cpu
+ * lock usage. Also, you're not allowed to use smp_call_function*() from IRQs,
+ * and must be extra careful from SoftIRQ.
+ */
 static __always_inline void csd_lock(call_single_data_t *csd)
 {
csd_lock_wait(csd);
@@ -318,7 +324,7 @@ EXPORT_SYMBOL(smp_call_function_single);
 
 /**
  * smp_call_function_single_async(): Run an asynchronous function on a
- *  specific CPU.
+ *  specific CPU.
  * @cpu: The CPU to run on.
  * @csd: Pre-allocated and setup data structure
  *
@@ -339,18 +345,23 @@ EXPORT_SYMBOL(smp_call_function_single);
  */
 int smp_call_function_single_async(int cpu, call_single_data_t *csd)
 {
+   unsigned int csd_flags;
int err = 0;
 
preempt_disable();
 
-   if (csd->flags & CSD_FLAG_LOCK) {
+   /*
+* Unlike the regular smp_call_function*() APIs, this one is actually
+* usable from IRQ context, also the -EBUSY return value suggests
+* it is safe to share csd's.
+*/
+   csd_flags = READ_ONCE(csd->flags);
+   if (csd_flags & CSD_FLAG_LOCK ||
+   cmpxchg(>flags, csd_flags, csd_flags | CSD_FLAG_LOCK) != 
csd_flags) {
err = -EBUSY;
goto out;
}
 
-   csd->flags = CSD_FLAG_LOCK;
-   smp_wmb();
-
err = generic_exec_single(cpu, csd, csd->func, csd->info);
 
 out:


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:

>atomic_fetch_or(, 
> nohz_flags(0))
> softirq() {#VMEXIT or anything 
> that could stop a CPU for a while
> run_rebalance_domain() {
> nohz_idle_balance() {
> atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))

I'm an idiot and didn't have enough wake-up-juice; I missed that andnot
clearing the flag again.

Yes, fun fun fun..


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > .config are in,
> > 
> > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > not seen anything like that myself. Let me go have a look.
> > 
> > 
> > In as far as the logs are readable (they're a wrapped mess, please don't
> > do that!), they contain very little useful, as is typical with IPIs :/
> > 
> > > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > flush_smp_call_function_queue+0x1fa/0x2e0
> 
> So I've tried to think of a race that could produce that and here is
> the only thing I could come up with. It's a bit complicated unfortunately:

This:

> smp_call_function_single_async() { 
> smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified 
> csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK  csd->flags = 
> CSD_LOCK

concurrent smp_call_function_single_async() using the same csd is what
I'm looking at as well. Now in the ILB case there is an easy cure:

(because there is only a single ilb target)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
 * is idle. And the softirq performing nohz idle load balance
 * will be run before returning from the IPI.
 */
-   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
+   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
 }
 
 /*

Qian, can you give that a spin?

But I'm still not convinced of your scenario:

>kick_ilb() {
>atomic_fetch_or(, 
> nohz_flags(0))

> atomic_fetch_or(, nohz_flags(0))   #VMENTER
> smp_call_function_single_async() { 
> smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified 
> csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK  csd->flags = 
> CSD_LOCK

Note that we check the return value of atomic_fetch_or() and bail if
someone else set a flag in KICK_MASK before us.

Aah, I suppose you're saying this can happen when:

  !(flags & NOHZ_KICK_MASK)

? That's not supposed to happen though.


Anyway, let me go stare at the remove wake-up case, because i'm afraid
that might have the same problem too...


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Peter Zijlstra
On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> Just a head up. Repeatedly compiling kernels for a while would trigger
> endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> .config are in,

Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
not seen anything like that myself. Let me go have a look.


In as far as the logs are readable (they're a wrapped mess, please don't
do that!), they contain very little useful, as is typical with IPIs :/

> [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.00][C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> dm_region_hash dm_log dm_mod
> [ 1168.029492][C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> 5.7.0-rc6-next-20200519 #1
> [ 1168.037665][C1] Hardware name: HPE ProLiant DL385
> Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> [ 1168.046978][C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.053658][C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> 10 52 3f 71
> [ 1168.073262][C1] RSP: 0018:c9178918 EFLAGS: 00010046
> [ 1168.079253][C1] RAX:  RBX: 430c58f8
> RCX: 8ec26083
> [ 1168.087156][C1] RDX: 0003 RSI: dc00
> RDI: 430c58f8
> [ 1168.095054][C1] RBP: c91789a8 R08: ed1108618cec
> R09: ed1108618cec
> [ 1168.102964][C1] R10: 430c675b R11: 
> R12: 430c58e0
> [ 1168.110866][C1] R13: 8eb30c40 R14: 430c5880
> R15: 430c58e0
> [ 1168.118767][C1] FS:  ()
> GS:4308() knlGS:
> [ 1168.127628][C1] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1168.134129][C1] CR2: 55b169604560 CR3: 000d08a14000
> CR4: 003406e0
> [ 1168.142026][C1] Call Trace:
> [ 1168.145206][C1]  
> [ 1168.147957][C1]  ? smp_call_on_cpu_callback+0xd0/0xd0
> [ 1168.153421][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
> [ 1168.158880][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
> [ 1168.164076][C1]  generic_smp_call_function_single_interrupt+0x13/0x2b
> [ 1168.170938][C1]  smp_call_function_single_interrupt+0x157/0x4e0
> [ 1168.177278][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
> [ 1168.183172][C1]  ? interrupt_entry+0xe4/0xf0
> [ 1168.187846][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
> [ 1168.193478][C1]  ? trace_hardirqs_on_caller+0x1f0/0x1f0
> [ 1168.199116][C1]  ? _nohz_idle_balance+0x221/0x360
> [ 1168.204228][C1]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1168.209690][C1]  call_function_single_interrupt+0xf/0x20


Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-21 Thread Peter Zijlstra
On Fri, Apr 17, 2020 at 09:26:56AM -0400, Qian Cai wrote:

> > Acked-by: Michael Ellerman  (powerpc)
> 
> Peter, can you take a look at this patch when you have a chance?

Sorry, -ETOOMUCHEMAIL, got it now, thanks!


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-09 Thread Peter Zijlstra
On Thu, Apr 09, 2020 at 09:08:26AM -0700, Minchan Kim wrote:
> On Wed, Apr 08, 2020 at 01:59:08PM +0200, Christoph Hellwig wrote:
> > This allows to unexport map_vm_area and unmap_kernel_range, which are
> > rather deep internal and should not be available to modules.
> 
> Even though I don't know how many usecase we have using zsmalloc as
> module(I heard only once by dumb reason), it could affect existing
> users. Thus, please include concrete explanation in the patch to
> justify when the complain occurs.

The justification is 'we can unexport functions that have no sane reason
of being exported in the first place'.

The Changelog pretty much says that.


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-08 Thread Peter Zijlstra
On Wed, Apr 08, 2020 at 08:01:00AM -0700, Randy Dunlap wrote:
> Hi,
> 
> On 4/8/20 4:59 AM, Christoph Hellwig wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 36949a9425b8..614cc786b519 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -702,7 +702,7 @@ config ZSMALLOC
> >  
> >  config ZSMALLOC_PGTABLE_MAPPING
> > bool "Use page table mapping to access object in zsmalloc"
> > -   depends on ZSMALLOC
> > +   depends on ZSMALLOC=y
> 
> It's a bool so this shouldn't matter... not needed.

My mm/Kconfig has:

config ZSMALLOC
tristate "Memory allocator for compressed pages"
depends on MMU

which I think means it can be modular, no?


Re: decruft the vmalloc API

2020-04-08 Thread Peter Zijlstra
On Wed, Apr 08, 2020 at 01:58:58PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> Peter noticed that with some dumb luck you can toast the kernel address
> space with exported vmalloc symbols.
> 
> I used this as an opportunity to decruft the vmalloc.c API and make it
> much more systematic.  This also removes any chance to create vmalloc
> mappings outside the designated areas or using executable permissions
> from modules.  Besides that it removes more than 300 lines of code.
> 

Looks great, thanks for doing this!

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH 17/28] mm: remove the prot argument from vm_map_ram

2020-04-08 Thread Peter Zijlstra
On Wed, Apr 08, 2020 at 01:59:15PM +0200, Christoph Hellwig wrote:
> This is always GFP_KERNEL - for long term mappings with other properties
> vmap should be used.

 PAGE_KERNEL != GFP_KERNEL :-)

> - return vm_map_ram(mock->pages, mock->npages, 0, PAGE_KERNEL);
> + return vm_map_ram(mock->pages, mock->npages, 0);


Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

2020-04-01 Thread Peter Zijlstra
On Tue, Mar 31, 2020 at 09:00:21PM -0300, Leonardo Bras wrote:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
> 
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
> 
> After NMI IPI is sent to all other cpus, force unlock all spinlocks
> needed for finishing crash routine.
> 
> Signed-off-by: Leonardo Bras 

> @@ -129,6 +132,13 @@ static void crash_kexec_prepare_cpus(int cpu)
>   /* Would it be better to replace the trap vector here? */
>  
>   if (atomic_read(_in_crash) >= ncpus) {
> + /*
> +  * At this point no other CPU is running, and some of them may
> +  * have been interrupted while holding one of the locks needed
> +  * to complete crashing. Free them so there is no deadlock.
> +  */
> + arch_spin_unlock(_lock.raw_lock);
> + arch_spin_unlock();
>   printk(KERN_EMERG "IPI complete\n");
>   return;
>   }

You might want to add a note to your asm/spinlock.h that you rely on
spin_unlock() unconditionally clearing a lock.

This isn't naturally true for all lock implementations. Consider ticket
locks, doing a surplus unlock will wreck your lock state in that case.
So anybody poking at the powerpc spinlock implementation had better know
you rely on this.


Re: [PATCH 1/1] ppc/crash: Skip spinlocks during crash

2020-03-30 Thread Peter Zijlstra
On Fri, Mar 27, 2020 at 07:50:20AM +0100, Christophe Leroy wrote:
> 
> 
> Le 26/03/2020 à 23:28, Leonardo Bras a écrit :
> > During a crash, there is chance that the cpus that handle the NMI IPI
> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> > will cause a deadlock. (rtas_lock and printk logbuf_log as of today)
> > 
> > This is a problem if the system has kdump set up, given if it crashes
> > for any reason kdump may not be saved for crash analysis.
> > 
> > Skip spinlocks after NMI IPI is sent to all other cpus.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >   arch/powerpc/include/asm/spinlock.h | 6 ++
> >   arch/powerpc/kexec/crash.c  | 3 +++
> >   2 files changed, 9 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> > b/arch/powerpc/include/asm/spinlock.h
> > index 860228e917dc..a6381d110795 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -111,6 +111,8 @@ static inline void splpar_spin_yield(arch_spinlock_t 
> > *lock) {};
> >   static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
> >   #endif
> > +extern bool crash_skip_spinlock __read_mostly;
> > +
> >   static inline bool is_shared_processor(void)
> >   {
> >   #ifdef CONFIG_PPC_SPLPAR
> > @@ -142,6 +144,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> > if (likely(__arch_spin_trylock(lock) == 0))
> > break;
> > do {
> > +   if (unlikely(crash_skip_spinlock))
> > +   return;
> 
> You are adding a test that reads a global var in the middle of a so hot path
> ? That must kill performance. Can we do different ?

This; adding code to a super hot patch like this for an exceptional case
like the crash handling seems like a very very bad trade to me.

One possible solution is to simply write 0 to the affected spinlocks
after sending the NMI IPI thing, no?


Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-22 Thread Peter Zijlstra
On Sun, Mar 22, 2020 at 09:33:17AM -0700, Davidlohr Bueso wrote:
> On Fri, 20 Mar 2020, Peter Zijlstra wrote:
> 
> > On Fri, Mar 20, 2020 at 01:55:26AM -0700, Davidlohr Bueso wrote:
> > > - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
> > > -(!vcpu->arch.pause)));
> > > + rcuwait_wait_event(*wait,
> > > +(!vcpu->arch.power_off) && (!vcpu->arch.pause),
> > > +TASK_INTERRUPTIBLE);
> > 
> > > - for (;;) {
> > > - prepare_to_swait_exclusive(>wq, , 
> > > TASK_INTERRUPTIBLE);
> > > -
> > > - if (kvm_vcpu_check_block(vcpu) < 0)
> > > - break;
> > > -
> > > - waited = true;
> > > - schedule();
> > > - }
> > > -
> > > - finish_swait(>wq, );
> > > + rcuwait_wait_event(>wait,
> > > +(block_check = kvm_vcpu_check_block(vcpu)) < 0,
> > > +TASK_INTERRUPTIBLE);
> > 
> > Are these yet more instances that really want to be TASK_IDLE ?
> 
> Hmm probably as it makes sense for a blocked vcpu not to be contributing to
> the loadavg. So if this is the only reason to use interruptible, then yes we
> ought to change it.
> 
> However, I'll make this a separate patch, given this (ab)use isn't as obvious
> as the PS3 case, which is a kthread and therefore signals are masked.

The thing that was a dead give-away was that the return value of the
interruptible wait wasn't used.


Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Peter Zijlstra
On Fri, Mar 20, 2020 at 01:55:26AM -0700, Davidlohr Bueso wrote:
> - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
> -(!vcpu->arch.pause)));
> + rcuwait_wait_event(*wait,
> +(!vcpu->arch.power_off) && (!vcpu->arch.pause),
> +TASK_INTERRUPTIBLE);

> - for (;;) {
> - prepare_to_swait_exclusive(>wq, , 
> TASK_INTERRUPTIBLE);
> -
> - if (kvm_vcpu_check_block(vcpu) < 0)
> - break;
> -
> - waited = true;
> - schedule();
> - }
> -
> - finish_swait(>wq, );
> + rcuwait_wait_event(>wait,
> +(block_check = kvm_vcpu_check_block(vcpu)) < 0,
> +TASK_INTERRUPTIBLE);

Are these yet more instances that really want to be TASK_IDLE ?



Re: [PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted

2020-03-20 Thread Peter Zijlstra
On Fri, Mar 20, 2020 at 01:55:25AM -0700, Davidlohr Bueso wrote:

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 6cc6cc485d07..b0bb0a8ec4b1 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -234,9 +234,10 @@ void release_task(struct task_struct *p)
>   goto repeat;
>  }
>  
> -void rcuwait_wake_up(struct rcuwait *w)
> +bool rcuwait_wake_up(struct rcuwait *w)
>  {
>   struct task_struct *task;
> + bool ret = false;
>  
>   rcu_read_lock();
>  
> @@ -254,10 +255,15 @@ void rcuwait_wake_up(struct rcuwait *w)
>   smp_mb(); /* (B) */
>  
>   task = rcu_dereference(w->task);
> - if (task)
> + if (task) {
>   wake_up_process(task);
> + ret = true;

ret = wake_up_process(task); ?

> + }
>   rcu_read_unlock();
> +
> + return ret;
>  }
> +EXPORT_SYMBOL_GPL(rcuwait_wake_up);


Re: [patch V2 07/15] powerpc/ps3: Convert half completion to rcuwait

2020-03-19 Thread Peter Zijlstra
On Thu, Mar 19, 2020 at 10:00:24AM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-03-18 21:43:09 [+0100], Thomas Gleixner wrote:
> > --- a/arch/powerpc/platforms/ps3/device-init.c
> > +++ b/arch/powerpc/platforms/ps3/device-init.c
> > @@ -725,12 +728,12 @@ static int ps3_notification_read_write(s
> > unsigned long flags;
> > int res;
> >  
> > -   init_completion(>done);
> > spin_lock_irqsave(>lock, flags);
> > res = write ? lv1_storage_write(dev->sbd.dev_id, 0, 0, 1, 0, lpar,
> > >tag)
> > : lv1_storage_read(dev->sbd.dev_id, 0, 0, 1, 0, lpar,
> >>tag);
> > +   dev->done = false;
> > spin_unlock_irqrestore(>lock, flags);
> > if (res) {
> > pr_err("%s:%u: %s failed %d\n", __func__, __LINE__, op, res);
> > @@ -738,14 +741,10 @@ static int ps3_notification_read_write(s
> > }
> > pr_debug("%s:%u: notification %s issued\n", __func__, __LINE__, op);
> >  
> > -   res = wait_event_interruptible(dev->done.wait,
> > -  dev->done.done || kthread_should_stop());
> > +   rcuwait_wait_event(>wait, dev->done || kthread_should_stop(), 
> > TASK_IDLE);
> > +
> …
> 
> Not sure it matters but this struct `dev' is allocated on stack. Should
> the interrupt fire *before* rcuwait_wait_event() set wait.task to NULL
> then it is of random value on the first invocation of rcuwait_wake_up().
> ->
> 
> diff --git a/arch/powerpc/platforms/ps3/device-init.c 
> b/arch/powerpc/platforms/ps3/device-init.c
> index 197347c3c0b24..e87360a0fb40d 100644
> --- a/arch/powerpc/platforms/ps3/device-init.c
> +++ b/arch/powerpc/platforms/ps3/device-init.c
> @@ -809,6 +809,7 @@ static int ps3_probe_thread(void *data)
>   }
>  
>   spin_lock_init();
> + rcuwait_init();
>  
>   res = request_irq(irq, ps3_notification_interrupt, 0,
> "ps3_notification", );
> 

Very good, sorry for that.


Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

2020-03-02 Thread Peter Zijlstra
On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
> Modern processors export such hazard data in Performance
> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
> AMD[3] provides similar information.
> 
> Implementation detail:
> 
> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
> If it's set, kernel converts arch specific hazard information
> into generic format:
> 
>   struct perf_pipeline_haz_data {
>  /* Instruction/Opcode type: Load, Store, Branch  */
>  __u8itype;
>  /* Instruction Cache source */
>  __u8icache;
>  /* Instruction suffered hazard in pipeline stage */
>  __u8hazard_stage;
>  /* Hazard reason */
>  __u8hazard_reason;
>  /* Instruction suffered stall in pipeline stage */
>  __u8stall_stage;
>  /* Stall reason */
>  __u8stall_reason;
>  __u16   pad;
>   };

Kim, does this format indeed work for AMD IBS?


Re: [RFC 02/11] perf/core: Data structure to present hazard data

2020-03-02 Thread Peter Zijlstra
On Mon, Mar 02, 2020 at 10:53:46AM +0530, Ravi Bangoria wrote:
> From: Madhavan Srinivasan 
> 
> Introduce new perf sample_type PERF_SAMPLE_PIPELINE_HAZ to request kernel
> to provide cpu pipeline hazard data. Also, introduce arch independent
> structure 'perf_pipeline_haz_data' to pass hazard data to userspace. This
> is generic structure and arch specific data needs to be converted to this
> format.
> 
> Signed-off-by: Madhavan Srinivasan 
> Signed-off-by: Ravi Bangoria 
> ---
>  include/linux/perf_event.h|  7 ++
>  include/uapi/linux/perf_event.h   | 32 ++-
>  kernel/events/core.c  |  6 +
>  tools/include/uapi/linux/perf_event.h | 32 ++-
>  4 files changed, 75 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 547773f5894e..d5b606e3c57d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1001,6 +1001,7 @@ struct perf_sample_data {
>   u64 stack_user_size;
>  
>   u64 phys_addr;
> + struct perf_pipeline_haz_data   pipeline_haz;
>  } cacheline_aligned;
>  
>  /* default value for data source */
> @@ -1021,6 +1022,12 @@ static inline void perf_sample_data_init(struct 
> perf_sample_data *data,
>   data->weight = 0;
>   data->data_src.val = PERF_MEM_NA;
>   data->txn = 0;
> + data->pipeline_haz.itype = PERF_HAZ__ITYPE_NA;
> + data->pipeline_haz.icache = PERF_HAZ__ICACHE_NA;
> + data->pipeline_haz.hazard_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.hazard_reason = PERF_HAZ__HREASON_NA;
> + data->pipeline_haz.stall_stage = PERF_HAZ__PIPE_STAGE_NA;
> + data->pipeline_haz.stall_reason = PERF_HAZ__SREASON_NA;
>  }

NAK, Don't touch anything outside of the first cacheline here.


Re: [PATCH v4 2/9] perf/core: open access for CAP_SYS_PERFMON privileged process

2020-01-10 Thread Peter Zijlstra
On Thu, Jan 09, 2020 at 02:36:50PM +0300, Alexey Budankov wrote:
> On 08.01.2020 19:07, Peter Zijlstra wrote:
> > On Wed, Dec 18, 2019 at 12:25:35PM +0300, Alexey Budankov wrote:

> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index 059ee7116008..d9db414f2197 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> >> @@ -9056,7 +9056,7 @@ static int perf_kprobe_event_init(struct perf_event 
> >> *event)
> >>if (event->attr.type != perf_kprobe.type)
> >>return -ENOENT;
> >>  
> >> -  if (!capable(CAP_SYS_ADMIN))
> >> +  if (!perfmon_capable())
> >>return -EACCES;
> >>  
> >>/*
> > 
> > This one only allows attaching to already extant kprobes, right? It does
> > not allow creation of kprobes.
> 
> This unblocks creation of local trace kprobes and uprobes by CAP_SYS_PERFMON 
> privileged process, exactly the same as for CAP_SYS_ADMIN privileged process.

I've no idea what you just said; it's just words.

Again, this only allows attaching to previously created kprobes, it does
not allow creating kprobes, right?

That is; I don't think CAP_SYS_PERFMON should be allowed to create
kprobes.

As might be clear; I don't actually know what the user-ABI is for
creating kprobes.


Re: [PATCH v4 2/9] perf/core: open access for CAP_SYS_PERFMON privileged process

2020-01-08 Thread Peter Zijlstra
On Wed, Dec 18, 2019 at 12:25:35PM +0300, Alexey Budankov wrote:
> 
> Open access to perf_events monitoring for CAP_SYS_PERFMON privileged
> processes. For backward compatibility reasons access to perf_events
> subsystem remains open for CAP_SYS_ADMIN privileged processes but
> CAP_SYS_ADMIN usage for secure perf_events monitoring is discouraged
> with respect to CAP_SYS_PERFMON capability.
> 
> Signed-off-by: Alexey Budankov 
> ---
>  include/linux/perf_event.h | 6 +++---
>  kernel/events/core.c   | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 34c7c6910026..f46acd69425f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1285,7 +1285,7 @@ static inline int perf_is_paranoid(void)
>  
>  static inline int perf_allow_kernel(struct perf_event_attr *attr)
>  {
> - if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
>   return -EACCES;
>  
>   return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
> @@ -1293,7 +1293,7 @@ static inline int perf_allow_kernel(struct 
> perf_event_attr *attr)
>  
>  static inline int perf_allow_cpu(struct perf_event_attr *attr)
>  {
> - if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > 0 && !perfmon_capable())
>   return -EACCES;
>  
>   return security_perf_event_open(attr, PERF_SECURITY_CPU);
> @@ -1301,7 +1301,7 @@ static inline int perf_allow_cpu(struct perf_event_attr 
> *attr)
>  
>  static inline int perf_allow_tracepoint(struct perf_event_attr *attr)
>  {
> - if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > -1 && !perfmon_capable())
>   return -EPERM;
>  
>   return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);

These are OK I suppose.

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 059ee7116008..d9db414f2197 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9056,7 +9056,7 @@ static int perf_kprobe_event_init(struct perf_event 
> *event)
>   if (event->attr.type != perf_kprobe.type)
>   return -ENOENT;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!perfmon_capable())
>   return -EACCES;
>  
>   /*

This one only allows attaching to already extant kprobes, right? It does
not allow creation of kprobes.

> @@ -9116,7 +9116,7 @@ static int perf_uprobe_event_init(struct perf_event 
> *event)
>   if (event->attr.type != perf_uprobe.type)
>   return -ENOENT;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!perfmon_capable())
>   return -EACCES;
>  
>   /*

Idem, I presume.

> @@ -11157,7 +11157,7 @@ SYSCALL_DEFINE5(perf_event_open,
>   }
>  
>   if (attr.namespaces) {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!perfmon_capable())
>   return -EACCES;
>   }

And given we basically make the entire kernel observable with this CAP,
busting namespaces shoulnd't be a problem either.

So yeah, I suppose that works.


Re: [PATCH v2 2/3] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-18 Thread Peter Zijlstra
On Thu, Dec 19, 2019 at 12:13:48AM +1100, Michael Ellerman wrote:

> >> I'm a little confused though; if nohash is a software TLB fill, why do
> >> you need a TLBI for tables?
> >> 
> >
> > nohash (AKA book3e) has different mmu modes. I don't follow all the 
> > details w.r.t book3e. Paul or Michael might be able to explain the need 
> > for table flush with book3e.
> 
> Some of the Book3E CPUs have a partial hardware table walker. The IBM one (A2)
> did, before we ripped that support out. And the Freescale (NXP) e6500
> does, see eg:
> 
>   28efc35fe68d ("powerpc/e6500: TLB miss handler with hardware tablewalk 
> support")
> 
> They only support walking one level IIRC, ie. you can create a TLB entry
> that points to a PTE page, and the hardware will dereference that to get
> a PTE and load that into the TLB.

Shiny!, all the embedded goodness. Thanks for the info.


Re: [PATCH v2 3/3] asm-generic/tlb: Avoid potential double flush

2019-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2019 at 11:05:30AM +0530, Aneesh Kumar K.V wrote:

> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -402,7 +402,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct 
> vm_area_struct *vma) { }
>  
>  static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>  {
> - if (!tlb->end)
> + /*
> +  * Anything calling __tlb_adjust_range() also sets at least one of
> +  * these bits.
> +  */
> + if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> +   tlb->cleared_puds || tlb->cleared_p4ds))
>   return;

FWIW I looked at the GCC generated assembly output for this (x86_64) and
it did a single load and mask as expected.


Re: [PATCH v2 2/3] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2019 at 11:05:29AM +0530, Aneesh Kumar K.V wrote:
> From: Peter Zijlstra 
> 
> Architectures for which we have hardware walkers of Linux page table should
> flush TLB on mmu gather batch allocation failures and batch flush. Some
> architectures like POWER supports multiple translation modes (hash and radix)

nohash, hash and radix in fact :-)

> and in the case of POWER only radix translation mode needs the above TLBI.

> This is because for hash translation mode kernel wants to avoid this extra
> flush since there are no hardware walkers of linux page table. With radix
> translation, the hardware also walks linux page table and with that, kernel
> needs to make sure to TLB invalidate page walk cache before page table pages 
> are
> freed.
> 
> More details in
> commit: d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for 
> RCU_TABLE_FREE")
> 
> Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
> Signed-off-by: Peter Zijlstra (Intel)  Signed-off-by: Aneesh Kumar K.V 
> ---

> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index b2c0be93929d..7f3a8b902325 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -26,6 +26,17 @@
>  
>  #define tlb_flush tlb_flush
>  extern void tlb_flush(struct mmu_gather *tlb);
> +/*
> + * book3s:
> + * Hash does not use the linux page-tables, so we can avoid
> + * the TLB invalidate for page-table freeing, Radix otoh does use the
> + * page-tables and needs the TLBI.
> + *
> + * nohash:
> + * We still do TLB invalidate in the __pte_free_tlb routine before we
> + * add the page table pages to mmu gather table batch.

I'm a little confused though; if nohash is a software TLB fill, why do
you need a TLBI for tables?

> + */
> +#define tlb_needs_table_invalidate() radix_enabled()
>  
>  /* Get the generic bits... */
>  #include 


Re: [PATCH v2 1/3] powerpc/mmu_gather: Enable RCU_TABLE_FREE even for !SMP case

2019-12-18 Thread Peter Zijlstra


I'm going to assume you'll take these 3 patches through the powerpc tree
for convenience, a few small nits, but otherwise ACK on the lot.

On Wed, Dec 18, 2019 at 11:05:28AM +0530, Aneesh Kumar K.V wrote:
> A follow up patch is going to make sure we correctly invalidate page walk 
> cache
> before we free page table pages. In order to keep things simple enable
> RCU_TABLE_FREE even for !SMP so that we don't have to fixup the !SMP case
> differently in the followup patch

Perhaps more clearly state that !SMP is broken for Radix/PWC. The above
sorta implies it is, but it's not spelled out in any detail.

> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Peter Zijlstra (Intel) 


Re: [RFC PATCH 1/2] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-18 Thread Peter Zijlstra
On Wed, Dec 18, 2019 at 10:52:53AM +0530, Aneesh Kumar K.V wrote:
> Upstream ppc64 is broken after the commit: a46cc7a90fd8
> ("powerpc/mm/radix: Improve TLB/PWC flushes").
> 
> Also the patches are not adding any extra TLBI on either radix or hash.
> 
> Considering we need to backport this to stable and other distributions,
> how about we do this early patches in your series before the Kconfig rename?
> This should enable stable to pick them up with less dependencies. 

OK I suppose. Will you send a new series?


[PATCH] asm-generic/tlb: Avoid potential double flush

2019-12-17 Thread Peter Zijlstra
On Tue, Dec 17, 2019 at 01:34:16PM +0100, Peter Zijlstra wrote:
> Perhaps if we replace !tlb->end with something like:
> 
>   !tlb->freed_tables && !tlb->cleared_p*
> 
> (which GCC should be able to do with a single load and mask)
> 
> I've not really thought too hard about it yet, I need to run some
> errands, but I'll look at it more closely when I get back.

AFAICT this should work.

---
Subject: asm-generic/tlb: Avoid potential double flush

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one
of the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those
bits set, as opposed to having tlb->end != 0.

Reported-by: "Aneesh Kumar K.V" 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index fe0ea6ff3636..c9a25c5a83e8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -402,7 +402,12 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct 
vm_area_struct *vma) { }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);


Re: [RFC PATCH 1/2] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-17 Thread Peter Zijlstra
On Tue, Dec 17, 2019 at 04:18:40PM +0530, Aneesh Kumar K.V wrote:
> On 12/17/19 2:39 PM, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2019 at 12:47:12PM +0530, Aneesh Kumar K.V wrote:
> > > Architectures for which we have hardware walkers of Linux page table 
> > > should
> > > flush TLB on mmu gather batch allocation failures and batch flush. Some
> > > architectures like POWER supports multiple translation modes (hash and 
> > > radix)
> > > and in the case of POWER only radix translation mode needs the above TLBI.
> > > This is because for hash translation mode kernel wants to avoid this extra
> > > flush since there are no hardware walkers of linux page table. With radix
> > > translation, the hardware also walks linux page table and with that, 
> > > kernel
> > > needs to make sure to TLB invalidate page walk cache before page table 
> > > pages are
> > > freed.
> > 
> > > Based on changes from Peter Zijlstra 
> > 
> > AFAICT it is all my patch ;-)
> 
> Yes. I moved the changes you had to upstream. I can update the From: in the
> next version if you are ok with that?

Well, since PPC isn't broken per finding the invalidate in
__p*_free_tlb(), lets do these things on top of the patches I proposed
here. Also, you mnight want to run benchmarks to see if the movement of
that TLBI actually helps (I'm thinking the cost of the PTESYNC might add
up).


Re: [RFC PATCH 2/2] mm/mmu_gather: Avoid multiple page walk cache flush

2019-12-17 Thread Peter Zijlstra
On Tue, Dec 17, 2019 at 03:45:36PM +0530, Aneesh Kumar K.V wrote:
> On 12/17/19 2:28 PM, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2019 at 12:47:13PM +0530, Aneesh Kumar K.V wrote:
> > > On tlb_finish_mmu() kernel does a tlb flush before  mmu gather table 
> > > invalidate.
> > > The mmu gather table invalidate depending on kernel config also does 
> > > another
> > > TLBI. Avoid the later on tlb_finish_mmu().
> > 
> > That is already avoided, if you look at tlb_flush_mmu_tlbonly() it does
> > __tlb_range_reset(), which results in ->end = 0, which then triggers the
> > early exit on the next invocation:
> > 
> > if (!tlb->end)
> > return;
> > 
> 
> Is that true for tlb->fulmm flush?

Hmm, no, but I'm thinking you patch is broken, even for that case. We
must issue the TLBI before call_rcu().

Perhaps if we replace !tlb->end with something like:

  !tlb->freed_tables && !tlb->cleared_p*

(which GCC should be able to do with a single load and mask)

I've not really thought too hard about it yet, I need to run some
errands, but I'll look at it more closely when I get back.


Re: [RFC PATCH 1/2] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-17 Thread Peter Zijlstra
On Tue, Dec 17, 2019 at 12:47:12PM +0530, Aneesh Kumar K.V wrote:
> Architectures for which we have hardware walkers of Linux page table should
> flush TLB on mmu gather batch allocation failures and batch flush. Some
> architectures like POWER supports multiple translation modes (hash and radix)
> and in the case of POWER only radix translation mode needs the above TLBI.
> This is because for hash translation mode kernel wants to avoid this extra
> flush since there are no hardware walkers of linux page table. With radix
> translation, the hardware also walks linux page table and with that, kernel
> needs to make sure to TLB invalidate page walk cache before page table pages 
> are
> freed.

> Based on changes from Peter Zijlstra 

AFAICT it is all my patch ;-)

Anyway, this commit:

> More details in
> commit: d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for 
> RCU_TABLE_FREE")

states that you do an explicit invalidate in __p*_free_tlb(), which, if
I'm not mistaken is still there:

  arch/powerpc/include/asm/nohash/pgalloc.h:  tlb_flush_pgtable(tlb, 
address);

Or am I reading this wrong? I'm thinking you can remove that now.

> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index b2c0be93929d..feea1a09bbce 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -27,6 +27,10 @@
>  #define tlb_flush tlb_flush
>  extern void tlb_flush(struct mmu_gather *tlb);
>  
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
/*
 * PPC-Hash does not use the linux page-tables, so we can avoid
 * the TLBI for page-table freeing, PPC-Radix otoh does use the
 * page-tables and needs the TLBI.
 */
> +#define tlb_needs_table_invalidate() radix_enabled()
> +#endif

Also, are you really sure about the !SMP case? Esp. on Radix I'm
thinking that the PWC (page-walk-cache) can give trouble even on UP,
when we get preempted in the middle of mmu_gather. Hmm?

>  /* Get the generic bits... */
>  #include 




Re: [RFC PATCH 2/2] mm/mmu_gather: Avoid multiple page walk cache flush

2019-12-17 Thread Peter Zijlstra
On Tue, Dec 17, 2019 at 12:47:13PM +0530, Aneesh Kumar K.V wrote:
> On tlb_finish_mmu() kernel does a tlb flush before  mmu gather table 
> invalidate.
> The mmu gather table invalidate depending on kernel config also does another
> TLBI. Avoid the later on tlb_finish_mmu().

That is already avoided, if you look at tlb_flush_mmu_tlbonly() it does
__tlb_range_reset(), which results in ->end = 0, which then triggers the
early exit on the next invocation:

if (!tlb->end)
return;


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-16 Thread Peter Zijlstra
On Mon, Dec 16, 2019 at 10:28:06AM +, Will Deacon wrote:
> However, enabling this for 32-bit ARM is total carnage; as Linus mentioned,
> a whole bunch of code appears to be relying on atomic 64-bit access of
> READ_ONCE(); the perf ring buffer, io_uring, the scheduler, pm_runtime,
> cpuidle, ... :(
> 
> Unfortunately, at least some of these *do* look like bugs, but I can't see
> how we can fix them, not least because the first two are user ABI afaict. It
> may also be that in practice we get 2x32-bit stores, and that works out fine
> when storing a 32-bit virtual address. I'm not sure what (if anything) the
> compiler guarantees in these cases.

Perf does indeed have a (known) problem here for the head/tail values.
Last time we looked at that nobody could really come up with a sane
solution that wouldn't break something.

I'll try and dig out that thread. Perhaps casting the value to 'unsigned
long' internally might work, I forgot the details.


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-13 Thread Peter Zijlstra
On Fri, Dec 13, 2019 at 11:47:06AM +0100, Luc Van Oostenryck wrote:
> On Thu, Dec 12, 2019 at 09:53:38PM +0100, Peter Zijlstra wrote:
> > Now, looking at the current GCC source:
> > 
> >   
> > https://github.com/gcc-mirror/gcc/blob/97d7270f894395e513667a031a0c309d1819d05e/gcc/c/c-parser.c#L3707
> > 
> > it seems that __typeof__() is supposed to strip all qualifiers from
> > _Atomic types. That lead me to try:
> > 
> > typeof(_Atomic typeof(p)) __p = (p);
> > 
> > But alas, I still get the same junk you got for ool_store_release() :/
> 
> I was checking this to see if Sparse was ready to support this.
> I was a bit surprised because at first sigth GCC was doing as
> it claims (typeof striping const & volatile on _Atomic types)
> but your exampe wasn't working. But it's working if an
> intermediate var is used:
>   _Atomic typeof(p) tmp;
>   typeof(tmp) __p = (p);
> or, uglier but probably more practical:
>   typeof(({_Atomic typeof(p) tmp; })) __p = (p);
> 
> Go figure!

Excellent! I had to change it to something like:

#define unqual_typeof(x)typeof(({_Atomic typeof(x) ___x __maybe_unused; 
___x; }))

but that does indeed work!

Now I suppose we should wrap that in a symbol that indicates our
compiler does indeed support _Atomic, otherwise things will come apart.

That is, my gcc-4.6 doesn't seem to have it, while gcc-4.8 does, which
is exactly the range that needs the daft READ_ONCE() construct, how
convenient :/

Something a little like this perhaps?

---

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 7d9cc5ec4971..c389af602da8 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -75,9 +75,9 @@ static inline unsigned long array_index_mask_nospec(unsigned 
long idx,
 
 #define __smp_store_release(p, v)  \
 do {   \
-   typeof(p) __p = (p);\
-   union { typeof(*p) __val; char __c[1]; } __u =  \
-   { .__val = (__force typeof(*p)) (v) };  \
+   unqual_typeof(p) __p = (p); \
+   union { unqual_typeof(*p) __val; char __c[1]; } __u =   \
+   { .__val = (__force unqual_typeof(*p)) (v) };   \
compiletime_assert_atomic_type(*p); \
kasan_check_write(__p, sizeof(*p)); \
switch (sizeof(*p)) {   \
@@ -110,8 +110,8 @@ do {
\
 
 #define __smp_load_acquire(p)  \
 ({ \
-   union { typeof(*p) __val; char __c[1]; } __u;   \
-   typeof(p) __p = (p);\
+   union { unqual_typeof(*p) __val; char __c[1]; } __u;\
+   unqual_typeof(p) __p = (p); \
compiletime_assert_atomic_type(*p); \
kasan_check_read(__p, sizeof(*p));  \
switch (sizeof(*p)) {   \
@@ -141,8 +141,8 @@ do {
\
 
 #define smp_cond_load_relaxed(ptr, cond_expr)  \
 ({ \
-   typeof(ptr) __PTR = (ptr);  \
-   typeof(*ptr) VAL;   \
+   unqual_typeof(ptr) __PTR = (ptr);   \
+   unqual_typeof(*ptr) VAL;\
for (;;) {  \
VAL = READ_ONCE(*__PTR);\
if (cond_expr)  \
@@ -154,8 +154,8 @@ do {
\
 
 #define smp_cond_load_acquire(ptr, cond_expr)  \
 ({ \
-   typeof(ptr) __PTR = (ptr);  \
-   typeof(*ptr) VAL;   \
+   unqual_typeof(ptr) __PTR = (ptr);   \
+   unqual_typeof(*ptr) VAL;\
for (;;) {  \
VAL = smp_load_acquire(__PTR);  \
if (cond_expr)

Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Peter Zijlstra
On Thu, Dec 12, 2019 at 09:21:57PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 12, 2019 at 07:34:01PM +, Will Deacon wrote:
> > void ool_store_release(volatile unsigned long *ptr, unsigned long val)
> > {
> > smp_store_release(ptr, val);
> > }
> > 
> >  :
> >0:   a9be7bfdstp x29, x30, [sp, #-32]!
> >4:   9002adrpx2, 0 <__stack_chk_guard>
> >8:   9142add x2, x2, #0x0
> >c:   910003fdmov x29, sp
> >   10:   f9400043ldr x3, [x2]
> >   14:   f9000fa3str x3, [x29, #24]
> >   18:   d283mov x3, #0x0// #0
> >   1c:   c89ffc01stlrx1, [x0]
> >   20:   f9400fa1ldr x1, [x29, #24]
> >   24:   f9400040ldr x0, [x2]
> >   28:   ca20eor x0, x1, x0
> >   2c:   b560cbnzx0, 38 
> >   30:   a8c27bfdldp x29, x30, [sp], #32
> >   34:   d65f03c0ret
> >   38:   9400bl  0 <__stack_chk_fail>
> > 
> > It's a mess, and fixing READ_ONCE() doesn't help this case, which is why
> > I was looking at getting rid of volatile where it's not strictly needed.
> > I'm certainly open to other suggestions, I just haven't managed to think
> > of anything else.
> 
> We could move the kernel to C++ and write:
> 
>   std::remove_volatile::type __p = (p);
> 
> /me runs like hell...

Also, the GCC __auto_type thing strips _Atomic and const qualifiers but
for some obscure raisin forgets to strip volatile :/

  https://gcc.gnu.org/ml/gcc-patches/2013-11/msg01378.html

Now, looking at the current GCC source:

  
https://github.com/gcc-mirror/gcc/blob/97d7270f894395e513667a031a0c309d1819d05e/gcc/c/c-parser.c#L3707

it seems that __typeof__() is supposed to strip all qualifiers from
_Atomic types. That lead me to try:

typeof(_Atomic typeof(p)) __p = (p);

But alas, I still get the same junk you got for ool_store_release() :/


  1   2   3   4   5   6   7   8   >