Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2020 at 11:36:03PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> > On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> >> "Paul E. McKenney"  writes:
> >> >
> >> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >> >lock to exclude softirq handlers.
> >> 
> >> I've made that:
> >> 
> >>   - The soft interrupt related suffix (_bh()) still disables softirq
> >> handlers.
> >> 
> >> Non-PREEMPT_RT kernels disable preemption to get this effect.
> >> 
> >> PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
> >> disables softirq handlers and prevents reentrancy by a preempting
> >> task.
> >
> > That works!  At the end, I would instead say "prevents reentrancy
> > due to task preemption", but what you have works.
> 
> Yours is better.
> 
> >>- Task state is preserved across spinlock acquisition, ensuring that the
> >>  task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
> >>  kernels leave task state untouched.  However, PREEMPT_RT must change
> >>  task state if the task blocks during acquisition.  Therefore, it
> >>  saves the current task state before blocking and the corresponding
> >>  lock wakeup restores it. A regular not lock related wakeup sets the
> >>  task state to RUNNING. If this happens while the task is blocked on
> >>  a spinlock then the saved task state is changed so that correct
> >>  state is restored on lock wakeup.
> >> 
> >> Hmm?
> >
> > I of course cannot resist editing the last two sentences:
> >
> >... Other types of wakeups unconditionally set task state to RUNNING.
> >If this happens while a task is blocked while acquiring a spinlock,
> >then the task state is restored to its pre-acquisition value at
> >lock-wakeup time.
> 
> Errm no. That would mean
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>non lock wakeup
>  state = RUNNING<--- FAIL #1
> 
>lock wakeup
>  state = real_state <--- FAIL #2
> 
> How it works is:
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>non lock wakeup
>  real_state = RUNNING
> 
>lock wakeup
>  state = real_state == RUNNING
> 
> If there is no 'non lock wakeup' before the lock wakeup:
> 
>  state = UNINTERRUPTIBLE
>  lock()
>block()
>  real_state = state
>  state = SLEEPONLOCK
> 
>lock wakeup
>  state = real_state == UNINTERRUPTIBLE
> 
> I agree that what I tried to express is hard to parse, but it's at least
> halfways correct :)

Apologies!  That is what I get for not looking it up in the source.  :-/

OK, so I am stupid enough not only to get it wrong, but also to try again:

   ... Other types of wakeups would normally unconditionally set the
   task state to RUNNING, but that does not work here because the task
   must remain blocked until the lock becomes available.  Therefore,
   when a non-lock wakeup attempts to awaken a task blocked waiting
   for a spinlock, it instead sets the saved state to RUNNING.  Then,
   when the lock acquisition completes, the lock wakeup sets the task
   state to the saved state, in this case setting it to RUNNING.

Is that better?

Thanx, Paul


Re: [PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-20 Thread Ram Pai
On Fri, Mar 20, 2020 at 11:26:43AM +0100, Laurent Dufour wrote:
> When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
> there is not enough free secured memory, the Hypervisor (HV) has to call
   secure memory,

> UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
> H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.
> 
> If the kvm->arch.secure_guest is not set, in the return path rfid is called
> but there is no valid context to get back to the SVM since the Hcall has
> been routed by the Ultravisor.
> 
> Move the setting of kvm->arch.secure_guest earlier in
> kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
> instead of rfid.
> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Ram Pai 



Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-20 Thread Ram Pai
On Fri, Mar 20, 2020 at 11:26:42AM +0100, Laurent Dufour wrote:
> The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
> prevent a malicious VM or SVM to call them. This could lead to weird result
> and should be filtered out.
> 
> Checking the Secure bit of the calling MSR ensure that the call is coming
> from either the Ultravisor or a SVM. But any system call made from a SVM
> are going through the Ultravisor, and the Ultravisor should filter out
> these malicious call. This way, only the Ultravisor is able to make such a
> Hcall.
> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Ram Pai 

> ---
>  arch/powerpc/kvm/book3s_hv.c | 32 +---
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 



Re: [PATCH v2 0/2] Don't generate thousands of new warnings when building docs

2020-03-20 Thread Jonathan Corbet
On Fri, 20 Mar 2020 16:11:01 +0100
Mauro Carvalho Chehab  wrote:

> This small series address a regression caused by a new patch at
> docs-next (and at linux-next).

I don't know how I missed that mess, sorry.  I plead distracting times or
something like that.  Heck, I think I'll blame everything on the plague
for at least the next few weeks.

Anyway, I've applied this, thanks for cleaning it up.

jon


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Thomas Gleixner
"Paul E. McKenney"  writes:
> On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
>> "Paul E. McKenney"  writes:
>> >
>> >  - The soft interrupt related suffix (_bh()) still disables softirq
>> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
>> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
>> >lock to exclude softirq handlers.
>> 
>> I've made that:
>> 
>>   - The soft interrupt related suffix (_bh()) still disables softirq
>> handlers.
>> 
>> Non-PREEMPT_RT kernels disable preemption to get this effect.
>> 
>> PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
>> disables softirq handlers and prevents reentrancy by a preempting
>> task.
>
> That works!  At the end, I would instead say "prevents reentrancy
> due to task preemption", but what you have works.

Yours is better.

>>- Task state is preserved across spinlock acquisition, ensuring that the
>>  task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
>>  kernels leave task state untouched.  However, PREEMPT_RT must change
>>  task state if the task blocks during acquisition.  Therefore, it
>>  saves the current task state before blocking and the corresponding
>>  lock wakeup restores it. A regular not lock related wakeup sets the
>>  task state to RUNNING. If this happens while the task is blocked on
>>  a spinlock then the saved task state is changed so that correct
>>  state is restored on lock wakeup.
>> 
>> Hmm?
>
> I of course cannot resist editing the last two sentences:
>
>... Other types of wakeups unconditionally set task state to RUNNING.
>If this happens while a task is blocked while acquiring a spinlock,
>then the task state is restored to its pre-acquisition value at
>lock-wakeup time.

Errm no. That would mean

 state = UNINTERRUPTIBLE
 lock()
   block()
 real_state = state
 state = SLEEPONLOCK

   non lock wakeup
 state = RUNNING<--- FAIL #1

   lock wakeup
 state = real_state <--- FAIL #2

How it works is:

 state = UNINTERRUPTIBLE
 lock()
   block()
 real_state = state
 state = SLEEPONLOCK

   non lock wakeup
 real_state = RUNNING

   lock wakeup
 state = real_state == RUNNING

If there is no 'non lock wakeup' before the lock wakeup:

 state = UNINTERRUPTIBLE
 lock()
   block()
 real_state = state
 state = SLEEPONLOCK

   lock wakeup
 state = real_state == UNINTERRUPTIBLE

I agree that what I tried to express is hard to parse, but it's at least
halfways correct :)

Thanks,

tglx


[powerpc:merge] BUILD SUCCESS a87b93bdf800a4d7a42d95683624a4516e516b4f

2020-03-20 Thread kbuild test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: a87b93bdf800a4d7a42d95683624a4516e516b4f  Automatic merge of 
branches 'master', 'next' and 'fixes' into merge

elapsed time: 569m

configs tested: 157
configs skipped: 0

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm at91_dt_defconfig
arm   efm32_defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
armmulti_v7_defconfig
armshmobile_defconfig
arm   sunxi_defconfig
arm64   defconfig
sparcallyesconfig
pariscgeneric-32bit_defconfig
openrisc simple_smp_defconfig
sparc64  allmodconfig
powerpc   ppc64_defconfig
i386  allnoconfig
i386 alldefconfig
i386 allyesconfig
i386defconfig
ia64 alldefconfig
ia64 allmodconfig
ia64  allnoconfig
ia64 allyesconfig
ia64defconfig
c6x  allyesconfig
c6xevmc6678_defconfig
nios2 10m50_defconfig
nios2 3c120_defconfig
openriscor1ksim_defconfig
xtensa   common_defconfig
xtensa  iss_defconfig
alpha   defconfig
cskydefconfig
nds32 allnoconfig
nds32   defconfig
h8300 edosk2674_defconfig
h8300h8300h-sim_defconfig
h8300   h8s-sim_defconfig
m68k   m5475evb_defconfig
m68k   sun3_defconfig
m68k allmodconfig
m68k  multi_defconfig
arc defconfig
arc  allyesconfig
powerpc defconfig
powerpc  rhel-kconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
mips  fuloong2e_defconfig
mips  malta_kvm_defconfig
mips allyesconfig
mips 64r6el_defconfig
mips  allnoconfig
mips   32r2_defconfig
mips allmodconfig
pariscallnoconfig
parisc   allyesconfig
pariscgeneric-64bit_defconfig
x86_64   randconfig-a001-20200320
x86_64   randconfig-a002-20200320
x86_64   randconfig-a003-20200320
i386 randconfig-a001-20200320
i386 randconfig-a002-20200320
i386 randconfig-a003-20200320
alpharandconfig-a001-20200320
mips randconfig-a001-20200320
nds32randconfig-a001-20200320
parisc   randconfig-a001-20200320
m68k randconfig-a001-20200320
c6x  randconfig-a001-20200320
h8300randconfig-a001-20200320
microblaze   randconfig-a001-20200320
nios2randconfig-a001-20200320
sparc64  randconfig-a001-20200320
csky randconfig-a001-20200320
openrisc randconfig-a001-20200320
s390 randconfig-a001-20200320
sh   randconfig-a001-20200320
xtensa   randconfig-a001-20200320
i386 randconfig-b003-20200320
i386 randconfig-b001-20200320
x86_64   randconfig-b003-20200320
i386 randconfig-b002-20200320
x86_64   randconfig-b002-20200320
x86_64   randconfig-b001-20200320
x86_64   randconfig-c003-20200320
i386 randconfig-c002-20200320
x86_64   randconfig-c001-20200320
i386 randconfig-c003-20200320
i386 randconfig-c001-20200320
x86_64   randconfig-c002-20200320
i386 randconfig-d003-20200320
i386 randconfig-d001-20200320
x86_64   randconfig-d002-20200320
i386 randconfig-d002-20200320
x86_64

[powerpc:next-test] BUILD SUCCESS b3258f133d72af0bb463e198610d47ae57498b00

2020-03-20 Thread kbuild test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next-test
branch HEAD: b3258f133d72af0bb463e198610d47ae57498b00  powerpc/64: Prevent 
stack protection in early boot

elapsed time: 1090m

configs tested: 167
configs skipped: 17

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm at91_dt_defconfig
arm   efm32_defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
armmulti_v7_defconfig
armshmobile_defconfig
arm   sunxi_defconfig
arm64   defconfig
sparcallyesconfig
riscv   defconfig
i386defconfig
s390 allyesconfig
sh  rsk7269_defconfig
pariscgeneric-32bit_defconfig
openrisc simple_smp_defconfig
sparc64  allmodconfig
ia64 alldefconfig
powerpc   ppc64_defconfig
mips  malta_kvm_defconfig
i386  allnoconfig
i386 allyesconfig
i386 alldefconfig
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
c6x  allyesconfig
c6xevmc6678_defconfig
nios2 10m50_defconfig
nios2 3c120_defconfig
openriscor1ksim_defconfig
xtensa   common_defconfig
xtensa  iss_defconfig
alpha   defconfig
cskydefconfig
nds32 allnoconfig
nds32   defconfig
h8300   h8s-sim_defconfig
h8300 edosk2674_defconfig
m68k   m5475evb_defconfig
m68k allmodconfig
h8300h8300h-sim_defconfig
m68k   sun3_defconfig
m68k  multi_defconfig
arc  allyesconfig
arc defconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
powerpc defconfig
powerpc  rhel-kconfig
mips  fuloong2e_defconfig
mips allyesconfig
mips 64r6el_defconfig
mips  allnoconfig
mips   32r2_defconfig
mips allmodconfig
pariscallnoconfig
parisc   allyesconfig
pariscgeneric-64bit_defconfig
x86_64   randconfig-a001-20200320
x86_64   randconfig-a002-20200320
x86_64   randconfig-a003-20200320
i386 randconfig-a001-20200320
i386 randconfig-a002-20200320
i386 randconfig-a003-20200320
alpharandconfig-a001-20200320
m68k randconfig-a001-20200320
mips randconfig-a001-20200320
nds32randconfig-a001-20200320
parisc   randconfig-a001-20200320
c6x  randconfig-a001-20200320
h8300randconfig-a001-20200320
microblaze   randconfig-a001-20200320
nios2randconfig-a001-20200320
sparc64  randconfig-a001-20200320
csky randconfig-a001-20200320
openrisc randconfig-a001-20200320
s390 randconfig-a001-20200320
sh   randconfig-a001-20200320
xtensa   randconfig-a001-20200320
x86_64   randconfig-b001-20200320
x86_64   randconfig-b002-20200320
x86_64   randconfig-b003-20200320
i386 randconfig-b001-20200320
i386 randconfig-b002-20200320
i386 randconfig-b003-20200320
x86_64   randconfig-c001-20200320
x86_64   randconfig-c002-20200320
x86_64   randconfig-c003-20200320
i386 randconfig-c001-20200320
i386 randconfig-c002-20200320
i386 randconfig-c003-20200320
i386 randconfig-d003-20200320
i386 randconfig-d001

Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Joe Perches
(removed a bunch of cc's)

On Fri, 2020-03-20 at 18:31 +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> > On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > > While at it also simplify the existing perf patterns.
> > > > > And still missed fixes from parse-maintainers.pl.
> > > > 
> > > > Oh, that script UX is truly ingenious.
> > > 
> > > You have at least two options, their combinations, etc:
> > >  - complain to the author :-)
> > >  - send a patch :-)
> > 
> > Recently:
> > 
> > https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/
> 
> But why?
> 
> Shouldn't we rather run MAINTAINERS clean up once and require people to use
> parse-maintainers.pl for good?

That can basically only be done by Linus just before he releases
an RC1.

I am for it.  One day...




Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Fri, Mar 20, 2020 at 08:51:44PM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney"  writes:
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >lock to exclude softirq handlers.
> 
> I've made that:
> 
>   - The soft interrupt related suffix (_bh()) still disables softirq
> handlers.
> 
> Non-PREEMPT_RT kernels disable preemption to get this effect.
> 
> PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
> disables softirq handlers and prevents reentrancy by a preempting
> task.

That works!  At the end, I would instead say "prevents reentrancy
due to task preemption", but what you have works.

> On non-RT this is implicit through preemption disable, but it's non
> obvious for RT as preemption stays enabled.
> 
> > PREEMPT_RT kernels preserve all other spinlock_t semantics:
> >
> >  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
> >avoid migration by disabling preemption.  PREEMPT_RT kernels instead
> >disable migration, which ensures that pointers to per-CPU variables
> >remain valid even if the task is preempted.
> >
> >  - Task state is preserved across spinlock acquisition, ensuring that the
> >task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
> >kernels leave task state untouched.  However, PREEMPT_RT must change
> >task state if the task blocks during acquisition.  Therefore, the
> >corresponding lock wakeup restores the task state.  Note that regular
> >(not lock related) wakeups do not restore task state.
> 
>- Task state is preserved across spinlock acquisition, ensuring that the
>  task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
>  kernels leave task state untouched.  However, PREEMPT_RT must change
>  task state if the task blocks during acquisition.  Therefore, it
>  saves the current task state before blocking and the corresponding
>  lock wakeup restores it. A regular not lock related wakeup sets the
>  task state to RUNNING. If this happens while the task is blocked on
>  a spinlock then the saved task state is changed so that correct
>  state is restored on lock wakeup.
> 
> Hmm?

I of course cannot resist editing the last two sentences:

   ... Other types of wakeups unconditionally set task state to RUNNING.
   If this happens while a task is blocked while acquiring a spinlock,
   then the task state is restored to its pre-acquisition value at
   lock-wakeup time.

> > But this code failes on PREEMPT_RT kernels because the memory allocator
> > is fully preemptible and therefore cannot be invoked from truly atomic
> > contexts.  However, it is perfectly fine to invoke the memory allocator
> > while holding a normal non-raw spinlocks because they do not disable
> > preemption::
> >
> >> +  spin_lock();
> >> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> >> +
> >> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> >> +execution is forced into thread context and the lock substitution is
> >> +ensuring preemptibility.
> >
> > Interestingly enough, most uses of GFP_ATOMIC allocations are
> > actually safe on PREEMPT_RT because the the lock substitution ensures
> > preemptibility.  Only those GFP_ATOMIC allocations that are invoke
> > while holding a raw spinlock or with preemption otherwise disabled need
> > adjustment to work correctly on PREEMPT_RT.
> >
> > [ I am not as confident of the above as I would like to be... ]
> 
> I'd leave that whole paragraph out. This documents the rules and from
> the above code examples it's pretty clear what works and what not :)

Works for me!  ;-)

> > And meeting time, will continue later!
> 
> Enjoy!

Not bad, actually, as meetings go.

Thanx, Paul


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Thomas Gleixner
"Paul E. McKenney"  writes:
>
>  - The soft interrupt related suffix (_bh()) still disables softirq
>handlers.  However, unlike non-PREEMPT_RT kernels (which disable
>preemption to get this effect), PREEMPT_RT kernels use a per-CPU
>lock to exclude softirq handlers.

I've made that:

  - The soft interrupt related suffix (_bh()) still disables softirq
handlers.

Non-PREEMPT_RT kernels disable preemption to get this effect.

PREEMPT_RT kernels use a per-CPU lock for serialization. The lock
disables softirq handlers and prevents reentrancy by a preempting
task.

On non-RT this is implicit through preemption disable, but it's non
obvious for RT as preemption stays enabled.

> PREEMPT_RT kernels preserve all other spinlock_t semantics:
>
>  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
>avoid migration by disabling preemption.  PREEMPT_RT kernels instead
>disable migration, which ensures that pointers to per-CPU variables
>remain valid even if the task is preempted.
>
>  - Task state is preserved across spinlock acquisition, ensuring that the
>task-state rules apply to all kernel configurations.  In non-PREEMPT_RT
>kernels leave task state untouched.  However, PREEMPT_RT must change
>task state if the task blocks during acquisition.  Therefore, the
>corresponding lock wakeup restores the task state.  Note that regular
>(not lock related) wakeups do not restore task state.

   - Task state is preserved across spinlock acquisition, ensuring that the
 task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
 kernels leave task state untouched.  However, PREEMPT_RT must change
 task state if the task blocks during acquisition.  Therefore, it
 saves the current task state before blocking and the corresponding
 lock wakeup restores it. A regular not lock related wakeup sets the
 task state to RUNNING. If this happens while the task is blocked on
 a spinlock then the saved task state is changed so that correct
 state is restored on lock wakeup.

Hmm?

> But this code failes on PREEMPT_RT kernels because the memory allocator
> is fully preemptible and therefore cannot be invoked from truly atomic
> contexts.  However, it is perfectly fine to invoke the memory allocator
> while holding a normal non-raw spinlocks because they do not disable
> preemption::
>
>> +  spin_lock();
>> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
>> +
>> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
>> +execution is forced into thread context and the lock substitution is
>> +ensuring preemptibility.
>
> Interestingly enough, most uses of GFP_ATOMIC allocations are
> actually safe on PREEMPT_RT because the the lock substitution ensures
> preemptibility.  Only those GFP_ATOMIC allocations that are invoke
> while holding a raw spinlock or with preemption otherwise disabled need
> adjustment to work correctly on PREEMPT_RT.
>
> [ I am not as confident of the above as I would like to be... ]

I'd leave that whole paragraph out. This documents the rules and from
the above code examples it's pretty clear what works and what not :)

> And meeting time, will continue later!

Enjoy!

Thanks,

tglx


Re: [PATCH] powerpc/64: ftrace don't trace real mode

2020-03-20 Thread Naveen N. Rao

Hi Nick,

Nicholas Piggin wrote:

This warns and prevents tracing attempted in a real-mode context.


Is this something you're seeing often? Last time we looked at this, KVM 
was the biggest offender and we introduced paca->ftrace_enabled as a way 
to disable ftrace while in KVM code.


While this is cheap when handling ftrace_regs_caller() as done in this 
patch, for simple function tracing (see below), we will have to grab the 
MSR which will slow things down slightly.




Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/trace/ftrace.c|  3 +++
 .../powerpc/kernel/trace/ftrace_64_mprofile.S | 19 +++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 7ea0ca044b65..ef965815fcb9 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -949,6 +949,9 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip,
 {
unsigned long return_hooker;

+   if (WARN_ON_ONCE((mfmsr() & (MSR_IR|MSR_DR)) != (MSR_IR|MSR_DR)))
+   goto out;
+


This is called on function entry to redirect function return to a 
trampoline if needed. I am not sure if we have (or will have) too many C 
functions that disable MSR_IR|MSR_DR. Unless the number of such 
functions is large, it might be preferable to mark specific functions as 
notrace.



if (unlikely(ftrace_graph_is_dead()))
goto out;

diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index f9fd5f743eba..6205f15cb603 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -51,16 +51,21 @@ _GLOBAL(ftrace_regs_caller)
SAVE_10GPRS(12, r1)
SAVE_10GPRS(22, r1)

-   /* Save previous stack pointer (r1) */
-   addir8, r1, SWITCH_FRAME_SIZE
-   std r8, GPR1(r1)
-
/* Load special regs for save below */
mfmsr   r8
mfctr   r9
mfxer   r10
mfcrr11

+   /* Shouldn't be called in real mode */
+   andi.   r3,r8,(MSR_IR|MSR_DR)
+   cmpdi   r3,(MSR_IR|MSR_DR)
+   bne ftrace_bad_realmode
+
+   /* Save previous stack pointer (r1) */
+   addir8, r1, SWITCH_FRAME_SIZE
+   std r8, GPR1(r1)
+


This stomps on the MSR value in r8, which is saved into pt_regs further 
below.


You'll also have to handle ftrace_caller() which is used for simple 
function tracing. We don't read the MSR there today, but that will be 
needed if we want to suppress tracing.



- Naveen



Re: [PATCH v5 6/7] ASoC: dt-bindings: fsl_easrc: Add document for EASRC

2020-03-20 Thread Rob Herring
On Mon, Mar 09, 2020 at 11:58:33AM +0800, Shengjiu Wang wrote:
> EASRC (Enhanced Asynchronous Sample Rate Converter) is a new
> IP module found on i.MX8MN.
> 
> Signed-off-by: Shengjiu Wang 
> ---
>  .../devicetree/bindings/sound/fsl,easrc.yaml  | 101 ++
>  1 file changed, 101 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/sound/fsl,easrc.yaml
> 
> diff --git a/Documentation/devicetree/bindings/sound/fsl,easrc.yaml 
> b/Documentation/devicetree/bindings/sound/fsl,easrc.yaml
> new file mode 100644
> index ..ff22f8056a63
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/sound/fsl,easrc.yaml
> @@ -0,0 +1,101 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/sound/fsl,easrc.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: NXP Asynchronous Sample Rate Converter (ASRC) Controller
> +
> +maintainers:
> +  - Shengjiu Wang 
> +
> +properties:
> +  $nodename:
> +pattern: "^easrc@.*"
> +
> +  compatible:
> +const: fsl,imx8mn-easrc
> +
> +  reg:
> +maxItems: 1
> +
> +  interrupts:
> +maxItems: 1
> +
> +  clocks:
> +items:
> +  - description: Peripheral clock
> +
> +  clock-names:
> +items:
> +  - const: mem
> +
> +  dmas:
> +maxItems: 8
> +
> +  dma-names:
> +items:
> +  - const: ctx0_rx
> +  - const: ctx0_tx
> +  - const: ctx1_rx
> +  - const: ctx1_tx
> +  - const: ctx2_rx
> +  - const: ctx2_tx
> +  - const: ctx3_rx
> +  - const: ctx3_tx
> +
> +  fsl,easrc-ram-script-name:

'firmware-name' is the established property name for this.

> +allOf:
> +  - $ref: /schemas/types.yaml#/definitions/string
> +  - const: imx/easrc/easrc-imx8mn.bin

Though if there's only 1 possible value, why does this need to be in DT?

> +description: The coefficient table for the filters

If the firmware is only 1 thing, then perhaps this should just be a DT 
property rather than a separate file. It depends on who owns/creates 
this file. If fixed for the platform, then DT is a good fit. If updated 
separately from DT and boot firmware, then keeping it separate makes 
sense.

> +
> +  fsl,asrc-rate:
> +allOf:
> +  - $ref: /schemas/types.yaml#/definitions/uint32
> +  - minimum: 8000
> +  - maximum: 192000
> +description: Defines a mutual sample rate used by DPCM Back Ends
> +
> +  fsl,asrc-format:
> +allOf:
> +  - $ref: /schemas/types.yaml#/definitions/uint32
> +  - enum: [2, 6, 10, 32, 36]
> +default: 2
> +description:
> +  Defines a mutual sample format used by DPCM Back Ends
> +
> +required:
> +  - compatible
> +  - reg
> +  - interrupts
> +  - clocks
> +  - clock-names
> +  - dmas
> +  - dma-names
> +  - fsl,easrc-ram-script-name
> +  - fsl,asrc-rate
> +  - fsl,asrc-format
> +
> +examples:
> +  - |
> +#include 
> +
> +easrc: easrc@300C {
> +   compatible = "fsl,imx8mn-easrc";
> +   reg = <0x0 0x300C 0x0 0x1>;
> +   interrupts = <0x0 122 0x4>;
> +   clocks = < IMX8MN_CLK_ASRC_ROOT>;
> +   clock-names = "mem";
> +   dmas = < 16 23 0> , < 17 23 0>,
> +  < 18 23 0> , < 19 23 0>,
> +  < 20 23 0> , < 21 23 0>,
> +  < 22 23 0> , < 23 23 0>;
> +   dma-names = "ctx0_rx", "ctx0_tx",
> +   "ctx1_rx", "ctx1_tx",
> +   "ctx2_rx", "ctx2_tx",
> +   "ctx3_rx", "ctx3_tx";
> +   fsl,easrc-ram-script-name = "imx/easrc/easrc-imx8mn.bin";
> +   fsl,asrc-rate  = <8000>;
> +   fsl,asrc-format = <2>;
> +};
> -- 
> 2.21.0
> 


Re: [PATCH v5 1/7] ASoC: dt-bindings: fsl_asrc: Add new property fsl,asrc-format

2020-03-20 Thread Rob Herring
On Mon, Mar 09, 2020 at 02:19:44PM -0700, Nicolin Chen wrote:
> On Mon, Mar 09, 2020 at 11:58:28AM +0800, Shengjiu Wang wrote:
> > In order to support new EASRC and simplify the code structure,
> > We decide to share the common structure between them. This bring
> > a problem that EASRC accept format directly from devicetree, but
> > ASRC accept width from devicetree.
> > 
> > In order to align with new ESARC, we add new property fsl,asrc-format.
> > The fsl,asrc-format can replace the fsl,asrc-width, then driver
> > can accept format from devicetree, don't need to convert it to
> > format through width.
> > 
> > Signed-off-by: Shengjiu Wang 
> > ---
> >  Documentation/devicetree/bindings/sound/fsl,asrc.txt | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/sound/fsl,asrc.txt 
> > b/Documentation/devicetree/bindings/sound/fsl,asrc.txt
> > index cb9a25165503..780455cf7f71 100644
> > --- a/Documentation/devicetree/bindings/sound/fsl,asrc.txt
> > +++ b/Documentation/devicetree/bindings/sound/fsl,asrc.txt
> > @@ -51,6 +51,11 @@ Optional properties:
> >   will be in use as default. Otherwise, the big endian
> >   mode will be in use for all the device registers.
> >  
> > +   - fsl,asrc-format   : Defines a mutual sample format used by DPCM 
> > Back
> > + Ends, which can replace the fsl,asrc-width.
> > + The value is SNDRV_PCM_FORMAT_S16_LE, or
> > + SNDRV_PCM_FORMAT_S24_LE
> 
> I am still holding the concern at the DT binding of this format,
> as it uses values from ASoC header file instead of a dt-binding
> header file -- not sure if we can do this. Let's wait for Rob's
> comments.

I assume those are an ABI as well, so it's okay to copy them unless we 
already have some format definitions for DT. But it does need to be copy 
in a header under include/dt-bindings/.

Rob


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Andy Shevchenko
On Fri, Mar 20, 2020 at 05:42:04PM +0100, Michal Suchánek wrote:
> On Fri, Mar 20, 2020 at 06:31:57PM +0200, Andy Shevchenko wrote:
> > On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> > > On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > > > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > > > While at it also simplify the existing perf patterns.
> > > > > > And still missed fixes from parse-maintainers.pl.
> > > > > 
> > > > > Oh, that script UX is truly ingenious.
> > > > 
> > > > You have at least two options, their combinations, etc:
> > > >  - complain to the author :-)
> > > >  - send a patch :-)
> > > 
> > > Recently:
> > > 
> > > https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/
> > 
> > But why?
> > 
> > Shouldn't we rather run MAINTAINERS clean up once and require people to use
> > parse-maintainers.pl for good?
> 
> That cleanup did not happen yet, and I am not volunteering for one.
> The difference between MAINTAINERS and MAINTAINERS.new is:
> 
>  MAINTAINERS | 5510 
> +--
>  1 file changed, 2755 insertions(+), 2755 deletions(-)

Yes, it was basically reply to Joe.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 06:31:57PM +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> > On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > > While at it also simplify the existing perf patterns.
> > > > > And still missed fixes from parse-maintainers.pl.
> > > > 
> > > > Oh, that script UX is truly ingenious.
> > > 
> > > You have at least two options, their combinations, etc:
> > >  - complain to the author :-)
> > >  - send a patch :-)
> > 
> > Recently:
> > 
> > https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/
> 
> But why?
> 
> Shouldn't we rather run MAINTAINERS clean up once and require people to use
> parse-maintainers.pl for good?

That cleanup did not happen yet, and I am not volunteering for one.
The difference between MAINTAINERS and MAINTAINERS.new is:

 MAINTAINERS | 5510 +--
 1 file changed, 2755 insertions(+), 2755 deletions(-)

Thanks

Michal


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Andy Shevchenko
On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > While at it also simplify the existing perf patterns.
> > > > And still missed fixes from parse-maintainers.pl.
> > > 
> > > Oh, that script UX is truly ingenious.
> > 
> > You have at least two options, their combinations, etc:
> >  - complain to the author :-)
> >  - send a patch :-)
> 
> Recently:
> 
> https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/

But why?

Shouldn't we rather run MAINTAINERS clean up once and require people to use
parse-maintainers.pl for good?

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > While at it also simplify the existing perf patterns.
> > > > And still missed fixes from parse-maintainers.pl.
> > > 
> > > Oh, that script UX is truly ingenious.
> > 
> > You have at least two options, their combinations, etc:
> >  - complain to the author :-)
> >  - send a patch :-)
> 
> Recently:
> 
> https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/

Can we expect that reaordering is taken care of in that discussion then?

Thanks

Michal


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-20 Thread Paul E. McKenney
On Thu, Mar 19, 2020 at 07:02:17PM +0100, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney"  writes:
> 
> > On Wed, Mar 18, 2020 at 09:43:10PM +0100, Thomas Gleixner wrote:
> >
> > Mostly native-English-speaker services below, so please feel free to
> > ignore.  The one place I made a substantive change, I marked it "@@@".
> > I only did about half of this document, but should this prove useful,
> > I will do the other half later.
> 
> Native speaker services are always useful and appreciated.

Glad it is helpful.  ;-)

[ . . . ]

> >> +
> >> +raw_spinlock_t and spinlock_t
> >> +=
> >> +
> >> +raw_spinlock_t
> >> +--
> >> +
> >> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> >> +kernel configuration including PREEMPT_RT enabled kernels.
> >> +
> >> +raw_spinlock_t is to be used only in real critical core code, low level
> >> +interrupt handling and places where protecting (hardware) state is 
> >> required
> >> +to be safe against preemption and eventually interrupts.
> >> +
> >> +Another reason to use raw_spinlock_t is when the critical section is tiny
> >> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> >> +contended case.
> >
> > raw_spinlock_t is a strict spinning lock implementation in all kernels,
> > including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
> > core code, low level interrupt handling and places where disabling
> > preemption or interrupts is required, for example, to safely access
> > hardware state.  raw_spinlock_t can sometimes also be used when the
> > critical section is tiny and the lock is lightly contended, thus avoiding
> > RT-mutex overhead.
> >
> > @@@  I added the point about the lock being lightly contended.
> 
> Hmm, not sure. The point is that if the critical section is small the
> overhead of cross CPU boosting along with the resulting IPIs is going to
> be at least an order of magnitude larger. And on contention this is just
> pushing the raw_spinlock contention off to the raw_spinlock in the rt
> mutex plus the owning tasks pi_lock which makes things even worse.

Fair enough.  So, leaving that out:

raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
core code, low level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state.  In addition, raw_spinlock_t can sometimes be used when
the critical section is tiny, thus avoiding RT-mutex overhead.

> >> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> >> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> 
> Si senor!

;-)

> >> +   interrupt disabled state
> >> +
> >> + - The soft interrupt related suffix (_bh()) is still disabling the
> >> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> >> +   kernel, which utilizes the preemption count, this is achieved by a per
> >> +   CPU bottom half locking mechanism.
> >
> >  - The soft interrupt related suffix (_bh()) still disables softirq
> >handlers.  However, unlike non-PREEMPT_RT kernels (which disable
> >preemption to get this effect), PREEMPT_RT kernels use a per-CPU
> >per-bottom-half locking mechanism.
> 
> it's not per-bottom-half anymore. That turned out to be dangerous due to
> dependencies between BH types, e.g. network and timers.

Ah!  OK, how about this?

 - The soft interrupt related suffix (_bh()) still disables softirq
   handlers.  However, unlike non-PREEMPT_RT kernels (which disable
   preemption to get this effect), PREEMPT_RT kernels use a per-CPU
   lock to exclude softirq handlers.

> I hope I was able to encourage you to comment on the other half as well :)

OK, here goes...

> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that
> +   the task state rules are always correct independent of the kernel
> +   configuration.

How about this?

PREEMPT_RT kernels preserve all other spinlock_t semantics:

 - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
   disable migration, which ensures that pointers to per-CPU variables
   remain valid even if the task is preempted.

 - Task state is preserved across 

Re: [PATCH] powerpc/pseries: avoid harmless preempt warning

2020-03-20 Thread Christophe Leroy




Le 20/03/2020 à 16:24, Nicholas Piggin a écrit :

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/platforms/pseries/lpar.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 3c3da25b445c..e4ed5317f117 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -636,8 +636,16 @@ static const struct proc_ops 
vcpudispatch_stats_freq_proc_ops = {
  
  static int __init vcpudispatch_stats_procfs_init(void)

  {
-   if (!lppaca_shared_proc(get_lppaca()))
+   /*
+* Avoid smp_processor_id while preemptible. All CPUs should have
+* the same value for lppaca_shared_proc.
+*/
+   preempt_disable();
+   if (!lppaca_shared_proc(get_lppaca())) {
+   preempt_enable();
return 0;
+   }
+   preempt_enable();


Can we avoid the double preempt_enable() with something like:

preempt_disable();
is_shared = lppaca_shared_proc(get_lppaca());
preempt_enable();
if (!is_shared)
return 0;


  
  	if (!proc_create("powerpc/vcpudispatch_stats", 0600, NULL,

_stats_proc_ops))



Christophe


[PATCH] powerpc/64: allow rtas to be called in real-mode, use this in machine check

2020-03-20 Thread Nicholas Piggin
rtas_call allocates and uses memory in failure paths, which is
not safe for RMA. It also calls local_irq_save() which may not be safe
in all real mode contexts.

Particularly machine check may run with interrupts not "reconciled",
and it may have hit while it was in tracing code that should not be
rentered.

Create minimal rtas call that should be usable by guest machine check
code, use it there to call "ibm,nmi-interlock".

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/rtas.h  |  1 +
 arch/powerpc/kernel/entry_64.S   | 12 ++--
 arch/powerpc/kernel/rtas.c   | 43 
 arch/powerpc/platforms/pseries/ras.c |  2 +-
 4 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3c1887351c71..4ffc499ce1ac 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -352,6 +352,7 @@ extern struct rtas_t rtas;
 extern int rtas_token(const char *service);
 extern int rtas_service_present(const char *service);
 extern int rtas_call(int token, int, int, int *, ...);
+extern int raw_rtas_call(int token, int, int, int *, ...);
 void rtas_call_unlocked(struct rtas_args *args, int token, int nargs,
int nret, ...);
 extern void __noreturn rtas_restart(char *cmd);
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51c5b681f70c..309abb677788 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -759,6 +759,13 @@ _GLOBAL(enter_rtas)
li  r0,0
mtcrr0
 
+   /* enter_rtas called from real-mode may not have irqs reconciled
+* but will always have interrupts disabled.
+*/
+   mfmsr   r6
+   andi.   r7,r6,(MSR_IR|MSR_DR)
+   beq 2f
+
 #ifdef CONFIG_BUG
/* There is no way it is acceptable to get here with interrupts enabled,
 * check it with the asm equivalent of WARN_ON
@@ -769,10 +776,10 @@ _GLOBAL(enter_rtas)
 #endif
 
/* Hard-disable interrupts */
-   mfmsr   r6
rldicl  r7,r6,48,1
rotldi  r7,r7,16
mtmsrd  r7,1
+2:
 
/* Unfortunately, the stack pointer and the MSR are also clobbered,
 * so they are saved in the PACA which allows us to restore
@@ -795,7 +802,6 @@ _GLOBAL(enter_rtas)
ori r9,r9,MSR_IR|MSR_DR|MSR_FE0|MSR_FE1|MSR_FP|MSR_RI|MSR_LE
andcr6,r0,r9
 
-__enter_rtas:
sync/* disable interrupts so SRR0/1 */
mtmsrd  r0  /* don't get trashed */
 
@@ -837,7 +843,7 @@ rtas_return_loc:
mtspr   SPRN_SRR1,r4
RFI_TO_KERNEL
b   .   /* prevent speculative execution */
-_ASM_NOKPROBE_SYMBOL(__enter_rtas)
+_ASM_NOKPROBE_SYMBOL(enter_rtas)
 _ASM_NOKPROBE_SYMBOL(rtas_return_loc)
 
.align  3
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..a058dcfb6726 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -450,6 +450,8 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
char *buff_copy = NULL;
int ret;
 
+   WARN_ON_ONCE((mfmsr() & (MSR_IR|MSR_DR)) != (MSR_IR|MSR_DR));
+
if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
return -1;
 
@@ -483,6 +485,47 @@ int rtas_call(int token, int nargs, int nret, int 
*outputs, ...)
 }
 EXPORT_SYMBOL(rtas_call);
 
+/*
+ * Like rtas_call but no kmalloc or printk etc in error handling, so
+ * error won't go through log_error. No tracing, may be called in real mode.
+ */
+int notrace raw_rtas_call(int token, int nargs, int nret, int *outputs, ...)
+{
+   va_list list;
+   int i;
+   struct rtas_args *rtas_args;
+   int ret;
+
+   WARN_ON_ONCE((mfmsr() & MSR_EE));
+
+   if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
+   return -1;
+
+   /*
+* Real mode must have MSR[EE]=0 and we prefer not to touch any
+* irq or preempt state (this may be called in machine check).
+*/
+   preempt_disable_notrace();
+   arch_spin_lock();
+
+   /* We use the global rtas args buffer */
+   rtas_args = 
+
+   va_start(list, outputs);
+   va_rtas_call_unlocked(rtas_args, token, nargs, nret, list);
+   va_end(list);
+
+   if (nret > 1 && outputs != NULL)
+   for (i = 0; i < nret-1; ++i)
+   outputs[i] = be32_to_cpu(rtas_args->rets[i+1]);
+   ret = (nret > 0)? be32_to_cpu(rtas_args->rets[0]): 0;
+
+   arch_spin_unlock();
+   preempt_enable_notrace();
+
+   return ret;
+}
+
 /* For RTAS_BUSY (-2), delay for 1 millisecond.  For an extended busy status
  * code of 990n, perform the hinted delay of 10^n (last digit) milliseconds.
  */
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index c74d5e740922..e87f86f02569 100644
--- 

[PATCH] powerpc/64: ftrace don't trace real mode

2020-03-20 Thread Nicholas Piggin
This warns and prevents tracing attempted in a real-mode context.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/trace/ftrace.c|  3 +++
 .../powerpc/kernel/trace/ftrace_64_mprofile.S | 19 +++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 7ea0ca044b65..ef965815fcb9 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -949,6 +949,9 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip,
 {
unsigned long return_hooker;
 
+   if (WARN_ON_ONCE((mfmsr() & (MSR_IR|MSR_DR)) != (MSR_IR|MSR_DR)))
+   goto out;
+
if (unlikely(ftrace_graph_is_dead()))
goto out;
 
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index f9fd5f743eba..6205f15cb603 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -51,16 +51,21 @@ _GLOBAL(ftrace_regs_caller)
SAVE_10GPRS(12, r1)
SAVE_10GPRS(22, r1)
 
-   /* Save previous stack pointer (r1) */
-   addir8, r1, SWITCH_FRAME_SIZE
-   std r8, GPR1(r1)
-
/* Load special regs for save below */
mfmsr   r8
mfctr   r9
mfxer   r10
mfcrr11
 
+   /* Shouldn't be called in real mode */
+   andi.   r3,r8,(MSR_IR|MSR_DR)
+   cmpdi   r3,(MSR_IR|MSR_DR)
+   bne ftrace_bad_realmode
+
+   /* Save previous stack pointer (r1) */
+   addir8, r1, SWITCH_FRAME_SIZE
+   std r8, GPR1(r1)
+
/* Get the _mcount() call site out of LR */
mflrr7
/* Save it as pt_regs->nip */
@@ -141,6 +146,12 @@ _GLOBAL(ftrace_graph_stub)
 _GLOBAL(ftrace_stub)
blr
 
+ftrace_bad_realmode:
+   REST_4GPRS(8, r1)
+#ifdef CONFIG_BUG
+1: trap
+   EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,(BUGFLAG_WARNING | BUGFLAG_ONCE)
+#endif
 ftrace_no_trace:
mflrr3
mtctr   r3
-- 
2.23.0



Re: [PATCH] arch/powerpc/64: Avoid isync in flush_dcache_range

2020-03-20 Thread Segher Boessenkool
On Fri, Mar 20, 2020 at 08:38:42PM +0530, Aneesh Kumar K.V wrote:
> On 3/20/20 8:35 PM, Segher Boessenkool wrote:
> >On Fri, Mar 20, 2020 at 04:02:42PM +0530, Aneesh Kumar K.V wrote:
> >>As per ISA and isync is only needed on instruction cache
> >>block invalidate. Remove the same from dcache invalidate.
> >
> >Is that true on older CPUs?
> >
> 
> That is what I found by checking with hardware team.

Oh, the comment right before this function says "does not invalidat
the corresponding insncache blocks", so this looks fine, sorry for not
looking closely enough before.

> One thing i was not 
> able to get full confirmation about was the usage of 'sync' before 'dcbf'.

Yeah, this looks like something that would matter on some implementations.
Would it make anything measurably faster if you would remove that sync?


Segher


[PATCH] powerpc/pseries: avoid harmless preempt warning

2020-03-20 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lpar.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 3c3da25b445c..e4ed5317f117 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -636,8 +636,16 @@ static const struct proc_ops 
vcpudispatch_stats_freq_proc_ops = {
 
 static int __init vcpudispatch_stats_procfs_init(void)
 {
-   if (!lppaca_shared_proc(get_lppaca()))
+   /*
+* Avoid smp_processor_id while preemptible. All CPUs should have
+* the same value for lppaca_shared_proc.
+*/
+   preempt_disable();
+   if (!lppaca_shared_proc(get_lppaca())) {
+   preempt_enable();
return 0;
+   }
+   preempt_enable();
 
if (!proc_create("powerpc/vcpudispatch_stats", 0600, NULL,
_stats_proc_ops))
-- 
2.23.0



[PATCH v2 1/2] docs: prevent warnings due to autosectionlabel

2020-03-20 Thread Mauro Carvalho Chehab
Changeset 58ad30cf91f0 ("docs: fix reference to core-api/namespaces.rst")
enabled a new feature at Sphinx: it will now generate index for each
document title, plus to each chapter inside it.

There's a drawback, though: one document cannot have two sections
with the same name anymore.

A followup patch will change the logic of autosectionlabel to
avoid most creating references for every single section title,
but still we need to be able to reference the chapters inside
a document.

There are a few places where there are two chapters with the
same name. This patch renames one of the chapters, in order to
avoid symbol conflict within the same document.

PS.: as I don't speach Chinese, I had some help from a friend
(Wen Liu) at the Chinese translation for "publishing patches"
for this document:

Documentation/translations/zh_CN/process/5.Posting.rst

Fixes: 58ad30cf91f0 ("docs: fix reference to core-api/namespaces.rst")
Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/driver-api/80211/mac80211-advanced.rst  |  8 
 Documentation/driver-api/dmaengine/index.rst  |  4 ++--
 Documentation/filesystems/ecryptfs.rst| 11 +--
 Documentation/kernel-hacking/hacking.rst  |  4 ++--
 Documentation/media/kapi/v4l2-controls.rst|  8 
 Documentation/networking/snmp_counter.rst |  4 ++--
 Documentation/powerpc/ultravisor.rst  |  4 ++--
 Documentation/security/siphash.rst|  8 
 Documentation/target/tcmu-design.rst  |  6 +++---
 .../translations/zh_CN/process/5.Posting.rst  |  2 +-
 Documentation/x86/intel-iommu.rst |  3 ++-
 11 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/Documentation/driver-api/80211/mac80211-advanced.rst 
b/Documentation/driver-api/80211/mac80211-advanced.rst
index 9f1c5bb7ac35..24cb64b3b715 100644
--- a/Documentation/driver-api/80211/mac80211-advanced.rst
+++ b/Documentation/driver-api/80211/mac80211-advanced.rst
@@ -272,8 +272,8 @@ STA information lifetime rules
 .. kernel-doc:: net/mac80211/sta_info.c
:doc: STA information lifetime rules
 
-Aggregation
-===
+Aggregation Functions
+=
 
 .. kernel-doc:: net/mac80211/sta_info.h
:functions: sta_ampdu_mlme
@@ -284,8 +284,8 @@ Aggregation
 .. kernel-doc:: net/mac80211/sta_info.h
:functions: tid_ampdu_rx
 
-Synchronisation
-===
+Synchronisation Functions
+=
 
 TBD
 
diff --git a/Documentation/driver-api/dmaengine/index.rst 
b/Documentation/driver-api/dmaengine/index.rst
index b9df904d0a79..bdc45d8b4cfb 100644
--- a/Documentation/driver-api/dmaengine/index.rst
+++ b/Documentation/driver-api/dmaengine/index.rst
@@ -5,8 +5,8 @@ DMAEngine documentation
 DMAEngine documentation provides documents for various aspects of DMAEngine
 framework.
 
-DMAEngine documentation

+DMAEngine development documentation
+---
 
 This book helps with DMAengine internal APIs and guide for DMAEngine device
 driver writers.
diff --git a/Documentation/filesystems/ecryptfs.rst 
b/Documentation/filesystems/ecryptfs.rst
index 7236172300ef..1f2edef4c57a 100644
--- a/Documentation/filesystems/ecryptfs.rst
+++ b/Documentation/filesystems/ecryptfs.rst
@@ -30,13 +30,12 @@ Userspace requirements include:
 - Libgcrypt
 
 
-Notes
-=
+.. note::
 
-In the beta/experimental releases of eCryptfs, when you upgrade
-eCryptfs, you should copy the files to an unencrypted location and
-then copy the files back into the new eCryptfs mount to migrate the
-files.
+   In the beta/experimental releases of eCryptfs, when you upgrade
+   eCryptfs, you should copy the files to an unencrypted location and
+   then copy the files back into the new eCryptfs mount to migrate the
+   files.
 
 
 Mount-wide Passphrase
diff --git a/Documentation/kernel-hacking/hacking.rst 
b/Documentation/kernel-hacking/hacking.rst
index d707a0a61cc9..eed2136d847f 100644
--- a/Documentation/kernel-hacking/hacking.rst
+++ b/Documentation/kernel-hacking/hacking.rst
@@ -601,7 +601,7 @@ Defined in ``include/linux/export.h``
 
 This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol
 namespace. Symbol Namespaces are documented in
-:ref:`Documentation/core-api/symbol-namespaces.rst `
+:doc:`../core-api/symbol-namespaces`
 
 :c:func:`EXPORT_SYMBOL_NS_GPL()`
 
@@ -610,7 +610,7 @@ Defined in ``include/linux/export.h``
 
 This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol
 namespace. Symbol Namespaces are documented in
-:ref:`Documentation/core-api/symbol-namespaces.rst `
+:doc:`../core-api/symbol-namespaces`
 
 Routines and Conventions
 
diff --git a/Documentation/media/kapi/v4l2-controls.rst 
b/Documentation/media/kapi/v4l2-controls.rst
index b20800cae3f2..5129019afb49 100644
--- 

[PATCH v2 0/2] Don't generate thousands of new warnings when building docs

2020-03-20 Thread Mauro Carvalho Chehab
This small series address a regression caused by a new patch at
docs-next (and at linux-next).

Before this patch, when a cross-reference to a chapter within the
documentation is needed, we had to add a markup like:

.. _foo:

foo
===

This behavor is now different after this patch:

58ad30cf91f0 ("docs: fix reference to core-api/namespaces.rst")

As a Sphinx extension now creates automatically a reference
like the above, without requiring any extra markup.

That, however, comes with a price: it is not possible anymore to have
two sections with the same name within the entire Kernel docs!

This causes thousands of warnings, as we have sections named
"introduction" on lots of places.

This series solve this regression by doing two changes:

1) The references are now prefixed by the document name. So,
   a file named "bar" would have the "foo" reference as "bar:foo".

2) It will only use the first two levels. The first one is (usually) the
   name of the document, and the second one the chapter name.

This solves almost all problems we have. Still, there are a few places
where we have two chapters at the same document with the
same name. The first patch addresses this problem.

The second patch limits the escope of the autosectionlabel.

Mauro Carvalho Chehab (2):
  docs: prevent warnings due to autosectionlabel
  docs: conf.py: avoid thousands of duplicate label warning on Sphinx

 Documentation/conf.py |  4 
 Documentation/driver-api/80211/mac80211-advanced.rst  |  8 
 Documentation/driver-api/dmaengine/index.rst  |  4 ++--
 Documentation/filesystems/ecryptfs.rst| 11 +--
 Documentation/kernel-hacking/hacking.rst  |  4 ++--
 Documentation/media/kapi/v4l2-controls.rst|  8 
 Documentation/networking/snmp_counter.rst |  4 ++--
 Documentation/powerpc/ultravisor.rst  |  4 ++--
 Documentation/security/siphash.rst|  8 
 Documentation/target/tcmu-design.rst  |  6 +++---
 .../translations/zh_CN/process/5.Posting.rst  |  2 +-
 Documentation/x86/intel-iommu.rst |  3 ++-
 12 files changed, 35 insertions(+), 31 deletions(-)

-- 
2.24.1




Re: [PATCH] arch/powerpc/64: Avoid isync in flush_dcache_range

2020-03-20 Thread Aneesh Kumar K.V

On 3/20/20 8:35 PM, Segher Boessenkool wrote:

On Fri, Mar 20, 2020 at 04:02:42PM +0530, Aneesh Kumar K.V wrote:

As per ISA and isync is only needed on instruction cache
block invalidate. Remove the same from dcache invalidate.


Is that true on older CPUs?



That is what I found by checking with hardware team. One thing i was not 
able to get full confirmation about was the usage of 'sync' before 'dcbf'.


-aneesh



Re: [PATCH] arch/powerpc/64: Avoid isync in flush_dcache_range

2020-03-20 Thread Segher Boessenkool
On Fri, Mar 20, 2020 at 04:02:42PM +0530, Aneesh Kumar K.V wrote:
> As per ISA and isync is only needed on instruction cache
> block invalidate. Remove the same from dcache invalidate.

Is that true on older CPUs?


Segher


Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-03-20 Thread Greg Kroah-Hartman
On Fri, Mar 20, 2020 at 03:16:39PM +0100, Christoph Hellwig wrote:
> Several IOMMU drivers have a bypass mode where they can use a direct
> mapping if the devices DMA mask is large enough.  Add generic support
> to the core dma-mapping code to do that to switch those drivers to
> a common solution.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/device.h  |  6 ++
>  include/linux/dma-mapping.h | 30 ++
>  kernel/dma/mapping.c| 36 +++-
>  3 files changed, 51 insertions(+), 21 deletions(-)

Reviewed-by: Greg Kroah-Hartman 


Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-20 Thread Laurent Dufour

Le 20/03/2020 à 13:22, Greg Kurz a écrit :

On Fri, 20 Mar 2020 11:26:42 +0100
Laurent Dufour  wrote:


The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
prevent a malicious VM or SVM to call them. This could lead to weird result
and should be filtered out.

Checking the Secure bit of the calling MSR ensure that the call is coming
from either the Ultravisor or a SVM. But any system call made from a SVM
are going through the Ultravisor, and the Ultravisor should filter out
these malicious call. This way, only the Ultravisor is able to make such a
Hcall.


"Ultravisor should filter" ? And what if it doesn't (eg. because of a bug) ?


If it doesn't, a malicious SVM would be able to call UV reserved Hcall like 
H_SVM_INIT_ABORT, etc... which is not a good idea.




Shouldn't we also check the HV bit of the calling MSR as well to
disambiguate SVM and UV ?


That's another way to do so, but since the SVM Hcall are going through the UV, 
it seems the right place (the UV) to do the filtering.




Cc: Bharata B Rao 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv.c | 32 +---
  1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 33be4d93248a..43773182a737 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1074,25 +1074,35 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 kvmppc_get_gpr(vcpu, 6));
break;
case H_SVM_PAGE_IN:
-   ret = kvmppc_h_svm_page_in(vcpu->kvm,
-  kvmppc_get_gpr(vcpu, 4),
-  kvmppc_get_gpr(vcpu, 5),
-  kvmppc_get_gpr(vcpu, 6));
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_page_in(vcpu->kvm,
+  kvmppc_get_gpr(vcpu, 4),
+  kvmppc_get_gpr(vcpu, 5),
+  kvmppc_get_gpr(vcpu, 6));


If calling kvmppc_h_svm_page_in() produces a "weird result" when
the MSR_S bit isn't set, then I think it should do the checking
itself, ie. pass vcpu.

This would also prevent adding that many lines in kvmppc_pseries_do_hcall()
which is a big enough function already. The checking could be done in a
helper in book3s_hv_uvmem.c and used by all UV specific hcalls.


I'm not convinced that would be better, and I followed the way checks for other 
Hcalls has been made (see H_TLB_INVALIDATE,..).


I agree  kvmppc_pseries_do_hcall() is long but this is just a big switch(), 
quite linear.





break;
case H_SVM_PAGE_OUT:
-   ret = kvmppc_h_svm_page_out(vcpu->kvm,
-   kvmppc_get_gpr(vcpu, 4),
-   kvmppc_get_gpr(vcpu, 5),
-   kvmppc_get_gpr(vcpu, 6));
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_page_out(vcpu->kvm,
+   kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5),
+   kvmppc_get_gpr(vcpu, 6));
break;
case H_SVM_INIT_START:
-   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
break;
case H_SVM_INIT_DONE:
-   ret = kvmppc_h_svm_init_done(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_done(vcpu->kvm);
break;
case H_SVM_INIT_ABORT:
-   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
  
  	default:






Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Joe Perches
On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > While at it also simplify the existing perf patterns.
> > > And still missed fixes from parse-maintainers.pl.
> > 
> > Oh, that script UX is truly ingenious.
> 
> You have at least two options, their combinations, etc:
>  - complain to the author :-)
>  - send a patch :-)

Recently:

https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/




Re: [PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-20 Thread Laurent Dufour

Le 20/03/2020 à 12:24, Bharata B Rao a écrit :

On Fri, Mar 20, 2020 at 11:26:43AM +0100, Laurent Dufour wrote:

When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
there is not enough free secured memory, the Hypervisor (HV) has to call
UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.

If the kvm->arch.secure_guest is not set, in the return path rfid is called
but there is no valid context to get back to the SVM since the Hcall has
been routed by the Ultravisor.

Move the setting of kvm->arch.secure_guest earlier in
kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
instead of rfid.

Cc: Bharata B Rao 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 79b1202b1c62..68dff151315c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -209,6 +209,8 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
int ret = H_SUCCESS;
int srcu_idx;
  
+	kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;

+
if (!kvmppc_uvmem_bitmap)
return H_UNSUPPORTED;
  
@@ -233,7 +235,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)

goto out;
}
}
-   kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_START;


There is an assumption that memory slots would have been registered with UV
if KVMPPC_SECURE_INIT_START has been done. KVM_PPC_SVM_OFF ioctl will skip
unregistration and other steps during reboot if KVMPPC_SECURE_INIT_START
hasn't been done.

Have you checked if that path isn't affected by this change?


I checked that and didn't find any issue there.

My only concern was that block:
kvm_for_each_vcpu(i, vcpu, kvm) {
spin_lock(>arch.vpa_update_lock);
unpin_vpa_reset(kvm, >arch.dtl);
unpin_vpa_reset(kvm, >arch.slb_shadow);
unpin_vpa_reset(kvm, >arch.vpa);
spin_unlock(>arch.vpa_update_lock);
}

But that seems to be safe.

However I'm not a familiar with the KVM's code, do you think an additional 
KVMPPC_SECURE_INIT_* value needed here?


Thanks,
Laurent.






[PATCH 2/2] powerpc: use the generic dma_ops_bypass mode

2020-03-20 Thread Christoph Hellwig
Use the DMA API bypass mechanism for direct window mappings.  This uses
common code and speed up the direct mapping case by avoiding indirect
calls just when not using dma ops at all.  It also fixes a problem where
the sync_* methods were using the bypass check for DMA allocations, but
those are part of the streaming ops.

Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
has never been well defined, as is only used by a few drivers, which
IIRC never showed up in the typical Cell blade setups that are affected
by the ordering workaround.

Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/device.h |  5 --
 arch/powerpc/kernel/dma-iommu.c   | 90 ---
 2 files changed, 9 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 266542769e4b..452402215e12 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -18,11 +18,6 @@ struct iommu_table;
  * drivers/macintosh/macio_asic.c
  */
 struct dev_archdata {
-   /*
-* Set to %true if the dma_iommu_ops are requested to use a direct
-* window instead of dynamically mapping memory.
-*/
-   booliommu_bypass : 1;
/*
 * These two used to be a union. However, with the hybrid ops we need
 * both so here we store both a DMA offset for direct mappings and
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e486d1d78de2..569fecd7b5b2 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -14,23 +14,6 @@
  * Generic iommu implementation
  */
 
-/*
- * The coherent mask may be smaller than the real mask, check if we can
- * really use a direct window.
- */
-static inline bool dma_iommu_alloc_bypass(struct device *dev)
-{
-   return dev->archdata.iommu_bypass && !iommu_fixed_is_weak &&
-   dma_direct_supported(dev, dev->coherent_dma_mask);
-}
-
-static inline bool dma_iommu_map_bypass(struct device *dev,
-   unsigned long attrs)
-{
-   return dev->archdata.iommu_bypass &&
-   (!iommu_fixed_is_weak || (attrs & DMA_ATTR_WEAK_ORDERING));
-}
-
 /* Allocates a contiguous real buffer and creates mappings over it.
  * Returns the virtual address of the buffer and sets dma_handle
  * to the dma address (mapping) of the first page.
@@ -39,8 +22,6 @@ static void *dma_iommu_alloc_coherent(struct device *dev, 
size_t size,
  dma_addr_t *dma_handle, gfp_t flag,
  unsigned long attrs)
 {
-   if (dma_iommu_alloc_bypass(dev))
-   return dma_direct_alloc(dev, size, dma_handle, flag, attrs);
return iommu_alloc_coherent(dev, get_iommu_table_base(dev), size,
dma_handle, dev->coherent_dma_mask, flag,
dev_to_node(dev));
@@ -50,11 +31,7 @@ static void dma_iommu_free_coherent(struct device *dev, 
size_t size,
void *vaddr, dma_addr_t dma_handle,
unsigned long attrs)
 {
-   if (dma_iommu_alloc_bypass(dev))
-   dma_direct_free(dev, size, vaddr, dma_handle, attrs);
-   else
-   iommu_free_coherent(get_iommu_table_base(dev), size, vaddr,
-   dma_handle);
+   iommu_free_coherent(get_iommu_table_base(dev), size, vaddr, dma_handle);
 }
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
@@ -67,9 +44,6 @@ static dma_addr_t dma_iommu_map_page(struct device *dev, 
struct page *page,
 enum dma_data_direction direction,
 unsigned long attrs)
 {
-   if (dma_iommu_map_bypass(dev, attrs))
-   return dma_direct_map_page(dev, page, offset, size, direction,
-   attrs);
return iommu_map_page(dev, get_iommu_table_base(dev), page, offset,
  size, dma_get_mask(dev), direction, attrs);
 }
@@ -79,11 +53,8 @@ static void dma_iommu_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 size_t size, enum dma_data_direction direction,
 unsigned long attrs)
 {
-   if (!dma_iommu_map_bypass(dev, attrs))
-   iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size,
-   direction,  attrs);
-   else
-   dma_direct_unmap_page(dev, dma_handle, size, direction, attrs);
+   iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size, direction,
+attrs);
 }
 
 
@@ -91,8 +62,6 @@ static int dma_iommu_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nelems, enum dma_data_direction 

[PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-03-20 Thread Christoph Hellwig
Several IOMMU drivers have a bypass mode where they can use a direct
mapping if the devices DMA mask is large enough.  Add generic support
to the core dma-mapping code to do that to switch those drivers to
a common solution.

Signed-off-by: Christoph Hellwig 
---
 include/linux/device.h  |  6 ++
 include/linux/dma-mapping.h | 30 ++
 kernel/dma/mapping.c| 36 +++-
 3 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 0cd7c647c16c..09be8bb2c4a6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -525,6 +525,11 @@ struct dev_links_info {
  *   sync_state() callback.
  * @dma_coherent: this particular device is dma coherent, even if the
  * architecture supports non-coherent devices.
+ * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
+ * streaming DMA operations (->map_* / ->unmap_* / ->sync_*),
+ * and optionall (if the coherent mask is large enough) also
+ * for dma allocations.  This flag is managed by the dma ops
+ * instance from ->dma_supported.
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -625,6 +630,7 @@ struct device {
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
booldma_coherent:1;
 #endif
+   booldma_ops_bypass : 1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 330ad58fbf4d..c3af0cf5e435 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -188,9 +188,15 @@ static inline int dma_mmap_from_global_coherent(struct 
vm_area_struct *vma,
 }
 #endif /* CONFIG_DMA_DECLARE_COHERENT */
 
-static inline bool dma_is_direct(const struct dma_map_ops *ops)
+/*
+ * Check if the devices uses a direct mapping for streaming DMA operations.
+ * This allows IOMMU drivers to set a bypass mode if the DMA mask is large
+ * enough.
+ */
+static inline bool dma_map_direct(struct device *dev,
+   const struct dma_map_ops *ops)
 {
-   return likely(!ops);
+   return likely(!ops) || dev->dma_ops_bypass;
 }
 
 /*
@@ -279,7 +285,7 @@ static inline dma_addr_t dma_map_page_attrs(struct device 
*dev,
dma_addr_t addr;
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -294,7 +300,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, 
dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
@@ -313,7 +319,7 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
int ents;
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -331,7 +337,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
else if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
@@ -352,7 +358,7 @@ static inline dma_addr_t dma_map_resource(struct device 
*dev,
if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr
return DMA_MAPPING_ERROR;
 
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
addr = dma_direct_map_resource(dev, phys_addr, size, dir, 
attrs);
else if (ops->map_resource)
addr = ops->map_resource(dev, phys_addr, size, dir, attrs);
@@ -368,7 +374,7 @@ static inline void dma_unmap_resource(struct device *dev, 
dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (!dma_is_direct(ops) && ops->unmap_resource)
+   if (!dma_map_direct(dev, ops) && ops->unmap_resource)
ops->unmap_resource(dev, addr, size, dir, attrs);
debug_dma_unmap_resource(dev, addr, size, dir);
 }
@@ -380,7 +386,7 @@ static inline void dma_sync_single_for_cpu(struct device 

generic DMA bypass flag v2

2020-03-20 Thread Christoph Hellwig
Hi all,

I've recently beeing chatting with Lu about using dma-iommu and
per-device DMA ops in the intel IOMMU driver, and one missing feature
in dma-iommu is a bypass mode where the direct mapping is used even
when an iommu is attached to improve performance.  The powerpc
code already has a similar mode, so I'd like to move it to the core
DMA mapping code.  As part of that I noticed that the current
powerpc code has a little bug in that it used the wrong check in the
dma_sync_* routines to see if the direct mapping code is used.

These two patches just add the generic code and move powerpc over,
the intel IOMMU bits will require a separate discussion.

The x86 AMD Gart code also has a bypass mode, but it is a lot
strange, so I'm not going to touch it for now.

Changes since v1:
 - rebased to the current dma-mapping-for-next tree


[PATCH] powerpc/64s/hash: add torture_segments kernel boot option to increase SLB faults

2020-03-20 Thread Nicholas Piggin
This option increases the number of SLB misses by limiting the number of
kernel SLB entries, and increased flushing of cached lookaside information.
This helps stress test difficult to hit paths in the kernel.

Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |   4 +
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   7 +
 arch/powerpc/mm/book3s64/hash_utils.c |  13 ++
 arch/powerpc/mm/book3s64/slb.c| 145 --
 4 files changed, 124 insertions(+), 45 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index dbc22d684627..cd3ea9f0c6b1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -861,6 +861,10 @@
can be useful when debugging issues that require an SLB
miss to occur.
 
+   torture_segments [PPC]
+   Limits the number of SLB entries used, and flushes
+   them frequently to stress SLB faults.
+
disable=[IPV6]
See Documentation/networking/ipv6.txt.
 
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3fa1b962dc27..de34bf94f38c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -317,6 +317,13 @@ extern unsigned long tce_alloc_start, tce_alloc_end;
  */
 extern int mmu_ci_restrictions;
 
+extern bool torture_segments_enabled;
+DECLARE_STATIC_KEY_FALSE(torture_segments_key);
+static inline bool torture_segments(void)
+{
+   return static_branch_unlikely(_segments_key);
+}
+
 /*
  * This computes the AVPN and B fields of the first dword of a HPTE,
  * for use when we want to match an existing PTE.  The bottom 7 bits
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 523d4d39d11e..1e5028e22aae 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -354,6 +354,7 @@ int htab_remove_mapping(unsigned long vstart, unsigned long 
vend,
 }
 
 static bool disable_1tb_segments = false;
+bool torture_segments_enabled __read_mostly = false;
 
 static int __init parse_disable_1tb_segments(char *p)
 {
@@ -362,6 +363,13 @@ static int __init parse_disable_1tb_segments(char *p)
 }
 early_param("disable_1tb_segments", parse_disable_1tb_segments);
 
+static int __init parse_torture_segments(char *p)
+{
+   torture_segments_enabled = true;
+   return 0;
+}
+early_param("torture_segments", parse_torture_segments);
+
 static int __init htab_dt_scan_seg_sizes(unsigned long node,
 const char *uname, int depth,
 void *data)
@@ -853,6 +861,8 @@ static void __init hash_init_partition_table(phys_addr_t 
hash_table,
pr_info("Partition table %p\n", partition_tb);
 }
 
+DEFINE_STATIC_KEY_FALSE(torture_segments_key);
+
 static void __init htab_initialize(void)
 {
unsigned long table;
@@ -869,6 +879,9 @@ static void __init htab_initialize(void)
printk(KERN_INFO "Using 1TB segments\n");
}
 
+   if (torture_segments_enabled)
+   static_branch_enable(_segments_key);
+
/*
 * Calculate the required size of the htab.  We want the number of
 * PTEGs to equal one half the number of real pages.
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index 716204aee3da..d5efce53c54f 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -68,7 +68,7 @@ static void assert_slb_presence(bool present, unsigned long 
ea)
 * slbfee. requires bit 24 (PPC bit 39) be clear in RB. Hardware
 * ignores all other bits from 0-27, so just clear them all.
 */
-   ea &= ~((1UL << 28) - 1);
+   ea &= ~((1UL << SID_SHIFT) - 1);
asm volatile(__PPC_SLBFEE_DOT(%0, %1) : "=r"(tmp) : "r"(ea) : "cr0");
 
WARN_ON(present == (tmp == 0));
@@ -153,14 +153,28 @@ void slb_flush_all_realmode(void)
asm volatile("slbmte %0,%0; slbia" : : "r" (0));
 }
 
+static __always_inline void __slb_flush_and_restore_bolted(u32 ih)
+{
+   struct slb_shadow *p = get_slb_shadow();
+   unsigned long ksp_esid_data, ksp_vsid_data;
+
+   ksp_esid_data = be64_to_cpu(p->save_area[KSTACK_INDEX].esid);
+   ksp_vsid_data = be64_to_cpu(p->save_area[KSTACK_INDEX].vsid);
+
+   asm volatile(PPC_SLBIA(%0)" \n"
+"slbmte%1, %2  \n"
+:: "i" (ih),
+   "r" (ksp_vsid_data),
+   "r" (ksp_esid_data)
+: "memory");
+}
+
 /*
  * This flushes non-bolted entries, it can be run in virtual mode. Must
  * be called with interrupts disabled.
  */
 void 

Re: [PATCH 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-20 12:55:32]:

> Sachin reports [1] a crash in SLUB __slab_alloc():
> 
> BUG: Kernel NULL pointer dereference on read at 0x73b0
> Faulting instruction address: 0xc03d55f4
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
> NIP:  c03d55f4 LR: c03d5b94 CTR: 
> REGS: c008b37836d0 TRAP: 0300   Not tainted  
> (5.6.0-rc2-next-20200218-autotest)
> MSR:  80009033   CR: 24004844  XER: 
> CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
> GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
> GPR04: 0dc0 0002 c03443d8 c008bb398620
> GPR08: 0008ba2f 0001  
> GPR12: 24004844 c0001ec52a00  
> GPR16: c008a1b20048 c1595898 c1750c18 0002
> GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
> GPR24: 0001 0dc0 0002 c03443d8
> GPR28: c008b301f500 c008bb398620  c00c02287180
> NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
> LR [c03d5b94] __slab_alloc+0x34/0x60
> Call Trace:
> [c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
> [c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
> [c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
> [c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
> [c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
> [c008b3783b90] [c0235aa8] online_css+0x48/0xd0
> [c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
> [c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
> [c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
> [c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
> [c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
> [c008b3783e20] [c000b278] system_call+0x5c/0x68
> 
> This is a PowerPC platform with following NUMA topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 35247 MB
> node 1 free: 30907 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> possible numa nodes: 0-31
> 
> This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map 
> on
> appropriate NUMA node" [2] which effectively calls kmalloc_node for each
> possible node. SLUB however only allocates kmem_cache_node on online
> N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid 
> node
> for other nodes since commit a561ce00b09e ("slub: fall back to
> node_to_mem_node() node if allocating on memoryless node"). This is however 
> not
> true in this configuration where the _node_numa_mem_ array is not initialized
> for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
> accessing non-allocated kmem_cache_node.
> 
> A related issue was reported by Bharata (originally by Ramachandran) [3] where
> a similar PowerPC configuration, but with mainline kernel without patch [2]
> ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This 
> seems
> to have the same underlying issue with node_to_mem_node() not behaving as
> expected, and might probably also lead to an infinite loop with
> CONFIG_SLUB_CPU_PARTIAL [4].
> 
> This patch should fix both issues by not relying on node_to_mem_node() anymore
> and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
> attempted for a node that's not online, or has no usable memory. The "usable
> memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY
> node state, as that is exactly the condition that SLUB uses to allocate
> kmem_cache_node structures. The check in get_partial() is removed completely,
> as the checks in ___slab_alloc() are now sufficient to prevent get_partial()
> being reached with an invalid node.
> 
> [1] 
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> [2] 
> https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c...@virtuozzo.com/
> [3] https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
> [4] 
> https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125a...@suse.cz/
> 
> Reported-and-tested-by: Sachin Sant 
> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN 
> Tested-by: Bharata B Rao 
> Debugged-by: Srikar Dronamraju 
> Signed-off-by: Vlastimil Babka 
> Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if 
> allocating on memoryless node")

Reviewed-by: Srikar Dronamraju 

-- 

Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-20 Thread Greg Kurz
On Fri, 20 Mar 2020 11:26:42 +0100
Laurent Dufour  wrote:

> The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
> prevent a malicious VM or SVM to call them. This could lead to weird result
> and should be filtered out.
> 
> Checking the Secure bit of the calling MSR ensure that the call is coming
> from either the Ultravisor or a SVM. But any system call made from a SVM
> are going through the Ultravisor, and the Ultravisor should filter out
> these malicious call. This way, only the Ultravisor is able to make such a
> Hcall.

"Ultravisor should filter" ? And what if it doesn't (eg. because of a bug) ?

Shouldn't we also check the HV bit of the calling MSR as well to
disambiguate SVM and UV ?

> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 32 +---
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 33be4d93248a..43773182a737 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1074,25 +1074,35 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>kvmppc_get_gpr(vcpu, 6));
>   break;
>   case H_SVM_PAGE_IN:
> - ret = kvmppc_h_svm_page_in(vcpu->kvm,
> -kvmppc_get_gpr(vcpu, 4),
> -kvmppc_get_gpr(vcpu, 5),
> -kvmppc_get_gpr(vcpu, 6));
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_page_in(vcpu->kvm,
> +kvmppc_get_gpr(vcpu, 4),
> +kvmppc_get_gpr(vcpu, 5),
> +kvmppc_get_gpr(vcpu, 6));

If calling kvmppc_h_svm_page_in() produces a "weird result" when
the MSR_S bit isn't set, then I think it should do the checking
itself, ie. pass vcpu.

This would also prevent adding that many lines in kvmppc_pseries_do_hcall()
which is a big enough function already. The checking could be done in a
helper in book3s_hv_uvmem.c and used by all UV specific hcalls.

>   break;
>   case H_SVM_PAGE_OUT:
> - ret = kvmppc_h_svm_page_out(vcpu->kvm,
> - kvmppc_get_gpr(vcpu, 4),
> - kvmppc_get_gpr(vcpu, 5),
> - kvmppc_get_gpr(vcpu, 6));
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_page_out(vcpu->kvm,
> + kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6));
>   break;
>   case H_SVM_INIT_START:
> - ret = kvmppc_h_svm_init_start(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_start(vcpu->kvm);
>   break;
>   case H_SVM_INIT_DONE:
> - ret = kvmppc_h_svm_init_done(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_done(vcpu->kvm);
>   break;
>   case H_SVM_INIT_ABORT:
> - ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> + ret = H_UNSUPPORTED;
> + if (kvmppc_get_srr1(vcpu) & MSR_S)
> + ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>   break;
>  
>   default:



[PATCH v6 09/11] perf/tools: Enhance JSON/metric infrastructure to handle "?"

2020-03-20 Thread Kajol Jain
Patch enhances current metric infrastructure to handle "?" in the metric
expression. The "?" can be use for parameters whose value not known while
creating metric events and which can be replace later at runtime to
the proper value. It also add flexibility to create multiple events out
of single metric event added in json file.

Patch adds function 'arch_get_runtimeparam' which is a arch specific
function, returns the count of metric events need to be created.
By default it return 1.

This infrastructure needed for hv_24x7 socket/chip level events.
"hv_24x7" chip level events needs specific chip-id to which the
data is requested. Function 'arch_get_runtimeparam' implemented
in header.c which extract number of sockets from sysfs file
"sockets" under "/sys/devices/hv_24x7/interface/".


With this patch basically we are trying to create as many metric events
as define by runtime_param.

For that one loop is added in function 'metricgroup__add_metric',
which create multiple events at run time depend on return value of
'arch_get_runtimeparam' and merge that event in 'group_list'.

To achieve that we are actually passing this parameter value as part of
`expr__find_other` function and changing "?" present in metric expression
with this value.

As in our json file, there gonna be single metric event, and out of
which we are creating multiple events, I am also merging this value
to the original metric name to specify parameter value.

For example,
command:# ./perf stat  -M PowerBUS_Frequency -C 0 -I 1000
#   time counts unit events
 1.000101867  9,356,933  hv_24x7/pm_pb_cyc,chip=0/ #  2.3 
GHz  PowerBUS_Frequency_0
 1.000101867  9,366,134  hv_24x7/pm_pb_cyc,chip=1/ #  2.3 
GHz  PowerBUS_Frequency_1
 2.000314878  9,365,868  hv_24x7/pm_pb_cyc,chip=0/ #  2.3 
GHz  PowerBUS_Frequency_0
 2.000314878  9,366,092  hv_24x7/pm_pb_cyc,chip=1/ #  2.3 
GHz  PowerBUS_Frequency_1

So, here _0 and _1 after PowerBUS_Frequency specify parameter value.

As after adding this to group_list, again we call expr__parse
in 'generic_metric' function present in util/stat-display.c.
By this time again we need to pass this parameter value. So, now to get this 
value
actually I am trying to extract it from metric name itself. Because
otherwise it gonna point to last updated value present in runtime_param.
And gonna match for that value only.

Signed-off-by: Kajol Jain 
---
 tools/perf/arch/powerpc/util/header.c |  8 ++
 tools/perf/tests/expr.c   |  8 +++---
 tools/perf/util/expr.c| 11 
 tools/perf/util/expr.h|  5 ++--
 tools/perf/util/expr.l| 27 +-
 tools/perf/util/metricgroup.c | 40 +++
 tools/perf/util/metricgroup.h |  1 +
 tools/perf/util/stat-shadow.c | 12 ++--
 8 files changed, 86 insertions(+), 26 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/header.c 
b/tools/perf/arch/powerpc/util/header.c
index 3b4cdfc5efd6..c0a86afe63fb 100644
--- a/tools/perf/arch/powerpc/util/header.c
+++ b/tools/perf/arch/powerpc/util/header.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 #include "header.h"
+#include "metricgroup.h"
+#include 
 
 #define mfspr(rn)   ({unsigned long rval; \
 asm volatile("mfspr %0," __stringify(rn) \
@@ -44,3 +46,9 @@ get_cpuid_str(struct perf_pmu *pmu __maybe_unused)
 
return bufp;
 }
+
+int arch_get_runtimeparam(void)
+{
+   int count;
+   return sysfs__read_int("/devices/hv_24x7/interface/sockets", ) < 
0 ? 3 : count;
+}
diff --git a/tools/perf/tests/expr.c b/tools/perf/tests/expr.c
index ea10fc4412c4..516504cf0ea5 100644
--- a/tools/perf/tests/expr.c
+++ b/tools/perf/tests/expr.c
@@ -10,7 +10,7 @@ static int test(struct expr_parse_ctx *ctx, const char *e, 
double val2)
 {
double val;
 
-   if (expr__parse(, ctx, e))
+   if (expr__parse(, ctx, e, 1))
TEST_ASSERT_VAL("parse test failed", 0);
TEST_ASSERT_VAL("unexpected value", val == val2);
return 0;
@@ -44,15 +44,15 @@ int test__expr(struct test *t __maybe_unused, int subtest 
__maybe_unused)
return ret;
 
p = "FOO/0";
-   ret = expr__parse(, , p);
+   ret = expr__parse(, , p, 1);
TEST_ASSERT_VAL("division by zero", ret == -1);
 
p = "BAR/";
-   ret = expr__parse(, , p);
+   ret = expr__parse(, , p, 1);
TEST_ASSERT_VAL("missing operand", ret == -1);
 
TEST_ASSERT_VAL("find other",
-   expr__find_other("FOO + BAR + BAZ + BOZO", "FOO", 
, _other) == 0);
+   expr__find_other("FOO + BAR + BAZ + BOZO", "FOO", 
, _other, 1) == 0);
TEST_ASSERT_VAL("find other", num_other == 3);
TEST_ASSERT_VAL("find other", !strcmp(other[0], "BAR"));
TEST_ASSERT_VAL("find other", !strcmp(other[1], "BAZ"));
diff --git 

[PATCH v6 11/11] perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric events

2020-03-20 Thread Kajol Jain
The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
facility to continuously collect large numbers of hardware performance
metrics efficiently and accurately.
This patch adds hv_24x7  metric file for different Socket/chip
resources.

Result:

power9 platform:

command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0 -I 1000

 1.96188  0.9  0.3
 2.000285720  0.5  0.1
 3.000424990  0.4  0.1

command:# ./perf stat --metric-only -M PowerBUS_Frequency -C 0 -I 1000

 1.979812.32.3
 2.0002917132.32.3
 3.0004217192.32.3
 4.0005509122.32.3

Signed-off-by: Kajol Jain 
---
 .../arch/powerpc/power9/nest_metrics.json | 19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json 
b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json
new file mode 100644
index ..ac38f5540ac6
--- /dev/null
+++ b/tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json
@@ -0,0 +1,19 @@
+[
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_RD_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_RD_DISP_PORT23\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_RD_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_RD_DISP_PORT23\\,chip\\=?@)",
+"MetricName": "Memory_RD_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_MCS01_128B_WR_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS01_128B_WR_DISP_PORT23\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_WR_DISP_PORT01\\,chip\\=?@ + 
hv_24x7@PM_MCS23_128B_WR_DISP_PORT23\\,chip\\=?@ )",
+"MetricName": "Memory_WR_BW_Chip",
+"MetricGroup": "Memory_BW",
+"ScaleUnit": "1.6e-2MB"
+},
+{
+"MetricExpr": "(hv_24x7@PM_PB_CYC\\,chip\\=?@ )",
+"MetricName": "PowerBUS_Frequency",
+"ScaleUnit": "2.5e-7GHz"
+}
+]
-- 
2.18.1



[PATCH v6 08/11] perf/tools: Refactoring metricgroup__add_metric function

2020-03-20 Thread Kajol Jain
This patch refactor metricgroup__add_metric function where
some part of it move to function metricgroup__add_metric_param.
No logic change.

Signed-off-by: Kajol Jain 
---
 tools/perf/util/metricgroup.c | 64 +--
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index c3a8c701609a..52fb119d25c8 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -474,6 +474,42 @@ static bool metricgroup__has_constraint(struct pmu_event 
*pe)
return false;
 }
 
+static int metricgroup__add_metric_param(struct strbuf *events,
+   struct list_head *group_list, struct pmu_event *pe)
+{
+
+   const char **ids;
+   int idnum;
+   struct egroup *eg;
+   int ret = -EINVAL;
+
+   if (expr__find_other(pe->metric_expr, NULL, , ) < 0)
+   return ret;
+
+   if (events->len > 0)
+   strbuf_addf(events, ",");
+
+   if (metricgroup__has_constraint(pe))
+   metricgroup__add_metric_non_group(events, ids, idnum);
+   else
+   metricgroup__add_metric_weak_group(events, ids, idnum);
+
+   eg = malloc(sizeof(*eg));
+   if (!eg) {
+   ret = -ENOMEM;
+   return ret;
+   }
+
+   eg->ids = ids;
+   eg->idnum = idnum;
+   eg->metric_name = pe->metric_name;
+   eg->metric_expr = pe->metric_expr;
+   eg->metric_unit = pe->unit;
+   list_add_tail(>nd, group_list);
+
+   return 0;
+}
+
 static int metricgroup__add_metric(const char *metric, struct strbuf *events,
   struct list_head *group_list)
 {
@@ -493,35 +529,13 @@ static int metricgroup__add_metric(const char *metric, 
struct strbuf *events,
continue;
if (match_metric(pe->metric_group, metric) ||
match_metric(pe->metric_name, metric)) {
-   const char **ids;
-   int idnum;
-   struct egroup *eg;
 
pr_debug("metric expr %s for %s\n", pe->metric_expr, 
pe->metric_name);
 
-   if (expr__find_other(pe->metric_expr,
-NULL, , ) < 0)
+   ret = metricgroup__add_metric_param(events,
+   group_list, pe);
+   if (ret == -EINVAL || !ret)
continue;
-   if (events->len > 0)
-   strbuf_addf(events, ",");
-
-   if (metricgroup__has_constraint(pe))
-   metricgroup__add_metric_non_group(events, ids, 
idnum);
-   else
-   metricgroup__add_metric_weak_group(events, ids, 
idnum);
-
-   eg = malloc(sizeof(struct egroup));
-   if (!eg) {
-   ret = -ENOMEM;
-   break;
-   }
-   eg->ids = ids;
-   eg->idnum = idnum;
-   eg->metric_name = pe->metric_name;
-   eg->metric_expr = pe->metric_expr;
-   eg->metric_unit = pe->unit;
-   list_add_tail(>nd, group_list);
-   ret = 0;
}
}
return ret;
-- 
2.18.1



[PATCH v6 10/11] tools/perf: Enable Hz/hz prinitg for --metric-only option

2020-03-20 Thread Kajol Jain
Commit 54b5091606c18 ("perf stat: Implement --metric-only mode")
added function 'valid_only_metric()' which drops "Hz" or "hz",
if it is part of "ScaleUnit". This patch enable it since hv_24x7
supports couple of frequency events.

Signed-off-by: Kajol Jain 
---
 tools/perf/util/stat-display.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index 16efdba1973a..ecdebfcdd379 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit)
if (!unit)
return false;
if (strstr(unit, "/sec") ||
-   strstr(unit, "hz") ||
-   strstr(unit, "Hz") ||
strstr(unit, "CPUs utilized"))
return false;
return true;
-- 
2.18.1



Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Peter Zijlstra
On Fri, Mar 20, 2020 at 01:55:26AM -0700, Davidlohr Bueso wrote:
> - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
> -(!vcpu->arch.pause)));
> + rcuwait_wait_event(*wait,
> +(!vcpu->arch.power_off) && (!vcpu->arch.pause),
> +TASK_INTERRUPTIBLE);

> - for (;;) {
> - prepare_to_swait_exclusive(>wq, , 
> TASK_INTERRUPTIBLE);
> -
> - if (kvm_vcpu_check_block(vcpu) < 0)
> - break;
> -
> - waited = true;
> - schedule();
> - }
> -
> - finish_swait(>wq, );
> + rcuwait_wait_event(>wait,
> +(block_check = kvm_vcpu_check_block(vcpu)) < 0,
> +TASK_INTERRUPTIBLE);

Are these yet more instances that really want to be TASK_IDLE ?



[PATCH v6 07/11] powerpc/hv-24x7: Update post_mobility_fixup() to handle migration

2020-03-20 Thread Kajol Jain
Function 'read_sys_info_pseries()' is added to get system parameter
values like number of sockets and chips per socket.
and it gets these details via rtas_call with token
"PROCESSOR_MODULE_INFO".

Incase lpar migrate from one system to another, system
parameter details like chips per sockets or number of sockets might
change. So, it needs to be re-initialized otherwise, these values
corresponds to previous system values.
This patch adds a call to 'read_sys_info_pseries()' from
'post-mobility_fixup()' to re-init the physsockets and physchips values.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/platforms/pseries/mobility.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index b571285f6c14..226accd6218b 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -371,6 +371,18 @@ void post_mobility_fixup(void)
/* Possibly switch to a new RFI flush type */
pseries_setup_rfi_flush();
 
+   /*
+* Incase lpar migrate from one system to another, system
+* parameter details like chips per sockets and number of sockets
+* might change. So, it needs to be re-initialized otherwise these
+* values corresponds to previous system.
+* Here, adding a call to read_sys_info_pseries() declared in
+* platforms/pseries/pseries.h to re-init the physsockets and
+* physchips value.
+*/
+   if (IS_ENABLED(CONFIG_HV_PERF_CTRS) && IS_ENABLED(CONFIG_PPC_RTAS))
+   read_sys_info_pseries();
+
return;
 }
 
-- 
2.18.1



[PATCH v6 06/11] Documentation/ABI: Add ABI documentation for chips and sockets

2020-03-20 Thread Kajol Jain
Add documentation for the following sysfs files:
/sys/devices/hv_24x7/interface/chips,
/sys/devices/hv_24x7/interface/sockets

Signed-off-by: Kajol Jain 
---
 .../testing/sysfs-bus-event_source-devices-hv_24x7 | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index ec27c6c9e737..e17e5b444a1c 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -22,6 +22,20 @@ Description:
Exposes the "version" field of the 24x7 catalog. This is also
extractable from the provided binary "catalog" sysfs entry.
 
+What:  /sys/devices/hv_24x7/interface/sockets
+Date:  March 2020
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs interface exposes the number of sockets present in 
the
+   system.
+
+What:  /sys/devices/hv_24x7/interface/chips
+Date:  March 2020
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs interface exposes the number of chips per socket
+   present in the system.
+
 What:  /sys/bus/event_source/devices/hv_24x7/event_descs/
 Date:  February 2014
 Contact:   Linux on PowerPC Developer List 
-- 
2.18.1



[PATCH v6 05/11] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show processor details

2020-03-20 Thread Kajol Jain
To expose the system dependent parameter like total number of
sockets and numbers of chips per socket, patch adds two sysfs files.
"sockets" and "chips" are added to /sys/devices/hv_24x7/interface/
of the "hv_24x7" pmu.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 9ae00f29bd21..a31bd5b88f7a 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -454,6 +454,20 @@ static ssize_t device_show_string(struct device *dev,
return sprintf(buf, "%s\n", (char *)d->var);
 }
 
+#ifdef CONFIG_PPC_RTAS
+static ssize_t sockets_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%d\n", physsockets);
+}
+
+static ssize_t chips_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+   return sprintf(buf, "%d\n", physchips);
+}
+#endif
+
 static struct attribute *device_str_attr_create_(char *name, char *str)
 {
struct dev_ext_attribute *attr = kzalloc(sizeof(*attr), GFP_KERNEL);
@@ -1100,6 +1114,10 @@ PAGE_0_ATTR(catalog_len, "%lld\n",
(unsigned long long)be32_to_cpu(page_0->length) * 4096);
 static BIN_ATTR_RO(catalog, 0/* real length varies */);
 static DEVICE_ATTR_RO(domains);
+#ifdef CONFIG_PPC_RTAS
+static DEVICE_ATTR_RO(sockets);
+static DEVICE_ATTR_RO(chips);
+#endif
 
 static struct bin_attribute *if_bin_attrs[] = {
_attr_catalog,
@@ -1110,6 +1128,10 @@ static struct attribute *if_attrs[] = {
_attr_catalog_len.attr,
_attr_catalog_version.attr,
_attr_domains.attr,
+#ifdef CONFIG_PPC_RTAS
+   _attr_sockets.attr,
+   _attr_chips.attr,
+#endif
NULL,
 };
 
-- 
2.18.1



[PATCH v6 04/11] powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor details

2020-03-20 Thread Kajol Jain
For hv_24x7 socket/chip level events, specific chip-id to which
the data requested should be added as part of pmu events.
But number of chips/socket in the system details are not exposed.

Patch implements read_sys_info_pseries() to get system
parameter values like number of sockets and chips per socket.
Rtas_call with token "PROCESSOR_MODULE_INFO"
is used to get these values.

Sub-sequent patch exports these values via sysfs.

Patch also make these parameters default to 1.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c  | 72 
 arch/powerpc/platforms/pseries/pseries.h |  3 +
 2 files changed, 75 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 48e8f4b17b91..9ae00f29bd21 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -20,6 +20,11 @@
 #include 
 #include 
 
+#ifdef CONFIG_PPC_RTAS
+#include 
+#include <../../platforms/pseries/pseries.h>
+#endif
+
 #include "hv-24x7.h"
 #include "hv-24x7-catalog.h"
 #include "hv-common.h"
@@ -57,6 +62,69 @@ static bool is_physical_domain(unsigned domain)
}
 }
 
+#ifdef CONFIG_PPC_RTAS
+#define PROCESSOR_MODULE_INFO   43
+#define PROCESSOR_MAX_LENGTH   (8 * 1024)
+
+static int strbe16toh(const char *buf, int offset)
+{
+   return (buf[offset] << 8) + buf[offset + 1];
+}
+
+static u32 physsockets;/* Physical sockets */
+static u32 physchips;  /* Physical chips */
+
+/*
+ * Function read_sys_info_pseries() make a rtas_call which require
+ * data buffer of size 8K. As standard 'rtas_data_buf' is of size
+ * 4K, we are adding new local buffer 'rtas_local_data_buf'.
+ */
+char rtas_local_data_buf[PROCESSOR_MAX_LENGTH] __cacheline_aligned;
+
+/*
+ * read_sys_info_pseries()
+ * Retrieve the number of sockets and chips per socket details
+ * through the get-system-parameter rtas call.
+ */
+void read_sys_info_pseries(void)
+{
+   int call_status, len, ntypes;
+
+   /*
+* Making system parameter: chips and sockets default to 1.
+*/
+   physsockets = 1;
+   physchips = 1;
+   memset(rtas_local_data_buf, 0, PROCESSOR_MAX_LENGTH);
+   spin_lock(_data_buf_lock);
+
+   call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+   NULL,
+   PROCESSOR_MODULE_INFO,
+   __pa(rtas_local_data_buf),
+   PROCESSOR_MAX_LENGTH);
+
+   spin_unlock(_data_buf_lock);
+
+   if (call_status != 0) {
+   pr_info("%s %s Error calling get-system-parameter (0x%x)\n",
+   __FILE__, __func__, call_status);
+   } else {
+   rtas_local_data_buf[PROCESSOR_MAX_LENGTH - 1] = '\0';
+   len = strbe16toh(rtas_local_data_buf, 0);
+   if (len < 6)
+   return;
+
+   ntypes = strbe16toh(rtas_local_data_buf, 2);
+
+   if (!ntypes)
+   return;
+   physsockets = strbe16toh(rtas_local_data_buf, 4);
+   physchips = strbe16toh(rtas_local_data_buf, 6);
+   }
+}
+#endif /* CONFIG_PPC_RTAS */
+
 /* Domains for which more than one result element are returned for each event. 
*/
 static bool domain_needs_aggregation(unsigned int domain)
 {
@@ -1605,6 +1673,10 @@ static int hv_24x7_init(void)
if (r)
return r;
 
+#ifdef CONFIG_PPC_RTAS
+   read_sys_info_pseries();
+#endif
+
return 0;
 }
 
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 13fa370a87e4..1727559ce304 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -19,6 +19,9 @@ extern void request_event_sources_irqs(struct device_node *np,
 struct pt_regs;
 
 extern int pSeries_system_reset_exception(struct pt_regs *regs);
+#ifdef CONFIG_PPC_RTAS
+extern void read_sys_info_pseries(void);
+#endif
 extern int pSeries_machine_check_exception(struct pt_regs *regs);
 extern long pseries_machine_check_realmode(struct pt_regs *regs);
 
-- 
2.18.1



[PATCH v6 03/11] powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run

2020-03-20 Thread Kajol Jain
Commit 2b206ee6b0df ("powerpc/perf/hv-24x7: Display change in counter
values")' added to print _change_ in the counter value rather then raw
value for 24x7 counters. Incase of transactions, the event count
is set to 0 at the beginning of the transaction. It also sets
the event's prev_count to the raw value at the time of initialization.
Because of setting event count to 0, we are seeing some weird behaviour,
whenever we run multiple 24x7 events at a time.

For example:

command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/,
   hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}"
   -C 0 -I 1000 sleep 100

 1.000121704120 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 1.000121704  5 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 2.000357733  8 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 2.000357733 10 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 3.000495215 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 3.000495215 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.000641884 56 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 4.000641884 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 5.000791887 18,446,744,073,709,551,616 
hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/

Getting these large values in case we do -I.

As we are setting event_count to 0, for interval case, overall event_count is 
not
coming in incremental order. As we may can get new delta lesser then previous 
count.
Because of which when we print intervals, we are getting negative value which 
create
these large values.

This patch removes part where we set event_count to 0 in function
'h_24x7_event_read'. There won't be much impact as we do set 
event->hw.prev_count
to the raw value at the time of initialization to print change value.

With this patch
In power9 platform

command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/,
   hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}"
   -C 0 -I 1000 sleep 100

 1.000117685 93 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 1.000117685  1 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 2.000349331 98 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 2.000349331  2 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 3.000495900131 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 3.000495900  4 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.000645920204 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/
 4.000645920 61 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/
 4.284169997 22 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/

Signed-off-by: Kajol Jain 
Suggested-by: Sukadev Bhattiprolu 
---
 arch/powerpc/perf/hv-24x7.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 573e0b309c0c..48e8f4b17b91 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1400,16 +1400,6 @@ static void h_24x7_event_read(struct perf_event *event)
h24x7hw = _cpu_var(hv_24x7_hw);
h24x7hw->events[i] = event;
put_cpu_var(h24x7hw);
-   /*
-* Clear the event count so we can compute the _change_
-* in the 24x7 raw counter value at the end of the txn.
-*
-* Note that we could alternatively read the 24x7 value
-* now and save its value in event->hw.prev_count. But
-* that would require issuing a hcall, which would then
-* defeat the purpose of using the txn interface.
-*/
-   local64_set(>count, 0);
}
 
put_cpu_var(hv_24x7_reqb);
-- 
2.18.1



[PATCH v6 02/11] perf expr: Add expr_scanner_ctx object

2020-03-20 Thread Kajol Jain
From: Jiri Olsa 

Adding expr_scanner_ctx object to hold user data
for the expr scanner. Currently it holds only
start_token, Kajol Jain will use it to hold 24x7
runtime param.

Signed-off-by: Jiri Olsa 
---
 tools/perf/util/expr.c |  6 --
 tools/perf/util/expr.h |  4 
 tools/perf/util/expr.l | 10 +-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c
index c8ccc548a585..c3382d58cf40 100644
--- a/tools/perf/util/expr.c
+++ b/tools/perf/util/expr.c
@@ -3,7 +3,6 @@
 #include 
 #include "expr.h"
 #include "expr-bison.h"
-#define YY_EXTRA_TYPE int
 #include "expr-flex.h"
 
 #ifdef PARSER_DEBUG
@@ -30,11 +29,14 @@ static int
 __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr,
  int start)
 {
+   struct expr_scanner_ctx scanner_ctx = {
+   .start_token = start,
+   };
YY_BUFFER_STATE buffer;
void *scanner;
int ret;
 
-   ret = expr_lex_init_extra(start, );
+   ret = expr_lex_init_extra(_ctx, );
if (ret)
return ret;
 
diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h
index b9e53f2b5844..0938ad166ece 100644
--- a/tools/perf/util/expr.h
+++ b/tools/perf/util/expr.h
@@ -15,6 +15,10 @@ struct expr_parse_ctx {
struct expr_parse_id ids[MAX_PARSE_ID];
 };
 
+struct expr_scanner_ctx {
+   int start_token;
+};
+
 void expr__ctx_init(struct expr_parse_ctx *ctx);
 void expr__add_id(struct expr_parse_ctx *ctx, const char *id, double val);
 int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char 
*expr);
diff --git a/tools/perf/util/expr.l b/tools/perf/util/expr.l
index eaad29243c23..2582c2464938 100644
--- a/tools/perf/util/expr.l
+++ b/tools/perf/util/expr.l
@@ -76,13 +76,13 @@ sym [0-9a-zA-Z_\.:@]+
 symbol {spec}*{sym}*{spec}*{sym}*
 
 %%
-   {
-   int start_token;
+   struct expr_scanner_ctx *sctx = expr_get_extra(yyscanner);
 
-   start_token = expr_get_extra(yyscanner);
+   {
+   int start_token = sctx->start_token;
 
-   if (start_token) {
-   expr_set_extra(NULL, yyscanner);
+   if (sctx->start_token) {
+   sctx->start_token = 0;
return start_token;
}
}
-- 
2.18.1



[PATCH v6 01/11] perf expr: Add expr_ prefix for parse_ctx and parse_id

2020-03-20 Thread Kajol Jain
From: Jiri Olsa 

Adding expr_ prefix for parse_ctx and parse_id,
to straighten out the expr* namespace.

There's no functional change.

Signed-off-by: Jiri Olsa 
---
 tools/perf/tests/expr.c   |  4 ++--
 tools/perf/util/expr.c| 10 +-
 tools/perf/util/expr.h| 12 ++--
 tools/perf/util/expr.y|  6 +++---
 tools/perf/util/stat-shadow.c |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/tools/perf/tests/expr.c b/tools/perf/tests/expr.c
index 28313e59d6f6..ea10fc4412c4 100644
--- a/tools/perf/tests/expr.c
+++ b/tools/perf/tests/expr.c
@@ -6,7 +6,7 @@
 #include 
 #include 
 
-static int test(struct parse_ctx *ctx, const char *e, double val2)
+static int test(struct expr_parse_ctx *ctx, const char *e, double val2)
 {
double val;
 
@@ -22,7 +22,7 @@ int test__expr(struct test *t __maybe_unused, int subtest 
__maybe_unused)
const char **other;
double val;
int i, ret;
-   struct parse_ctx ctx;
+   struct expr_parse_ctx ctx;
int num_other;
 
expr__ctx_init();
diff --git a/tools/perf/util/expr.c b/tools/perf/util/expr.c
index fd192ddf93c1..c8ccc548a585 100644
--- a/tools/perf/util/expr.c
+++ b/tools/perf/util/expr.c
@@ -11,7 +11,7 @@ extern int expr_debug;
 #endif
 
 /* Caller must make sure id is allocated */
-void expr__add_id(struct parse_ctx *ctx, const char *name, double val)
+void expr__add_id(struct expr_parse_ctx *ctx, const char *name, double val)
 {
int idx;
 
@@ -21,13 +21,13 @@ void expr__add_id(struct parse_ctx *ctx, const char *name, 
double val)
ctx->ids[idx].val = val;
 }
 
-void expr__ctx_init(struct parse_ctx *ctx)
+void expr__ctx_init(struct expr_parse_ctx *ctx)
 {
ctx->num_ids = 0;
 }
 
 static int
-__expr__parse(double *val, struct parse_ctx *ctx, const char *expr,
+__expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr,
  int start)
 {
YY_BUFFER_STATE buffer;
@@ -52,7 +52,7 @@ __expr__parse(double *val, struct parse_ctx *ctx, const char 
*expr,
return ret;
 }
 
-int expr__parse(double *final_val, struct parse_ctx *ctx, const char *expr)
+int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char 
*expr)
 {
return __expr__parse(final_val, ctx, expr, EXPR_PARSE) ? -1 : 0;
 }
@@ -75,7 +75,7 @@ int expr__find_other(const char *expr, const char *one, const 
char ***other,
 int *num_other)
 {
int err, i = 0, j = 0;
-   struct parse_ctx ctx;
+   struct expr_parse_ctx ctx;
 
expr__ctx_init();
err = __expr__parse(NULL, , expr, EXPR_OTHER);
diff --git a/tools/perf/util/expr.h b/tools/perf/util/expr.h
index 9377538f4097..b9e53f2b5844 100644
--- a/tools/perf/util/expr.h
+++ b/tools/perf/util/expr.h
@@ -5,19 +5,19 @@
 #define EXPR_MAX_OTHER 20
 #define MAX_PARSE_ID EXPR_MAX_OTHER
 
-struct parse_id {
+struct expr_parse_id {
const char *name;
double val;
 };
 
-struct parse_ctx {
+struct expr_parse_ctx {
int num_ids;
-   struct parse_id ids[MAX_PARSE_ID];
+   struct expr_parse_id ids[MAX_PARSE_ID];
 };
 
-void expr__ctx_init(struct parse_ctx *ctx);
-void expr__add_id(struct parse_ctx *ctx, const char *id, double val);
-int expr__parse(double *final_val, struct parse_ctx *ctx, const char *expr);
+void expr__ctx_init(struct expr_parse_ctx *ctx);
+void expr__add_id(struct expr_parse_ctx *ctx, const char *id, double val);
+int expr__parse(double *final_val, struct expr_parse_ctx *ctx, const char 
*expr);
 int expr__find_other(const char *expr, const char *one, const char ***other,
int *num_other);
 
diff --git a/tools/perf/util/expr.y b/tools/perf/util/expr.y
index 4720cbe79357..cd17486c1c5d 100644
--- a/tools/perf/util/expr.y
+++ b/tools/perf/util/expr.y
@@ -15,7 +15,7 @@
 %define api.pure full
 
 %parse-param { double *final_val }
-%parse-param { struct parse_ctx *ctx }
+%parse-param { struct expr_parse_ctx *ctx }
 %parse-param {void *scanner}
 %lex-param {void* scanner}
 
@@ -39,14 +39,14 @@
 
 %{
 static void expr_error(double *final_val __maybe_unused,
-  struct parse_ctx *ctx __maybe_unused,
+  struct expr_parse_ctx *ctx __maybe_unused,
   void *scanner,
   const char *s)
 {
pr_debug("%s\n", s);
 }
 
-static int lookup_id(struct parse_ctx *ctx, char *id, double *val)
+static int lookup_id(struct expr_parse_ctx *ctx, char *id, double *val)
 {
int i;
 
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 0fd713d3674f..402af3e8d287 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -729,7 +729,7 @@ static void generic_metric(struct perf_stat_config *config,
   struct runtime_stat *st)
 {
print_metric_t print_metric = out->print_metric;
-   struct parse_ctx pctx;
+   struct expr_parse_ctx pctx;
double 

[PATCH v6 00/11] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events

2020-03-20 Thread Kajol Jain
Patchset fixes the inconsistent results we are getting when
we run multiple 24x7 events.

Patchset adds json file metric support for the hv_24x7 socket/chip level
events. "hv_24x7" pmu interface events needs system dependent parameter
like socket/chip/core. For example, hv_24x7 chip level events needs
specific chip-id to which the data is requested should be added as part
of pmu events.

So to enable JSON file support to "hv_24x7" interface, patchset expose
total number of sockets and chips per-socket details in sysfs
files (sockets, chips) under "/sys/devices/hv_24x7/interface/".

To get sockets and number of chips per sockets, patchset adds a rtas call
with token "PROCESSOR_MODULE_INFO" to get these details. Patchset also
handles partition migration case to re-init these system depended
parameters by adding proper calls in post_mobility_fixup() (mobility.c).

Second patch of the patchset adds expr_scanner_ctx object to hold user
data for the expr scanner, which can be used to hold runtime parameter.

Patch 9 & 11 of the patchset handles perf tool plumbing needed to replace
the "?" character in the metric expression to proper value and hv_24x7
json metric file for different Socket/chip resources.

Patch set also enable Hz/hz prinitg for --metric-only option to print
metric data for bus frequency.

Applied and tested all these patches cleanly on top of jiri's flex changes
with the changes done by Kan Liang for "Support metric group constraint"
patchset and made required changes.

Changelog:
v5 -> v6
- resolve compilation issue due to rearranging patch series.
- Rather then adding new function to take careof case for runtime param
  in metricgroup__add_metric, using metricgroup__add_metric_param itself
  for that work.
- Address some optimization suggested like using directly file path
  rather then adding new macro in header.c
- Change commit message on patch where we are adding "?" support
  by adding simple example.

v4 -> v5
- Using sysfs__read_int instead of sysfs__read_ull while reading
  parameter value in powerpc/util/header.c file.

- Using asprintf rather then malloc and sprintf 
  Suggested by Arnaldo Carvalho de Melo

- Break patch 6 from previous version to two patch,
  - One to add refactor current "metricgroup__add_metric" function
and another where actually "?" handling infra added.

- Add expr__runtimeparam as part of 'expr_scanner_ctx' struct
  rather then making it global variable. Thanks Jiri for
  adding this structure to hold user data for the expr scanner.

- Add runtime param as agrugement to function 'expr__find_other'
  and 'expr__parse' and made changes on references accordingly.

v3 -> v4
- Apply these patch on top of Kan liang changes.
  As suggested by Jiri.

v2 -> v3
- Remove setting  event_count to 0 part in function 'h_24x7_event_read'
  with comment rather then adding 0 to event_count value.
  Suggested by: Sukadev Bhattiprolu

- Apply tool side changes require to replace "?" on Jiri's flex patch
  series and made all require changes to make it compatible with added
  flex change.

v1 -> v2
- Rename hv-24x7 metric json file as nest_metrics.json

Jiri Olsa (2):
  perf expr: Add expr_ prefix for parse_ctx and parse_id
  perf expr: Add expr_scanner_ctx object

Kajol Jain (9):
  powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple
hv-24x7 events run
  powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor
details
  powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show
processor details
  Documentation/ABI: Add ABI documentation for chips and sockets
  powerpc/hv-24x7: Update post_mobility_fixup() to handle migration
  perf/tools: Refactoring metricgroup__add_metric function
  perf/tools: Enhance JSON/metric infrastructure to handle "?"
  tools/perf: Enable Hz/hz prinitg for --metric-only option
  perf/tools/pmu-events/powerpc: Add hv_24x7 socket/chip level metric
events

 .../sysfs-bus-event_source-devices-hv_24x7|  14 +++
 arch/powerpc/perf/hv-24x7.c   | 104 --
 arch/powerpc/platforms/pseries/mobility.c |  12 ++
 arch/powerpc/platforms/pseries/pseries.h  |   3 +
 tools/perf/arch/powerpc/util/header.c |   8 ++
 .../arch/powerpc/power9/nest_metrics.json |  19 
 tools/perf/tests/expr.c   |  12 +-
 tools/perf/util/expr.c|  25 +++--
 tools/perf/util/expr.h|  19 ++--
 tools/perf/util/expr.l|  37 +--
 tools/perf/util/expr.y|   6 +-
 tools/perf/util/metricgroup.c |  88 +++
 tools/perf/util/metricgroup.h |   1 +
 tools/perf/util/stat-display.c|   2 -
 tools/perf/util/stat-shadow.c |  14 ++-
 15 files changed, 287 insertions(+), 77 deletions(-)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/nest_metrics.json

-- 
2.18.1



Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Andy Shevchenko
On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > While at it also simplify the existing perf patterns.

> > And still missed fixes from parse-maintainers.pl.
> 
> Oh, that script UX is truly ingenious.

You have at least two options, their combinations, etc:
 - complain to the author :-)
 - send a patch :-)

> It provides no output and quietly
> creates MAINTAINERS.new which is, of course, not included in the patch.

Yes. it also took me a while to understand how it works, luckily it has a
little help note.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v3 9/9] Documentation/powerpc: VAS API

2020-03-20 Thread Daniel Axtens
Hi Haren,

This is good documentation.

> Power9 introduced Virtual Accelerator Switchboard (VAS) which allows
> userspace to communicate with Nest Accelerator (NX) directly. But
> kernel has to establish channel to NX for userspace. This document
> describes user space API that application can use to establish
> communication channel.
>
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Haren Myneni 
> ---
>  Documentation/powerpc/index.rst   |   1 +
>  Documentation/powerpc/vas-api.rst | 246 
> ++
>  2 files changed, 247 insertions(+)
>  create mode 100644 Documentation/powerpc/vas-api.rst
>
> diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
> index 0d45f0f..afe2d5e 100644
> --- a/Documentation/powerpc/index.rst
> +++ b/Documentation/powerpc/index.rst
> @@ -30,6 +30,7 @@ powerpc
>  syscall64-abi
>  transactional_memory
>  ultravisor
> +vas-api
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/powerpc/vas-api.rst 
> b/Documentation/powerpc/vas-api.rst
> new file mode 100644
> index 000..13ce4e7
> --- /dev/null
> +++ b/Documentation/powerpc/vas-api.rst
> @@ -0,0 +1,246 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. _VAS-API:
> +
> +===
> +Virtual Accelerator Switchboard (VAS) userspace API
> +===
> +
> +Introduction
> +
> +
> +Power9 processor introduced Virtual Accelerator Switchboard (VAS) which
> +allows both userspace and kernel communicate to co-processor
> +(hardware accelerator) referred to as the Nest Accelerator (NX). The NX
> +unit comprises of one or more hardware engines or co-processor types
> +such as 842 compression, GZIP compression and encryption. On power9,
> +userspace applications will have access to only GZIP Compression engine
> +which supports ZLIB and GZIP compression algorithms in the hardware.
> +
> +To communicate with NX, kernel has to establish a channel or window and
> +then requests can be submitted directly without kernel involvement.
> +Requests to the GZIP engine must be formatted as a co-processor Request
> +Block (CRB) and these CRBs must be submitted to the NX using COPY/PASTE
> +instructions to paste the CRB to hardware address that is associated with
> +the engine's request queue.
> +
> +The GZIP engine provides two priority levels of requests: Normal and
> +High. Only Normal requests are supported from userspace right now.
> +
> +This document explains userspace API that is used to interact with
> +kernel to setup channel / window which can be used to send compression
> +requests directly to NX accelerator.
> +
> +
> +Overview
> +
> +
> +Application access to the GZIP engine is provided through
> +/dev/crypto/nx-gzip device node implemented by the VAS/NX device driver.
> +An application must open the /dev/crypto/nx-gzip device to obtain a file
> +descriptor (fd). Then should issue VAS_TX_WIN_OPEN ioctl with this fd to
> +establish connection to the engine. It means send window is opened on GZIP
> +engine for this process. Once a connection is established, the application
> +should use the mmap() system call to map the hardware address of engine's
> +request queue into the application's virtual address space.
> +
> +The application can then submit one or more requests to the the engine by
> +using copy/paste instructions and pasting the CRBs to the virtual address
> +(aka paste_address) returned by mmap(). User space can close the
> +established connection or send window by closing the file descriptior
> +(close(fd)) or upon the process exit.
> +
> +Note that applications can send several requests with the same window or
> +can establish multiple windows, but one window for each file descriptor.
> +
> +Following sections provide additional details and references about the
> +individual steps.
> +
> +NX-GZIP Device Node
> +===
> +
> +There is one /dev/crypto/nx-gzip node in the system and it provides
> +access to all GZIP engines in the system. The only valid operations on
> +/dev/crypto/nx-gzip are:
> +
> + * open() the device for read and write.
> + * issue VAS_TX_WIN_OPEN ioctl
> + * mmap() the engine's request queue into application's virtual
> +   address space (i.e. get a paste_address for the co-processor
> +   engine).
> + * close the device node.
> +
> +Other file operations on this device node are undefined.
> +
> +Note that the copy and paste operations go directly to the hardware and
> +do not go through this device. Refer COPY/PASTE document for more
> +details.
> +
> +Although a system may have several instances of the NX co-processor
> +engines (typically, one per P9 chip) there is just one
> +/dev/crypto/nx-gzip device node in the system. When the nx-gzip device
> +node is opened, Kernel opens send window on a suitable instance of NX
> +accelerator. It finds CPU on which the user process 

Re: [PATCH v3 3/9] powerpc/vas: Add VAS user space API

2020-03-20 Thread Daniel Axtens
Haren Myneni  writes:

> On power9, userspace can send GZIP compression requests directly to NX
> once kernel establishes NX channel / window with VAS. This patch provides
> user space API which allows user space to establish channel using open
> VAS_TX_WIN_OPEN ioctl, mmap and close operations.
>
> Each window corresponds to file descriptor and application can open
> multiple windows. After the window is opened, VAS_TX_WIN_OPEN icoctl to
> open a window on specific VAS instance, mmap() system call to map
> the hardware address of engine's request queue into the application's
> virtual address space.
>
> Then the application can then submit one or more requests to the the
> engine by using the copy/paste instructions and pasting the CRBs to
> the virtual address (aka paste_address) returned by mmap().
>
> Only NX GZIP coprocessor type is supported right now and allow GZIP
> engine access via /dev/crypto/nx-gzip device node.
>
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/include/asm/vas.h  |  11 ++
>  arch/powerpc/platforms/powernv/Makefile |   2 +-
>  arch/powerpc/platforms/powernv/vas-api.c| 290 
> 
>  arch/powerpc/platforms/powernv/vas-window.c |   6 +-
>  arch/powerpc/platforms/powernv/vas.h|   2 +
>  5 files changed, 307 insertions(+), 4 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/vas-api.c
>
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index f93e6b0..e064953 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -163,4 +163,15 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
> vas_cop_type cop,
>   */
>  int vas_paste_crb(struct vas_window *win, int offset, bool re);
>  
> +/*
> + * Register / unregister coprocessor type to VAS API which will be exported
> + * to user space. Applications can use this API to open / close window
> + * which can be used to send / receive requests directly to cooprcessor.
> + *
> + * Only NX GZIP coprocessor type is supported now, but this API can be
> + * used for others in future.
> + */
> +int vas_register_coproc_api(struct module *mod);
> +void vas_unregister_coproc_api(void);
> +
>  #endif /* __ASM_POWERPC_VAS_H */
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index 395789f..fe3f0fb 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -17,7 +17,7 @@ obj-$(CONFIG_MEMORY_FAILURE)+= opal-memory-errors.o
>  obj-$(CONFIG_OPAL_PRD)   += opal-prd.o
>  obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
>  obj-$(CONFIG_PPC_MEMTRACE)   += memtrace.o
> -obj-$(CONFIG_PPC_VAS)+= vas.o vas-window.o vas-debug.o vas-fault.o
> +obj-$(CONFIG_PPC_VAS)+= vas.o vas-window.o vas-debug.o vas-fault.o 
> vas-api.o
>  obj-$(CONFIG_OCXL_BASE)  += ocxl.o
>  obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
>  obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
> diff --git a/arch/powerpc/platforms/powernv/vas-api.c 
> b/arch/powerpc/platforms/powernv/vas-api.c
> new file mode 100644
> index 000..3473a4a
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/vas-api.c
> @@ -0,0 +1,290 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * VAS user space API for its accelerators (Only NX-GZIP is supported now)
> + * Copyright (C) 2019 Haren Myneni, IBM Corp
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "vas.h"
> +
> +/*
> + * The driver creates the device node that can be used as follows:
> + * For NX-GZIP
> + *
> + *   fd = open("/dev/crypto/nx-gzip", O_RDWR);
> + *   rc = ioctl(fd, VAS_TX_WIN_OPEN, );
> + *   paste_addr = mmap(NULL, PAGE_SIZE, prot, MAP_SHARED, fd, 0ULL).
> + *   vas_copy(, 0, 1);
> + *   vas_paste(paste_addr, 0, 1);
> + *   close(fd) or exit process to close window.
> + *
> + * where "vas_copy" and "vas_paste" are defined in copy-paste.h.
> + * copy/paste returns to the user space directly. So refer NX hardware
> + * documententation for excat copy/paste usage and completion / error
> + * conditions.
> + */
> +
> +static char  *coproc_dev_name = "nx-gzip";
> +static atomic_t  coproc_instid = ATOMIC_INIT(0);
> +
> +/*
> + * Wrapper object for the nx-gzip device - there is just one instance of
> + * this node for the whole system.
> + */
> +static struct coproc_dev {
> + struct cdev cdev;
> + struct device *device;
> + char *name;
> + dev_t devt;
> + struct class *class;
> +} coproc_device;
> +
> +/*
> + * One instance per open of a nx-gzip device. Each coproc_instance is
> + * associated with a VAS window after the caller issues
> + * VAS_GZIP_TX_WIN_OPEN ioctl.
> + */
> +struct coproc_instance {
> + int id;
> + struct vas_window *txwin;
> +};
> +
> +static char *coproc_devnode(struct device *dev, umode_t *mode)
> +{
> + return 

[PATCH 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Vlastimil Babka
Sachin reports [1] a crash in SLUB __slab_alloc():

BUG: Kernel NULL pointer dereference on read at 0x73b0
Faulting instruction address: 0xc03d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP:  c03d55f4 LR: c03d5b94 CTR: 
REGS: c008b37836d0 TRAP: 0300   Not tainted  
(5.6.0-rc2-next-20200218-autotest)
MSR:  80009033   CR: 24004844  XER: 
CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
GPR04: 0dc0 0002 c03443d8 c008bb398620
GPR08: 0008ba2f 0001  
GPR12: 24004844 c0001ec52a00  
GPR16: c008a1b20048 c1595898 c1750c18 0002
GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
GPR24: 0001 0dc0 0002 c03443d8
GPR28: c008b301f500 c008bb398620  c00c02287180
NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on
appropriate NUMA node" [2] which effectively calls kmalloc_node for each
possible node. SLUB however only allocates kmem_cache_node on online
N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node
for other nodes since commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"). This is however not
true in this configuration where the _node_numa_mem_ array is not initialized
for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
accessing non-allocated kmem_cache_node.

A related issue was reported by Bharata (originally by Ramachandran) [3] where
a similar PowerPC configuration, but with mainline kernel without patch [2]
ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems
to have the same underlying issue with node_to_mem_node() not behaving as
expected, and might probably also lead to an infinite loop with
CONFIG_SLUB_CPU_PARTIAL [4].

This patch should fix both issues by not relying on node_to_mem_node() anymore
and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
attempted for a node that's not online, or has no usable memory. The "usable
memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY
node state, as that is exactly the condition that SLUB uses to allocate
kmem_cache_node structures. The check in get_partial() is removed completely,
as the checks in ___slab_alloc() are now sufficient to prevent get_partial()
being reached with an invalid node.

[1] 
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
[2] 
https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c...@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
[4] 
https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125a...@suse.cz/

Reported-and-tested-by: Sachin Sant 
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN 
Tested-by: Bharata B Rao 
Debugged-by: Srikar Dronamraju 
Signed-off-by: Vlastimil Babka 
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating 
on memoryless node")
Cc: sta...@vger.kernel.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Nathan Lynch 
---
Hi, after the RFC got 

Re: [PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-20 Thread Bharata B Rao
On Fri, Mar 20, 2020 at 11:26:43AM +0100, Laurent Dufour wrote:
> When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
> there is not enough free secured memory, the Hypervisor (HV) has to call
> UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
> H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.
> 
> If the kvm->arch.secure_guest is not set, in the return path rfid is called
> but there is no valid context to get back to the SVM since the Hcall has
> been routed by the Ultravisor.
> 
> Move the setting of kvm->arch.secure_guest earlier in
> kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
> instead of rfid.
> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 79b1202b1c62..68dff151315c 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -209,6 +209,8 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   int ret = H_SUCCESS;
>   int srcu_idx;
>  
> + kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;
> +
>   if (!kvmppc_uvmem_bitmap)
>   return H_UNSUPPORTED;
>  
> @@ -233,7 +235,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   goto out;
>   }
>   }
> - kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_START;

There is an assumption that memory slots would have been registered with UV
if KVMPPC_SECURE_INIT_START has been done. KVM_PPC_SVM_OFF ioctl will skip
unregistration and other steps during reboot if KVMPPC_SECURE_INIT_START
hasn't been done.

Have you checked if that path isn't affected by this change?

Regards,
Bharata.



Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > While at it also simplify the existing perf patterns.
> > 
> 
> And still missed fixes from parse-maintainers.pl.

Oh, that script UX is truly ingenious. It provides no output and quietly
creates MAINTAINERS.new which is, of course, not included in the patch.

Thanks

Michal

> 
> I see it like below in the linux-next (after the script)
> 
> PERFORMANCE EVENTS SUBSYSTEM
> M:  Peter Zijlstra 
> M:  Ingo Molnar 
> M:  Arnaldo Carvalho de Melo 
> R:  Mark Rutland 
> R:  Alexander Shishkin 
> R:  Jiri Olsa 
> R:  Namhyung Kim 
> L:  linux-ker...@vger.kernel.org
> S:  Supported
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
> perf/core
> F:  arch/*/events/*
> F:  arch/*/events/*/*
> F:  arch/*/include/asm/perf_event.h
> F:  arch/*/kernel/*/*/perf_event*.c
> F:  arch/*/kernel/*/perf_event*.c
> F:  arch/*/kernel/perf_callchain.c
> F:  arch/*/kernel/perf_event*.c
> F:  include/linux/perf_event.h
> F:  include/uapi/linux/perf_event.h
> F:  kernel/events/*
> F:  tools/perf/
> 
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -13080,7 +13080,7 @@ R:  Namhyung Kim 
> >  L: linux-ker...@vger.kernel.org
> >  T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
> >  S: Supported
> > -F: kernel/events/*
> > +F: kernel/events/
> >  F: include/linux/perf_event.h
> >  F: include/uapi/linux/perf_event.h
> >  F: arch/*/kernel/perf_event*.c
> > @@ -13088,8 +13088,8 @@ F:  arch/*/kernel/*/perf_event*.c
> >  F: arch/*/kernel/*/*/perf_event*.c
> >  F: arch/*/include/asm/perf_event.h
> >  F: arch/*/kernel/perf_callchain.c
> > -F: arch/*/events/*
> > -F: arch/*/events/*/*
> > +F: arch/*/events/
> > +F: arch/*/perf/
> >  F: tools/perf/
> >  
> >  PERFORMANCE EVENTS SUBSYSTEM ARM64 PMU EVENTS
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
> 


Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Paolo Bonzini
On 20/03/20 09:55, Davidlohr Bueso wrote:
> Only compiled and tested on x86.

It shows :) as the __KVM_HAVE_ARCH_WQP case is broken.  But no problem, 
Paul and I can pick this up and fix it.

This is missing:

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 506e4df2d730..6e5d85ba588d 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -78,7 +78,7 @@ struct kvmppc_vcore {
struct kvm_vcpu *runnable_threads[MAX_SMT_THREADS];
struct list_head preempt_list;
spinlock_t lock;
-   struct swait_queue_head wq;
+   struct rcuwait wait;
spinlock_t stoltb_lock; /* protects stolen_tb and preempt_tb */
u64 stolen_tb;
u64 preempt_tb;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1af96fb5dc6f..8c8122c30b89 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -754,7 +754,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (err)
goto out_vcpu_uninit;
 
-   vcpu->arch.wqp = >wq;
+   vcpu->arch.waitp = >wait;
kvmppc_create_vcpu_debugfs(vcpu, vcpu->vcpu_id);
return 0;
 

and...

> -static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu 
> *vcpu)
> +static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
>  {
>  #ifdef __KVM_HAVE_ARCH_WQP
> - return vcpu->arch.wqp;
> + return vcpu->arch.wait;

... this needs to be vcpu->arch.waitp.  That should be it.

Thanks!

Paolo

>  #else
> - return >wq;
> + return >wait;
>  #endif
>  }
>  
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 0d9438e9de2a..4be71cb58691 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -593,7 +593,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>   if (map.emul_ptimer)
>   soft_timer_cancel(_ptimer->hrtimer);
>  
> - if (swait_active(kvm_arch_vcpu_wq(vcpu)))
> + if (rcu_dereference(kvm_arch_vpu_get_wait(vcpu)) != NULL)
>   kvm_timer_blocking(vcpu);
>  
>   /*
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index eda7b624eab8..4a704866e9b6 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -579,16 +579,17 @@ void kvm_arm_resume_guest(struct kvm *kvm)
>  
>   kvm_for_each_vcpu(i, vcpu, kvm) {
>   vcpu->arch.pause = false;
> - swake_up_one(kvm_arch_vcpu_wq(vcpu));
> + rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
>   }
>  }
>  
>  static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>  {
> - struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu);
> + struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
>  
> - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
> -(!vcpu->arch.pause)));
> + rcuwait_wait_event(*wait,
> +(!vcpu->arch.power_off) && (!vcpu->arch.pause),
> +TASK_INTERRUPTIBLE);
>  
>   if (vcpu->arch.power_off || vcpu->arch.pause) {
>   /* Awaken to handle a signal, request we sleep again later. */
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 15e5b037f92d..10b533f641a6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,8 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>  
> - if (swq_has_sleeper(>wq))
> - swake_up_one(>wq);
> + rcuwait_wake_up(>wait);
>  
>   mmput(mm);
>   kvm_put_kvm(vcpu->kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70f03ce0e5c1..6b49dcb321e2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -343,7 +343,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct 
> kvm *kvm, unsigned id)
>   vcpu->kvm = kvm;
>   vcpu->vcpu_id = id;
>   vcpu->pid = NULL;
> - init_swait_queue_head(>wq);
> + rcuwait_init(>wait);
>   kvm_async_pf_vcpu_init(vcpu);
>  
>   vcpu->pre_pcpu = -1;
> @@ -2465,9 +2465,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  {
>   ktime_t start, cur;
> - DECLARE_SWAITQUEUE(wait);
> - bool waited = false;
>   u64 block_ns;
> + int block_check = -EINTR;
>  
>   kvm_arch_vcpu_blocking(vcpu);
>  
> @@ -2487,21 +2486,14 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>   ++vcpu->stat.halt_poll_invalid;
>   goto out;
>   }
> +
>   cur = ktime_get();
>   } while (single_task_running() && ktime_before(cur, stop));
>   }
>  
> - for (;;) {
> - prepare_to_swait_exclusive(>wq, , 
> TASK_INTERRUPTIBLE);
> -
> - if (kvm_vcpu_check_block(vcpu) < 0)
> - break;
> -
> - 

[ltcras] powerpc/pseries: Handle UE event for memcpy_mcsafe

2020-03-20 Thread Ganesh Goudar
If we hit UE at an instruction with a fixup entry, flag to
ignore the event and set nip to continue execution at the
fixup entry.
For powernv these changes are already made by
commit 895e3dceeb97 ("powerpc/mce: Handle UE event for memcpy_mcsafe")

Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Santosh S 
Signed-off-by: Ganesh Goudar 
---
V2: Fixes a trivial checkpatch error in commit msg
---
 arch/powerpc/platforms/pseries/ras.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 5d49d9d711da..6dc3074a34c5 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -505,6 +506,7 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
int initiator = rtas_error_initiator(errp);
int severity = rtas_error_severity(errp);
u8 error_type, err_sub_type;
+   const struct exception_table_entry *entry;
 
if (initiator == RTAS_INITIATOR_UNKNOWN)
mce_err.initiator = MCE_INITIATOR_UNKNOWN;
@@ -558,6 +560,12 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
switch (mce_log->error_type) {
case MC_ERROR_TYPE_UE:
mce_err.error_type = MCE_ERROR_TYPE_UE;
+   entry = search_kernel_exception_table(regs->nip);
+   if (entry) {
+   mce_err.ignore_event = true;
+   regs->nip = extable_fixup(entry);
+   disposition = RTAS_DISP_FULLY_RECOVERED;
+   }
switch (err_sub_type) {
case MC_ERROR_UE_IFETCH:
mce_err.u.ue_error_type = MCE_UE_ERROR_IFETCH;
-- 
2.17.2



[PATCH v2] powerpc/pseries: Fix MCE handling on pseries

2020-03-20 Thread Ganesh Goudar
MCE handling on pSeries platform fails as recent rework to use common
code for pSeries and PowerNV in machine check error handling tries to
access per-cpu variables in realmode. The per-cpu variables may be
outside the RMO region on pSeries platform and needs translation to be
enabled for access. Just moving these per-cpu variable into RMO region
did'nt help because we queue some work to workqueues in real mode, which
again tries to touch per-cpu variables. Also fwnmi_release_errinfo()
cannot be called when translation is not enabled.

This patch fixes this by enabling translation in the exception handler
when all required real mode handling is done. This change only affects
the pSeries platform.

Without this fix below kernel crash is seen on injecting
SLB multihit:

BUG: Unable to handle kernel data access on read at 0xc0027b205950
Faulting instruction address: 0xc003b7e0
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: mcetest_slb(OE+) af_packet(E) xt_tcpudp(E) ip6t_rpfilter(E) 
ip6t_REJECT(E) ipt_REJECT(E) xt_conntrack(E) ip_set(E) nfnetlink(E) 
ebtable_nat(E) ebtable_broute(E) ip6table_nat(E) ip6table_mangle(E) 
ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) 
nf_defrag_ipv6(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) 
iptable_security(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) 
ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) xfs(E) ibmveth(E) 
vmx_crypto(E) gf128mul(E) uio_pdrv_genirq(E) uio(E) crct10dif_vpmsum(E) 
rtc_generic(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) 
raid6_pq(E) sr_mod(E) sd_mod(E) cdrom(E) ibmvscsi(E) scsi_transport_srp(E) 
crc32c_vpmsum(E) dm_mod(E) sg(E) scsi_mod(E)
CPU: 34 PID: 8154 Comm: insmod Kdump: loaded Tainted: G OE 5.5.0-mahesh #1
NIP: c003b7e0 LR: c00f2218 CTR: 
REGS: c7dcb960 TRAP: 0300 Tainted: G OE (5.5.0-mahesh)
MSR: 80001003  CR: 28002428 XER: 2004
CFAR: c00f2214 DAR: c0027b205950 DSISR: 4000 IRQMASK: 0
GPR00: c00f2218 c7dcbbf0 c1544800 c7dcbd70
GPR04: 0001 c7dcbc98 c00800d00258 c008011c
GPR08:  00030003 c1035950 0348
GPR12: 00027a1d c7f9c000 0558 
GPR16: 0540 c0080111 c00801110540 
GPR20: c022af10 c0025480fd70 c0080128 c0004bfbb300
GPR24: c1442330 c008080d c0080800 4009287a77000510
GPR28:  0002 c1033d30 0001
NIP [c003b7e0] save_mce_event+0x30/0x240
LR [c00f2218] pseries_machine_check_realmode+0x2c8/0x4f0
Call Trace:
Instruction dump:
3c4c0151 38429050 7c0802a6 6000 fbc1fff0 fbe1fff8 f821ffd1 3d42ffaf
3fc2ffaf e98d0030 394a1150 3bdef530 <7d6a62aa> 1d2b0048 2f8b0063 380b0001
---[ end trace 46fd63f36bbdd940 ]---

Fixes: 9ca766f9891d ("powerpc/64s/pseries: machine check convert to use common 
event code")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Ganesh Goudar 
---
v2: Avoid asm code to switch to virtual mode.
---
 arch/powerpc/platforms/pseries/ras.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 1d7f973c647b..43710b69e09e 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -683,6 +683,17 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
 #endif
 
 out:
+   /*
+* Enable translation as we will be accessing per-cpu variables
+* in save_mce_event() which may fall outside RMO region, also
+* leave it enabled because subsequently we will be queuing work
+* to workqueues where again per-cpu variables accessed, besides
+* fwnmi_release_errinfo() crashes when called in realmode on
+* pseries.
+* Note: All the realmode handling like flushing SLB entries for
+*   SLB multihit is done by now.
+*/
+   mtmsr(mfmsr() | MSR_IR | MSR_DR);
save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
_err, regs->nip, eaddr, paddr);
 
-- 
2.17.2



Re: [PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted

2020-03-20 Thread Peter Zijlstra
On Fri, Mar 20, 2020 at 01:55:25AM -0700, Davidlohr Bueso wrote:

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 6cc6cc485d07..b0bb0a8ec4b1 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -234,9 +234,10 @@ void release_task(struct task_struct *p)
>   goto repeat;
>  }
>  
> -void rcuwait_wake_up(struct rcuwait *w)
> +bool rcuwait_wake_up(struct rcuwait *w)
>  {
>   struct task_struct *task;
> + bool ret = false;
>  
>   rcu_read_lock();
>  
> @@ -254,10 +255,15 @@ void rcuwait_wake_up(struct rcuwait *w)
>   smp_mb(); /* (B) */
>  
>   task = rcu_dereference(w->task);
> - if (task)
> + if (task) {
>   wake_up_process(task);
> + ret = true;

ret = wake_up_process(task); ?

> + }
>   rcu_read_unlock();
> +
> + return ret;
>  }
> +EXPORT_SYMBOL_GPL(rcuwait_wake_up);


Re: [PATCH] cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn

2020-03-20 Thread Gautham R Shenoy
On Mon, Mar 16, 2020 at 07:27:43PM +0530, Pratik Rajesh Sampat wrote:
> The patch avoids allocating cpufreq_policy on stack hence fixing frame
> size overflow in 'powernv_cpufreq_work_fn'
>

Thanks for fixing this.

> Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur 
> on unthrottling")
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Gautham R. Shenoy 

> ---
>  drivers/cpufreq/powernv-cpufreq.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 56f4bc0d209e..20ee0661555a 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>  void powernv_cpufreq_work_fn(struct work_struct *work)
>  {
>   struct chip *chip = container_of(work, struct chip, throttle);
> + struct cpufreq_policy *policy;
>   unsigned int cpu;
>   cpumask_t mask;
> 
> @@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work)
>   chip->restore = false;
>   for_each_cpu(cpu, ) {
>   int index;
> - struct cpufreq_policy policy;
> 
> - cpufreq_get_policy(, cpu);
> - index = cpufreq_table_find_index_c(, policy.cur);
> - powernv_cpufreq_target_index(, index);
> - cpumask_andnot(, , policy.cpus);
> + policy = cpufreq_cpu_get(cpu);
> + if (!policy)
> + continue;
> + index = cpufreq_table_find_index_c(policy, policy->cur);
> + powernv_cpufreq_target_index(policy, index);
> + cpumask_andnot(, , policy->cpus);
> + cpufreq_cpu_put(policy);
>   }
>  out:
>   put_online_cpus();
> -- 
> 2.24.1
> 


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Andy Shevchenko
On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> While at it also simplify the existing perf patterns.
> 

And still missed fixes from parse-maintainers.pl.

I see it like below in the linux-next (after the script)

PERFORMANCE EVENTS SUBSYSTEM
M:  Peter Zijlstra 
M:  Ingo Molnar 
M:  Arnaldo Carvalho de Melo 
R:  Mark Rutland 
R:  Alexander Shishkin 
R:  Jiri Olsa 
R:  Namhyung Kim 
L:  linux-ker...@vger.kernel.org
S:  Supported
T:  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
F:  arch/*/events/*
F:  arch/*/events/*/*
F:  arch/*/include/asm/perf_event.h
F:  arch/*/kernel/*/*/perf_event*.c
F:  arch/*/kernel/*/perf_event*.c
F:  arch/*/kernel/perf_callchain.c
F:  arch/*/kernel/perf_event*.c
F:  include/linux/perf_event.h
F:  include/uapi/linux/perf_event.h
F:  kernel/events/*
F:  tools/perf/

> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -13080,7 +13080,7 @@ R:Namhyung Kim 
>  L:   linux-ker...@vger.kernel.org
>  T:   git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
>  S:   Supported
> -F:   kernel/events/*
> +F:   kernel/events/
>  F:   include/linux/perf_event.h
>  F:   include/uapi/linux/perf_event.h
>  F:   arch/*/kernel/perf_event*.c
> @@ -13088,8 +13088,8 @@ F:arch/*/kernel/*/perf_event*.c
>  F:   arch/*/kernel/*/*/perf_event*.c
>  F:   arch/*/include/asm/perf_event.h
>  F:   arch/*/kernel/perf_callchain.c
> -F:   arch/*/events/*
> -F:   arch/*/events/*/*
> +F:   arch/*/events/
> +F:   arch/*/perf/
>  F:   tools/perf/
>  
>  PERFORMANCE EVENTS SUBSYSTEM ARM64 PMU EVENTS

-- 
With Best Regards,
Andy Shevchenko




[PATCH] arch/powerpc/mm: Enable compound page check for both THP and HugeTLB

2020-03-20 Thread Aneesh Kumar K.V
THP config can result in compound pages. Make sure kernel enables the
PageCompound() check when only THP is enabled.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9b4f5fb719e0..b03cbddf9054 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -485,7 +485,7 @@ EXPORT_SYMBOL(flush_dcache_page);
 
 void flush_dcache_icache_page(struct page *page)
 {
-#ifdef CONFIG_HUGETLB_PAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
if (PageCompound(page)) {
flush_dcache_icache_hugepage(page);
return;
-- 
2.25.1



[PATCH] arch/powerpc/64: Avoid isync in flush_dcache_range

2020-03-20 Thread Aneesh Kumar K.V
As per ISA and isync is only needed on instruction cache
block invalidate. Remove the same from dcache invalidate.

Signed-off-by: Aneesh Kumar K.V 
---
Note: IIUC we can also void the sync fore dcbf.

 arch/powerpc/include/asm/cacheflush.h | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/cacheflush.h 
b/arch/powerpc/include/asm/cacheflush.h
index 4a1c9f0200e1..e92191b390f3 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -65,17 +65,13 @@ static inline void flush_dcache_range(unsigned long start, 
unsigned long stop)
unsigned long size = stop - (unsigned long)addr + (bytes - 1);
unsigned long i;
 
-   if (IS_ENABLED(CONFIG_PPC64)) {
+   if (IS_ENABLED(CONFIG_PPC64))
mb();   /* sync */
-   isync();
-   }
 
for (i = 0; i < size >> shift; i++, addr += bytes)
dcbf(addr);
mb();   /* sync */
 
-   if (IS_ENABLED(CONFIG_PPC64))
-   isync();
 }
 
 /*
-- 
2.25.1



[PATCH 0/2] Fix SVM hang at startup

2020-03-20 Thread Laurent Dufour
This series is fixing a SVM hang occurring when starting a SVM requiring
more secure memory than available. The hang happens in the SVM when calling
UV_ESM.

The following is happening:

1. SVM calls UV_ESM
2. Ultravisor (UV) calls H_SVM_INIT_START
3. Hypervisor (HV) calls UV_REGISTER_MEM_SLOT
4. UV returns error because there is not enough free secure memory
5. HV enter the error path in kvmppc_h_svm_init_start()
6. In the return path, since kvm->arch.secure_guest is not yet set hrfid is
   called
7. As the HV doesn't know the SVM calling context hrfid is jumping to
   unknown address in the SVM leading to various expections.

This series fixes the setting of kvm->arch.secure_guest in
kvmppc_h_svm_init_start() to ensure that UV_RETURN is called on the return
path to get back to the UV.

In addition to ensure that a malicious VM will not call UV reserved Hcall,
a check of the Secure bit in the calling MSR is addded to reject such a
call.

It is assumed that the UV will filtered out such Hcalls made by a malicious
SVM.

Laurent Dufour (2):
  KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls
  KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

 arch/powerpc/kvm/book3s_hv.c   | 32 --
 arch/powerpc/kvm/book3s_hv_uvmem.c |  3 ++-
 2 files changed, 23 insertions(+), 12 deletions(-)

-- 
2.25.2



[PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-20 Thread Laurent Dufour
When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
there is not enough free secured memory, the Hypervisor (HV) has to call
UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.

If the kvm->arch.secure_guest is not set, in the return path rfid is called
but there is no valid context to get back to the SVM since the Hcall has
been routed by the Ultravisor.

Move the setting of kvm->arch.secure_guest earlier in
kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
instead of rfid.

Cc: Bharata B Rao 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 79b1202b1c62..68dff151315c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -209,6 +209,8 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
int ret = H_SUCCESS;
int srcu_idx;
 
+   kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;
+
if (!kvmppc_uvmem_bitmap)
return H_UNSUPPORTED;
 
@@ -233,7 +235,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
goto out;
}
}
-   kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_START;
 out:
srcu_read_unlock(>srcu, srcu_idx);
return ret;
-- 
2.25.2



[PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-20 Thread Laurent Dufour
The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
prevent a malicious VM or SVM to call them. This could lead to weird result
and should be filtered out.

Checking the Secure bit of the calling MSR ensure that the call is coming
from either the Ultravisor or a SVM. But any system call made from a SVM
are going through the Ultravisor, and the Ultravisor should filter out
these malicious call. This way, only the Ultravisor is able to make such a
Hcall.

Cc: Bharata B Rao 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/kvm/book3s_hv.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 33be4d93248a..43773182a737 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1074,25 +1074,35 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 kvmppc_get_gpr(vcpu, 6));
break;
case H_SVM_PAGE_IN:
-   ret = kvmppc_h_svm_page_in(vcpu->kvm,
-  kvmppc_get_gpr(vcpu, 4),
-  kvmppc_get_gpr(vcpu, 5),
-  kvmppc_get_gpr(vcpu, 6));
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_page_in(vcpu->kvm,
+  kvmppc_get_gpr(vcpu, 4),
+  kvmppc_get_gpr(vcpu, 5),
+  kvmppc_get_gpr(vcpu, 6));
break;
case H_SVM_PAGE_OUT:
-   ret = kvmppc_h_svm_page_out(vcpu->kvm,
-   kvmppc_get_gpr(vcpu, 4),
-   kvmppc_get_gpr(vcpu, 5),
-   kvmppc_get_gpr(vcpu, 6));
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_page_out(vcpu->kvm,
+   kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5),
+   kvmppc_get_gpr(vcpu, 6));
break;
case H_SVM_INIT_START:
-   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
break;
case H_SVM_INIT_DONE:
-   ret = kvmppc_h_svm_init_done(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_done(vcpu->kvm);
break;
case H_SVM_INIT_ABORT:
-   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
+   ret = H_UNSUPPORTED;
+   if (kvmppc_get_srr1(vcpu) & MSR_S)
+   ret = kvmppc_h_svm_init_abort(vcpu->kvm);
break;
 
default:
-- 
2.25.2



[PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchanek
While at it also simplify the existing perf patterns.

Signed-off-by: Michal Suchanek 
---
v10: new patch
V12: remove redundant entries
---
 MAINTAINERS | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index e1a99197fb34..578429d0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13080,7 +13080,7 @@ R:  Namhyung Kim 
 L: linux-ker...@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
 S: Supported
-F: kernel/events/*
+F: kernel/events/
 F: include/linux/perf_event.h
 F: include/uapi/linux/perf_event.h
 F: arch/*/kernel/perf_event*.c
@@ -13088,8 +13088,8 @@ F:  arch/*/kernel/*/perf_event*.c
 F: arch/*/kernel/*/*/perf_event*.c
 F: arch/*/include/asm/perf_event.h
 F: arch/*/kernel/perf_callchain.c
-F: arch/*/events/*
-F: arch/*/events/*/*
+F: arch/*/events/
+F: arch/*/perf/
 F: tools/perf/
 
 PERFORMANCE EVENTS SUBSYSTEM ARM64 PMU EVENTS
-- 
2.23.0



[PATCH v12 7/8] powerpc/perf: split callchain.c by bitness

2020-03-20 Thread Michal Suchanek
Building callchain.c with !COMPAT proved quite ugly with all the
defines. Splitting out the 32bit and 64bit parts looks better.

No code change intended.

Signed-off-by: Michal Suchanek 
---
v6:
 - move current_is_64bit consolidetaion to earlier patch
 - move defines to the top of callchain_32.c
 - Makefile cleanup
v8:
 - fix valid_user_sp
v11:
 - rebase on top of def0bfdbd603
---
 arch/powerpc/perf/Makefile   |   5 +-
 arch/powerpc/perf/callchain.c| 356 +--
 arch/powerpc/perf/callchain.h|  19 ++
 arch/powerpc/perf/callchain_32.c | 196 +
 arch/powerpc/perf/callchain_64.c | 174 +++
 5 files changed, 394 insertions(+), 356 deletions(-)
 create mode 100644 arch/powerpc/perf/callchain.h
 create mode 100644 arch/powerpc/perf/callchain_32.c
 create mode 100644 arch/powerpc/perf/callchain_64.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index c155dcbb8691..53d614e98537 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -1,6 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_PERF_EVENTS)  += callchain.o perf_regs.o
+obj-$(CONFIG_PERF_EVENTS)  += callchain.o callchain_$(BITS).o perf_regs.o
+ifdef CONFIG_COMPAT
+obj-$(CONFIG_PERF_EVENTS)  += callchain_32.o
+endif
 
 obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += ppc970-pmu.o power5-pmu.o \
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index b5afd0bec4f8..dd5051015008 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -15,11 +15,9 @@
 #include 
 #include 
 #include 
-#ifdef CONFIG_COMPAT
-#include "../kernel/ppc32.h"
-#endif
 #include 
 
+#include "callchain.h"
 
 /*
  * Is sp valid as the address of the next kernel stack frame after prev_sp?
@@ -102,358 +100,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
}
 }
 
-static inline bool invalid_user_sp(unsigned long sp)
-{
-   unsigned long mask = is_32bit_task() ? 3 : 7;
-   unsigned long top = STACK_TOP - (is_32bit_task() ? 16 : 32);
-
-   return (!sp || (sp & mask) || (sp > top));
-}
-
-#ifdef CONFIG_PPC64
-/*
- * On 64-bit we don't want to invoke hash_page on user addresses from
- * interrupt context, so if the access faults, we read the page tables
- * to find which page (if any) is mapped and access it directly.
- */
-static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
-{
-   int ret = -EFAULT;
-   pgd_t *pgdir;
-   pte_t *ptep, pte;
-   unsigned shift;
-   unsigned long addr = (unsigned long) ptr;
-   unsigned long offset;
-   unsigned long pfn, flags;
-   void *kaddr;
-
-   pgdir = current->mm->pgd;
-   if (!pgdir)
-   return -EFAULT;
-
-   local_irq_save(flags);
-   ptep = find_current_mm_pte(pgdir, addr, NULL, );
-   if (!ptep)
-   goto err_out;
-   if (!shift)
-   shift = PAGE_SHIFT;
-
-   /* align address to page boundary */
-   offset = addr & ((1UL << shift) - 1);
-
-   pte = READ_ONCE(*ptep);
-   if (!pte_present(pte) || !pte_user(pte))
-   goto err_out;
-   pfn = pte_pfn(pte);
-   if (!page_is_ram(pfn))
-   goto err_out;
-
-   /* no highmem to worry about here */
-   kaddr = pfn_to_kaddr(pfn);
-   memcpy(buf, kaddr + offset, nb);
-   ret = 0;
-err_out:
-   local_irq_restore(flags);
-   return ret;
-}
-
-static int read_user_stack_64(unsigned long __user *ptr, unsigned long *ret)
-{
-   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned long) ||
-   ((unsigned long)ptr & 7))
-   return -EFAULT;
-
-   if (!probe_user_read(ret, ptr, sizeof(*ret)))
-   return 0;
-
-   return read_user_stack_slow(ptr, ret, 8);
-}
-
-/*
- * 64-bit user processes use the same stack frame for RT and non-RT signals.
- */
-struct signal_frame_64 {
-   chardummy[__SIGNAL_FRAMESIZE];
-   struct ucontext uc;
-   unsigned long   unused[2];
-   unsigned inttramp[6];
-   struct siginfo  *pinfo;
-   void*puc;
-   struct siginfo  info;
-   charabigap[288];
-};
-
-static int is_sigreturn_64_address(unsigned long nip, unsigned long fp)
-{
-   if (nip == fp + offsetof(struct signal_frame_64, tramp))
-   return 1;
-   if (vdso64_rt_sigtramp && current->mm->context.vdso_base &&
-   nip == current->mm->context.vdso_base + vdso64_rt_sigtramp)
-   return 1;
-   return 0;
-}
-
-/*
- * Do some sanity checking on the signal frame pointed to by sp.
- * We check the pinfo and puc pointers in the frame.
- */
-static int sane_signal_64_frame(unsigned long sp)
-{
-   struct signal_frame_64 __user *sf;
-   unsigned long pinfo, puc;
-
-   sf = (struct signal_frame_64 __user *) sp;
-   if 

[PATCH v12 6/8] powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.

2020-03-20 Thread Michal Suchanek
On bigendian ppc64 it is common to have 32bit legacy binaries but much
less so on littleendian.

Signed-off-by: Michal Suchanek 
Reviewed-by: Christophe Leroy 
---
v3: make configurable
---
 arch/powerpc/Kconfig | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..29d00b3959b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -264,8 +264,9 @@ config PANIC_TIMEOUT
default 180
 
 config COMPAT
-   bool
-   default y if PPC64
+   bool "Enable support for 32bit binaries"
+   depends on PPC64
+   default y if !CPU_LITTLE_ENDIAN
select COMPAT_BINFMT_ELF
select ARCH_WANT_OLD_COMPAT_IPC
select COMPAT_OLD_SIGACTION
-- 
2.23.0



[PATCH v12 5/8] powerpc/64: make buildable without CONFIG_COMPAT

2020-03-20 Thread Michal Suchanek
There are numerous references to 32bit functions in generic and 64bit
code so ifdef them out.

Signed-off-by: Michal Suchanek 
---
v2:
- fix 32bit ifdef condition in signal.c
- simplify the compat ifdef condition in vdso.c - 64bit is redundant
- simplify the compat ifdef condition in callchain.c - 64bit is redundant
v3:
- use IS_ENABLED and maybe_unused where possible
- do not ifdef declarations
- clean up Makefile
v4:
- further makefile cleanup
- simplify is_32bit_task conditions
- avoid ifdef in condition by using return
v5:
- avoid unreachable code on 32bit
- make is_current_64bit constant on !COMPAT
- add stub perf_callchain_user_32 to avoid some ifdefs
v6:
- consolidate current_is_64bit
v7:
- remove leftover perf_callchain_user_32 stub from previous series version
v8:
- fix build again - too trigger-happy with stub removal
- remove a vdso.c hunk that causes warning according to kbuild test robot
v9:
- removed current_is_64bit in previous patch
v10:
- rebase on top of 70ed86f4de5bd
---
 arch/powerpc/include/asm/thread_info.h | 4 ++--
 arch/powerpc/kernel/Makefile   | 6 +++---
 arch/powerpc/kernel/entry_64.S | 2 ++
 arch/powerpc/kernel/signal.c   | 3 +--
 arch/powerpc/kernel/syscall_64.c   | 6 ++
 arch/powerpc/kernel/vdso.c | 3 ++-
 arch/powerpc/perf/callchain.c  | 8 +++-
 7 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index a2270749b282..ca6c97025704 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -162,10 +162,10 @@ static inline bool test_thread_local_flags(unsigned int 
flags)
return (ti->local_flags & flags) != 0;
 }
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_COMPAT
 #define is_32bit_task()(test_thread_flag(TIF_32BIT))
 #else
-#define is_32bit_task()(1)
+#define is_32bit_task()(IS_ENABLED(CONFIG_PPC32))
 #endif
 
 #if defined(CONFIG_PPC64)
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 5700231a8988..98a1c143b613 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -42,16 +42,16 @@ CFLAGS_btext.o += -DDISABLE_BRANCH_PROFILING
 endif
 
 obj-y  := cputable.o ptrace.o syscalls.o \
-  irq.o align.o signal_32.o pmc.o vdso.o \
+  irq.o align.o signal_$(BITS).o pmc.o vdso.o \
   process.o systbl.o idle.o \
   signal.o sysfs.o cacheinfo.o time.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
   of_platform.o prom_parse.o
-obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \
-  signal_64.o ptrace32.o \
+obj-$(CONFIG_PPC64)+= setup_64.o \
   paca.o nvram_64.o firmware.o note.o \
   syscall_64.o
+obj-$(CONFIG_COMPAT)   += sys_ppc32.o ptrace32.o signal_32.o
 obj-$(CONFIG_VDSO32)   += vdso32/
 obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 4c0d0400e93d..fe1421e08f09 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -52,8 +52,10 @@
 SYS_CALL_TABLE:
.tc sys_call_table[TC],sys_call_table
 
+#ifdef CONFIG_COMPAT
 COMPAT_SYS_CALL_TABLE:
.tc compat_sys_call_table[TC],compat_sys_call_table
+#endif
 
 /* This value is used to mark exception frames on the stack. */
 exception_marker:
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 4b0152108f61..a264989626fd 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -247,7 +247,6 @@ static void do_signal(struct task_struct *tsk)
sigset_t *oldset = sigmask_to_save();
struct ksignal ksig = { .sig = 0 };
int ret;
-   int is32 = is_32bit_task();
 
BUG_ON(tsk != current);
 
@@ -277,7 +276,7 @@ static void do_signal(struct task_struct *tsk)
 
rseq_signal_deliver(, tsk->thread.regs);
 
-   if (is32) {
+   if (is_32bit_task()) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
else
diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
index 87d95b455b83..2dcbfe38f5ac 100644
--- a/arch/powerpc/kernel/syscall_64.c
+++ b/arch/powerpc/kernel/syscall_64.c
@@ -24,7 +24,6 @@ notrace long system_call_exception(long r3, long r4, long r5,
   long r6, long r7, long r8,
   unsigned long r0, struct pt_regs *regs)
 {
-   unsigned long 

[PATCH v12 4/8] powerpc/perf: consolidate valid_user_sp -> invalid_user_sp

2020-03-20 Thread Michal Suchanek
Merge the 32bit and 64bit version.

Halve the check constants on 32bit.

Use STACK_TOP since it is defined.

Passing is_64 is now redundant since is_32bit_task() is used to
determine which callchain variant should be used. Use STACK_TOP and
is_32bit_task() directly.

This removes a page from the valid 32bit area on 64bit:
 #define TASK_SIZE_USER32 (0x0001UL - (1 * PAGE_SIZE))
 #define STACK_TOP_USER32 TASK_SIZE_USER32

Change return value to bool. It is inverted by users anyway.

Change to invalid_user_sp to avoid inverting the return value twice.

Signed-off-by: Michal Suchanek 
---
v8: new patch
v11: simplify by using is_32bit_task()
v12:
 - simplify by precalculating subexpresions
 - change return value to bool
 - remove double inversion
---
 arch/powerpc/perf/callchain.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c9a78c6e4361..001d0473a61f 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -102,6 +102,14 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
}
 }
 
+static inline bool invalid_user_sp(unsigned long sp)
+{
+   unsigned long mask = is_32bit_task() ? 3 : 7;
+   unsigned long top = STACK_TOP - (is_32bit_task() ? 16 : 32);
+
+   return (!sp || (sp & mask) || (sp > top));
+}
+
 #ifdef CONFIG_PPC64
 /*
  * On 64-bit we don't want to invoke hash_page on user addresses from
@@ -161,13 +169,6 @@ static int read_user_stack_64(unsigned long __user *ptr, 
unsigned long *ret)
return read_user_stack_slow(ptr, ret, 8);
 }
 
-static inline int valid_user_sp(unsigned long sp, int is_64)
-{
-   if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) - 32)
-   return 0;
-   return 1;
-}
-
 /*
  * 64-bit user processes use the same stack frame for RT and non-RT signals.
  */
@@ -226,7 +227,7 @@ static void perf_callchain_user_64(struct 
perf_callchain_entry_ctx *entry,
 
while (entry->nr < entry->max_stack) {
fp = (unsigned long __user *) sp;
-   if (!valid_user_sp(sp, 1) || read_user_stack_64(fp, _sp))
+   if (invalid_user_sp(sp) || read_user_stack_64(fp, _sp))
return;
if (level > 0 && read_user_stack_64([2], _ip))
return;
@@ -275,13 +276,6 @@ static inline void perf_callchain_user_64(struct 
perf_callchain_entry_ctx *entry
 {
 }
 
-static inline int valid_user_sp(unsigned long sp, int is_64)
-{
-   if (!sp || (sp & 7) || sp > TASK_SIZE - 32)
-   return 0;
-   return 1;
-}
-
 #define __SIGNAL_FRAMESIZE32   __SIGNAL_FRAMESIZE
 #define sigcontext32   sigcontext
 #define mcontext32 mcontext
@@ -423,7 +417,7 @@ static void perf_callchain_user_32(struct 
perf_callchain_entry_ctx *entry,
 
while (entry->nr < entry->max_stack) {
fp = (unsigned int __user *) (unsigned long) sp;
-   if (!valid_user_sp(sp, 0) || read_user_stack_32(fp, _sp))
+   if (invalid_user_sp(sp) || read_user_stack_32(fp, _sp))
return;
if (level > 0 && read_user_stack_32([1], _ip))
return;
-- 
2.23.0



[PATCH v12 3/8] powerpc/perf: consolidate read_user_stack_32

2020-03-20 Thread Michal Suchanek
There are two almost identical copies for 32bit and 64bit.

The function is used only in 32bit code which will be split out in next
patch so consolidate to one function.

Signed-off-by: Michal Suchanek 
Reviewed-by: Christophe Leroy 
---
v6:  new patch
v8:  move the consolidated function out of the ifdef block.
v11: rebase on top of def0bfdbd603
---
 arch/powerpc/perf/callchain.c | 48 +--
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index cbc251981209..c9a78c6e4361 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -161,18 +161,6 @@ static int read_user_stack_64(unsigned long __user *ptr, 
unsigned long *ret)
return read_user_stack_slow(ptr, ret, 8);
 }
 
-static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
-{
-   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
-   ((unsigned long)ptr & 3))
-   return -EFAULT;
-
-   if (!probe_user_read(ret, ptr, sizeof(*ret)))
-   return 0;
-
-   return read_user_stack_slow(ptr, ret, 4);
-}
-
 static inline int valid_user_sp(unsigned long sp, int is_64)
 {
if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) - 32)
@@ -277,19 +265,9 @@ static void perf_callchain_user_64(struct 
perf_callchain_entry_ctx *entry,
 }
 
 #else  /* CONFIG_PPC64 */
-/*
- * On 32-bit we just access the address and let hash_page create a
- * HPTE if necessary, so there is no need to fall back to reading
- * the page tables.  Since this is called at interrupt level,
- * do_page_fault() won't treat a DSI as a page fault.
- */
-static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
+static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
 {
-   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
-   ((unsigned long)ptr & 3))
-   return -EFAULT;
-
-   return probe_user_read(ret, ptr, sizeof(*ret));
+   return 0;
 }
 
 static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx 
*entry,
@@ -312,6 +290,28 @@ static inline int valid_user_sp(unsigned long sp, int 
is_64)
 
 #endif /* CONFIG_PPC64 */
 
+/*
+ * On 32-bit we just access the address and let hash_page create a
+ * HPTE if necessary, so there is no need to fall back to reading
+ * the page tables.  Since this is called at interrupt level,
+ * do_page_fault() won't treat a DSI as a page fault.
+ */
+static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
+{
+   int rc;
+
+   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
+   ((unsigned long)ptr & 3))
+   return -EFAULT;
+
+   rc = probe_user_read(ret, ptr, sizeof(*ret));
+
+   if (IS_ENABLED(CONFIG_PPC64) && rc)
+   return read_user_stack_slow(ptr, ret, 4);
+
+   return rc;
+}
+
 /*
  * Layout for non-RT signal frames
  */
-- 
2.23.0



[PATCH v12 2/8] powerpc: move common register copy functions from signal_32.c to signal.c

2020-03-20 Thread Michal Suchanek
These functions are required for 64bit as well.

Signed-off-by: Michal Suchanek 
Reviewed-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal.c| 141 
 arch/powerpc/kernel/signal_32.c | 140 ---
 2 files changed, 141 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index d215f9554553..4b0152108f61 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -18,12 +18,153 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
 #include "signal.h"
 
+#ifdef CONFIG_VSX
+unsigned long copy_fpr_to_user(void __user *to,
+  struct task_struct *task)
+{
+   u64 buf[ELF_NFPREG];
+   int i;
+
+   /* save FPR copy to local buffer then write to the thread_struct */
+   for (i = 0; i < (ELF_NFPREG - 1) ; i++)
+   buf[i] = task->thread.TS_FPR(i);
+   buf[i] = task->thread.fp_state.fpscr;
+   return __copy_to_user(to, buf, ELF_NFPREG * sizeof(double));
+}
+
+unsigned long copy_fpr_from_user(struct task_struct *task,
+void __user *from)
+{
+   u64 buf[ELF_NFPREG];
+   int i;
+
+   if (__copy_from_user(buf, from, ELF_NFPREG * sizeof(double)))
+   return 1;
+   for (i = 0; i < (ELF_NFPREG - 1) ; i++)
+   task->thread.TS_FPR(i) = buf[i];
+   task->thread.fp_state.fpscr = buf[i];
+
+   return 0;
+}
+
+unsigned long copy_vsx_to_user(void __user *to,
+  struct task_struct *task)
+{
+   u64 buf[ELF_NVSRHALFREG];
+   int i;
+
+   /* save FPR copy to local buffer then write to the thread_struct */
+   for (i = 0; i < ELF_NVSRHALFREG; i++)
+   buf[i] = task->thread.fp_state.fpr[i][TS_VSRLOWOFFSET];
+   return __copy_to_user(to, buf, ELF_NVSRHALFREG * sizeof(double));
+}
+
+unsigned long copy_vsx_from_user(struct task_struct *task,
+void __user *from)
+{
+   u64 buf[ELF_NVSRHALFREG];
+   int i;
+
+   if (__copy_from_user(buf, from, ELF_NVSRHALFREG * sizeof(double)))
+   return 1;
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)
+   task->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];
+   return 0;
+}
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+unsigned long copy_ckfpr_to_user(void __user *to,
+ struct task_struct *task)
+{
+   u64 buf[ELF_NFPREG];
+   int i;
+
+   /* save FPR copy to local buffer then write to the thread_struct */
+   for (i = 0; i < (ELF_NFPREG - 1) ; i++)
+   buf[i] = task->thread.TS_CKFPR(i);
+   buf[i] = task->thread.ckfp_state.fpscr;
+   return __copy_to_user(to, buf, ELF_NFPREG * sizeof(double));
+}
+
+unsigned long copy_ckfpr_from_user(struct task_struct *task,
+ void __user *from)
+{
+   u64 buf[ELF_NFPREG];
+   int i;
+
+   if (__copy_from_user(buf, from, ELF_NFPREG * sizeof(double)))
+   return 1;
+   for (i = 0; i < (ELF_NFPREG - 1) ; i++)
+   task->thread.TS_CKFPR(i) = buf[i];
+   task->thread.ckfp_state.fpscr = buf[i];
+
+   return 0;
+}
+
+unsigned long copy_ckvsx_to_user(void __user *to,
+ struct task_struct *task)
+{
+   u64 buf[ELF_NVSRHALFREG];
+   int i;
+
+   /* save FPR copy to local buffer then write to the thread_struct */
+   for (i = 0; i < ELF_NVSRHALFREG; i++)
+   buf[i] = task->thread.ckfp_state.fpr[i][TS_VSRLOWOFFSET];
+   return __copy_to_user(to, buf, ELF_NVSRHALFREG * sizeof(double));
+}
+
+unsigned long copy_ckvsx_from_user(struct task_struct *task,
+ void __user *from)
+{
+   u64 buf[ELF_NVSRHALFREG];
+   int i;
+
+   if (__copy_from_user(buf, from, ELF_NVSRHALFREG * sizeof(double)))
+   return 1;
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)
+   task->thread.ckfp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];
+   return 0;
+}
+#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#else
+inline unsigned long copy_fpr_to_user(void __user *to,
+ struct task_struct *task)
+{
+   return __copy_to_user(to, task->thread.fp_state.fpr,
+ ELF_NFPREG * sizeof(double));
+}
+
+inline unsigned long copy_fpr_from_user(struct task_struct *task,
+   void __user *from)
+{
+   return __copy_from_user(task->thread.fp_state.fpr, from,
+ ELF_NFPREG * sizeof(double));
+}
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+inline unsigned long copy_ckfpr_to_user(void __user *to,
+struct task_struct *task)
+{
+   return __copy_to_user(to, task->thread.ckfp_state.fpr,
+ ELF_NFPREG * sizeof(double));
+}
+

[PATCH v12 0/8] Disable compat cruft on ppc64le v12

2020-03-20 Thread Michal Suchanek
Less code means less bugs so add a knob to skip the compat stuff.

Changes in v2: saner CONFIG_COMPAT ifdefs
Changes in v3:
 - change llseek to 32bit instead of builing it unconditionally in fs
 - clanup the makefile conditionals
 - remove some ifdefs or convert to IS_DEFINED where possible
Changes in v4:
 - cleanup is_32bit_task and current_is_64bit
 - more makefile cleanup
Changes in v5:
 - more current_is_64bit cleanup
 - split off callchain.c 32bit and 64bit parts
Changes in v6:
 - cleanup makefile after split
 - consolidate read_user_stack_32
 - fix some checkpatch warnings
Changes in v7:
 - add back __ARCH_WANT_SYS_LLSEEK to fix build with llseek
 - remove leftover hunk
 - add review tags
Changes in v8:
 - consolidate valid_user_sp to fix it in the split callchain.c
 - fix build errors/warnings with PPC64 !COMPAT and PPC32
Changes in v9:
 - remove current_is_64bit()
Chanegs in v10:
 - rebase, sent together with the syscall cleanup
Changes in v11:
 - rebase
 - add MAINTAINERS pattern for ppc perf
Changes in v12:
 - simplify valid_user_sp and change to invalid_user_sp
 - remove superfluous perf patterns in MAINTAINERS

Michal Suchanek (8):
  powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
  powerpc: move common register copy functions from signal_32.c to
signal.c
  powerpc/perf: consolidate read_user_stack_32
  powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
  powerpc/64: make buildable without CONFIG_COMPAT
  powerpc/64: Make COMPAT user-selectable disabled on littleendian by
default.
  powerpc/perf: split callchain.c by bitness
  MAINTAINERS: perf: Add pattern that matches ppc perf to the perf
entry.

 MAINTAINERS|   6 +-
 arch/powerpc/Kconfig   |   5 +-
 arch/powerpc/include/asm/thread_info.h |   4 +-
 arch/powerpc/include/asm/unistd.h  |   1 +
 arch/powerpc/kernel/Makefile   |   6 +-
 arch/powerpc/kernel/entry_64.S |   2 +
 arch/powerpc/kernel/signal.c   | 144 +-
 arch/powerpc/kernel/signal_32.c| 140 --
 arch/powerpc/kernel/syscall_64.c   |   6 +-
 arch/powerpc/kernel/vdso.c |   3 +-
 arch/powerpc/perf/Makefile |   5 +-
 arch/powerpc/perf/callchain.c  | 356 +
 arch/powerpc/perf/callchain.h  |  19 ++
 arch/powerpc/perf/callchain_32.c   | 196 ++
 arch/powerpc/perf/callchain_64.c   | 174 
 fs/read_write.c|   3 +-
 16 files changed, 556 insertions(+), 514 deletions(-)
 create mode 100644 arch/powerpc/perf/callchain.h
 create mode 100644 arch/powerpc/perf/callchain_32.c
 create mode 100644 arch/powerpc/perf/callchain_64.c

-- 
2.23.0



[PATCH v12 1/8] powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro

2020-03-20 Thread Michal Suchanek
This partially reverts commit caf6f9c8a326 ("asm-generic: Remove
unneeded __ARCH_WANT_SYS_LLSEEK macro")

When CONFIG_COMPAT is disabled on ppc64 the kernel does not build.

There is resistance to both removing the llseek syscall from the 64bit
syscall tables and building the llseek interface unconditionally.

Link: https://lore.kernel.org/lkml/20190828151552.ga16...@infradead.org/
Link: https://lore.kernel.org/lkml/20190829214319.498c7de2@naga/

Signed-off-by: Michal Suchanek 
Reviewed-by: Arnd Bergmann 
---
v7: new patch
---
 arch/powerpc/include/asm/unistd.h | 1 +
 fs/read_write.c   | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index b0720c7c3fcf..700fcdac2e3c 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -31,6 +31,7 @@
 #define __ARCH_WANT_SYS_SOCKETCALL
 #define __ARCH_WANT_SYS_FADVISE64
 #define __ARCH_WANT_SYS_GETPGRP
+#define __ARCH_WANT_SYS_LLSEEK
 #define __ARCH_WANT_SYS_NICE
 #define __ARCH_WANT_SYS_OLD_GETRLIMIT
 #define __ARCH_WANT_SYS_OLD_UNAME
diff --git a/fs/read_write.c b/fs/read_write.c
index 59d819c5b92e..bbfa9b12b15e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -331,7 +331,8 @@ COMPAT_SYSCALL_DEFINE3(lseek, unsigned int, fd, 
compat_off_t, offset, unsigned i
 }
 #endif
 
-#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT)
+#if !defined(CONFIG_64BIT) || defined(CONFIG_COMPAT) || \
+   defined(__ARCH_WANT_SYS_LLSEEK)
 SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
unsigned long, offset_low, loff_t __user *, result,
unsigned int, whence)
-- 
2.23.0



Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-20 09:43:11]:

> On 3/20/20 8:46 AM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-19 15:10:19]:
> > 
> >> On 3/19/20 3:05 PM, Srikar Dronamraju wrote:
> >> > * Vlastimil Babka  [2020-03-19 14:47:58]:
> >> > 
> >> 
> >> No, but AFAICS, such node values are already handled in ___slab_alloc, and
> >> cannot reach get_partial(). If you see something I missed, please do tell.
> >> 
> > 
> > Ah I probably got confused with your previous version where
> > alloc_slab_page() was modified. I see no problems with this version.
> 
> Thanks!
> 
> > Sorry for the noise.
> 
> No problem.
> 
> > A question just for my better understanding,
> > How worse would it be to set node to numa_mem_id() instead of NUMA_NODE_ID
> > when the current node is !N_NORMAL_MEMORY?
> 

Yes,

> (I'm assuming you mean s/NUMA_NODE_ID/NUMA_NO_NODE/)
> 
> Well, numa_mem_id() should work too, but it would make the allocation
> constrained to the node of current cpu, with all the consequences 
> (deactivating
> percpu slab if it was from a different node etc).
> 
> There's no reason why this cpu's node should be the closest node to the one 
> that
> was originally requested (but has no memory), so it's IMO pointless or even
> suboptimal to constraint to it. This can be revisited in case we get 
> guaranteed
> existence of node data with zonelists for all possible nodes, but for now
> NUMA_NO_NODE seems the most reasonable fix to me.
> 

Okay.


-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v3 0/8] mm/memory_hotplug: allow to specify a default online_type

2020-03-20 Thread Baoquan He
On 03/19/20 at 02:12pm, David Hildenbrand wrote:
> Distributions nowadays use udev rules ([1] [2]) to specify if and
> how to online hotplugged memory. The rules seem to get more complex with
> many special cases. Due to the various special cases,
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
> is handled via udev rules.
> 
> Everytime we hotplug memory, the udev rule will come to the same
> conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
> memory in separate memory blocks and wait for memory to get onlined by user
> space before continuing to add more memory blocks (to not add memory faster
> than it is getting onlined). This of course slows down the whole memory
> hotplug process.
> 
> To make the job of distributions easier and to avoid udev rules that get
> more and more complicated, let's extend the mechanism provided by
> - /sys/devices/system/memory/auto_online_blocks
> - "memhp_default_state=" on the kernel cmdline
> to be able to specify also "online_movable" as well as "online_kernel"
> 
> v2 -> v3:
> - "hv_balloon: don't check for memhp_auto_online manually"
> -- init_completion() before register_memory_notifier()
> - Minor typo fix
> 
> v1 -> v2:
> - Tweaked some patch descriptions
> - Added
> -- "powernv/memtrace: always online added memory blocks"
> -- "hv_balloon: don't check for memhp_auto_online manually"
> -- "mm/memory_hotplug: unexport memhp_auto_online"
> - "mm/memory_hotplug: convert memhp_auto_online to store an online_type"
> -- No longer touches hv/memtrace code

Ack the series.

Reviewed-by: Baoquan He 



Re: [linux-next/mainline][bisected 3acac06][ppc] Oops when unloading mpt3sas driver

2020-03-20 Thread Abdul Haleem
On Tue, 2020-02-25 at 12:23 +0530, Sreekanth Reddy wrote:
> On Tue, Feb 25, 2020 at 11:51 AM Abdul Haleem
>  wrote:
> >
> > On Fri, 2020-01-17 at 18:21 +0530, Abdul Haleem wrote:
> > > On Thu, 2020-01-16 at 09:44 -0800, Christoph Hellwig wrote:
> > > > Hi Abdul,
> > > >
> > > > I think the problem is that mpt3sas has some convoluted logic to do
> > > > some DMA allocations with a 32-bit coherent mask, and then switches
> > > > to a 63 or 64 bit mask, which is not supported by the DMA API.
> > > >
> > > > Can you try the patch below?
> > >
> > > Thank you Christoph, with the given patch applied the bug is not seen.
> > >
> > > rmmod of mpt3sas driver is successful, no kernel Oops
> > >
> > > Reported-and-tested-by: Abdul Haleem 
> >
> > Hi Christoph,
> >
> > I see the patch is under discussion, will this be merged upstream any
> > time soon ? as boot is broken on our machines with out your patch.
> >
> 
> Hi Abdul,
> 
> We have posted a new set of patches to fix this issue. This patch set
> won't change the DMA Mask on the fly and also won't hardcode the DMA
> mask to 32 bit.
> 
> [PATCH 0/5] mpt3sas: Fix changing coherent mask after allocation.
> 
> This patchset will have below patches, Please review and try with this
> patch set.
> 
> Suganath Prabu S (5):
>   mpt3sas: Don't change the dma coherent mask after  allocations
>   mpt3sas: Rename function name is_MSB_are_same
>   mpt3sas: Code Refactoring.
>   mpt3sas: Handle RDPQ DMA allocation in same 4g region
>   mpt3sas: Update version to 33.101.00.00

Hi Suganath, 

The above patch fixes the issue, driver is loading and unloading with no
kernel oops. 

Reported-and-tested-by: Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v3 3/8] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-20 Thread Baoquan He
On 03/20/20 at 10:50am, David Hildenbrand wrote:
> On 20.03.20 08:36, Baoquan He wrote:
> > On 03/19/20 at 02:12pm, David Hildenbrand wrote:
> >> Let's use a simple array which we can reuse soon. While at it, move the
> >> string->mmop conversion out of the device hotplug lock.
> >>
> >> Reviewed-by: Wei Yang 
> >> Acked-by: Michal Hocko 
> >> Cc: Greg Kroah-Hartman 
> >> Cc: Andrew Morton 
> >> Cc: Michal Hocko 
> >> Cc: Oscar Salvador 
> >> Cc: "Rafael J. Wysocki" 
> >> Cc: Baoquan He 
> >> Cc: Wei Yang 
> >> Signed-off-by: David Hildenbrand 
> >> ---
> >>  drivers/base/memory.c | 38 +++---
> >>  1 file changed, 23 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >> index e7e77cafef80..8a7f29c0bf97 100644
> >> --- a/drivers/base/memory.c
> >> +++ b/drivers/base/memory.c
> >> @@ -28,6 +28,24 @@
> >>  
> >>  #define MEMORY_CLASS_NAME "memory"
> >>  
> >> +static const char *const online_type_to_str[] = {
> >> +  [MMOP_OFFLINE] = "offline",
> >> +  [MMOP_ONLINE] = "online",
> >> +  [MMOP_ONLINE_KERNEL] = "online_kernel",
> >> +  [MMOP_ONLINE_MOVABLE] = "online_movable",
> >> +};
> >> +
> >> +static int memhp_online_type_from_str(const char *str)
> >> +{
> >> +  int i;
> > 
> > I would change it as: 
> > 
> > for (int i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
> > 
> 
> That's not allowed by the C90 standard (and -std=gnu89).
> 
> $ gcc main.c -std=gnu89
> main.c: In function 'main':
> main.c:3:2: error: 'for' loop initial declarations are only allowed in
> C99 or C11 mode
> 3 |  for (int i = 0; i < 8; i++) {
>   |  ^~~

Good to know, thanks.

> 
> One of the reasons why
>   git grep "for (int "
> 
> will result in very little hits (IOW, only 5 in driver code only).
> 
> -- 
> Thanks,
> 
> David / dhildenb



Re: [PATCH] powerpc/pseries: Fix MCE handling on pseries

2020-03-20 Thread Ganesh



On 3/20/20 8:11 AM, Nicholas Piggin wrote:

Ganesh's on March 18, 2020 12:35 am:


On 3/17/20 3:31 PM, Nicholas Piggin wrote:

Ganesh's on March 16, 2020 9:47 pm:

On 3/14/20 9:18 AM, Nicholas Piggin wrote:

Ganesh Goudar's on March 14, 2020 12:04 am:

MCE handling on pSeries platform fails as recent rework to use common
code for pSeries and PowerNV in machine check error handling tries to
access per-cpu variables in realmode. The per-cpu variables may be
outside the RMO region on pSeries platform and needs translation to be
enabled for access. Just moving these per-cpu variable into RMO region
did'nt help because we queue some work to workqueues in real mode, which
again tries to touch per-cpu variables.

Which queues are these? We should not be using Linux workqueues, but the
powerpc mce code which uses irq_work.

Yes, irq work queues accesses memory outside RMO.
irq_work_queue()->__irq_work_queue_local()->[this_cpu_ptr(_list) | 
this_cpu_ptr(_list)]

Hmm, okay.


Also fwnmi_release_errinfo()
cannot be called when translation is not enabled.

Why not?

It crashes when we try to get RTAS token for "ibm, nmi-interlock" device
tree node. But yes we can avoid it by storing it rtas_token somewhere but 
haven't
tried it, here is the backtrace I got when fwnmi_release_errinfo() called from
realmode handler.

Okay, I actually had problems with that messing up soft-irq state too
and so I sent a patch to get rid of it, but that's the least of your
problems really.


This patch fixes this by enabling translation in the exception handler
when all required real mode handling is done. This change only affects
the pSeries platform.

Not supposed to do this, because we might not be in a state
where the MMU is ready to be turned on at this point.

I'd like to understand better which accesses are a problem, and whether
we can fix them all to be in the RMO.

I faced three such access problems,
* accessing per-cpu data (like mce_event,mce_event_queue and 
mce_event_queue),
  we can move this inside RMO.
* calling fwnmi_release_errinfo().
* And queuing work to irq_work_queue, not sure how to fix this.

Yeah. The irq_work_queue one is the biggest problem.

This code "worked" prior to the series unifying pseries and powernv
machine check handlers, 9ca766f9891d ("powerpc/64s/pseries: machine
check convert to use common event code") and friends. But it does in
basically the same way as your fix (i.e., it runs this early handler
in virtual mode), but that's not really the right fix.

Consider: you get a SLB multi hit on a kernel address due to hardware or
software error. That access causes a MCE, but before the error can be
decode to save and flush the SLB, you turn on relocation and that
causes another SLB multi hit...

We turn on relocation only after all the realmode handling/recovery is done
like SLB flush and reload, All we do after we turn relocation on is saving
mce event to array and queuing the work to irq_workqueue.

Oh I see, fwnmi_release_errinfo is done after mce_handle_error, I didnt
read your comment closely!

That means the recovery is done with MSR[ME]=0, which means saving the
SLB entries can take a machine check which will turn into a checkstop,
or walking user page tables and loading memory to handle memory
failures.

We really should release that immediately so we get ME back on.


So we are good to turn it on here.

Possibly. I don't think it's generally a good idea enable relocation
from an interrupted relocation off context, but yeah this might be okay.

I think FWNMI mce needs to be fixed to not do this, and do the
nmi-interlock earlier, but for now your patch I guess improves things
significantly. So, okay let's go with it.

You should be able to just use mtmsrd to switch to virtual mode, so no
need for the asm code.

   mtmsr(mfmsr()|MSR_IR|MSR_DR);


Sure, Thanks



Otherwise,

Reviewed-by: Nicholas Piggin 

Thanks,
Nick




Re: [PATCH v3 3/8] drivers/base/memory: store mapping between MMOP_* and string in an array

2020-03-20 Thread David Hildenbrand
On 20.03.20 08:36, Baoquan He wrote:
> On 03/19/20 at 02:12pm, David Hildenbrand wrote:
>> Let's use a simple array which we can reuse soon. While at it, move the
>> string->mmop conversion out of the device hotplug lock.
>>
>> Reviewed-by: Wei Yang 
>> Acked-by: Michal Hocko 
>> Cc: Greg Kroah-Hartman 
>> Cc: Andrew Morton 
>> Cc: Michal Hocko 
>> Cc: Oscar Salvador 
>> Cc: "Rafael J. Wysocki" 
>> Cc: Baoquan He 
>> Cc: Wei Yang 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  drivers/base/memory.c | 38 +++---
>>  1 file changed, 23 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index e7e77cafef80..8a7f29c0bf97 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -28,6 +28,24 @@
>>  
>>  #define MEMORY_CLASS_NAME   "memory"
>>  
>> +static const char *const online_type_to_str[] = {
>> +[MMOP_OFFLINE] = "offline",
>> +[MMOP_ONLINE] = "online",
>> +[MMOP_ONLINE_KERNEL] = "online_kernel",
>> +[MMOP_ONLINE_MOVABLE] = "online_movable",
>> +};
>> +
>> +static int memhp_online_type_from_str(const char *str)
>> +{
>> +int i;
> 
> I would change it as: 
> 
>   for (int i = 0; i < ARRAY_SIZE(online_type_to_str); i++) {
> 

That's not allowed by the C90 standard (and -std=gnu89).

$ gcc main.c -std=gnu89
main.c: In function 'main':
main.c:3:2: error: 'for' loop initial declarations are only allowed in
C99 or C11 mode
3 |  for (int i = 0; i < 8; i++) {
  |  ^~~

One of the reasons why
git grep "for (int "

will result in very little hits (IOW, only 5 in driver code only).

-- 
Thanks,

David / dhildenb



[PATCH 4/5] ia64: Remove mm.h from asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The defconfig compiles without linux/mm.h. With mm.h included the
include chain leands to:
|   CC  kernel/locking/percpu-rwsem.o
| In file included from include/linux/huge_mm.h:8,
|  from include/linux/mm.h:567,
|  from arch/ia64/include/asm/uaccess.h:,
|  from include/linux/uaccess.h:11,
|  from include/linux/sched/task.h:11,
|  from include/linux/sched/signal.h:9,
|  from include/linux/rcuwait.h:6,
|  from include/linux/percpu-rwsem.h:8,
|  from kernel/locking/percpu-rwsem.c:6:
| include/linux/fs.h:1422:29: error: array type has incomplete element type 
'struct percpu_rw_semaphore'
|  1422 |  struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];

once rcuwait.h includes linux/sched/signal.h.

Remove the linux/mm.h include.

Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: linux-i...@vger.kernel.org
Reported-by: kbuild test robot 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/ia64/include/asm/uaccess.h | 1 -
 arch/ia64/kernel/process.c  | 1 +
 arch/ia64/mm/ioremap.c  | 1 +
 3 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/ia64/include/asm/uaccess.h b/arch/ia64/include/asm/uaccess.h
index 89782ad3fb887..5c7e79eccaeed 100644
--- a/arch/ia64/include/asm/uaccess.h
+++ b/arch/ia64/include/asm/uaccess.h
@@ -35,7 +35,6 @@
 
 #include 
 #include 
-#include 
 
 #include 
 #include 
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 968b5f33e725e..743aaf5283278 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -681,3 +681,4 @@ machine_power_off (void)
machine_halt();
 }
 
+EXPORT_SYMBOL(ia64_delay_loop);
diff --git a/arch/ia64/mm/ioremap.c b/arch/ia64/mm/ioremap.c
index a09cfa0645369..55fd3eb753ff9 100644
--- a/arch/ia64/mm/ioremap.c
+++ b/arch/ia64/mm/ioremap.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.26.0.rc2



[PATCH 1/5] nds32: Remove mm.h from asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The defconfig compiles without linux/mm.h. With mm.h included the
include chain leands to:
|   CC  kernel/locking/percpu-rwsem.o
| In file included from include/linux/huge_mm.h:8,
|  from include/linux/mm.h:567,
|  from arch/nds32/include/asm/uaccess.h:,
|  from include/linux/uaccess.h:11,
|  from include/linux/sched/task.h:11,
|  from include/linux/sched/signal.h:9,
|  from include/linux/rcuwait.h:6,
|  from include/linux/percpu-rwsem.h:8,
|  from kernel/locking/percpu-rwsem.c:6:
| include/linux/fs.h:1422:29: error: array type has incomplete element type 
'struct percpu_rw_semaphore'
|  1422 |  struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];

once rcuwait.h includes linux/sched/signal.h.

Remove the linux/mm.h include.

Cc: Nick Hu 
Cc: Greentime Hu 
Cc: Vincent Chen 
Reported-by: kbuild test robot 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/nds32/include/asm/uaccess.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/nds32/include/asm/uaccess.h b/arch/nds32/include/asm/uaccess.h
index 8916ad9f9f139..3a9219f53ee0d 100644
--- a/arch/nds32/include/asm/uaccess.h
+++ b/arch/nds32/include/asm/uaccess.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define __asmeq(x, y)  ".ifnc " x "," y " ; .err ; .endif\n\t"
 
-- 
2.26.0.rc2



[PATCH 2/5] csky: Remove mm.h from asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The defconfig compiles without linux/mm.h. With mm.h included the
include chain leands to:
|   CC  kernel/locking/percpu-rwsem.o
| In file included from include/linux/huge_mm.h:8,
|  from include/linux/mm.h:567,
|  from arch/csky/include/asm/uaccess.h:,
|  from include/linux/uaccess.h:11,
|  from include/linux/sched/task.h:11,
|  from include/linux/sched/signal.h:9,
|  from include/linux/rcuwait.h:6,
|  from include/linux/percpu-rwsem.h:8,
|  from kernel/locking/percpu-rwsem.c:6:
| include/linux/fs.h:1422:29: error: array type has incomplete element type 
'struct percpu_rw_semaphore'
|  1422 |  struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];

once rcuwait.h includes linux/sched/signal.h.

Remove the linux/mm.h include.

Cc: Guo Ren 
Cc: linux-c...@vger.kernel.org
Reported-by: kbuild test robot 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/csky/include/asm/uaccess.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/csky/include/asm/uaccess.h b/arch/csky/include/asm/uaccess.h
index eaa1c3403a424..abefa125b93cf 100644
--- a/arch/csky/include/asm/uaccess.h
+++ b/arch/csky/include/asm/uaccess.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.26.0.rc2



[PATCH 3/5] hexagon: Remove mm.h from asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The defconfig compiles without linux/mm.h. With mm.h included the
include chain leands to:
|   CC  kernel/locking/percpu-rwsem.o
| In file included from include/linux/huge_mm.h:8,
|  from include/linux/mm.h:567,
|  from arch/hexagon/include/asm/uaccess.h:,
|  from include/linux/uaccess.h:11,
|  from include/linux/sched/task.h:11,
|  from include/linux/sched/signal.h:9,
|  from include/linux/rcuwait.h:6,
|  from include/linux/percpu-rwsem.h:8,
|  from kernel/locking/percpu-rwsem.c:6:
| include/linux/fs.h:1422:29: error: array type has incomplete element type 
'struct percpu_rw_semaphore'
|  1422 |  struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];

once rcuwait.h includes linux/sched/signal.h.

Remove the linux/mm.h include.

Cc: Brian Cain 
Cc: linux-hexa...@vger.kernel.org
Reported-by: kbuild test robot 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/hexagon/include/asm/uaccess.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/hexagon/include/asm/uaccess.h 
b/arch/hexagon/include/asm/uaccess.h
index 00cb38faad0c4..c1019a736ff13 100644
--- a/arch/hexagon/include/asm/uaccess.h
+++ b/arch/hexagon/include/asm/uaccess.h
@@ -10,7 +10,6 @@
 /*
  * User space memory access functions
  */
-#include 
 #include 
 
 /*
-- 
2.26.0.rc2



[PATCH 0/5] Remove mm.h from arch/*/include/asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The following mini-series removes linux/mm.h from the asm/uaccess.h of
the individual architecture. The series has been compile tested with the
defconfig and additionally for ia64 with the "special" allmodconfig
supplied by the bot. The regular allmod for the architectures does not
compile (even without the series).

Sebastian




[PATCH 5/5] microblaze: Remove mm.h from asm/uaccess.h

2020-03-20 Thread Sebastian Andrzej Siewior
The defconfig compiles without linux/mm.h. With mm.h included the
include chain leands to:
|   CC  kernel/locking/percpu-rwsem.o
| In file included from include/linux/huge_mm.h:8,
|  from include/linux/mm.h:567,
|  from arch/microblaze/include/asm/uaccess.h:,
|  from include/linux/uaccess.h:11,
|  from include/linux/sched/task.h:11,
|  from include/linux/sched/signal.h:9,
|  from include/linux/rcuwait.h:6,
|  from include/linux/percpu-rwsem.h:8,
|  from kernel/locking/percpu-rwsem.c:6:
| include/linux/fs.h:1422:29: error: array type has incomplete element type 
'struct percpu_rw_semaphore'
|  1422 |  struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];

once rcuwait.h includes linux/sched/signal.h.

Remove the linux/mm.h include.

Cc: Michal Simek 
Reported-by: kbuild test robot 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/microblaze/include/asm/uaccess.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/microblaze/include/asm/uaccess.h 
b/arch/microblaze/include/asm/uaccess.h
index a1f206b90753a..4916d5fbea5e3 100644
--- a/arch/microblaze/include/asm/uaccess.h
+++ b/arch/microblaze/include/asm/uaccess.h
@@ -12,7 +12,6 @@
 #define _ASM_MICROBLAZE_UACCESS_H
 
 #include 
-#include 
 
 #include 
 #include 
-- 
2.26.0.rc2



Re: [PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device

2020-03-20 Thread Aneesh Kumar K.V


Hi Dan,


Dan Williams  writes:

...


>
>>
>> Or are you suggesting that application should not infer any of those
>> details looking at persistence_domain value? If so what is the purpose
>> of exporting that attribute?
>
> The way the patch was worded I thought it was referring to an explicit
> mechanism outside cpu cache flushes, i.e. a mechanism that required a
> driver call.
>

This patch is blocked because I am not expressing the details correctly.
I updates this as below. Can you suggest if this is ok? If not what
alternate wording do you suggest to document "memory controller"


commit 329b46e88f8cd30eee4776b0de7913ab4d496bd8
Author: Aneesh Kumar K.V 
Date:   Wed Dec 18 13:53:16 2019 +0530

libnvdimm: Update persistence domain value for of_pmem and papr_scm device

Currently, kernel shows the below values
"persistence_domain":"cpu_cache"
"persistence_domain":"memory_controller"
"persistence_domain":"unknown"

"cpu_cache" indicates no extra instructions is needed to ensure the 
persistence
of data in the pmem media on power failure.

"memory_controller" indicates cpu cache flush instructions is required to 
flush
the data. Platform provides mechanisms to automatically flush outstanding
write data from memory controler to pmem on system power loss.

Based on the above use memory_controller for non volatile regions on ppc64.

Signed-off-by: Aneesh Kumar K.V 

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..922a4fc3b61b 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -361,8 +361,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
-   else
+   else {
+   set_bit(ND_REGION_PERSIST_MEMCTRL, _desc.flags);
p->region = nvdimm_pmem_region_create(p->bus, _desc);
+   }
if (!p->region) {
dev_err(dev, "Error registering region %pR from %pOF\n",
ndr_desc.res, p->dn);
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index 8224d1431ea9..6826a274a1f1 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -62,8 +62,10 @@ static int of_pmem_region_probe(struct platform_device *pdev)
 
if (is_volatile)
region = nvdimm_volatile_region_create(bus, _desc);
-   else
+   else {
+   set_bit(ND_REGION_PERSIST_MEMCTRL, _desc.flags);
region = nvdimm_pmem_region_create(bus, _desc);
+   }
 
if (!region)
dev_warn(>dev, "Unable to register region %pR 
from %pOF\n",



Re: [PATCH 19/15] sched/swait: Reword some of the main description

2020-03-20 Thread Sebastian Andrzej Siewior
On 2020-03-20 01:55:27 [-0700], Davidlohr Bueso wrote:
> diff --git a/include/linux/swait.h b/include/linux/swait.h
> index 73e06e9986d4..6e5b5d0e64fd 100644
> --- a/include/linux/swait.h
> +++ b/include/linux/swait.h
> @@ -39,7 +26,7 @@
>   *sleeper state.
>   *
>   *  - the !exclusive mode; because that leads to O(n) wakeups, everything is
> - *exclusive.
> + *exclusive. As such swait_wake_up_one will only ever awake _one_ waiter.
swake_up_one()

>   *  - custom wake callback functions; because you cannot give any guarantees
>   *about random code. This also allows swait to be used in RT, such that

Sebastian


Re: [PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted

2020-03-20 Thread Sebastian Andrzej Siewior
On 2020-03-20 01:55:25 [-0700], Davidlohr Bueso wrote:
> Let the caller know if wake_up_process() was actually called or not;
> some users can use this information for ad-hoc. Of course returning
> true does not guarantee that wake_up_process() actually woke anything
> up.

Wouldn't it make sense to return wake_up_process() return value to know
if a change of state occurred or not?

Sebastian


Re: [patch V2 11/15] completion: Use simple wait queues

2020-03-20 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:


From: Thomas Gleixner 

completion uses a wait_queue_head_t to enqueue waiters.

wait_queue_head_t contains a spinlock_t to protect the list of waiters
which excludes it from being used in truly atomic context on a PREEMPT_RT
enabled kernel.

The spinlock in the wait queue head cannot be replaced by a raw_spinlock
because:

 - wait queues can have custom wakeup callbacks, which acquire other
   spinlock_t locks and have potentially long execution times

 - wake_up() walks an unbounded number of list entries during the wake up
   and may wake an unbounded number of waiters.

For simplicity and performance reasons complete() should be usable on
PREEMPT_RT enabled kernels.

completions do not use custom wakeup callbacks and are usually single
waiter, except for a few corner cases.

Replace the wait queue in the completion with a simple wait queue (swait),
which uses a raw_spinlock_t for protecting the waiter list and therefore is
safe to use inside truly atomic regions on PREEMPT_RT.

There is no semantical or functional change:

 - completions use the exclusive wait mode which is what swait provides

 - complete() wakes one exclusive waiter

 - complete_all() wakes all waiters while holding the lock which protects
   the wait queue against newly incoming waiters. The conversion to swait
   preserves this behaviour.

complete_all() might cause unbound latencies with a large number of waiters
being woken at once, but most complete_all() usage sites are either in
testing or initialization code or have only a really small number of
concurrent waiters which for now does not cause a latency problem. Keep it
simple for now.

The fixup of the warning check in the USB gadget driver is just a straight
forward conversion of the lockless waiter check from one waitqueue type to
the other.

Signed-off-by: Thomas Gleixner 
Cc: Arnd Bergmann 


Reviewed-by: Davidlohr Bueso 


Re: [PATCH] KVM: PPC: Book3S HV: Skip kvmppc_uvmem_free if Ultravisor is not supported

2020-03-20 Thread Greg Kurz
On Thu, 19 Mar 2020 19:55:10 -0300
Fabiano Rosas  wrote:

> kvmppc_uvmem_init checks for Ultravisor support and returns early if
> it is not present. Calling kvmppc_uvmem_free at module exit will cause
> an Oops:
> 
> $ modprobe -r kvm-hv
> 
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   
>   NIP:  c0789e90 LR: c0789e8c CTR: c0401030
>   REGS: c03fa7bab9a0 TRAP: 0300   Not tainted  
> (5.6.0-rc6-00033-g6c90b86a745a-dirty)
>   MSR:  90009033   CR: 24002282  XER: 
> 
>   CFAR: c0dae880 DAR: 0008 DSISR: 4000 IRQMASK: 1
>   GPR00: c0789e8c c03fa7babc30 c16fe500 
>   GPR04:  0006  c03faf205c00
>   GPR08:  0001 802d c0080ddde140
>   GPR12: c0401030 c03d9080 0001 
>   GPR16:   00013aad0074 00013aaac978
>   GPR20: 00013aad0070  7fffd1b37158 
>   GPR24: 00014fef0d58  00014fef0cf0 0001
>   GPR28:   c18b2a60 
>   NIP [c0789e90] percpu_ref_kill_and_confirm+0x40/0x170
>   LR [c0789e8c] percpu_ref_kill_and_confirm+0x3c/0x170
>   Call Trace:
>   [c03fa7babc30] [c03faf2064d4] 0xc03faf2064d4 (unreliable)
>   [c03fa7babcb0] [c0400e8c] dev_pagemap_kill+0x6c/0x80
>   [c03fa7babcd0] [c0401064] memunmap_pages+0x34/0x2f0
>   [c03fa7babd50] [c0080548] kvmppc_uvmem_free+0x30/0x80 [kvm_hv]
>   [c03fa7babd80] [c0080ddcef18] kvmppc_book3s_exit_hv+0x20/0x78 
> [kvm_hv]
>   [c03fa7babda0] [c02084d0] sys_delete_module+0x1d0/0x2c0
>   [c03fa7babe20] [c000b9d0] system_call+0x5c/0x68
>   Instruction dump:
>   3fc2001b fb81ffe0 fba1ffe8 fbe1fff8 7c7f1b78 7c9c2378 3bde4560 7fc3f378
>   f8010010 f821ff81 486249a1 6000  7c7d1b78 712a0002 40820084
>   ---[ end trace 5774ef4dc2c98279 ]---
> 
> So this patch checks if kvmppc_uvmem_init actually allocated anything
> before running kvmppc_uvmem_free.
> 
> Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests")
> Reported-by: Greg Kurz 
> Signed-off-by: Fabiano Rosas 
> ---

Thanks for the quick fix :)

Tested-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv_uvmem.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 79b1202b1c62..9d26614b2a77 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -806,6 +806,9 @@ int kvmppc_uvmem_init(void)
>  
>  void kvmppc_uvmem_free(void)
>  {
> + if (!kvmppc_uvmem_bitmap)
> + return;
> +
>   memunmap_pages(_uvmem_pgmap);
>   release_mem_region(kvmppc_uvmem_pgmap.res.start,
>  resource_size(_uvmem_pgmap.res));



Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()

2020-03-20 Thread Davidlohr Bueso

On Fri, 20 Mar 2020, Sebastian Andrzej Siewior wrote:


I though that v2 has it fixed with the previous commit (acpi: Remove
header dependency). The kbot just reported that everything is fine.
Let me look???


Nah my bad, that build did not have the full series applied :)

Sorry for the noise.

Thanks,
Davidlohr


[PATCH 19/15] sched/swait: Reword some of the main description

2020-03-20 Thread Davidlohr Bueso
With both the increased use of swait and kvm no longer using
it, we can reword some of the comments. While removing Linus'
valid rant, I've also cared to explicitly mention that swait
is very different than regular wait. In addition it is
mentioned against using swait in favor of the regular flavor.

Signed-off-by: Davidlohr Bueso 
---
 include/linux/swait.h | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 73e06e9986d4..6e5b5d0e64fd 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -9,23 +9,10 @@
 #include 
 
 /*
- * BROKEN wait-queues.
- *
- * These "simple" wait-queues are broken garbage, and should never be
- * used. The comments below claim that they are "similar" to regular
- * wait-queues, but the semantics are actually completely different, and
- * every single user we have ever had has been buggy (or pointless).
- *
- * A "swake_up_one()" only wakes up _one_ waiter, which is not at all what
- * "wake_up()" does, and has led to problems. In other cases, it has
- * been fine, because there's only ever one waiter (kvm), but in that
- * case gthe whole "simple" wait-queue is just pointless to begin with,
- * since there is no "queue". Use "wake_up_process()" with a direct
- * pointer instead.
- *
- * While these are very similar to regular wait queues (wait.h) the most
- * important difference is that the simple waitqueue allows for deterministic
- * behaviour -- IOW it has strictly bounded IRQ and lock hold times.
+ * Simple waitqueues are semantically very different to regular wait queues
+ * (wait.h). The most important difference is that the simple waitqueue allows
+ * for deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold
+ * times.
  *
  * Mainly, this is accomplished by two things. Firstly not allowing 
swake_up_all
  * from IRQ disabled, and dropping the lock upon every wakeup, giving a higher
@@ -39,7 +26,7 @@
  *sleeper state.
  *
  *  - the !exclusive mode; because that leads to O(n) wakeups, everything is
- *exclusive.
+ *exclusive. As such swait_wake_up_one will only ever awake _one_ waiter.
  *
  *  - custom wake callback functions; because you cannot give any guarantees
  *about random code. This also allows swait to be used in RT, such that
-- 
2.16.4



[PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Davidlohr Bueso
The use of any sort of waitqueue (simple or regular) for
wait/waking vcpus has always been an overkill and semantically
wrong. Because this is per-vcpu (which is blocked) there is
only ever a single waiting vcpu, thus no need for any sort of
queue.

As such, make use of the rcuwait primitive, with the following
considerations:

  - rcuwait already provides the proper barriers that serialize
  concurrent waiter and waker.

  - Task wakeup is done in rcu read critical region, with a
  stable task pointer.

  - Because there is no concurrency among waiters, we need
  not worry about rcuwait_wait_event() calls corrupting
  the wait->task. As a consequence, this saves the locking
  done in swait when adding to the queue.

The x86-tscdeadline_latency test mentioned in 8577370fb0cb
("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
latency is reduced by around 15% with this change.

Cc: Paolo Bonzini 
Signed-off-by: Davidlohr Bueso 
---

Only compiled and tested on x86.

 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s_hv.c| 10 --
 arch/x86/kvm/lapic.c|  2 +-
 include/linux/kvm_host.h| 10 +-
 virt/kvm/arm/arch_timer.c   |  2 +-
 virt/kvm/arm/arm.c  |  9 +
 virt/kvm/async_pf.c |  3 +--
 virt/kvm/kvm_main.c | 33 +
 8 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6e8b8ffd06ad..e2b4a1e3fb7d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -752,7 +752,7 @@ struct kvm_vcpu_arch {
u8 irq_pending; /* Used by XIVE to signal pending guest irqs */
u32 last_inst;
 
-   struct swait_queue_head *wqp;
+   struct rcuwait *waitp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2cefd071b848..c7cbc4bd06e9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -231,13 +231,11 @@ static bool kvmppc_ipi_thread(int cpu)
 static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu)
 {
int cpu;
-   struct swait_queue_head *wqp;
+   struct rcuwait *wait;
 
-   wqp = kvm_arch_vcpu_wq(vcpu);
-   if (swq_has_sleeper(wqp)) {
-   swake_up_one(wqp);
+   wait = kvm_arch_vcpu_get_wait(vcpu);
+   if (rcuwait_wake_up(wait))
++vcpu->stat.halt_wakeup;
-   }
 
cpu = READ_ONCE(vcpu->arch.thread_cpu);
if (cpu >= 0 && kvmppc_ipi_thread(cpu))
@@ -4274,7 +4272,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
}
user_vrsave = mfspr(SPRN_VRSAVE);
 
-   vcpu->arch.wqp = >arch.vcore->wq;
+   vcpu->arch.waitp = >arch.vcore->wait;
vcpu->arch.pgdir = kvm->mm->pgd;
vcpu->arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e3099c642fec..a4420c26dfbc 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1815,7 +1815,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
/* If the preempt notifier has already run, it also called 
apic_timer_expired */
if (!apic->lapic_timer.hv_timer_in_use)
goto out;
-   WARN_ON(swait_active(>wq));
+   WARN_ON(rcu_dereference(vcpu->wait.task));
cancel_hv_timer(apic);
apic_timer_expired(apic);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bcb9b2ac0791..b5694429aede 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -23,7 +23,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -277,7 +277,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
 
-   struct swait_queue_head wq;
+   struct rcuwait wait;
struct pid __rcu *pid;
int sigset_active;
sigset_t sigset;
@@ -952,12 +952,12 @@ static inline bool kvm_arch_has_assigned_device(struct 
kvm *kvm)
 }
 #endif
 
-static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu)
+static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
 {
 #ifdef __KVM_HAVE_ARCH_WQP
-   return vcpu->arch.wqp;
+   return vcpu->arch.wait;
 #else
-   return >wq;
+   return >wait;
 #endif
 }
 
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 0d9438e9de2a..4be71cb58691 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -593,7 +593,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
if (map.emul_ptimer)
soft_timer_cancel(_ptimer->hrtimer);
 
-   if (swait_active(kvm_arch_vcpu_wq(vcpu)))
+   if (rcu_dereference(kvm_arch_vpu_get_wait(vcpu)) != NULL)
kvm_timer_blocking(vcpu);
 
/*
diff 

[PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted

2020-03-20 Thread Davidlohr Bueso
Let the caller know if wake_up_process() was actually called or not;
some users can use this information for ad-hoc. Of course returning
true does not guarantee that wake_up_process() actually woke anything
up.

Signed-off-by: Davidlohr Bueso 
---
 include/linux/rcuwait.h |  2 +-
 kernel/exit.c   | 10 --
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 6e8798458091..3f83b9a12ad3 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -24,7 +24,7 @@ static inline void rcuwait_init(struct rcuwait *w)
w->task = NULL;
 }
 
-extern void rcuwait_wake_up(struct rcuwait *w);
+extern bool rcuwait_wake_up(struct rcuwait *w);
 
 /*
  * The caller is responsible for locking around rcuwait_wait_event(),
diff --git a/kernel/exit.c b/kernel/exit.c
index 6cc6cc485d07..b0bb0a8ec4b1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,9 +234,10 @@ void release_task(struct task_struct *p)
goto repeat;
 }
 
-void rcuwait_wake_up(struct rcuwait *w)
+bool rcuwait_wake_up(struct rcuwait *w)
 {
struct task_struct *task;
+   bool ret = false;
 
rcu_read_lock();
 
@@ -254,10 +255,15 @@ void rcuwait_wake_up(struct rcuwait *w)
smp_mb(); /* (B) */
 
task = rcu_dereference(w->task);
-   if (task)
+   if (task) {
wake_up_process(task);
+   ret = true;
+   }
rcu_read_unlock();
+
+   return ret;
 }
+EXPORT_SYMBOL_GPL(rcuwait_wake_up);
 
 /*
  * Determine if a process group is "orphaned", according to the POSIX
-- 
2.16.4



[PATCH 16/15] rcuwait: Get rid of stale name comment

2020-03-20 Thread Davidlohr Bueso
The 'trywake' name was renamed to simply 'wake',
update the comment.

Signed-off-by: Davidlohr Bueso 
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 0b81b26a872a..6cc6cc485d07 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -243,7 +243,7 @@ void rcuwait_wake_up(struct rcuwait *w)
/*
 * Order condition vs @task, such that everything prior to the load
 * of @task is visible. This is the condition as to why the user called
-* rcuwait_trywake() in the first place. Pairs with set_current_state()
+* rcuwait_wake() in the first place. Pairs with set_current_state()
 * barrier (A) in rcuwait_wait_event().
 *
 *WAITWAKE
-- 
2.16.4



Re: [patch V2 00/15] Lock ordering documentation and annotation for lockdep

2020-03-20 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:

   The PS3 one got converted by Peter Zijlstra to rcu_wait().


While at it, I think it makes sense to finally convert the kvm vcpu swait
to rcuwait (patch 6/15 starts the necessary api changes). I'm sending
some patches on top of this patchset.

Thanks,
Davidlohr


Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()

2020-03-20 Thread Sebastian Andrzej Siewior
On 2020-03-19 22:36:57 [-0700], Davidlohr Bueso wrote:
> On Wed, 18 Mar 2020, Thomas Gleixner wrote:
> 
> Right now I'm not sure what the proper fix should be.

I though that v2 has it fixed with the previous commit (acpi: Remove
header dependency). The kbot just reported that everything is fine.
Let me look…

> Thanks,
> Davidlohr

Sebastian


  1   2   >