[PATCH v2] powerpc/mm: ptdump: Add missing include

2020-04-02 Thread YueHaibing
while PPC_PSERIES is not set, gcc build fails:

arch/powerpc/mm/ptdump/hashpagetable.c: In function ‘pseries_find’:
arch/powerpc/mm/ptdump/hashpagetable.c:262:18: error: ‘H_SUCCESS’ undeclared 
(first use in this function); did you mean ‘FL_ACCESS’?
   if (lpar_rc != H_SUCCESS)
  ^
  FL_ACCESS

If PPC_PSERIES is not set,  does not
include , which leads to this failure.

Add missing include file  in hashpagetable.c to fix this.

Reported-by: Hulk Robot 
Fixes: 65e701b2d2a8 ("powerpc/ptdump: drop non vital #ifdefs")
Signed-off-by: YueHaibing 
---
v2: Add hvcall.h instead of vio.h, also rework patch log
---
 arch/powerpc/mm/ptdump/hashpagetable.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/ptdump/hashpagetable.c 
b/arch/powerpc/mm/ptdump/hashpagetable.c
index b6ed9578382f..3e7e6206688c 100644
--- a/arch/powerpc/mm/ptdump/hashpagetable.c
+++ b/arch/powerpc/mm/ptdump/hashpagetable.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-- 
2.17.1




Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

2020-04-02 Thread Nicholas Piggin
Leonardo Bras's on April 3, 2020 10:37 am:
> On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
>> Leonardo Bras  writes:
>> > During a crash, there is chance that the cpus that handle the NMI IPI
>> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
>> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>> > 
>> > This is a problem if the system has kdump set up, given if it crashes
>> > for any reason kdump may not be saved for crash analysis.
>> > 
>> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
>> > needed for finishing crash routine.
>> 
>> I'm not convinced this is the right approach.
> 
> Me neither. I think it's a very hacky solution, but I couldn't think of
> anything better at the time.
> 
>> Busting locks is risky, it could easily cause a crash if data structures
>> are left in some inconsistent state.
>> 
>> I think we need to make this code more careful about what it's doing.
>> There's a clue at the top of default_machine_crash_shutdown(), which
>> calls crash_kexec_prepare_cpus():
>> 
>>   * This function is only called after the system
>>   * has panicked or is otherwise in a critical state.
>>   * The minimum amount of code to allow a kexec'd kernel
>>   * to run successfully needs to happen here.
>> 
>> 
>> You said the "IPI complete" message was the cause of one lockup:
>> 
>>   #0  arch_spin_lock 
>>   #1  do_raw_spin_lock 
>>   #2  __raw_spin_lock 
>>   #3  _raw_spin_lock 
>>   #4  vprintk_emit 
>>   #5  vprintk_func
>>   #7  crash_kexec_prepare_cpus 
>>   #8  default_machine_crash_shutdown
>>   #9  machine_crash_shutdown 
>>   #10 __crash_kexec
>>   #11 crash_kexec
>>   #12 oops_end
>> 
>> TBH I think we could just drop that printk() entirely.
>> 
>> Or we could tell printk() that we're in NMI context so that it uses the
>> percpu buffers.
>> 
>> We should probably do the latter anyway, in case there's any other code
>> we call that inadvertently calls printk().
>> 
> 
> I was not aware of using per-cpu buffers in printk. It may be a nice
> solution.
> 
> There is another printk call there:
> printk("kexec: Starting switchover sequence.\n");
> in default_machine_kexec().
> 
> Two printk and one rtas call: it's all I could see using a spinlock
> after IPI Complete.
> 
>> 
>> The RTAS trace you sent was:
>> 
>>   #0 arch_spin_lock
>>   #1  lock_rtas () 
>>   #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
>>   #3  ics_rtas_mask_real_irq (hw_irq=4100) 
>>   #4  machine_kexec_mask_interrupts
>>   #5  default_machine_crash_shutdown
>>   #6  machine_crash_shutdown 
>>   #7  __crash_kexec
>>   #8  crash_kexec
>>   #9  oops_end
>> 
>> 
>> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
>> be crashing while holding the RTAS lock, but I guess it could happen.
>> Can you get a full backtrace?
>> 
> 
> Oh, all traces are from the thread that called the crash, by writing
> 'c' to sysrq. That is what I am using to reproduce.
> 
> #10 bad_page_fault
> #11 handle_page_fault
> #12 __handle_sysrq (key=99, check_mask=false) 
> #13 write_sysrq_trigger 
> #14 proc_reg_write
> #15 __vfs_write
> #16 vfs_write
> #17 SYSC_write
> #18 SyS_write
> #19 system_call
> 
>> 
>> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
>> except for a very small list of RTAS calls. So if we bust the RTAS lock
>> there's a risk we violate that part of PAPR and crash even harder.
> 
> Interesting, I was not aware.
> 
>> 
>> Also it's not specific to kdump, we can't even get through a normal
>> reboot if we crash with the RTAS lock held.
>> 
>> Anyway here's a patch with some ideas. That allows me to get from a
>> crash with the RTAS lock held through kdump into the 2nd kernel. But it
>> only works if it's the crashing CPU that holds the RTAS lock.
>> 
> 
> Nice idea. 
> But my test environment is just triggering a crash from sysrq, so I
> think it would not improve the result, given that this thread is
> probably not holding the lock by the time.

Crash paths should not take that RTAS lock, it's a massive pain. I'm 
fixing it for machine check, for other crashes I think it can be removed 
too, it just needs to be unpicked. The good thing with crashing is that 
you can reasonably *know* that you're single threaded, so you can 
usually reason through situations like above.

> I noticed that when rtas is locked, irqs and preemption are also
> disabled.
> 
> Should the IPI send by crash be able to interrupt a thread with
> disabled irqs?

Yes. It's been a bit painful, but in the long term it means that a CPU 
which hangs with interrupts off can be debugged, and it means we can 
take it offline to crash without risking that it will be clobbering what 
we're doing.

Arguably what I should have done is try a regular IPI first, wait a few 
seconds, then NMI IPI.

A couple of problems with that. Firstly it probably avoids this issue 
you hit almost all the time, so it won't get fixed. So w

Re: [PATCH v4 6/6] pseries/sysfs: Minimise IPI noise while reading [idle_][s]purr

2020-04-02 Thread Gautham R Shenoy
Hi Naveen,

On Thu, Apr 02, 2020 at 01:04:34PM +0530, Naveen N. Rao wrote:
[..snip..]

> >
> >It does reduce it to 10ms window. I am not sure if anyone samples PURR
> >etc faster than that rate.
> >
> >I measured how much time it takes to read the purr, spurr, idle_purr,
> >idle_spurr files back-to-back. It takes not more than 150us.  From
> >lparstat will these values be read back-to-back ? If so, we can reduce
> >the staleness_tolerance to something like 500us and still avoid extra
> >IPIs. If not, what is the maximum delay between the first sysfs file
> >read and the last sysfs file read ?
> 
> Oh, for lparstat usage, this is perfectly fine.
> 
> I meant that there could be other users of [s]purr who might care. I don't
> know of one, but since this is an existing sysfs interface, I wanted to
> point out that the behavior might change.

Fair point. Perhaps this should be documented in the Documentation, if
we are going to continue with this patch.

> 
> >
> >>
> >>I wonder if we should introduce a sysctl interface to control thresholding.
> >>It can default to 0, which disables thresholding so that the existing
> >>behavior continues. Applications (lparstat) can optionally set it to suit
> >>their use.
> >
> >We would be introducing 3 new sysfs interfaces that way instead of
> >two.
> >
> >/sys/devices/system/cpu/purr_spurr_staleness
> >/sys/devices/system/cpu/cpuX/idle_purr
> >/sys/devices/system/cpu/cpuX/idle_spurr
> >
> >I don't have a problem with this. Nathan, Michael, thoughts on this?
> >
> >
> >The alternative is to have a procfs interface, something like
> >/proc/powerpc/resource_util_stats
> >
> >which gives a listing similar to /proc/stat, i.e
> >
> >  CPUX
> >
> >Even in this case, the values can be obtained in one-shot with a
> >single IPI and be printed in the row corresponding to the CPU.
> 
> Right -- and that would be optimal requiring a single system call, at the
> cost of using a legacy interface.
> 
> The other option would be to drop this patch and to just go with patches 1-5
> introducing the new sysfs interfaces for idle_[s]purr. It isn't entirely
> clear how often this would be used, or its actual impact. We can perhaps
> consider this optimization if and when this causes problems...

I am ok with that. We can revisit the problem if IPI noise becomes
noticable. However, if Nathan or Michael feel that this problem is
better solved now, than leaving it for the future, we will have to
take a call on what the interface is going to be.

> 
> 
> Thanks,
> Naveen
> 

--
Thanks and Regards
gautham.


Re: [PATCH 4/4] ocxl: Remove custom service to allocate interrupts

2020-04-02 Thread Cédric Le Goater
On 4/2/20 5:43 PM, Frederic Barrat wrote:
> We now allocate interrupts through xive directly.
> 
> Signed-off-by: Frederic Barrat 


Reviewed-by: Cédric Le Goater 

> ---
>  arch/powerpc/include/asm/pnv-ocxl.h   |  3 ---
>  arch/powerpc/platforms/powernv/ocxl.c | 30 ---
>  2 files changed, 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
> b/arch/powerpc/include/asm/pnv-ocxl.h
> index 7de82647e761..e90650328c9c 100644
> --- a/arch/powerpc/include/asm/pnv-ocxl.h
> +++ b/arch/powerpc/include/asm/pnv-ocxl.h
> @@ -30,7 +30,4 @@ extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void 
> *spa_mem, int PE_mask,
>  extern void pnv_ocxl_spa_release(void *platform_data);
>  extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
> pe_handle);
>  
> -extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
> -extern void pnv_ocxl_free_xive_irq(u32 irq);
> -
>  #endif /* _ASM_PNV_OCXL_H */
> diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
> b/arch/powerpc/platforms/powernv/ocxl.c
> index 8c65aacda9c8..ecdad219d704 100644
> --- a/arch/powerpc/platforms/powernv/ocxl.c
> +++ b/arch/powerpc/platforms/powernv/ocxl.c
> @@ -2,7 +2,6 @@
>  // Copyright 2017 IBM Corp.
>  #include 
>  #include 
> -#include 
>  #include 
>  #include "pci.h"
>  
> @@ -484,32 +483,3 @@ int pnv_ocxl_spa_remove_pe_from_cache(void 
> *platform_data, int pe_handle)
>   return rc;
>  }
>  EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe_from_cache);
> -
> -int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr)
> -{
> - __be64 flags, trigger_page;
> - s64 rc;
> - u32 hwirq;
> -
> - hwirq = xive_native_alloc_irq();
> - if (!hwirq)
> - return -ENOENT;
> -
> - rc = opal_xive_get_irq_info(hwirq, &flags, NULL, &trigger_page, NULL,
> - NULL);
> - if (rc || !trigger_page) {
> - xive_native_free_irq(hwirq);
> - return -ENOENT;
> - }
> - *irq = hwirq;
> - *trigger_addr = be64_to_cpu(trigger_page);
> - return 0;
> -
> -}
> -EXPORT_SYMBOL_GPL(pnv_ocxl_alloc_xive_irq);
> -
> -void pnv_ocxl_free_xive_irq(u32 irq)
> -{
> - xive_native_free_irq(irq);
> -}
> -EXPORT_SYMBOL_GPL(pnv_ocxl_free_xive_irq);
> 



Re: [PATCH 2/4] ocxl: Access interrupt trigger page from xive directly

2020-04-02 Thread Andrew Donnellan

On 3/4/20 2:43 am, Frederic Barrat wrote:

We can access the trigger page through standard APIs so let's use it
and avoid saving it when allocating the interrupt. It will also allow
to simplify allocation in a later patch.

Signed-off-by: Frederic Barrat 


I don't see any obvious issues.

Acked-by: Andrew Donnellan 

--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH 3/4] ocxl: Don't return trigger page when allocating an interrupt

2020-04-02 Thread Cédric Le Goater
On 4/2/20 5:43 PM, Frederic Barrat wrote:
> Existing users of ocxl_link_irq_alloc() have been converted to obtain
> the trigger page of an interrupt through xive directly, we therefore
> have no need to return the trigger page when allocating an interrupt.
> 
> It also allows ocxl to use the xive native interface to allocate
> interrupts, instead of its custom service.
> 
> Signed-off-by: Frederic Barrat 

Reviewed-by: Cédric Le Goater 


> ---
>  drivers/misc/ocxl/Kconfig   |  2 +-
>  drivers/misc/ocxl/afu_irq.c |  4 +---
>  drivers/misc/ocxl/link.c| 15 +++
>  drivers/scsi/cxlflash/ocxl_hw.c |  3 +--
>  include/misc/ocxl.h | 10 ++
>  5 files changed, 12 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig
> index 2d2266c1439e..e65773f5cf59 100644
> --- a/drivers/misc/ocxl/Kconfig
> +++ b/drivers/misc/ocxl/Kconfig
> @@ -9,7 +9,7 @@ config OCXL_BASE
>  
>  config OCXL
>   tristate "OpenCAPI coherent accelerator support"
> - depends on PPC_POWERNV && PCI && EEH
> + depends on PPC_POWERNV && PCI && EEH && PPC_XIVE_NATIVE
>   select OCXL_BASE
>   select HOTPLUG_PCI_POWERNV
>   default m
> diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
> index b30ec0ef7be7..ecdcfae025b7 100644
> --- a/drivers/misc/ocxl/afu_irq.c
> +++ b/drivers/misc/ocxl/afu_irq.c
> @@ -11,7 +11,6 @@ struct afu_irq {
>   int hw_irq;
>   unsigned int virq;
>   char *name;
> - u64 trigger_page;
>   irqreturn_t (*handler)(void *private);
>   void (*free_private)(void *private);
>   void *private;
> @@ -125,8 +124,7 @@ int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int 
> *irq_id)
>   goto err_unlock;
>   }
>  
> - rc = ocxl_link_irq_alloc(ctx->afu->fn->link, &irq->hw_irq,
> - &irq->trigger_page);
> + rc = ocxl_link_irq_alloc(ctx->afu->fn->link, &irq->hw_irq);
>   if (rc)
>   goto err_idr;
>  
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 58d111afd9f6..fd73d3bc0eb6 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ocxl_internal.h"
>  #include "trace.h"
> @@ -682,23 +683,21 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_remove_pe);
>  
> -int ocxl_link_irq_alloc(void *link_handle, int *hw_irq, u64 *trigger_addr)
> +int ocxl_link_irq_alloc(void *link_handle, int *hw_irq)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
> - int rc, irq;
> - u64 addr;
> + int irq;
>  
>   if (atomic_dec_if_positive(&link->irq_available) < 0)
>   return -ENOSPC;
>  
> - rc = pnv_ocxl_alloc_xive_irq(&irq, &addr);
> - if (rc) {
> + irq = xive_native_alloc_irq();
> + if (!irq) {
>   atomic_inc(&link->irq_available);
> - return rc;
> + return -ENXIO;
>   }
>  
>   *hw_irq = irq;
> - *trigger_addr = addr;
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_irq_alloc);
> @@ -707,7 +706,7 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
>  {
>   struct ocxl_link *link = (struct ocxl_link *) link_handle;
>  
> - pnv_ocxl_free_xive_irq(hw_irq);
> + xive_native_free_irq(hw_irq);
>   atomic_inc(&link->irq_available);
>  }
>  EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index 59452850f71c..03bff0cae658 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -613,7 +613,6 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
> int num)
>   struct ocxl_hw_afu *afu = ctx->hw_afu;
>   struct device *dev = afu->dev;
>   struct ocxlflash_irqs *irqs;
> - u64 addr;
>   int rc = 0;
>   int hwirq;
>   int i;
> @@ -638,7 +637,7 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
> int num)
>   }
>  
>   for (i = 0; i < num; i++) {
> - rc = ocxl_link_irq_alloc(afu->link_token, &hwirq, &addr);
> + rc = ocxl_link_irq_alloc(afu->link_token, &hwirq);
>   if (unlikely(rc)) {
>   dev_err(dev, "%s: ocxl_link_irq_alloc failed rc=%d\n",
>   __func__, rc);
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 06dd5839e438..a2868adec22f 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -480,14 +480,8 @@ int ocxl_link_remove_pe(void *link_handle, int pasid);
>   * Allocate an AFU interrupt associated to the link.
>   *
>   * 'hw_irq' is the hardware interrupt number
> - * 'obj_handle' is the 64-bit object handle to be passed to the AFU to
> - * trigger the interrupt.
> - * On P9, 'obj_handle' is an address, which, if written, triggers the
> - * interrupt. It is an

Re: [PATCH] powerpc/mm: ptdump: Add missing include

2020-04-02 Thread Yuehaibing
On 2020/4/3 12:58, Michael Ellerman wrote:
> YueHaibing  writes:
>> gcc build fails:
> 
> What config? Custom?

A randconfig, CONFIG_PPC_PSERIES  is not set,  see attach.

> 
>> arch/powerpc/mm/ptdump/hashpagetable.c: In function ‘pseries_find’:
>> arch/powerpc/mm/ptdump/hashpagetable.c:262:18: error: ‘H_SUCCESS’ undeclared 
>> (first use in this function); did you mean ‘FL_ACCESS’?
>>if (lpar_rc != H_SUCCESS)
>>   ^
>>   FL_ACCESS
>>
>> Reported-by: Hulk Robot 
>> Fixes: 65e701b2d2a8 ("powerpc/ptdump: drop non vital #ifdefs")
>> Signed-off-by: YueHaibing 
>> ---
>>  arch/powerpc/mm/ptdump/hashpagetable.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/mm/ptdump/hashpagetable.c 
>> b/arch/powerpc/mm/ptdump/hashpagetable.c
>> index b6ed9578382f..8ea5f9a3b658 100644
>> --- a/arch/powerpc/mm/ptdump/hashpagetable.c
>> +++ b/arch/powerpc/mm/ptdump/hashpagetable.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> 
> I don't think you want vio.h, hvcall.h has the definition of H_SUCCESS.

Ok, will resend.

> 
> cheers
> 
> .
> 
#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 5.6.0 Kernel Configuration
#

#
# Compiler: powerpc-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=70500
CONFIG_LD_VERSION=23000
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_WARN_MAYBE_UNINITIALIZED=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_UAPI_HEADER_TEST=y
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
# CONFIG_HIGH_RES_TIMERS is not set
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_SCHED_THERMAL_PRESSURE is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_RCU_NOCB_CPU is not set
CONFIG_TASKS_TRACE_RCU_READ_MB=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKHEADERS=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_CC_HAS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
# CONFIG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CPUSETS is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_NAMESPACES is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_RD_LZ4 is not set
CONFIG_BOOT_CONFIG=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
# CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_EXPERT=y
CONFIG_MULTIUSER=y
# CONFIG_SGETMASK_SYSCALL is not set
CONFIG

Re: [PATCH v4 2/6] powerpc/idle: Add accessor function to always read latest idle PURR

2020-04-02 Thread Gautham R Shenoy
On Wed, Apr 01, 2020 at 03:12:53PM +0530, Naveen N. Rao wrote:
> Hi Gautham,
> 
> Gautham R. Shenoy wrote:
> >From: "Gautham R. Shenoy" 
> >
> >Currently when CPU goes idle, we take a snapshot of PURR via
> >pseries_idle_prolog() which is used at the CPU idle exit to compute
> >the idle PURR cycles via the function pseries_idle_epilog().  Thus,
> >the value of idle PURR cycle thus read before pseries_idle_prolog() and
> >after pseries_idle_epilog() is always correct.
> >
> >However, if we were to read the idle PURR cycles from an interrupt
> >context between pseries_idle_prolog() and pseries_idle_epilog() (this will
> >be done in a future patch), then, the value of the idle PURR thus read
> >will not include the cycles spent in the most recent idle period.
> >
> >This patch addresses the issue by providing accessor function to read
> >the idle PURR such such that it includes the cycles spent in the most
> >recent idle period, if we read it between pseries_idle_prolog() and
> >pseries_idle_epilog(). In order to achieve it, the patch saves the
> >snapshot of PURR in pseries_idle_prolog() in a per-cpu variable,
> >instead of on the stack, so that it can be accessed from an interrupt
> >context.
> >
> >Signed-off-by: Gautham R. Shenoy 
> >---
> > arch/powerpc/include/asm/idle.h| 47 
> > +++---
> > arch/powerpc/platforms/pseries/setup.c |  7 +++--
> > drivers/cpuidle/cpuidle-pseries.c  | 15 +--
> > 3 files changed, 47 insertions(+), 22 deletions(-)
> >
> >diff --git a/arch/powerpc/include/asm/idle.h 
> >b/arch/powerpc/include/asm/idle.h
> >index 32064a4c..d4bfb6a 100644
> >--- a/arch/powerpc/include/asm/idle.h
> >+++ b/arch/powerpc/include/asm/idle.h
> >@@ -5,10 +5,27 @@
> > #include 
> >
> > #ifdef CONFIG_PPC_PSERIES
> >-static inline void pseries_idle_prolog(unsigned long *in_purr)
> >+DECLARE_PER_CPU(u64, idle_entry_purr_snap);
> >+
> >+static inline void snapshot_purr_idle_entry(void)
> >+{
> >+*this_cpu_ptr(&idle_entry_purr_snap) = mfspr(SPRN_PURR);
> >+}
> >+
> >+static inline void update_idle_purr_accounting(void)
> >+{
> >+u64 wait_cycles;
> >+u64 in_purr = *this_cpu_ptr(&idle_entry_purr_snap);
> >+
> >+wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
> >+wait_cycles += mfspr(SPRN_PURR) - in_purr;
> >+get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
> >+}
> >+
> >+static inline void pseries_idle_prolog(void)
> > {
> > ppc64_runlatch_off();
> >-*in_purr = mfspr(SPRN_PURR);
> >+snapshot_purr_idle_entry();
> > /*
> >  * Indicate to the HV that we are idle. Now would be
> >  * a good time to find other work to dispatch.
> >@@ -16,16 +33,28 @@ static inline void pseries_idle_prolog(unsigned long 
> >*in_purr)
> > get_lppaca()->idle = 1;
> > }
> >
> >-static inline void pseries_idle_epilog(unsigned long in_purr)
> >+static inline void pseries_idle_epilog(void)
> > {
> >-u64 wait_cycles;
> >-
> >-wait_cycles = be64_to_cpu(get_lppaca()->wait_state_cycles);
> >-wait_cycles += mfspr(SPRN_PURR) - in_purr;
> >-get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
> >+update_idle_purr_accounting();
> > get_lppaca()->idle = 0;
> >-
> > ppc64_runlatch_on();
> > }
> >+
> >+static inline u64 read_this_idle_purr(void)
> >+{
> >+/*
> >+ * If we are reading from an idle context, update the
> >+ * idle-purr cycles corresponding to the last idle period.
> >+ * Since the idle context is not yet over, take a fresh
> >+ * snapshot of the idle-purr.
> >+ */
> >+if (unlikely(get_lppaca()->idle == 1)) {
> >+update_idle_purr_accounting();
> >+snapshot_purr_idle_entry();
> >+}
> >+
> >+return be64_to_cpu(get_lppaca()->wait_state_cycles);
> >+}
> >+
> 
> I think this and read_this_idle_spurr() from the next patch should be moved
> to Patch 4/6, where they are actually used.

The reason I included this function in this patch was to justify why
we were introducing snapshotting the purr values in a global per-cpu
variable instead of on a stack variable. The reason being that someone
might want to read the PURR value from an interrupt context which had
woken up the CPU from idle. At this point, since epilog() function
wasn't called, the idle PURR count corresponding to this latest idle
period would have been accumulated in lppaca->wait_cycles. Thus, this
helper function safely reads the value by
   1) First updating the lppaca->wait_cycles with the latest idle_purr
   count.
   2) Take a fresh snapshot, since the time from now to the epilog()
   call is also counted under idle CPU. So the PURR cycle increment
   during this short period should also be accumulated in lppaca->wait_cycles.


prolog()
|   snapshot PURR
|
|
|
Idle
|
| <- Interrupt . Read idle PURR  update idle PURR;
|snapshot PURR;
|Read idle PURR.   
|
epilog()
   

Re: [PATCH 3/4] powerpc/eeh: Remove workaround from eeh_add_device_late()

2020-04-02 Thread Oliver O'Halloran
On Mon, 2020-03-30 at 15:56 +1100, Sam Bobroff wrote:
> When EEH device state was released asynchronously by the device
> release handler, it was possible for an outstanding reference to
> prevent it's release and it was necessary to work around that if a
> device was re-discovered at the same PCI location.

I think this is a bit misleading. The main situation where you'll hit
this hack is when recovering a device with a driver that doesn't
implement the error handling callbacks. In that case the device is
removed, reset, then re-probed by the PCI core, but we assume it's the
same physical device so the eeh_device state remains active.

If you actually changed the underlying device I suspect something bad
would happen.

> Now that the state is released synchronously that is no longer
> possible and the workaround is no longer necessary.

You could probably fold this into the previous patch, but eh. You could
probably fold this into the previous patch, but eh.

> Signed-off-by: Sam Bobroff 
> ---
>  arch/powerpc/kernel/eeh.c | 23 +--
>  1 file changed, 1 insertion(+), 22 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index c36c5a7db5ca..12c248a16527 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1206,28 +1206,7 @@ void eeh_add_device_late(struct pci_dev *dev)
>   eeh_edev_dbg(edev, "Device already referenced!\n");
>   return;
>   }
> -
> - /*
> -  * The EEH cache might not be removed correctly because of
> -  * unbalanced kref to the device during unplug time, which
> -  * relies on pcibios_release_device(). So we have to remove
> -  * that here explicitly.
> -  */
> - if (edev->pdev) {
> - eeh_rmv_from_parent_pe(edev);
> - eeh_addr_cache_rmv_dev(edev->pdev);
> - eeh_sysfs_remove_device(edev->pdev);
> -
> - /*
> -  * We definitely should have the PCI device removed
> -  * though it wasn't correctly. So we needn't call
> -  * into error handler afterwards.
> -  */
> - edev->mode |= EEH_DEV_NO_HANDLER;
> -
> - edev->pdev = NULL;
> - dev->dev.archdata.edev = NULL;
> - }
> + WARN_ON_ONCE(edev->pdev);
>  
>   if (eeh_has_flag(EEH_PROBE_MODE_DEV))
>   eeh_ops->probe(pdn, NULL);



Re: [PATCH 4/4] powerpc/eeh: Clean up edev cleanup for VFs

2020-04-02 Thread Oliver O'Halloran
On Mon, 2020-03-30 at 15:56 +1100, Sam Bobroff wrote:
> Because the bus notifier calls eeh_rmv_from_parent_pe() (via
> eeh_remove_device()) when a VF is removed, the call in
> remove_sriov_vf_pdns() is redundant.

eeh_rmv_from_parent_pe() won't actually remove the device if the
recovering flag is set on the PE. Are you sure we're not introducing a
race here?



Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms

2020-04-02 Thread Michael Ellerman
Benjamin Herrenschmidt  writes:
> On Tue, 2020-03-31 at 16:30 +1100, Michael Ellerman wrote:
>> I have no attachment to 40x, and I'd certainly be happy to have less
>> code in the tree, we struggle to keep even the modern platforms well
>> maintained.
>> 
>> At the same time I don't want to render anyone's hardware obsolete
>> unnecessarily. But if there's really no one using 40x then we should
>> remove it, it could well be broken already.
>> 
>> So I guess post a series to do the removal and we'll see if anyone
>> speaks up.
>
> We shouldn't remove 40x completely. Just remove the Xilinx 405 stuff.

Congratulations on becoming the 40x maintainer!

cheers


Re: [PATCH] powerpc/mm: ptdump: Add missing include

2020-04-02 Thread Michael Ellerman
YueHaibing  writes:
> gcc build fails:

What config? Custom?

> arch/powerpc/mm/ptdump/hashpagetable.c: In function ‘pseries_find’:
> arch/powerpc/mm/ptdump/hashpagetable.c:262:18: error: ‘H_SUCCESS’ undeclared 
> (first use in this function); did you mean ‘FL_ACCESS’?
>if (lpar_rc != H_SUCCESS)
>   ^
>   FL_ACCESS
>
> Reported-by: Hulk Robot 
> Fixes: 65e701b2d2a8 ("powerpc/ptdump: drop non vital #ifdefs")
> Signed-off-by: YueHaibing 
> ---
>  arch/powerpc/mm/ptdump/hashpagetable.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/mm/ptdump/hashpagetable.c 
> b/arch/powerpc/mm/ptdump/hashpagetable.c
> index b6ed9578382f..8ea5f9a3b658 100644
> --- a/arch/powerpc/mm/ptdump/hashpagetable.c
> +++ b/arch/powerpc/mm/ptdump/hashpagetable.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

I don't think you want vio.h, hvcall.h has the definition of H_SUCCESS.

cheers


Re: [PATCH 2/4] powerpc/eeh: Release EEH device state synchronously

2020-04-02 Thread Oliver O'Halloran
On Mon, 2020-03-30 at 15:56 +1100, Sam Bobroff wrote:
> EEH device state is currently removed (by eeh_remove_device()) during
> the device release handler, which is invoked as the device's reference
> count drops to zero. This may take some time, or forever, as other
> threads may hold references.
> 
> However, the PCI device state is released synchronously by
> pci_stop_and_remove_bus_device(). This mismatch causes problems, for
> example the device may be re-discovered as a new device before the
> release handler has been called, leaving the PCI and EEH state
> mismatched.
> 
> So instead, call eeh_remove_device() from the bus device removal
> handlers, which are called synchronously in the removal path.
> 
> Signed-off-by: Sam Bobroff 
> ---
>  arch/powerpc/kernel/eeh.c | 26 ++
>  arch/powerpc/kernel/pci-hotplug.c |  2 --
>  2 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 17cb3e9b5697..c36c5a7db5ca 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1106,6 +1106,32 @@ static int eeh_init(void)
>  
>  core_initcall_sync(eeh_init);
>  
> +static int eeh_device_notifier(struct notifier_block *nb,
> +unsigned long action, void *data)
> +{
> + struct device *dev = data;
> +
> + switch (action) {
> + case BUS_NOTIFY_DEL_DEVICE:
> + eeh_remove_device(to_pci_dev(dev));
> + break;
> + default:
> + break;
> + }

A comment briefly explaining why we're not doing anything in the add
case might be nice.

Reviewed-by: Oliver O'Halloran 

> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block eeh_device_nb = {
> + .notifier_call = eeh_device_notifier,
> +};
> +
> +static __init int eeh_set_bus_notifier(void)
> +{
> + bus_register_notifier(&pci_bus_type, &eeh_device_nb);
> + return 0;
> +}
> +arch_initcall(eeh_set_bus_notifier);
> +
>  /**
>   * eeh_add_device_early - Enable EEH for the indicated device node
>   * @pdn: PCI device node for which to set up EEH
> diff --git a/arch/powerpc/kernel/pci-hotplug.c 
> b/arch/powerpc/kernel/pci-hotplug.c
> index d6a67f814983..28e9aa274f64 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -57,8 +57,6 @@ void pcibios_release_device(struct pci_dev *dev)
>   struct pci_controller *phb = pci_bus_to_host(dev->bus);
>   struct pci_dn *pdn = pci_get_pdn(dev);
>  
> - eeh_remove_device(dev);
> -
>   if (phb->controller_ops.release_device)
>   phb->controller_ops.release_device(dev);
>  



Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-04-02 Thread Hoan Tran

Hi All,

On 3/31/20 7:31 AM, Baoquan He wrote:

On 03/31/20 at 04:21pm, Michal Hocko wrote:

On Tue 31-03-20 22:03:32, Baoquan He wrote:

Hi Michal,

On 03/31/20 at 10:55am, Michal Hocko wrote:

On Tue 31-03-20 11:14:23, Mike Rapoport wrote:

Maybe I mis-read the code, but I don't see how this could happen. In the
HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
calculate_node_totalpages() that ensures that node->node_zones are entirely
within the node because this is checked in zone_spanned_pages_in_node().


zone_spanned_pages_in_node does chech the zone boundaries are within the
node boundaries. But that doesn't really tell anything about other
potential zones interleaving with the physical memory range.
zone->spanned_pages simply gives the physical range for the zone
including holes. Interleaving nodes are essentially a hole
(__absent_pages_in_range is going to skip those).

That means that when free_area_init_core simply goes over the whole
physical zone range including holes and that is why we need to check
both for physical and logical holes (aka other nodes).

The life would be so much easier if the whole thing would simply iterate
over memblocks...


The memblock iterating sounds a great idea. I tried with putting the
memblock iterating in the upper layer, memmap_init(), which is used for
boot mem only anyway. Do you think it's doable and OK? It yes, I can
work out a formal patch to make this simpler as you said. The draft code
is as below. Like this it uses the existing code and involves little change.


Doing this would be a step in the right direction! I haven't checked the
code very closely though. The below sounds way too simple to be truth I
am afraid. First for_each_mem_pfn_range is available only for
CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
saying that I really hate that being conditional). Also I haven't really
checked the deferred initialization path - I have a very vague
recollection that it has been converted to the memblock api but I have
happilly dropped all that memory.


Thanks for your quick response and pointing out the rest suspect aspects,
I will investigate what you mentioned, see if they impact.


I would like to check if we still move on with my patch to remove 
CONFIG_NODES_SPAN_OTHER_NODES and have another patch on top it?


Thanks
Hoan



  

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 138a56c0f48f..558d421f294b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * function.  They do not exist on hotplugged memory.
 */
if (context == MEMMAP_EARLY) {
-   if (!early_pfn_valid(pfn)) {
-   pfn = next_pfn(pfn);
-   continue;
-   }
-   if (!early_pfn_in_nid(pfn, nid)) {
-   pfn++;
-   continue;
-   }
if (overlap_memmap_init(zone, &pfn))
continue;
if (defer_init(nid, pfn, end_pfn))
@@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
  }
  
  void __meminit __weak memmap_init(unsigned long size, int nid,

- unsigned long zone, unsigned long start_pfn)
+ unsigned long zone, unsigned long 
range_start_pfn)
  {
-   memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
+   unsigned long start_pfn, end_pfn;
+   unsigned long range_end_pfn = range_start_pfn + size;
+   int i;
+   for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
+   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+   if (end_pfn > start_pfn)
+   memmap_init_zone(size, nid, zone, start_pfn, 
MEMMAP_EARLY, NULL);
+   }
  }
  
  static int zone_batchsize(struct zone *zone)


--
Michal Hocko
SUSE Labs





Re: [PATCH v4 03/25] powerpc/powernv: Map & release OpenCAPI LPC memory

2020-04-02 Thread Michael Ellerman
Benjamin Herrenschmidt  writes:
> On Wed, 2020-04-01 at 01:48 -0700, Dan Williams wrote:
>> > 
>> > +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
>> > +{
>> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>> > +   struct pnv_phb *phb = hose->private_data;
>> 
>> Is calling the local variable 'hose' instead of 'host' on purpose?
>
> Haha that's funny :-)
>
> It's an ld usage that comes iirc from sparc ? or maybe alpha ?

Yeah it was alpha, I found it in the history tree:

  
https://github.com/mpe/linux-fullhistory/blob/1928de59ba4209dc5e9f2cef63560c09ba0df73b/arch/alpha/kernel/mcpcia.c

And airlied found an old manual which confirms it:

  The TIOP module interfaces the AlphaServer 8000 system bus to four I/O 
channels, called "hoses."

  https://www.hpl.hp.com/hpjournal/dtj/vol7num1/vol7num1art4.pdf


So at least now we know where it comes from.

It's also used widely in mips, microblaze, sh and a little bit in drm.

cheers


Re: [PATCH 1/4] powerpc/eeh: fix pseries_eeh_configure_bridge()

2020-04-02 Thread Oliver O'Halloran
On Mon, 2020-03-30 at 15:56 +1100, Sam Bobroff wrote:
> If a device is hot unplgged during EEH recovery, it's possible for the
> RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
> parameter error (-3), however negative return values are not checked
> for and this leads to an infinite loop.
> 
> Fix this by correctly bailing out on negative values.
> 

This should probably be a standalone patch. Looks fine otherwise.

Reviewed-by: Oliver O'Halloran 


> Signed-off-by: Sam Bobroff 
> ---
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
> b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index 893ba3f562c4..c4ef03bec0de 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -605,7 +605,7 @@ static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
>   config_addr, BUID_HI(pe->phb->buid),
>   BUID_LO(pe->phb->buid));
>  
> - if (!ret)
> + if (ret <= 0)
>   return ret;
>  
>   /*



[PATCH 2/2] powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected

2020-04-02 Thread Alistair Popple
Enable Advertising support for cpu feature ISA v3.1 if advertised in the
device-tree.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index ad75095c121f..7c00419bfb64 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -26,6 +26,7 @@
 /* Device-tree visible constants follow */
 #define ISA_V2_07B  2070
 #define ISA_V3_0B   3000
+#define ISA_V3_13100
 
 #define USABLE_PR   (1U << 0)
 #define USABLE_OS   (1U << 1)
@@ -658,6 +659,11 @@ static void __init cpufeatures_setup_start(u32 isa)
cur_cpu_spec->cpu_features |= CPU_FTR_ARCH_300;
cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_ARCH_3_00;
}
+
+   if (isa >= 3100) {
+   cur_cpu_spec->cpu_features |= CPU_FTR_ARCH_31;
+   cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_ARCH_3_1;
+   }
 }
 
 static bool __init cpufeatures_process_feature(struct dt_cpu_feature *f)
-- 
2.20.1



[PATCH 1/2] powerpc: Add base support for ISA v3.1

2020-04-02 Thread Alistair Popple
Newer ISA versions are enabled by clearing all bits in the PCR
associated with previous versions of the ISA. Enable ISA v3.1 support
by updating the PCR mask to include ISA v3.0. This ensures all PCR
bits corresponding to earlier architecture versions get cleared
thereby enabling ISA v3.1.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/include/asm/cputable.h | 1 +
 arch/powerpc/include/asm/reg.h  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 86efd5eb0389..5cd111c63b5a 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -213,6 +213,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_P9_TIDR
LONG_ASM_CONST(0x8000)
 #define CPU_FTR_P9_TLBIE_ERAT_BUG  LONG_ASM_CONST(0x0001)
 #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002)
+#define CPU_FTR_ARCH_31
LONG_ASM_CONST(0x0004)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 5f5c0254ee3a..163773cf011b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -487,10 +487,11 @@
  * determine both the compatibility level which we want to emulate and the
  * compatibility level which the host is capable of emulating.
  */
+#define   PCR_ARCH_300 0x10/* Architecture 3.00 */
 #define   PCR_ARCH_207 0x8 /* Architecture 2.07 */
 #define   PCR_ARCH_206 0x4 /* Architecture 2.06 */
 #define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
-#define   PCR_LOW_BITS (PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205)
+#define   PCR_LOW_BITS (PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205 | 
PCR_ARCH_300)
 #define   PCR_MASK ~(PCR_HIGH_BITS | PCR_LOW_BITS) /* PCR Reserved Bits */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
 #define SPRN_TLBINDEXR 0x154   /* P7 TLB control register */
-- 
2.20.1



RE: [PATCH v4 08/25] ocxl: Emit a log message showing how much LPC memory was detected

2020-04-02 Thread Alastair D'Silva
> -Original Message-
> From: Dan Williams 
> Sent: Wednesday, 1 April 2020 7:49 PM
> To: Alastair D'Silva 
> Cc: Aneesh Kumar K . V ; Oliver O'Halloran
> ; Benjamin Herrenschmidt
> ; Paul Mackerras ; Michael
> Ellerman ; Frederic Barrat ;
> Andrew Donnellan ; Arnd Bergmann
> ; Greg Kroah-Hartman ;
> Vishal Verma ; Dave Jiang
> ; Ira Weiny ; Andrew Morton
> ; Mauro Carvalho Chehab
> ; David S. Miller ;
> Rob Herring ; Anton Blanchard ;
> Krzysztof Kozlowski ; Mahesh Salgaonkar
> ; Madhavan Srinivasan
> ; Cédric Le Goater ; Anju T
> Sudhakar ; Hari Bathini
> ; Thomas Gleixner ; Greg
> Kurz ; Nicholas Piggin ; Masahiro
> Yamada ; Alexey Kardashevskiy
> ; Linux Kernel Mailing List ;
> linuxppc-dev ; linux-nvdimm  nvd...@lists.01.org>; Linux MM 
> Subject: Re: [PATCH v4 08/25] ocxl: Emit a log message showing how much
> LPC memory was detected
> 
> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva 
> wrote:
> >
> > This patch emits a message showing how much LPC memory & special
> > purpose memory was detected on an OCXL device.
> >
> > Signed-off-by: Alastair D'Silva 
> > Acked-by: Frederic Barrat 
> > Acked-by: Andrew Donnellan 
> > ---
> >  drivers/misc/ocxl/config.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
> > index a62e3d7db2bf..69cca341d446 100644
> > --- a/drivers/misc/ocxl/config.c
> > +++ b/drivers/misc/ocxl/config.c
> > @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct
> pci_dev *dev,
> > afu->special_purpose_mem_size =
> > total_mem_size - lpc_mem_size;
> > }
> > +
> > +   dev_info(&dev->dev, "Probed LPC memory of %#llx bytes and special
> purpose memory of %#llx bytes\n",
> > +afu->lpc_mem_size, afu->special_purpose_mem_size);
> 
> A patch for a single log message is too fine grained for my taste, let's 
> squash
> this into another patch in the series.
> 

I'm not sure there's a great place for it. At a pinch, it could with the 
previous patch, but they are really different layers.

> > +
> > return 0;
> >  }
> >
> > --
> > 2.24.1
> >
> 
> 
> --
> This email has been checked for viruses by AVG.
> https://www.avg.com




RE: [PATCH v4 07/25] ocxl: Add functions to map/unmap LPC memory

2020-04-02 Thread Alastair D'Silva


> -Original Message-
> From: Dan Williams 
> Sent: Wednesday, 1 April 2020 7:49 PM
> To: Alastair D'Silva 
> Cc: Aneesh Kumar K . V ; Oliver O'Halloran
> ; Benjamin Herrenschmidt
> ; Paul Mackerras ; Michael
> Ellerman ; Frederic Barrat ;
> Andrew Donnellan ; Arnd Bergmann
> ; Greg Kroah-Hartman ;
> Vishal Verma ; Dave Jiang
> ; Ira Weiny ; Andrew Morton
> ; Mauro Carvalho Chehab
> ; David S. Miller ;
> Rob Herring ; Anton Blanchard ;
> Krzysztof Kozlowski ; Mahesh Salgaonkar
> ; Madhavan Srinivasan
> ; Cédric Le Goater ; Anju T
> Sudhakar ; Hari Bathini
> ; Thomas Gleixner ; Greg
> Kurz ; Nicholas Piggin ; Masahiro
> Yamada ; Alexey Kardashevskiy
> ; Linux Kernel Mailing List ;
> linuxppc-dev ; linux-nvdimm  nvd...@lists.01.org>; Linux MM 
> Subject: Re: [PATCH v4 07/25] ocxl: Add functions to map/unmap LPC
> memory
> 
> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva 
> wrote:
> >
> > Add functions to map/unmap LPC memory
> >
> 
> "map memory" is an overloaded term. I'm guessing this patch has nothing to
> do with mapping memory in the MMU. Is it updating hardware resource
> decoders to start claiming address space that was allocated previously?
> 

It's similar to MMIO - these calls end up setting up a BAR which places the LPC
memory into a physical memory range addressable by the kernel.

> > Signed-off-by: Alastair D'Silva 
> > Acked-by: Frederic Barrat 
> > ---
> >  drivers/misc/ocxl/core.c  | 51 +++
> >  drivers/misc/ocxl/ocxl_internal.h |  3 ++
> >  include/misc/ocxl.h   | 21 +
> >  3 files changed, 75 insertions(+)
> >
> > diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c index
> > 2531c6cf19a0..75ff14e3882a 100644
> > --- a/drivers/misc/ocxl/core.c
> > +++ b/drivers/misc/ocxl/core.c
> > @@ -210,6 +210,56 @@ static void unmap_mmio_areas(struct ocxl_afu
> *afu)
> > release_fn_bar(afu->fn, afu->config.global_mmio_bar);  }
> >
> > +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu) {
> > +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> > +
> > +   if ((afu->config.lpc_mem_size + afu-
> >config.special_purpose_mem_size) == 0)
> > +   return 0;
> > +
> > +   afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
> > +   if (afu->lpc_base_addr == 0)
> > +   return -EINVAL;
> > +
> > +   if (afu->config.lpc_mem_size > 0) {
> > +   afu->lpc_res.start = afu->lpc_base_addr + afu-
> >config.lpc_mem_offset;
> > +   afu->lpc_res.end = afu->lpc_res.start + 
> > afu->config.lpc_mem_size
> - 1;
> > +   }
> > +
> > +   if (afu->config.special_purpose_mem_size > 0) {
> > +   afu->special_purpose_res.start = afu->lpc_base_addr +
> > +
> > afu->config.special_purpose_mem_offset;
> > +   afu->special_purpose_res.end = 
> > afu->special_purpose_res.start +
> > +  
> > afu->config.special_purpose_mem_size - 1;
> > +   }
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(ocxl_afu_map_lpc_mem);
> > +
> > +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu) {
> > +   return &afu->lpc_res;
> > +}
> > +EXPORT_SYMBOL_GPL(ocxl_afu_lpc_mem);
> > +
> > +static void unmap_lpc_mem(struct ocxl_afu *afu) {
> > +   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
> > +
> > +   if (afu->lpc_res.start || afu->special_purpose_res.start) {
> > +   void *link = afu->fn->link;
> > +
> > +   // only release the link when the the last consumer calls 
> > release
> > +   ocxl_link_lpc_release(link, dev);
> > +
> > +   afu->lpc_res.start = 0;
> > +   afu->lpc_res.end = 0;
> > +   afu->special_purpose_res.start = 0;
> > +   afu->special_purpose_res.end = 0;
> > +   }
> > +}
> > +
> >  static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct
> > pci_dev *dev)  {
> > int rc;
> > @@ -251,6 +301,7 @@ static int configure_afu(struct ocxl_afu *afu, u8
> > afu_idx, struct pci_dev *dev)
> >
> >  static void deconfigure_afu(struct ocxl_afu *afu)  {
> > +   unmap_lpc_mem(afu);
> > unmap_mmio_areas(afu);
> > reclaim_afu_pasid(afu);
> > reclaim_afu_actag(afu);
> > diff --git a/drivers/misc/ocxl/ocxl_internal.h
> > b/drivers/misc/ocxl/ocxl_internal.h
> > index 2d7575225bd7..7b975a89db7b 100644
> > --- a/drivers/misc/ocxl/ocxl_internal.h
> > +++ b/drivers/misc/ocxl/ocxl_internal.h
> > @@ -52,6 +52,9 @@ struct ocxl_afu {
> > void __iomem *global_mmio_ptr;
> > u64 pp_mmio_start;
> > void *private;
> > +   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
> > +   struct resource lpc_res;
> > +   struct resource special_purpose_res;
> >  };
> >
> >  enum ocxl_context_status {
> > diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h index
> > 357ef1aadb

Re: [RFC/PATCH 2/3] pseries/kvm: Clear PSSCR[ESL|EC] bits before guest entry

2020-04-02 Thread Nicholas Piggin
Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" 
> 
> ISA v3.0 allows the guest to execute a stop instruction. For this, the
> PSSCR[ESL|EC] bits need to be cleared by the hypervisor before
> scheduling in the guest vCPU.
> 
> Currently we always schedule in a vCPU with PSSCR[ESL|EC] bits
> set. This patch changes the behaviour to enter the guest with
> PSSCR[ESL|EC] bits cleared. This is a RFC patch where we
> unconditionally clear these bits. Ideally this should be done
> conditionally on platforms where the guest stop instruction has no
> Bugs (starting POWER9 DD2.3).

How will guests know that they can use this facility safely after your
series? You need both DD2.3 and a patched KVM.

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kvm/book3s_hv.c|  2 +-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 25 +
>  2 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index cdb7224..36d059a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3424,7 +3424,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   mtspr(SPRN_IC, vcpu->arch.ic);
>   mtspr(SPRN_PID, vcpu->arch.pid);
>  
> - mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
> + mtspr(SPRN_PSSCR, (vcpu->arch.psscr  & ~(PSSCR_EC | PSSCR_ESL)) |
> (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
>  
>   mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index dbc2fec..c2daec3 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -823,6 +823,18 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>   mtspr   SPRN_PID, r7
>   mtspr   SPRN_WORT, r8
>  BEGIN_FTR_SECTION
> + /* POWER9-only registers */
> + ld  r5, VCPU_TID(r4)
> + ld  r6, VCPU_PSSCR(r4)
> + lbz r8, HSTATE_FAKE_SUSPEND(r13)
> + lis r7, (PSSCR_EC | PSSCR_ESL)@h /* Allow guest to call stop */
> + andcr6, r6, r7
> + rldimi  r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> + ld  r7, VCPU_HFSCR(r4)
> + mtspr   SPRN_TIDR, r5
> + mtspr   SPRN_PSSCR, r6
> + mtspr   SPRN_HFSCR, r7
> +FTR_SECTION_ELSE

Why did you move these around? Just because the POWER9 section became
larger than the other?

That's a real wart in the instruction patching implementation, I think
we can fix it by padding with nops in the macros.

Can you just add the additional required nops to the top branch without
changing them around for this patch, so it's easier to see what's going
on? The end result will be the same after patching. Actually changing
these around can have a slight unintended consequence in that code that
runs before features were patched will execute the IF code. Not a
problem here, but another reason why the instruction patching 
restriction is annoying.

Thanks,
Nick

>   /* POWER8-only registers */
>   ld  r5, VCPU_TCSCR(r4)
>   ld  r6, VCPU_ACOP(r4)
> @@ -833,18 +845,7 @@ BEGIN_FTR_SECTION
>   mtspr   SPRN_CSIGR, r7
>   mtspr   SPRN_TACR, r8
>   nop
> -FTR_SECTION_ELSE
> - /* POWER9-only registers */
> - ld  r5, VCPU_TID(r4)
> - ld  r6, VCPU_PSSCR(r4)
> - lbz r8, HSTATE_FAKE_SUSPEND(r13)
> - orisr6, r6, PSSCR_EC@h  /* This makes stop trap to HV */
> - rldimi  r6, r8, PSSCR_FAKE_SUSPEND_LG, 63 - PSSCR_FAKE_SUSPEND_LG
> - ld  r7, VCPU_HFSCR(r4)
> - mtspr   SPRN_TIDR, r5
> - mtspr   SPRN_PSSCR, r6
> - mtspr   SPRN_HFSCR, r7
> -ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>  8:
>  
>   ld  r5, VCPU_SPRG0(r4)
> -- 
> 1.9.4
> 
> 


Re: [RFC/PATCH 1/3] powerpc/kvm: Handle H_FAC_UNAVAIL when guest executes stop.

2020-04-02 Thread Nicholas Piggin
Gautham R. Shenoy's on March 31, 2020 10:10 pm:
> From: "Gautham R. Shenoy" 
> 
> If a guest executes a stop instruction when the hypervisor has set the
> PSSCR[ESL|EC] bits, the processor will throw an Hypervisor Facility
> Unavailable exception. Currently when we receive this exception, we
> only check if the exeception is generated due to a doorbell
> instruction, in which case we emulate it. For all other cases,
> including the case when the guest executes a stop-instruction, the
> hypervisor sends a PROGILL to the guest program, which results in a
> guest crash.
> 
> This patch adds code to handle the case when the hypervisor receives a
> H_FAC_UNAVAIL exception due to guest executing the stop
> instruction. The hypervisor increments the pc to the next instruction
> and resumes the guest as expected by the semantics of the
> PSSCR[ESL|EC] = 0 stop instruction.

This seems reasonable, I don't think we need to crash the guest here.

Thanks,
Nick


Re: [PATCH] powerpc/mce: Add MCE notification chain

2020-04-02 Thread Nicholas Piggin
Ganesh Goudar's on March 30, 2020 5:12 pm:
> From: Santosh S 
> 
> Introduce notification chain which lets know about uncorrected memory
> errors(UE). This would help prospective users in pmem or nvdimm subsystem
> to track bad blocks for better handling of persistent memory allocations.
> 
> Signed-off-by: Santosh S 
> Signed-off-by: Ganesh Goudar 

Do you have any such users yet? It would be good to refer to an example 
user and give a brief description of what it does in its notifier.

> @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct 
> *work)
>   while (__this_cpu_read(mce_ue_count) > 0) {
>   index = __this_cpu_read(mce_ue_count) - 1;
>   evt = this_cpu_ptr(&mce_ue_event_queue[index]);
> + blocking_notifier_call_chain(&mce_notifier_list, 0, evt);

Can we really use a blocking notifier here? I'm not sure that we can.

Thanks,
Nick


Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-02 Thread Dan Williams
On Wed, Apr 1, 2020 at 8:08 PM Dan Williams  wrote:
[..]
> >  * "locked" : Indicating that nvdimm contents cant be modified
> >until next power cycle.
>
> There is the generic NDD_LOCKED flag, can you use that? ...and in
> general I wonder if we should try to unify all the common papr_scm and
> nfit health flags in a generic location. It will already be the case
> the ndctl needs to look somewhere papr specific for this data maybe it
> all should have been generic from the beginning.

The more I think about this more I think this would be a good time to
introduce a common "health/" attribute group under the generic nmemX
sysfs, and then have one flag per-file / attribute. Not only does that
match the recommended sysfs ABI better, but it allows ndctl to
enumerate which flags are supported in addition to their state.

> In any event, can you also add this content to a new
> Documentation/ABI/testing/sysfs-bus-papr? See sysfs-bus-nfit for
> comparison.


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 11:35:57AM -0700, Kees Cook wrote:

> Yup, I think it's a weakness of the ARM implementation and I'd like to
> not extend it further. AFAIK we should never nest, but I would not be
> surprised at all if we did.
> 
> If we were looking at a design goal for all architectures, I'd like
> to be doing what the public PaX patchset did for their memory access
> switching, which is to alarm if calling into "enable" found the access
> already enabled, etc. Such a condition would show an unexpected nesting
> (like we've seen with similar constructs with set_fs() not getting reset
> during an exception handler, etc etc).

FWIW, maybe I'm misreading the ARM uaccess logics, but... it smells like
KERNEL_DS is somewhat more dangerous there than on e.g. x86.

Look: with CONFIG_CPU_DOMAINS, set_fs(KERNEL_DS) tells MMU to ignore
per-page permission bits in DOMAIN_KERNEL (i.e. for kernel address
ranges), allowing them even if they would normally be denied.  We need
that for actual uaccess loads/stores, since those use insns that pretend
to be done in user mode and we want them to access the kernel pages.
But that affects the normal loads/stores as well; unless I'm misreading
that code, it will ignore (supervisor) r/o on a page.  And that's not
just for the code inside the uaccess blocks; *everything* done under
KERNEL_DS is subject to that.

Why do we do that (modify_domain(), that is) inside set_fs() and not
in uaccess_enable() et.al.?


Re: [PATCH v4 16/25] nvdimm/ocxl: Implement the Read Error Log command

2020-04-02 Thread Dan Williams
On Tue, Mar 31, 2020 at 1:59 AM Alastair D'Silva  wrote:
>
> The read error log command extracts information from the controller's
> internal error log.
>
> This patch exposes this information in 2 ways:
> - During probe, if an error occurs & a log is available, print it to the
>   console
> - After probe, make the error log available to userspace via an IOCTL.
>   Userspace is notified of pending error logs in a later patch
>   ("powerpc/powernv/pmem: Forward events to userspace")

So, have a look at the recent papr_scm patches to add health flags and
smart data retrieval. I'd prefer to extend existing nvdimm device
retrieval mechanisms than invent new ones.


>
> Signed-off-by: Alastair D'Silva 
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |   1 +
>  drivers/nvdimm/ocxl/main.c| 240 ++
>  include/uapi/nvdimm/ocxlpmem.h|  46 
>  3 files changed, 287 insertions(+)
>  create mode 100644 include/uapi/nvdimm/ocxlpmem.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9425377615ce..ba0ce7dca643 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -340,6 +340,7 @@ Code  Seq#Include File
>Comments
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> +0xCA  30-3F  uapi/nvdimm/ocxlpmem.h  
> OpenCAPI Persistent Memory
>  0xCA  80-BF  uapi/scsi/cxlflash_ioctl.h
>  0xCB  00-1F  CBM 
> serial IEC bus in development:
>   
> 
> diff --git a/drivers/nvdimm/ocxl/main.c b/drivers/nvdimm/ocxl/main.c
> index 9b85fcd3f1c9..e6be0029f658 100644
> --- a/drivers/nvdimm/ocxl/main.c
> +++ b/drivers/nvdimm/ocxl/main.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ocxlpmem.h"
>
>  static const struct pci_device_id pci_tbl[] = {
> @@ -401,10 +402,190 @@ static int file_release(struct inode *inode, struct 
> file *file)
> return 0;
>  }
>
> +/**
> + * error_log_header_parse() - Parse the first 64 bits of the error log 
> command response
> + * @ocxlpmem: the device metadata
> + * @length: out, returns the number of bytes in the response (excluding the 
> 64 bit header)
> + */
> +static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
> +{
> +   int rc;
> +   u64 val;
> +   u16 data_identifier;
> +   u32 data_length;
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset,
> +OCXL_LITTLE_ENDIAN, &val);
> +   if (rc)
> +   return rc;
> +
> +   data_identifier = val >> 48;
> +   data_length = val & 0x;
> +
> +   if (data_identifier != 0x454C) { // 'EL'
> +   dev_err(&ocxlpmem->dev,
> +   "Bad data identifier for error log data, expected 
> 'EL', got '%2s' (%#x), data_length=%u\n",
> +   (char *)&data_identifier,
> +   (unsigned int)data_identifier, data_length);
> +   return -EINVAL;
> +   }
> +
> +   *length = data_length;
> +   return 0;
> +}
> +
> +static int read_error_log(struct ocxlpmem *ocxlpmem,
> + struct ioctl_ocxlpmem_error_log *log,
> + bool buf_is_user)
> +{
> +   u64 val;
> +   u16 user_buf_length;
> +   u16 buf_length;
> +   u64 *buf = (u64 *)log->buf_ptr;
> +   u16 i;
> +   int rc;
> +
> +   if (log->buf_size % 8)
> +   return -EINVAL;
> +
> +   rc = ocxlpmem_chi(ocxlpmem, &val);
> +   if (rc)
> +   return rc;
> +
> +   if (!(val & GLOBAL_MMIO_CHI_ELA))
> +   return -EAGAIN;
> +
> +   user_buf_length = log->buf_size;
> +
> +   mutex_lock(&ocxlpmem->admin_command.lock);
> +
> +   rc = admin_command_execute(ocxlpmem, ADMIN_COMMAND_ERRLOG);
> +   if (rc != STATUS_SUCCESS) {
> +   warn_status(ocxlpmem,
> +   "Unexpected status from retrieve error log", rc);
> +   goto out;
> +   }
> +
> +   rc = error_log_header_parse(ocxlpmem, &log->buf_size);
> +   if (rc)
> +   goto out;
> +   // log->buf_size now contains the returned buffer size, not the user 
> size
> +
> +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> +ocxlpmem->admin_command.data_offset + 
> 0x08,
> +OCXL_LITTLE_ENDIAN, &val);
> +   if (rc)
> +   goto out;
> +
> +   log->log_identifier = val >> 32;
> +   log->program

Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

2020-04-02 Thread Leonardo Bras
On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
> Leonardo Bras  writes:
> > During a crash, there is chance that the cpus that handle the NMI IPI
> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
> > 
> > This is a problem if the system has kdump set up, given if it crashes
> > for any reason kdump may not be saved for crash analysis.
> > 
> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
> > needed for finishing crash routine.
> 
> I'm not convinced this is the right approach.

Me neither. I think it's a very hacky solution, but I couldn't think of
anything better at the time.

> Busting locks is risky, it could easily cause a crash if data structures
> are left in some inconsistent state.
> 
> I think we need to make this code more careful about what it's doing.
> There's a clue at the top of default_machine_crash_shutdown(), which
> calls crash_kexec_prepare_cpus():
> 
>* This function is only called after the system
>* has panicked or is otherwise in a critical state.
>* The minimum amount of code to allow a kexec'd kernel
>* to run successfully needs to happen here.
> 
> 
> You said the "IPI complete" message was the cause of one lockup:
> 
>   #0  arch_spin_lock 
>   #1  do_raw_spin_lock 
>   #2  __raw_spin_lock 
>   #3  _raw_spin_lock 
>   #4  vprintk_emit 
>   #5  vprintk_func
>   #7  crash_kexec_prepare_cpus 
>   #8  default_machine_crash_shutdown
>   #9  machine_crash_shutdown 
>   #10 __crash_kexec
>   #11 crash_kexec
>   #12 oops_end
> 
> TBH I think we could just drop that printk() entirely.
> 
> Or we could tell printk() that we're in NMI context so that it uses the
> percpu buffers.
> 
> We should probably do the latter anyway, in case there's any other code
> we call that inadvertently calls printk().
> 

I was not aware of using per-cpu buffers in printk. It may be a nice
solution.

There is another printk call there:
printk("kexec: Starting switchover sequence.\n");
in default_machine_kexec().

Two printk and one rtas call: it's all I could see using a spinlock
after IPI Complete.

> 
> The RTAS trace you sent was:
> 
>   #0 arch_spin_lock
>   #1  lock_rtas () 
>   #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
>   #3  ics_rtas_mask_real_irq (hw_irq=4100) 
>   #4  machine_kexec_mask_interrupts
>   #5  default_machine_crash_shutdown
>   #6  machine_crash_shutdown 
>   #7  __crash_kexec
>   #8  crash_kexec
>   #9  oops_end
> 
> 
> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
> be crashing while holding the RTAS lock, but I guess it could happen.
> Can you get a full backtrace?
> 

Oh, all traces are from the thread that called the crash, by writing
'c' to sysrq. That is what I am using to reproduce.

#10 bad_page_fault
#11 handle_page_fault
#12 __handle_sysrq (key=99, check_mask=false) 
#13 write_sysrq_trigger 
#14 proc_reg_write
#15 __vfs_write
#16 vfs_write
#17 SYSC_write
#18 SyS_write
#19 system_call

> 
> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
> except for a very small list of RTAS calls. So if we bust the RTAS lock
> there's a risk we violate that part of PAPR and crash even harder.

Interesting, I was not aware.

> 
> Also it's not specific to kdump, we can't even get through a normal
> reboot if we crash with the RTAS lock held.
> 
> Anyway here's a patch with some ideas. That allows me to get from a
> crash with the RTAS lock held through kdump into the 2nd kernel. But it
> only works if it's the crashing CPU that holds the RTAS lock.
> 

Nice idea. 
But my test environment is just triggering a crash from sysrq, so I
think it would not improve the result, given that this thread is
probably not holding the lock by the time.

I noticed that when rtas is locked, irqs and preemption are also
disabled.

Should the IPI send by crash be able to interrupt a thread with
disabled irqs?

Best regards,
Leonardo Bras


> cheers
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c5fa251b8950..44ce74966d60 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
>  void (*rtas_flash_term_hook)(int);
>  EXPORT_SYMBOL(rtas_flash_term_hook);
>  
> +static int rtas_lock_holder = -1;
> +
>  /* RTAS use home made raw locking instead of spin_lock_irqsave
>   * because those can be called from within really nasty contexts
>   * such as having the timebase stopped which would lockup with
> @@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
>  
>   local_irq_save(flags);
>   preempt_disable();
> - arch_spin_lock(&rtas.lock);
> +
> + if (!arch_spin_trylock(&rtas.lock)) {
> + // Couldn't get the lock, do we already hold it?
> + if (rt

Re: [PATCH v2 01/22] powerpc/pkeys: Avoid using lockless page table walk

2020-04-02 Thread Ram Pai
On Thu, Mar 19, 2020 at 09:25:48AM +0530, Aneesh Kumar K.V wrote:
> Fetch pkey from vma instead of linux page table. Also document the fact that 
> in
> some cases the pkey returned in siginfo won't be the same as the one we took
> keyfault on. Even with linux page table walk, we can end up in a similar 
> scenario.

There is no way to correctly ensure that the key returned through
siginfo is actually the key that took the fault.  Either get it
from page table or get it from the corresponding vma.

So we had to choose the lesser evil. Getting it from the page table was
faster, and did not involve taking any locks.  Getting it from the vma
was slower, since it needed locks.  Also I faintly recall, there
is a scenario where the address that gets a key fault, has no
corresponding VMA associated with it yet.

Hence the logic used was --
if it is key-fault, than procure the key quickly
from the page table.  In the unlikely event that the fault is
something else, but still has a non-permissive key associated
with it, get the key from the vma.

A well written application should avoid changing the key of an address
space without synchronizing the corresponding threads that operate in
that address range.  However, if the application ignores to do so, than
it is vulnerable to a undefined behavior. There is no way to prove that
the reported key is correct or incorrect, since there is no provable
order between the two events; the key-fault event and the key-change
event.

Hence I think the change proposed in this patch may not be necessary.
RP

> 
> Cc: Ram Pai 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/mmu.h|  9 ---
>  arch/powerpc/mm/book3s64/hash_utils.c | 24 
>  arch/powerpc/mm/fault.c   | 83 +++
>  3 files changed, 60 insertions(+), 56 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
> index 0699cfeeb8c9..cf2a08bfd5cd 100644
> --- a/arch/powerpc/include/asm/mmu.h
> +++ b/arch/powerpc/include/asm/mmu.h
> @@ -291,15 +291,6 @@ static inline bool early_radix_enabled(void)
>  }
>  #endif
>  
> -#ifdef CONFIG_PPC_MEM_KEYS
> -extern u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address);
> -#else
> -static inline u16 get_mm_addr_key(struct mm_struct *mm, unsigned long 
> address)
> -{
> - return 0;
> -}
> -#endif /* CONFIG_PPC_MEM_KEYS */
> -
>  #ifdef CONFIG_STRICT_KERNEL_RWX
>  static inline bool strict_kernel_rwx_enabled(void)
>  {
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
> b/arch/powerpc/mm/book3s64/hash_utils.c
> index 523d4d39d11e..8530ddbba56f 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1670,30 +1670,6 @@ void update_mmu_cache(struct vm_area_struct *vma, 
> unsigned long address,
>   hash_preload(vma->vm_mm, address, is_exec, trap);
>  }
>  
> -#ifdef CONFIG_PPC_MEM_KEYS
> -/*
> - * Return the protection key associated with the given address and the
> - * mm_struct.
> - */
> -u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
> -{
> - pte_t *ptep;
> - u16 pkey = 0;
> - unsigned long flags;
> -
> - if (!mm || !mm->pgd)
> - return 0;
> -
> - local_irq_save(flags);
> - ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
> - if (ptep)
> - pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
> - local_irq_restore(flags);
> -
> - return pkey;
> -}
> -#endif /* CONFIG_PPC_MEM_KEYS */
> -
>  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>  static inline void tm_flush_hash_page(int local)
>  {
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 8db0507619e2..ab99ffa7d946 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -118,9 +118,34 @@ static noinline int bad_area(struct pt_regs *regs, 
> unsigned long address)
>   return __bad_area(regs, address, SEGV_MAPERR);
>  }
>  
> -static int bad_key_fault_exception(struct pt_regs *regs, unsigned long 
> address,
> - int pkey)
> +#ifdef CONFIG_PPC_MEM_KEYS
> +static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long 
> address,
> + struct vm_area_struct *vma)
>  {
> + struct mm_struct *mm = current->mm;
> + int pkey;
> +
> + /*
> +  * We don't try to fetch the pkey from page table because reading
> +  * page table without locking doesn't guarantee stable pte value.
> +  * Hence the pkey value that we return to userspace can be different
> +  * from the pkey that actually caused access error.
> +  *
> +  * It does *not* guarantee that the VMA we find here
> +  * was the one that we faulted on.
> +  *
> +  * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
> +  * 2. T1   : set AMR to deny access to pkey=4, touches, page
> +  * 3. T1   : faults...
> +  * 4.T2: mprotect_key(foo, PAGE_SIZ

Re: [PATCH v4 03/16] powerpc: Use a datatype for instructions

2020-04-02 Thread Jordan Niethe
On Fri, Apr 3, 2020 at 10:45 AM Alistair Popple  wrote:
>
> On Thursday, 2 April 2020 10:52:37 AM AEDT Jordan Niethe wrote:
> > On Wed, Apr 1, 2020 at 9:32 PM Balamuruhan S  wrote:
> > > On Fri, 2020-03-20 at 16:17 +1100, Jordan Niethe wrote:
> > > > Currently unsigned ints are used to represent instructions on powerpc.
> > > > This has worked well as instructions have always been 4 byte words.
> > > > However, a future ISA version will introduce some changes to
> > > > instructions that mean this scheme will no longer work as well. This
> > > > change is Prefixed Instructions. A prefixed instruction is made up of a
> > > > word prefix followed by a word suffix to make an 8 byte double word
> > > > instruction. No matter the endianess of the system the prefix always
> > > > comes first. Prefixed instructions are only planned for powerpc64.
> > > >
> > > > Introduce a ppc_inst type to represent both prefixed and word
> > > > instructions on powerpc64 while keeping it possible to exclusively have
> > > > word instructions on powerpc32, A latter patch will expand the type to
> > > > include prefixed instructions but for now just typedef it to a u32.
> > > >
> > > > Later patches will introduce helper functions and macros for
> > > > manipulating the instructions so that powerpc64 and powerpc32 might
> > > > maintain separate type definitions.
> > > >
> > > > Signed-off-by: Jordan Niethe 
> > > > ---
> > > >
> > > >  arch/powerpc/include/asm/code-patching.h | 31 +--
> > > >  arch/powerpc/include/asm/inst.h  | 53 +++
> > > >  arch/powerpc/include/asm/sstep.h |  5 +-
> > > >  arch/powerpc/kernel/align.c  |  2 +-
> > > >  arch/powerpc/kernel/hw_breakpoint.c  |  3 +-
> > > >  arch/powerpc/kernel/kprobes.c|  2 +-
> > > >  arch/powerpc/kernel/mce_power.c  |  5 +-
> > > >  arch/powerpc/kernel/optprobes.c  | 10 ++--
> > > >  arch/powerpc/kernel/trace/ftrace.c   | 66 
> > > >  arch/powerpc/kvm/emulate_loadstore.c |  1 +
> > > >  arch/powerpc/lib/code-patching.c | 54 +--
> > > >  arch/powerpc/lib/sstep.c |  4 +-
> > > >  arch/powerpc/lib/test_emulate_step.c |  9 ++--
> > > >  arch/powerpc/xmon/xmon.c | 12 ++---
> > > >  14 files changed, 160 insertions(+), 97 deletions(-)
> > > >  create mode 100644 arch/powerpc/include/asm/inst.h
> > > >
> > > > diff --git a/arch/powerpc/include/asm/code-patching.h
> > > > b/arch/powerpc/include/asm/code-patching.h
> > > > index 898b54262881..cb5106f92d67 100644
> > > > --- a/arch/powerpc/include/asm/code-patching.h
> > > > +++ b/arch/powerpc/include/asm/code-patching.h
> > > > @@ -11,6 +11,7 @@
> > > >
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > >
> > > > +#include 
> > > >
> > > >  /* Flags for create_branch:
> > > >   * "b"   == create_branch(addr, target, 0);
> > > >
> > > > @@ -22,27 +23,27 @@
> > > >
> > > >  #define BRANCH_ABSOLUTE  0x2
> > > >
> > > >  bool is_offset_in_branch_range(long offset);
> > > >
> > > > -unsigned int create_branch(const unsigned int *addr,
> > > > +ppc_inst create_branch(const ppc_inst *addr,
> > > >
> > > >  unsigned long target, int flags);
> > > >
> > > > -unsigned int create_cond_branch(const unsigned int *addr,
> > > > +unsigned int create_cond_branch(const ppc_inst *addr,
> > > >
> > > >   unsigned long target, int flags);
> > > >
> > > > -int patch_branch(unsigned int *addr, unsigned long target, int flags);
> > > > -int patch_instruction(unsigned int *addr, unsigned int instr);
> > > > -int raw_patch_instruction(unsigned int *addr, unsigned int instr);
> > > > +int patch_branch(ppc_inst *addr, unsigned long target, int flags);
> > > > +int patch_instruction(ppc_inst *addr, ppc_inst instr);
> > >
> > > we need to handle this change for its user in epapr_paravirt.c,
>
> Seeing a similar issue in kgdb.c:
>
> In file included from /linux/include/linux/kgdb.h:20,
>  from /linux/arch/powerpc/kernel/kgdb.c:18:
> /linux/arch/powerpc/kernel/kgdb.c: In function 'kgdb_arch_set_breakpoint':
> /linux/arch/powerpc/include/asm/kgdb.h:30:22: error: incompatible type for 
> argument 2 of 'patch_instruction'
>  #define BREAK_INSTR  0x7d821008 /* twge r2, r2 */
>   ^~
> /linux/arch/powerpc/kernel/kgdb.c:427:32: note: in expansion of macro 
> 'BREAK_INSTR'
>   err = patch_instruction(addr, BREAK_INSTR);
> ^~~
> In file included from /linux/arch/powerpc/kernel/kgdb.c:27:
> /linux/arch/powerpc/include/asm/code-patching.h:31:44: note: expected 
> 'ppc_inst' {aka 'struct ppc_inst'} but argument is of type 'int'
>  int patch_instruction(void *addr, ppc_inst instr);
>~^
> /linux/arch/powerpc/kernel/kgdb.c: In function 'kgdb_arch_remove_breakpoint':
> /linux/arch/powerpc/kernel/kgdb.c:442:32: er

Re: [PATCH 1/4] scsi: cxlflash: Access interrupt trigger page from xive directly

2020-04-02 Thread Matthew R. Ochs
On Thu, Apr 02, 2020 at 05:43:49PM +0200, Frederic Barrat wrote:
> xive is already mapping the trigger page in kernel space and it can be
> accessed through standard APIs, so let's reuse it and simplify the code.
> 
> Signed-off-by: Frederic Barrat 
> ---
>  drivers/scsi/cxlflash/ocxl_hw.c | 17 +++--
>  drivers/scsi/cxlflash/ocxl_hw.h |  1 -
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index 7018cd802569..59452850f71c 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -15,7 +15,8 @@
>  #include 
>  #include 
>  #include 
> -
> +#include 
> +#include 
>  #include 
>  
>  #include 
> @@ -180,7 +181,7 @@ static int afu_map_irq(u64 flags, struct 
> ocxlflash_context *ctx, int num,
>   struct ocxl_hw_afu *afu = ctx->hw_afu;
>   struct device *dev = afu->dev;
>   struct ocxlflash_irqs *irq;
> - void __iomem *vtrig;
> + struct xive_irq_data *xd;
>   u32 virq;
>   int rc = 0;
>  
> @@ -204,15 +205,14 @@ static int afu_map_irq(u64 flags, struct 
> ocxlflash_context *ctx, int num,
>   goto err1;
>   }
>  
> - vtrig = ioremap(irq->ptrig, PAGE_SIZE);
> - if (unlikely(!vtrig)) {
> - dev_err(dev, "%s: Trigger page mapping failed\n", __func__);
> - rc = -ENOMEM;
> + xd = irq_get_handler_data(virq);
> + if (unlikely(!xd)) {
> + dev_err(dev, "%s: Can't get interrupt data\n", __func__);

The removal of setting the return code injects a bug should this error leg
ever be encountered. So we should either keep the rc statement e.g. -EINVAL,
-ENXIO, -ENODEV, etc., or remove this error leg. I lean towards keeping the
statement.

>   goto err2;
>   }
>  
>   irq->virq = virq;
> - irq->vtrig = vtrig;
> + irq->vtrig = xd->trig_mmio;
>  out:
> 


Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-02 Thread Michael Ellerman
Vaibhav Jain  writes:
> Thanks for reviewing this patch Mpe,
> Michael Ellerman  writes:
>> Vaibhav Jain  writes:
...
>>
>>> +   /* Check for various masks in bitmap and set the buffer */
>>> +   if (health & PAPR_SCM_DIMM_UNARMED_MASK)
>>> +   rc += sprintf(buf, "not_armed ");
>>
>> I know buf is "big enough" but using sprintf() in 2020 is a bit ... :)
>>
>> seq_buf is a pretty thin wrapper over a buffer you can use to make this
>> cleaner and also handles overflow for you.
>>
>> See eg. show_user_instructions() for an example.
>
> Unfortunatly seq_buf_printf() is still not an exported symbol hence not
> usable in external modules.

Send a patch? :)

cheers



Re: [PATCH v4 03/16] powerpc: Use a datatype for instructions

2020-04-02 Thread Alistair Popple
On Thursday, 2 April 2020 10:52:37 AM AEDT Jordan Niethe wrote:
> On Wed, Apr 1, 2020 at 9:32 PM Balamuruhan S  wrote:
> > On Fri, 2020-03-20 at 16:17 +1100, Jordan Niethe wrote:
> > > Currently unsigned ints are used to represent instructions on powerpc.
> > > This has worked well as instructions have always been 4 byte words.
> > > However, a future ISA version will introduce some changes to
> > > instructions that mean this scheme will no longer work as well. This
> > > change is Prefixed Instructions. A prefixed instruction is made up of a
> > > word prefix followed by a word suffix to make an 8 byte double word
> > > instruction. No matter the endianess of the system the prefix always
> > > comes first. Prefixed instructions are only planned for powerpc64.
> > > 
> > > Introduce a ppc_inst type to represent both prefixed and word
> > > instructions on powerpc64 while keeping it possible to exclusively have
> > > word instructions on powerpc32, A latter patch will expand the type to
> > > include prefixed instructions but for now just typedef it to a u32.
> > > 
> > > Later patches will introduce helper functions and macros for
> > > manipulating the instructions so that powerpc64 and powerpc32 might
> > > maintain separate type definitions.
> > > 
> > > Signed-off-by: Jordan Niethe 
> > > ---
> > > 
> > >  arch/powerpc/include/asm/code-patching.h | 31 +--
> > >  arch/powerpc/include/asm/inst.h  | 53 +++
> > >  arch/powerpc/include/asm/sstep.h |  5 +-
> > >  arch/powerpc/kernel/align.c  |  2 +-
> > >  arch/powerpc/kernel/hw_breakpoint.c  |  3 +-
> > >  arch/powerpc/kernel/kprobes.c|  2 +-
> > >  arch/powerpc/kernel/mce_power.c  |  5 +-
> > >  arch/powerpc/kernel/optprobes.c  | 10 ++--
> > >  arch/powerpc/kernel/trace/ftrace.c   | 66 
> > >  arch/powerpc/kvm/emulate_loadstore.c |  1 +
> > >  arch/powerpc/lib/code-patching.c | 54 +--
> > >  arch/powerpc/lib/sstep.c |  4 +-
> > >  arch/powerpc/lib/test_emulate_step.c |  9 ++--
> > >  arch/powerpc/xmon/xmon.c | 12 ++---
> > >  14 files changed, 160 insertions(+), 97 deletions(-)
> > >  create mode 100644 arch/powerpc/include/asm/inst.h
> > > 
> > > diff --git a/arch/powerpc/include/asm/code-patching.h
> > > b/arch/powerpc/include/asm/code-patching.h
> > > index 898b54262881..cb5106f92d67 100644
> > > --- a/arch/powerpc/include/asm/code-patching.h
> > > +++ b/arch/powerpc/include/asm/code-patching.h
> > > @@ -11,6 +11,7 @@
> > > 
> > >  #include 
> > >  #include 
> > >  #include 
> > > 
> > > +#include 
> > > 
> > >  /* Flags for create_branch:
> > >   * "b"   == create_branch(addr, target, 0);
> > > 
> > > @@ -22,27 +23,27 @@
> > > 
> > >  #define BRANCH_ABSOLUTE  0x2
> > >  
> > >  bool is_offset_in_branch_range(long offset);
> > > 
> > > -unsigned int create_branch(const unsigned int *addr,
> > > +ppc_inst create_branch(const ppc_inst *addr,
> > > 
> > >  unsigned long target, int flags);
> > > 
> > > -unsigned int create_cond_branch(const unsigned int *addr,
> > > +unsigned int create_cond_branch(const ppc_inst *addr,
> > > 
> > >   unsigned long target, int flags);
> > > 
> > > -int patch_branch(unsigned int *addr, unsigned long target, int flags);
> > > -int patch_instruction(unsigned int *addr, unsigned int instr);
> > > -int raw_patch_instruction(unsigned int *addr, unsigned int instr);
> > > +int patch_branch(ppc_inst *addr, unsigned long target, int flags);
> > > +int patch_instruction(ppc_inst *addr, ppc_inst instr);
> > 
> > we need to handle this change for its user in epapr_paravirt.c,

Seeing a similar issue in kgdb.c:

In file included from /linux/include/linux/kgdb.h:20,
 from /linux/arch/powerpc/kernel/kgdb.c:18:
/linux/arch/powerpc/kernel/kgdb.c: In function 'kgdb_arch_set_breakpoint':
/linux/arch/powerpc/include/asm/kgdb.h:30:22: error: incompatible type for 
argument 2 of 'patch_instruction'
 #define BREAK_INSTR  0x7d821008 /* twge r2, r2 */
  ^~
/linux/arch/powerpc/kernel/kgdb.c:427:32: note: in expansion of macro 
'BREAK_INSTR'
  err = patch_instruction(addr, BREAK_INSTR);
^~~
In file included from /linux/arch/powerpc/kernel/kgdb.c:27:
/linux/arch/powerpc/include/asm/code-patching.h:31:44: note: expected 
'ppc_inst' {aka 'struct ppc_inst'} but argument is of type 'int'
 int patch_instruction(void *addr, ppc_inst instr);
   ~^
/linux/arch/powerpc/kernel/kgdb.c: In function 'kgdb_arch_remove_breakpoint':
/linux/arch/powerpc/kernel/kgdb.c:442:32: error: incompatible type for argument 
2 of 'patch_instruction'
  err = patch_instruction(addr, instr);
^
In file included from /linux/arch/powerpc/kernel/kgdb.c:27:
/linux/arch/powerpc/include/asm/code-pat

Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Leonardo Bras
On Fri, 2020-04-03 at 10:31 +1100, Oliver O'Halloran wrote:
> On Fri, Apr 3, 2020 at 10:07 AM Leonardo Bras  wrote:
> > Hello Oliver, thank you for the feedback.
> > Comments inline:
> > 
> > On Fri, 2020-04-03 at 09:46 +1100, Oliver O'Halloran wrote:
> > > I don't really understand why the flag is needed at all. According to
> > > PAPR any memory provided by dynamic reconfiguration can be hot-removed
> > > so why aren't we treating all DR memory as hot removable? The only
> > > memory guaranteed to be there 100% of the time is what's in the
> > > /memory@0 node since that's supposed to cover the real mode area.
> > 
> > All LMBs are listed in DR memory, even the base memory.
> > 
> > The v1 of the patch would work this way, as qemu would configure it's
> > DR memory with (DRC_INVALID | RESERVED) flags and the hot-added memory
> > with (ASSIGNED) flag. Looking for assigned flag would be enough.
> > 
> > But as of today, PowerVM doesn't seem to work that way.
> > When you boot a PowerVM virtual machine with Linux, all memory is added
> > with the same flags (ASSIGNED).
> > 
> > To create a solution that doesn't break PowerVM, this new flag was made
> > necessary.
> 
> I'm still not convinced it's necessary. Why not check memory@0 and use
> the size as a clip level? Any memory above that level gets marked as
> hotpluggable and anything below doesn't. Seems to me that would work
> on all current platforms, so what am I missing here?
> 

Humm, wouldn't that assume that all unmovable memory should be at the
first LMBs? 

If we use the recently approved flag, there would be no such
limitation, and there would be possible to do some other things, like
hot-adding more unmovable memory to kernel usage.

What do you think?

Best regards,


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Oliver O'Halloran
On Fri, Apr 3, 2020 at 10:07 AM Leonardo Bras  wrote:
>
> Hello Oliver, thank you for the feedback.
> Comments inline:
>
> On Fri, 2020-04-03 at 09:46 +1100, Oliver O'Halloran wrote:
> >
> > I don't really understand why the flag is needed at all. According to
> > PAPR any memory provided by dynamic reconfiguration can be hot-removed
> > so why aren't we treating all DR memory as hot removable? The only
> > memory guaranteed to be there 100% of the time is what's in the
> > /memory@0 node since that's supposed to cover the real mode area.
>
> All LMBs are listed in DR memory, even the base memory.
>
> The v1 of the patch would work this way, as qemu would configure it's
> DR memory with (DRC_INVALID | RESERVED) flags and the hot-added memory
> with (ASSIGNED) flag. Looking for assigned flag would be enough.
>
> But as of today, PowerVM doesn't seem to work that way.
> When you boot a PowerVM virtual machine with Linux, all memory is added
> with the same flags (ASSIGNED).
>
> To create a solution that doesn't break PowerVM, this new flag was made
> necessary.

I'm still not convinced it's necessary. Why not check memory@0 and use
the size as a clip level? Any memory above that level gets marked as
hotpluggable and anything below doesn't. Seems to me that would work
on all current platforms, so what am I missing here?

>
> Best regards,
> Leonardo Bras


Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Leonardo Bras
Hello Oliver, thank you for the feedback.
Comments inline:

On Fri, 2020-04-03 at 09:46 +1100, Oliver O'Halloran wrote:
> 
> I don't really understand why the flag is needed at all. According to
> PAPR any memory provided by dynamic reconfiguration can be hot-removed
> so why aren't we treating all DR memory as hot removable? The only
> memory guaranteed to be there 100% of the time is what's in the
> /memory@0 node since that's supposed to cover the real mode area.

All LMBs are listed in DR memory, even the base memory.

The v1 of the patch would work this way, as qemu would configure it's
DR memory with (DRC_INVALID | RESERVED) flags and the hot-added memory
with (ASSIGNED) flag. Looking for assigned flag would be enough.

But as of today, PowerVM doesn't seem to work that way. 
When you boot a PowerVM virtual machine with Linux, all memory is added
with the same flags (ASSIGNED).

To create a solution that doesn't break PowerVM, this new flag was made
necessary.

Best regards,
Leonardo Bras


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v8 2/7] powerpc/kprobes: Mark newly allocated probes as RO

2020-04-02 Thread Russell Currey
On Fri, 2020-04-03 at 00:18 +0530, Naveen N. Rao wrote:
> Naveen N. Rao wrote:
> > Russell Currey wrote:
> > > With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will
> > > be one
> > > W+X page at boot by default.  This can be tested with
> > > CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking
> > > the
> > > kernel log during boot.
> > > 
> > > powerpc doesn't implement its own alloc() for kprobes like other
> > > architectures do, but we couldn't immediately mark RO anyway
> > > since we do
> > > a memcpy to the page we allocate later.  After that, nothing
> > > should be
> > > allowed to modify the page, and write permissions are removed
> > > well
> > > before the kprobe is armed.
> > > 
> > > The memcpy() would fail if >1 probes were allocated, so use
> > > patch_instruction() instead which is safe for RO.
> > > 
> > > Reviewed-by: Daniel Axtens 
> > > Signed-off-by: Russell Currey 
> > > Signed-off-by: Christophe Leroy 
> > > ---
> > >  arch/powerpc/kernel/kprobes.c | 17 +
> > >  1 file changed, 13 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kernel/kprobes.c
> > > b/arch/powerpc/kernel/kprobes.c
> > > index 81efb605113e..fa4502b4de35 100644
> > > --- a/arch/powerpc/kernel/kprobes.c
> > > +++ b/arch/powerpc/kernel/kprobes.c
> > > @@ -24,6 +24,8 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +#include 
> > >  
> > >  DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
> > >  DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
> > > @@ -102,6 +104,16 @@ kprobe_opcode_t *kprobe_lookup_name(const
> > > char *name, unsigned int offset)
> > >   return addr;
> > >  }
> > >  
> > > +void *alloc_insn_page(void)
> > > +{
> > > + void *page = vmalloc_exec(PAGE_SIZE);
> > > +
> > > + if (page)
> > > + set_memory_ro((unsigned long)page, 1);
> > > +
> > > + return page;
> > > +}
> > > +
> > 
> > This crashes for me with KPROBES_SANITY_TEST during the kretprobe
> > test.  
> 
> That isn't needed to reproduce this. After bootup, disabling
> optprobes 
> also shows the crash with kretprobes:
>   sysctl debug.kprobes-optimization=0
> 
> The problem happens to be with patch_instruction() in 
> arch_prepare_kprobe(). During boot, on kprobe init, we register a
> probe 
> on kretprobe_trampoline for use with kretprobes (see 
> arch_init_kprobes()). This results in an instruction slot being 
> allocated, and arch_prepare_kprobe() to be called for copying the 
> instruction (nop) at kretprobe_trampoline. patch_instruction() is 
> failing resulting in corrupt instruction which we try to
> emulate/single 
> step causing the crash.

Thanks a lot for the info Naveen, I'll investigate.

- Russell

> 
> 
> - Naveen
> 



Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Oliver O'Halloran
On Fri, Apr 3, 2020 at 6:55 AM Leonardo Bras  wrote:
>
> While providing guests, it's desirable to resize it's memory on demand.
>
> By now, it's possible to do so by creating a guest with a small base
> memory, hot-plugging all the rest, and using 'movable_node' kernel
> command-line parameter, which puts all hot-plugged memory in
> ZONE_MOVABLE, allowing it to be removed whenever needed.
>
> But there is an issue regarding guest reboot:
> If memory is hot-plugged, and then the guest is rebooted, all hot-plugged
> memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal.
> It usually prevents this memory to be hot-removed from the guest.
>
> It's possible to use device-tree information to fix that behavior, as
> it stores flags for LMB ranges on ibm,dynamic-memory-vN.
> It involves marking each memblock with the correct flags as hotpluggable
> memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if
> 'movable_node' is passed.

I don't really understand why the flag is needed at all. According to
PAPR any memory provided by dynamic reconfiguration can be hot-removed
so why aren't we treating all DR memory as hot removable? The only
memory guaranteed to be there 100% of the time is what's in the
/memory@0 node since that's supposed to cover the real mode area.


Re: [PATCH kernel] powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory

2020-04-02 Thread Wen Xiong
I applied the patch on top of the latest upstream kernel. I ran HTX over pmem nodes for several hours and it works.
 
Tested-by: Wen Xiong
 
Thanks,
Wendy
- Original message -From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.orgCc: Alexey Kardashevskiy , David Gibson , Michael Ellerman , Oliver O'Halloran , "Aneesh Kumar K . V" , Wen Xiong , Brian J King Subject: [EXTERNAL] [PATCH kernel] powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memoryDate: Mon, Mar 30, 2020 8:23 PM 
Unlike normal memory ("memory" compatible type in the FDT),the persistent memory ("ibm,pmemory" in the FDT) can be mapped anywherein the guest physical space and it can be used for DMA.In order to maintain 1:1 mapping via the huge DMA window, we need toknow the maximum physical address at the time of the window setup.So far we've been looking at "memory" nodes but "ibm,pmemory" does nothave fixed addresses and the persistent memory may be mapped afterwards.Since the persistent memory is still backed with page structs,use MAX_PHYSMEM_BITS as the upper limit.This effectively disables huge DMA window in LPAR under pHyp ifpersistent memory is present but this is the best we can do.Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 9 + 1 file changed, 9 insertions(+)diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.cindex 2e0a8eab5588..6d47b4a3ce39 100644--- a/arch/powerpc/platforms/pseries/iommu.c+++ b/arch/powerpc/platforms/pseries/iommu.c@@ -945,6 +945,15 @@ static phys_addr_t ddw_memory_hotplug_max(void)  phys_addr_t max_addr = memory_hotplug_max();  struct device_node *memory; + /*+ * The "ibm,pmemory" can appear anywhere in the address space.+ * Assuming it is still backed by page structs, set the upper limit+ * for the huge DMA window as MAX_PHYSMEM_BITS.+ */+ if (of_find_node_by_type(NULL, "ibm,pmemory"))+ return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?+ (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);+  for_each_node_by_type(memory, "memory") {  unsigned long start, size;  int n_mem_addr_cells, n_mem_size_cells, len;--2.17.1 
 



Re: [PATCH v8 0/7] powerpc/perf: Add json file metric support for the hv_24x7 socket/chip level events

2020-04-02 Thread Jiri Olsa
On Thu, Apr 02, 2020 at 02:03:33AM +0530, Kajol Jain wrote:
> Patchset adds json file metric support for the hv_24x7 socket/chip level
> events. "hv_24x7" pmu interface events needs system dependent parameter
> like socket/chip/core. For example, hv_24x7 chip level events needs
> specific chip-id to which the data is requested should be added as part
> of pmu events.
> 
> So to enable JSON file support to "hv_24x7" interface, patchset reads
> total number of sockets details in sysfs under 
> "/sys/devices/hv_24x7/interface/".
> 
> Second patch of the patchset adds expr_scanner_ctx object to hold user
> data for the expr scanner, which can be used to hold runtime parameter.
> 
> Patch 4 & 6 of the patchset handles perf tool plumbing needed to replace
> the "?" character in the metric expression to proper value and hv_24x7
> json metric file for different Socket/chip resources.
> 
> Patch set also enable Hz/hz prinitg for --metric-only option to print
> metric data for bus frequency.
> 
> Applied and tested all these patches cleanly on top of jiri's flex changes
> with the changes done by Kan Liang for "Support metric group constraint"
> patchset and made required changes.
> 
> Also apply this patch on top of the fix patch send earlier
> for printing metric name incase overlapping events.
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core&id=37cd7f65bf71a48f25eeb6d9be5dacb20d008ea6
> 
> Changelog:
> v7 -> v8
> - Add test case for testing parsing of "?" in metric expression
> - Reaname variables name to runtime

Acked-by: Jiri Olsa 

thanks,
jirka



Re: WARNING in ext4_da_update_reserve_space

2020-04-02 Thread Murilo Opsfelder Araújo
On Thursday, April 2, 2020 8:02:11 AM -03 syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:1a147b74 Merge branch 'DSA-mtu'
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=14237713e0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=46ee14d4915944bc
> dashboard link: https://syzkaller.appspot.com/bug?extid=67e4f16db666b1c8253c
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12237713e0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10ec7c97e0
>
> The bug was bisected to:
>
> commit 658b0f92bc7003bc734471f61bf7cd56339eb8c3
> Author: Murilo Opsfelder Araujo 
> Date:   Wed Aug 1 21:33:15 2018 +
>
> powerpc/traps: Print unhandled signals in a separate function

This commit is specific to powerpc and the crash is from an x86_64 system.

There is a bunch of scp errors in the logs:

scp: ./syz-executor998635077: No space left on device

Is it possible that these errors might be misleading the syzbot?

>
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15979f5be0
> final crash:https://syzkaller.appspot.com/x/report.txt?x=17979f5be0
> console output: https://syzkaller.appspot.com/x/log.txt?x=13979f5be0
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+67e4f16db666b1c82...@syzkaller.appspotmail.com
> Fixes: 658b0f92bc70 ("powerpc/traps: Print unhandled signals in a separate
> function")
>
> EXT4-fs warning (device sda1): ext4_da_update_reserve_space:344:
> ext4_da_update_reserve_space: ino 15722, used 1 with only 0 reserved data
> blocks [ cut here ]
> WARNING: CPU: 1 PID: 359 at fs/ext4/inode.c:348
> ext4_da_update_reserve_space+0x622/0x7d0 fs/ext4/inode.c:344 Kernel panic -
> not syncing: panic_on_warn set ...
> CPU: 1 PID: 359 Comm: kworker/u4:5 Not tainted 5.6.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011 Workqueue: writeback wb_workfn (flush-8:0)
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x188/0x20d lib/dump_stack.c:118
>  panic+0x2e3/0x75c kernel/panic.c:221
>  __warn.cold+0x2f/0x35 kernel/panic.c:582
>  report_bug+0x27b/0x2f0 lib/bug.c:195
>  fixup_bug arch/x86/kernel/traps.c:174 [inline]
>  fixup_bug arch/x86/kernel/traps.c:169 [inline]
>  do_error_trap+0x12b/0x220 arch/x86/kernel/traps.c:267
>  do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:286
>  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
> RIP: 0010:ext4_da_update_reserve_space+0x622/0x7d0 fs/ext4/inode.c:348
> Code: 02 00 0f 85 94 01 00 00 48 8b 7d 28 49 c7 c0 20 72 3c 88 41 56 48 c7
> c1 80 60 3c 88 53 ba 58 01 00 00 4c 89 c6 e8 1e 6d 0d 00 <0f> 0b 48 b8 00
> 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 0f b6 04 RSP:
> 0018:c90002197288 EFLAGS: 00010296
> RAX:  RBX: 0001 RCX: 
> RDX:  RSI: 820bf066 RDI: f52000432e21
> RBP: 888086b744c8 R08: 0091 R09: ed1015ce6659
> R10: ed1015ce6658 R11: 8880ae7332c7 R12: 0001
> R13: 888086b74990 R14:  R15: 888086b74a40
>  ext4_ext_map_blocks+0x24aa/0x37d0 fs/ext4/extents.c:4500
>  ext4_map_blocks+0x4cb/0x1650 fs/ext4/inode.c:622
>  mpage_map_one_extent fs/ext4/inode.c:2365 [inline]
>  mpage_map_and_submit_extent fs/ext4/inode.c:2418 [inline]
>  ext4_writepages+0x19eb/0x3080 fs/ext4/inode.c:2772
>  do_writepages+0xfa/0x2a0 mm/page-writeback.c:2344
>  __writeback_single_inode+0x12a/0x1410 fs/fs-writeback.c:1452
>  writeback_sb_inodes+0x515/0xdd0 fs/fs-writeback.c:1716
>  wb_writeback+0x2a5/0xd90 fs/fs-writeback.c:1892
>  wb_do_writeback fs/fs-writeback.c:2037 [inline]
>  wb_workfn+0x339/0x11c0 fs/fs-writeback.c:2078
>  process_one_work+0x94b/0x1690 kernel/workqueue.c:2266
>  worker_thread+0x96/0xe20 kernel/workqueue.c:2412
>  kthread+0x357/0x430 kernel/kthread.c:255
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

--
Murilo


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2020 at 1:27 PM Kees Cook  wrote:
>
> I was just speaking to design principles in this area: if the "enable"
> is called when already enabled, Something Is Wrong. :)

Well, the "something is wrong" could easily be "the hardware does not
support this".

I'm not at all interested in the crazy code to do this in software.
Nobody sane should ever do that.

Yes, I realize that PaX did software emulation of things like that,
and it was one of the reasons why it was never useful to any normal
use.

Security is not an end goal in itself, it's always secondary to "can I
use this".

Security that means "normal people can't use this, it's only for the
special l33t users" is not security, it's garbage. That "do page
tables in software" was a prime example of garbage.

   Linus


Re: [PATCH v8 6/7] tools/perf: Enable Hz/hz prinitg for --metric-only option

2020-04-02 Thread Andi Kleen
> > diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> > index 9e757d18d713..679aaa655824 100644
> > --- a/tools/perf/util/stat-display.c
> > +++ b/tools/perf/util/stat-display.c
> > @@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit)
> > if (!unit)
> > return false;
> > if (strstr(unit, "/sec") ||
> > -   strstr(unit, "hz") ||
> > -   strstr(unit, "Hz") ||
> 
> will this change output of --metric-only for some setups then?
> 
> Andi, are you ok with this?

Yes should be ok

$ grep -i ScaleUnit.*hz arch/x86/*/* 
$ 

Probably was for some metric we later dropped.

-Andi


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Kees Cook
On Thu, Apr 02, 2020 at 12:26:52PM -0700, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 11:36 AM Kees Cook  wrote:
> >
> > Yup, I think it's a weakness of the ARM implementation and I'd like to
> > not extend it further. AFAIK we should never nest, but I would not be
> > surprised at all if we did.
> 
> Wel, at least the user_access_begin/end() sections can't nest. objtool
> verifies and warns about that on x86.

Right, yes, I mentioned that earlier in the thread. I meant I wasn't
100% sure about ARM's corner cases. I would _hope_ it doesn't.

> > If we were looking at a design goal for all architectures, I'd like
> > to be doing what the public PaX patchset
> 
> We already do better than PaX ever did. Seriously. Mainline has long
> since passed their hacky garbage.

I was just speaking to design principles in this area: if the "enable"
is called when already enabled, Something Is Wrong. :) (And one thing
still missing in this general subject is that x86 still lacks SMAP
emulation. And yes, I understand it's just not been a priority for anyone
that can work on it, but it is still a gap.)

-- 
Kees Cook


[PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Leonardo Bras
While providing guests, it's desirable to resize it's memory on demand.

By now, it's possible to do so by creating a guest with a small base
memory, hot-plugging all the rest, and using 'movable_node' kernel
command-line parameter, which puts all hot-plugged memory in
ZONE_MOVABLE, allowing it to be removed whenever needed.

But there is an issue regarding guest reboot:
If memory is hot-plugged, and then the guest is rebooted, all hot-plugged
memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal.
It usually prevents this memory to be hot-removed from the guest.

It's possible to use device-tree information to fix that behavior, as
it stores flags for LMB ranges on ibm,dynamic-memory-vN.
It involves marking each memblock with the correct flags as hotpluggable
memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if
'movable_node' is passed.

For carrying such information, the new flag DRCONF_MEM_HOTREMOVABLE was
proposed and accepted into Power Architecture documentation.
This flag should be:
- true (b=1) if the hypervisor may want to hot-remove it later, and
- false (b=0) if it does not care.

During boot, guest kernel reads the device-tree, early_init_drmem_lmb()
is called for every added LMBs. Here, checking for this new flag and
marking memblocks as hotplugable memory is enough to get the desirable
behavior.

This should cause no change if 'movable_node' parameter is not passed
in kernel command-line.

Signed-off-by: Leonardo Bras 
Reviewed-by: Bharata B Rao 

---

Changes since v2:
- New flag name changed from DRCONF_MEM_HOTPLUGGED to
DRCONF_MEM_HOTREMOVABLE

Changes since v1:
- Adds new flag, so PowerVM is compatible with the change.
- Fixes mistakes in code
---
 arch/powerpc/include/asm/drmem.h | 1 +
 arch/powerpc/kernel/prom.c   | 9 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 3d76e1c388c2..ad99e27e5b65 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -65,6 +65,7 @@ struct of_drconf_cell_v2 {
 #define DRCONF_MEM_ASSIGNED0x0008
 #define DRCONF_MEM_AI_INVALID  0x0040
 #define DRCONF_MEM_RESERVED0x0080
+#define DRCONF_MEM_HOTREMOVABLE0x0100
 
 static inline u32 drmem_lmb_size(void)
 {
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 6620f37abe73..abc9b04d03ce 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -515,9 +515,14 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
size = 0x8000ul - base;
}
 
+   if (!validate_mem_limit(base, &size))
+   continue;
+
DBG("Adding: %llx -> %llx\n", base, size);
-   if (validate_mem_limit(base, &size))
-   memblock_add(base, size);
+   memblock_add(base, size);
+
+   if (lmb->flags & DRCONF_MEM_HOTREMOVABLE)
+   memblock_mark_hotplug(base, size);
} while (--rngs);
 }
 #endif /* CONFIG_PPC_PSERIES */
-- 
2.25.1



Re: [PATCH v2] qtpm2: Export tpm2_get_cc_attrs_tbl for ibmvtpm driver as module

2020-04-02 Thread Jarkko Sakkinen
On Wed, Apr 01, 2020 at 02:40:30PM +0530, Sachin Sant wrote:
> 
> 
> > On 20-Mar-2020, at 1:27 AM, Jarkko Sakkinen 
> >  wrote:
> > 
> > On Wed, Mar 18, 2020 at 09:00:17PM -0400, Stefan Berger wrote:
> >> From: Stefan Berger 
> >> 
> >> This patch fixes the following problem when the ibmvtpm driver
> >> is built as a module:
> >> 
> >> ERROR: modpost: "tpm2_get_cc_attrs_tbl" [drivers/char/tpm/tpm_ibmvtpm.ko] 
> >> undefined!
> >> make[1]: *** [scripts/Makefile.modpost:94: __modpost] Error 1
> >> make: *** [Makefile:1298: modules] Error 2
> >> 
> >> Fixes: 18b3670d79ae ("tpm: ibmvtpm: Add support for TPM2")
> >> Signed-off-by: Stefan Berger 
> >> Reported-by: Sachin Sant 
> >> Tested-by: Sachin Sant 
> > 
> 
> Ping. This failure can now be seen in mainline (cad18da0af) as well.

It is in my tree

/Jarkko


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Linus Torvalds
On Thu, Apr 2, 2020 at 11:36 AM Kees Cook  wrote:
>
> Yup, I think it's a weakness of the ARM implementation and I'd like to
> not extend it further. AFAIK we should never nest, but I would not be
> surprised at all if we did.

Wel, at least the user_access_begin/end() sections can't nest. objtool
verifies and warns about that on x86.

> If we were looking at a design goal for all architectures, I'd like
> to be doing what the public PaX patchset

We already do better than PaX ever did. Seriously. Mainline has long
since passed their hacky garbage.

Plus PaX  and grsecurity should be actively shunned. Don't look at it,
don't use it, and tell everybody you know to not use that shit.

Linus


Re: [PATCH v8 2/7] powerpc/kprobes: Mark newly allocated probes as RO

2020-04-02 Thread Naveen N. Rao

Naveen N. Rao wrote:

Russell Currey wrote:

With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
W+X page at boot by default.  This can be tested with
CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
kernel log during boot.

powerpc doesn't implement its own alloc() for kprobes like other
architectures do, but we couldn't immediately mark RO anyway since we do
a memcpy to the page we allocate later.  After that, nothing should be
allowed to modify the page, and write permissions are removed well
before the kprobe is armed.

The memcpy() would fail if >1 probes were allocated, so use
patch_instruction() instead which is safe for RO.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/kprobes.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 81efb605113e..fa4502b4de35 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;

 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
@@ -102,6 +104,16 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
+void *alloc_insn_page(void)

+{
+   void *page = vmalloc_exec(PAGE_SIZE);
+
+   if (page)
+   set_memory_ro((unsigned long)page, 1);
+
+   return page;
+}
+


This crashes for me with KPROBES_SANITY_TEST during the kretprobe test.  


That isn't needed to reproduce this. After bootup, disabling optprobes 
also shows the crash with kretprobes:

sysctl debug.kprobes-optimization=0

The problem happens to be with patch_instruction() in 
arch_prepare_kprobe(). During boot, on kprobe init, we register a probe 
on kretprobe_trampoline for use with kretprobes (see 
arch_init_kprobes()). This results in an instruction slot being 
allocated, and arch_prepare_kprobe() to be called for copying the 
instruction (nop) at kretprobe_trampoline. patch_instruction() is 
failing resulting in corrupt instruction which we try to emulate/single 
step causing the crash.



- Naveen



Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Kees Cook
On Thu, Apr 02, 2020 at 06:50:32PM +0100, Al Viro wrote:
> On Thu, Apr 02, 2020 at 07:03:28PM +0200, Christophe Leroy wrote:
> 
> > user_access_begin() grants both read and write.
> > 
> > This patch adds user_read_access_begin() and user_write_access_begin() but
> > it doesn't remove user_access_begin()
> 
> Ouch...  So the most generic name is for the rarest case?
>  
> > > What should we do about that?  Do we prohibit such blocks outside
> > > of arch?
> > > 
> > > What should we do about arm and s390?  There we want a cookie passed
> > > from beginning of block to its end; should that be a return value?
> > 
> > That was the way I implemented it in January, see
> > https://patchwork.ozlabs.org/patch/1227926/
> > 
> > There was some discussion around that and most noticeable was:
> > 
> > H. Peter (hpa) said about it: "I have *deep* concern with carrying state in
> > a "key" variable: it's a direct attack vector for a crowbar attack,
> > especially since it is by definition live inside a user access region."
> 
> > This patch minimises the change by just adding user_read_access_begin() and
> > user_write_access_begin() keeping the same parameters as the existing
> > user_access_begin().
> 
> Umm...  What about the arm situation?  The same concerns would apply there,
> wouldn't they?  Currently we have
> static __always_inline unsigned int uaccess_save_and_enable(void)
> {
> #ifdef CONFIG_CPU_SW_DOMAIN_PAN
> unsigned int old_domain = get_domain();
> 
> /* Set the current domain access to permit user accesses */
> set_domain((old_domain & ~domain_mask(DOMAIN_USER)) |
>domain_val(DOMAIN_USER, DOMAIN_CLIENT));
> 
> return old_domain;
> #else
> return 0;
> #endif
> }
> and
> static __always_inline void uaccess_restore(unsigned int flags)
> {
> #ifdef CONFIG_CPU_SW_DOMAIN_PAN
> /* Restore the user access mask */
> set_domain(flags);
> #endif
> }
> 
> How much do we need nesting on those, anyway?  rmk?

Yup, I think it's a weakness of the ARM implementation and I'd like to
not extend it further. AFAIK we should never nest, but I would not be
surprised at all if we did.

If we were looking at a design goal for all architectures, I'd like
to be doing what the public PaX patchset did for their memory access
switching, which is to alarm if calling into "enable" found the access
already enabled, etc. Such a condition would show an unexpected nesting
(like we've seen with similar constructs with set_fs() not getting reset
during an exception handler, etc etc).

-- 
Kees Cook


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Christophe Leroy




Le 02/04/2020 à 19:50, Al Viro a écrit :

On Thu, Apr 02, 2020 at 07:03:28PM +0200, Christophe Leroy wrote:


user_access_begin() grants both read and write.

This patch adds user_read_access_begin() and user_write_access_begin() but
it doesn't remove user_access_begin()


Ouch...  So the most generic name is for the rarest case?
  


I can add another patch at the end of the series to rename 
user_access_begin() to user_full_access_begin().


Would it suit you ?

Christophe


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 07:03:28PM +0200, Christophe Leroy wrote:

> user_access_begin() grants both read and write.
> 
> This patch adds user_read_access_begin() and user_write_access_begin() but
> it doesn't remove user_access_begin()

Ouch...  So the most generic name is for the rarest case?
 
> > What should we do about that?  Do we prohibit such blocks outside
> > of arch?
> > 
> > What should we do about arm and s390?  There we want a cookie passed
> > from beginning of block to its end; should that be a return value?
> 
> That was the way I implemented it in January, see
> https://patchwork.ozlabs.org/patch/1227926/
> 
> There was some discussion around that and most noticeable was:
> 
> H. Peter (hpa) said about it: "I have *deep* concern with carrying state in
> a "key" variable: it's a direct attack vector for a crowbar attack,
> especially since it is by definition live inside a user access region."

> This patch minimises the change by just adding user_read_access_begin() and
> user_write_access_begin() keeping the same parameters as the existing
> user_access_begin().

Umm...  What about the arm situation?  The same concerns would apply there,
wouldn't they?  Currently we have
static __always_inline unsigned int uaccess_save_and_enable(void)
{
#ifdef CONFIG_CPU_SW_DOMAIN_PAN
unsigned int old_domain = get_domain();

/* Set the current domain access to permit user accesses */
set_domain((old_domain & ~domain_mask(DOMAIN_USER)) |
   domain_val(DOMAIN_USER, DOMAIN_CLIENT));

return old_domain;
#else
return 0;
#endif
}
and
static __always_inline void uaccess_restore(unsigned int flags)
{
#ifdef CONFIG_CPU_SW_DOMAIN_PAN
/* Restore the user access mask */
set_domain(flags);
#endif
}

How much do we need nesting on those, anyway?  rmk?


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Kees Cook
On Thu, Apr 02, 2020 at 07:03:28PM +0200, Christophe Leroy wrote:
> > What should we do about arm and s390?  There we want a cookie passed
> > from beginning of block to its end; should that be a return value?
> 
> That was the way I implemented it in January, see
> https://patchwork.ozlabs.org/patch/1227926/
> 
> There was some discussion around that and most noticeable was:
> 
> H. Peter (hpa) said about it: "I have *deep* concern with carrying state in
> a "key" variable: it's a direct attack vector for a crowbar attack,
> especially since it is by definition live inside a user access region."

I share this concern -- we want to keep user/kernel access as static as
possible. It should be provable with static analysis, etc (e.g. objtool
does this already for x86).

Since this doesn't disrupt existing R+W access, I'd prefer the design of
this series as-is.

-- 
Kees Cook


Re: [RFC PATCH v2 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Leonardo Bras
Hello Bharata, thank you for reviewing and testing!

During review of this new flag, it was suggested to change it's name to
a better one (on platform's viewpoint). 

So I will have to change the flag name from DRCONF_MEM_HOTPLUGGED to
DRCONF_MEM_HOTREMOVABLE.

Everything should work the same as today.

Best regards,
Leonardo

On Thu, 2020-04-02 at 14:44 +0530, Bharata B Rao wrote:
> Looks good to me, also tested with PowerKVM guests.
> 
> Reviewed-by: Bharata B Rao 
> 
> Regards,
> Bharata.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Christophe Leroy




Le 02/04/2020 à 18:29, Al Viro a écrit :

On Thu, Apr 02, 2020 at 07:34:16AM +, Christophe Leroy wrote:

Some architectures like powerpc64 have the capability to separate
read access and write access protection.
For get_user() and copy_from_user(), powerpc64 only open read access.
For put_user() and copy_to_user(), powerpc64 only open write access.
But when using unsafe_get_user() or unsafe_put_user(),
user_access_begin open both read and write.

Other architectures like powerpc book3s 32 bits only allow write
access protection. And on this architecture protection is an heavy
operation as it requires locking/unlocking per segment of 256Mbytes.
On those architecture it is therefore desirable to do the unlocking
only for write access. (Note that book3s/32 ranges from very old
powermac from the 90's with powerpc 601 processor, till modern
ADSL boxes with PowerQuicc II modern processors for instance so it
is still worth considering)

In order to avoid any risk based of hacking some variable parameters
passed to user_access_begin/end that would allow hacking and
leaving user access open or opening too much, it is preferable to
use dedicated static functions that can't be overridden.

Add a user_read_access_begin and user_read_access_end to only open
read access.

Add a user_write_access_begin and user_write_access_end to only open
write access.

By default, when undefined, those new access helpers default on the
existing user_access_begin and user_access_end.


The only problem I have is that we'd better choose the calling
conventions that work for other architectures as well.

AFAICS, aside of ppc and x86 we have (at least) this:
arm:
unsigned int __ua_flags = uaccess_save_and_enable();
...
uaccess_restore(__ua_flags);
arm64:
uaccess_enable_not_uao();
...
uaccess_disable_not_uao();
riscv:
__enable_user_access();
...
__disable_user_access();
s390/mvc:
old_fs = enable_sacf_uaccess();
...
 disable_sacf_uaccess(old_fs);

arm64 and riscv are easy - they map well on what we have now.
The interesting ones are ppc, arm and s390.

You wants to specify the kind of access; OK, but...  it's not just read
vs. write - there's read-write as well.  AFAICS, there are 3 users of
that:
* copy_in_user()
* arch_futex_atomic_op_inuser()
* futex_atomic_cmpxchg_inatomic()
The former is of dubious utility (all users outside of arch are in
the badly done compat code), but the other two are not going to go
away.


user_access_begin() grants both read and write.

This patch adds user_read_access_begin() and user_write_access_begin() 
but it doesn't remove user_access_begin()




What should we do about that?  Do we prohibit such blocks outside
of arch?

What should we do about arm and s390?  There we want a cookie passed
from beginning of block to its end; should that be a return value?


That was the way I implemented it in January, see 
https://patchwork.ozlabs.org/patch/1227926/


There was some discussion around that and most noticeable was:

H. Peter (hpa) said about it: "I have *deep* concern with carrying state 
in a "key" variable: it's a direct attack vector for a crowbar attack, 
especially since it is by definition live inside a user access region."




And at least on arm that thing nests (see e.g. __clear_user_memset()
there), so "stash that cookie in current->something" is not a solution...

Folks, let's sort that out while we still have few users of that
interface; changing the calling conventions later will be much harder.
Comments?



This patch minimises the change by just adding user_read_access_begin() 
and user_write_access_begin() keeping the same parameters as the 
existing user_access_begin().


So I can come back to a mix of this patch and the January version if it 
corresponds to everyone's view, it will also be a bit easier for powerpc 
(especially book3s/32). But that didn't seem to be the expected 
direction back when we discussed it in January.


Christophe


Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Paul E. McKenney
On Thu, Apr 02, 2020 at 12:19:54PM -0400, Qian Cai wrote:
> 
> 
> > On Apr 2, 2020, at 11:54 AM, Paul E. McKenney  wrote:
> > 
> > I do run this combination quite frequently, but only as part of
> > rcutorture, which might not be a representative workload.  For one thing,
> > it has a minimal userspace consisting only of a trivial init program.
> > I don't recall having ever seen this.  (I have seen one recent complaint
> > about an IPI being sent to an offline CPU, but I cannot prove that this
> > was not due to RCU bugs that I was chasing at the time.)
> 
> Yes, a trivial init is tough while running systemd should be able to catch it 
> as it will use cgroup.

Not planning to add systemd to my rcutorture runs.  ;-)

Thanx, Paul


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 07:34:16AM +, Christophe Leroy wrote:
> Some architectures like powerpc64 have the capability to separate
> read access and write access protection.
> For get_user() and copy_from_user(), powerpc64 only open read access.
> For put_user() and copy_to_user(), powerpc64 only open write access.
> But when using unsafe_get_user() or unsafe_put_user(),
> user_access_begin open both read and write.
> 
> Other architectures like powerpc book3s 32 bits only allow write
> access protection. And on this architecture protection is an heavy
> operation as it requires locking/unlocking per segment of 256Mbytes.
> On those architecture it is therefore desirable to do the unlocking
> only for write access. (Note that book3s/32 ranges from very old
> powermac from the 90's with powerpc 601 processor, till modern
> ADSL boxes with PowerQuicc II modern processors for instance so it
> is still worth considering)
> 
> In order to avoid any risk based of hacking some variable parameters
> passed to user_access_begin/end that would allow hacking and
> leaving user access open or opening too much, it is preferable to
> use dedicated static functions that can't be overridden.
> 
> Add a user_read_access_begin and user_read_access_end to only open
> read access.
> 
> Add a user_write_access_begin and user_write_access_end to only open
> write access.
> 
> By default, when undefined, those new access helpers default on the
> existing user_access_begin and user_access_end.

The only problem I have is that we'd better choose the calling
conventions that work for other architectures as well.

AFAICS, aside of ppc and x86 we have (at least) this:
arm:
unsigned int __ua_flags = uaccess_save_and_enable();
...
uaccess_restore(__ua_flags);
arm64:
uaccess_enable_not_uao();
...
uaccess_disable_not_uao();
riscv:
__enable_user_access();
...
__disable_user_access();
s390/mvc:
old_fs = enable_sacf_uaccess();
...
disable_sacf_uaccess(old_fs);

arm64 and riscv are easy - they map well on what we have now.
The interesting ones are ppc, arm and s390.

You wants to specify the kind of access; OK, but...  it's not just read
vs. write - there's read-write as well.  AFAICS, there are 3 users of
that:
* copy_in_user()
* arch_futex_atomic_op_inuser()
* futex_atomic_cmpxchg_inatomic()
The former is of dubious utility (all users outside of arch are in
the badly done compat code), but the other two are not going to go
away.

What should we do about that?  Do we prohibit such blocks outside
of arch?

What should we do about arm and s390?  There we want a cookie passed
from beginning of block to its end; should that be a return value?

And at least on arm that thing nests (see e.g. __clear_user_memset()
there), so "stash that cookie in current->something" is not a solution...

Folks, let's sort that out while we still have few users of that
interface; changing the calling conventions later will be much harder.
Comments?


Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Qian Cai



> On Apr 2, 2020, at 11:54 AM, Paul E. McKenney  wrote:
> 
> I do run this combination quite frequently, but only as part of
> rcutorture, which might not be a representative workload.  For one thing,
> it has a minimal userspace consisting only of a trivial init program.
> I don't recall having ever seen this.  (I have seen one recent complaint
> about an IPI being sent to an offline CPU, but I cannot prove that this
> was not due to RCU bugs that I was chasing at the time.)

Yes, a trivial init is tough while running systemd should be able to catch it 
as it will use cgroup.

Re: [PATCH v8 2/7] powerpc/kprobes: Mark newly allocated probes as RO

2020-04-02 Thread Naveen N. Rao

Russell Currey wrote:

With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
W+X page at boot by default.  This can be tested with
CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
kernel log during boot.

powerpc doesn't implement its own alloc() for kprobes like other
architectures do, but we couldn't immediately mark RO anyway since we do
a memcpy to the page we allocate later.  After that, nothing should be
allowed to modify the page, and write permissions are removed well
before the kprobe is armed.

The memcpy() would fail if >1 probes were allocated, so use
patch_instruction() instead which is safe for RO.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/kprobes.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 81efb605113e..fa4502b4de35 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;

 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
@@ -102,6 +104,16 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
+void *alloc_insn_page(void)

+{
+   void *page = vmalloc_exec(PAGE_SIZE);
+
+   if (page)
+   set_memory_ro((unsigned long)page, 1);
+
+   return page;
+}
+


This crashes for me with KPROBES_SANITY_TEST during the kretprobe test.  
It seems to be handling the kretprobe itself properly, but seems to 
crash on the return path. I haven't yet been able to work out what's 
going wrong.


[0.517880] Kprobe smoke test: started
[0.626680] Oops: Exception in kernel mode, sig: 4 [#1]
[0.626708] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[0.626735] Modules linked in:
[0.626760] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.6.0-06592-g2e64694b9137 #51
[0.626795] NIP:  c0080002 LR: c021ce34 CTR: c021c5f8
[0.626829] REGS: c000fd3e3860 TRAP: 0e40   Not tainted  
(5.6.0-06592-g2e64694b9137)
[0.626862] MSR:  92089433   CR: 
28000284  XER: 0004
[0.626922] CFAR: c000ef20 IRQMASK: 0 
[0.626922] GPR00: c0052250 c000fd3e3af0 c2330200 2db8ad86 
[0.626922] GPR04:  c006ba3c 0800 c000fd3a 
[0.626922] GPR08:  aaab c000fd3a 4000 
[0.626922] GPR12: c021c5f0 c252 c0011790  
[0.626922] GPR16:     
[0.626922] GPR20:     
[0.626922] GPR24: c20034bc c12068b8 c2062e50 c000fd2319a0 
[0.626922] GPR28: c0f5ebb0  c21bc278 c2458540 
[0.627234] NIP [c0080002] 0xc0080002

[0.627264] LR [c021ce34] init_test_probes+0x424/0x560
[0.627291] Call Trace:
[0.627313] [c000fd3e3af0] [c021ce34] 
init_test_probes+0x424/0x560 (unreliable)
[0.627356] [c000fd3e3b90] [c202de2c] init_kprobes+0x1a8/0x1c8
[0.627392] [c000fd3e3c00] [c0011140] do_one_initcall+0x60/0x2b0
[0.627432] [c000fd3e3cd0] [c2004674] 
kernel_init_freeable+0x2e0/0x3a0
[0.627471] [c000fd3e3db0] [c00117ac] kernel_init+0x24/0x178
[0.627510] [c000fd3e3e20] [c000c7a8] 
ret_from_kernel_thread+0x5c/0x74
[0.627543] Instruction dump:
[0.627562]         
[0.627607]     <>    
[0.627660] ---[ end trace 964ab92781f5d99d ]---
[0.629607] 



- Naveen



Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Paul E. McKenney
On Thu, Apr 02, 2020 at 10:00:16AM -0400, Qian Cai wrote:
> 
> 
> > On Apr 2, 2020, at 7:24 AM, Michael Ellerman  wrote:
> > 
> > Qian Cai  writes:
> >> From: Peter Zijlstra 
> >> 
> >> In the CPU-offline process, it calls mmdrop() after idle entry and the
> >> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
> >> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
> >> lockdep complaining when mmdrop() uses RCU from either memcg or
> >> debugobjects below.
> >> 
> >> Fix it by cleaning up the active_mm state from BP instead. Every arch
> >> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
> >> from AP. The only exception is parisc because it switches them to
> >> &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
> >> but the patch will still work there because it calls mmgrab(&init_mm) in
> >> smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
> > 
> > Thanks for debugging this. How did you hit it in the first place?
> 
> Just repeatedly offline/online CPUs which will eventually cause an idle thread
> refcount goes to 0 and trigger __mmdrop() and of course it needs to enable
> lockdep (PROVE_RCU?) as well as having luck to hit the cgroup, workqueue
> or debugobject code paths to call RCU.
> 
> > 
> > A link to the original thread would have helped me:
> > 
> >  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/
> > 
> >> WARNING: suspicious RCU usage
> >> -
> >> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
> >> 
> >> other info that might help us debug this:
> >> 
> >> RCU used illegally from offline CPU!
> >> Call Trace:
> >> dump_stack+0xf4/0x164 (unreliable)
> >> lockdep_rcu_suspicious+0x140/0x164
> >> get_work_pool+0x110/0x150
> >> __queue_work+0x1bc/0xca0
> >> queue_work_on+0x114/0x120
> >> css_release+0x9c/0xc0
> >> percpu_ref_put_many+0x204/0x230
> >> free_pcp_prepare+0x264/0x570
> >> free_unref_page+0x38/0xf0
> >> __mmdrop+0x21c/0x2c0
> >> idle_task_exit+0x170/0x1b0
> >> pnv_smp_cpu_kill_self+0x38/0x2e0
> >> cpu_die+0x48/0x64
> >> arch_cpu_idle_dead+0x30/0x50
> >> do_idle+0x2f4/0x470
> >> cpu_startup_entry+0x38/0x40
> >> start_secondary+0x7a8/0xa80
> >> start_secondary_resume+0x10/0x14
> > 
> > Do we know when this started happening? ie. can we determine a Fixes
> > tag?
> 
> I don’t know. I looked at some commits that it seems the code was like that
> even 10-year ago. It must be nobody who cares to run lockdep (PROVE_RCU?)
> with CPU hotplug very regularly.

I do run this combination quite frequently, but only as part of
rcutorture, which might not be a representative workload.  For one thing,
it has a minimal userspace consisting only of a trivial init program.
I don't recall having ever seen this.  (I have seen one recent complaint
about an IPI being sent to an offline CPU, but I cannot prove that this
was not due to RCU bugs that I was chasing at the time.)

Thanx, Paul

> >> 
> >> Signed-off-by: Qian Cai 
> >> ---
> >> arch/powerpc/platforms/powernv/smp.c |  1 -
> >> include/linux/sched/mm.h |  2 ++
> >> kernel/cpu.c | 18 +-
> >> kernel/sched/core.c  |  5 +++--
> >> 4 files changed, 22 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/platforms/powernv/smp.c 
> >> b/arch/powerpc/platforms/powernv/smp.c
> >> index 13e251699346..b2ba3e95bda7 100644
> >> --- a/arch/powerpc/platforms/powernv/smp.c
> >> +++ b/arch/powerpc/platforms/powernv/smp.c
> >> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
> >>/* Standard hot unplug procedure */
> >> 
> >>idle_task_exit();
> >> -  current->active_mm = NULL; /* for sanity */
> > 
> > If I'm reading it right, we'll now be running with active_mm == init_mm
> > in the offline loop.
> > 
> > I guess that's fine, I can't think of any reason it would matter, and it
> > seems like we were NULL'ing it out just for paranoia's sake not because
> > of any actual problem.
> > 
> > Acked-by: Michael Ellerman  (powerpc)
> > 
> > 
> > cheers
> > 
> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> >> index c49257a3b510..a132d875d351 100644
> >> --- a/include/linux/sched/mm.h
> >> +++ b/include/linux/sched/mm.h
> >> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
> >>__mmdrop(mm);
> >> }
> >> 
> >> +void mmdrop(struct mm_struct *mm);
> >> +
> >> /*
> >>  * This has to be called after a get_task_mm()/mmget_not_zero()
> >>  * followed by taking the mmap_sem for writing before modifying the
> >> diff --git a/kernel/cpu.c b/kernel/cpu.c
> >> index 2371292f30b0..244d30544377 100644
> >> --- a/kernel/cpu.c
> >> +++ b/kernel/cpu.c
> >> @@ -3,6 +3,7 @@
> >>  *
> >>  * This code is licenced under the GPL.
> >>  */
> >> +#include 
> >> #include 
> >> #include 
> >> #include 
> >> @@ -564,6 +565,21 @@ static int bringup_

[PATCH 1/4] scsi: cxlflash: Access interrupt trigger page from xive directly

2020-04-02 Thread Frederic Barrat
xive is already mapping the trigger page in kernel space and it can be
accessed through standard APIs, so let's reuse it and simplify the code.

Signed-off-by: Frederic Barrat 
---
 drivers/scsi/cxlflash/ocxl_hw.c | 17 +++--
 drivers/scsi/cxlflash/ocxl_hw.h |  1 -
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 7018cd802569..59452850f71c 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -15,7 +15,8 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
 #include 
 
 #include 
@@ -180,7 +181,7 @@ static int afu_map_irq(u64 flags, struct ocxlflash_context 
*ctx, int num,
struct ocxl_hw_afu *afu = ctx->hw_afu;
struct device *dev = afu->dev;
struct ocxlflash_irqs *irq;
-   void __iomem *vtrig;
+   struct xive_irq_data *xd;
u32 virq;
int rc = 0;
 
@@ -204,15 +205,14 @@ static int afu_map_irq(u64 flags, struct 
ocxlflash_context *ctx, int num,
goto err1;
}
 
-   vtrig = ioremap(irq->ptrig, PAGE_SIZE);
-   if (unlikely(!vtrig)) {
-   dev_err(dev, "%s: Trigger page mapping failed\n", __func__);
-   rc = -ENOMEM;
+   xd = irq_get_handler_data(virq);
+   if (unlikely(!xd)) {
+   dev_err(dev, "%s: Can't get interrupt data\n", __func__);
goto err2;
}
 
irq->virq = virq;
-   irq->vtrig = vtrig;
+   irq->vtrig = xd->trig_mmio;
 out:
return rc;
 err2:
@@ -259,8 +259,6 @@ static void afu_unmap_irq(u64 flags, struct 
ocxlflash_context *ctx, int num,
}
 
irq = &ctx->irqs[num];
-   if (irq->vtrig)
-   iounmap(irq->vtrig);
 
if (irq_find_mapping(NULL, irq->hwirq)) {
free_irq(irq->virq, cookie);
@@ -648,7 +646,6 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
int num)
}
 
irqs[i].hwirq = hwirq;
-   irqs[i].ptrig = addr;
}
 
ctx->irqs = irqs;
diff --git a/drivers/scsi/cxlflash/ocxl_hw.h b/drivers/scsi/cxlflash/ocxl_hw.h
index fc6ad4f985de..f2fe88816bea 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.h
+++ b/drivers/scsi/cxlflash/ocxl_hw.h
@@ -13,7 +13,6 @@
 struct ocxlflash_irqs {
int hwirq;
u32 virq;
-   u64 ptrig;
void __iomem *vtrig;
 };
 
-- 
2.25.1



[PATCH 2/4] ocxl: Access interrupt trigger page from xive directly

2020-04-02 Thread Frederic Barrat
We can access the trigger page through standard APIs so let's use it
and avoid saving it when allocating the interrupt. It will also allow
to simplify allocation in a later patch.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/ocxl/afu_irq.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
index 70f8f1c3929d..b30ec0ef7be7 100644
--- a/drivers/misc/ocxl/afu_irq.c
+++ b/drivers/misc/ocxl/afu_irq.c
@@ -2,6 +2,7 @@
 // Copyright 2017 IBM Corp.
 #include 
 #include 
+#include 
 #include "ocxl_internal.h"
 #include "trace.h"
 
@@ -196,13 +197,16 @@ void ocxl_afu_irq_free_all(struct ocxl_context *ctx)
 
 u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id)
 {
+   struct xive_irq_data *xd;
struct afu_irq *irq;
u64 addr = 0;
 
mutex_lock(&ctx->irq_lock);
irq = idr_find(&ctx->irq_idr, irq_id);
-   if (irq)
-   addr = irq->trigger_page;
+   if (irq) {
+   xd = irq_get_handler_data(irq->virq);
+   addr = xd ? xd->trig_page : 0;
+   }
mutex_unlock(&ctx->irq_lock);
return addr;
 }
-- 
2.25.1



[PATCH 4/4] ocxl: Remove custom service to allocate interrupts

2020-04-02 Thread Frederic Barrat
We now allocate interrupts through xive directly.

Signed-off-by: Frederic Barrat 
---
 arch/powerpc/include/asm/pnv-ocxl.h   |  3 ---
 arch/powerpc/platforms/powernv/ocxl.c | 30 ---
 2 files changed, 33 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 7de82647e761..e90650328c9c 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -30,7 +30,4 @@ extern int pnv_ocxl_spa_setup(struct pci_dev *dev, void 
*spa_mem, int PE_mask,
 extern void pnv_ocxl_spa_release(void *platform_data);
 extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int 
pe_handle);
 
-extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
-extern void pnv_ocxl_free_xive_irq(u32 irq);
-
 #endif /* _ASM_PNV_OCXL_H */
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c8..ecdad219d704 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -2,7 +2,6 @@
 // Copyright 2017 IBM Corp.
 #include 
 #include 
-#include 
 #include 
 #include "pci.h"
 
@@ -484,32 +483,3 @@ int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, 
int pe_handle)
return rc;
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_spa_remove_pe_from_cache);
-
-int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr)
-{
-   __be64 flags, trigger_page;
-   s64 rc;
-   u32 hwirq;
-
-   hwirq = xive_native_alloc_irq();
-   if (!hwirq)
-   return -ENOENT;
-
-   rc = opal_xive_get_irq_info(hwirq, &flags, NULL, &trigger_page, NULL,
-   NULL);
-   if (rc || !trigger_page) {
-   xive_native_free_irq(hwirq);
-   return -ENOENT;
-   }
-   *irq = hwirq;
-   *trigger_addr = be64_to_cpu(trigger_page);
-   return 0;
-
-}
-EXPORT_SYMBOL_GPL(pnv_ocxl_alloc_xive_irq);
-
-void pnv_ocxl_free_xive_irq(u32 irq)
-{
-   xive_native_free_irq(irq);
-}
-EXPORT_SYMBOL_GPL(pnv_ocxl_free_xive_irq);
-- 
2.25.1



[PATCH 3/4] ocxl: Don't return trigger page when allocating an interrupt

2020-04-02 Thread Frederic Barrat
Existing users of ocxl_link_irq_alloc() have been converted to obtain
the trigger page of an interrupt through xive directly, we therefore
have no need to return the trigger page when allocating an interrupt.

It also allows ocxl to use the xive native interface to allocate
interrupts, instead of its custom service.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/ocxl/Kconfig   |  2 +-
 drivers/misc/ocxl/afu_irq.c |  4 +---
 drivers/misc/ocxl/link.c| 15 +++
 drivers/scsi/cxlflash/ocxl_hw.c |  3 +--
 include/misc/ocxl.h | 10 ++
 5 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig
index 2d2266c1439e..e65773f5cf59 100644
--- a/drivers/misc/ocxl/Kconfig
+++ b/drivers/misc/ocxl/Kconfig
@@ -9,7 +9,7 @@ config OCXL_BASE
 
 config OCXL
tristate "OpenCAPI coherent accelerator support"
-   depends on PPC_POWERNV && PCI && EEH
+   depends on PPC_POWERNV && PCI && EEH && PPC_XIVE_NATIVE
select OCXL_BASE
select HOTPLUG_PCI_POWERNV
default m
diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
index b30ec0ef7be7..ecdcfae025b7 100644
--- a/drivers/misc/ocxl/afu_irq.c
+++ b/drivers/misc/ocxl/afu_irq.c
@@ -11,7 +11,6 @@ struct afu_irq {
int hw_irq;
unsigned int virq;
char *name;
-   u64 trigger_page;
irqreturn_t (*handler)(void *private);
void (*free_private)(void *private);
void *private;
@@ -125,8 +124,7 @@ int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int 
*irq_id)
goto err_unlock;
}
 
-   rc = ocxl_link_irq_alloc(ctx->afu->fn->link, &irq->hw_irq,
-   &irq->trigger_page);
+   rc = ocxl_link_irq_alloc(ctx->afu->fn->link, &irq->hw_irq);
if (rc)
goto err_idr;
 
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 58d111afd9f6..fd73d3bc0eb6 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ocxl_internal.h"
 #include "trace.h"
@@ -682,23 +683,21 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
 }
 EXPORT_SYMBOL_GPL(ocxl_link_remove_pe);
 
-int ocxl_link_irq_alloc(void *link_handle, int *hw_irq, u64 *trigger_addr)
+int ocxl_link_irq_alloc(void *link_handle, int *hw_irq)
 {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
-   int rc, irq;
-   u64 addr;
+   int irq;
 
if (atomic_dec_if_positive(&link->irq_available) < 0)
return -ENOSPC;
 
-   rc = pnv_ocxl_alloc_xive_irq(&irq, &addr);
-   if (rc) {
+   irq = xive_native_alloc_irq();
+   if (!irq) {
atomic_inc(&link->irq_available);
-   return rc;
+   return -ENXIO;
}
 
*hw_irq = irq;
-   *trigger_addr = addr;
return 0;
 }
 EXPORT_SYMBOL_GPL(ocxl_link_irq_alloc);
@@ -707,7 +706,7 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
 {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
 
-   pnv_ocxl_free_xive_irq(hw_irq);
+   xive_native_free_irq(hw_irq);
atomic_inc(&link->irq_available);
 }
 EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 59452850f71c..03bff0cae658 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -613,7 +613,6 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
int num)
struct ocxl_hw_afu *afu = ctx->hw_afu;
struct device *dev = afu->dev;
struct ocxlflash_irqs *irqs;
-   u64 addr;
int rc = 0;
int hwirq;
int i;
@@ -638,7 +637,7 @@ static int alloc_afu_irqs(struct ocxlflash_context *ctx, 
int num)
}
 
for (i = 0; i < num; i++) {
-   rc = ocxl_link_irq_alloc(afu->link_token, &hwirq, &addr);
+   rc = ocxl_link_irq_alloc(afu->link_token, &hwirq);
if (unlikely(rc)) {
dev_err(dev, "%s: ocxl_link_irq_alloc failed rc=%d\n",
__func__, rc);
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 06dd5839e438..a2868adec22f 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -480,14 +480,8 @@ int ocxl_link_remove_pe(void *link_handle, int pasid);
  * Allocate an AFU interrupt associated to the link.
  *
  * 'hw_irq' is the hardware interrupt number
- * 'obj_handle' is the 64-bit object handle to be passed to the AFU to
- * trigger the interrupt.
- * On P9, 'obj_handle' is an address, which, if written, triggers the
- * interrupt. It is an MMIO address which needs to be remapped (one
- * page).
- */
-int ocxl_link_irq_alloc(void *link_handle, int *hw_irq,
-   u64 *obj_handle);
+ */
+int ocxl_link_irq_alloc(void *link_handle, int *hw_irq);
 
 /*
  * Free a

[PATCH 0/4] ocxl: Cleanup AFU interrupt allocation

2020-04-02 Thread Frederic Barrat
Short series to cleanup AFU interrupt allocation for opencapi.
Current code was using its own allocation service, calling opal
directly to get the trigger page. This is not needed and we can use
xive to achieve the same thing. The only caveat is that the trigger
page address is only valid after the interrupt has been mapped, but
that is not a problem with the way the code is using it.
No functional change.

Frederic Barrat (4):
  scsi: cxlflash: Access interrupt trigger page from xive directly
  ocxl: Access interrupt trigger page from xive directly
  ocxl: Don't return trigger page when allocating an interrupt
  ocxl: Remove custom service to allocate interrupts

 arch/powerpc/include/asm/pnv-ocxl.h   |  3 ---
 arch/powerpc/platforms/powernv/ocxl.c | 30 ---
 drivers/misc/ocxl/Kconfig |  2 +-
 drivers/misc/ocxl/afu_irq.c   | 12 ++-
 drivers/misc/ocxl/link.c  | 15 +++---
 drivers/scsi/cxlflash/ocxl_hw.c   | 20 +++---
 drivers/scsi/cxlflash/ocxl_hw.h   |  1 -
 include/misc/ocxl.h   | 10 ++---
 8 files changed, 25 insertions(+), 68 deletions(-)

-- 
2.25.1



[PATCH v4 3/3] powerpc test_emulate_step: add testcases for divde[.] and divdeu[.] instructions

2020-04-02 Thread Balamuruhan S
add testcases for divde, divde., divdeu, divdeu. emulated
instructions to cover few scenarios,
* with same dividend and divisor to have undefine RT
  for divdeu[.]
* with divide by zero to have undefine RT for both
  divde[.] and divdeu[.]
* with negative dividend to cover -|divisor| < r <= 0 if
  the dividend is negative for divde[.]
* normal case with proper dividend and divisor for both
  divde[.] and divdeu[.]

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/lib/test_emulate_step.c | 164 +++
 1 file changed, 164 insertions(+)

diff --git a/arch/powerpc/lib/test_emulate_step.c 
b/arch/powerpc/lib/test_emulate_step.c
index 53df4146dd32..eb1dea47a637 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -54,6 +54,14 @@
___PPC_RA(a) | ___PPC_RB(b))
 #define TEST_ADDC_DOT(t, a, b) (PPC_INST_ADDC | ___PPC_RT(t) | \
___PPC_RA(a) | ___PPC_RB(b) | 0x1)
+#define TEST_DIVDE(t, a, b)(PPC_INST_DIVDE | ___PPC_RT(t) |\
+   ___PPC_RA(a) | ___PPC_RB(b))
+#define TEST_DIVDE_DOT(t, a, b)(PPC_INST_DIVDE | ___PPC_RT(t) |
\
+   ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
+#define TEST_DIVDEU(t, a, b)   (PPC_INST_DIVDEU | ___PPC_RT(t) |   \
+   ___PPC_RA(a) | ___PPC_RB(b))
+#define TEST_DIVDEU_DOT(t, a, b)(PPC_INST_DIVDEU | ___PPC_RT(t) |  \
+   ___PPC_RA(a) | ___PPC_RB(b) | 0x1)
 
 #define MAX_SUBTESTS   16
 
@@ -838,6 +846,162 @@ static struct compute_test compute_tests[] = {
}
}
}
+   },
+   {
+   .mnemonic = "divde",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = TEST_DIVDE(20, 21, 22),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = 1L, RB = 0",
+   .instr = TEST_DIVDE(20, 21, 22),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = 1L,
+   .gpr[22] = 0,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = TEST_DIVDE(20, 21, 22),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   }
+   }
+   },
+   {
+   .mnemonic = "divde.",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = TEST_DIVDE_DOT(20, 21, 22),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   }
+   },
+   {
+   .descr = "RA = 1L, RB = 0",
+   .instr = TEST_DIVDE_DOT(20, 21, 22),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = 1L,
+   .gpr[22] = 0,
+   }
+   },
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MAX",
+   .instr = TEST_DIVDE_DOT(20, 21, 22),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MAX,
+   }
+   }
+   }
+   },
+   {
+   .mnemonic = "divdeu",
+   .subtests = {
+   {
+   .descr = "RA = LONG_MIN, RB = LONG_MIN",
+   .instr = TEST_DIVDEU(20, 21, 22),
+   .flags = IGNORE_GPR(20),
+   .regs = {
+   .gpr[21] = LONG_MIN,
+   .gpr[22] = LONG_MIN,
+   

[PATCH v4 2/3] powerpc sstep: add support for divde[.] and divdeu[.] instructions

2020-04-02 Thread Balamuruhan S
This patch adds emulation support for divde, divdeu instructions,
* Divide Doubleword Extended (divde[.])
* Divide Doubleword Extended Unsigned (divdeu[.])

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/lib/sstep.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5f3a7bd9d90d..c9036a75730c 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1736,7 +1736,18 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
op->val = (int) regs->gpr[ra] /
(int) regs->gpr[rb];
goto arith_done;
-
+#ifdef __powerpc64__
+   case 425:   /* divde[.] */
+   asm volatile(PPC_DIVDE(%0, %1, %2) :
+   "=r" (op->val) : "r" (regs->gpr[ra]),
+   "r" (regs->gpr[rb]));
+   goto arith_done;
+   case 393:   /* divdeu[.] */
+   asm volatile(PPC_DIVDEU(%0, %1, %2) :
+   "=r" (op->val) : "r" (regs->gpr[ra]),
+   "r" (regs->gpr[rb]));
+   goto arith_done;
+#endif
case 755:   /* darn */
if (!cpu_has_feature(CPU_FTR_ARCH_300))
return -1;
-- 
2.24.1



[PATCH v4 1/3] powerpc ppc-opcode: add divde and divdeu opcodes

2020-04-02 Thread Balamuruhan S
include instruction opcodes for divde and divdeu as macros.

Reviewed-by: Sandipan Das 
Signed-off-by: Balamuruhan S 
Acked-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ppc-opcode.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index c1df75edde44..9c9a604f30a6 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -339,6 +339,8 @@
 #define PPC_INST_DIVWU 0x7c000396
 #define PPC_INST_DIVD  0x7c0003d2
 #define PPC_INST_DIVDU 0x7c000392
+#define PPC_INST_DIVDE 0x7c000352
+#define PPC_INST_DIVDEU0x7c000312
 #define PPC_INST_RLWINM0x5400
 #define PPC_INST_RLWINM_DOT0x5401
 #define PPC_INST_RLWIMI0x5000
@@ -439,6 +441,12 @@
__PPC_RA(a) | __PPC_RB(b))
 #definePPC_DCBZL(a, b) stringify_in_c(.long PPC_INST_DCBZL | \
__PPC_RA(a) | __PPC_RB(b))
+#define PPC_DIVDE(t, a, b) stringify_in_c(.long PPC_INST_DIVDE | \
+   ___PPC_RT(t) | ___PPC_RA(a) | \
+   ___PPC_RB(b))
+#define PPC_DIVDEU(t, a, b)stringify_in_c(.long PPC_INST_DIVDEU| \
+   ___PPC_RT(t) | ___PPC_RA(a) | \
+   ___PPC_RB(b))
 #define PPC_LQARX(t, a, b, eh) stringify_in_c(.long PPC_INST_LQARX | \
___PPC_RT(t) | ___PPC_RA(a) | \
___PPC_RB(b) | __PPC_EH(eh))
-- 
2.24.1



[PATCH v4 0/3] Add support for divde[.] and divdeu[.] instruction emulation

2020-04-02 Thread Balamuruhan S
Hi All,

This patchset adds support to emulate divde, divde., divdeu and divdeu.
instructions and testcases for it.

Changes in v4:
-
Fix review comments from Naveen,
* replace TEST_DIVDEU() instead of wrongly used TEST_DIVDEU_DOT() in
  divdeu testcase.
* Include `acked-by` tag from Naveen for the series.
* Rebase it on latest mpe's merge tree.

Changes in v3:
-
* Fix suggestion from Sandipan to remove `PPC_INST_DIVDE_DOT` and
  `PPC_INST_DIVDEU_DOT` opcode macros defined in ppc-opcode.h, reuse
  `PPC_INST_DIVDE` and `PPC_INST_DIVDEU` in test_emulate_step.c to
  derive them respectively.

Changes in v2:
-
* Fix review comments from Paul to make divde_dot and divdeu_dot simple
  by using divde and divdeu, then goto `arith_done` instead of
  `compute_done`.
* Include `Reviewed-by` tag from Sandipan Das.
* Rebase with recent mpe's merge tree.

I would request for your review and suggestions for making it better.

Boot Log:

:: ::
:: ::
[2.777518] emulate_step_test: divde  : RA = LONG_MIN, RB = LONG_MIN 
PASS
[2.777882] emulate_step_test: divde  : RA = 1L, RB = 0  
PASS
[2.778432] emulate_step_test: divde  : RA = LONG_MIN, RB = LONG_MAX 
PASS
[2.778880] emulate_step_test: divde. : RA = LONG_MIN, RB = LONG_MIN 
PASS
[2.780172] emulate_step_test: divde. : RA = 1L, RB = 0  
PASS
[2.780582] emulate_step_test: divde. : RA = LONG_MIN, RB = LONG_MAX 
PASS
[2.780983] emulate_step_test: divdeu : RA = LONG_MIN, RB = LONG_MIN 
PASS
[2.781276] emulate_step_test: divdeu : RA = 1L, RB = 0  
PASS
[2.781579] emulate_step_test: divdeu : RA = LONG_MIN, RB = LONG_MAX 
PASS
[2.781820] emulate_step_test: divdeu : RA = LONG_MAX - 1, RB = 
LONG_MAX PASS
[2.782056] emulate_step_test: divdeu : RA = LONG_MIN + 1, RB = 
LONG_MIN PASS
[2.782296] emulate_step_test: divdeu.: RA = LONG_MIN, RB = LONG_MIN 
PASS
[2.782556] emulate_step_test: divdeu.: RA = 1L, RB = 0  
PASS
[2.783502] emulate_step_test: divdeu.: RA = LONG_MIN, RB = LONG_MAX 
PASS
[2.783748] emulate_step_test: divdeu.: RA = LONG_MAX - 1, RB = 
LONG_MAX PASS
[2.783973] emulate_step_test: divdeu.: RA = LONG_MIN + 1, RB = 
LONG_MIN PASS
[2.789617] registered taskstats version 1
[2.794779] printk: console [netcon0] enabled
[2.794931] netconsole: network logging started
[2.795327] hctosys: unable to open rtc device (rtc0)
[2.953449] Freeing unused kernel memory: 5120K
[2.953639] This architecture does not have kernel memory protection.
[2.953918] Run /init as init process
[3.173573] mount (54) used greatest stack depth: 12576 bytes left
[3.252465] mount (55) used greatest stack depth: 12544 bytes left

Welcome to Buildroot
buildroot login:

Balamuruhan S (3):
  powerpc ppc-opcode: add divde and divdeu opcodes
  powerpc sstep: add support for divde[.] and divdeu[.] instructions
  powerpc test_emulate_step: add testcases for divde[.] and divdeu[.]
instructions

 arch/powerpc/include/asm/ppc-opcode.h |   8 ++
 arch/powerpc/lib/sstep.c  |  13 +-
 arch/powerpc/lib/test_emulate_step.c  | 164 ++
 3 files changed, 184 insertions(+), 1 deletion(-)


base-commit: 9c17a47d827c00401c2ff2ba71d7ecf08ed5b677
-- 
2.24.1



RE: [PATCH 0/6] Kill setup_irq()

2020-04-02 Thread Brian Cain
> -Original Message-
> From: linux-hexagon-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of afzal mohammed
...
> On Fri, Mar 27, 2020 at 09:48:38PM -0500, Brian Cain wrote:
> 
> > > Note 2: hexagon final image creation fails even w/o my patch
> 
> > What's the nature of the failure in "Note 2"?
> 
> drivers/base/firmware_loader/main.o: In function `fw_is_builtin_firmware':
> /devel/src/kernel6/drivers/base/firmware_loader/main.c:132:(.text+0xc8):
> relocation truncated to fit: R_HEX_16_X against symbol
`__start_builtin_fw'
> defined in .modinfo section in .tmp_vmlinux1
> Makefile:1077: recipe for target 'vmlinux' failed
> make: *** [vmlinux] Error 1

Thanks for reporting it -- I will make a patch to fix it.

-Brian


Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Qian Cai



> On Apr 2, 2020, at 7:24 AM, Michael Ellerman  wrote:
> 
> Qian Cai  writes:
>> From: Peter Zijlstra 
>> 
>> In the CPU-offline process, it calls mmdrop() after idle entry and the
>> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
>> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
>> lockdep complaining when mmdrop() uses RCU from either memcg or
>> debugobjects below.
>> 
>> Fix it by cleaning up the active_mm state from BP instead. Every arch
>> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
>> from AP. The only exception is parisc because it switches them to
>> &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
>> but the patch will still work there because it calls mmgrab(&init_mm) in
>> smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
> 
> Thanks for debugging this. How did you hit it in the first place?

Just repeatedly offline/online CPUs which will eventually cause an idle thread
refcount goes to 0 and trigger __mmdrop() and of course it needs to enable
lockdep (PROVE_RCU?) as well as having luck to hit the cgroup, workqueue
or debugobject code paths to call RCU.

> 
> A link to the original thread would have helped me:
> 
>  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/
> 
>> WARNING: suspicious RCU usage
>> -
>> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
>> 
>> other info that might help us debug this:
>> 
>> RCU used illegally from offline CPU!
>> Call Trace:
>> dump_stack+0xf4/0x164 (unreliable)
>> lockdep_rcu_suspicious+0x140/0x164
>> get_work_pool+0x110/0x150
>> __queue_work+0x1bc/0xca0
>> queue_work_on+0x114/0x120
>> css_release+0x9c/0xc0
>> percpu_ref_put_many+0x204/0x230
>> free_pcp_prepare+0x264/0x570
>> free_unref_page+0x38/0xf0
>> __mmdrop+0x21c/0x2c0
>> idle_task_exit+0x170/0x1b0
>> pnv_smp_cpu_kill_self+0x38/0x2e0
>> cpu_die+0x48/0x64
>> arch_cpu_idle_dead+0x30/0x50
>> do_idle+0x2f4/0x470
>> cpu_startup_entry+0x38/0x40
>> start_secondary+0x7a8/0xa80
>> start_secondary_resume+0x10/0x14
> 
> Do we know when this started happening? ie. can we determine a Fixes
> tag?

I don’t know. I looked at some commits that it seems the code was like that
even 10-year ago. It must be nobody who cares to run lockdep (PROVE_RCU?)
with CPU hotplug very regularly.

> 
>> 
>> Signed-off-by: Qian Cai 
>> ---
>> arch/powerpc/platforms/powernv/smp.c |  1 -
>> include/linux/sched/mm.h |  2 ++
>> kernel/cpu.c | 18 +-
>> kernel/sched/core.c  |  5 +++--
>> 4 files changed, 22 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/smp.c 
>> b/arch/powerpc/platforms/powernv/smp.c
>> index 13e251699346..b2ba3e95bda7 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
>>  /* Standard hot unplug procedure */
>> 
>>  idle_task_exit();
>> -current->active_mm = NULL; /* for sanity */
> 
> If I'm reading it right, we'll now be running with active_mm == init_mm
> in the offline loop.
> 
> I guess that's fine, I can't think of any reason it would matter, and it
> seems like we were NULL'ing it out just for paranoia's sake not because
> of any actual problem.
> 
> Acked-by: Michael Ellerman  (powerpc)
> 
> 
> cheers
> 
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index c49257a3b510..a132d875d351 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
>>  __mmdrop(mm);
>> }
>> 
>> +void mmdrop(struct mm_struct *mm);
>> +
>> /*
>>  * This has to be called after a get_task_mm()/mmget_not_zero()
>>  * followed by taking the mmap_sem for writing before modifying the
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 2371292f30b0..244d30544377 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3,6 +3,7 @@
>>  *
>>  * This code is licenced under the GPL.
>>  */
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -564,6 +565,21 @@ static int bringup_cpu(unsigned int cpu)
>>  return bringup_wait_for_ap(cpu);
>> }
>> 
>> +static int finish_cpu(unsigned int cpu)
>> +{
>> +struct task_struct *idle = idle_thread_get(cpu);
>> +struct mm_struct *mm = idle->active_mm;
>> +
>> +/*
>> + * idle_task_exit() will have switched to &init_mm, now
>> + * clean up any remaining active_mm state.
>> + */
>> +if (mm != &init_mm)
>> +idle->active_mm = &init_mm;
>> +mmdrop(mm);
>> +return 0;
>> +}
>> +
>> /*
>>  * Hotplug state machine related functions
>>  */
>> @@ -1549,7 +1565,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
>>  [CPUHP_BRINGUP_CPU] = {
>>  .name   = "cpu:bringup",
>>  .startup.single = bringup_cpu,
>> -  

[PATCH] powerpc/mm: ptdump: Add missing include

2020-04-02 Thread YueHaibing
gcc build fails:

arch/powerpc/mm/ptdump/hashpagetable.c: In function ‘pseries_find’:
arch/powerpc/mm/ptdump/hashpagetable.c:262:18: error: ‘H_SUCCESS’ undeclared 
(first use in this function); did you mean ‘FL_ACCESS’?
   if (lpar_rc != H_SUCCESS)
  ^
  FL_ACCESS

Reported-by: Hulk Robot 
Fixes: 65e701b2d2a8 ("powerpc/ptdump: drop non vital #ifdefs")
Signed-off-by: YueHaibing 
---
 arch/powerpc/mm/ptdump/hashpagetable.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/ptdump/hashpagetable.c 
b/arch/powerpc/mm/ptdump/hashpagetable.c
index b6ed9578382f..8ea5f9a3b658 100644
--- a/arch/powerpc/mm/ptdump/hashpagetable.c
+++ b/arch/powerpc/mm/ptdump/hashpagetable.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-- 
2.17.1




Re: [PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices

2020-04-02 Thread Greg Kurz
On Thu, 02 Apr 2020 21:06:01 +1100
Michael Ellerman  wrote:

> "Oliver O'Halloran"  writes:
> > On Thu, Apr 2, 2020 at 2:42 PM Michael Ellerman  wrote:
> >> "Alastair D'Silva"  writes:
> >> >> -Original Message-
> >> >> From: Dan Williams 
> >> >>
> >> >> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva 
> >> >> wrote:
> >> >> >
> >> >> > *snip*
> >> >> Are OPAL calls similar to ACPI DSMs? I.e. methods for the OS to invoke
> >> >> platform firmware services? What's Skiboot?
> >> >
> >> > Yes, OPAL is the interface to firmware for POWER. Skiboot is the 
> >> > open-source (and only) implementation of OPAL.
> >>
> >>   https://github.com/open-power/skiboot
> >>
> >> In particular the tokens for calls are defined here:
> >>
> >>   https://github.com/open-power/skiboot/blob/master/include/opal-api.h#L220
> >>
> >> And you can grep for the token to find the implementation:
> >>
> >>   
> >> https://github.com/open-power/skiboot/blob/master/hw/npu2-opencapi.c#L2328
> >
> > I'm not sure I'd encourage anyone to read npu2-opencapi.c. I find it
> > hard enough to follow even with access to the workbooks.
> 
> Compared to certain firmwares that run on certain other platforms it's
> actually pretty readable code ;)
> 

Forth rocks ! ;-)

> > There's an OPAL call API reference here:
> > http://open-power.github.io/skiboot/doc/opal-api/index.html
> 
> Even better.
> 
> cheers



Re: [PATCH v8 6/7] tools/perf: Enable Hz/hz prinitg for --metric-only option

2020-04-02 Thread Jiri Olsa
On Thu, Apr 02, 2020 at 02:03:39AM +0530, Kajol Jain wrote:
> Commit 54b5091606c18 ("perf stat: Implement --metric-only mode")
> added function 'valid_only_metric()' which drops "Hz" or "hz",
> if it is part of "ScaleUnit". This patch enable it since hv_24x7
> supports couple of frequency events.
> 
> Signed-off-by: Kajol Jain 
> ---
>  tools/perf/util/stat-display.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index 9e757d18d713..679aaa655824 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -237,8 +237,6 @@ static bool valid_only_metric(const char *unit)
>   if (!unit)
>   return false;
>   if (strstr(unit, "/sec") ||
> - strstr(unit, "hz") ||
> - strstr(unit, "Hz") ||

will this change output of --metric-only for some setups then?

Andi, are you ok with this?

other than this, the patchset looks ok to me

thanks,
jirka

>   strstr(unit, "CPUs utilized"))
>   return false;
>   return true;
> -- 
> 2.21.0
> 



[PATCH v2] powerpc/64: Update Speculation_Store_Bypass in /proc//status

2020-04-02 Thread Michael Ellerman
Currently we don't report anything useful in /proc//status:

  $ grep Speculation_Store_Bypass /proc/self/status
  Speculation_Store_Bypass:   unknown

Our mitigation is currently always a barrier instruction, which
doesn't map that well onto the existing possibilities for the PR_SPEC
values.

However even if we added a "barrier" type PR_SPEC value, userspace
would still need to consult some other source to work out which type
of barrier to use. So reporting "vulnerable" seems sufficient, as
userspace can see that and then consult its source to determine what
barrier to use.

Signed-off-by: Gustavo Walbon 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/security.c | 35 ++
 1 file changed, 35 insertions(+)

v2: Change the logic.

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index bd70f5be1c27..7c0b7b55e969 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -353,6 +354,40 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, 
struct device_attribute *
return sprintf(buf, "Vulnerable\n");
 }
 
+static int ssb_prctl_get(struct task_struct *task)
+{
+   if (stf_enabled_flush_types == STF_BARRIER_NONE)
+   /*
+* We don't have an explicit signal from firmware that we're
+* vulnerable or not, we only have certain CPU revisions that
+* are known to be vulnerable.
+*
+* We assume that if we're on another CPU, where the barrier is
+* NONE, then we are not vulnerable.
+*/
+   return PR_SPEC_NOT_AFFECTED;
+   else
+   /*
+* If we do have a barrier type then we are vulnerable. The
+* barrier is not a global or per-process mitigation, so the
+* only value we can report here is PR_SPEC_ENABLE, which
+* appears as "vulnerable" in /proc.
+*/
+   return PR_SPEC_ENABLE;
+
+   return -EINVAL;
+}
+
+int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
+{
+   switch (which) {
+   case PR_SPEC_STORE_BYPASS:
+   return ssb_prctl_get(task);
+   default:
+   return -ENODEV;
+   }
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int stf_barrier_set(void *data, u64 val)
 {

base-commit: c17eb4dca5a353a9dbbb8ad6934fe57af7165e91
-- 
2.25.1



[PATCH] powerpc/64s: Fix doorbell wakeup msgclr optimisation

2020-04-02 Thread Nicholas Piggin
Commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C")
broke the doorbell wakeup optimisation introduced by commit a9af97aa0a12
("powerpc/64s: msgclr when handling doorbell exceptions from system
reset").

This patch restores it, in C code. It's moved explicitly to the system
reset wakeup path, rather than the doorbell replay path, because it is
always the right thing to do in the wakeup case, but it could be rare to
have a pending message in other cases in which case it's wasted work --
we would have to be done to see if that was a worthwhile optimisation,
and I suspect it would not be.

The results are similar to those in the original commit, test on POWER8
of context_switch selftests benchmark with polling idle disabled (e.g.,
always nap, giving cross-CPU IPIs) gives the following results:

broken   patched
Different threads, same core:   317k/s   375k/s+18.7%
Different cores:280k/s   282k/s +1.0%

Fixes: 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 19 ---
 arch/powerpc/kernel/irq.c| 13 +
 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 18bbce143084..728ccb0f560c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -3121,22 +3121,3 @@ handle_dabr_fault:
li  r5,SIGSEGV
bl  bad_page_fault
b   interrupt_return
-
-/*
- * When doorbell is triggered from system reset wakeup, the message is
- * not cleared, so it would fire again when EE is enabled.
- *
- * When coming from local_irq_enable, there may be the same problem if
- * we were hard disabled.
- *
- * Execute msgclr to clear pending exceptions before handling it.
- */
-h_doorbell_common_msgclr:
-   LOAD_REG_IMMEDIATE(r3, PPC_DBELL_MSGTYPE << (63-36))
-   PPC_MSGCLR(3)
-   b   h_doorbell_common_virt
-
-doorbell_super_common_msgclr:
-   LOAD_REG_IMMEDIATE(r3, PPC_DBELL_MSGTYPE << (63-36))
-   PPC_MSGCLRP(3)
-   b   doorbell_super_common_virt
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 6ea27dbcb872..ed6230ba0c43 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -527,6 +527,19 @@ void irq_set_pending_from_srr1(unsigned long srr1)
return;
}
 
+   if (reason == PACA_IRQ_DBELL) {
+   /*
+* When doorbell triggers a system reset wakeup, the message
+* is not cleared, so if the doorbell interrupt is replayed
+* and the IPI handled, the doorbell interrupt would still
+* fire when EE is enabled.
+*
+* To avoid taking the superfluous doorbell interrupt,
+* execute a msgclr here before the interrupt is replayed.
+*/
+   ppc_msgclr(PPC_DBELL_MSGTYPE);
+   }
+
/*
 * The 0 index (SRR1[42:45]=b) must always evaluate to 0,
 * so this can be called unconditionally with the SRR1 wake
-- 
2.23.0



[PATCH] Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"

2020-04-02 Thread Nicholas Piggin
This reverts commit ebb37cf3ffd39fdb6ec5b07111f8bb2f11d92c5f.

That commit does not play well with soft-masked irq state manipulations
in idle, interrupt replay, and possibly others due to tracing code
sometimes using irq_work_queue (e.g., in trace_hardirqs_on()). That
can cause PACA_IRQ_DEC to become set when it is not expected, and be
ignored or cleared or cause warnings.

The net result seems to be missing an irq_work until the next timer
interrupt in the worst case which is usually not going to be noticed,
however it could be a long time if the tick is disabled, which is
agains the spirit of irq_work and might cause real problems.

The idea is still solid, but it would need more work. It's not really
clear if it would be worth added complexity, so revert this for now
(not a straight revert, but replace with a comment explaining why we
might see interrupts happening, and gives git blame something to find).

Fixes: ebb37cf3ffd3 ("powerpc/64: irq_work avoid interrupt when called with 
hardware irqs enabled")
Signed-off-by: Nicholas Piggin 
---
This started tripping some warnings testing the latest interrupt code
which juggled a few things around, but it looks like it may have had
problems before that too.

 arch/powerpc/kernel/time.c | 44 +++---
 1 file changed, 13 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 1168e8b37e30..716f8d0960a7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -522,35 +522,6 @@ static inline void clear_irq_work_pending(void)
"i" (offsetof(struct paca_struct, irq_work_pending)));
 }
 
-void arch_irq_work_raise(void)
-{
-   preempt_disable();
-   set_irq_work_pending_flag();
-   /*
-* Non-nmi code running with interrupts disabled will replay
-* irq_happened before it re-enables interrupts, so setthe
-* decrementer there instead of causing a hardware exception
-* which would immediately hit the masked interrupt handler
-* and have the net effect of setting the decrementer in
-* irq_happened.
-*
-* NMI interrupts can not check this when they return, so the
-* decrementer hardware exception is raised, which will fire
-* when interrupts are next enabled.
-*
-* BookE does not support this yet, it must audit all NMI
-* interrupt handlers to ensure they call nmi_enter() so this
-* check would be correct.
-*/
-   if (IS_ENABLED(CONFIG_BOOKE) || !irqs_disabled() || in_nmi()) {
-   set_dec(1);
-   } else {
-   hard_irq_disable();
-   local_paca->irq_happened |= PACA_IRQ_DEC;
-   }
-   preempt_enable();
-}
-
 #else /* 32-bit */
 
 DEFINE_PER_CPU(u8, irq_work_pending);
@@ -559,16 +530,27 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()
__this_cpu_read(irq_work_pending)
 #define clear_irq_work_pending()   __this_cpu_write(irq_work_pending, 0)
 
+#endif /* 32 vs 64 bit */
+
 void arch_irq_work_raise(void)
 {
+   /*
+* 64-bit code that uses irq soft-mask can just cause an immediate
+* interrupt here that gets soft masked, if this is called under
+* local_irq_disable(). It might be possible to prevent that happening
+* by noticing interrupts are disabled and setting decrementer pending
+* to be replayed when irqs are enabled. The problem there is that
+* tracing can call irq_work_raise, including in code that does low
+* level manipulations of irq soft-mask state (e.g., trace_hardirqs_on)
+* which could get tangled up if we're messing with the same state
+* here.
+*/
preempt_disable();
set_irq_work_pending_flag();
set_dec(1);
preempt_enable();
 }
 
-#endif /* 32 vs 64 bit */
-
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()0
-- 
2.23.0



Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

2020-04-02 Thread Michael Ellerman
Leonardo Bras  writes:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
>
> After NMI IPI is sent to all other cpus, force unlock all spinlocks
> needed for finishing crash routine.

I'm not convinced this is the right approach.

Busting locks is risky, it could easily cause a crash if data structures
are left in some inconsistent state.

I think we need to make this code more careful about what it's doing.
There's a clue at the top of default_machine_crash_shutdown(), which
calls crash_kexec_prepare_cpus():

 * This function is only called after the system
 * has panicked or is otherwise in a critical state.
 * The minimum amount of code to allow a kexec'd kernel
 * to run successfully needs to happen here.


You said the "IPI complete" message was the cause of one lockup:

  #0  arch_spin_lock 
  #1  do_raw_spin_lock 
  #2  __raw_spin_lock 
  #3  _raw_spin_lock 
  #4  vprintk_emit 
  #5  vprintk_func
  #7  crash_kexec_prepare_cpus 
  #8  default_machine_crash_shutdown
  #9  machine_crash_shutdown 
  #10 __crash_kexec
  #11 crash_kexec
  #12 oops_end

TBH I think we could just drop that printk() entirely.

Or we could tell printk() that we're in NMI context so that it uses the
percpu buffers.

We should probably do the latter anyway, in case there's any other code
we call that inadvertently calls printk().


The RTAS trace you sent was:

  #0 arch_spin_lock
  #1  lock_rtas () 
  #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
  #3  ics_rtas_mask_real_irq (hw_irq=4100) 
  #4  machine_kexec_mask_interrupts
  #5  default_machine_crash_shutdown
  #6  machine_crash_shutdown 
  #7  __crash_kexec
  #8  crash_kexec
  #9  oops_end


Which doesn't make it clear who holds the RTAS lock. We really shouldn't
be crashing while holding the RTAS lock, but I guess it could happen.
Can you get a full backtrace?


PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
except for a very small list of RTAS calls. So if we bust the RTAS lock
there's a risk we violate that part of PAPR and crash even harder.

Also it's not specific to kdump, we can't even get through a normal
reboot if we crash with the RTAS lock held.

Anyway here's a patch with some ideas. That allows me to get from a
crash with the RTAS lock held through kdump into the 2nd kernel. But it
only works if it's the crashing CPU that holds the RTAS lock.

cheers

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..44ce74966d60 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
 void (*rtas_flash_term_hook)(int);
 EXPORT_SYMBOL(rtas_flash_term_hook);
 
+static int rtas_lock_holder = -1;
+
 /* RTAS use home made raw locking instead of spin_lock_irqsave
  * because those can be called from within really nasty contexts
  * such as having the timebase stopped which would lockup with
@@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
 
local_irq_save(flags);
preempt_disable();
-   arch_spin_lock(&rtas.lock);
+
+   if (!arch_spin_trylock(&rtas.lock)) {
+   // Couldn't get the lock, do we already hold it?
+   if (rtas_lock_holder == smp_processor_id())
+   // Yes, so we would have deadlocked on ourself. Assume
+   // we're crashing and continue on hopefully ...
+   return flags;
+
+   // No, wait on the lock
+   arch_spin_lock(&rtas.lock);
+   }
+
+   rtas_lock_holder = smp_processor_id();
+
return flags;
 }
 
@@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
arch_spin_unlock(&rtas.lock);
local_irq_restore(flags);
preempt_enable();
+
+   rtas_lock_holder = -1;
 }
 
 /*
@@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
timebase = 0;
arch_spin_unlock(&timebase_lock);
 }
+
+static int rtas_crash_set(void *data, u64 val)
+{
+   printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
+   lock_rtas();
+
+   *((volatile int *) 0) = 0;
+
+   return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
+
+static __init int rtas_crash_debugfs_init(void)
+{
+   debugfs_create_file_unsafe("crash_in_rtas", 0200,
+  powerpc_debugfs_root, NULL,
+  &fops_rtas_crash);
+   return 0;
+}
+device_initcall(rtas_crash_debugfs_init);
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efa

Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Michael Ellerman
Qian Cai  writes:
> From: Peter Zijlstra 
>
> In the CPU-offline process, it calls mmdrop() after idle entry and the
> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
> lockdep complaining when mmdrop() uses RCU from either memcg or
> debugobjects below.
>
> Fix it by cleaning up the active_mm state from BP instead. Every arch
> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
> from AP. The only exception is parisc because it switches them to
> &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
> but the patch will still work there because it calls mmgrab(&init_mm) in
> smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

Thanks for debugging this. How did you hit it in the first place?

A link to the original thread would have helped me:

  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/

> WARNING: suspicious RCU usage
> -
> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
>
> other info that might help us debug this:
>
> RCU used illegally from offline CPU!
> Call Trace:
>  dump_stack+0xf4/0x164 (unreliable)
>  lockdep_rcu_suspicious+0x140/0x164
>  get_work_pool+0x110/0x150
>  __queue_work+0x1bc/0xca0
>  queue_work_on+0x114/0x120
>  css_release+0x9c/0xc0
>  percpu_ref_put_many+0x204/0x230
>  free_pcp_prepare+0x264/0x570
>  free_unref_page+0x38/0xf0
>  __mmdrop+0x21c/0x2c0
>  idle_task_exit+0x170/0x1b0
>  pnv_smp_cpu_kill_self+0x38/0x2e0
>  cpu_die+0x48/0x64
>  arch_cpu_idle_dead+0x30/0x50
>  do_idle+0x2f4/0x470
>  cpu_startup_entry+0x38/0x40
>  start_secondary+0x7a8/0xa80
>  start_secondary_resume+0x10/0x14

Do we know when this started happening? ie. can we determine a Fixes
tag?

> 
> Signed-off-by: Qian Cai 
> ---
>  arch/powerpc/platforms/powernv/smp.c |  1 -
>  include/linux/sched/mm.h |  2 ++
>  kernel/cpu.c | 18 +-
>  kernel/sched/core.c  |  5 +++--
>  4 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/smp.c 
> b/arch/powerpc/platforms/powernv/smp.c
> index 13e251699346..b2ba3e95bda7 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
>   /* Standard hot unplug procedure */
>  
>   idle_task_exit();
> - current->active_mm = NULL; /* for sanity */

If I'm reading it right, we'll now be running with active_mm == init_mm
in the offline loop.

I guess that's fine, I can't think of any reason it would matter, and it
seems like we were NULL'ing it out just for paranoia's sake not because
of any actual problem.

Acked-by: Michael Ellerman  (powerpc)


cheers

> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index c49257a3b510..a132d875d351 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
>   __mmdrop(mm);
>  }
>  
> +void mmdrop(struct mm_struct *mm);
> +
>  /*
>   * This has to be called after a get_task_mm()/mmget_not_zero()
>   * followed by taking the mmap_sem for writing before modifying the
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 2371292f30b0..244d30544377 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -3,6 +3,7 @@
>   *
>   * This code is licenced under the GPL.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -564,6 +565,21 @@ static int bringup_cpu(unsigned int cpu)
>   return bringup_wait_for_ap(cpu);
>  }
>  
> +static int finish_cpu(unsigned int cpu)
> +{
> + struct task_struct *idle = idle_thread_get(cpu);
> + struct mm_struct *mm = idle->active_mm;
> +
> + /*
> +  * idle_task_exit() will have switched to &init_mm, now
> +  * clean up any remaining active_mm state.
> +  */
> + if (mm != &init_mm)
> + idle->active_mm = &init_mm;
> + mmdrop(mm);
> + return 0;
> +}
> +
>  /*
>   * Hotplug state machine related functions
>   */
> @@ -1549,7 +1565,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
>   [CPUHP_BRINGUP_CPU] = {
>   .name   = "cpu:bringup",
>   .startup.single = bringup_cpu,
> - .teardown.single= NULL,
> + .teardown.single= finish_cpu,
>   .cant_stop  = true,
>   },
>   /* Final state before CPU kills itself */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a2694ba82874..8787958339d5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6200,13 +6200,14 @@ void idle_task_exit(void)
>   struct mm_struct *mm = current->active_mm;
>  
>   BUG_ON(cpu_online(smp_processor_id()));
> + BUG_ON(current != this_rq()->idle);
>  
>   if (mm != &init_mm) {
>   switch_mm(m

WARNING in ext4_da_update_reserve_space

2020-04-02 Thread syzbot
Hello,

syzbot found the following crash on:

HEAD commit:1a147b74 Merge branch 'DSA-mtu'
git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14237713e0
kernel config:  https://syzkaller.appspot.com/x/.config?x=46ee14d4915944bc
dashboard link: https://syzkaller.appspot.com/bug?extid=67e4f16db666b1c8253c
compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12237713e0
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10ec7c97e0

The bug was bisected to:

commit 658b0f92bc7003bc734471f61bf7cd56339eb8c3
Author: Murilo Opsfelder Araujo 
Date:   Wed Aug 1 21:33:15 2018 +

powerpc/traps: Print unhandled signals in a separate function

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15979f5be0
final crash:https://syzkaller.appspot.com/x/report.txt?x=17979f5be0
console output: https://syzkaller.appspot.com/x/log.txt?x=13979f5be0

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+67e4f16db666b1c82...@syzkaller.appspotmail.com
Fixes: 658b0f92bc70 ("powerpc/traps: Print unhandled signals in a separate 
function")

EXT4-fs warning (device sda1): ext4_da_update_reserve_space:344: 
ext4_da_update_reserve_space: ino 15722, used 1 with only 0 reserved data blocks
[ cut here ]
WARNING: CPU: 1 PID: 359 at fs/ext4/inode.c:348 
ext4_da_update_reserve_space+0x622/0x7d0 fs/ext4/inode.c:344
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 359 Comm: kworker/u4:5 Not tainted 5.6.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Workqueue: writeback wb_workfn (flush-8:0)
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:221
 __warn.cold+0x2f/0x35 kernel/panic.c:582
 report_bug+0x27b/0x2f0 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 fixup_bug arch/x86/kernel/traps.c:169 [inline]
 do_error_trap+0x12b/0x220 arch/x86/kernel/traps.c:267
 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:ext4_da_update_reserve_space+0x622/0x7d0 fs/ext4/inode.c:348
Code: 02 00 0f 85 94 01 00 00 48 8b 7d 28 49 c7 c0 20 72 3c 88 41 56 48 c7 c1 
80 60 3c 88 53 ba 58 01 00 00 4c 89 c6 e8 1e 6d 0d 00 <0f> 0b 48 b8 00 00 00 00 
00 fc ff df 4c 89 ea 48 c1 ea 03 0f b6 04
RSP: 0018:c90002197288 EFLAGS: 00010296
RAX:  RBX: 0001 RCX: 
RDX:  RSI: 820bf066 RDI: f52000432e21
RBP: 888086b744c8 R08: 0091 R09: ed1015ce6659
R10: ed1015ce6658 R11: 8880ae7332c7 R12: 0001
R13: 888086b74990 R14:  R15: 888086b74a40
 ext4_ext_map_blocks+0x24aa/0x37d0 fs/ext4/extents.c:4500
 ext4_map_blocks+0x4cb/0x1650 fs/ext4/inode.c:622
 mpage_map_one_extent fs/ext4/inode.c:2365 [inline]
 mpage_map_and_submit_extent fs/ext4/inode.c:2418 [inline]
 ext4_writepages+0x19eb/0x3080 fs/ext4/inode.c:2772
 do_writepages+0xfa/0x2a0 mm/page-writeback.c:2344
 __writeback_single_inode+0x12a/0x1410 fs/fs-writeback.c:1452
 writeback_sb_inodes+0x515/0xdd0 fs/fs-writeback.c:1716
 wb_writeback+0x2a5/0xd90 fs/fs-writeback.c:1892
 wb_do_writeback fs/fs-writeback.c:2037 [inline]
 wb_workfn+0x339/0x11c0 fs/fs-writeback.c:2078
 process_one_work+0x94b/0x1690 kernel/workqueue.c:2266
 worker_thread+0x96/0xe20 kernel/workqueue.c:2412
 kthread+0x357/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches


Re: [PATCH v5 1/4] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-04-02 Thread Michael Ellerman
Vaibhav Jain  writes:

> Implement support for fetching nvdimm health information via
> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
> of 64-bit big-endian integers which are then stored in 'struct
> papr_scm_priv' and subsequently partially exposed to user-space via
> newly introduced dimm specific attribute 'papr_flags'. Also a new asm
> header named 'papr-scm.h' is added that describes the interface
> between PHYP and guest kernel.
>
> Following flags are reported via 'papr_flags' sysfs attribute contents
> of which are space separated string flags indicating various nvdimm
> states:
>
>  * "not_armed": Indicating that nvdimm contents wont survive a power
>  cycle.
>  * "save_fail": Indicating that nvdimm contents couldn't be flushed
>  during last shutdown event.
>  * "restore_fail": Indicating that nvdimm contents couldn't be restored
>  during dimm initialization.
>  * "encrypted": Dimm contents are encrypted.
>  * "smart_notify": There is health event for the nvdimm.
>  * "scrubbed" : Indicating that contents of the nvdimm have been
>  scrubbed.
>  * "locked"   : Indicating that nvdimm contents cant be modified
>  until next power cycle.
>
> [1]: commit 58b278f568f0 ("powerpc: Provide initial documentation for
> PAPR hcalls")
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4..v5 : None
>
> v3..v4 : None
>
> v2..v3 : Removed PAPR_SCM_DIMM_HEALTH_NON_CRITICAL as a condition for
>NVDIMM unarmed [Aneesh]
>
> v1..v2 : New patch in the series.
> ---
>  arch/powerpc/include/asm/papr_scm.h   |  48 ++
>  arch/powerpc/platforms/pseries/papr_scm.c | 105 +-
>  2 files changed, 151 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/papr_scm.h
>
> diff --git a/arch/powerpc/include/asm/papr_scm.h 
> b/arch/powerpc/include/asm/papr_scm.h
> new file mode 100644
> index ..868d3360f56a
> --- /dev/null
> +++ b/arch/powerpc/include/asm/papr_scm.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Structures and defines needed to manage nvdimms for spapr guests.
> + */
> +#ifndef _ASM_POWERPC_PAPR_SCM_H_
> +#define _ASM_POWERPC_PAPR_SCM_H_
> +
> +#include 
> +#include 
> +
> +/* DIMM health bitmap bitmap indicators */
> +/* SCM device is unable to persist memory contents */
> +#define PAPR_SCM_DIMM_UNARMEDPPC_BIT(0)

Please don't use PPC_BIT, it's just unncessary obfuscation for folks
who are reading the code without access to the docs (ie. more or less
everyone other than you :)

> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 0b4467e378e5..aaf2e4ab1f75 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -14,6 +14,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #define BIND_ANY_ADDR (~0ul)
>  
> @@ -39,6 +40,13 @@ struct papr_scm_priv {
>   struct resource res;
>   struct nd_region *region;
>   struct nd_interleave_set nd_set;
> +
> + /* Protect dimm data from concurrent access */
> + struct mutex dimm_mutex;
> +
> + /* Health information for the dimm */
> + __be64 health_bitmap;
> + __be64 health_bitmap_valid;

It's much less error prone to store the data in CPU endian and do the
endian conversion only at the point where the data either comes from or
goes to firmware.

That would also mean you can define flags above without needing PPC_BIT
because they'll be in CPU endian too.

> @@ -144,6 +152,35 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
>   return drc_pmem_bind(p);
>  }
>  
> +static int drc_pmem_query_health(struct papr_scm_priv *p)
> +{
> + unsigned long ret[PLPAR_HCALL_BUFSIZE];
> + int64_t rc;

Use kernel types please, ie. s64, or just long.

> + rc = plpar_hcall(H_SCM_HEALTH, ret, p->drc_index);
> + if (rc != H_SUCCESS) {
> + dev_err(&p->pdev->dev,
> +  "Failed to query health information, Err:%lld\n", rc);
> + return -ENXIO;
> + }
> +
> + /* Protect modifications to papr_scm_priv with the mutex */
> + rc = mutex_lock_interruptible(&p->dimm_mutex);
> + if (rc)
> + return rc;
> +
> + /* Store the retrieved health information in dimm platform data */
> + p->health_bitmap = ret[0];
> + p->health_bitmap_valid = ret[1];
> +
> + dev_dbg(&p->pdev->dev,
> + "Queried dimm health info. Bitmap:0x%016llx Mask:0x%016llx\n",
> + be64_to_cpu(p->health_bitmap),
> + be64_to_cpu(p->health_bitmap_valid));
> +
> + mutex_unlock(&p->dimm_mutex);
> + return 0;
> +}
>  
>  static int papr_scm_meta_get(struct papr_scm_priv *p,
>struct nd_cmd_get_config_data_hdr *hdr)
> @@ -304,6 +341,67 @@ static inli

Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms

2020-04-02 Thread Arnd Bergmann
On Wed, Apr 1, 2020 at 11:07 PM Arnd Bergmann  wrote:
>
> On Tue, Mar 31, 2020 at 7:51 PM Segher Boessenkool
>  wrote:
> >
> > On Tue, Mar 31, 2020 at 08:56:23AM +0200, Christophe Leroy wrote:
> > > While we are at it, can we also remove the 601 ? This one is also full
> > > of workarounds and diverges a bit from other 6xx.
> > >
> > > I'm unable to find its end of life date, but it was on the market in
> > > 1994, so I guess it must be outdated by more than 10-15 yr old now ?
> >
> > There probably are still some people running Linux on 601 powermacs.
>
> It could be marked as "BROKEN" for a year to find out for sure ;-)
>
> Apparently there were only two or three models that are old enough to
> have a 601 and new enough to run Linux with PCI and OF: 7200/8200
> and 7500. These were sold for less than 18 months around 1996,
> though one can still find them on eBay.

A. Wilcox said on IRC regarding 601 support in Adélie Linux:

"right now we are primarily targeting G3, though 603 should be supported.
601/601e support is planned to be added for 2.0 (next year)."

  Arnd


Re: [PATCH V2 3/5] selftests/powerpc: Add NX-GZIP engine compress testcase

2020-04-02 Thread Michael Ellerman
Daniel Axtens  writes:
> Raphael Moreira Zinsly  writes:
>
>> Add a compression testcase for the powerpc NX-GZIP engine.
>>
>> Signed-off-by: Bulent Abali 
>> Signed-off-by: Raphael Moreira Zinsly 
...
>> diff --git a/tools/testing/selftests/powerpc/nx-gzip/gzip_vas.c 
>> b/tools/testing/selftests/powerpc/nx-gzip/gzip_vas.c
>> new file mode 100644
>> index ..d28e1efb527b
>> --- /dev/null
>> +++ b/tools/testing/selftests/powerpc/nx-gzip/gzip_vas.c
>> @@ -0,0 +1,259 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +/* Copyright 2020 IBM Corp.
>> + *
>> + * Author: Bulent Abali 
>> + *
>> + */
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include "nx-gzip.h"
>> +#include "nx.h"
>> +#include "copy-paste.h"
>> +#include "nxu.h"
>> +#include "nx_dbg.h"
>> +#include 
>> +
>> +#define barrier()
>> +#define hwsync()({ asm volatile("hwsync" ::: "memory"); })
>
> This doesn't compile on the clang version I tried as it doesn't
> recognise 'hwsync'.

What assembler are you using? I guess the LLVM one?

Can you try with -fno-integrated-as ?

> Does asm volatile("sync" ::: "memory");
> do the same thing? That is recognised by clang, but I don't know if
> dropping the hw prefix matters!

It shouldn't matter.

But you can just try it and look at the generated code to be sure, you
should get 0x7c0004ac.

cheers


Re: [PATCH v4 03/25] powerpc/powernv: Map & release OpenCAPI LPC memory

2020-04-02 Thread Benjamin Herrenschmidt
On Wed, 2020-04-01 at 01:48 -0700, Dan Williams wrote:
> > 
> > +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> > +{
> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> > +   struct pnv_phb *phb = hose->private_data;
> 
> Is calling the local variable 'hose' instead of 'host' on purpose?

Haha that's funny :-)

It's an ld usage that comes iirc from sparc ? or maybe alpha ?
I somewhat accidentally picked it up when adding multiple host-bridge
support on powerpc in the early 2000's and it hasn't quite died yet :)

Cheers,
Ben.



Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms

2020-04-02 Thread Benjamin Herrenschmidt
On Tue, 2020-03-31 at 16:30 +1100, Michael Ellerman wrote:
> I have no attachment to 40x, and I'd certainly be happy to have less
> code in the tree, we struggle to keep even the modern platforms well
> maintained.
> 
> At the same time I don't want to render anyone's hardware obsolete
> unnecessarily. But if there's really no one using 40x then we should
> remove it, it could well be broken already.
> 
> So I guess post a series to do the removal and we'll see if anyone
> speaks up.

We shouldn't remove 40x completely. Just remove the Xilinx 405 stuff.

Cheers,
Ben.




Re: [PATCH v1 06/46] powerpc/kasan: Refactor update of early shadow mappings

2020-04-02 Thread Christophe Leroy

Michael,

Le 16/03/2020 à 13:35, Christophe Leroy a écrit :

kasan_remap_early_shadow_ro() and kasan_unmap_early_shadow_vmalloc()
are both updating the early shadow mapping: the first one sets
the mapping read-only while the other clears the mapping.

Refactor and create kasan_update_early_region()


There is a trivial conflict with this patch on powerpc/next.
Do you plan to take this series for 5.7 ? I so, I can repost the series 
now with the fix, or just this patch ?


Otherwise, what are your plans ? This series (Patches 18 and 19) will 
conflict with the 40x removal series as both do things about that 
PTE_ATOMIC_UPDATE stuff. Which series would go first ?


Thanks
Christophe



Signed-off-by: Christophe Leroy 
---
  arch/powerpc/mm/kasan/kasan_init_32.c | 39 +--
  1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index c9d053078c37..65fd8b891f8e 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -79,45 +79,42 @@ static int __init kasan_init_region(void *start, size_t 
size)
return 0;
  }
  
-static void __init kasan_remap_early_shadow_ro(void)

+static void __init
+kasan_update_early_region(unsigned long k_start, unsigned long k_end, pte_t 
pte)
  {
-   pgprot_t prot = kasan_prot_ro();
-   unsigned long k_start = KASAN_SHADOW_START;
-   unsigned long k_end = KASAN_SHADOW_END;
unsigned long k_cur;
phys_addr_t pa = __pa(kasan_early_shadow_page);
  
-	kasan_populate_pte(kasan_early_shadow_pte, prot);

-
-   for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
+   for (k_cur = k_start; k_cur < k_end; k_cur += PAGE_SIZE) {
pmd_t *pmd = pmd_ptr_k(k_cur);
pte_t *ptep = pte_offset_kernel(pmd, k_cur);
  
  		if ((pte_val(*ptep) & PTE_RPN_MASK) != pa)

continue;
  
-		__set_pte_at(&init_mm, k_cur, ptep, pfn_pte(PHYS_PFN(pa), prot), 0);

+   __set_pte_at(&init_mm, k_cur, ptep, pte, 0);
}
-   flush_tlb_kernel_range(KASAN_SHADOW_START, KASAN_SHADOW_END);
+
+   flush_tlb_kernel_range(k_start, k_end);
  }
  
-static void __init kasan_unmap_early_shadow_vmalloc(void)

+static void __init kasan_remap_early_shadow_ro(void)
  {
-   unsigned long k_start = (unsigned long)kasan_mem_to_shadow((void 
*)VMALLOC_START);
-   unsigned long k_end = (unsigned long)kasan_mem_to_shadow((void 
*)VMALLOC_END);
-   unsigned long k_cur;
+   pgprot_t prot = kasan_prot_ro();
phys_addr_t pa = __pa(kasan_early_shadow_page);
  
-	for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {

-   pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), 
k_cur);
-   pte_t *ptep = pte_offset_kernel(pmd, k_cur);
+   kasan_populate_pte(kasan_early_shadow_pte, prot);
  
-		if ((pte_val(*ptep) & PTE_RPN_MASK) != pa)

-   continue;
+   kasan_update_early_region(KASAN_SHADOW_START, KASAN_SHADOW_END,
+ pfn_pte(PHYS_PFN(pa), prot));
+}
  
-		__set_pte_at(&init_mm, k_cur, ptep, __pte(0), 0);

-   }
-   flush_tlb_kernel_range(k_start, k_end);
+static void __init kasan_unmap_early_shadow_vmalloc(void)
+{
+   unsigned long k_start = (unsigned long)kasan_mem_to_shadow((void 
*)VMALLOC_START);
+   unsigned long k_end = (unsigned long)kasan_mem_to_shadow((void 
*)VMALLOC_END);
+
+   kasan_update_early_region(k_start, k_end, __pte(0));
  }
  
  static void __init kasan_mmu_init(void)




Re: [PATCH v1 39/46] powerpc/8xx: Add a function to early map kernel via huge pages

2020-04-02 Thread Christophe Leroy




Le 17/03/2020 à 15:43, Christophe Leroy a écrit :



Le 17/03/2020 à 02:39, kbuild test robot a écrit :

Hi Christophe,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20200316]
[cannot apply to powerpc/next v5.6-rc6 v5.6-rc5 v5.6-rc4 v5.6-rc6]
[if your patch is applied to the wrong git tree, please drop us a note 
to help
improve the system. BTW, we also suggest to use '--base' option to 
specify the
base tree in git format-patch, please see 
https://stackoverflow.com/a/37406982]


url:
https://github.com/0day-ci/linux/commits/Christophe-Leroy/Use-hugepages-to-map-kernel-mem-on-8xx/20200317-065610 


base:    8548fd2f20ed19b0e8c0585b71fdfde1ae00ae3c
config: powerpc-tqm8xx_defconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.2.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
-O ~/bin/make.cross

 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 GCC_VERSION=9.2.0 make.cross ARCH=powerpc

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

    In file included from arch/powerpc/mm/fault.c:33:
    include/linux/hugetlb.h: In function 'hstate_inode':
include/linux/hugetlb.h:522:9: error: implicit declaration of 
function 'HUGETLBFS_SB'; did you mean 'HUGETLBFS_MAGIC'? 
[-Werror=implicit-function-declaration]

  522 |  return HUGETLBFS_SB(i->i_sb)->hstate;
  | ^~~~
  | HUGETLBFS_MAGIC
include/linux/hugetlb.h:522:30: error: invalid type argument of '->' 
(have 'int')

  522 |  return HUGETLBFS_SB(i->i_sb)->hstate;
  |  ^~
    cc1: all warnings being treated as errors


hstate_inode() shouldn't use HUGETLBFS_SB() which CONFIG_HUGETLBFS is 
not set.


Proposed fix at https://patchwork.ozlabs.org/patch/1256108/


The fix is going in 5.7 via mm tree it seems, see 
https://patchwork.kernel.org/patch/11470105/


Christophe



[...]


include/linux/hugetlb.h:522:30: error: invalid type argument of '->' 
(have 'int')

  522 |  return HUGETLBFS_SB(i->i_sb)->hstate;
  |  ^~
    At top level:
    arch/powerpc//mm/nohash/8xx.c:73:18: error: 
'__early_map_kernel_hugepage' defined but not used 
[-Werror=unused-function]
   73 | static int __ref __early_map_kernel_hugepage(unsigned long 
va, phys_addr_t pa,

  |  ^~~
    cc1: all warnings being treated as errors


This patch is a preparation patch. The function is not used yet, that's 
normal. Ok, it breaks bisectability. Should it be squashed with the 
first user of the function ?


Christophe


Re: [PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices

2020-04-02 Thread Michael Ellerman
"Oliver O'Halloran"  writes:
> On Thu, Apr 2, 2020 at 2:42 PM Michael Ellerman  wrote:
>> "Alastair D'Silva"  writes:
>> >> -Original Message-
>> >> From: Dan Williams 
>> >>
>> >> On Sun, Mar 29, 2020 at 10:23 PM Alastair D'Silva 
>> >> wrote:
>> >> >
>> >> > *snip*
>> >> Are OPAL calls similar to ACPI DSMs? I.e. methods for the OS to invoke
>> >> platform firmware services? What's Skiboot?
>> >
>> > Yes, OPAL is the interface to firmware for POWER. Skiboot is the 
>> > open-source (and only) implementation of OPAL.
>>
>>   https://github.com/open-power/skiboot
>>
>> In particular the tokens for calls are defined here:
>>
>>   https://github.com/open-power/skiboot/blob/master/include/opal-api.h#L220
>>
>> And you can grep for the token to find the implementation:
>>
>>   https://github.com/open-power/skiboot/blob/master/hw/npu2-opencapi.c#L2328
>
> I'm not sure I'd encourage anyone to read npu2-opencapi.c. I find it
> hard enough to follow even with access to the workbooks.

Compared to certain firmwares that run on certain other platforms it's
actually pretty readable code ;)

> There's an OPAL call API reference here:
> http://open-power.github.io/skiboot/doc/opal-api/index.html

Even better.

cheers


[powerpc:next] BUILD SUCCESS c17eb4dca5a353a9dbbb8ad6934fe57af7165e91

2020-04-02 Thread kbuild test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: c17eb4dca5a353a9dbbb8ad6934fe57af7165e91  powerpc: Make 
setjmp/longjmp signature standard

elapsed time: 1281m

configs tested: 187
configs skipped: 0

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm   efm32_defconfig
arm at91_dt_defconfig
armshmobile_defconfig
arm64   defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
arm   sunxi_defconfig
armmulti_v7_defconfig
h8300h8300h-sim_defconfig
m68k   m5475evb_defconfig
c6x  allyesconfig
powerpc   ppc64_defconfig
ia64defconfig
powerpc defconfig
i386  allnoconfig
i386 alldefconfig
i386 allyesconfig
i386defconfig
ia64 alldefconfig
ia64 allmodconfig
ia64  allnoconfig
ia64 allyesconfig
nios2 3c120_defconfig
nios2 10m50_defconfig
c6xevmc6678_defconfig
xtensa  iss_defconfig
xtensa   common_defconfig
openrisc simple_smp_defconfig
openriscor1ksim_defconfig
nds32   defconfig
nds32 allnoconfig
cskydefconfig
alpha   defconfig
h8300   h8s-sim_defconfig
h8300 edosk2674_defconfig
m68k allmodconfig
m68k   sun3_defconfig
m68k  multi_defconfig
arc  allyesconfig
arc defconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
powerpc  rhel-kconfig
mips   32r2_defconfig
mips 64r6el_defconfig
mips allmodconfig
mips  allnoconfig
mips allyesconfig
mips  fuloong2e_defconfig
mips  malta_kvm_defconfig
pariscallnoconfig
parisc   allyesconfig
pariscgeneric-32bit_defconfig
pariscgeneric-64bit_defconfig
x86_64   randconfig-a001-20200401
x86_64   randconfig-a002-20200401
x86_64   randconfig-a003-20200401
i386 randconfig-a001-20200401
i386 randconfig-a002-20200401
i386 randconfig-a003-20200401
mips randconfig-a001-20200401
nds32randconfig-a001-20200401
m68k randconfig-a001-20200401
alpharandconfig-a001-20200401
parisc   randconfig-a001-20200401
riscvrandconfig-a001-20200401
alpharandconfig-a001-20200402
m68k randconfig-a001-20200402
mips randconfig-a001-20200402
nds32randconfig-a001-20200402
parisc   randconfig-a001-20200402
riscvrandconfig-a001-20200402
c6x  randconfig-a001-20200401
h8300randconfig-a001-20200401
microblaze   randconfig-a001-20200401
nios2randconfig-a001-20200401
sparc64  randconfig-a001-20200401
csky randconfig-a001-20200401
openrisc randconfig-a001-20200401
s390 randconfig-a001-20200401
sh   randconfig-a001-20200401
xtensa   randconfig-a001-20200401
csky randconfig-a001-20200402
openrisc randconfig-a001-20200402
s390 randconfig-a001-20200402
sh   randconfig-a001-20200402
xtensa   randconfig-a001-20200402
x86_64   randconfig-b001-20200402
x86_64   randconfig-b002-20200402
x86_64   randconfig-b003-20200402
i386 randconfig-b001-20200402
i386 randconfig-b002-20200402
i386 randconfig-b003-20200402
x86_64   randconfig-c001-20200401

Re: [RFC PATCH v2 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Bharata B Rao
On Wed, Apr 1, 2020 at 8:38 PM Leonardo Bras  wrote:
>
> On Thu, 2020-03-05 at 20:32 -0300, Leonardo Bras wrote:
> > ---
> > The new flag was already proposed on Power Architecture documentation,
> > and it's waiting for approval.
> >
> > I would like to get your comments on this change, but it's still not
> > ready for being merged.
>
> New flag got approved on the documentation.
> Please review this patch.

Looks good to me, also tested with PowerKVM guests.

Reviewed-by: Bharata B Rao 

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/


[PATCH v8 7/7] powerpc/32: use set_memory_attr()

2020-04-02 Thread Russell Currey
From: Christophe Leroy 

Use set_memory_attr() instead of the PPC32 specific change_page_attr()

change_page_attr() was checking that the address was not mapped by
blocks and was handling highmem, but that's unneeded because the
affected pages can't be in highmem and block mapping verification
is already done by the callers.

Signed-off-by: Christophe Leroy 
[ruscur: rebase on powerpc/merge with Christophe's new patches]
Signed-off-by: Russell Currey 
---
v8: Rebase on powerpc/merge

 arch/powerpc/mm/pgtable_32.c | 60 ++--
 1 file changed, 10 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index f62de06e3d07..0d9d164fad26 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -121,64 +122,20 @@ void __init mapin_ram(void)
}
 }
 
-static int __change_page_attr_noflush(struct page *page, pgprot_t prot)
-{
-   pte_t *kpte;
-   unsigned long address;
-
-   BUG_ON(PageHighMem(page));
-   address = (unsigned long)page_address(page);
-
-   if (v_block_mapped(address))
-   return 0;
-   kpte = virt_to_kpte(address);
-   if (!kpte)
-   return -EINVAL;
-   __set_pte_at(&init_mm, address, kpte, mk_pte(page, prot), 0);
-
-   return 0;
-}
-
-/*
- * Change the page attributes of an page in the linear mapping.
- *
- * THIS DOES NOTHING WITH BAT MAPPINGS, DEBUG USE ONLY
- */
-static int change_page_attr(struct page *page, int numpages, pgprot_t prot)
-{
-   int i, err = 0;
-   unsigned long flags;
-   struct page *start = page;
-
-   local_irq_save(flags);
-   for (i = 0; i < numpages; i++, page++) {
-   err = __change_page_attr_noflush(page, prot);
-   if (err)
-   break;
-   }
-   wmb();
-   local_irq_restore(flags);
-   flush_tlb_kernel_range((unsigned long)page_address(start),
-  (unsigned long)page_address(page));
-   return err;
-}
-
 void mark_initmem_nx(void)
 {
-   struct page *page = virt_to_page(_sinittext);
unsigned long numpages = PFN_UP((unsigned long)_einittext) -
 PFN_DOWN((unsigned long)_sinittext);
 
if (v_block_mapped((unsigned long)_stext + 1))
mmu_mark_initmem_nx();
else
-   change_page_attr(page, numpages, PAGE_KERNEL);
+   set_memory_attr((unsigned long)_sinittext, numpages, 
PAGE_KERNEL);
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 void mark_rodata_ro(void)
 {
-   struct page *page;
unsigned long numpages;
 
if (v_block_mapped((unsigned long)_sinittext)) {
@@ -187,20 +144,18 @@ void mark_rodata_ro(void)
return;
}
 
-   page = virt_to_page(_stext);
numpages = PFN_UP((unsigned long)_etext) -
   PFN_DOWN((unsigned long)_stext);
 
-   change_page_attr(page, numpages, PAGE_KERNEL_ROX);
+   set_memory_attr((unsigned long)_stext, numpages, PAGE_KERNEL_ROX);
/*
 * mark .rodata as read only. Use __init_begin rather than __end_rodata
 * to cover NOTES and EXCEPTION_TABLE.
 */
-   page = virt_to_page(__start_rodata);
numpages = PFN_UP((unsigned long)__init_begin) -
   PFN_DOWN((unsigned long)__start_rodata);
 
-   change_page_attr(page, numpages, PAGE_KERNEL_RO);
+   set_memory_attr((unsigned long)__start_rodata, numpages, 
PAGE_KERNEL_RO);
 
// mark_initmem_nx() should have already run by now
ptdump_check_wx();
@@ -210,9 +165,14 @@ void mark_rodata_ro(void)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
+   unsigned long addr = (unsigned long)page_address(page);
+
if (PageHighMem(page))
return;
 
-   change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
+   if (enable)
+   set_memory_attr(addr, numpages, PAGE_KERNEL);
+   else
+   set_memory_attr(addr, numpages, __pgprot(0));
 }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
-- 
2.26.0



[PATCH v8 6/7] powerpc/mm: implement set_memory_attr()

2020-04-02 Thread Russell Currey
From: Christophe Leroy 

In addition to the set_memory_xx() functions which allows to change
the memory attributes of not (yet) used memory regions, implement a
set_memory_attr() function to:
- set the final memory protection after init on currently used
kernel regions.
- enable/disable kernel memory regions in the scope of DEBUG_PAGEALLOC.

Unlike the set_memory_xx() which can act in three step as the regions
are unused, this function must modify 'on the fly' as the kernel is
executing from them. At the moment only PPC32 will use it and changing
page attributes on the fly is not an issue.

Signed-off-by: Christophe Leroy 
Reported-by: kbuild test robot 
[ruscur: cast "data" to unsigned long instead of int]
Signed-off-by: Russell Currey 
---
 arch/powerpc/include/asm/set_memory.h |  2 ++
 arch/powerpc/mm/pageattr.c| 33 +++
 2 files changed, 35 insertions(+)

diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
index 64011ea444b4..b040094f7920 100644
--- a/arch/powerpc/include/asm/set_memory.h
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -29,4 +29,6 @@ static inline int set_memory_x(unsigned long addr, int 
numpages)
return change_memory_attr(addr, numpages, SET_MEMORY_X);
 }
 
+int set_memory_attr(unsigned long addr, int numpages, pgprot_t prot);
+
 #endif
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index 2da3fbab6ff7..2fde1b195c85 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -79,3 +79,36 @@ int change_memory_attr(unsigned long addr, int numpages, 
long action)
return apply_to_existing_page_range(&init_mm, start, sz,
change_page_attr, (void *)action);
 }
+
+/*
+ * Set the attributes of a page:
+ *
+ * This function is used by PPC32 at the end of init to set final kernel memory
+ * protection. It includes changing the maping of the page it is executing from
+ * and data pages it is using.
+ */
+static int set_page_attr(pte_t *ptep, unsigned long addr, void *data)
+{
+   pgprot_t prot = __pgprot((unsigned long)data);
+
+   spin_lock(&init_mm.page_table_lock);
+
+   set_pte_at(&init_mm, addr, ptep, pte_modify(*ptep, prot));
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   spin_unlock(&init_mm.page_table_lock);
+
+   return 0;
+}
+
+int set_memory_attr(unsigned long addr, int numpages, pgprot_t prot)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+   unsigned long sz = numpages * PAGE_SIZE;
+
+   if (numpages <= 0)
+   return 0;
+
+   return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
+   (void *)pgprot_val(prot));
+}
-- 
2.26.0



[PATCH v8 5/7] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

2020-04-02 Thread Russell Currey
skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX
enabled, and if you want memory protection for kernel text you'd want it
for modules too, so enable STRICT_MODULE_RWX there.

Acked-by: Joel Stanley 
Signed-off-by: Russell Currey 
---
 arch/powerpc/configs/skiroot_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/skiroot_defconfig 
b/arch/powerpc/configs/skiroot_defconfig
index 1b6bdad36b13..66d20dbe67b7 100644
--- a/arch/powerpc/configs/skiroot_defconfig
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -51,6 +51,7 @@ CONFIG_CMDLINE="console=tty0 console=hvc0 ipr.fast_reboot=1 
quiet"
 # CONFIG_PPC_MEM_KEYS is not set
 CONFIG_JUMP_LABEL=y
 CONFIG_STRICT_KERNEL_RWX=y
+CONFIG_STRICT_MODULE_RWX=y
 CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
 CONFIG_MODULE_SIG_FORCE=y
-- 
2.26.0



[PATCH v8 4/7] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

2020-04-02 Thread Russell Currey
To enable strict module RWX on powerpc, set:

CONFIG_STRICT_MODULE_RWX=y

You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
security benefit.

ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
makes STRICT_MODULE_RWX *on by default* in configurations where
STRICT_KERNEL_RWX is *unavailable*.

Since this doesn't make much sense, and module RWX without kernel RWX
doesn't make much sense, having the same dependencies as kernel RWX
works around this problem.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 399a4de28ff0..1488bb5f4179 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -131,6 +131,7 @@ config PPC
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!HIBERNATION)
+   select ARCH_HAS_STRICT_MODULE_RWX   if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UACCESS_MCSAFE  if PPC64
-- 
2.26.0



[PATCH v8 3/7] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime

2020-04-02 Thread Russell Currey
Very rudimentary, just

echo 1 > [debugfs]/check_wx_pages

and check the kernel log.  Useful for testing strict module RWX.

Updated the Kconfig entry to reflect this.

Also fixed a typo.

Reviewed-by: Kees Cook 
Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig.debug  |  6 --
 arch/powerpc/mm/ptdump/ptdump.c | 21 -
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 0b063830eea8..e37960ef68c6 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -370,7 +370,7 @@ config PPC_PTDUMP
  If you are unsure, say N.
 
 config PPC_DEBUG_WX
-   bool "Warn on W+X mappings at boot"
+   bool "Warn on W+X mappings at boot & enable manual checks at runtime"
depends on PPC_PTDUMP && STRICT_KERNEL_RWX
help
  Generate a warning if any W+X mappings are found at boot.
@@ -384,7 +384,9 @@ config PPC_DEBUG_WX
  of other unfixed kernel bugs easier.
 
  There is no runtime or memory usage effect of this option
- once the kernel has booted up - it's a one time check.
+ once the kernel has booted up, it only automatically checks once.
+
+ Enables the "check_wx_pages" debugfs entry for checking at runtime.
 
  If in doubt, say "Y".
 
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index d92bb8ea229c..525ca5aeaa01 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -4,7 +4,7 @@
  *
  * This traverses the kernel pagetables and dumps the
  * information about the used sections of memory to
- * /sys/kernel/debug/kernel_pagetables.
+ * /sys/kernel/debug/kernel_page_tables.
  *
  * Derived from the arm64 implementation:
  * Copyright (c) 2014, The Linux Foundation, Laura Abbott.
@@ -413,6 +413,25 @@ void ptdump_check_wx(void)
else
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
 }
+
+static int check_wx_debugfs_set(void *data, u64 val)
+{
+   if (val != 1ULL)
+   return -EINVAL;
+
+   ptdump_check_wx();
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(check_wx_fops, NULL, check_wx_debugfs_set, "%llu\n");
+
+static int ptdump_check_wx_init(void)
+{
+   return debugfs_create_file("check_wx_pages", 0200, NULL,
+  NULL, &check_wx_fops) ? 0 : -ENOMEM;
+}
+device_initcall(ptdump_check_wx_init);
 #endif
 
 static int ptdump_init(void)
-- 
2.26.0



[PATCH v8 1/7] powerpc/mm: Implement set_memory() routines

2020-04-02 Thread Russell Currey
The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
and are generally useful primitives to have.  This implementation is
designed to be completely generic across powerpc's many MMUs.

It's possible that this could be optimised to be faster for specific
MMUs, but the focus is on having a generic and safe implementation for
now.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them, so that is left as a TODO for now.

These functions do nothing if STRICT_KERNEL_RWX is not enabled.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
---

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/set_memory.h | 32 +++
 arch/powerpc/mm/Makefile  |  2 +-
 arch/powerpc/mm/pageattr.c| 81 +++
 4 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6f40af294685..399a4de28ff0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,6 +129,7 @@ config PPC
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
+   select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
new file mode 100644
index ..64011ea444b4
--- /dev/null
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SET_MEMORY_H
+#define _ASM_POWERPC_SET_MEMORY_H
+
+#define SET_MEMORY_RO  0
+#define SET_MEMORY_RW  1
+#define SET_MEMORY_NX  2
+#define SET_MEMORY_X   3
+
+int change_memory_attr(unsigned long addr, int numpages, long action);
+
+static inline int set_memory_ro(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RO);
+}
+
+static inline int set_memory_rw(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RW);
+}
+
+static inline int set_memory_nx(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_NX);
+}
+
+static inline int set_memory_x(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_X);
+}
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 5e147986400d..a998fdac52f9 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -5,7 +5,7 @@
 
 ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC)
 
-obj-y  := fault.o mem.o pgtable.o mmap.o \
+obj-y  := fault.o mem.o pgtable.o mmap.o pageattr.o \
   init_$(BITS).o pgtable_$(BITS).o \
   pgtable-frag.o ioremap.o ioremap_$(BITS).o \
   init-common.o mmu_context.o drmem.o
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
new file mode 100644
index ..2da3fbab6ff7
--- /dev/null
+++ b/arch/powerpc/mm/pageattr.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * MMU-generic set_memory implementation for powerpc
+ *
+ * Copyright 2019, IBM Corporation.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/*
+ * Updates the attributes of a page in three steps:
+ *
+ * 1. invalidate the page table entry
+ * 2. flush the TLB
+ * 3. install the new entry with the updated attributes
+ *
+ * This is unsafe if the caller is attempting to change the mapping of the
+ * page it is executing from, or if another CPU is concurrently using the
+ * page being altered.
+ *
+ * TODO make the implementation resistant to this.
+ *
+ * NOTE: can be dangerous to call without STRICT_KERNEL_RWX
+ */
+static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
+{
+   long action = (long)data;
+   pte_t pte;
+
+   spin_lock(&init_mm.page_table_lock);
+
+   /* invalidate the PTE so it's safe to modify */
+   pte = ptep_get_and_clear(&init_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   /* modify the PTE bits as desired, then apply */
+   switch (action) {
+   case SET_MEMORY_RO:
+   pte = pte_wrprotect(pte);
+   break;
+   case SET_MEMORY_RW:
+   pte = 

[PATCH v8 2/7] powerpc/kprobes: Mark newly allocated probes as RO

2020-04-02 Thread Russell Currey
With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
W+X page at boot by default.  This can be tested with
CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
kernel log during boot.

powerpc doesn't implement its own alloc() for kprobes like other
architectures do, but we couldn't immediately mark RO anyway since we do
a memcpy to the page we allocate later.  After that, nothing should be
allowed to modify the page, and write permissions are removed well
before the kprobe is armed.

The memcpy() would fail if >1 probes were allocated, so use
patch_instruction() instead which is safe for RO.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/kprobes.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 81efb605113e..fa4502b4de35 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
@@ -102,6 +104,16 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
+void *alloc_insn_page(void)
+{
+   void *page = vmalloc_exec(PAGE_SIZE);
+
+   if (page)
+   set_memory_ro((unsigned long)page, 1);
+
+   return page;
+}
+
 int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
@@ -124,11 +136,8 @@ int arch_prepare_kprobe(struct kprobe *p)
}
 
if (!ret) {
-   memcpy(p->ainsn.insn, p->addr,
-   MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+   patch_instruction(p->ainsn.insn, *p->addr);
p->opcode = *p->addr;
-   flush_icache_range((unsigned long)p->ainsn.insn,
-   (unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t));
}
 
p->ainsn.boostable = 0;
-- 
2.26.0



Re: [PATCH v10 03/14] powerpc/vas: Alloc and setup IRQ and trigger port address

2020-04-02 Thread Cédric Le Goater
On 4/2/20 9:10 AM, Haren Myneni wrote:
> 
> Allocate a xive irq on each chip with a vas instance. The NX coprocessor
> raises a host CPU interrupt via vas if it encounters page fault on user
> space request buffer. Subsequent patches register the trigger port with
> the NX coprocessor, and create a vas fault handler for this interrupt
> mapping.

Looks good !

> Signed-off-by: Haren Myneni 


Reviewed-by: Cédric Le Goater 

Thanks,

C.

> ---
>  arch/powerpc/platforms/powernv/vas.c | 44 
> +++-
>  arch/powerpc/platforms/powernv/vas.h |  2 ++
>  2 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/vas.c 
> b/arch/powerpc/platforms/powernv/vas.c
> index ed9cc6d..3303cfe 100644
> --- a/arch/powerpc/platforms/powernv/vas.c
> +++ b/arch/powerpc/platforms/powernv/vas.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vas.h"
>  
> @@ -25,10 +26,12 @@
>  
>  static int init_vas_instance(struct platform_device *pdev)
>  {
> - int rc, cpu, vasid;
> - struct resource *res;
> - struct vas_instance *vinst;
>   struct device_node *dn = pdev->dev.of_node;
> + struct vas_instance *vinst;
> + struct xive_irq_data *xd;
> + uint32_t chipid, hwirq;
> + struct resource *res;
> + int rc, cpu, vasid;
>  
>   rc = of_property_read_u32(dn, "ibm,vas-id", &vasid);
>   if (rc) {
> @@ -36,6 +39,12 @@ static int init_vas_instance(struct platform_device *pdev)
>   return -ENODEV;
>   }
>  
> + rc = of_property_read_u32(dn, "ibm,chip-id", &chipid);
> + if (rc) {
> + pr_err("No ibm,chip-id property for %s?\n", pdev->name);
> + return -ENODEV;
> + }
> +
>   if (pdev->num_resources != 4) {
>   pr_err("Unexpected DT configuration for [%s, %d]\n",
>   pdev->name, vasid);
> @@ -69,9 +78,32 @@ static int init_vas_instance(struct platform_device *pdev)
>  
>   vinst->paste_win_id_shift = 63 - res->end;
>  
> - pr_devel("Initialized instance [%s, %d], paste_base 0x%llx, "
> - "paste_win_id_shift 0x%llx\n", pdev->name, vasid,
> - vinst->paste_base_addr, vinst->paste_win_id_shift);
> + hwirq = xive_native_alloc_irq_on_chip(chipid);
> + if (!hwirq) {
> + pr_err("Inst%d: Unable to allocate global irq for chip %d\n",
> + vinst->vas_id, chipid);
> + return -ENOENT;
> + }
> +
> + vinst->virq = irq_create_mapping(NULL, hwirq);
> + if (!vinst->virq) {
> + pr_err("Inst%d: Unable to map global irq %d\n",
> + vinst->vas_id, hwirq);
> + return -EINVAL;
> + }
> +
> + xd = irq_get_handler_data(vinst->virq);
> + if (!xd) {
> + pr_err("Inst%d: Invalid virq %d\n",
> + vinst->vas_id, vinst->virq);
> + return -EINVAL;
> + }
> +
> + vinst->irq_port = xd->trig_page;
> + pr_devel("Initialized instance [%s, %d] paste_base 0x%llx 
> paste_win_id_shift 0x%llx IRQ %d Port 0x%llx\n",
> + pdev->name, vasid, vinst->paste_base_addr,
> + vinst->paste_win_id_shift, vinst->virq,
> + vinst->irq_port);
>  
>   for_each_possible_cpu(cpu) {
>   if (cpu_to_chip_id(cpu) == of_get_ibm_chip_id(dn))
> diff --git a/arch/powerpc/platforms/powernv/vas.h 
> b/arch/powerpc/platforms/powernv/vas.h
> index 5574aec..598608b 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -313,6 +313,8 @@ struct vas_instance {
>   u64 paste_base_addr;
>   u64 paste_win_id_shift;
>  
> + u64 irq_port;
> + int virq;
>   struct mutex mutex;
>   struct vas_window *rxwin[VAS_COP_TYPE_MAX];
>   struct vas_window *windows[VAS_WINDOWS_PER_CHIP];
> 



Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-04-02 Thread Michal Hocko
On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> Hi,
> 
> On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
[...]
> > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > memblock when NUMA support is enabled.
>  
> Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> this will not help cleaning up the whole node/zone initialization mess and
> we'll be stuck with two implementations.

Yeah, this is far from optimal.

> The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> most architectures will anyway discard the entire memblock, so having it in
> a UMA arch won't be a problem. The only exception is arm that uses
> memblock for pfn_valid(), here we may also think about a solution to
> compensate the addition of nid to the memblock structures. 

Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
memblock_get_region_node would then unconditionally return 0 on UMA.
Essentially the same way we do NUMA for other MM code. I only see few
direct usage of region->nid.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RESEND 2/4] uaccess: Selectively open read or write user access

2020-04-02 Thread Christophe Leroy




Le 02/04/2020 à 09:51, Kees Cook a écrit :

On Thu, Apr 02, 2020 at 07:34:17AM +, Christophe Leroy wrote:

[...]
diff --git a/kernel/compat.c b/kernel/compat.c
index 843dd17e6078..705ca7e418c6 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -199,7 +199,7 @@ long compat_get_bitmap(unsigned long *mask, const 
compat_ulong_t __user *umask,
bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG);
nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size);
  
-	if (!user_access_begin(umask, bitmap_size / 8))

+   if (!user_write_access_begin(umask, bitmap_size / 8))


This looks mismatched: should be user_read_access_begin()?


Oops, looks like a copy/paste/modify where I modified the original 
instead of the copy.






return -EFAULT;
  
  	while (nr_compat_longs > 1) {

@@ -211,11 +211,11 @@ long compat_get_bitmap(unsigned long *mask, const 
compat_ulong_t __user *umask,
}
if (nr_compat_longs)
unsafe_get_user(*mask, umask++, Efault);
-   user_access_end();
+   user_read_access_end();
return 0;
  
  Efault:

-   user_access_end();
+   user_read_access_end();
return -EFAULT;
  }


(These correctly end read access.)

  
@@ -228,7 +228,7 @@ long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,

bitmap_size = ALIGN(bitmap_size, BITS_PER_COMPAT_LONG);
nr_compat_longs = BITS_TO_COMPAT_LONGS(bitmap_size);
  
-	if (!user_access_begin(umask, bitmap_size / 8))

+   if (!user_read_access_begin(umask, bitmap_size / 8))


And ..._write_... here?


return -EFAULT;
  
  	while (nr_compat_longs > 1) {

@@ -239,10 +239,10 @@ long compat_put_bitmap(compat_ulong_t __user *umask, 
unsigned long *mask,
}
if (nr_compat_longs)
unsafe_put_user((compat_ulong_t)*mask, umask++, Efault);
-   user_access_end();
+   user_write_access_end();
return 0;
  Efault:
-   user_access_end();
+   user_write_access_end();
return -EFAULT;
  }


(These correctly end write access.)


All the others look correct. With the above fixed:

Reviewed-by: Kees Cook 



Thanks
Christophe


Re: [PATCH RESEND 3/4] drm/i915/gem: Replace user_access_begin by user_write_access_begin

2020-04-02 Thread Christophe Leroy




Le 02/04/2020 à 09:52, Kees Cook a écrit :

On Thu, Apr 02, 2020 at 07:34:18AM +, Christophe Leroy wrote:

When i915_gem_execbuffer2_ioctl() is using user_access_begin(),
that's only to perform unsafe_put_user() so use
user_write_access_begin() in order to only open write access.

Signed-off-by: Christophe Leroy 


Why is this split from the other conversions?



I split it from the other because this one is in drivers while other 
ones are in core part of the kernel.


Is it better to squash it in the previous patch ?

Christophe



Reviewed-by: Kees Cook 



---
  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 7643a30ba4cd..4be8205a70b6 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1611,14 +1611,14 @@ static int eb_copy_relocations(const struct 
i915_execbuffer *eb)
 * happened we would make the mistake of assuming that the
 * relocations were valid.
 */
-   if (!user_access_begin(urelocs, size))
+   if (!user_write_access_begin(urelocs, size))
goto end;
  
  		for (copied = 0; copied < nreloc; copied++)

unsafe_put_user(-1,
&urelocs[copied].presumed_offset,
end_user);
-   user_access_end();
+   user_write_access_end();
  
  		eb->exec[i].relocs_ptr = (uintptr_t)relocs;

}
@@ -1626,7 +1626,7 @@ static int eb_copy_relocations(const struct 
i915_execbuffer *eb)
return 0;
  
  end_user:

-   user_access_end();
+   user_write_access_end();
  end:
kvfree(relocs);
err = -EFAULT;
@@ -2991,7 +2991,8 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void 
*data,
 * And this range already got effectively checked earlier
 * when we did the "copy_from_user()" above.
 */
-   if (!user_access_begin(user_exec_list, count * 
sizeof(*user_exec_list)))
+   if (!user_write_access_begin(user_exec_list,
+count * sizeof(*user_exec_list)))
goto end;
  
  		for (i = 0; i < args->buffer_count; i++) {

@@ -3005,7 +3006,7 @@ i915_gem_execbuffer2_ioctl(struct drm_device *dev, void 
*data,
end_user);
}
  end_user:
-   user_access_end();
+   user_write_access_end();
  end:;
}
  
--

2.25.0





Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Kees Cook
On Thu, Apr 02, 2020 at 07:34:16AM +, Christophe Leroy wrote:
> Some architectures like powerpc64 have the capability to separate
> read access and write access protection.
> For get_user() and copy_from_user(), powerpc64 only open read access.
> For put_user() and copy_to_user(), powerpc64 only open write access.
> But when using unsafe_get_user() or unsafe_put_user(),
> user_access_begin open both read and write.
> 
> Other architectures like powerpc book3s 32 bits only allow write
> access protection. And on this architecture protection is an heavy
> operation as it requires locking/unlocking per segment of 256Mbytes.
> On those architecture it is therefore desirable to do the unlocking
> only for write access. (Note that book3s/32 ranges from very old
> powermac from the 90's with powerpc 601 processor, till modern
> ADSL boxes with PowerQuicc II modern processors for instance so it
> is still worth considering)
> 
> In order to avoid any risk based of hacking some variable parameters
> passed to user_access_begin/end that would allow hacking and
> leaving user access open or opening too much, it is preferable to
> use dedicated static functions that can't be overridden.
> 
> Add a user_read_access_begin and user_read_access_end to only open
> read access.
> 
> Add a user_write_access_begin and user_write_access_end to only open
> write access.
> 
> By default, when undefined, those new access helpers default on the
> existing user_access_begin and user_access_end.
> 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Kees Cook 

-Kees

> Link: https://patchwork.ozlabs.org/patch/1227926/
> ---
> Resending this series as I mistakenly only sent it to powerpc list
> begining of February (https://patchwork.ozlabs.org/patch/1233172/)
> 
> This series is based on the discussion we had in January, see
> https://patchwork.ozlabs.org/patch/1227926/ . I tried to
> take into account all remarks, especially @hpa 's remark to use
> a fixed API on not base the relocking on a magic id returned at
> unlocking.
> 
> This series is awaited for implementing selective lkdtm test to
> test powerpc64 independant read and write protection, see
> https://patchwork.ozlabs.org/patch/1231765/
> 
>  include/linux/uaccess.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 67f016010aad..9861c89f93be 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -378,6 +378,14 @@ extern long strnlen_unsafe_user(const void __user 
> *unsafe_addr, long count);
>  static inline unsigned long user_access_save(void) { return 0UL; }
>  static inline void user_access_restore(unsigned long flags) { }
>  #endif
> +#ifndef user_write_access_begin
> +#define user_write_access_begin user_access_begin
> +#define user_write_access_end user_access_end
> +#endif
> +#ifndef user_read_access_begin
> +#define user_read_access_begin user_access_begin
> +#define user_read_access_end user_access_end
> +#endif
>  
>  #ifdef CONFIG_HARDENED_USERCOPY
>  void usercopy_warn(const char *name, const char *detail, bool to_user,
> -- 
> 2.25.0
> 

-- 
Kees Cook


  1   2   >