date:20200303

Re: [PATCH 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-03-03 Thread Leonardo Bras

Hello Bharata, thanks for this feedback!

On Wed, 2020-03-04 at 10:13 +0530, Bharata B Rao wrote:
> Hi,
> 
> I tried this a few years back
> (https://patchwork.ozlabs.org/patch/800142/) and didn't pursue it
> further because at that time, it was felt that the approach might not
> work for PowerVM guests, because all the present memory except RMA
> gets marked as hot-pluggable by PowerVM. This discussion is not
> present in the above thread, but during my private discussions with
> Reza and Nathan, it was noted that making all that memory as MOVABLE
> is not preferable for PowerVM guests as we might run out of memory for
> kernel allocations.

Humm, this makes sense.
But with mu change, these pieces of memory only get into ZONE_MOVABLE
if the boot parameter 'movable_node' gets passed to guest kernel. 

So, even if we are unable to sort out some flag combination that work
fine for both use-cases, if PowerVM don't pass 'movable_node' as boot
parameter to kernel, it will behave just as today.

What are your thoughts on that?

Best regards,

Leonardo Bras

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2] ima: add a new CONFIG for loading arch-specific policies

2020-03-03 Thread Ard Biesheuvel

On Wed, 4 Mar 2020 at 03:34, Nayna Jain  wrote:
>
> Every time a new architecture defines the IMA architecture specific
> functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA
> include file needs to be updated. To avoid this "noise", this patch
> defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing
> the different architectures to select it.
>
> Suggested-by: Linus Torvalds 
> Signed-off-by: Nayna Jain 
> Cc: Ard Biesheuvel 
> Cc: Philipp Rudo 
> Cc: Michael Ellerman 

Acked-by: Ard Biesheuvel 

for the x86 bits, but I'm not an x86 maintainer. Also, you may need to
split this if you want to permit arch maintainers to pick up their
parts individually.


> ---
> v2:
> * Fixed the issue identified by Mimi. Thanks Mimi, Ard, Heiko and Michael for
> discussing the fix.
>
>  arch/powerpc/Kconfig   | 1 +
>  arch/s390/Kconfig  | 1 +
>  arch/x86/Kconfig   | 1 +
>  include/linux/ima.h| 3 +--
>  security/integrity/ima/Kconfig | 9 +
>  5 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 497b7d0b2d7e..a5cfde432983 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -979,6 +979,7 @@ config PPC_SECURE_BOOT
> bool
> depends on PPC_POWERNV
> depends on IMA_ARCH_POLICY
> +   select IMA_SECURE_AND_OR_TRUSTED_BOOT
> help
>   Systems with firmware secure boot enabled need to define security
>   policies to extend secure boot to the OS. This config allows a user
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 8abe77536d9d..4a502fbcb800 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -195,6 +195,7 @@ config S390
> select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> select SWIOTLB
> select GENERIC_ALLOCATOR
> +   select IMA_SECURE_AND_OR_TRUSTED_BOOT if IMA_ARCH_POLICY
>
>
>  config SCHED_OMIT_FRAME_POINTER
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index beea77046f9b..7f5bfaf0cbd2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -230,6 +230,7 @@ config X86
> select VIRT_TO_BUS
> select X86_FEATURE_NAMESif PROC_FS
> select PROC_PID_ARCH_STATUS if PROC_FS
> +   select IMA_SECURE_AND_OR_TRUSTED_BOOT   if EFI && IMA_ARCH_POLICY
>
>  config INSTRUCTION_DECODER
> def_bool y
> diff --git a/include/linux/ima.h b/include/linux/ima.h
> index 1659217e9b60..aefe758f4466 100644
> --- a/include/linux/ima.h
> +++ b/include/linux/ima.h
> @@ -30,8 +30,7 @@ extern void ima_kexec_cmdline(const void *buf, int size);
>  extern void ima_add_kexec_buffer(struct kimage *image);
>  #endif
>
> -#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
> -   || defined(CONFIG_PPC_SECURE_BOOT)
> +#ifdef CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT
>  extern bool arch_ima_get_secureboot(void);
>  extern const char * const *arch_get_ima_policy(void);
>  #else
> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
> index 3f3ee4e2eb0d..d17972aa413a 100644
> --- a/security/integrity/ima/Kconfig
> +++ b/security/integrity/ima/Kconfig
> @@ -327,3 +327,12 @@ config IMA_QUEUE_EARLY_BOOT_KEYS
> depends on IMA_MEASURE_ASYMMETRIC_KEYS
> depends on SYSTEM_TRUSTED_KEYRING
> default y
> +
> +config IMA_SECURE_AND_OR_TRUSTED_BOOT
> +   bool
> +   depends on IMA
> +   depends on IMA_ARCH_POLICY

Doesn't the latter already depend on the former?

> +   default n
> +   help
> +  This option is selected by architectures to enable secure and/or
> +  trusted boot based on IMA runtime policies.
> --
> 2.13.6
>

Re: [PATCH v3 18/27] powerpc/powernv/pmem: Add controller dump IOCTLs

2020-03-03 Thread Andrew Donnellan


On 21/2/20 2:27 pm, Alastair D'Silva wrote:

+static int ioctl_controller_dump_data(struct ocxlpmem *ocxlpmem,
+   struct ioctl_ocxl_pmem_controller_dump_data __user *uarg)
+{
+   struct ioctl_ocxl_pmem_controller_dump_data args;
+   u16 i;
+   u64 val;
+   int rc;
+
+   if (copy_from_user(&args, uarg, sizeof(args)))
+   return -EFAULT;
+
+   if (args.buf_size % 8)
+   return -EINVAL;
+
+   if (args.buf_size > ocxlpmem->admin_command.data_size)
+   return -EINVAL;
+
+   mutex_lock(&ocxlpmem->admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc)
+   goto out;
+
+   val = ((u64)args.offset) << 32;
+   val |= args.buf_size;
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x08,
+ OCXL_LITTLE_ENDIAN, val);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem,
+   ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc < 0) {
+   dev_warn(&ocxlpmem->dev, "Controller dump timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem,
+   "Unexpected status from retrieve error log",


Controller dump


+   rc);
+   goto out;
+   }
+
+   for (i = 0; i < args.buf_size; i += 8) {
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+
ocxlpmem->admin_command.data_offset + i,
+OCXL_HOST_ENDIAN, &val);


Is a controller dump something where we want to do endian swapping?

Any reason we're not doing the usual check of the data identifier, 
additional data length etc?



+   if (rc)
+   goto out;
+
+   if (copy_to_user(&args.buf[i], &val, sizeof(u64))) {
+   rc = -EFAULT;
+   goto out;
+   }
+   }
+
+   if (copy_to_user(uarg, &args, sizeof(args))) {
+   rc = -EFAULT;
+   goto out;
+   }
+
+   rc = admin_response_handled(ocxlpmem);
+   if (rc)
+   goto out;
+
+out:
+   mutex_unlock(&ocxlpmem->admin_command.lock);
+   return rc;
+}
+
+int request_controller_dump(struct ocxlpmem *ocxlpmem)
+{
+   int rc;
+   u64 busy = 1;
+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIC,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_CHI_CDA);


This return code is ignored


+
+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_HCI_CONTROLLER_DUMP);
+   if (rc)
+   return rc;
+
+   while (busy) {
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+GLOBAL_MMIO_HCI,
+OCXL_LITTLE_ENDIAN, &busy);
+   if (rc)
+   return rc;
+
+   busy &= GLOBAL_MMIO_HCI_CONTROLLER_DUMP;
+   cond_resched();
+   }
+
+   return 0;
+}



--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited

Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-03-03 Thread Christophe Leroy





Le 04/03/2020 à 02:39, Qian Cai a écrit :



Below is slightly modified version of your change above and should still
prevent the bug on powerpc. Will it be possible for you to re-test this
? Once confirmed, will send a patch enabling this test on powerpc64
keeping your authorship. Thank you.


This works fine on radix MMU but I decided to go a bit future to test hash
MMU. The kernel will stuck here below. I did confirm that pte_alloc_map_lock()
was successful, so I don’t understand hash MMU well enough to tell why
it could still take an interrupt at pte_clear_tests() even before we calls
pte_unmap_unlock()?


AFAIU, you are not taking an interrupt here. You are stuck in the 
pte_update(), most likely due to nested locks. Try with LOCKDEP ?


Christophe



[   33.881515][T1] ok 8 - property-entry
[   33.883653][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
architecture page table helpers
[   60.418885][C8] watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
[swapper/0:1]

Re: [PATCH v3 17/27] powerpc/powernv/pmem: Implement the Read Error Log command

2020-03-03 Thread Andrew Donnellan


On 21/2/20 2:27 pm, Alastair D'Silva wrote:

From: Alastair D'Silva 

The read error log command extracts information from the controller's
internal error log.

This patch exposes this information in 2 ways:
- During probe, if an error occurs & a log is available, print it to the
   console
- After probe, make the error log available to userspace via an IOCTL.
   Userspace is notified of pending error logs in a later patch
   ("powerpc/powernv/pmem: Forward events to userspace")

Signed-off-by: Alastair D'Silva 


A few minor style checks at 
https://openpower.xyz/job/snowpatch/job/snowpatch-linux-checkpatch/11787//artifact/linux/checkpatch.log


We should also add some documentation for the user interfaces we're 
adding (same applies for all the remaining patches in this series that 
add more interfaces).



---
  arch/powerpc/platforms/powernv/pmem/ocxl.c| 269 ++
  .../platforms/powernv/pmem/ocxl_internal.h|   1 +
  include/uapi/nvdimm/ocxl-pmem.h   |  46 +++
  3 files changed, 316 insertions(+)
  create mode 100644 include/uapi/nvdimm/ocxl-pmem.h

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 63109a870d2c..2b64504f9129 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -447,10 +447,219 @@ static int file_release(struct inode *inode, struct file 
*file)
return 0;
  }
  
+/**

+ * error_log_header_parse() - Parse the first 64 bits of the error log command 
response
+ * @ocxlpmem: the device metadata
+ * @length: out, returns the number of bytes in the response (excluding the 64 
bit header)
+ */
+static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
+{
+   int rc;
+   u64 val;
+
+   u16 data_identifier;
+   u32 data_length;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset,
+OCXL_LITTLE_ENDIAN, &val);
+   if (rc)
+   return rc;
+
+   data_identifier = val >> 48;
+   data_length = val & 0x;
+
+   if (data_identifier != 0x454C) { // 'EL'
+   dev_err(&ocxlpmem->dev,
+   "Bad data identifier for error log data, expected 'EL', got 
'%2s' (%#x), data_length=%u\n",
+   (char *)&data_identifier,
+   (unsigned int)data_identifier, data_length);
+   return -EINVAL;


This should be something other than EINVAL I think


+   }
+
+   *length = data_length;
+   return 0;
+}
+
+static int error_log_offset_0x08(struct ocxlpmem *ocxlpmem,
+u32 *log_identifier, u32 *program_ref_code)
+{
+   int rc;
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08,
+OCXL_LITTLE_ENDIAN, &val);
+   if (rc)
+   return rc;
+
+   *log_identifier = val >> 32;
+   *program_ref_code = val & 0x;
+
+   return 0;
+}
+
+static int read_error_log(struct ocxlpmem *ocxlpmem,
+ struct ioctl_ocxl_pmem_error_log *log, bool 
buf_is_user)
+{
+   u64 val;
+   u16 user_buf_length;
+   u16 buf_length;
+   u16 i;
+   int rc;
+
+   if (log->buf_size % 8)
+   return -EINVAL;
+
+   rc = ocxlpmem_chi(ocxlpmem, &val);
+   if (rc)
+   goto out;
+
+   if (!(val & GLOBAL_MMIO_CHI_ELA))
+   return -EAGAIN;
+
+   user_buf_length = log->buf_size;
+
+   mutex_lock(&ocxlpmem->admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc < 0) {
+   dev_warn(&ocxlpmem->dev, "Read error log timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem, "Unexpected status from retrieve error 
log", rc);
+   goto out;
+   }
+
+
+   rc = error_log_header_parse(ocxlpmem, &log->buf_size);
+   if (rc)
+   goto out;
+   // log->buf_size now contains the returned buffer size, not the user 
size


In the event that the log is truncated to fit the user buffer, we return 
the full log size, I assume this is intentional to signal it's truncated 
as per the nd stuff?



+
+   rc = error_log_offset_0x08(ocxlpmem, &log->log_identifier,
+  &log->program_reference_code);
+   if (rc)
+   goto out;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem-

Re: [RFC 2/3] mm/vma: Introduce VM_ACCESS_FLAGS

2020-03-03 Thread Anshuman Khandual




On 03/03/2020 11:18 PM, Vlastimil Babka wrote:
> On 3/2/20 7:47 AM, Anshuman Khandual wrote:
>> There are many places where all basic VMA access flags (read, write, exec)
>> are initialized or checked against as a group. One such example is during
>> page fault. Existing vma_is_accessible() wrapper already creates the notion
>> of VMA accessibility as a group access permissions. Hence lets just create
>> VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code
>> duplication but also extend the VMA accessibility concept in general.
>>
>> Cc: Russell King 
>> CC: Catalin Marinas 
>> CC: Mark Salter 
>> Cc: Nick Hu 
>> CC: Ley Foon Tan 
>> Cc: Michael Ellerman 
>> Cc: Heiko Carstens 
>> Cc: Yoshinori Sato 
>> Cc: Guan Xuetao 
>> Cc: Dave Hansen 
>> Cc: Thomas Gleixner 
>> Cc: Rob Springer 
>> Cc: Greg Kroah-Hartman 
>> Cc: Andrew Morton 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-c6x-...@linux-c6x.org
>> Cc: nios2-...@lists.rocketboards.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: de...@driverdev.osuosl.org
>> Cc: linux...@kvack.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
> 
> Dunno. Such mask seems ok for testing flags, but it's a bit awkward when
> initializing flags, where it covers just one of many combinations that seem
> used. But no strong opinions, patch looks correct.

Fair enough. The fact that it covers only one of the many init combinations
used at various places, is indeed a good point. The page fault handlers does
start with VMA flags mask as VM_ACCESS_FLAGS, hence will keep them and drop
other init cases here.

Re: [PATCH v3 20/27] powerpc/powernv/pmem: Forward events to userspace

2020-03-03 Thread Alastair D'Silva

On Tue, 2020-03-03 at 18:02 +1100, Andrew Donnellan wrote:
> On 21/2/20 2:27 pm, Alastair D'Silva wrote:> @@ -938,6 +955,51 @@
> static 
> int ioctl_controller_stats(struct ocxlpmem *ocxlpmem,
> > return rc;
> >   }
> >   
> > +static int ioctl_eventfd(struct ocxlpmem *ocxlpmem,
> > +struct ioctl_ocxl_pmem_eventfd __user *uarg)
> > +{
> > +   struct ioctl_ocxl_pmem_eventfd args;
> > +
> > +   if (copy_from_user(&args, uarg, sizeof(args)))
> > +   return -EFAULT;
> > +
> > +   if (ocxlpmem->ev_ctx)
> > +   return -EINVAL;
> 
> I think EBUSY is more appropriate here.
> 

Ok

> > +
> > +   ocxlpmem->ev_ctx = eventfd_ctx_fdget(args.eventfd);
> > +   if (!ocxlpmem->ev_ctx)
> > +   return -EFAULT;
> > +
> > +   return 0;
> > +}
> > +
> > +static int ioctl_event_check(struct ocxlpmem *ocxlpmem, u64 __user
> > *uarg)
> > +{
> > +   u64 val = 0;
> > +   int rc;
> > +   u64 chi = 0;
> > +
> > +   rc = ocxlpmem_chi(ocxlpmem, &chi);
> > +   if (rc < 0)
> > +   return rc;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_ELA)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_ERROR_LOG_AVAILABLE;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CDA)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_CONTROLLER_DUMP_AVAILABLE;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CFFS)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_FIRMWARE_FATAL;
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CHFS)
> > +   val |= IOCTL_OCXL_PMEM_EVENT_HARDWARE_FATAL;
> > +
> > +   rc = copy_to_user((u64 __user *) uarg, &val, sizeof(val));
> > +
> > +   return rc;
> > +}
> > +
> >   static long file_ioctl(struct file *file, unsigned int cmd,
> > unsigned long args)
> >   {
> > struct ocxlpmem *ocxlpmem = file->private_data;
> > @@ -966,6 +1028,15 @@ static long file_ioctl(struct file *file,
> > unsigned int cmd, unsigned long args)
> > rc = ioctl_controller_stats(ocxlpmem,
> > (struct
> > ioctl_ocxl_pmem_controller_stats __user *)args);
> > break;
> > +
> > +   case IOCTL_OCXL_PMEM_EVENTFD:
> > +   rc = ioctl_eventfd(ocxlpmem,
> > +  (struct ioctl_ocxl_pmem_eventfd
> > __user *)args);
> > +   break;
> > +
> > +   case IOCTL_OCXL_PMEM_EVENT_CHECK:
> > +   rc = ioctl_event_check(ocxlpmem, (u64 __user *)args);
> > +   break;
> > }
> >   
> > return rc;
> > @@ -1107,6 +1178,146 @@ static void dump_error_log(struct ocxlpmem
> > *ocxlpmem)
> > kfree(buf);
> >   }
> >   
> > +static irqreturn_t imn0_handler(void *private)
> > +{
> > +   struct ocxlpmem *ocxlpmem = private;
> > +   u64 chi = 0;
> > +
> > +   (void)ocxlpmem_chi(ocxlpmem, &chi);
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_ELA) {
> > +   dev_warn(&ocxlpmem->dev, "Error log is available\n");
> > +
> > +   if (ocxlpmem->ev_ctx)
> > +   eventfd_signal(ocxlpmem->ev_ctx, 1);
> > +   }
> > +
> > +   if (chi & GLOBAL_MMIO_CHI_CDA) {
> > +   dev_warn(&ocxlpmem->dev, "Controller dump is
> > available\n");
> > +
> > +   if (ocxlpmem->ev_ctx)
> > +   eventfd_signal(ocxlpmem->ev_ctx, 1);
> > +   }
> > +
> > +
> > +   return IRQ_HANDLED;
> > +}
> > +
> > +static irqreturn_t imn1_handler(void *private)
> > +{
> > +   struct ocxlpmem *ocxlpmem = private;
> > +   u64 chi = 0;
> > +
> > +   (void)ocxlpmem_chi(ocxlpmem, &chi);
> > +
> > +   if (chi & (GLOBAL_MMIO_CHI_CFFS | GLOBAL_MMIO_CHI_CHFS)) {
> > +   dev_err(&ocxlpmem->dev,
> > +   "Controller status is fatal, chi=0x%llx, going
> > offline\n", chi);
> > +
> > +   if (ocxlpmem->nvdimm_bus) {
> > +   nvdimm_bus_unregister(ocxlpmem->nvdimm_bus);
> > +   ocxlpmem->nvdimm_bus = NULL;
> > +   }
> > +
> > +   if (ocxlpmem->ev_ctx)
> > +   eventfd_signal(ocxlpmem->ev_ctx, 1);
> > +   }
> > +
> > +   return IRQ_HANDLED;
> > +}
> > +
> > +
> > +/**
> > + * ocxlpmem_setup_irq() - Set up the IRQs for the OpenCAPI
> > Persistent Memory device
> > + * @ocxlpmem: the device metadata
> > + * Return: 0 on success, negative on failure
> > + */
> > +static int ocxlpmem_setup_irq(struct ocxlpmem *ocxlpmem)
> > +{
> > +   int rc;
> > +   u64 irq_addr;
> > +
> > +   rc = ocxl_afu_irq_alloc(ocxlpmem->ocxl_context, &ocxlpmem-
> > >irq_id[0]);
> > +   if (rc)
> > +   return rc;
> > +
> > +   rc = ocxl_irq_set_handler(ocxlpmem->ocxl_context, ocxlpmem-
> > >irq_id[0],
> > + imn0_handler, NULL, ocxlpmem);
> > +
> > +   irq_addr = ocxl_afu_irq_get_addr(ocxlpmem->ocxl_context,
> > ocxlpmem->irq_id[0]);
> > +   if (!irq_addr)
> > +   return -EINVAL;
> > +
> > +   ocxlpmem->irq_addr[0] = ioremap(irq_addr, PAGE_SIZE);
> > +   if (!ocxlpmem->irq_addr[0])
> > +   return -EINVAL;
> 
> Something other than EINVAL for these two

Ok

> 
> > +
> > +   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
> > GLOBAL_MMIO_IMA0_OHP,
> > +

Re: [PATCH v3 03/27] powerpc: Map & release OpenCAPI LPC memory

2020-03-03 Thread Alastair D'Silva

On Tue, 2020-03-03 at 17:10 +1100, Andrew Donnellan wrote:
> On 21/2/20 2:26 pm, Alastair D'Silva wrote:> +#ifdef 
> CONFIG_MEMORY_HOTPLUG_SPARSE
> > +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> > +{
> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> > +   struct pnv_phb *phb = hose->private_data;
> > +   u32 bdfn = pci_dev_id(pdev);
> > +   __be64 base_addr_be64;
> > +   u64 base_addr;
> > +   int rc;
> > +
> > +   rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size,
> > &base_addr_be64);
> 
> Sparse warning:
> 
> https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/15776//artifact/linux/report.txt
> 
> I think in patch 1 we need to change a uint64_t to a __be64.
> 

Ok, thanks

-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819

Re: [RFC 1/3] mm/vma: Define a default value for VM_DATA_DEFAULT_FLAGS

2020-03-03 Thread Anshuman Khandual




On 03/03/2020 10:55 PM, Vlastimil Babka wrote:
> On 3/2/20 7:47 AM, Anshuman Khandual wrote:
>> There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
>> This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
>> existing VM_STACK_DEFAULT_FLAGS. While here, also define some more macros
>> with standard VMA access flag combinations that are used frequently across
>> many platforms. Apart from simplification, this reduces code duplication
>> as well.
>>
>> Cc: Richard Henderson 
>> Cc: Vineet Gupta 
>> Cc: Russell King 
>> Cc: Catalin Marinas 
>> Cc: Mark Salter 
>> Cc: Guo Ren 
>> Cc: Yoshinori Sato 
>> Cc: Brian Cain 
>> Cc: Tony Luck 
>> Cc: Geert Uytterhoeven 
>> Cc: Michal Simek 
>> Cc: Ralf Baechle 
>> Cc: Paul Burton 
>> Cc: Nick Hu 
>> Cc: Ley Foon Tan 
>> Cc: Jonas Bonn 
>> Cc: "James E.J. Bottomley" 
>> Cc: Michael Ellerman 
>> Cc: Paul Walmsley 
>> Cc: Heiko Carstens 
>> Cc: Rich Felker 
>> Cc: "David S. Miller" 
>> Cc: Guan Xuetao 
>> Cc: Thomas Gleixner 
>> Cc: Jeff Dike 
>> Cc: Chris Zankel 
>> Cc: Andrew Morton 
>> Cc: linux-al...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-c6x-...@linux-c6x.org
>> Cc: uclinux-h8-de...@lists.sourceforge.jp
>> Cc: linux-hexa...@vger.kernel.org
>> Cc: linux-i...@vger.kernel.org
>> Cc: linux-m...@lists.linux-m68k.org
>> Cc: linux-m...@vger.kernel.org
>> Cc: nios2-...@lists.rocketboards.org
>> Cc: openr...@lists.librecores.org
>> Cc: linux-par...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-ri...@lists.infradead.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: linux...@lists.infradead.org
>> Cc: linux-xte...@linux-xtensa.org
>> Cc: linux...@kvack.org
>> Signed-off-by: Anshuman Khandual  Reviewed-by: Vlastimil Babka 
> 
> Nit:
> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b0e53ef13ff1..7a764ae6ab68 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -342,6 +342,21 @@ extern unsigned int kobjsize(const void *objp);
>>  /* Bits set in the VMA until the stack is in its final location */
>>  #define VM_STACK_INCOMPLETE_SETUP   (VM_RAND_READ | VM_SEQ_READ)
>>  
>> +#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
>> +
>> +/* Common data flag combinations */
>> +#define VM_DATA_FLAGS_TSK_EXEC  (VM_READ | VM_WRITE | TASK_EXEC | \
>> + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
>> +#define VM_DATA_FLAGS_NON_EXEC  (VM_READ | VM_WRITE | VM_MAYREAD | \
>> + VM_MAYWRITE | VM_MAYEXEC)
>> +#define VM_DATA_FLAGS_EXEC  (VM_READ | VM_WRITE | VM_EXEC | \
>> + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
>> +
>> +#ifndef VM_DATA_DEFAULT_FLAGS   /* arch can override this */
>> +#define VM_DATA_DEFAULT_FLAGS   (VM_READ | VM_WRITE | VM_EXEC | \
>> + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> 
> Should you use VM_DATA_FLAGS_EXEC here? Yeah one more macro to expand, but 
> it's
> right above this.

Sure, can do that.

> 
>> +#endif
>> +
>>  #ifndef VM_STACK_DEFAULT_FLAGS  /* arch can override this */
>>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
>>  #endif
>>
> 
> 
>

Re: [PATCH 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-03-03 Thread Bharata B Rao

On Fri, Feb 28, 2020 at 11:36 AM Leonardo Bras  wrote:
>
> While providing guests, it's desirable to resize it's memory on demand.
>
> By now, it's possible to do so by creating a guest with a small base
> memory, hot-plugging all the rest, and using 'movable_node' kernel
> command-line parameter, which puts all hot-plugged memory in
> ZONE_MOVABLE, allowing it to be removed whenever needed.
>
> But there is an issue regarding guest reboot:
> If memory is hot-plugged, and then the guest is rebooted, all hot-plugged
> memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal.
> It usually prevents this memory to be hot-removed from the guest.
>
> It's possible to use device-tree information to fix that behavior, as
> it stores flags for LMB ranges on ibm,dynamic-memory-vN.
> It involves marking each memblock with the correct flags as hotpluggable
> memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if
> 'movable_node' is passed.
>
> For base memory, qemu assigns these flags for it's LMBs:
> (DRCONF_MEM_AI_INVALID | DRCONF_MEM_RESERVED)
> For hot-plugged memory, it assigns (DRCONF_MEM_ASSIGNED).
>
> While guest kernel reads the device-tree, early_init_drmem_lmb() is
> called for every added LMBs, doing nothing for base memory, and adding
> memblocks for hot-plugged memory. Skipping base memory happens here:
>
> if ((lmb->flags & DRCONF_MEM_RESERVED) ||
> !(lmb->flags & DRCONF_MEM_ASSIGNED))
> return;
>
> Marking memblocks added by this function as hotplugable memory
> is enough to get the desirable behavior, and should cause no change
> if 'movable_node' parameter is not passed to kernel.
>
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kernel/prom.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 6620f37abe73..f4d14c67bf53 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -518,6 +518,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
> *lmb,
> DBG("Adding: %llx -> %llx\n", base, size);
> if (validate_mem_limit(base, &size))
> memblock_add(base, size);
> +
> +   early_init_dt_mark_hotplug_memory_arch(base, size);

Hi,

I tried this a few years back
(https://patchwork.ozlabs.org/patch/800142/) and didn't pursue it
further because at that time, it was felt that the approach might not
work for PowerVM guests, because all the present memory except RMA
gets marked as hot-pluggable by PowerVM. This discussion is not
present in the above thread, but during my private discussions with
Reza and Nathan, it was noted that making all that memory as MOVABLE
is not preferable for PowerVM guests as we might run out of memory for
kernel allocations.

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/

RE: [PATCH v3 15/27] powerpc/powernv/pmem: Add support for near storage commands

2020-03-03 Thread Alastair D'Silva

On Mon, 2020-03-02 at 10:42 -0800, Dan Williams wrote:
> On Mon, Mar 2, 2020 at 9:59 AM Frederic Barrat  > wrote:
> > 
> > 
> > Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :
> > > From: Alastair D'Silva 
> > > 
> > > Similar to the previous patch, this adds support for near storage
> > > commands.
> > > 
> > > Signed-off-by: Alastair D'Silva 
> > > ---
> > 
> > Is any of these new functions ever called?
> 
> This is my concern as well. The libnvdimm command support is limited
> to the commands that Linux will use. Other passthrough commands are
> supported through a passthrough interface. However, that passthrough
> interface is explicitly limited to publicly documented command sets
> so
> that the kernel has an opportunity to constrain and consolidate
> command implementations across vendors.


It will be in the patch that implements overwrite. I moved that patch
out of this series, as it needs more testing, so I guess I can submit
this alongside it.

-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819

[PATCH v2 6/6] asm-generic/tlb: avoid potential double flush

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0758cd8304942292e95a0f750c374533db378b32 upstream.

Aneesh reported that:

tlb_flush_mmu()
  tlb_flush_mmu_tlbonly()
tlb_flush() <-- #1
  tlb_flush_mmu_free()
tlb_table_flush()
  tlb_table_invalidate()
tlb_flush_mmu_tlbonly()
  tlb_flush()   <-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Aneesh Kumar K.V 
Reported-by: "Aneesh Kumar K.V" 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 include/asm-generic/tlb.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 19934cdd143e..427a70c56ddd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-   if (!tlb->end)
+   /*
+* Anything calling __tlb_adjust_range() also sets at least one of
+* these bits.
+*/
+   if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+ tlb->cleared_puds || tlb->cleared_p4ds))
return;
 
tlb_flush(tlb);
-- 
2.24.1

[PATCH v2 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: 
http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) 
Cc:   # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported to 4.19 stable]
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  | 11 +++
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 15 +++
 mm/memory.c | 16 
 7 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 061a12b8140e..3abbdb0cea44 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1a00ce4b0040..e5bc0cfea2b1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,7 +216,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..63418275f402 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,17 @@
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()   radix_enabled()
 
 /* Get the generic bits... */
 #include 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index d90d632868aa..e6f2a38d2e61 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,7 +64,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f2b9dc9cbaf8..19934cdd143e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -61,8 +61,23 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_

[PATCH v2 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

2020-03-03 Thread Santosh Sivaraj

From: "Aneesh Kumar K.V" 

commit 12e4d53f3f04e81f9e83d6fc10edc7314ab9f6b9 upstream.

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable.

The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.

This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: 
http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: backported for 4.19 stable]
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 --
 arch/powerpc/mm/pgtable-book3s64.c   | 7 ---
 4 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e09cfb109b8c..1a00ce4b0040 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -215,7 +215,7 @@ config PPC
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
-   select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_FREE
select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 82e44b1a00ae..79ba3fbb512e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned 
index_size)
 #define check_pgt_cache()  do { } while (0)
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
void *table, int shift)
 {
@@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-   void *table, int shift)
-{
-   pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
  unsigned long address)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index f9019b579903..1013c0214213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned 
long);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 297db665d953..5b4e9fd8990c 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64

[PATCH v2 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 96bc9567cbe112e9320250f01b9c060c882e8619 upstream.

Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 arch/Kconfig | 2 +-
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 -
 mm/memory.c  | 2 +-
 5 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a336548487e6..061a12b8140e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_INVALIDATE
+config HAVE_RCU_TABLE_NO_INVALIDATE
bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6f475dc5829b..e09cfb109b8c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -216,6 +216,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e6f2a38d2e61..d90d632868aa 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -64,6 +64,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5caadbe..181d0d522977 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,7 +181,6 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if PARAVIRT
-   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
diff --git a/mm/memory.c b/mm/memory.c
index 1832c5ed6ac0..ba5689610c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page, int page_
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
/*
 * Invalidate page-table caches used by hardware walkers. Then we still
 * need to RCU-sched wait while freeing the pages because software
-- 
2.24.1

[PATCH v2 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared

2020-03-03 Thread Santosh Sivaraj

From: Will Deacon 

commit a6d60245d6d9b1caf66b0d94419988c4836980af upstream

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for upcoming tlbflush backports]
---
 include/asm-generic/tlb.h | 58 +--
 mm/memory.c   |  4 ++-
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 97306b32d8d2..f2b9dc9cbaf8 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -114,6 +114,14 @@ struct mmu_gather {
 */
unsigned intfreed_tables : 1;
 
+   /*
+* at which levels have we cleared entries?
+*/
+   unsigned intcleared_ptes : 1;
+   unsigned intcleared_pmds : 1;
+   unsigned intcleared_puds : 1;
+   unsigned intcleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+   tlb->cleared_ptes = 0;
+   tlb->cleared_pmds = 0;
+   tlb->cleared_puds = 0;
+   tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -197,6 +209,25 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+   if (tlb->cleared_ptes)
+   return PAGE_SHIFT;
+   if (tlb->cleared_pmds)
+   return PMD_SHIFT;
+   if (tlb->cleared_puds)
+   return PUD_SHIFT;
+   if (tlb->cleared_p4ds)
+   return P4D_SHIFT;
+
+   return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+   return 1UL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -230,13 +261,19 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->cleared_ptes = 1;  \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\
-   do { \
-   __tlb_adjust_range(tlb, address, huge_page_size(h)); \
-   __tlb_remove_tlb_entry(tlb, ptep, address);  \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
+   do {\
+   unsigned long _sz = huge_page_size(h);  \
+   __tlb_adjust_range(tlb, address, _sz);  \
+   if (_sz == PMD_SIZE)\
+   tlb->cleared_pmds = 1;  \
+   else if (_sz == PUD_SIZE)   \
+   tlb->cleared_puds = 1;  \
+   __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
 /**
@@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)   \
do {\
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);   \
+   tlb->cleared_pmds = 1;  \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)
 
@@ -264,6 +302,7 @@ static inline void tlb_remove_check_page_size_ch

[PATCH v2 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2020-03-03 Thread Santosh Sivaraj

From: Peter Zijlstra 

commit 22a61c3c4f1379ef8b0ce0d5cb78baf3178950e2 upstream

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Will Deacon 
Cc:  # 4.19
Signed-off-by: Santosh Sivaraj 
[santosh: prerequisite for tlbflush backports]
---
 include/asm-generic/tlb.h | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..97306b32d8d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -97,12 +97,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct 
mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
-- 
2.24.1

[PATCH v2 0/6] Memory corruption may occur due to incorrent tlb flush

2020-03-03 Thread Santosh Sivaraj

The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.

The series is a backport of the fix sent by Peter [1].

The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.

This is only for v4.19 stable.

Changelog:
* Send the patches with the correct format (commit sha1 upstream) for stable

--
Aneesh Kumar K.V (1):
  powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Peter Zijlstra (4):
  asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
  asm-generic/tlb: avoid potential double flush

Will Deacon (1):
  asm-generic/tlb: Track which levels of the page tables have been
cleared

 arch/Kconfig |   3 -
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |   8 --
 arch/powerpc/include/asm/book3s/64/pgalloc.h |   2 -
 arch/powerpc/include/asm/tlb.h   |  11 ++
 arch/powerpc/mm/pgtable-book3s64.c   |   7 --
 arch/sparc/include/asm/tlb_64.h  |   9 ++
 arch/x86/Kconfig |   1 -
 include/asm-generic/tlb.h| 103 ---
 mm/memory.c  |  20 ++--
 10 files changed, 122 insertions(+), 44 deletions(-)

-- 
2.24.1

Re: [PATCH v3 13/27] powerpc/powernv/pmem: Read the capability registers & wait for device ready

2020-03-03 Thread Alastair D'Silva

On Mon, 2020-03-02 at 18:51 +0100, Frederic Barrat wrote:
> 
> Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :
> > From: Alastair D'Silva 
> > 
> > This patch reads timeouts & firmware version from the controller,
> > and
> > uses those timeouts to wait for the controller to report that it is
> > ready
> > before handing the memory over to libnvdimm.
> > 
> > Signed-off-by: Alastair D'Silva 
> > ---
> >   arch/powerpc/platforms/powernv/pmem/Makefile  |  2 +-
> >   arch/powerpc/platforms/powernv/pmem/ocxl.c| 92
> > +++
> >   .../platforms/powernv/pmem/ocxl_internal.c| 19 
> >   .../platforms/powernv/pmem/ocxl_internal.h| 24 +
> >   4 files changed, 136 insertions(+), 1 deletion(-)
> >   create mode 100644
> > arch/powerpc/platforms/powernv/pmem/ocxl_internal.c
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pmem/Makefile
> > b/arch/powerpc/platforms/powernv/pmem/Makefile
> > index 1c55c4193175..4ceda25907d4 100644
> > --- a/arch/powerpc/platforms/powernv/pmem/Makefile
> > +++ b/arch/powerpc/platforms/powernv/pmem/Makefile
> > @@ -4,4 +4,4 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
> >   
> >   obj-$(CONFIG_OCXL_PMEM) += ocxlpmem.o
> >   
> > -ocxlpmem-y := ocxl.o
> > +ocxlpmem-y := ocxl.o ocxl_internal.o
> > diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > b/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > index 3c4eeb5dcc0f..431212c9f0cc 100644
> > --- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > +++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
> > @@ -8,6 +8,7 @@
> >   
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -215,6 +216,36 @@ static int register_lpc_mem(struct ocxlpmem
> > *ocxlpmem)
> > return 0;
> >   }
> >   
> > +/**
> > + * is_usable() - Is a controller usable?
> > + * @ocxlpmem: the device metadata
> > + * @verbose: True to log errors
> > + * Return: true if the controller is usable
> > + */
> > +static bool is_usable(const struct ocxlpmem *ocxlpmem, bool
> > verbose)
> > +{
> > +   u64 chi = 0;
> > +   int rc = ocxlpmem_chi(ocxlpmem, &chi);
> > +
> > +   if (rc < 0)
> > +   return false;
> > +
> > +   if (!(chi & GLOBAL_MMIO_CHI_CRDY)) {
> > +   if (verbose)
> > +   dev_err(&ocxlpmem->dev, "controller is not
> > ready.\n");
> > +   return false;
> > +   }
> > +
> > +   if (!(chi & GLOBAL_MMIO_CHI_MA)) {
> > +   if (verbose)
> > +   dev_err(&ocxlpmem->dev,
> > +   "controller does not have memory
> > available.\n");
> > +   return false;
> > +   }
> > +
> > +   return true;
> > +}
> > +
> >   /**
> >* allocate_minor() - Allocate a minor number to use for an
> > OpenCAPI pmem device
> >* @ocxlpmem: the device metadata
> > @@ -328,6 +359,48 @@ static void ocxlpmem_remove(struct pci_dev
> > *pdev)
> > }
> >   }
> >   
> > +/**
> > + * read_device_metadata() - Retrieve config information from the
> > AFU and save it for future use
> > + * @ocxlpmem: the device metadata
> > + * Return: 0 on success, negative on failure
> > + */
> > +static int read_device_metadata(struct ocxlpmem *ocxlpmem)
> > +{
> > +   u64 val;
> > +   int rc;
> > +
> > +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> > GLOBAL_MMIO_CCAP0,
> > +OCXL_LITTLE_ENDIAN, &val);
> > +   if (rc)
> > +   return rc;
> > +
> > +   ocxlpmem->scm_revision = val & 0x;
> > +   ocxlpmem->read_latency = (val >> 32) & 0xFF;
> > +   ocxlpmem->readiness_timeout = (val >> 48) & 0x0F;
> > +   ocxlpmem->memory_available_timeout = val >> 52;
> > +
> > +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> > GLOBAL_MMIO_CCAP1,
> > +OCXL_LITTLE_ENDIAN, &val);
> > +   if (rc)
> > +   return rc;
> > +
> > +   ocxlpmem->max_controller_dump_size = val & 0x;
> > +
> > +   // Extract firmware version text
> > +   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
> > GLOBAL_MMIO_FWVER,
> > +OCXL_HOST_ENDIAN, (u64 *)ocxlpmem-
> > >fw_version);
> > +   if (rc)
> > +   return rc;
> > +
> > +   ocxlpmem->fw_version[8] = '\0';
> > +
> > +   dev_info(&ocxlpmem->dev,
> > +"Firmware version '%s' SCM revision %d:%d\n",
> > ocxlpmem->fw_version,
> > +ocxlpmem->scm_revision >> 4, ocxlpmem->scm_revision &
> > 0x0F);
> > +
> > +   return 0;
> > +}
> > +
> >   /**
> >* probe_function0() - Set up function 0 for an OpenCAPI
> > persistent memory device
> >* This is important as it enables templates higher than 0 across
> > all other functions,
> > @@ -368,6 +441,7 @@ static int probe(struct pci_dev *pdev, const
> > struct pci_device_id *ent)
> >   {
> > struct ocxlpmem *ocxlpmem;
> > int rc;
> > +   u16 elapsed, timeout;
> >   
> > if (PCI_FUNC(pdev->devfn) == 0)
> > return probe_function0(pdev);
> > @@ -422,6 +496,24 @@ static int probe(struct pci_dev *pdev, co

Re: [PATCH v3 26/27] powerpc/powernv/pmem: Expose the firmware version in sysfs

2020-03-03 Thread Alastair D'Silva

On Mon, 2020-03-02 at 18:35 +1100, Andrew Donnellan wrote:
> On 21/2/20 2:27 pm, Alastair D'Silva wrote:
> > From: Alastair D'Silva 
> > 
> > This information will be used by ndctl in userspace to help users
> > identify
> > the device.
> 
> You should include the information from the subject line in the body
> of 
> the commit message too.
> 
> I think this patch could probably be squashed in with the last one.
> 

Ok

-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819

[PATCH v2] ima: add a new CONFIG for loading arch-specific policies

2020-03-03 Thread Nayna Jain

Every time a new architecture defines the IMA architecture specific
functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA
include file needs to be updated. To avoid this "noise", this patch
defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing
the different architectures to select it.

Suggested-by: Linus Torvalds 
Signed-off-by: Nayna Jain 
Cc: Ard Biesheuvel 
Cc: Philipp Rudo 
Cc: Michael Ellerman 
---
v2:
* Fixed the issue identified by Mimi. Thanks Mimi, Ard, Heiko and Michael for
discussing the fix.

 arch/powerpc/Kconfig   | 1 +
 arch/s390/Kconfig  | 1 +
 arch/x86/Kconfig   | 1 +
 include/linux/ima.h| 3 +--
 security/integrity/ima/Kconfig | 9 +
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..a5cfde432983 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -979,6 +979,7 @@ config PPC_SECURE_BOOT
bool
depends on PPC_POWERNV
depends on IMA_ARCH_POLICY
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 8abe77536d9d..4a502fbcb800 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -195,6 +195,7 @@ config S390
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select SWIOTLB
select GENERIC_ALLOCATOR
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT if IMA_ARCH_POLICY
 
 
 config SCHED_OMIT_FRAME_POINTER
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..7f5bfaf0cbd2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,7 @@ config X86
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
select PROC_PID_ARCH_STATUS if PROC_FS
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT   if EFI && IMA_ARCH_POLICY
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 1659217e9b60..aefe758f4466 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -30,8 +30,7 @@ extern void ima_kexec_cmdline(const void *buf, int size);
 extern void ima_add_kexec_buffer(struct kimage *image);
 #endif
 
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
-   || defined(CONFIG_PPC_SECURE_BOOT)
+#ifdef CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT
 extern bool arch_ima_get_secureboot(void);
 extern const char * const *arch_get_ima_policy(void);
 #else
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 3f3ee4e2eb0d..d17972aa413a 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -327,3 +327,12 @@ config IMA_QUEUE_EARLY_BOOT_KEYS
depends on IMA_MEASURE_ASYMMETRIC_KEYS
depends on SYSTEM_TRUSTED_KEYRING
default y
+
+config IMA_SECURE_AND_OR_TRUSTED_BOOT
+   bool
+   depends on IMA
+   depends on IMA_ARCH_POLICY
+   default n
+   help
+  This option is selected by architectures to enable secure and/or
+  trusted boot based on IMA runtime policies.
-- 
2.13.6

[PATCH v4 5/5] libnvdimm/region: Introduce an 'align' attribute

2020-03-03 Thread Dan Williams

The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.

The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.

The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.

Unfortunately this change needs to touch several parts of the
implementation at once:

- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align

...to keep the free space accounting conforming to the dynamic align
setting.

Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Reviewed-by: Jeff Moyer 
Link: 
https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/dimm_devs.c  |   86 +++
 drivers/nvdimm/namespace_devs.c |9 ++-
 drivers/nvdimm/nd.h |1 
 drivers/nvdimm/region_devs.c|  122 ---
 4 files changed, 192 insertions(+), 26 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 39a61a514746..b7b77e8d9027 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
 }
 
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+   struct device *dev = &nd_region->dev;
+
+   if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+   "bus lock required for capacity provision\n"))
+   return 0;
+   if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+   % nd_region->ndr_mappings,
+   "invalid region align %#lx mappings: %d\n",
+   nd_region->align, nd_region->ndr_mappings))
+   return 0;
+   return nd_region->align / nd_region->ndr_mappings;
+}
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+   unsigned long align;
int i;
 
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
 * Find the free dpa from the end of the last pmem allocation to
 * the end of the interleave-set mapping.
 */
+   align = dpa_align(nd_region);
+   if (!align)
+   return 0;
+
for_each_dpa_resource(ndd, res) {
+   resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
-   if ((res->start >= blk_start && res->start < map_end)
-   || (res->end >= blk_start
-   && res->end <= map_end)) {
-   new = max(blk_start, min(map_end + 1, res->end + 1));
+
+   start = ALIGN_DOWN(res->start, align);
+   end = ALIGN(res->end + 1, align) - 1;
+   if ((start >= blk_start && start < map_end)
+   || (end >= blk_start && end <= map_end)) {
+   new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
.res = NULL,
};
struct resource *res;
+   unsigned long align;
 
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region 
*nd_region)
device_for_each_child(&nvdimm_bus->dev, &info, alias_dpa_busy);
 
/* now account for busy blk allocations in unaliased dpa */
+   align = dpa_align(nd_region);
+   if (

[PATCH v4 4/5] libnvdimm/region: Introduce NDD_LABELING

2020-03-03 Thread Dan Williams

The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.

To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.

This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.

Cc: Vishal Verma 
Cc: Oliver O'Halloran 
Reviewed-by: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Link: 
https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.st...@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |2 +-
 drivers/acpi/nfit/core.c  |4 +++-
 drivers/nvdimm/dimm.c |2 +-
 drivers/nvdimm/dimm_devs.c|9 +
 drivers/nvdimm/namespace_devs.c   |2 +-
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/region_devs.c  |   10 +-
 include/linux/libnvdimm.h |2 ++
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
 
dimm_flags = 0;
-   set_bit(NDD_ALIASING, &dimm_flags);
+   set_bit(NDD_LABELING, &dimm_flags);
 
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
  PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
continue;
}
 
-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, &flags);
+   set_bit(NDD_LABELING, &flags);
+   }
 
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
-   nvdimm_set_aliasing(dev);
+   nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..39a61a514746 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
 
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, &nvdimm->cmd_mask)) {
-   if (test_bit(NDD_ALIASING, &nvdimm->flags))
+   if (test_bit(NDD_LABELING, &nvdimm->flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
return rc;
 }
 
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   set_bit(NDD_ALIASING, &nvdimm->flags);
+   set_bit(NDD_LABELING, &nvdimm->flags);
 }
 
 void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
 
-   return sprintf(buf, "%s%s\n",
+   return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, &nvdimm->flags) ? "alias " : "",
+   test_bit(NDD_LABELING, &nvdimm->flags) ? "label " : "",
test_bit(NDD_LOCKED, &nvdimm->flags) ? "lock " : "");
 }
 static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 77e211c7d94d..01f6c22f0d1a 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2538,7 +2538,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, &nvdimm->flags))
/* fail, label data may be unreadable */;
-   else if (test_bit(NDD_ALIASING, &nvdimm->flags))
+

[PATCH v4 3/5] libnvdimm/namespace: Enforce memremap_compat_align()

2020-03-03 Thread Dan Williams

The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
 arch_add_memory+0x74/0x130 (unreliable)
 memremap_pages+0x74c/0xa30
 devm_memremap_pages+0x3c/0xa0
 pmem_attach_disk+0x188/0x770
 nvdimm_bus_probe+0xd8/0x470

With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, and nd_dax cases. This includes
preventing the creation of namespaces where the base address is
misaligned and cases there infoblock padding parameters are invalid.

Reported-by: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces to section 
alignment")
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   17 +
 drivers/nvdimm/pfn.h|   12 
 drivers/nvdimm/pfn_devs.c   |   32 +---
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..77e211c7d94d 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -10,6 +10,7 @@
 #include 
 #include "nd-core.h"
 #include "pmem.h"
+#include "pfn.h"
 #include "nd.h"
 
 static void namespace_io_release(struct device *dev)
@@ -1739,6 +1740,22 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
 
+   /*
+* Note, alignment validation for fsdax and devdax mode
+* namespaces happens in nd_pfn_validate() where infoblock
+* padding parameters can be applied.
+*/
+   if (pmem_should_map_pages(dev)) {
+   struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
+   struct resource *res = &nsio->res;
+
+   if (!IS_ALIGNED(res->start | (res->end + 1),
+   memremap_compat_align())) {
+   dev_err(&ndns->dev, "%pr misaligned, unable to map\n", 
res);
+   return ERR_PTR(-EOPNOTSUPP);
+   }
+   }
+
if (is_namespace_pmem(&ndns->dev)) {
struct nd_namespace_pmem *nspm;
 
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index acb19517f678..37cb1b8a2a39 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -24,6 +24,18 @@ struct nd_pfn_sb {
__le64 npfns;
__le32 mode;
/* minor-version-1 additions for section alignment */
+   /**
+* @start_pad: Deprecated attribute to pad start-misaligned namespaces
+*
+* start_pad is deprecated because the original definition did
+* not comprehend that dataoff is relative to the base address
+* of the namespace not the start_pad adjusted base. The result
+* is that the dax path is broken, but the block-I/O path is
+* not. The kernel will no longer create namespaces using start
+* padding, but it still supports block-I/O for legacy
+* configurations mainly to allow a backup, reconfigure the
+* namespace, and restore flow to repair dax operation.
+*/
__le32 start_pad;
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 79fe02d6f657..34db557dbad1 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -446,6 +446,7 @@ static bool nd_supported_alignment(unsigned long align)
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
u64 checksum, offset;
+   struct resource *res;
enum nd_pfn_mode mode;
struct nd_namespace_io *nsio;
unsigned long align, start_pad;
@@ -578,13 +579,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
 * established.
 */
nsio = to_nd_namespace_io(&ndns->dev);
-   if (offset >= resource_size(&nsio->res)) {
+   res = &nsio->res;
+   if (offset >= resource_size(res)) {
dev_err(&nd_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(&ndns->dev));
return -EOPNOTSUPP;
}
 
-   if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
+   if ((align && !IS_ALIGNED(res->start + offset + start_pad, align))
|| !IS_ALIGNED(offset, PAGE_SIZE)) {
dev_err(&nd_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
@@ -592,6 +594,18 @@ int nd_pfn_va

[PATCH v4 2/5] libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid

2020-03-03 Thread Dan Williams

The EOPNOTSUPP return code from the pmem driver indicates that the
namespace has a configuration that may be valid, but the current kernel
does not support it. Expand this to all of the nd_pfn_validate() error
conditions after the infoblock has been verified as self consistent.

This prevents exposing the namespace to I/O when the infoblock needs to
be corrected, or the system needs to be put into a different
configuration (like changing the page size on PowerPC).

Cc: Aneesh Kumar K.V 
Cc: Jeff Moyer 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn_devs.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a5c25cb87116..79fe02d6f657 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,14 +561,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
*sig)
dev_dbg(&nd_pfn->dev, "align: %lx:%lx mode: %d:%d\n",
nd_pfn->align, align, nd_pfn->mode,
mode);
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
}
 
if (align > nvdimm_namespace_capacity(ndns)) {
dev_err(&nd_pfn->dev, "alignment: %lx exceeds capacity %llx\n",
align, nvdimm_namespace_capacity(ndns));
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 
/*
@@ -581,7 +581,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (offset >= resource_size(&nsio->res)) {
dev_err(&nd_pfn->dev, "pfn array size exceeds capacity of %s\n",
dev_name(&ndns->dev));
-   return -EBUSY;
+   return -EOPNOTSUPP;
}
 
if ((align && !IS_ALIGNED(nsio->res.start + offset + start_pad, align))
@@ -589,7 +589,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
dev_err(&nd_pfn->dev,
"bad offset: %#llx dax disabled align: %#lx\n",
offset, align);
-   return -ENXIO;
+   return -EOPNOTSUPP;
}
 
return 0;

[PATCH v4 1/5] mm/memremap_pages: Introduce memremap_compat_align()

2020-03-03 Thread Dan Williams

The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity.  The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.

While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.

In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.

The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.

Based on an initial patch by Aneesh.

Link: 
http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
Reported-by: Aneesh Kumar K.V 
Reported-by: Jeff Moyer 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Dan Williams 
---
 arch/powerpc/Kconfig  |1 +
 arch/powerpc/mm/ioremap.c |   21 +
 drivers/nvdimm/pfn_devs.c |2 +-
 include/linux/memremap.h  |8 
 include/linux/mmzone.h|1 +
 lib/Kconfig   |3 +++
 mm/memremap.c |   23 +++
 7 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..e6ffe905e2b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -122,6 +122,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index fc669643ce6a..b1a0aebe8c48 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -2,6 +2,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -97,3 +98,23 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
 
return NULL;
 }
+
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * Override the generic version in mm/memremap.c.
+ *
+ * With hash translation, the direct-map range is mapped with just one
+ * page size selected by htab_init_page_sizes(). Consult
+ * mmu_psize_defs[] to determine the minimum page size alignment.
+*/
+unsigned long memremap_compat_align(void)
+{
+   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
+
+   if (radix_enabled())
+   return SUBSECTION_SIZE;
+   return max(SUBSECTION_SIZE, 1UL << shift);
+
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index b94f7a7e94b8..a5c25cb87116 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(&nsio->res);
npfns = PHYS_PFN(size - SZ_8K);
-   align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+   align = max(nd_pfn->align, SUBSECTION_SIZE);
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..8af1cbd8f293 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+unsigned long memremap_compat_align(void);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
unsigned long nr_pfns)
 {
 }
+
+/* when memremap_pages() is disabled all archs can remap a single page */
+static inline unsigned long memremap_compat_align(void)
+

[PATCH v4 0/5] libnvdimm: Cross-arch compatible namespace alignment

2020-03-03 Thread Dan Williams

Changes since v3 [1]:
- Collect Aneesh's Reviewed-by for Patch1-Patch3

- Add commentary to "libnvdimm/namespace: Enforce
  memremap_compat_align()" to clarify alignment validation and why
  start_pad is deprecated (Aneesh)

[1]: 
http://lore.kernel.org/r/158291746615.1609624.7591692546429050845.st...@dwillia2-desk3.amr.corp.intel.com

---

Patch "mm/memremap_pages: Introduce memremap_compat_align()" still
needs a PowerPC maintainer ack for the touches to
arch/powerpc/mm/ioremap.c.

---

Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.

Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels (modulo dax operation
not being supported with a non-zero start_pad).

The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.

Thanks to Aneesh for early feedback and testing on this change to
alignment handling.

---

Dan Williams (5):
  mm/memremap_pages: Introduce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/region: Introduce an 'align' attribute


 arch/powerpc/Kconfig  |1 
 arch/powerpc/mm/ioremap.c |   21 +
 arch/powerpc/platforms/pseries/papr_scm.c |2 
 drivers/acpi/nfit/core.c  |4 +
 drivers/nvdimm/dimm.c |2 
 drivers/nvdimm/dimm_devs.c|   95 +
 drivers/nvdimm/namespace_devs.c   |   28 +-
 drivers/nvdimm/nd.h   |3 -
 drivers/nvdimm/pfn.h  |   12 +++
 drivers/nvdimm/pfn_devs.c |   40 +++--
 drivers/nvdimm/region_devs.c  |  132 ++---
 include/linux/libnvdimm.h |2 
 include/linux/memremap.h  |8 ++
 include/linux/mmzone.h|1 
 lib/Kconfig   |3 +
 mm/memremap.c |   23 +
 16 files changed, 330 insertions(+), 47 deletions(-)

base-commit: 1d0827b75ee7df497f611a2ac412a88135fb0ef5

Re: [PATCH net-next 00/23] Clean driver, module and FW versions

2020-03-03 Thread David Miller

From: Leon Romanovsky 
Date: Sun,  1 Mar 2020 16:44:33 +0200

> From: Leon Romanovsky 
> 
> Hi,
> 
> This is second batch of the series which removes various static versions
> in favour of globaly defined Linux kernel version.
> 
> The first part with better cover letter can be found here
> https://lore.kernel.org/lkml/20200224085311.460338-1-l...@kernel.org
> 
> The code is based on
> 68e2c37690b0 ("Merge branch 'hsr-several-code-cleanup-for-hsr-module'")
> 
> and WIP branch is
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=ethtool

Series applied, thanks.

Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-03-03 Thread Qian Cai



> Below is slightly modified version of your change above and should still
> prevent the bug on powerpc. Will it be possible for you to re-test this
> ? Once confirmed, will send a patch enabling this test on powerpc64
> keeping your authorship. Thank you.

This works fine on radix MMU but I decided to go a bit future to test hash
MMU. The kernel will stuck here below. I did confirm that pte_alloc_map_lock()
was successful, so I don’t understand hash MMU well enough to tell why
it could still take an interrupt at pte_clear_tests() even before we calls
pte_unmap_unlock()?

[   33.881515][T1] ok 8 - property-entry
[   33.883653][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
architecture page table helpers
[   60.418885][C8] watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
[swapper/0:1]
[   60.418913][C8] Modules linked in:
[   60.418927][C8] irq event stamp: 2896762
[   60.418945][C8] hardirqs last  enabled at (2896761): []
fast_exc_return_irq+0x28/0x34
[   60.418960][C8] hardirqs last disabled at (2896762): []
decrementer_common+0x10c/0x130
[   60.418985][C8] softirqs last  enabled at (2896760): []
__do_softirq+0x640/0x8c8
[   60.419009][C8] softirqs last disabled at (2896753): []
irq_exit+0x16c/0x1d0
[   60.419024][C8] CPU: 8 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-
20200303+ #7
[   60.419055][C8] NIP:  c103dc14 LR: c103db0c CTR:

[   60.419076][C8] REGS: c0003dd4fa30 TRAP: 0901   Not tainted  (5.6.0-
rc4-next-20200303+)
[   60.419107][C8] MSR:  90009033   CR:
42000222  XER: 
[   60.419134][C8] CFAR: c103dc1c IRQMASK: 0 
[   60.419134][C8] GPR00: c103db0c c0003dd4fcc0 c1657d00
0521000100c0 
[   60.419134][C8] GPR04: 8105 000a f4d9864c
0001 
[   60.419134][C8] GPR08:   0001
000a 
[   60.419134][C8] GPR12:  c01f9880 
[   60.419220][C8] NIP [c103dc14] debug_vm_pgtable+0x7a8/0xbb4
hash__pte_update at arch/powerpc/include/asm/book3s/64/hash.h:159
(inlined by) pte_update at arch/powerpc/include/asm/book3s/64/pgtable.h:359
(inlined by) pte_clear at arch/powerpc/include/asm/book3s/64/pgtable.h:477
(inlined by) pte_clear_tests at mm/debug_vm_pgtable.c:259
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:368
[   60.419241][C8] LR [c103db0c] debug_vm_pgtable+0x6a0/0xbb4
pmd_basic_tests at mm/debug_vm_pgtable.c:74
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:363
[   60.419260][C8] Call Trace:
[   60.419278][C8] [c0003dd4fcc0] [c103d994]
debug_vm_pgtable+0x528/0xbb4 (unreliable)
[   60.419302][C8] [c0003dd4fdb0] [c0010eac]
kernel_init+0x30/0x194
[   60.419325][C8] [c0003dd4fe20] [c000b748]
ret_from_kernel_thread+0x5c/0x74
[   60.419363][C8] Instruction dump:
[   60.419382][C8] 7d075078 7ce74b78 7ce0f9ad 40c2fff0 7e449378 7fc3f378
4b03531d 6000 
[   60.419416][C8] 4880 3920 3941 3900 <7e00f8a8> 7e075039
40c2fff8 7e074878 
[   98.908889][C8] rcu: INFO: rcu_sched self-detected stall on CPU
[   98.908933][C8] rcu: 8-: (6500 ticks this GP)
idle=522/1/0x4002 softirq=132/132 fqs=3250 
[   98.908963][C8] (t=6501 jiffies g=-719 q=510)
[   98.908984][C8] NMI backtrace for cpu 8
[   98.909012][C8] CPU: 8 PID: 1 Comm: swapper/0 Tainted:
G L    5.6.0-rc4-next-20200303+ #7
[   98.909025][C8] Call Trace:
[   98.909046][C8] [c0003dd4f360] [c0970fe0]
dump_stack+0xf4/0x164 (unreliable)
[   98.909070][C8] [c0003dd4f3b0] [c097dcf4]
nmi_cpu_backtrace+0x1b4/0x1e0
[   98.909084][C8] [c0003dd4f450] [c097df48]
nmi_trigger_cpumask_backtrace+0x228/0x2c0
[   98.909118][C8] [c0003dd4f500] [c0057bf8]
arch_trigger_cpumask_backtrace+0x28/0x40
[   98.909152][C8] [c0003dd4f520] [c0202dd4]
rcu_dump_cpu_stacks+0x1c4/0x234
[   98.909184][C8] [c0003dd4f5a0] [c0201634]
rcu_sched_clock_irq+0xd54/0x1130
[   98.909207][C8] [c0003dd4f6c0] [c0217068]
update_process_times+0x48/0xb0
[   98.909239][C8] [c0003dd4f6f0] [c02358b4]
tick_sched_handle+0x34/0xb0
[   98.909262][C8] [c0003dd4f720] [c02361d8]
tick_sched_timer+0x68/0xe0
[   98.909284][C8] [c0003dd4f760] [c0219768]
__hrtimer_run_queues+0x528/0xa60
[   98.909306][C8] [c0003dd4f880] [c021ab58]
hrtimer_interrupt+0x128/0x330
[   98.909329][C8] [c0003dd4f930] [c002e1b4]
timer_interrupt+0x264/0x680
[   98.909352][C8] [c0003dd4f9c0] [c0009264]
decrementer_common+0x124/0x130
[   98.909366][C8] --- interrupt: 901 at debug_vm_pgtable+0x7a8/0xbb4
[   98.909366][C8] LR = debug_vm_pgtable+0x6a0/0xbb4
[

[PATCH v3] powerpc: Replace setup_irq() by request_irq()

2020-03-03 Thread afzal mohammed

request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.

Per tglx[1], setup_irq() existed in olden days when allocators were not
ready by the time early interrupts were initialized.

Hence replace setup_irq() by request_irq().

[1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos

Signed-off-by: afzal mohammed 
---
Hi powerpc maintainers,

if okay w/ this change, please consider taking it thr' your tree, else please
let me know.

Regards
afzal

Link to v2 & v1,
[v2] https://lkml.kernel.org/r/cover.1582471508.git.afzal.mohd...@gmail.com
[v1] https://lkml.kernel.org/r/cover.1581478323.git.afzal.mohd...@gmail.com

v3:
 * Split out from tree wide series, as Thomas suggested to get it thr'
respective maintainers
 * Modify pr_err displayed in case of error
 * Re-arrange code & choose pr_err args as required to improve readability
 * Remove irrelevant parts from commit message & improve
 
v2:
 * Replace pr_err("request_irq() on %s failed" by
   pr_err("%s: request_irq() failed"
 * Commit message massage

 arch/powerpc/platforms/85xx/mpc85xx_cds.c | 10 +++-
 arch/powerpc/platforms/8xx/cpm1.c |  9 ++-
 arch/powerpc/platforms/8xx/m8xx_setup.c   |  9 ++-
 arch/powerpc/platforms/chrp/setup.c   | 14 ---
 arch/powerpc/platforms/powermac/pic.c | 29 +--
 arch/powerpc/platforms/powermac/smp.c | 12 --
 6 files changed, 28 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/platforms/85xx/mpc85xx_cds.c 
b/arch/powerpc/platforms/85xx/mpc85xx_cds.c
index 6b1436abe9b1..1c5598877d70 100644
--- a/arch/powerpc/platforms/85xx/mpc85xx_cds.c
+++ b/arch/powerpc/platforms/85xx/mpc85xx_cds.c
@@ -218,12 +218,6 @@ static irqreturn_t mpc85xx_8259_cascade_action(int irq, 
void *dev_id)
 {
return IRQ_HANDLED;
 }
-
-static struct irqaction mpc85xxcds_8259_irqaction = {
-   .handler = mpc85xx_8259_cascade_action,
-   .flags = IRQF_SHARED | IRQF_NO_THREAD,
-   .name = "8259 cascade",
-};
 #endif /* PPC_I8259 */
 #endif /* CONFIG_PCI */
 
@@ -271,7 +265,9 @@ static int mpc85xx_cds_8259_attach(void)
 *  disabled when the last user of the shared IRQ line frees their
 *  interrupt.
 */
-   if ((ret = setup_irq(cascade_irq, &mpc85xxcds_8259_irqaction))) {
+   ret = request_irq(cascade_irq, mpc85xx_8259_cascade_action,
+ IRQF_SHARED | IRQF_NO_THREAD, "8259 cascade", NULL);
+   if (ret) {
printk(KERN_ERR "Failed to setup cascade interrupt\n");
return ret;
}
diff --git a/arch/powerpc/platforms/8xx/cpm1.c 
b/arch/powerpc/platforms/8xx/cpm1.c
index a43ee7d1ff85..4db4ca2e1222 100644
--- a/arch/powerpc/platforms/8xx/cpm1.c
+++ b/arch/powerpc/platforms/8xx/cpm1.c
@@ -120,12 +120,6 @@ static irqreturn_t cpm_error_interrupt(int irq, void *dev)
return IRQ_HANDLED;
 }
 
-static struct irqaction cpm_error_irqaction = {
-   .handler = cpm_error_interrupt,
-   .flags = IRQF_NO_THREAD,
-   .name = "error",
-};
-
 static const struct irq_domain_ops cpm_pic_host_ops = {
.map = cpm_pic_host_map,
 };
@@ -187,7 +181,8 @@ unsigned int __init cpm_pic_init(void)
if (!eirq)
goto end;
 
-   if (setup_irq(eirq, &cpm_error_irqaction))
+   if (request_irq(eirq, cpm_error_interrupt, IRQF_NO_THREAD, "error",
+   NULL))
printk(KERN_ERR "Could not allocate CPM error IRQ!");
 
setbits32(&cpic_reg->cpic_cicr, CICR_IEN);
diff --git a/arch/powerpc/platforms/8xx/m8xx_setup.c 
b/arch/powerpc/platforms/8xx/m8xx_setup.c
index f1c805c8adbc..df4d57d07f9a 100644
--- a/arch/powerpc/platforms/8xx/m8xx_setup.c
+++ b/arch/powerpc/platforms/8xx/m8xx_setup.c
@@ -39,12 +39,6 @@ static irqreturn_t timebase_interrupt(int irq, void *dev)
return IRQ_HANDLED;
 }
 
-static struct irqaction tbint_irqaction = {
-   .handler = timebase_interrupt,
-   .flags = IRQF_NO_THREAD,
-   .name = "tbint",
-};
-
 /* per-board overridable init_internal_rtc() function. */
 void __init __attribute__ ((weak))
 init_internal_rtc(void)
@@ -157,7 +151,8 @@ void __init mpc8xx_calibrate_decr(void)
(TBSCR_TBF | TBSCR_TBE));
immr_unmap(sys_tmr2);
 
-   if (setup_irq(virq, &tbint_irqaction))
+   if (request_irq(virq, timebase_interrupt, IRQF_NO_THREAD, "tbint",
+   NULL))
panic("Could not allocate timer IRQ!");
 }
 
diff --git a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
index fcf6f2342ef4..8328cd5817b0 100644
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -451,13 +451,6 @@ static void __init chrp_find_openpic(void)
of_node_put(np);
 }
 
-#if defined(CONFIG_VT) && defined(CONFIG_INPUT_ADBHID) && defined(CONFIG_XMON)
-static struct irqaction xmon_irqaction = {
-

Re: [PATCH v3 14/18] docs: powerpc: convert vcpudispatch_stats.txt to ReST

2020-03-03 Thread Michael Ellerman

Mauro Carvalho Chehab  writes:
> - Add a SPDX header;
> - Use standard markup for document title;
> - Adjust identation on lists and add blank lines where
>   needed;
> - Add it to the powerpc index.rst file.
>
> Signed-off-by: Mauro Carvalho Chehab 
> ---
>  Documentation/powerpc/index.rst |  1 +
>  ...ispatch_stats.txt => vcpudispatch_stats.rst} | 17 -
>  2 files changed, 13 insertions(+), 5 deletions(-)
>  rename Documentation/powerpc/{vcpudispatch_stats.txt => 
> vcpudispatch_stats.rst} (94%)

LGTM.

Acked-by: Michael Ellerman  (powerpc)

I'm going to assume this will go via the docs tree, unless you tell me 
otherwise.

cheers

[PATCH v2] powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()

2020-03-03 Thread Michael Ellerman

We received a report of strange kernel faults which turned out to be
due to a missing KUAP disable in flush_coherent_icache() called
from flush_icache_range().

The fault looks like:

  Kernel attempted to access user page (7fffc30d9c00) - exploit attempt? (uid: 
1009)
  BUG: Unable to handle kernel data access on read at 0x7fffc30d9c00
  Faulting instruction address: 0xc007232c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  CPU: 35 PID: 5886 Comm: sigtramp Not tainted 
5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40 #79
  NIP:  c007232c LR: c003b7fc CTR: 
  REGS: c01e11093940 TRAP: 0300   Not tainted  
(5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40)
  MSR:  9280b033   CR: 28000884  
XER: 
  CFAR: c00722fc DAR: 7fffc30d9c00 DSISR: 0800 IRQMASK: 0
  GPR00: c003b7fc c01e11093bd0 c23ac200 7fffc30d9c00
  GPR04: 7fffc30d9c18  c01e11093bd4 
  GPR08:  0001  c01e1104ed80
  GPR12:  c01fff6ab380 c16be2d0 4000
  GPR16: c000 bfff  
  GPR20: 7fffc30d9c00 7fffc30d8f58 7fffc30d9c18 7fffc30d9c20
  GPR24: 7fffc30d9c18  c01e11093d90 c01e1104ed80
  GPR28: c01e11093e90  c23d9d18 7fffc30d9c00
  NIP flush_icache_range+0x5c/0x80
  LR  handle_rt_signal64+0x95c/0xc2c
  Call Trace:
0xc01e11093d90 (unreliable)
handle_rt_signal64+0x93c/0xc2c
do_notify_resume+0x310/0x430
ret_from_except_lite+0x70/0x74
  Instruction dump:
  409e002c 7c0802a6 3c62ff31 3863f6a0 f8010080 48195fed 6000 48fe4c8d
  6000 e8010080 7c0803a6 7c0004ac <7c00ffac> 7c0004ac 4c00012c 38210070

This path through handle_rt_signal64() to setup_trampoline() and
flush_icache_range() is only triggered by 64-bit processes that have
unmapped their VDSO, which is rare.

flush_icache_range() takes a range of addresses to flush. In
flush_coherent_icache() we implement an optimisation for CPUs where we
know we don't actually have to flush the whole range, we just need to
do a single icbi.

However we still execute the icbi on the user address of the start of
the range we're flushing. On CPUs that also implement KUAP (Power9)
that leads to the spurious fault above.

We should be able to pass any address, including a kernel address, to
the icbi on these CPUs, which would avoid any interaction with KUAP.
But I don't want to make that change in a bug fix, just in case it
surfaces some strange behaviour on some CPU.

So for now just disable KUAP around the icbi. Note the icbi is treated
as a load, so we allow read access, not write as you'd expect.

Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Cc: sta...@vger.kernel.org # v5.2+
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/mem.c | 2 ++
 1 file changed, 2 insertions(+)

v2: Use L1_CACHE_BYTES as suggested by Christophe.

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ef7b1119b2e2..36a8c7b105ce 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -373,7 +373,9 @@ static inline bool flush_coherent_icache(unsigned long addr)
 */
if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
mb(); /* sync */
+   allow_read_from_user((void *)addr, L1_CACHE_BYTES);
icbi((void *)addr);
+   prevent_read_from_user((void *)addr, L1_CACHE_BYTES);
mb(); /* sync */
isync();
return true;
-- 
2.21.1

Re: [PATCH] powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()

2020-03-03 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 03/03/2020 à 13:59, Michael Ellerman a écrit :
>> We received a report of strange kernel faults which turned out to be
>> due to a missing KUAP disable in flush_coherent_icache() called
>> from flush_icache_range().
>> 
>> The fault looks like:
>> 
>>Kernel attempted to access user page (7fffc30d9c00) - exploit attempt? 
>> (uid: 1009)
>>BUG: Unable to handle kernel data access on read at 0x7fffc30d9c00
>>Faulting instruction address: 0xc007232c
>>Oops: Kernel access of bad area, sig: 11 [#1]
>>LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
>>CPU: 35 PID: 5886 Comm: sigtramp Not tainted 
>> 5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40 #79
>>NIP:  c007232c LR: c003b7fc CTR: 
>>REGS: c01e11093940 TRAP: 0300   Not tainted  
>> (5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40)
>>MSR:  9280b033   CR: 28000884 
>>  XER: 
>>CFAR: c00722fc DAR: 7fffc30d9c00 DSISR: 0800 IRQMASK: 0
>>GPR00: c003b7fc c01e11093bd0 c23ac200 7fffc30d9c00
>>GPR04: 7fffc30d9c18  c01e11093bd4 
>>GPR08:  0001  c01e1104ed80
>>GPR12:  c01fff6ab380 c16be2d0 4000
>>GPR16: c000 bfff  
>>GPR20: 7fffc30d9c00 7fffc30d8f58 7fffc30d9c18 7fffc30d9c20
>>GPR24: 7fffc30d9c18  c01e11093d90 c01e1104ed80
>>GPR28: c01e11093e90  c23d9d18 7fffc30d9c00
>>NIP flush_icache_range+0x5c/0x80
>>LR  handle_rt_signal64+0x95c/0xc2c
>>Call Trace:
>>  0xc01e11093d90 (unreliable)
>>  handle_rt_signal64+0x93c/0xc2c
>>  do_notify_resume+0x310/0x430
>>  ret_from_except_lite+0x70/0x74
>>Instruction dump:
>>409e002c 7c0802a6 3c62ff31 3863f6a0 f8010080 48195fed 6000 48fe4c8d
>>6000 e8010080 7c0803a6 7c0004ac <7c00ffac> 7c0004ac 4c00012c 38210070
>> 
>> This path through handle_rt_signal64() to setup_trampoline() and
>> flush_icache_range() is only triggered by 64-bit processes that have
>> unmapped their VDSO, which is rare.
>> 
>> flush_icache_range() takes a range of addresses to flush. In
>> flush_coherent_icache() we implement an optimisation for CPUs where we
>> know we don't actually have to flush the whole range, we just need to
>> do a single icbi.
>> 
>> However we still execute the icbi on the user address of the start of
>> the range we're flushing. On CPUs that also implement KUAP (Power9)
>> that leads to the spurious fault above.
>> 
>> We should be able to pass any address, including a kernel address, to
>> the icbi on these CPUs, which would avoid any interaction with KUAP.
>> But I don't want to make that change in a bug fix, just in case it
>> surfaces some strange behaviour on some CPU.
>> 
>> So for now just disable KUAP around the icbi. Note the icbi is treated
>> as a load, so we allow read access, not write as you'd expect.
>> 
>> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
>> Cc: sta...@vger.kernel.org # v5.2+
>> Signed-off-by: Michael Ellerman 
>> ---
>>   arch/powerpc/mm/mem.c | 2 ++
>>   1 file changed, 2 insertions(+)
>> 
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index ef7b1119b2e2..184850d9d000 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -373,7 +373,9 @@ static inline bool flush_coherent_icache(unsigned long 
>> addr)
>>   */
>>  if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
>>  mb(); /* sync */
>> +allow_read_from_user((void *)addr, 4);
>
> I know that's ignored on Radix, but shouldn't we use L1_CACHE_BYTES as a 
> length ?

Yes you're right, I actually meant to do that but forgot.

Will send a v2.

cheers

Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Greg Kurz

On Tue, 3 Mar 2020 09:02:05 -0800
Ram Pai  wrote:

> On Tue, Mar 03, 2020 at 07:50:08AM +0100, Cédric Le Goater wrote:
> > On 3/3/20 12:32 AM, David Gibson wrote:
> > > On Fri, Feb 28, 2020 at 11:54:04PM -0800, Ram Pai wrote:
> > >> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
> > >>
> > >> Hence Secure VM, must always default to XICS interrupt controller.
> > >>
> > >> If XIVE is requested through kernel command line option "xive=on",
> > >> override and turn it off.
> > >>
> > >> If XIVE is the only supported platform interrupt controller; specified
> > >> through qemu option "ic-mode=xive", simply abort. Otherwise default to
> > >> XICS.
> > > 
> > > Uh... the discussion thread here seems to have gotten oddly off
> > > track.  
> > 
> > There seem to be multiple issues. It is difficult to have a clear status.
> > 
> > > So, to try to clean up some misunderstandings on both sides:
> > > 
> > >   1) The guest is the main thing that knows that it will be in secure
> > >  mode, so it's reasonable for it to conditionally use XIVE based
> > >  on that
> > 
> > FW support is required AFAIUI.
> > >   2) The mechanism by which we do it here isn't quite right.  Here the
> > >  guest is checking itself that the host only allows XIVE, but we
> > >  can't do XIVE and is panic()ing.  Instead, in the SVM case we
> > >  should force support->xive to false, and send that in the CAS
> > >  request to the host.  We expect the host to just terminate
> > >  us because of the mismatch, but this will interact better with
> > >  host side options setting policy for panic states and the like.
> > >  Essentially an SVM kernel should behave like an old kernel with
> > >  no XIVE support at all, at least w.r.t. the CAS irq mode flags.
> > 
> > Yes. XIVE shouldn't be requested by the guest.
> 
>   Ok.
> 
> > This is the last option 
> > I proposed but I thought there was some negotiation with the hypervisor
> > which is not the case. 
> > 
> > >   3) Although there are means by which the hypervisor can kind of know
> > >  a guest is in secure mode, there's not really an "svm=on" option
> > >  on the host side.  For the most part secure mode is based on
> > >  discussion directly between the guest and the ultravisor with
> > >  almost no hypervisor intervention.
> > 
> > Is there a negotiation with the ultravisor ? 
> 
>   The VM has no negotiation with the ultravisor w.r.t CAS.
> 
> > 
> > >   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> > >  to write to event queues in guest memory, which would have to be
> > >  explicitly shared for secure mode.  That's true whether it's KVM
> > >  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> > >  entirely irrelevant.
> > 
> > This problem should be already fixed.
> > The XIVE event queues are shared 
>   
> Yes i have a patch for the guest kernel that shares the event 
> queue page with the hypervisor. This is done using the
> UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> lists yet.

Why ?

> However the patch by itself does not solve the xive problem
> for secure VM.
> 

This patch would allow at least to answer Cedric's question about
kernel_irqchip=off, since this looks like the only thing needed
to make it work.

> > and the remaining problem with XIVE is the KVM page fault handler 
> > populating the TIMA and ESB pages. Ultravisor doesn't seem to support
> > this feature and this breaks interrupt management in the guest. 
> 
> Yes. This is the bigger issue that needs to be fixed. When the secure guest
> accesses the page associated with the xive memslot, a page fault is
> generated, which the ultravisor reflects to the hypervisor. Hypervisor
> seems to be mapping Hardware-page to that GPA. Unforatunately it is not
> informing the ultravisor of that map.  I am trying to understand the
> root cause. But since I am not sure what more issues I might run into
> after chasing down that issue, I figured its better to disable xive
> support in SVM in the interim.
> 
>  BTW: I figured, I dont need this intermin patch to disable xive for
> secure VM.  Just doing "svm=on xive=off" on the kernel command line is
> sufficient for now. *
> 

No it is not. If the hypervisor doesn't propose XIVE (ie. ic-mode=xive
on the QEMU command line), the kernel simply ignores "xive=off".

> 
> > 
> > But, kernel_irqchip=off should work out of the box. It seems it doesn't. 
> > Something to investigate.
> 
> Dont know why. 
> 
> Does this option, disable the chip from interrupting the
> guest directly; instead mediates the interrupt through the hypervisor?
> 
> > 
> > > 
> > >   5) All the above said, having to use XICS is pretty crappy.  You
> > >  should really get working on XIVE support for secure VMs.
> > 
> > Yes. 
> 
> and yes too.
> 
> 
> Summary:  I am dropping this patch for now.
> 
> > 
> > Thanks,
> > 
> > C.
>

Re: [PATCH v3 02/14] powerpc: Define new SRR1 bits for a future ISA version

2020-03-03 Thread Alistair Popple

On Wednesday, 26 February 2020 3:07:04 PM AEDT Jordan Niethe wrote:
> Add the BOUNDARY SRR1 bit definition for when the cause of an alignment
> exception is a prefixed instruction that crosses a 64-byte boundary.
> Add the PREFIXED SRR1 bit definition for exceptions caused by prefixed
> instructions.
> 
> Bit 35 of SRR1 is called SRR1_ISI_N_OR_G. This name comes from it being
> used to indicate that an ISI was due to the access being no-exec or
> guarded. A future ISA version adds another purpose. It is also set if
> there is an access in a cache-inhibited location for prefixed
> instruction.  Rename from SRR1_ISI_N_OR_G to SRR1_ISI_N_G_OR_CIP.
> 
> Signed-off-by: Jordan Niethe 

Confirmed the definitions here match the specifications so:

Reviewed-by: Alistair Popple 

> ---
> v2: Combined all the commits concerning SRR1 bits.
> ---
>  arch/powerpc/include/asm/reg.h  | 4 +++-
>  arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index c7758c2ccc5f..173f33df4fab 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -762,7 +762,7 @@
>  #endif
> 
>  #define   SRR1_ISI_NOPT  0x4000 /* ISI: Not found in hash */
> -#define   SRR1_ISI_N_OR_G0x1000 /* ISI: Access is no-exec or G */
> +#define   SRR1_ISI_N_G_OR_CIP0x1000 /* ISI: Access is no-exec or 
> G or
> CI for a prefixed instruction */ #define   SRR1_ISI_PROT  
> 0x0800 /*
> ISI: Other protection fault */ #define   SRR1_WAKEMASK
> 0x0038 /*
> reason for wakeup */
>  #define   SRR1_WAKEMASK_P8   0x003c /* reason for wakeup on POWER8 and 9
> */ @@ -789,6 +789,8 @@
>  #define   SRR1_PROGADDR  0x0001 /* SRR0 contains subsequent 
> addr */
> 
>  #define   SRR1_MCE_MCP   0x0008 /* Machine check signal 
> caused 
interrupt
> */ +#define   SRR1_BOUNDARY   0x1000 /* Prefixed instruction 
> crosses
> 64-byte boundary */ +#define   SRR1_PREFIXED  0x2000 /* Exception
> caused by prefixed instruction */
> 
>  #define SPRN_HSRR0   0x13A   /* Save/Restore Register 0 */
>  #define SPRN_HSRR1   0x13B   /* Save/Restore Register 1 */
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c
> b/arch/powerpc/kvm/book3s_hv_nested.c index dc97e5be76f6..6ab685227574
> 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -1169,7 +1169,7 @@ static int kvmhv_translate_addr_nested(struct kvm_vcpu
> *vcpu, } else if (vcpu->arch.trap == BOOK3S_INTERRUPT_H_INST_STORAGE) { /*
> Can we execute? */
>   if (!gpte_p->may_execute) {
> - flags |= SRR1_ISI_N_OR_G;
> + flags |= SRR1_ISI_N_G_OR_CIP;
>   goto forward_to_l1;
>   }
>   } else {
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 220305454c23..b53a9f1c1a46
> 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -1260,7 +1260,7 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu,
> unsigned long addr, status &= ~DSISR_NOHPTE;  /* DSISR_NOHPTE ==
> SRR1_ISI_NOPT */
>   if (!data) {
>   if (gr & (HPTE_R_N | HPTE_R_G))
> - return status | SRR1_ISI_N_OR_G;
> + return status | SRR1_ISI_N_G_OR_CIP;
>   if (!hpte_read_permission(pp, slb_v & key))
>   return status | SRR1_ISI_PROT;
>   } else if (status & DSISR_ISSTORE) {

Re: [PATCH v3 01/14] powerpc: Enable Prefixed Instructions

2020-03-03 Thread Alistair Popple

> But hmm, the dt_cpu_ftrs parsing should be picking those up and setting
> them by default I would think (or maybe not because you don't expect
> FSCR bit if OS support is not required).

Right - the generic dt_cpu_ftrs didn't do the FSCR enablement which I assume 
is because if OS support is required anyway we can just add it in a similar 
way to the below patch. My thinking was that we could use it to disable prefix 
with a firmware if required, however outside of a lab environment there isn't 
much practical use for that so I'm ok with just doing something similar to the 
below.

> All the other FSCR bits are done similarly to this AFAIKS:
> 
>  https://patchwork.ozlabs.org/patch/1244476/
> 
> I would do that for now rather than digging into it too far, but we
> should look at cleaning that up and doing something more like what
> you have here.
>
> >> >  struct dt_cpu_feature_match {
> >> >  
> >> >   const char *name;
> >> >   int (*enable)(struct dt_cpu_feature *f);
> >> > 
> >> > @@ -626,6 +648,7 @@ static struct dt_cpu_feature_match __initdata
> >> > 
> >> >   {"vector-binary128", feat_enable, 0},
> >> >   {"vector-binary16", feat_enable, 0},
> >> >   {"wait-v3", feat_enable, 0},
> >> > 
> >> > + {"prefix-instructions", feat_enable_prefix, 0},
> >> 
> >> That's reasonable to make that a feature, will it specify a minimum
> >> base set of prefix instructions or just that prefix instructions
> >> with the prefix/suffix arrangement exist?
> > 
> > This was just going to be that they exist.
> > 
> >> You may not need "-instructions" on the end, none of the other
> >> instructions do.
> > 
> > Good point.
> > 
> >> I would maybe just hold off upstreaming the dt_cpu_ftrs changes for
> >> a bit. We have to do a pass over new CPU feature device tree, and
> >> some compatibility questions have come up recently.
> >> 
> >> If you wouldn't mind just adding the new [H]FSCR bits and faults
> >> upstream for now, that would be good.
> > 
> > No problem.
> 
> Thanks,
> Nick

Re: [PATCH] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2020-03-03 Thread Segher Boessenkool

Hi!

On Tue, Mar 03, 2020 at 11:45:27AM +1000, Nicholas Piggin wrote:
> Provide an option to use ELFv2 ABI for big endian builds. This works on
> GCC and clang (since 2014). it is is not officially supported by the GNU
> toolchain, but it can give some useful advantages of the ELFv2 ABI for
> BE (e.g., less stack usage). Some distros build BE ELFv2 userspace.

It is not officially supported in the sense that a) as a host config,
it does not exist *at all* (this isn't relevant for the kernel, it does
not use a libc or other libraries, of course); and b) as a target config,
it is not supported in the sense that no one tests it, so we cannot say
anything about what quality code it generates, if it works at all, etc.

But we *do* allow "-mbig -mabi=elfv2", it's just a chicken-and-egg
problem to have this properly tested.  If someone would regularly test
it (incl. sending the test results to gcc-testresults@), I don't see why
it would not become a supported platform.

> +override flavour := linux-ppc64v2

That isn't a good name, heh.  This isn't "v2" of anything...  Spell out
the name "ELFv2"?  Or as "elfv2"?  It is just a name after all, it is
version 1 in all three version fields in the ELF headers!

Anyway, looks like it will work, let's see where this goes :-)

Reviewed-by: Segher Boessenkool 

Segher

RE: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Ram Pai

On Tue, Mar 03, 2020 at 08:08:51PM +0100, Cédric Le Goater wrote:
> >>>   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> >>>  to write to event queues in guest memory, which would have to be
> >>>  explicitly shared for secure mode.  That's true whether it's KVM
> >>>  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> >>>  entirely irrelevant.
> >>
> >> This problem should be already fixed.
> >> The XIVE event queues are shared 
> > 
> > Yes i have a patch for the guest kernel that shares the event 
> > queue page with the hypervisor. This is done using the
> > UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> > lists yet. However the patch by itself does not solve the xive problem
> > for secure VM.
> 
> yes because you also need to share the XIVE TIMA and ESB pages mapped 
> in xive_native_esb_fault() and xive_native_tima_fault(). 

These pages belong to the xive memory slot right? If that is the case,
they are implicitly shared. The Ultravisor will set them up to be
shared. The guest kernel should not be doing anything.

We still need some fixes in KVM and Ultravisor to correctly map the
hardware pages to GPA ranges of the xive memory slot. Work is in progress...



> 
> >> and the remaining problem with XIVE is the KVM page fault handler 
> >> populating the TIMA and ESB pages. Ultravisor doesn't seem to support
> >> this feature and this breaks interrupt management in the guest. 
> > 
> > Yes. This is the bigger issue that needs to be fixed. When the secure guest
> > accesses the page associated with the xive memslot, a page fault is
> > generated, which the ultravisor reflects to the hypervisor. Hypervisor
> > seems to be mapping Hardware-page to that GPA. Unforatunately it is not
> > informing the ultravisor of that map.  I am trying to understand the
> > root cause. But since I am not sure what more issues I might run into
> > after chasing down that issue, I figured its better to disable xive
> > support in SVM in the interim.
> 
> Is it possible to call uv_share_page() from the hypervisor ? 

No. Not allowed. If allowed hypervisor can easily attack the SVM.

> 
> >  BTW: I figured, I dont need this intermin patch to disable xive for
> > secure VM.  Just doing "svm=on xive=off" on the kernel command line is
> > sufficient for now. *
> 
> Yes. 
> 
> >> But, kernel_irqchip=off should work out of the box. It seems it doesn't. 
> >> Something to investigate.
> > 
> > Dont know why. 
> 
> We need to understand why. 
> 
> You still need the patch to share the event queue page allocated by the 
> guest OS because QEMU will enqueue events. But you should not need anything
> else.

ok. that is assuring.

> 
> > Does this option, disable the chip from interrupting the
> > guest directly; instead mediates the interrupt through the hypervisor?
> 
> Yes. The KVM backend is unused, the XIVE interrupt controller is deactivated
> for the guest and QEMU notifies the vCPUs directly.  
> 
> The TIMA and ESB pages belong the QEMU process and the guest OS will do 
> some load and store operations onto them for interrupt management. Is that 
> OK from a UV perspective ?  

Yes. These GPA ranges are needed; by design, to be read/writable from qemu/KVM 
and
the SVM. Just that the implementation in its current form, needs some
fixing.

RP

Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-03 Thread Wolfram Sang


> > sound/aoa/codecs/onyx.c
> > sound/aoa/codecs/tas.c
> 
> These are loaded explicitly via request_module (as snd-aoa-codec-%s).

Good to know, thanks!



signature.asc
Description: PGP signature

Re: [EXTERNAL] Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Cédric Le Goater

>>  BTW: I figured, I dont need this intermin patch to disable xive for
>> secure VM.  Just doing "svm=on xive=off" on the kernel command line is
>> sufficient for now. *
>>
> 
> No it is not. If the hypervisor doesn't propose XIVE (ie. ic-mode=xive
> on the QEMU command line), the kernel simply ignores "xive=off".

If I am correct, with the option ic-mode=xive, the hypervisor will 
propose only 'xive' in OV5 and not both 'xive' and 'xics'. But the
result is the same because xive can not be turned off and "xive=off" 
is ignored.

Anyway, it's not the most common case of usage of the QEMU command
like. I think it's OK to use "xive=off" on the kernel command line 
for now.

C.

Re: [EXTERNAL] Re: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Cédric Le Goater

>>>   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
>>>  to write to event queues in guest memory, which would have to be
>>>  explicitly shared for secure mode.  That's true whether it's KVM
>>>  or qemu accessing the guest memory, so kernel_irqchip=on/off is
>>>  entirely irrelevant.
>>
>> This problem should be already fixed.
>> The XIVE event queues are shared 
>   
> Yes i have a patch for the guest kernel that shares the event 
> queue page with the hypervisor. This is done using the
> UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> lists yet. However the patch by itself does not solve the xive problem
> for secure VM.

yes because you also need to share the XIVE TIMA and ESB pages mapped 
in xive_native_esb_fault() and xive_native_tima_fault(). 

>> and the remaining problem with XIVE is the KVM page fault handler 
>> populating the TIMA and ESB pages. Ultravisor doesn't seem to support
>> this feature and this breaks interrupt management in the guest. 
> 
> Yes. This is the bigger issue that needs to be fixed. When the secure guest
> accesses the page associated with the xive memslot, a page fault is
> generated, which the ultravisor reflects to the hypervisor. Hypervisor
> seems to be mapping Hardware-page to that GPA. Unforatunately it is not
> informing the ultravisor of that map.  I am trying to understand the
> root cause. But since I am not sure what more issues I might run into
> after chasing down that issue, I figured its better to disable xive
> support in SVM in the interim.

Is it possible to call uv_share_page() from the hypervisor ? 

>  BTW: I figured, I dont need this intermin patch to disable xive for
> secure VM.  Just doing "svm=on xive=off" on the kernel command line is
> sufficient for now. *

Yes. 

>> But, kernel_irqchip=off should work out of the box. It seems it doesn't. 
>> Something to investigate.
> 
> Dont know why. 

We need to understand why. 

You still need the patch to share the event queue page allocated by the 
guest OS because QEMU will enqueue events. But you should not need anything
else.

> Does this option, disable the chip from interrupting the
> guest directly; instead mediates the interrupt through the hypervisor?

Yes. The KVM backend is unused, the XIVE interrupt controller is deactivated
for the guest and QEMU notifies the vCPUs directly.  

The TIMA and ESB pages belong the QEMU process and the guest OS will do 
some load and store operations onto them for interrupt management. Is that 
OK from a UV perspective ?  

Thanks,

C.

Re: [PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-03 Thread Andreas Schwab

On Mär 03 2020, Wolfram Sang wrote:

> sound/aoa/codecs/onyx.c
> sound/aoa/codecs/tas.c

These are loaded explicitly via request_module (as snd-aoa-codec-%s).

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

RE: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Ram Pai

On Tue, Mar 03, 2020 at 06:45:20PM +0100, Greg Kurz wrote:
> On Tue, 3 Mar 2020 09:02:05 -0800
> Ram Pai  wrote:
> 
> > On Tue, Mar 03, 2020 at 07:50:08AM +0100, Cédric Le Goater wrote:
> > > On 3/3/20 12:32 AM, David Gibson wrote:
> > > > On Fri, Feb 28, 2020 at 11:54:04PM -0800, Ram Pai wrote:
> > > >> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
> > > >>
> > > >> Hence Secure VM, must always default to XICS interrupt controller.
> > > >>
> > > >> If XIVE is requested through kernel command line option "xive=on",
> > > >> override and turn it off.
> > > >>
> > > >> If XIVE is the only supported platform interrupt controller; specified
> > > >> through qemu option "ic-mode=xive", simply abort. Otherwise default to
> > > >> XICS.
> > > > 
> > > > Uh... the discussion thread here seems to have gotten oddly off
> > > > track.  
> > > 
> > > There seem to be multiple issues. It is difficult to have a clear status.
> > > 
> > > > So, to try to clean up some misunderstandings on both sides:
> > > > 
> > > >   1) The guest is the main thing that knows that it will be in secure
> > > >  mode, so it's reasonable for it to conditionally use XIVE based
> > > >  on that
> > > 
> > > FW support is required AFAIUI.
> > > >   2) The mechanism by which we do it here isn't quite right.  Here the
> > > >  guest is checking itself that the host only allows XIVE, but we
> > > >  can't do XIVE and is panic()ing.  Instead, in the SVM case we
> > > >  should force support->xive to false, and send that in the CAS
> > > >  request to the host.  We expect the host to just terminate
> > > >  us because of the mismatch, but this will interact better with
> > > >  host side options setting policy for panic states and the like.
> > > >  Essentially an SVM kernel should behave like an old kernel with
> > > >  no XIVE support at all, at least w.r.t. the CAS irq mode flags.
> > > 
> > > Yes. XIVE shouldn't be requested by the guest.
> > 
> > Ok.
> > 
> > > This is the last option 
> > > I proposed but I thought there was some negotiation with the hypervisor
> > > which is not the case. 
> > > 
> > > >   3) Although there are means by which the hypervisor can kind of know
> > > >  a guest is in secure mode, there's not really an "svm=on" option
> > > >  on the host side.  For the most part secure mode is based on
> > > >  discussion directly between the guest and the ultravisor with
> > > >  almost no hypervisor intervention.
> > > 
> > > Is there a negotiation with the ultravisor ? 
> > 
> > The VM has no negotiation with the ultravisor w.r.t CAS.
> > 
> > > 
> > > >   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> > > >  to write to event queues in guest memory, which would have to be
> > > >  explicitly shared for secure mode.  That's true whether it's KVM
> > > >  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> > > >  entirely irrelevant.
> > > 
> > > This problem should be already fixed.
> > > The XIVE event queues are shared 
> > 
> > Yes i have a patch for the guest kernel that shares the event 
> > queue page with the hypervisor. This is done using the
> > UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
> > lists yet.
> 
> Why ?

At this point I am not sure if this is the only change, I need to the
guest kernel.  I also need changes to KVM and to the ultravisor. Its bit
premature to send the patch without having figured out everything
to get xive working on a Secure VM.

> 
> > However the patch by itself does not solve the xive problem
> > for secure VM.
> > 
> 
> This patch would allow at least to answer Cedric's question about
> kernel_irqchip=off, since this looks like the only thing needed
> to make it work.

hmm.. I am not sure. Are you saying
(a) patch the guest kernel to share the event queue page
(b) run the qemu with "kernel_irqchip=off"
(c) and the guest kernel with "svm=on"

and it should all work?

RP

Re: [PATCH v3 18/27] powerpc/powernv/pmem: Add controller dump IOCTLs

2020-03-03 Thread Frederic Barrat





Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :

From: Alastair D'Silva 

This patch adds IOCTLs to allow userspace to request & fetch dumps
of the internal controller state.

This is useful during debugging or when a fatal error on the controller
has occurred.

Signed-off-by: Alastair D'Silva 
---
  arch/powerpc/platforms/powernv/pmem/ocxl.c | 132 +
  include/uapi/nvdimm/ocxl-pmem.h|  15 +++
  2 files changed, 147 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 2b64504f9129..2cabafe1fc58 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -640,6 +640,124 @@ static int ioctl_error_log(struct ocxlpmem *ocxlpmem,
return 0;
  }
  
+static int ioctl_controller_dump_data(struct ocxlpmem *ocxlpmem,

+   struct ioctl_ocxl_pmem_controller_dump_data __user *uarg)
+{
+   struct ioctl_ocxl_pmem_controller_dump_data args;
+   u16 i;
+   u64 val;
+   int rc;
+
+   if (copy_from_user(&args, uarg, sizeof(args)))
+   return -EFAULT;
+
+   if (args.buf_size % 8)
+   return -EINVAL;
+
+   if (args.buf_size > ocxlpmem->admin_command.data_size)
+   return -EINVAL;
+
+   mutex_lock(&ocxlpmem->admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc)
+   goto out;
+
+   val = ((u64)args.offset) << 32;
+   val |= args.buf_size;
+   rc = ocxl_global_mmio_write64(ocxlpmem->ocxl_afu,
+ ocxlpmem->admin_command.request_offset + 
0x08,
+ OCXL_LITTLE_ENDIAN, val);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem,
+   ADMIN_COMMAND_CONTROLLER_DUMP);
+   if (rc < 0) {
+   dev_warn(&ocxlpmem->dev, "Controller dump timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem,
+   "Unexpected status from retrieve error log",
+   rc);
+   goto out;
+   }




It would help if there was a comment indicating how the 3 ioctls are 
used. My understanding is that the userland is:

- requesting the controller to prepare a state dump
- then one or more ioctls to fetch the data. The number of calls 
required to get the full state really depends on the size of the buffer 
passed by user
- a last ioctl to tell the controller that we're done, presumably to let 
it free some resources.




+
+   for (i = 0; i < args.buf_size; i += 8) {
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+
ocxlpmem->admin_command.data_offset + i,
+OCXL_HOST_ENDIAN, &val);
+   if (rc)
+   goto out;
+
+   if (copy_to_user(&args.buf[i], &val, sizeof(u64))) {
+   rc = -EFAULT;
+   goto out;
+   }
+   }
+
+   if (copy_to_user(uarg, &args, sizeof(args))) {
+   rc = -EFAULT;
+   goto out;
+   }
+
+   rc = admin_response_handled(ocxlpmem);
+   if (rc)
+   goto out;
+
+out:
+   mutex_unlock(&ocxlpmem->admin_command.lock);
+   return rc;
+}
+
+int request_controller_dump(struct ocxlpmem *ocxlpmem)
+{
+   int rc;
+   u64 busy = 1;
+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_CHIC,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_CHI_CDA);
+



rc is not checked here.



+
+   rc = ocxl_global_mmio_set64(ocxlpmem->ocxl_afu, GLOBAL_MMIO_HCI,
+   OCXL_LITTLE_ENDIAN,
+   GLOBAL_MMIO_HCI_CONTROLLER_DUMP);
+   if (rc)
+   return rc;
+
+   while (busy) {
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+GLOBAL_MMIO_HCI,
+OCXL_LITTLE_ENDIAN, &busy);
+   if (rc)
+   return rc;
+
+   busy &= GLOBAL_MMIO_HCI_CONTROLLER_DUMP;



Setting 'busy' doesn't hurt, but it's not really useful, is it?

We should add some kind of timeout so that if the controller hits an 
issue, we don't spin in kernel space endlessly.





+   cond_resched();
+   }
+
+   return 0;
+}
+
+static int ioctl_controller_dump_complete(struct ocxlpmem *ocxlpmem)
+{
+   return ocxl_global_mmio_set64

Re: [RFC 2/3] mm/vma: Introduce VM_ACCESS_FLAGS

2020-03-03 Thread Vlastimil Babka

On 3/2/20 7:47 AM, Anshuman Khandual wrote:
> There are many places where all basic VMA access flags (read, write, exec)
> are initialized or checked against as a group. One such example is during
> page fault. Existing vma_is_accessible() wrapper already creates the notion
> of VMA accessibility as a group access permissions. Hence lets just create
> VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code
> duplication but also extend the VMA accessibility concept in general.
> 
> Cc: Russell King 
> CC: Catalin Marinas 
> CC: Mark Salter 
> Cc: Nick Hu 
> CC: Ley Foon Tan 
> Cc: Michael Ellerman 
> Cc: Heiko Carstens 
> Cc: Yoshinori Sato 
> Cc: Guan Xuetao 
> Cc: Dave Hansen 
> Cc: Thomas Gleixner 
> Cc: Rob Springer 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: nios2-...@lists.rocketboards.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: de...@driverdev.osuosl.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Dunno. Such mask seems ok for testing flags, but it's a bit awkward when
initializing flags, where it covers just one of many combinations that seem
used. But no strong opinions, patch looks correct.

Re: [RFC 1/3] mm/vma: Define a default value for VM_DATA_DEFAULT_FLAGS

2020-03-03 Thread Vlastimil Babka

On 3/2/20 7:47 AM, Anshuman Khandual wrote:
> There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
> This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
> existing VM_STACK_DEFAULT_FLAGS. While here, also define some more macros
> with standard VMA access flag combinations that are used frequently across
> many platforms. Apart from simplification, this reduces code duplication
> as well.
> 
> Cc: Richard Henderson 
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Mark Salter 
> Cc: Guo Ren 
> Cc: Yoshinori Sato 
> Cc: Brian Cain 
> Cc: Tony Luck 
> Cc: Geert Uytterhoeven 
> Cc: Michal Simek 
> Cc: Ralf Baechle 
> Cc: Paul Burton 
> Cc: Nick Hu 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: "James E.J. Bottomley" 
> Cc: Michael Ellerman 
> Cc: Paul Walmsley 
> Cc: Heiko Carstens 
> Cc: Rich Felker 
> Cc: "David S. Miller" 
> Cc: Guan Xuetao 
> Cc: Thomas Gleixner 
> Cc: Jeff Dike 
> Cc: Chris Zankel 
> Cc: Andrew Morton 
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-hexa...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: nios2-...@lists.rocketboards.org
> Cc: openr...@lists.librecores.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@lists.infradead.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux...@kvack.org
> Signed-off-by: Anshuman Khandual 

Nit:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b0e53ef13ff1..7a764ae6ab68 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -342,6 +342,21 @@ extern unsigned int kobjsize(const void *objp);
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP(VM_RAND_READ | VM_SEQ_READ)
>  
> +#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
> +
> +/* Common data flag combinations */
> +#define VM_DATA_FLAGS_TSK_EXEC   (VM_READ | VM_WRITE | TASK_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> +#define VM_DATA_FLAGS_NON_EXEC   (VM_READ | VM_WRITE | VM_MAYREAD | \
> +  VM_MAYWRITE | VM_MAYEXEC)
> +#define VM_DATA_FLAGS_EXEC   (VM_READ | VM_WRITE | VM_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> +
> +#ifndef VM_DATA_DEFAULT_FLAGS/* arch can override this */
> +#define VM_DATA_DEFAULT_FLAGS(VM_READ | VM_WRITE | VM_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)

Should you use VM_DATA_FLAGS_EXEC here? Yeah one more macro to expand, but it's
right above this.

> +#endif
> +
>  #ifndef VM_STACK_DEFAULT_FLAGS   /* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
>  #endif
>

RE: [RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

2020-03-03 Thread Ram Pai

On Tue, Mar 03, 2020 at 07:50:08AM +0100, Cédric Le Goater wrote:
> On 3/3/20 12:32 AM, David Gibson wrote:
> > On Fri, Feb 28, 2020 at 11:54:04PM -0800, Ram Pai wrote:
> >> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
> >>
> >> Hence Secure VM, must always default to XICS interrupt controller.
> >>
> >> If XIVE is requested through kernel command line option "xive=on",
> >> override and turn it off.
> >>
> >> If XIVE is the only supported platform interrupt controller; specified
> >> through qemu option "ic-mode=xive", simply abort. Otherwise default to
> >> XICS.
> > 
> > Uh... the discussion thread here seems to have gotten oddly off
> > track.  
> 
> There seem to be multiple issues. It is difficult to have a clear status.
> 
> > So, to try to clean up some misunderstandings on both sides:
> > 
> >   1) The guest is the main thing that knows that it will be in secure
> >  mode, so it's reasonable for it to conditionally use XIVE based
> >  on that
> 
> FW support is required AFAIUI.
> >   2) The mechanism by which we do it here isn't quite right.  Here the
> >  guest is checking itself that the host only allows XIVE, but we
> >  can't do XIVE and is panic()ing.  Instead, in the SVM case we
> >  should force support->xive to false, and send that in the CAS
> >  request to the host.  We expect the host to just terminate
> >  us because of the mismatch, but this will interact better with
> >  host side options setting policy for panic states and the like.
> >  Essentially an SVM kernel should behave like an old kernel with
> >  no XIVE support at all, at least w.r.t. the CAS irq mode flags.
> 
> Yes. XIVE shouldn't be requested by the guest.

Ok.

> This is the last option 
> I proposed but I thought there was some negotiation with the hypervisor
> which is not the case. 
> 
> >   3) Although there are means by which the hypervisor can kind of know
> >  a guest is in secure mode, there's not really an "svm=on" option
> >  on the host side.  For the most part secure mode is based on
> >  discussion directly between the guest and the ultravisor with
> >  almost no hypervisor intervention.
> 
> Is there a negotiation with the ultravisor ? 

The VM has no negotiation with the ultravisor w.r.t CAS.

> 
> >   4) I'm guessing the problem with XIVE in SVM mode is that XIVE needs
> >  to write to event queues in guest memory, which would have to be
> >  explicitly shared for secure mode.  That's true whether it's KVM
> >  or qemu accessing the guest memory, so kernel_irqchip=on/off is
> >  entirely irrelevant.
> 
> This problem should be already fixed.
> The XIVE event queues are shared 

Yes i have a patch for the guest kernel that shares the event 
queue page with the hypervisor. This is done using the
UV_SHARE_PAGE ultracall. This patch is not sent out to any any mailing
lists yet. However the patch by itself does not solve the xive problem
for secure VM.

> and the remaining problem with XIVE is the KVM page fault handler 
> populating the TIMA and ESB pages. Ultravisor doesn't seem to support
> this feature and this breaks interrupt management in the guest. 

Yes. This is the bigger issue that needs to be fixed. When the secure guest
accesses the page associated with the xive memslot, a page fault is
generated, which the ultravisor reflects to the hypervisor. Hypervisor
seems to be mapping Hardware-page to that GPA. Unforatunately it is not
informing the ultravisor of that map.  I am trying to understand the
root cause. But since I am not sure what more issues I might run into
after chasing down that issue, I figured its better to disable xive
support in SVM in the interim.

 BTW: I figured, I dont need this intermin patch to disable xive for
secure VM.  Just doing "svm=on xive=off" on the kernel command line is
sufficient for now. *


> 
> But, kernel_irqchip=off should work out of the box. It seems it doesn't. 
> Something to investigate.

Dont know why. 

Does this option, disable the chip from interrupting the
guest directly; instead mediates the interrupt through the hypervisor?

> 
> > 
> >   5) All the above said, having to use XICS is pretty crappy.  You
> >  should really get working on XIVE support for secure VMs.
> 
> Yes. 

and yes too.


Summary:  I am dropping this patch for now.

> 
> Thanks,
> 
> C.

-- 
Ram Pai

Re: [RFC 02/11] perf/core: Data structure to present hazard data

2020-03-03 Thread Ravi Bangoria




On 3/2/20 8:18 PM, Mark Rutland wrote:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 377d794d3105..ff252618ca93 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -142,8 +142,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_REGS_INTR   = 1U << 18,
PERF_SAMPLE_PHYS_ADDR   = 1U << 19,
PERF_SAMPLE_AUX = 1U << 20,
+   PERF_SAMPLE_PIPELINE_HAZ= 1U << 21,


Can we please have perf_event_open() reject this sample flag for PMUs
without the new callback (introduced in the next patch)?

That way it'll be possible to detect whether the PMU exposes this.


Sure. Will change it.

Ravi

Re: [RFC 02/11] perf/core: Data structure to present hazard data

2020-03-03 Thread Ravi Bangoria





On 3/2/20 8:24 PM, Mark Rutland wrote:

@@ -870,6 +871,13 @@ enum perf_event_type {
 *  { u64   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 *  { u64   size;
 *char  data[size]; } && PERF_SAMPLE_AUX
+*  { u8itype;
+*u8icache;
+*u8hazard_stage;
+*u8hazard_reason;
+*u8stall_stage;
+*u8stall_reason;
+*u16   pad;} && PERF_SAMPLE_PIPELINE_HAZ
 * };


The existing comment shows the aux data *immediately* after ther
phys_addr field, where you've placed struct perf_pipeline_haz_data.

If adding to struct perf_sample_data is fine, this needs to come before
the aux data in this comment. If adding to struct perf_sample_data is
not fine. struct perf_pipeline_haz_data cannot live there.

I suspect the latter is true, but you're getting away with it because
you're not using both PERF_SAMPLE_AUX and PERF_SAMPLE_PIPELINE_HAZ
simultaneously.


Right. Thanks for pointing it out. Will change it.

Ravi

[PATCH v3 14/18] docs: powerpc: convert vcpudispatch_stats.txt to ReST

2020-03-03 Thread Mauro Carvalho Chehab

- Add a SPDX header;
- Use standard markup for document title;
- Adjust identation on lists and add blank lines where
  needed;
- Add it to the powerpc index.rst file.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/powerpc/index.rst |  1 +
 ...ispatch_stats.txt => vcpudispatch_stats.rst} | 17 -
 2 files changed, 13 insertions(+), 5 deletions(-)
 rename Documentation/powerpc/{vcpudispatch_stats.txt => 
vcpudispatch_stats.rst} (94%)

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 0d45f0fc8e57..29b90b1b6f20 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -30,6 +30,7 @@ powerpc
 syscall64-abi
 transactional_memory
 ultravisor
+vcpudispatch_stats
 
 .. only::  subproject and html
 
diff --git a/Documentation/powerpc/vcpudispatch_stats.txt 
b/Documentation/powerpc/vcpudispatch_stats.rst
similarity index 94%
rename from Documentation/powerpc/vcpudispatch_stats.txt
rename to Documentation/powerpc/vcpudispatch_stats.rst
index e21476bfd78c..5704657a5987 100644
--- a/Documentation/powerpc/vcpudispatch_stats.txt
+++ b/Documentation/powerpc/vcpudispatch_stats.rst
@@ -1,5 +1,8 @@
-VCPU Dispatch Statistics:
-=
+.. SPDX-License-Identifier: GPL-2.0
+
+
+VCPU Dispatch Statistics
+
 
 For Shared Processor LPARs, the POWER Hypervisor maintains a relatively
 static mapping of the LPAR processors (vcpus) to physical processor
@@ -20,25 +23,29 @@ The statistics themselves are available by reading the 
procfs file
 a vcpu as represented by the first field, followed by 8 numbers.
 
 The first number corresponds to:
+
 1. total vcpu dispatches since the beginning of statistics collection
 
 The next 4 numbers represent vcpu dispatch dispersions:
+
 2. number of times this vcpu was dispatched on the same processor as last
time
 3. number of times this vcpu was dispatched on a different processor core
as last time, but within the same chip
 4. number of times this vcpu was dispatched on a different chip
 5. number of times this vcpu was dispatches on a different socket/drawer
-(next numa boundary)
+   (next numa boundary)
 
 The final 3 numbers represent statistics in relation to the home node of
 the vcpu:
+
 6. number of times this vcpu was dispatched in its home node (chip)
 7. number of times this vcpu was dispatched in a different node
 8. number of times this vcpu was dispatched in a node further away (numa
-distance)
+   distance)
+
+An example output::
 
-An example output:
 $ sudo cat /proc/powerpc/vcpudispatch_stats
 cpu0 6839 4126 2683 30 0 6821 18 0
 cpu1 2515 1274 1229 12 0 2509 6 0
-- 
2.24.1

Re: [PATCH] powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()

2020-03-03 Thread Christophe Leroy





Le 03/03/2020 à 13:59, Michael Ellerman a écrit :

We received a report of strange kernel faults which turned out to be
due to a missing KUAP disable in flush_coherent_icache() called
from flush_icache_range().

The fault looks like:

   Kernel attempted to access user page (7fffc30d9c00) - exploit attempt? (uid: 
1009)
   BUG: Unable to handle kernel data access on read at 0x7fffc30d9c00
   Faulting instruction address: 0xc007232c
   Oops: Kernel access of bad area, sig: 11 [#1]
   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
   CPU: 35 PID: 5886 Comm: sigtramp Not tainted 
5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40 #79
   NIP:  c007232c LR: c003b7fc CTR: 
   REGS: c01e11093940 TRAP: 0300   Not tainted  
(5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40)
   MSR:  9280b033   CR: 28000884  
XER: 
   CFAR: c00722fc DAR: 7fffc30d9c00 DSISR: 0800 IRQMASK: 0
   GPR00: c003b7fc c01e11093bd0 c23ac200 7fffc30d9c00
   GPR04: 7fffc30d9c18  c01e11093bd4 
   GPR08:  0001  c01e1104ed80
   GPR12:  c01fff6ab380 c16be2d0 4000
   GPR16: c000 bfff  
   GPR20: 7fffc30d9c00 7fffc30d8f58 7fffc30d9c18 7fffc30d9c20
   GPR24: 7fffc30d9c18  c01e11093d90 c01e1104ed80
   GPR28: c01e11093e90  c23d9d18 7fffc30d9c00
   NIP flush_icache_range+0x5c/0x80
   LR  handle_rt_signal64+0x95c/0xc2c
   Call Trace:
 0xc01e11093d90 (unreliable)
 handle_rt_signal64+0x93c/0xc2c
 do_notify_resume+0x310/0x430
 ret_from_except_lite+0x70/0x74
   Instruction dump:
   409e002c 7c0802a6 3c62ff31 3863f6a0 f8010080 48195fed 6000 48fe4c8d
   6000 e8010080 7c0803a6 7c0004ac <7c00ffac> 7c0004ac 4c00012c 38210070

This path through handle_rt_signal64() to setup_trampoline() and
flush_icache_range() is only triggered by 64-bit processes that have
unmapped their VDSO, which is rare.

flush_icache_range() takes a range of addresses to flush. In
flush_coherent_icache() we implement an optimisation for CPUs where we
know we don't actually have to flush the whole range, we just need to
do a single icbi.

However we still execute the icbi on the user address of the start of
the range we're flushing. On CPUs that also implement KUAP (Power9)
that leads to the spurious fault above.

We should be able to pass any address, including a kernel address, to
the icbi on these CPUs, which would avoid any interaction with KUAP.
But I don't want to make that change in a bug fix, just in case it
surfaces some strange behaviour on some CPU.

So for now just disable KUAP around the icbi. Note the icbi is treated
as a load, so we allow read access, not write as you'd expect.

Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Cc: sta...@vger.kernel.org # v5.2+
Signed-off-by: Michael Ellerman 
---
  arch/powerpc/mm/mem.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ef7b1119b2e2..184850d9d000 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -373,7 +373,9 @@ static inline bool flush_coherent_icache(unsigned long addr)
 */
if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
mb(); /* sync */
+   allow_read_from_user((void *)addr, 4);


I know that's ignored on Radix, but shouldn't we use L1_CACHE_BYTES as a 
length ?



icbi((void *)addr);
+   prevent_read_from_user((void *)addr, 4);
mb(); /* sync */
isync();
return true;



Christophe

[PATCH] powerpc/mm: Fix missing KUAP disable in flush_coherent_icache()

2020-03-03 Thread Michael Ellerman

We received a report of strange kernel faults which turned out to be
due to a missing KUAP disable in flush_coherent_icache() called
from flush_icache_range().

The fault looks like:

  Kernel attempted to access user page (7fffc30d9c00) - exploit attempt? (uid: 
1009)
  BUG: Unable to handle kernel data access on read at 0x7fffc30d9c00
  Faulting instruction address: 0xc007232c
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  CPU: 35 PID: 5886 Comm: sigtramp Not tainted 
5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40 #79
  NIP:  c007232c LR: c003b7fc CTR: 
  REGS: c01e11093940 TRAP: 0300   Not tainted  
(5.6.0-rc2-gcc-8.2.0-3-gfc37a1632d40)
  MSR:  9280b033   CR: 28000884  
XER: 
  CFAR: c00722fc DAR: 7fffc30d9c00 DSISR: 0800 IRQMASK: 0
  GPR00: c003b7fc c01e11093bd0 c23ac200 7fffc30d9c00
  GPR04: 7fffc30d9c18  c01e11093bd4 
  GPR08:  0001  c01e1104ed80
  GPR12:  c01fff6ab380 c16be2d0 4000
  GPR16: c000 bfff  
  GPR20: 7fffc30d9c00 7fffc30d8f58 7fffc30d9c18 7fffc30d9c20
  GPR24: 7fffc30d9c18  c01e11093d90 c01e1104ed80
  GPR28: c01e11093e90  c23d9d18 7fffc30d9c00
  NIP flush_icache_range+0x5c/0x80
  LR  handle_rt_signal64+0x95c/0xc2c
  Call Trace:
0xc01e11093d90 (unreliable)
handle_rt_signal64+0x93c/0xc2c
do_notify_resume+0x310/0x430
ret_from_except_lite+0x70/0x74
  Instruction dump:
  409e002c 7c0802a6 3c62ff31 3863f6a0 f8010080 48195fed 6000 48fe4c8d
  6000 e8010080 7c0803a6 7c0004ac <7c00ffac> 7c0004ac 4c00012c 38210070

This path through handle_rt_signal64() to setup_trampoline() and
flush_icache_range() is only triggered by 64-bit processes that have
unmapped their VDSO, which is rare.

flush_icache_range() takes a range of addresses to flush. In
flush_coherent_icache() we implement an optimisation for CPUs where we
know we don't actually have to flush the whole range, we just need to
do a single icbi.

However we still execute the icbi on the user address of the start of
the range we're flushing. On CPUs that also implement KUAP (Power9)
that leads to the spurious fault above.

We should be able to pass any address, including a kernel address, to
the icbi on these CPUs, which would avoid any interaction with KUAP.
But I don't want to make that change in a bug fix, just in case it
surfaces some strange behaviour on some CPU.

So for now just disable KUAP around the icbi. Note the icbi is treated
as a load, so we allow read access, not write as you'd expect.

Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Cc: sta...@vger.kernel.org # v5.2+
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/mem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ef7b1119b2e2..184850d9d000 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -373,7 +373,9 @@ static inline bool flush_coherent_icache(unsigned long addr)
 */
if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
mb(); /* sync */
+   allow_read_from_user((void *)addr, 4);
icbi((void *)addr);
+   prevent_read_from_user((void *)addr, 4);
mb(); /* sync */
isync();
return true;
-- 
2.21.1

[Bug 199471] [Bisected][Regression] windfarm_pm* no longer gets automatically loaded when CONFIG_I2C_POWERMAC=y is set

2020-03-03 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=199471

--- Comment #19 from Wolfram Sang (w...@the-dreams.de) ---
Well, yes, the lm75 code gets loaded yet it is not clear to me if the device is
now created by DT or not. Well, we will see...

Patch sent out: http://patchwork.ozlabs.org/patch/1248349/

Let's discuss this one. The proof-of-concept here had a missing line but worked
enough.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH] macintosh: windfarm: fix MODINFO regression

2020-03-03 Thread Wolfram Sang

Commit af503716ac14 made sure OF devices get an OF style modalias with
I2C events. It assumed all in-tree users were converted, yet it missed
some Macintosh drivers.

Add an OF module device table for all windfarm drivers to make them
automatically load again.

Fixes: af503716ac14 ("i2c: core: report OF style module alias for devices 
registered via OF")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199471
Reported-by: Erhard Furtner 
Tested-by: Erhard Furtner 
Signed-off-by: Wolfram Sang 
---

This should also help with this: 
https://lists.debian.org/debian-powerpc/2020/01/msg00062.html
Some more testing would be appreciated because lm75 also has some code
changes I can't test myself obviusly.

By grepping, I found some more potential candidates (using a "MAC,"
prefix but not defining a OF MODULE DEVICE TABLE). Does someone know
about bugreports filed for those? I don't want to change them for no
reason:

drivers/macintosh/ams/ams-i2c.c
drivers/macintosh/therm_adt746x.c
sound/aoa/codecs/onyx.c
sound/aoa/codecs/tas.c
sound/ppc/keywest.c

Happy hacking,

   Wolfram

 drivers/macintosh/windfarm_ad7417_sensor.c  |  7 +++
 drivers/macintosh/windfarm_fcu_controls.c   |  7 +++
 drivers/macintosh/windfarm_lm75_sensor.c| 16 +++-
 drivers/macintosh/windfarm_lm87_sensor.c|  7 +++
 drivers/macintosh/windfarm_max6690_sensor.c |  7 +++
 drivers/macintosh/windfarm_smu_sat.c|  7 +++
 6 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/macintosh/windfarm_ad7417_sensor.c 
b/drivers/macintosh/windfarm_ad7417_sensor.c
index 125605987b44..e7dec328c7cf 100644
--- a/drivers/macintosh/windfarm_ad7417_sensor.c
+++ b/drivers/macintosh/windfarm_ad7417_sensor.c
@@ -312,9 +312,16 @@ static const struct i2c_device_id wf_ad7417_id[] = {
 };
 MODULE_DEVICE_TABLE(i2c, wf_ad7417_id);
 
+static const struct of_device_id wf_ad7417_of_id[] = {
+   { .compatible = "ad7417", },
+   { }
+};
+MODULE_DEVICE_TABLE(of, wf_ad7417_of_id);
+
 static struct i2c_driver wf_ad7417_driver = {
.driver = {
.name   = "wf_ad7417",
+   .of_match_table = wf_ad7417_of_id,
},
.probe  = wf_ad7417_probe,
.remove = wf_ad7417_remove,
diff --git a/drivers/macintosh/windfarm_fcu_controls.c 
b/drivers/macintosh/windfarm_fcu_controls.c
index 67daeec94b44..2470e5a725c8 100644
--- a/drivers/macintosh/windfarm_fcu_controls.c
+++ b/drivers/macintosh/windfarm_fcu_controls.c
@@ -580,9 +580,16 @@ static const struct i2c_device_id wf_fcu_id[] = {
 };
 MODULE_DEVICE_TABLE(i2c, wf_fcu_id);
 
+static const struct of_device_id wf_fcu_of_id[] = {
+   { .compatible = "fcu", },
+   { }
+};
+MODULE_DEVICE_TABLE(of, wf_fcu_of_id);
+
 static struct i2c_driver wf_fcu_driver = {
.driver = {
.name   = "wf_fcu",
+   .of_match_table = wf_fcu_of_id,
},
.probe  = wf_fcu_probe,
.remove = wf_fcu_remove,
diff --git a/drivers/macintosh/windfarm_lm75_sensor.c 
b/drivers/macintosh/windfarm_lm75_sensor.c
index 282c28a17ea1..1e5fa09845e7 100644
--- a/drivers/macintosh/windfarm_lm75_sensor.c
+++ b/drivers/macintosh/windfarm_lm75_sensor.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -91,9 +92,14 @@ static int wf_lm75_probe(struct i2c_client *client,
 const struct i2c_device_id *id)
 {  
struct wf_lm75_sensor *lm;
-   int rc, ds1775 = id->driver_data;
+   int rc, ds1775;
const char *name, *loc;
 
+   if (id)
+   ds1775 = id->driver_data;
+   else
+   ds1775 = !!of_device_get_match_data(&client->dev);
+
DBG("wf_lm75: creating  %s device at address 0x%02x\n",
ds1775 ? "ds1775" : "lm75", client->addr);
 
@@ -164,9 +170,17 @@ static const struct i2c_device_id wf_lm75_id[] = {
 };
 MODULE_DEVICE_TABLE(i2c, wf_lm75_id);
 
+static const struct of_device_id wf_lm75_of_id[] = {
+   { .compatible = "lm75", .data = (void *)0},
+   { .compatible = "ds1775", .data = (void *)1 },
+   { }
+};
+MODULE_DEVICE_TABLE(of, wf_lm75_of_id);
+
 static struct i2c_driver wf_lm75_driver = {
.driver = {
.name   = "wf_lm75",
+   .of_match_table = wf_lm75_of_id,
},
.probe  = wf_lm75_probe,
.remove = wf_lm75_remove,
diff --git a/drivers/macintosh/windfarm_lm87_sensor.c 
b/drivers/macintosh/windfarm_lm87_sensor.c
index b03a33b803b7..d011899c0a8a 100644
--- a/drivers/macintosh/windfarm_lm87_sensor.c
+++ b/drivers/macintosh/windfarm_lm87_sensor.c
@@ -166,9 +166,16 @@ static const struct i2c_device_id wf_lm87_id[] = {
 };
 MODULE_DEVICE_TABLE(i2c, wf_lm87_id);
 
+static const struct of_device_id wf_lm87_of_id[] = {
+   { .compatible = "lm87cimt", },
+   { }
+};
+MODULE_DEVICE_TABLE(of, wf_lm87_of_id);
+
 static struct i2c_driver wf_lm87_driver = {
.d

Re: [PATCH v4 1/8] ASoC: dt-bindings: fsl_asrc: Change asrc-width to asrc-format

2020-03-03 Thread Mark Brown

On Tue, Mar 03, 2020 at 11:59:30AM +0800, Shengjiu Wang wrote:
> On Tue, Mar 3, 2020 at 9:43 AM Rob Herring  wrote:

> > > -   - fsl,asrc-width  : Defines a mutual sample width used by DPCM Back 
> > > Ends.
> > > +   - fsl,asrc-format : Defines a mutual sample format used by DPCM Back
> > > +   Ends. The value is one of SNDRV_PCM_FORMAT_XX in
> > > +   "include/uapi/sound/asound.h"

> > You can't just change properties. They are an ABI.

> I have updated all the things related with this ABI in this patch series.
> What else should I do?

Like Nicolin says you should continue to support the old stuff.  The
kernel should work with people's out of tree DTs too so simply updating
everything in the tree isn't enough.


signature.asc
Description: PGP signature

Re: [PATCH 2/2] powerpc/powernv: Wire up OPAL address lookups

2020-03-03 Thread Michael Ellerman

Nicholas Piggin  writes:
> Use ARCH_HAS_ADDRESS_LOOKUP to look up the opal symbol table. This
> allows crashes and xmon debugging to print firmware symbols.
>
>   Oops: System Reset, sig: 6 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
>   Modules linked in:
>   CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc2-dirty #903
>   NIP:  30020434 LR: 3000378c CTR: 30020414
>   REGS: c000fffc3d70 TRAP: 0100   Not tainted  (5.6.0-rc2-dirty)
>   MSR:  92101002   CR: 28022284  XER: 2004
>   CFAR: 30003788 IRQMASK: 3
>   GPR00: 3000378c 31c13c90 30136200 c12cfa10
>   GPR04: c12cfa10 0010  31c10060
>   GPR08: c12cfaaf 30003640  0001
>   GPR12: 300e c149  c139c588
>   GPR16: 31c1 c125a900  c12076a8
>   GPR20: c12a3950 0001 31c10060 c12cfaaf
>   GPR24: 0019 30003640  
>   GPR28: 0010 c12cfa10  
>   NIP [30020434] .dummy_console_write_buffer_space+0x20/0x64 [OPAL]
>   LR [3000378c] opal_entry+0x14c/0x17c [OPAL]
>
> This won't unwind the firmware stack (or its Linux caller) properly if
> firmware and kernel endians don't match, but that problem could be solved
> in powerpc's unwinder.

How well does this work if we're tracing opal calls at the time we oops :)

Though it looks like that's already fishy because we don't do anything
to disable tracing of opal_console_write().

I guess I'm a bit wary of adding numerous further opal calls in the oops
path, I'm sure the opal symbol lookup code is bug free, but still.

Could we instead suck in the opal symbols early on, and search them in
Linux? I suspect you've thought of that and rejected it, but it would be
good to document why.

cheers

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 497b7d0b2d7e..4d32b02d35e8 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -115,6 +115,7 @@ config PPC
>   # Please keep this list sorted alphabetically.
>   #
>   select ARCH_32BIT_OFF_T if PPC32
> + select ARCH_HAS_ADDRESS_LOOKUP  if PPC_POWERNV
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
>   select ARCH_HAS_ELF_RANDOMIZE
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index c1f25a760eb1..c3a2a797177a 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -214,7 +214,11 @@
>  #define OPAL_SECVAR_GET  176
>  #define OPAL_SECVAR_GET_NEXT 177
>  #define OPAL_SECVAR_ENQUEUE_UPDATE   178
> -#define OPAL_LAST178
> +#define OPAL_PHB_SET_OPTION  179
> +#define OPAL_PHB_GET_OPTION  180

Only pull in the calls you need for this patch.

> +#define OPAL_GET_SYMBOL  181
> +#define OPAL_LOOKUP_SYMBOL   182
> +#define OPAL_LAST182
>  
>  #define QUIESCE_HOLD 1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT   2 /* Fail all calls with 
> OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 9986ac34b8e2..ef2d9273f06f 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -312,6 +312,9 @@ s64 opal_mpipl_query_tag(enum opal_mpipl_tags tag, u64 
> *addr);
>  s64 opal_signal_system_reset(s32 cpu);
>  s64 opal_quiesce(u64 shutdown_type, s32 cpu);
>  
> +int64_t opal_get_symbol(uint64_t addr, __be64 *symaddr, __be64 *symsize, 
> char *namebuf, uint64_t buflen);
> +int64_t opal_lookup_symbol(const char *name, __be64 *symaddr, __be64 
> *symsize);
> +
>  /* Internal functions */
>  extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
>  int depth, void *data);
> diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
> b/arch/powerpc/platforms/powernv/opal-call.c
> index 5cd0f52d258f..ba2d94df 100644
> --- a/arch/powerpc/platforms/powernv/opal-call.c
> +++ b/arch/powerpc/platforms/powernv/opal-call.c
> @@ -293,3 +293,5 @@ OPAL_CALL(opal_mpipl_query_tag,   
> OPAL_MPIPL_QUERY_TAG);
>  OPAL_CALL(opal_secvar_get,   OPAL_SECVAR_GET);
>  OPAL_CALL(opal_secvar_get_next,  OPAL_SECVAR_GET_NEXT);
>  OPAL_CALL(opal_secvar_enqueue_update,
> OPAL_SECVAR_ENQUEUE_UPDATE);
> +OPAL_CALL(opal_get_symbol,   OPAL_GET_SYMBOL);
> +OPAL_CALL(opal_lookup_symbol,OPAL_LOOKUP_SYMBOL);
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powe

Re: [PATCH] macintosh: therm_windtunnel: fix regression when instantiating devices

2020-03-03 Thread Michael Ellerman

Wolfram Sang  writes:
> On Tue, Feb 25, 2020 at 03:12:29PM +0100, Wolfram Sang wrote:
>> Removing attach_adapter from this driver caused a regression for at
>> least some machines. Those machines had the sensors described in their
>> DT, too, so they didn't need manual creation of the sensor devices. The
>> old code worked, though, because manual creation came first. Creation of
>> DT devices then failed later and caused error logs, but the sensors
>> worked nonetheless because of the manually created devices.
>> 
>> When removing attach_adaper, manual creation now comes later and loses
>> the race. The sensor devices were already registered via DT, yet with
>> another binding, so the driver could not be bound to it.
>> 
>> This fix refactors the code to remove the race and only manually creates
>> devices if there are no DT nodes present. Also, the DT binding is updated
>> to match both, the DT and manually created devices. Because we don't
>> know which device creation will be used at runtime, the code to start
>> the kthread is moved to do_probe() which will be called by both methods.
>> 
>> Fixes: 3e7bed52719d ("macintosh: therm_windtunnel: drop using 
>> attach_adapter")
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201723
>> Reported-by: Erhard Furtner 
>> Tested-by: Erhard Furtner 
>> Signed-off-by: Wolfram Sang 
>
> Applied to for-current, thanks!

Thanks.

cheers

Re: [PATCH v3 17/27] powerpc/powernv/pmem: Implement the Read Error Log command

2020-03-03 Thread Frederic Barrat





Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :

From: Alastair D'Silva 

The read error log command extracts information from the controller's
internal error log.

This patch exposes this information in 2 ways:
- During probe, if an error occurs & a log is available, print it to the
   console
- After probe, make the error log available to userspace via an IOCTL.
   Userspace is notified of pending error logs in a later patch
   ("powerpc/powernv/pmem: Forward events to userspace")

Signed-off-by: Alastair D'Silva 
---
  arch/powerpc/platforms/powernv/pmem/ocxl.c| 269 ++
  .../platforms/powernv/pmem/ocxl_internal.h|   1 +
  include/uapi/nvdimm/ocxl-pmem.h   |  46 +++
  3 files changed, 316 insertions(+)
  create mode 100644 include/uapi/nvdimm/ocxl-pmem.h

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index 63109a870d2c..2b64504f9129 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -447,10 +447,219 @@ static int file_release(struct inode *inode, struct file 
*file)
return 0;
  }
  
+/**

+ * error_log_header_parse() - Parse the first 64 bits of the error log command 
response
+ * @ocxlpmem: the device metadata
+ * @length: out, returns the number of bytes in the response (excluding the 64 
bit header)
+ */
+static int error_log_header_parse(struct ocxlpmem *ocxlpmem, u16 *length)
+{
+   int rc;
+   u64 val;
+



Empty line in the middle of declarations



+   u16 data_identifier;
+   u32 data_length;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset,
+OCXL_LITTLE_ENDIAN, &val);
+   if (rc)
+   return rc;
+
+   data_identifier = val >> 48;
+   data_length = val & 0x;
+
+   if (data_identifier != 0x454C) { // 'EL'
+   dev_err(&ocxlpmem->dev,
+   "Bad data identifier for error log data, expected 'EL', got 
'%2s' (%#x), data_length=%u\n",
+   (char *)&data_identifier,
+   (unsigned int)data_identifier, data_length);
+   return -EINVAL;
+   }
+
+   *length = data_length;
+   return 0;
+}
+
+static int error_log_offset_0x08(struct ocxlpmem *ocxlpmem,
+u32 *log_identifier, u32 *program_ref_code)
+{
+   int rc;
+   u64 val;
+
+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x08,
+OCXL_LITTLE_ENDIAN, &val);
+   if (rc)
+   return rc;
+
+   *log_identifier = val >> 32;
+   *program_ref_code = val & 0x;
+
+   return 0;
+}
+
+static int read_error_log(struct ocxlpmem *ocxlpmem,
+ struct ioctl_ocxl_pmem_error_log *log, bool 
buf_is_user)
+{
+   u64 val;
+   u16 user_buf_length;
+   u16 buf_length;
+   u16 i;
+   int rc;
+
+   if (log->buf_size % 8)
+   return -EINVAL;
+
+   rc = ocxlpmem_chi(ocxlpmem, &val);
+   if (rc)
+   goto out;




"out" will unlock a mutex not yet taken.




+
+   if (!(val & GLOBAL_MMIO_CHI_ELA))
+   return -EAGAIN;
+
+   user_buf_length = log->buf_size;
+
+   mutex_lock(&ocxlpmem->admin_command.lock);
+
+   rc = admin_command_request(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc)
+   goto out;
+
+   rc = admin_command_execute(ocxlpmem);
+   if (rc)
+   goto out;
+
+   rc = admin_command_complete_timeout(ocxlpmem, ADMIN_COMMAND_ERRLOG);
+   if (rc < 0) {
+   dev_warn(&ocxlpmem->dev, "Read error log timed out\n");
+   goto out;
+   }
+
+   rc = admin_response(ocxlpmem);
+   if (rc < 0)
+   goto out;
+   if (rc != STATUS_SUCCESS) {
+   warn_status(ocxlpmem, "Unexpected status from retrieve error 
log", rc);
+   goto out;
+   }
+
+
+   rc = error_log_header_parse(ocxlpmem, &log->buf_size);
+   if (rc)
+   goto out;
+   // log->buf_size now contains the returned buffer size, not the user 
size
+
+   rc = error_log_offset_0x08(ocxlpmem, &log->log_identifier,
+  &log->program_reference_code);
+   if (rc)
+   goto out;




Offset 0x08 gets a preferential treatment compared to 0x10 below and 
it's not clear why.

I would create a subfonction which parses all the fields linearly.




+   rc = ocxl_global_mmio_read64(ocxlpmem->ocxl_afu,
+ocxlpmem->admin_command.data_offset + 0x10,
+OCXL_LITTLE_ENDIAN, &val);
+   if (rc)
+   goto out;
+
+   log->error_log_type = val >> 56;
+   log->action_fl

Re: [PATCH 2/6] powerpc: kvm: no need to check return value of debugfs_create functions

2020-03-03 Thread Michael Ellerman

Greg Kroah-Hartman  writes:
> On Tue, Mar 03, 2020 at 08:45:23PM +1100, Michael Ellerman wrote:
>> Greg Kroah-Hartman  writes:
>> > On Tue, Mar 03, 2020 at 06:46:23PM +1100, Michael Ellerman wrote:
>> >> Greg Kroah-Hartman  writes:
>> >> > When calling debugfs functions, there is no need to ever check the
>> >> > return value.  The function can work or not, but the code logic should
>> >> > never do something different based on this.
>> >> 
>> >> Except it does need to do something different, if the file was created
>> >> it needs to be removed in the remove path.
>> >> 
>> >> > diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
>> >> > index bfe4f106cffc..8e4791c6f2af 100644
>> >> > --- a/arch/powerpc/kvm/timing.c
>> >> > +++ b/arch/powerpc/kvm/timing.c
>> >> > @@ -207,19 +207,12 @@ static const struct file_operations 
>> >> > kvmppc_exit_timing_fops = {
>> >> >  void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
>> >> >  {
>> >> > static char dbg_fname[50];
>> >> > -   struct dentry *debugfs_file;
>> >> >  
>> >> > snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
>> >> >  current->pid, id);
>> >> > -   debugfs_file = debugfs_create_file(dbg_fname, 0666,
>> >> > -   kvm_debugfs_dir, vcpu,
>> >> > -   &kvmppc_exit_timing_fops);
>> >> > -
>> >> > -   if (!debugfs_file) {
>> >> > -   printk(KERN_ERR"%s: error creating debugfs file %s\n",
>> >> > -   __func__, dbg_fname);
>> >> > -   return;
>> >> > -   }
>> >> > +   debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir, vcpu,
>> >> > +   &kvmppc_exit_timing_fops);
>> >> > +
>> >> >  
>> >> > vcpu->arch.debugfs_exit_timing = debugfs_file;
>> >
>> > Ugh, you are right, how did I miss that?  How is 0-day missing this?
>> > It's been in my tree for a long time, odd.
>> 
>> This code isn't enabled by default, or in any defconfig. So it's only
>> allmodconfig that would trip it, I guess 0-day isn't doing powerpc
>> allmodconfig builds.
>> 
>> >> I squashed this in, which seems to work:
>> ...
>> >>  
>> >>  void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
>> >>  {
>> >> -   if (vcpu->arch.debugfs_exit_timing) {
>> >> +   if (!IS_ERR_OR_NULL(vcpu->arch.debugfs_exit_timing)) {
>> >> debugfs_remove(vcpu->arch.debugfs_exit_timing);
>> >> vcpu->arch.debugfs_exit_timing = NULL;
>> >> }
>> >
>> > No, this can just be:
>> >debugfs_remove(vcpu->arch.debugfs_exit_timing);
>> >
>> > No need to check anything, just call it and the debugfs code can handle
>> > it just fine.
>> 
>> Oh duh, of course, I should have checked.
>> 
>> I'd still like to NULL out the debugfs_exit_timing member, so I'll do:
>> 
>> void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
>> {
>>  debugfs_remove(vcpu->arch.debugfs_exit_timing);
>>  vcpu->arch.debugfs_exit_timing = NULL;
>> }
>
> Fair enough, but I doubt it ever matters :)

Yeah, but I'm paranoid and I have no way to test this code :)

> Thanks for the fixups, sorry for sending a broken patch, my fault.

No worries, we have too many CONFIG options.

cheers

Re: eh_frame confusion

2020-03-03 Thread Michael Ellerman

"Naveen N. Rao"  writes:
> Rasmus Villemoes wrote:
>> I'm building a ppc32 kernel, and noticed that after upgrading from gcc-7
>> to gcc-8 all object files now end up having .eh_frame section. For
>> vmlinux, that's not a problem, because they all get discarded in
>> arch/powerpc/kernel/vmlinux.lds.S . However, they stick around in
>> modules, which doesn't seem to be useful - given that everything worked
>> just fine with gcc-7, and I don't see anything in the module loader that
>> handles .eh_frame.
>> 
>> The reason I care is that my target has a rather tight rootfs budget,
>> and the .eh_frame section seem to occupy 10-30% of the file size
>> (obviously very depending on the particular module).
>> 
>> Comparing the .foo.o.cmd files, I don't see change in options that might
>> explain this (there's a bunch of new -Wno-*, and the -mspe=no spelling
>> is apparently no longer supported in gcc-8). Both before and after, there's
>> 
>> -fno-dwarf2-cfi-asm
>> 
>> about which gcc's documentation says
>> 
>> '-fno-dwarf2-cfi-asm'
>>  Emit DWARF unwind info as compiler generated '.eh_frame' section
>>  instead of using GAS '.cfi_*' directives.
>> 
>> Looking into where that comes from got me even more confused, because
>> both arm and unicore32 say
>> 
>> # Never generate .eh_frame
>> KBUILD_CFLAGS   += $(call cc-option,-fno-dwarf2-cfi-asm)
>> 
>> while the ppc32 case at hand says
>> 
>> # FIXME: the module load should be taught about the additional relocs
>> # generated by this.
>> # revert to pre-gcc-4.4 behaviour of .eh_frame
>
> Michael opened a task to look into this recently and I had spent some 
> time last week on this. The original commit/discussion adding 
> -fno-dwarf2-cfi-asm refers to R_PPC64_REL32 relocations not being 
> handled by our module loader:
> http://lkml.kernel.org/r/20090224065112.ga6...@bombadil.infradead.org

I opened that issue purely based on noticing the wart in the Makefile,
not because I'd actually tested it.

> However, that is now handled thanks to commit 9f751b82b491d:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f751b82b491d

Haha, written by me, what an idiot.

So the Makefile hack can presumably be dropped, because the module
loader can handle the relocations.

And then maybe we also want to turn off the unwind tables, but that
would be a separate patch.

> I did a test build and a simple module loaded fine, so I think 
> -fno-dwarf2-cfi-asm is not required anymore, unless Michael has seen 
> some breakages with it. Michael?

No, as I said above it was just reading the Makefile.

cheers

Re: [PATCH 2/6] powerpc: kvm: no need to check return value of debugfs_create functions

2020-03-03 Thread Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 08:45:23PM +1100, Michael Ellerman wrote:
> Greg Kroah-Hartman  writes:
> > On Tue, Mar 03, 2020 at 06:46:23PM +1100, Michael Ellerman wrote:
> >> Greg Kroah-Hartman  writes:
> >> > When calling debugfs functions, there is no need to ever check the
> >> > return value.  The function can work or not, but the code logic should
> >> > never do something different based on this.
> >> 
> >> Except it does need to do something different, if the file was created
> >> it needs to be removed in the remove path.
> >> 
> >> > diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
> >> > index bfe4f106cffc..8e4791c6f2af 100644
> >> > --- a/arch/powerpc/kvm/timing.c
> >> > +++ b/arch/powerpc/kvm/timing.c
> >> > @@ -207,19 +207,12 @@ static const struct file_operations 
> >> > kvmppc_exit_timing_fops = {
> >> >  void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
> >> >  {
> >> >  static char dbg_fname[50];
> >> > -struct dentry *debugfs_file;
> >> >  
> >> >  snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
> >> >   current->pid, id);
> >> > -debugfs_file = debugfs_create_file(dbg_fname, 0666,
> >> > -kvm_debugfs_dir, vcpu,
> >> > -&kvmppc_exit_timing_fops);
> >> > -
> >> > -if (!debugfs_file) {
> >> > -printk(KERN_ERR"%s: error creating debugfs file %s\n",
> >> > -__func__, dbg_fname);
> >> > -return;
> >> > -}
> >> > +debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir, vcpu,
> >> > +&kvmppc_exit_timing_fops);
> >> > +
> >> >  
> >> >  vcpu->arch.debugfs_exit_timing = debugfs_file;
> >
> > Ugh, you are right, how did I miss that?  How is 0-day missing this?
> > It's been in my tree for a long time, odd.
> 
> This code isn't enabled by default, or in any defconfig. So it's only
> allmodconfig that would trip it, I guess 0-day isn't doing powerpc
> allmodconfig builds.
> 
> >> I squashed this in, which seems to work:
> ...
> >>  
> >>  void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
> >>  {
> >> -   if (vcpu->arch.debugfs_exit_timing) {
> >> +   if (!IS_ERR_OR_NULL(vcpu->arch.debugfs_exit_timing)) {
> >> debugfs_remove(vcpu->arch.debugfs_exit_timing);
> >> vcpu->arch.debugfs_exit_timing = NULL;
> >> }
> >
> > No, this can just be:
> > debugfs_remove(vcpu->arch.debugfs_exit_timing);
> >
> > No need to check anything, just call it and the debugfs code can handle
> > it just fine.
> 
> Oh duh, of course, I should have checked.
> 
> I'd still like to NULL out the debugfs_exit_timing member, so I'll do:
> 
> void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
> {
>   debugfs_remove(vcpu->arch.debugfs_exit_timing);
>   vcpu->arch.debugfs_exit_timing = NULL;
> }

Fair enough, but I doubt it ever matters :)

Thanks for the fixups, sorry for sending a broken patch, my fault.

greg k-h

Re: [PATCH 6/6] powerpc: powernv: no need to check return value of debugfs_create functions

2020-03-03 Thread Michael Ellerman

"Oliver O'Halloran"  writes:
> On Mon, Feb 10, 2020 at 12:12 AM Greg Kroah-Hartman
>  wrote:
>>
>> When calling debugfs functions, there is no need to ever check the
>> return value.  The function can work or not, but the code logic should
>> never do something different based on this.
>
> For memtrace debugfs is the only way to actually use the feature. It'd
> be nice if it still printed out *something* if it failed to create the
> files rather than just being mysteriously absent

That's true, but the current code doesn't actually do that anyway.

>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
>> b/arch/powerpc/platforms/powernv/memtrace.c
>> index eb2e75dac369..d6d64f8718e6 100644
>> --- a/arch/powerpc/platforms/powernv/memtrace.c
>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>> @@ -187,11 +187,6 @@ static int memtrace_init_debugfs(void)
>>
>> snprintf(ent->name, 16, "%08x", ent->nid);
>> dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
>> -   if (!dir) {
>> -   pr_err("Failed to create debugfs directory for node 
>> %d\n",
>> -   ent->nid);
>> -   return -1;
>> -   }

debugfs_create_dir() doesn't return NULL on error, it returns
ERR_PTR(-ENOMEM), which will not trigger that pr_err().

So I've merged this and if someone wants to they can send a follow-up to
do proper error checking in memtrace.c

cheers

Re: [PATCH v3 3/7] x86/mm: Thread pgprot_t through init_memory_mapping()

2020-03-03 Thread Michal Hocko

On Fri 21-02-20 11:24:59, Logan Gunthorpe wrote:
> In prepartion to support a pgprot_t argument for arch_add_memory().
> 
> It's required to move the prototype of init_memory_mapping() seeing
> the original location came before the definition of pgprot_t.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Signed-off-by: Logan Gunthorpe 

Acked-by: Michal Hocko 

> ---
>  arch/x86/include/asm/page_types.h |  3 ---
>  arch/x86/include/asm/pgtable.h|  3 +++
>  arch/x86/kernel/amd_gart_64.c |  3 ++-
>  arch/x86/mm/init.c|  9 +
>  arch/x86/mm/init_32.c |  3 ++-
>  arch/x86/mm/init_64.c | 32 +--
>  arch/x86/mm/mm_internal.h |  3 ++-
>  arch/x86/platform/uv/bios_uv.c|  3 ++-
>  8 files changed, 34 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/page_types.h 
> b/arch/x86/include/asm/page_types.h
> index c85e15010f48..bf7aa2e290ef 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -73,9 +73,6 @@ static inline phys_addr_t get_max_mapped(void)
>  
>  bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
>  
> -extern unsigned long init_memory_mapping(unsigned long start,
> -  unsigned long end);
> -
>  extern void initmem_init(void);
>  
>  #endif   /* !__ASSEMBLY__ */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 7e118660bbd9..48d6a5960f28 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1046,6 +1046,9 @@ static inline void __meminit 
> init_trampoline_default(void)
>  
>  void __init poking_init(void);
>  
> +unsigned long init_memory_mapping(unsigned long start,
> +   unsigned long end, pgprot_t prot);
> +
>  # ifdef CONFIG_RANDOMIZE_MEMORY
>  void __meminit init_trampoline(void);
>  # else
> diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> index 4e5f50236048..16133819415c 100644
> --- a/arch/x86/kernel/amd_gart_64.c
> +++ b/arch/x86/kernel/amd_gart_64.c
> @@ -744,7 +744,8 @@ int __init gart_iommu_init(void)
>  
>   start_pfn = PFN_DOWN(aper_base);
>   if (!pfn_range_is_mapped(start_pfn, end_pfn))
> - init_memory_mapping(start_pfn< + init_memory_mapping(start_pfn< + PAGE_KERNEL);
>  
>   pr_info("PCI-DMA: using GART IOMMU.\n");
>   iommu_size = check_iommu_size(info.aper_base, aper_size);
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index e7bb483557c9..1bba16c5742b 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -467,7 +467,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, 
> unsigned long end_pfn)
>   * the physical memory. To access them they are temporarily mapped.
>   */
>  unsigned long __ref init_memory_mapping(unsigned long start,
> -unsigned long end)
> + unsigned long end, pgprot_t prot)
>  {
>   struct map_range mr[NR_RANGE_MR];
>   unsigned long ret = 0;
> @@ -481,7 +481,8 @@ unsigned long __ref init_memory_mapping(unsigned long 
> start,
>  
>   for (i = 0; i < nr_range; i++)
>   ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
> -mr[i].page_size_mask);
> +mr[i].page_size_mask,
> +prot);
>  
>   add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
>  
> @@ -521,7 +522,7 @@ static unsigned long __init init_range_memory_mapping(
>*/
>   can_use_brk_pgt = max(start, (u64)pgt_buf_end<=
>   min(end, (u64)pgt_buf_top< - init_memory_mapping(start, end);
> + init_memory_mapping(start, end, PAGE_KERNEL);
>   mapped_ram_size += end - start;
>   can_use_brk_pgt = true;
>   }
> @@ -661,7 +662,7 @@ void __init init_mem_mapping(void)
>  #endif
>  
>   /* the ISA range is always mapped regardless of memory holes */
> - init_memory_mapping(0, ISA_END_ADDRESS);
> + init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL);
>  
>   /* Init the trampoline, possibly with KASLR memory offset */
>   init_trampoline();
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 3ec3dac7c268..e25a4218e6ff 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -253,7 +253,8 @@ static inline int is_kernel_text(unsigned long addr)
>  unsigned long __init
>  kernel_physical_mapping_init(unsigned long start,
>unsigned long end,
> -  unsigned long page_size_mask)
> +

Re: [PATCH v3 2/7] mm/memory_hotplug: Rename mhp_restrictions to mhp_params

2020-03-03 Thread Michal Hocko

On Fri 21-02-20 11:24:58, Logan Gunthorpe wrote:
> The mhp_restrictions struct really doesn't specify anything resembling
> a restriction anymore so rename it to be mhp_params as it is a list
> of extended parameters.
> 
> Signed-off-by: Logan Gunthorpe 

Acked-by: Michal Hocko 

> ---
>  arch/arm64/mm/mmu.c|  4 ++--
>  arch/ia64/mm/init.c|  4 ++--
>  arch/powerpc/mm/mem.c  |  4 ++--
>  arch/s390/mm/init.c|  6 +++---
>  arch/sh/mm/init.c  |  4 ++--
>  arch/x86/mm/init_32.c  |  4 ++--
>  arch/x86/mm/init_64.c  |  8 
>  include/linux/memory_hotplug.h | 16 
>  mm/memory_hotplug.c|  8 
>  mm/memremap.c  |  8 
>  10 files changed, 33 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 128f70852bf3..ee37bca8aba8 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1050,7 +1050,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
>  {
>   int flags = 0;
>  
> @@ -1063,7 +1063,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   memblock_clear_nomap(start, size);
>  
>   return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
> -restrictions);
> +params);
>  }
>  void arch_remove_memory(int nid, u64 start, u64 size,
>   struct vmem_altmap *altmap)
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index b01d68a2d5d9..97bbc23ea1e3 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -670,13 +670,13 @@ mem_init (void)
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
>   int ret;
>  
> - ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
> + ret = __add_pages(nid, start_pfn, nr_pages, params);
>   if (ret)
>   printk("%s: Problem encountered in __add_pages() as ret=%d\n",
>  __func__,  ret);
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index ef7b1119b2e2..b4bece53bec0 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -128,7 +128,7 @@ static void flush_dcache_range_chunked(unsigned long 
> start, unsigned long stop,
>  }
>  
>  int __ref arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> +   struct mhp_params *params)
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -144,7 +144,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>   return -EFAULT;
>   }
>  
> - return __add_pages(nid, start_pfn, nr_pages, restrictions);
> + return __add_pages(nid, start_pfn, nr_pages, params);
>  }
>  
>  void __ref arch_remove_memory(int nid, u64 start, u64 size,
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ac44bd76db4b..e9e4a7abd0cc 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -268,20 +268,20 @@ device_initcall(s390_cma_mem_init);
>  #endif /* CONFIG_CMA */
>  
>  int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long size_pages = PFN_DOWN(size);
>   int rc;
>  
> - if (WARN_ON_ONCE(restrictions->altmap))
> + if (WARN_ON_ONCE(params->altmap))
>   return -EINVAL;
>  
>   rc = vmem_add_mapping(start, size);
>   if (rc)
>   return rc;
>  
> - rc = __add_pages(nid, start_pfn, size_pages, restrictions);
> + rc = __add_pages(nid, start_pfn, size_pages, params);
>   if (rc)
>   vmem_remove_mapping(start, size);
>   return rc;
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index d1b1ff2be17a..e5114c053364 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -406,14 +406,14 @@ void __init mem_init(void)
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long nr_pages = size >> PAGE_SHIFT;
>   int ret;
>  
>   /* We only have ZONE_NORMAL, so this is easy.. */
> - ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
> + ret = __add_pages(nid, start_pfn, n

Re: eh_frame confusion

2020-03-03 Thread Rasmus Villemoes

On 02/03/2020 18.32, Naveen N. Rao wrote:
> Naveen N. Rao wrote:
>> Michael opened a task to look into this recently and I had spent some
>> time last week on this. The original commit/discussion adding
>> -fno-dwarf2-cfi-asm refers to R_PPC64_REL32 relocations not being
>> handled by our module loader:
>> http://lkml.kernel.org/r/20090224065112.ga6...@bombadil.infradead.org
>>
>> However, that is now handled thanks to commit 9f751b82b491d:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f751b82b491d
>>
>>
>> I did a test build and a simple module loaded fine, so I think
>> -fno-dwarf2-cfi-asm is not required anymore, unless Michael has seen
>> some breakages with it. Michael?
>>
>>>
>>> but prior to gcc-8, .eh_frame didn't seem to get generated anyway.
>>>
>>> Can .eh_frame sections be discarded for modules (on ppc32 at least), or
>>> is there some magic that makes them necessary when building with gcc-8?
>>
>> As Segher points out, it looks like we need to add
>> -fno-asynchronous-unwind-tables. Most other architectures seem to use
>> that too.

Yes. Thanks, Segher, that explains that part.

> Can you check if the below patch works? I am yet to test this in more
> detail, but would be good to know the implications for ppc32.

I'll see if that produces a bootable kernel, but I think I'd prefer a
more piecemeal approach.

One patch to add -fno-asynchronous-unwind-tables (given that other
arches do it unconditionally I don't think cc-option is needed), with a
commit log saying something like "no-op for gcc < 8, prevents .eh_frame
sections that are discarded anyway for vmlinux and waste disk space for
modules". Then another patch can get rid of -fno-dwarf2-cfi-asm if
that's no longer required.

Rasmus

Re: [PATCH v3 1/7] mm/memory_hotplug: Drop the flags field from struct mhp_restrictions

2020-03-03 Thread Michal Hocko

On Fri 21-02-20 11:24:57, Logan Gunthorpe wrote:
> This variable is not used anywhere and should therefore be removed
> from the structure.
> 
> Signed-off-by: Logan Gunthorpe 
> Reviewed-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  include/linux/memory_hotplug.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f4d59155f3d4..69ff3037528d 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -55,11 +55,9 @@ enum {
>  
>  /*
>   * Restrictions for the memory hotplug:
> - * flags:  MHP_ flags
>   * altmap: alternative allocator for memmap array
>   */
>  struct mhp_restrictions {
> - unsigned long flags;
>   struct vmem_altmap *altmap;
>  };
>  
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 2/6] powerpc: kvm: no need to check return value of debugfs_create functions

2020-03-03 Thread Michael Ellerman

Greg Kroah-Hartman  writes:
> On Tue, Mar 03, 2020 at 06:46:23PM +1100, Michael Ellerman wrote:
>> Greg Kroah-Hartman  writes:
>> > When calling debugfs functions, there is no need to ever check the
>> > return value.  The function can work or not, but the code logic should
>> > never do something different based on this.
>> 
>> Except it does need to do something different, if the file was created
>> it needs to be removed in the remove path.
>> 
>> > diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
>> > index bfe4f106cffc..8e4791c6f2af 100644
>> > --- a/arch/powerpc/kvm/timing.c
>> > +++ b/arch/powerpc/kvm/timing.c
>> > @@ -207,19 +207,12 @@ static const struct file_operations 
>> > kvmppc_exit_timing_fops = {
>> >  void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
>> >  {
>> >static char dbg_fname[50];
>> > -  struct dentry *debugfs_file;
>> >  
>> >snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
>> > current->pid, id);
>> > -  debugfs_file = debugfs_create_file(dbg_fname, 0666,
>> > -  kvm_debugfs_dir, vcpu,
>> > -  &kvmppc_exit_timing_fops);
>> > -
>> > -  if (!debugfs_file) {
>> > -  printk(KERN_ERR"%s: error creating debugfs file %s\n",
>> > -  __func__, dbg_fname);
>> > -  return;
>> > -  }
>> > +  debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir, vcpu,
>> > +  &kvmppc_exit_timing_fops);
>> > +
>> >  
>> >vcpu->arch.debugfs_exit_timing = debugfs_file;
>
> Ugh, you are right, how did I miss that?  How is 0-day missing this?
> It's been in my tree for a long time, odd.

This code isn't enabled by default, or in any defconfig. So it's only
allmodconfig that would trip it, I guess 0-day isn't doing powerpc
allmodconfig builds.

>> I squashed this in, which seems to work:
...
>>  
>>  void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
>>  {
>> -   if (vcpu->arch.debugfs_exit_timing) {
>> +   if (!IS_ERR_OR_NULL(vcpu->arch.debugfs_exit_timing)) {
>> debugfs_remove(vcpu->arch.debugfs_exit_timing);
>> vcpu->arch.debugfs_exit_timing = NULL;
>> }
>
> No, this can just be:
>   debugfs_remove(vcpu->arch.debugfs_exit_timing);
>
> No need to check anything, just call it and the debugfs code can handle
> it just fine.

Oh duh, of course, I should have checked.

I'd still like to NULL out the debugfs_exit_timing member, so I'll do:

void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
{
debugfs_remove(vcpu->arch.debugfs_exit_timing);
vcpu->arch.debugfs_exit_timing = NULL;
}


cheers

Re: [PATCH v3 16/27] powerpc/powernv/pmem: Register a character device for userspace to interact with

2020-03-03 Thread Frederic Barrat





Le 21/02/2020 à 04:27, Alastair D'Silva a écrit :

From: Alastair D'Silva 

This patch introduces a character device (/dev/ocxl-scmX) which further
patches will use to interact with userspace.

Signed-off-by: Alastair D'Silva 
---
  arch/powerpc/platforms/powernv/pmem/ocxl.c| 116 +-
  .../platforms/powernv/pmem/ocxl_internal.h|   2 +
  2 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pmem/ocxl.c 
b/arch/powerpc/platforms/powernv/pmem/ocxl.c
index b8bd7e703b19..63109a870d2c 100644
--- a/arch/powerpc/platforms/powernv/pmem/ocxl.c
+++ b/arch/powerpc/platforms/powernv/pmem/ocxl.c
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include "ocxl_internal.h"
@@ -339,6 +340,9 @@ static void free_ocxlpmem(struct ocxlpmem *ocxlpmem)
  
  	free_minor(ocxlpmem);
  
+	if (ocxlpmem->cdev.owner)

+   cdev_del(&ocxlpmem->cdev);
+
if (ocxlpmem->metadata_addr)
devm_memunmap(&ocxlpmem->dev, ocxlpmem->metadata_addr);
  
@@ -396,6 +400,70 @@ static int ocxlpmem_register(struct ocxlpmem *ocxlpmem)

return device_register(&ocxlpmem->dev);
  }
  
+static void ocxlpmem_put(struct ocxlpmem *ocxlpmem)

+{
+   put_device(&ocxlpmem->dev);
+}
+
+static struct ocxlpmem *ocxlpmem_get(struct ocxlpmem *ocxlpmem)
+{
+   return (get_device(&ocxlpmem->dev) == NULL) ? NULL : ocxlpmem;
+}
+
+static struct ocxlpmem *find_and_get_ocxlpmem(dev_t devno)
+{
+   struct ocxlpmem *ocxlpmem;
+   int minor = MINOR(devno);
+   /*
+* We don't declare an RCU critical section here, as our AFU
+* is protected by a reference counter on the device. By the time the
+* minor number of a device is removed from the idr, the ref count of
+* the device is already at 0, so no user API will access that AFU and
+* this function can't return it.
+*/



I fixed something related in the ocxl driver (which had enough changes 
with the introduction of the "info" device to make a similar comment 
become wrong). See commit a58d37bce0d21. The issue is handling a 
simultaneous open() and removal of the device through /sysfs as best we can.


We are on a file open path and it's not like we're going to have a 
thousand clients, so performance is not that critical. We can take the 
mutex before searching in the IDR and release it after we increment the 
reference count on the device.
But that's not enough: we could still find the device in the IDR while 
it is being removed in free_ocxlpmem(). I believe the only safe way to 
address it is by removing the user-facing APIs (the char device) before 
calling device_unregister(). So that it's not possible to find the 
device in file_open() if it's in the middle of being removed.


  Fred



+   ocxlpmem = idr_find(&minors_idr, minor);
+   if (ocxlpmem)
+   ocxlpmem_get(ocxlpmem);
+   return ocxlpmem;
+}
+
+static int file_open(struct inode *inode, struct file *file)
+{
+   struct ocxlpmem *ocxlpmem;
+
+   ocxlpmem = find_and_get_ocxlpmem(inode->i_rdev);
+   if (!ocxlpmem)
+   return -ENODEV;
+
+   file->private_data = ocxlpmem;
+   return 0;
+}
+
+static int file_release(struct inode *inode, struct file *file)
+{
+   struct ocxlpmem *ocxlpmem = file->private_data;
+
+   ocxlpmem_put(ocxlpmem);
+   return 0;
+}
+
+static const struct file_operations fops = {
+   .owner  = THIS_MODULE,
+   .open   = file_open,
+   .release= file_release,
+};
+
+/**
+ * create_cdev() - Create the chardev in /dev for the device
+ * @ocxlpmem: the SCM metadata
+ * Return: 0 on success, negative on failure
+ */
+static int create_cdev(struct ocxlpmem *ocxlpmem)
+{
+   cdev_init(&ocxlpmem->cdev, &fops);
+   return cdev_add(&ocxlpmem->cdev, ocxlpmem->dev.devt, 1);
+}
+
  /**
   * ocxlpmem_remove() - Free an OpenCAPI persistent memory device
   * @pdev: the PCI device information struct
@@ -572,6 +640,11 @@ static int probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err;
}
  
+	if (create_cdev(ocxlpmem)) {

+   dev_err(&pdev->dev, "Could not create character device\n");
+   goto err;
+   }



As already mentioned in a previous patch, we branch to the err label so 
rc needs to be set to a valid error.





+
elapsed = 0;
timeout = ocxlpmem->readiness_timeout + 
ocxlpmem->memory_available_timeout;
while (!is_usable(ocxlpmem, false)) {
@@ -613,20 +686,59 @@ static struct pci_driver pci_driver = {
.shutdown = ocxlpmem_remove,
  };
  
+static int file_init(void)

+{
+   int rc;
+
+   mutex_init(&minors_idr_lock);
+   idr_init(&minors_idr);
+
+   rc = alloc_chrdev_region(&ocxlpmem_dev, 0, NUM_MINORS, "ocxl-pmem");
+   if (rc) {
+   idr_destroy(&minors_idr);
+   pr_err("Unable to allocate Ope

[PATCH v2 -next] powerpc/pmac/smp: drop unnecessary volatile qualifier

2020-03-03 Thread YueHaibing

core99_l2_cache/core99_l3_cache no need to mark as volatile,
just remove it.

Signed-off-by: YueHaibing 
---
v2: remove 'volatile' qualifier
---
 arch/powerpc/platforms/powermac/smp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/smp.c 
b/arch/powerpc/platforms/powermac/smp.c
index f95fbde..69ad567 100644
--- a/arch/powerpc/platforms/powermac/smp.c
+++ b/arch/powerpc/platforms/powermac/smp.c
@@ -661,8 +661,8 @@ static void smp_core99_gpio_tb_freeze(int freeze)
 #endif /* !CONFIG_PPC64 */
 
 /* L2 and L3 cache settings to pass from CPU0 to CPU1 on G4 cpus */
-volatile static long int core99_l2_cache;
-volatile static long int core99_l3_cache;
+static long core99_l2_cache;
+static long core99_l3_cache;
 
 static void core99_init_caches(int cpu)
 {
-- 
2.7.4

Re: [PATCH v2 -next] powerpc/pmac/smp: drop unnecessary volatile qualifier

2020-03-03 Thread Christophe Leroy





Le 03/03/2020 à 09:56, YueHaibing a écrit :

core99_l2_cache/core99_l3_cache no need to mark as volatile,
just remove it.

Signed-off-by: YueHaibing 


Reviewed-by: Christophe Leroy 


---
v2: remove 'volatile' qualifier
---
  arch/powerpc/platforms/powermac/smp.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/smp.c 
b/arch/powerpc/platforms/powermac/smp.c
index f95fbde..69ad567 100644
--- a/arch/powerpc/platforms/powermac/smp.c
+++ b/arch/powerpc/platforms/powermac/smp.c
@@ -661,8 +661,8 @@ static void smp_core99_gpio_tb_freeze(int freeze)
  #endif /* !CONFIG_PPC64 */
  
  /* L2 and L3 cache settings to pass from CPU0 to CPU1 on G4 cpus */

-volatile static long int core99_l2_cache;
-volatile static long int core99_l3_cache;
+static long core99_l2_cache;
+static long core99_l3_cache;
  
  static void core99_init_caches(int cpu)

  {

Re: [PATCH 2/6] powerpc: kvm: no need to check return value of debugfs_create functions

2020-03-03 Thread Greg Kroah-Hartman

On Tue, Mar 03, 2020 at 06:46:23PM +1100, Michael Ellerman wrote:
> Greg Kroah-Hartman  writes:
> > When calling debugfs functions, there is no need to ever check the
> > return value.  The function can work or not, but the code logic should
> > never do something different based on this.
> 
> Except it does need to do something different, if the file was created
> it needs to be removed in the remove path.
> 
> > diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
> > index bfe4f106cffc..8e4791c6f2af 100644
> > --- a/arch/powerpc/kvm/timing.c
> > +++ b/arch/powerpc/kvm/timing.c
> > @@ -207,19 +207,12 @@ static const struct file_operations 
> > kvmppc_exit_timing_fops = {
> >  void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
> >  {
> > static char dbg_fname[50];
> > -   struct dentry *debugfs_file;
> >  
> > snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
> >  current->pid, id);
> > -   debugfs_file = debugfs_create_file(dbg_fname, 0666,
> > -   kvm_debugfs_dir, vcpu,
> > -   &kvmppc_exit_timing_fops);
> > -
> > -   if (!debugfs_file) {
> > -   printk(KERN_ERR"%s: error creating debugfs file %s\n",
> > -   __func__, dbg_fname);
> > -   return;
> > -   }
> > +   debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir, vcpu,
> > +   &kvmppc_exit_timing_fops);
> > +
> >  
> > vcpu->arch.debugfs_exit_timing = debugfs_file;

Ugh, you are right, how did I miss that?  How is 0-day missing this?
It's been in my tree for a long time, odd.

> >  }
> 
> This doesn't build:
> 
> arch/powerpc/kvm/timing.c:217:35: error: 'debugfs_file' undeclared (first 
> use in this function); did you mean 'debugfs_file_put'?
> 
> We can't just drop the assignment, we need the dentry to do the removal:
> 
> void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
> {
>   if (vcpu->arch.debugfs_exit_timing) {
>   debugfs_remove(vcpu->arch.debugfs_exit_timing);
>   vcpu->arch.debugfs_exit_timing = NULL;
>   }
> }
> 
> 
> I squashed this in, which seems to work:
> 
> diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
> index 8e4791c6f2af..5b7a66f86bd5 100644
> --- a/arch/powerpc/kvm/timing.c
> +++ b/arch/powerpc/kvm/timing.c
> @@ -207,19 +207,19 @@ static const struct file_operations 
> kvmppc_exit_timing_fops = {
>  void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
>  {
> static char dbg_fname[50];
> +   struct dentry *debugfs_file;
>  
> snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
>  current->pid, id);
> -   debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir, vcpu,
> -   &kvmppc_exit_timing_fops);
> -
> +   debugfs_file = debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir,
> +  vcpu, &kvmppc_exit_timing_fops);
>  
> vcpu->arch.debugfs_exit_timing = debugfs_file;

That works, yes.

>  }
>  
>  void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
>  {
> -   if (vcpu->arch.debugfs_exit_timing) {
> +   if (!IS_ERR_OR_NULL(vcpu->arch.debugfs_exit_timing)) {
> debugfs_remove(vcpu->arch.debugfs_exit_timing);
> vcpu->arch.debugfs_exit_timing = NULL;
> }

No, this can just be:
debugfs_remove(vcpu->arch.debugfs_exit_timing);

No need to check anything, just call it and the debugfs code can handle
it just fine.

thanks,

greg k-h

71 matches

Mail list logo