Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-09 Thread Benjamin Herrenschmidt
On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote:
> 
> Also, will POWER9 always have doorbells?  In which case you could
> reduce it to 3 options.

The problem with doorbells on POWER9 guests is that they may have
to trap and be emulated by the hypervisor, since the guest threads
on P9 don't have to match the HW threads of the core.

Thus it's quite possible that using XIVE for IPIs is actually faster
than doorbells in that case.

Cheers,
Ben.



Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-09 Thread David Gibson
On Wed, Aug 09, 2017 at 10:48:48AM +0200, Cédric Le Goater wrote:
> On 08/09/2017 05:53 AM, David Gibson wrote:
> > On Tue, Aug 08, 2017 at 10:56:12AM +0200, Cédric Le Goater wrote:
> >> This is the framework for using XIVE in a PowerVM guest. The support
> >> is very similar to the native one in a much simpler form.
> >>
> >> Instead of OPAL calls, a set of Hypervisors call are used to configure
> >> the interrupt sources and the event/notification queues of the guest:
> >>
> >>  - H_INT_GET_SOURCE_INFO
> >>
> >>used to obtain the address of the MMIO page of the Event State
> >>Buffer (PQ bits) entry associated with the source.
> >>
> >>  - H_INT_SET_SOURCE_CONFIG
> >>
> >>assigns a source to a "target".
> >>
> >>  - H_INT_GET_SOURCE_CONFIG
> >>
> >>determines to which "target" and "priority" is assigned to a source
> >>
> >>  - H_INT_GET_QUEUE_INFO
> >>
> >>returns the address of the notification management page associated
> >>with the specified "target" and "priority".
> >>
> >>  - H_INT_SET_QUEUE_CONFIG
> >>
> >>sets or resets the event queue for a given "target" and "priority".
> >>It is also used to set the notification config associated with the
> >>queue, only unconditional notification for the moment.  Reset is
> >>performed with a queue size of 0 and queueing is disabled in that
> >>case.
> >>
> >>  - H_INT_GET_QUEUE_CONFIG
> >>
> >>returns the queue settings for a given "target" and "priority".
> >>
> >>  - H_INT_RESET
> >>
> >>resets all of the partition's interrupt exploitation structures to
> >>their initial state, losing all configuration set via the hcalls
> >>H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> >>
> >>  - H_INT_SYNC
> >>
> >>issue a synchronisation on a source to make sure sure all
> >>notifications have reached their queue.
> >>
> >> As for XICS, the XIVE interface for the guest is described in the
> >> device tree under the interrupt controller node. A couple of new
> >> properties are specific to XIVE :
> >>
> >>  - "reg"
> >>
> >>contains the base address and size of the thread interrupt
> >>managnement areas (TIMA) for the user level for the OS level. Only
> >>the OS level is taken into account.
> >>
> >>  - "ibm,xive-eq-sizes"
> >>
> >>the size of the event queues.
> >>
> >>  - "ibm,xive-lisn-ranges"
> >>
> >>the interrupt numbers ranges assigned to the guest. These are
> >>allocated using a simple bitmap.
> >>
> >> Tested with a QEMU XIVE model for pseries and with the Power
> >> hypervisor
> >>
> >> Signed-off-by: Cédric Le Goater 
> > 
> > I don't know XIVE well enough to review in detail, but I've made some
> > comments based on general knowledge.
> 
> Thanks for taking a look.

np

[snip]
> >> +/* Cause IPI as setup by the interrupt controller (xics or xive) */
> >> +static void (*ic_cause_ipi)(int cpu);
> >> +
> >>  static void smp_pseries_cause_ipi(int cpu)
> >>  {
> >> -  /* POWER9 should not use this handler */
> >>if (doorbell_try_core_ipi(cpu))
> >>return;
> >>  
> >> -  icp_ops->cause_ipi(cpu);
> >> +  ic_cause_ipi(cpu);
> > 
> > Wouldn't it make more sense to change smp_ops->cause_ipi, rather than
> > having a double indirection through smp_ops, then the ic_cause_ipi
> > global?
> 
> we need to retain the original setting of smp_ops->cause_ipi 
> somewhere as it can be set from xive_smp_probe() to : 
> 
>   icp_ops->cause_ipi 
> 
> or from xics_smp_probe() to :
> 
>   xive_cause_ipi()
> 
> I am not sure we can do any better ? 

I don't see why not.  You basically have two bits of options xics vs
xive, and doorbell vs no doorbells.  At worst that gives you 4
different versions of ->cause_ipi, and you can work out the right one
in smp_probe().  If the number of combinations got too much larger you
might indeed need some more indirection.  But with 4 there's a little
code duplication, but small enough that I think it's preferable to an
extra global and an extra indirection.

Also, will POWER9 always have doorbells?  In which case you could
reduce it to 3 options.

[snip]
> >> +static void xive_spapr_update_pending(struct xive_cpu *xc)
> >> +{
> >> +  u8 nsr, cppr;
> >> +  u16 ack;
> >> +
> >> +  /* Perform the acknowledge hypervisor to register cycle */
> >> +  ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG));
> > 
> > Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of
> > the higher level IO helpers?
> 
> This is one of the many ways to do MMIOs on the TIMA. This memory 
> region defines a set of offsets and sizes for which loads and 
> stores have different effects. 
> 
> See the arch/powerpc/include/asm/xive-regs.h file and 
> arch/powerpc/kvm/book3s_xive_template.c for some more usage.

Sure, much like any IO region.  My point is, why do you want this kind
of complex combo, rather than say an in_be16() or readw_be().

-- 
David Gibson| I'll have my music baroque, and my 

[PATCH V11 0/3] powernv : Add support for OPAL-OCC command/response interface

2017-08-09 Thread Shilpasri G Bhat
In P9, OCC (On-Chip-Controller) supports shared memory based
commad-response interface. Within the shared memory there is an OPAL
command buffer and OCC response buffer that can be used to send
inband commands to OCC. The following commands are supported:

1) Set system powercap
2) Set CPU-GPU power shifting ratio
3) Clear min/max for OCC sensor groups

Changes from V10:
- Rebased on powerpc-next
- Add sysfs interface instead of IOCTL
  (Skiboot patch for Patch3 is posted below:
  https://lists.ozlabs.org/pipermail/skiboot/2017-August/008553.html )

Shilpasri G Bhat (3):
  powernv: powercap: Add support for powercap framework
  powernv: Add support to set power-shifting-ratio
  powernv: Add support to clear sensor groups data

 .../ABI/testing/sysfs-firmware-opal-powercap   |  31 +++
 Documentation/ABI/testing/sysfs-firmware-opal-psr  |  18 ++
 .../bindings/powerpc/opal/sensor-groups.txt|  27 +++
 arch/powerpc/include/asm/opal-api.h|   6 +
 arch/powerpc/include/asm/opal.h|  10 +
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/opal-powercap.c | 244 +
 arch/powerpc/platforms/powernv/opal-psr.c  | 175 +++
 .../powerpc/platforms/powernv/opal-sensor-groups.c | 212 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |   5 +
 arch/powerpc/platforms/powernv/opal.c  |  10 +
 11 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-powercap
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-psr
 create mode 100644 
Documentation/devicetree/bindings/powerpc/opal/sensor-groups.txt
 create mode 100644 arch/powerpc/platforms/powernv/opal-powercap.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-psr.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-sensor-groups.c

-- 
1.8.3.1



[PATCH V11 3/3] powernv: Add support to clear sensor groups data

2017-08-09 Thread Shilpasri G Bhat
Adds support for clearing different sensor groups. OCC inband sensor
groups like CSM, Profiler, Job Scheduler can be cleared using this
driver. The min/max of all sensors belonging to these sensor groups
will be cleared.

Signed-off-by: Shilpasri G Bhat 
---
 .../bindings/powerpc/opal/sensor-groups.txt|  27 +++
 arch/powerpc/include/asm/opal-api.h|   1 +
 arch/powerpc/include/asm/opal.h|   2 +
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 .../powerpc/platforms/powernv/opal-sensor-groups.c | 212 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 arch/powerpc/platforms/powernv/opal.c  |   3 +
 7 files changed, 247 insertions(+), 1 deletion(-)
 create mode 100644 
Documentation/devicetree/bindings/powerpc/opal/sensor-groups.txt
 create mode 100644 arch/powerpc/platforms/powernv/opal-sensor-groups.c

diff --git a/Documentation/devicetree/bindings/powerpc/opal/sensor-groups.txt 
b/Documentation/devicetree/bindings/powerpc/opal/sensor-groups.txt
new file mode 100644
index 000..6ad881c
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/opal/sensor-groups.txt
@@ -0,0 +1,27 @@
+IBM OPAL Sensor Groups Binding
+---
+
+Node: /ibm,opal/sensor-groups
+
+Description: Contains sensor groups available in the Powernv P9
+servers. Each child node indicates a sensor group.
+
+- compatible : Should be "ibm,opal-sensor-group"
+
+Each child node contains below properties:
+
+- type : String to indicate the type of sensor-group
+
+- sensor-group-id: Abstract unique identifier provided by firmware of
+  type  which is used for sensor-group
+  operations like clearing the min/max history of all
+  sensors belonging to the group.
+
+- ibm,chip-id : Chip ID
+
+- sensors : Phandle array of child nodes of /ibm,opal/sensor/
+   belonging to this group
+
+- ops : Array of opal-call numbers indicating available operations on
+   sensor groups like clearing min/max, enabling/disabling sensor
+   group.
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0cb7d11..450a60b 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -198,6 +198,7 @@
 #define OPAL_SET_POWERCAP  153
 #define OPAL_GET_POWER_SHIFT_RATIO 154
 #define OPAL_SET_POWER_SHIFT_RATIO 155
+#define OPAL_SENSOR_GROUP_CLEAR156
 #define OPAL_PCI_SET_P2P   157
 #define OPAL_LAST  157
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index d87ffcb..97ff192 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -279,6 +279,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
address,
 int opal_set_powercap(u32 handle, int token, u32 pcap);
 int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
+int opal_sensor_group_clear(u32 group_hndl, int token);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
@@ -359,6 +360,7 @@ static inline int opal_get_async_rc(struct opal_msg msg)
 
 void opal_powercap_init(void);
 void opal_psr_init(void);
+void opal_sensor_groups_init(void);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 674ed1e..177b3d4 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o
+obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-sensor-groups.c 
b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
new file mode 100644
index 000..7e5a235
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
@@ -0,0 +1,212 @@
+/*
+ * PowerNV OPAL Sensor-groups interface
+ *
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) "opal-sensor-groups: " fmt
+
+#include 

[PATCH V11 1/3] powernv: powercap: Add support for powercap framework

2017-08-09 Thread Shilpasri G Bhat
Adds a generic powercap framework to change the system powercap
inband through OPAL-OCC command/response interface.

Signed-off-by: Shilpasri G Bhat 
---
 .../ABI/testing/sysfs-firmware-opal-powercap   |  31 +++
 arch/powerpc/include/asm/opal-api.h|   3 +
 arch/powerpc/include/asm/opal.h|   5 +
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/opal-powercap.c | 244 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   2 +
 arch/powerpc/platforms/powernv/opal.c  |   4 +
 7 files changed, 290 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-powercap
 create mode 100644 arch/powerpc/platforms/powernv/opal-powercap.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap 
b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
new file mode 100644
index 000..c9b66ec
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
@@ -0,0 +1,31 @@
+What:  /sys/firmware/opal/powercap
+Date:  August 2017
+Contact:   Linux for PowerPC mailing list 
+Description:   Powercap directory for Powernv (P8, P9) servers
+
+   Each folder in this directory contains a
+   power-cappable component.
+
+What:  /sys/firmware/opal/powercap/system-powercap
+   /sys/firmware/opal/powercap/system-powercap/powercap-min
+   /sys/firmware/opal/powercap/system-powercap/powercap-max
+   /sys/firmware/opal/powercap/system-powercap/powercap-current
+Date:  August 2017
+Contact:   Linux for PowerPC mailing list 
+Description:   System powercap directory and attributes applicable for
+   Powernv (P8, P9) servers
+
+   This directory provides powercap information. It
+   contains below sysfs attributes:
+
+   - powercap-min : This file provides the minimum
+ possible powercap in Watt units
+
+   - powercap-max : This file provides the maximum
+ possible powercap in Watt units
+
+   - powercap-current : This file provides the current
+ powercap set on the system. Writing to this file
+ creates a request for setting a new-powercap. The
+ powercap requested must be between powercap-min
+ and powercap-max.
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index ced1ef2..b87305b 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -42,6 +42,7 @@
 #define OPAL_I2C_STOP_ERR  -24
 #define OPAL_XIVE_PROVISIONING -31
 #define OPAL_XIVE_FREE_ACTIVE  -32
+#define OPAL_TIMEOUT   -33
 
 /* API Tokens (in r0) */
 #define OPAL_INVALID_CALL -1
@@ -193,6 +194,8 @@
 #define OPAL_IMC_COUNTERS_INIT 149
 #define OPAL_IMC_COUNTERS_START150
 #define OPAL_IMC_COUNTERS_STOP 151
+#define OPAL_GET_POWERCAP  152
+#define OPAL_SET_POWERCAP  153
 #define OPAL_PCI_SET_P2P   157
 #define OPAL_LAST  157
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5a715e6..6f09ab7 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -275,6 +275,9 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
address,
 int64_t opal_imc_counters_start(uint32_t type, uint64_t cpu_pir);
 int64_t opal_imc_counters_stop(uint32_t type, uint64_t cpu_pir);
 
+int opal_get_powercap(u32 handle, int token, u32 *pcap);
+int opal_set_powercap(u32 handle, int token, u32 pcap);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
@@ -352,6 +355,8 @@ static inline int opal_get_async_rc(struct opal_msg msg)
 
 void opal_wake_poller(void);
 
+void opal_powercap_init(void);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_OPAL_H */
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index a0d4353..f9ec36d 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o
+obj-y  += opal-kmsg.o opal-powercap.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 

[PATCH V11 2/3] powernv: Add support to set power-shifting-ratio

2017-08-09 Thread Shilpasri G Bhat
This patch adds support to set power-shifting-ratio which hints the
firmware how to distribute/throttle power between different entities
in a system (e.g CPU v/s GPU). This ratio is used by OCC for power
capping algorithm.

Signed-off-by: Shilpasri G Bhat 
---
 Documentation/ABI/testing/sysfs-firmware-opal-psr |  18 +++
 arch/powerpc/include/asm/opal-api.h   |   2 +
 arch/powerpc/include/asm/opal.h   |   3 +
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/opal-psr.c | 175 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S|   2 +
 arch/powerpc/platforms/powernv/opal.c |   3 +
 7 files changed, 204 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-psr
 create mode 100644 arch/powerpc/platforms/powernv/opal-psr.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-psr 
b/Documentation/ABI/testing/sysfs-firmware-opal-psr
new file mode 100644
index 000..cc2ece7
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-opal-psr
@@ -0,0 +1,18 @@
+What:  /sys/firmware/opal/psr
+Date:  August 2017
+Contact:   Linux for PowerPC mailing list 
+Description:   Power-Shift-Ratio directory for Powernv P9 servers
+
+   Power-Shift-Ratio allows to provide hints the firmware
+   to shift/throttle power between different entities in
+   the system. Each attribute in this directory indicates
+   a settable PSR.
+
+What:  /sys/firmware/opal/psr/cpu_to_gpu_X
+Date:  August 2017
+Contact:   Linux for PowerPC mailing list 
+Description:   PSR sysfs attributes for Powernv P9 servers
+
+   Power-Shift-Ratio between CPU and GPU for a given chip
+   with chip-id X. This file gives the ratio (0-100)
+   which is used by OCC for power-capping.
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index b87305b..0cb7d11 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -196,6 +196,8 @@
 #define OPAL_IMC_COUNTERS_STOP 151
 #define OPAL_GET_POWERCAP  152
 #define OPAL_SET_POWERCAP  153
+#define OPAL_GET_POWER_SHIFT_RATIO 154
+#define OPAL_SET_POWER_SHIFT_RATIO 155
 #define OPAL_PCI_SET_P2P   157
 #define OPAL_LAST  157
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 6f09ab7..d87ffcb 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -277,6 +277,8 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
address,
 
 int opal_get_powercap(u32 handle, int token, u32 *pcap);
 int opal_set_powercap(u32 handle, int token, u32 pcap);
+int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
+int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
@@ -356,6 +358,7 @@ static inline int opal_get_async_rc(struct opal_msg msg)
 void opal_wake_poller(void);
 
 void opal_powercap_init(void);
+void opal_psr_init(void);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index f9ec36d..674ed1e 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o opal-powercap.o
+obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-psr.c 
b/arch/powerpc/platforms/powernv/opal-psr.c
new file mode 100644
index 000..7313b7f
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-psr.c
@@ -0,0 +1,175 @@
+/*
+ * PowerNV OPAL Power-Shift-Ratio interface
+ *
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) "opal-psr: " fmt
+
+#include 
+#include 
+#include 
+
+#include 
+
+DEFINE_MUTEX(psr_mutex);
+
+static struct kobject *psr_kobj;
+
+struct psr_attr {
+   u32 handle;
+   

Re: [PATCH] mtd: nand: Rename nand.h into rawnand.h

2017-08-09 Thread Shawn Guo
On Fri, Aug 04, 2017 at 05:29:10PM +0200, Boris Brezillon wrote:
> We are planning to share more code between different NAND based
> devices (SPI NAND, OneNAND and raw NANDs), but before doing that
> we need to move the existing include/linux/mtd/nand.h file into
> include/linux/mtd/rawnand.h so we can later create a nand.h header
> containing all common structure and function prototypes.
> 
> Signed-off-by: Boris Brezillon 
> Signed-off-by: Peter Pan 
...
>  arch/arm/mach-imx/mach-qong.c   | 2 +-

Acked-by: Shawn Guo 


Re: [PATCH 08/10] powerpc/xive: take into account '/ibm,plat-res-int-priorities'

2017-08-09 Thread David Gibson
On Wed, Aug 09, 2017 at 09:14:49AM +0200, Cédric Le Goater wrote:
> On 08/09/2017 06:02 AM, David Gibson wrote:
> > On Tue, Aug 08, 2017 at 10:56:18AM +0200, Cédric Le Goater wrote:
> >> '/ibm,plat-res-int-priorities' contains a list of priorities that the
> >> hypervisor has reserved for its own use. Scan these ranges to choose
> >> the lowest unused priority for the xive spapr backend.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>  arch/powerpc/sysdev/xive/spapr.c | 62 
> >> +++-
> >>  1 file changed, 61 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> >> b/arch/powerpc/sysdev/xive/spapr.c
> >> index 7fc40047c23d..220331986bd8 100644
> >> --- a/arch/powerpc/sysdev/xive/spapr.c
> >> +++ b/arch/powerpc/sysdev/xive/spapr.c
> >> @@ -532,13 +532,70 @@ static const struct xive_ops xive_spapr_ops = {
> >>.name   = "spapr",
> >>  };
> >>  
> >> +/*
> >> + * get max priority from "/ibm,plat-res-int-priorities"
> >> + */
> >> +static bool xive_get_max_prio(u8 *max_prio)
> >> +{
> >> +  struct device_node *rootdn;
> >> +  const __be32 *reg;
> >> +  u32 len;
> >> +  int prio, found;
> >> +
> >> +  rootdn = of_find_node_by_path("/");
> >> +  if (!rootdn) {
> >> +  pr_err("not root node found !\n");
> >> +  return false;
> >> +  }
> >> +
> >> +  reg = of_get_property(rootdn, "ibm,plat-res-int-priorities", );
> >> +  if (!reg) {
> >> +  pr_err("Failed to read 'ibm,plat-res-int-priorities' 
> >> property\n");
> >> +  return false;
> >> +  }
> >> +
> >> +  if (len % (2 * sizeof(u32)) != 0) {
> >> +  pr_err("invalid 'ibm,plat-res-int-priorities' property\n");
> >> +  return false;
> >> +  }
> >> +
> >> +  /* HW supports priorities in the range [0-7] and 0xFF is a
> >> +   * wildcard priority used to mask. We scan the ranges reserved
> >> +   * by the hypervisor to find the lowest priority we can use.
> >> +   */
> >> +  found = 0xFF;
> >> +  for (prio = 0; prio < 8; prio++) {
> >> +  int reserved = 0;
> >> +  int i;
> >> +
> >> +  for (i = 0; i < len / (2 * sizeof(u32)); i++) {
> >> +  int base  = be32_to_cpu(reg[2 * i]);
> >> +  int range = be32_to_cpu(reg[2 * i + 1]);
> >> +
> >> +  if (prio >= base && prio < base + range)
> >> +  reserved++;
> >> +  }
> >> +
> >> +  if (!reserved)
> >> +  found = prio;
> > 
> > So you continue the loop here, rather than using break.  Which means
> > found will be the highest valued priority that's not reserved.  Is
> > that what you intended?  The commit message says you find the lowest
> > unused, but do lower numbers mean higher priorities or the other way around?
> 
> yes. I should probably add a statement on how the priorities are 
> ordered : the most privileged is the lowest value.

Ok.  My inclination would be to reverse the order of the loop, and
break; on the first (==lowest priority) unused entry.  But you could
fairly argue that's premature optimization.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-09 Thread Kirill A. Shutemov
On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> > On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >> The VMA sequence count has been introduced to allow fast detection of
> >> VMA modification when running a page fault handler without holding
> >> the mmap_sem.
> >>
> >> This patch provides protection agains the VMA modification done in :
> >>- madvise()
> >>- mremap()
> >>- mpol_rebind_policy()
> >>- vma_replace_policy()
> >>- change_prot_numa()
> >>- mlock(), munlock()
> >>- mprotect()
> >>- mmap_region()
> >>- collapse_huge_page()
> > 
> > I don't thinks it's anywhere near complete list of places where we touch
> > vm_flags. What is your plan for the rest?
> 
> The goal is only to protect places where change to the VMA is impacting the
> page fault handling. If you think I missed one, please advise.

That's very fragile approach. We rely here too much on specific compiler 
behaviour.

Any write access to vm_flags can, in theory, be translated to several
write accesses. For instance with setting vm_flags to 0 in the middle,
which would result in sigfault on page fault to the vma.

Nothing (apart from common sense) prevents compiler from generating this
kind of pattern.

-- 
 Kirill A. Shutemov


Re: [v6 1/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

2017-08-09 Thread Matt Brown
On Wed, Aug 9, 2017 at 11:26 PM, Michael Ellerman  wrote:
> Matt Brown  writes:
>
>> This patch uses the vpermxor instruction to optimise the raid6 Q syndrome.
>> This instruction was made available with POWER8, ISA version 2.07.
>> It allows for both vperm and vxor instructions to be done in a single
>> instruction. This has been tested for correctness on a ppc64le vm with a
>> basic RAID6 setup containing 5 drives.
>>
>> The performance benchmarks are from the raid6test in the /lib/raid6/test
>> directory. These results are from an IBM Firestone machine with ppc64le
>> architecture. The benchmark results show a 35% speed increase over the best
>> existing algorithm for powerpc (altivec). The raid6test has also been run
>> on a big-endian ppc64 vm to ensure it also works for big-endian
>> architectures.
>>
>> Performance benchmarks:
>>   raid6: altivecx4 gen() 18773 MB/s
>>   raid6: altivecx8 gen() 19438 MB/s
>>
>>   raid6: vpermxor4 gen() 25112 MB/s
>>   raid6: vpermxor8 gen() 26279 MB/s
>>
>> Signed-off-by: Matt Brown 
>> Reviewed-by: Daniel Axtens 
>> ---
>> v6:
>>   - added vpermxor files to .gitignore
>>   - fixup whitespace
>>   - added vpermxor objs to test/Makefile
>> v5:
>>   - moved altivec.uc fix into other patch in series
>> ---
>>  include/linux/raid/pq.h |   4 ++
>>  lib/raid6/.gitignore|   1 +
>>  lib/raid6/Makefile  |  27 -
>>  lib/raid6/algos.c   |   4 ++
>>  lib/raid6/test/Makefile |  17 +++-
>>  lib/raid6/vpermxor.uc   | 104 
>> 
>>  6 files changed, 154 insertions(+), 3 deletions(-)
>>  create mode 100644 lib/raid6/vpermxor.uc
>
> This version at least is not Cc'ed to any of the folks that
> get_maintainers.pl identifies for these files:
>
> $ ./scripts/get_maintainer.pl -f lib/raid6
> s...@fb.com
> gayatri.kamm...@intel.com
> fenghua...@intel.com
> megha@linux.intel.com
> schwidef...@de.ibm.com
> anup.pa...@broadcom.com
> linux-ker...@vger.kernel.org
>
>
> This seems like mostly a list of random folks who've touched this code,
> but maybe some of them would have comments?
>

Ah my bad. I've CC'ed them into this email chain.
Apologies for not including you guys in the original email.
Here is a link to the patchworks patch:
http://patchwork.ozlabs.org/patch/797576/

Thanks,
Matt Brown


Re: [PATCH] soc: Convert to using %pOF instead of full_name

2017-08-09 Thread Rob Herring
On Tue, Jul 18, 2017 at 4:43 PM, Rob Herring  wrote:
> Now that we have a custom printf format specifier, convert users of
> full_name to use %pOF instead. This is preparation to remove storing
> of the full path string for each node.
>
> Signed-off-by: Rob Herring 
> Cc: Scott Wood 
> Cc: Qiang Zhao 
> Cc: Matthias Brugger 
> Cc: Simon Horman 
> Cc: Magnus Damm 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Javier Martinez Canillas 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-media...@lists.infradead.org
> Cc: linux-renesas-...@vger.kernel.org
> Cc: linux-samsung-...@vger.kernel.org
> ---
>  drivers/soc/fsl/qbman/bman_ccsr.c| 10 +-
>  drivers/soc/fsl/qbman/bman_portal.c  |  8 +++-
>  drivers/soc/fsl/qbman/qman_ccsr.c| 12 ++--
>  drivers/soc/fsl/qbman/qman_portal.c  | 11 ---
>  drivers/soc/fsl/qe/gpio.c|  4 ++--
>  drivers/soc/mediatek/mtk-pmic-wrap.c |  4 ++--
>  drivers/soc/renesas/rcar-rst.c   |  4 ++--
>  drivers/soc/renesas/rcar-sysc.c  |  6 +++---
>  drivers/soc/samsung/pm_domains.c |  8 
>  9 files changed, 31 insertions(+), 36 deletions(-)

Arnd, Olof,

Can you please apply this one.

Rob


Re: [PATCH] PCI: Convert to using %pOF instead of full_name

2017-08-09 Thread Rob Herring
On Wed, Aug 2, 2017 at 5:39 PM, Bjorn Helgaas  wrote:
> On Tue, Jul 18, 2017 at 04:43:21PM -0500, Rob Herring wrote:
>> Now that we have a custom printf format specifier, convert users of
>> full_name to use %pOF instead. This is preparation to remove storing
>> of the full path string for each node.
>>
>> Signed-off-by: Rob Herring 
>> Cc: Thomas Petazzoni 
>> Cc: Jason Cooper 
>> Cc: Bjorn Helgaas 
>> Cc: Thierry Reding 
>> Cc: Jonathan Hunter 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: linux-...@vger.kernel.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-te...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>
> Applied to pci/misc for v4.14, thanks!

This hasn't shown up in -next.

Rob


Re: [PATCH] mtd: nand: Rename nand.h into rawnand.h

2017-08-09 Thread Tony Lindgren
* Boris Brezillon  [170804 08:30]:
> We are planning to share more code between different NAND based
> devices (SPI NAND, OneNAND and raw NANDs), but before doing that
> we need to move the existing include/linux/mtd/nand.h file into
> include/linux/mtd/rawnand.h so we can later create a nand.h header
> containing all common structure and function prototypes.

For omap:

Acked-by: Tony Lindgren 


[PATCH v7 0/9] Application Data Integrity feature introduced by SPARC M7

2017-08-09 Thread Khalid Aziz
SPARC M7 processor adds additional metadata for memory address space
that can be used to secure access to regions of memory. This additional
metadata is implemented as a 4-bit tag attached to each cacheline size
block of memory. A task can set a tag on any number of such blocks.
Access to such block is granted only if the virtual address used to
access that block of memory has the tag encoded in the uppermost 4 bits
of VA. Since sparc processor does not implement all 64 bits of VA, top 4
bits are available for ADI tags. Any mismatch between tag encoded in VA
and tag set on the memory block results in a trap. Tags are verified in
the VA presented to the MMU and tags are associated with the physical
page VA maps on to. If a memory page is swapped out and page frame gets
reused for another task, the tags are lost and hence must be saved when
swapping or migrating the page.

A userspace task enables ADI through mprotect(). This patch series adds
a page protection bit PROT_ADI and a corresponding VMA flag
VM_SPARC_ADI. VM_SPARC_ADI is used to trigger setting TTE.mcd bit in the
sparc pte that enables ADI checking on the corresponding page. MMU
validates the tag embedded in VA for every page that has TTE.mcd bit set
in its pte. After enabling ADI on a memory range, the userspace task can
set ADI version tags using stxa instruction with ASI_MCD_PRIMARY or
ASI_MCD_ST_BLKINIT_PRIMARY ASI.

Once userspace task calls mprotect() with PROT_ADI, kernel takes
following overall steps:

1. Find the VMAs covering the address range passed in to mprotect and
set VM_SPARC_ADI flag. If address range covers a subset of a VMA, the
VMA will be split.

2. When a page is allocated for a VA and the VMA covering this VA has
VM_SPARC_ADI flag set, set the TTE.mcd bit so MMU will check the
vwersion tag.

3. Userspace can now set version tags on the memory it has enabled ADI
on. Userspace accesses ADI enabled memory using a virtual address that
has the version tag embedded in the high bits. MMU validates this
version tag against the actual tag set on the memory. If tag matches,
MMU performs the VA->PA translation and access is granted. If there is a
mismatch, hypervisor sends a data access exception or precise memory
corruption detected exception depending upon whether precise exceptions
are enabled or not (controlled by MCDPERR register). Kernel sends
SIGSEGV to the task with appropriate si_code.

4. If a page is being swapped out or migrated, kernel must save any ADI
tags set on the page. Kernel maintains a page worth of tag storage
descriptors. Each descriptors pointsto a tag storage space and the
address range it covers. If the page being swapped out or migrated has
ADI enabled on it, kernel finds a tag storage descriptor that covers the
address range for the page or allocates a new descriptor if none of the
existing descriptors cover the address range. Kernel saves tags from the
page into the tag storage space descriptor points to.

5. When the page is swapped back in or reinstantiated after migration,
kernel restores the version tags on the new physical page by retrieving
the original tag from tag storage pointed to by a tag storage descriptor
for the virtual address range for new page.

User task can disable ADI by calling mprotect() again on the memory
range with PROT_ADI bit unset. Kernel clears the VM_SPARC_ADI flag in
VMAs, merges adjacent VMAs if necessary, and clears TTE.mcd bit in the
corresponding ptes.

IOMMU does not support ADI checking. Any version tags embedded in the
top bits of VA meant for IOMMU, are cleared and replaced with sign
extension of the first non-version tag bit (bit 59 for SPARC M7) for
IOMMU addresses.

This patch series adds support for this feature in 9 patches:

Patch 1/9
  Tag mismatch on access by a task results in a trap from hypervisor as
  data access exception or a precide memory corruption detected
  exception. As part of handling these exceptions, kernel sends a
  SIGSEGV to user process with special si_code to indicate which fault
  occurred. This patch adds three new si_codes to differentiate between
  various mismatch errors.

Patch 2/9
  When a page is swapped or migrated, metadata associated with the page
  must be saved so it can be restored later. This patch adds a new
  function that saves/restores this metadata when updating pte upon a
  swap/migration.

Patch 3/9
  SPARC M7 processor adds new fields to control registers to support ADI
  feature. It also adds a new exception for precise traps on tag
  mismatch. This patch adds definitions for the new control register
  fields, new ASIs for ADI and an exception handler for the precise trap
  on tag mismatch.

Patch 4/9
  New hypervisor fault types were added by sparc M7 processor to support
  ADI feature. This patch adds code to handle these fault types for data
  access exception handler.

Patch 5/9
  When ADI is in use for a page and a tag mismatch occurs, processor
  raises "Memory corruption Detected" trap. This patch adds a 

[PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()

2017-08-09 Thread Khalid Aziz
A protection flag may not be valid across entire address space and
hence arch_validate_prot() might need the address a protection bit is
being set on to ensure it is a valid protection flag. For example, sparc
processors support memory corruption detection (as part of ADI feature)
flag on memory addresses mapped on to physical RAM but not on PFN mapped
pages or addresses mapped on to devices. This patch adds address to the
parameters being passed to arch_validate_prot() so protection bits can
be validated in the relevant context.

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
v7:
- new patch

 arch/powerpc/include/asm/mman.h | 2 +-
 arch/powerpc/kernel/syscalls.c  | 2 +-
 include/linux/mman.h| 2 +-
 mm/mprotect.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f699341..bc74074304a2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot)
return false;
return true;
 }
-#define arch_validate_prot(prot) arch_validate_prot(prot)
+#define arch_validate_prot(prot, addr) arch_validate_prot(prot)
 
 #endif /* CONFIG_PPC64 */
 #endif /* _ASM_POWERPC_MMAN_H */
diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index a877bf8269fe..6d90ddbd2d11 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -48,7 +48,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
 {
long ret = -EINVAL;
 
-   if (!arch_validate_prot(prot))
+   if (!arch_validate_prot(prot, addr))
goto out;
 
if (shift) {
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 634c4c51fe3a..1693d95a88ee 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -49,7 +49,7 @@ static inline void vm_unacct_memory(long pages)
  *
  * Returns true if the prot flags are valid
  */
-static inline bool arch_validate_prot(unsigned long prot)
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
 {
return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..beac2dfbb5fa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -396,7 +396,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
end = start + len;
if (end <= start)
return -ENOMEM;
-   if (!arch_validate_prot(prot))
+   if (!arch_validate_prot(prot, start))
return -EINVAL;
 
reqprot = prot;
-- 
2.11.0



Re: [RFC Part1 PATCH v3 02/17] x86/CPU/AMD: Add the Secure Encrypted Virtualization CPU feature

2017-08-09 Thread Tom Lendacky

On 7/25/2017 10:33 AM, Borislav Petkov wrote:

On Tue, Jul 25, 2017 at 10:29:40AM -0500, Tom Lendacky wrote:

But early_identify_cpu() calls get_cpu_cap() which will check for cpuid
leaf 0x8008 support and set x86_phys_bits.


Right, but it can't be less than 32, can it? And if it is more than 32
bits, then it probably doesn't really matter on 32-bit. Unless it is
less than 36 bits and you do PAE...


I'll try to build and run a 32-bit kernel and see how this all flows.


Yeah, that would be good.


Ok, finally got around to running a 32-bit kernel and it reports
x86_phys_bits as 48.

Thanks,
Tom



Thanks.



Re: [PATCH] ibmvnic: Fix unused variable warning

2017-08-09 Thread Tyrel Datwyler
On 08/09/2017 04:16 AM, Michal Suchanek wrote:
> Fixes: a248878d7a1d ("ibmvnic: Check for transport event on driver resume")
> 
> Signed-off-by: Michal Suchanek 
> ---

Reviewed-by: Tyrel Datwyler 



Re: [PATCH 03/10] powerpc/xive: rename xive_poke_esb in xive_esb_read

2017-08-09 Thread David Gibson
On Wed, Aug 09, 2017 at 09:12:29AM +0200, Cédric Le Goater wrote:
> On 08/09/2017 05:55 AM, David Gibson wrote:
> > On Tue, Aug 08, 2017 at 10:56:13AM +0200, Cédric Le Goater wrote:
> >> xive_poke_esb() is performing a load/read so it is better named as
> >> xive_esb_read().
> > 
> > Uh, patch seems to mismatch the comment here, calling it
> > xive_peek_esb() instead.
> 
> euh yes. oops. 
> 
> > Does reading the ESB had a side effect on the hardware?  If so "poke"
> > might be an appropriate name.
> 
> ok. I want to introduce a 'write' method. should I name them
> load and store ?

Either read/write or load/store.  I don't really care, don't know if
BenH has an opinion.


> 
> Thanks,
> 
> C. 
> 
> >> Also introduce a XIVE_ESB_LOAD_EOI read when EOI'ing
> >> LSI interrupts.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>  arch/powerpc/sysdev/xive/common.c | 20 ++--
> >>  1 file changed, 10 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/common.c 
> >> b/arch/powerpc/sysdev/xive/common.c
> >> index 6595462b1fc8..e6b245bb9602 100644
> >> --- a/arch/powerpc/sysdev/xive/common.c
> >> +++ b/arch/powerpc/sysdev/xive/common.c
> >> @@ -190,7 +190,7 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, 
> >> bool just_peek)
> >>   * This is used to perform the magic loads from an ESB
> >>   * described in xive.h
> >>   */
> >> -static u8 xive_poke_esb(struct xive_irq_data *xd, u32 offset)
> >> +static u8 xive_peek_esb(struct xive_irq_data *xd, u32 offset)
> >>  {
> >>u64 val;
> >>  
> >> @@ -227,7 +227,7 @@ void xmon_xive_do_dump(int cpu)
> >>xive_dump_eq("IRQ", >queue[xive_irq_priority]);
> >>  #ifdef CONFIG_SMP
> >>{
> >> -  u64 val = xive_poke_esb(>ipi_data, XIVE_ESB_GET);
> >> +  u64 val = xive_peek_esb(>ipi_data, XIVE_ESB_GET);
> >>xmon_printf("  IPI state: %x:%c%c\n", xc->hw_ipi,
> >>val & XIVE_ESB_VAL_P ? 'P' : 'p',
> >>val & XIVE_ESB_VAL_P ? 'Q' : 'q');
> >> @@ -326,9 +326,9 @@ void xive_do_source_eoi(u32 hw_irq, struct 
> >> xive_irq_data *xd)
> >> * properly.
> >> */
> >>if (xd->flags & XIVE_IRQ_FLAG_LSI)
> >> -  in_be64(xd->eoi_mmio);
> >> +  xive_peek_esb(xd, XIVE_ESB_LOAD_EOI);
> >>else {
> >> -  eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
> >> +  eoi_val = xive_peek_esb(xd, XIVE_ESB_SET_PQ_00);
> >>DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
> >>  
> >>/* Re-trigger if needed */
> >> @@ -383,12 +383,12 @@ static void xive_do_source_set_mask(struct 
> >> xive_irq_data *xd,
> >> * ESB accordingly on unmask.
> >> */
> >>if (mask) {
> >> -  val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01);
> >> +  val = xive_peek_esb(xd, XIVE_ESB_SET_PQ_01);
> >>xd->saved_p = !!(val & XIVE_ESB_VAL_P);
> >>} else if (xd->saved_p)
> >> -  xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
> >> +  xive_peek_esb(xd, XIVE_ESB_SET_PQ_10);
> >>else
> >> -  xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
> >> +  xive_peek_esb(xd, XIVE_ESB_SET_PQ_00);
> >>  }
> >>  
> >>  /*
> >> @@ -768,7 +768,7 @@ static int xive_irq_retrigger(struct irq_data *d)
> >> * To perform a retrigger, we first set the PQ bits to
> >> * 11, then perform an EOI.
> >> */
> >> -  xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
> >> +  xive_peek_esb(xd, XIVE_ESB_SET_PQ_11);
> >>  
> >>/*
> >> * Note: We pass "0" to the hw_irq argument in order to
> >> @@ -803,7 +803,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data 
> >> *d, void *state)
> >>irqd_set_forwarded_to_vcpu(d);
> >>  
> >>/* Set it to PQ=10 state to prevent further sends */
> >> -  pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
> >> +  pq = xive_peek_esb(xd, XIVE_ESB_SET_PQ_10);
> >>  
> >>/* No target ? nothing to do */
> >>if (xd->target == XIVE_INVALID_TARGET) {
> >> @@ -832,7 +832,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data 
> >> *d, void *state)
> >> * for sure the queue slot is no longer in use.
> >> */
> >>if (pq & 2) {
> >> -  pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
> >> +  pq = xive_peek_esb(xd, XIVE_ESB_SET_PQ_11);
> >>xd->saved_p = true;
> >>  
> >>/*
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [v2] powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails

2017-08-09 Thread Michael Ellerman
On Tue, 2017-08-08 at 08:43:15 UTC, "Gautham R. Shenoy" wrote:
> From: "Gautham R. Shenoy" 
> 
> Currently, we use the opal call opal_slw_set_reg() to inform the
> Sleep-Winkle Engine (SLW) to restore the contents of some of the
> Hypervisor state on wakeup from deep idle states that lose full
> hypervisor context (characterized by the flag
> OPAL_PM_LOSE_FULL_CONTEXT).
> 
> However, the current code has a bug in that if opal_slw_set_reg()
> fails, we don't disable the use of these deep states (winkle on
> POWER8, stop4 onwards on POWER9).
> 
> This patch fixes this bug by ensuring that if programing the
> sleep-winkle engine to restore the hypervisor states in
> pnv_save_sprs_for_deep_states() fails, then we exclude such states by
> clearing the OPAL_PM_LOSE_FULL_CONTEXT flag from
> supported_cpuidle_states. As a result POWER8 will be prevented from
> using winkle for CPU-Hotplug, and POWER9 will put the offlined CPUs to
> the default stop state when available.
> 
> Further, we ensure in the initialization of the cpuidle-powernv driver
> to only include those states whose flags are present in
> supported_cpuidle_states, thereby skipping OPAL_PM_LOSE_FULL_CONTEXT
> states when they have been disabled due to stop-api failure.
> 
> Fixes: 1e1601b38e6 ("powerpc/powernv/idle: Restore SPRs for deep idle
> states via stop API.")
> 
> Signed-off-by: Gautham R. Shenoy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/785a12afdb4a52903447fd890633c8

cheers


Re: [v6 1/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

2017-08-09 Thread Michael Ellerman
Matt Brown  writes:

> This patch uses the vpermxor instruction to optimise the raid6 Q syndrome.
> This instruction was made available with POWER8, ISA version 2.07.
> It allows for both vperm and vxor instructions to be done in a single
> instruction. This has been tested for correctness on a ppc64le vm with a
> basic RAID6 setup containing 5 drives.
>
> The performance benchmarks are from the raid6test in the /lib/raid6/test
> directory. These results are from an IBM Firestone machine with ppc64le
> architecture. The benchmark results show a 35% speed increase over the best
> existing algorithm for powerpc (altivec). The raid6test has also been run
> on a big-endian ppc64 vm to ensure it also works for big-endian
> architectures.
>
> Performance benchmarks:
>   raid6: altivecx4 gen() 18773 MB/s
>   raid6: altivecx8 gen() 19438 MB/s
>
>   raid6: vpermxor4 gen() 25112 MB/s
>   raid6: vpermxor8 gen() 26279 MB/s
>
> Signed-off-by: Matt Brown 
> Reviewed-by: Daniel Axtens 
> ---
> v6:
>   - added vpermxor files to .gitignore
>   - fixup whitespace
>   - added vpermxor objs to test/Makefile
> v5:
>   - moved altivec.uc fix into other patch in series
> ---
>  include/linux/raid/pq.h |   4 ++
>  lib/raid6/.gitignore|   1 +
>  lib/raid6/Makefile  |  27 -
>  lib/raid6/algos.c   |   4 ++
>  lib/raid6/test/Makefile |  17 +++-
>  lib/raid6/vpermxor.uc   | 104 
> 
>  6 files changed, 154 insertions(+), 3 deletions(-)
>  create mode 100644 lib/raid6/vpermxor.uc

This version at least is not Cc'ed to any of the folks that
get_maintainers.pl identifies for these files:

$ ./scripts/get_maintainer.pl -f lib/raid6
s...@fb.com
gayatri.kamm...@intel.com
fenghua...@intel.com
megha@linux.intel.com
schwidef...@de.ibm.com
anup.pa...@broadcom.com
linux-ker...@vger.kernel.org


This seems like mostly a list of random folks who've touched this code,
but maybe some of them would have comments?

cheers


Re: [PATCH 13/16] perf: Add a speculative page fault sw events

2017-08-09 Thread Laurent Dufour
On 09/08/2017 15:18, Michael Ellerman wrote:
> Laurent Dufour  writes:
> 
>> Add new software events to count succeeded and failed speculative page
>> faults.
>>
>> Signed-off-by: Laurent Dufour 
>> ---
>>  include/uapi/linux/perf_event.h | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/include/uapi/linux/perf_event.h 
>> b/include/uapi/linux/perf_event.h
>> index b1c0b187acfe..fbfb03dff334 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -111,6 +111,8 @@ enum perf_sw_ids {
>>  PERF_COUNT_SW_EMULATION_FAULTS  = 8,
>>  PERF_COUNT_SW_DUMMY = 9,
>>  PERF_COUNT_SW_BPF_OUTPUT= 10,
>> +PERF_COUNT_SW_SPF_DONE  = 11,
>> +PERF_COUNT_SW_SPF_FAILED= 12,
> 
> Can't you calculate:
> 
>   PERF_COUNT_SW_SPF_FAILED = PERF_COUNT_SW_PAGE_FAULTS - 
> PERF_COUNT_SW_SPF_DONE
> 
> ie. do you need a separate event for it?

Unfortunately not, because PERF_COUNT_SW_PAGE_FAULTS counts also page
faults from the kernel space, while SPF is only concerning user space page
faults.

Cheers,
Laurent.



Re: [PATCH] powerpc/64s: Add support for ASB_Notify on POWER9

2017-08-09 Thread christophe lombard

Le 05/08/2017 à 06:28, Benjamin Herrenschmidt a écrit :

On Fri, 2017-08-04 at 16:56 +0200, Christophe Lombard wrote:

The POWER9 core supports a new feature: ASB_Notify which requires the
support of the Special Purpose Register: TIDR.

The ASB_Notify command, generated by the AFU, will attempt to
wake-up the host thread identified by the particular LPID:PID:TID.

The special register TIDR has to be updated to with the same value
present in the process element.

If the length of the register TIDR is 64bits, the CAPI Translation
Service Layer core (XSL9) for Power9 systems limits the size (16bits) of
the Thread ID when it generates the ASB_Notify message adding
PID:LPID:TID information from the context.

The content of the internal kernel Thread ID (32bits) can not therefore
be used to fulfill the register TIDR.

This patch allows to avoid this limitation by adding a new interface
for the user. The instructions mfspr/mtspr SPRN_TIDR are emulated,
save/restore SPRs (context switch) are updated and a new feature
(CPU_FTR_TIDR) is added to POWER9 system.

Those CPU_FTR_* are internal to the kernel. You probably also need a
feature in AT_HWCAP2 to indicate to userspace that this is supported.


Thanks, I will look at in detail.


Also you put the onus of allocating the TIDs onto userspace which is a
bit tricky. What happens if there are duplicate TIDs for example ? (ie,
userspace doesn't allocate it or uses a library that spawns a thread)

Ben.

Signed-off-by: Christophe Lombard 
---
  arch/powerpc/include/asm/cputable.h |  4 +++-
  arch/powerpc/include/asm/emulated_ops.h |  2 ++
  arch/powerpc/include/asm/ppc-opcode.h   |  4 
  arch/powerpc/include/asm/processor.h|  1 +
  arch/powerpc/kernel/process.c   |  8 
  arch/powerpc/kernel/traps.c | 21 +
  6 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index d02ad93..706f668 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -215,6 +215,7 @@ enum {
  #define CPU_FTR_DABRX LONG_ASM_CONST(0x0800)
  #define CPU_FTR_PMAO_BUG  LONG_ASM_CONST(0x1000)
  #define CPU_FTR_POWER9_DD1LONG_ASM_CONST(0x4000)
+#define CPU_FTR_TIDR   LONG_ASM_CONST(0x8000)
  
  #ifndef __ASSEMBLY__
  
@@ -474,7 +475,8 @@ enum {

CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
-   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
+   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | \
+   CPU_FTR_TIDR)
  #define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
 (~CPU_FTR_SAO))
  #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
diff --git a/arch/powerpc/include/asm/emulated_ops.h 
b/arch/powerpc/include/asm/emulated_ops.h
index f00e10e..e83ad42 100644
--- a/arch/powerpc/include/asm/emulated_ops.h
+++ b/arch/powerpc/include/asm/emulated_ops.h
@@ -54,6 +54,8 @@ extern struct ppc_emulated {
  #ifdef CONFIG_PPC64
struct ppc_emulated_entry mfdscr;
struct ppc_emulated_entry mtdscr;
+   struct ppc_emulated_entry mftidr;
+   struct ppc_emulated_entry mttidr;
struct ppc_emulated_entry lq_stq;
  #endif
  } ppc_emulated;
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index fa9ebae..3ebc446 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -241,6 +241,10 @@
  #define PPC_INST_MFSPR_DSCR_USER_MASK 0xfc1e
  #define PPC_INST_MTSPR_DSCR_USER  0x7c0303a6
  #define PPC_INST_MTSPR_DSCR_USER_MASK 0xfc1e
+#define PPC_INST_MFSPR_TIDR0x7d2452a6
+#define PPC_INST_MFSPR_TIDR_MASK   0xfd2e
+#define PPC_INST_MTSPR_TIDR0x7d2453a6
+#define PPC_INST_MTSPR_TIDR_MASK   0xfd2e
  #define PPC_INST_MFVSRD   0x7c66
  #define PPC_INST_MTVSRD   0x7c000166
  #define PPC_INST_SLBFEE   0x7c0007a7
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index fab7ff8..58cc212 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -329,6 +329,7 @@ struct thread_struct {
 */
int dscr_inherit;
unsigned long   ppr;/* used to save/restore SMT priority */
+   unsigned long   tidr;
  #endif
  #ifdef CONFIG_PPC_BOOK3S_64
unsigned long   tar;
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9f3e2c9..f06ea10 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1084,6 

Re: [PATCH 13/16] perf: Add a speculative page fault sw events

2017-08-09 Thread Michael Ellerman
Laurent Dufour  writes:

> Add new software events to count succeeded and failed speculative page
> faults.
>
> Signed-off-by: Laurent Dufour 
> ---
>  include/uapi/linux/perf_event.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index b1c0b187acfe..fbfb03dff334 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -111,6 +111,8 @@ enum perf_sw_ids {
>   PERF_COUNT_SW_EMULATION_FAULTS  = 8,
>   PERF_COUNT_SW_DUMMY = 9,
>   PERF_COUNT_SW_BPF_OUTPUT= 10,
> + PERF_COUNT_SW_SPF_DONE  = 11,
> + PERF_COUNT_SW_SPF_FAILED= 12,

Can't you calculate:

  PERF_COUNT_SW_SPF_FAILED = PERF_COUNT_SW_PAGE_FAULTS - PERF_COUNT_SW_SPF_DONE

ie. do you need a separate event for it?

cheers


Re: [PATCH 3/3] powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y

2017-08-09 Thread Michael Ellerman
Christophe LEROY  writes:
...
> At least it is correct for the ones that use regular pages, and kernel 
> can also be started with nobats or noltlbs at command line, in which 
> case it is usefull to have the page tables correct.

Yep OK.

>> So yes we *should* always mark it no-execute but in practice we can't
>> because it's not page aligned.
>
> On 32 bit it seems to always be aligned to the normal page size, so no 
> problem.
>
>> But if that's different on (some?) 32-bit then we could introduce a new
>> CONFIG symbol that is enabled in the right cases.
>
> For the time being, I have added an "|| CONFIG_PPC32 " on the ifdef, is 
> that OK ?
> See https://patchwork.ozlabs.org/patch/796625/

Yeah looks fine.

cheers


[PATCH 6/6] powerpc/watchdog: add locking around init/exit functions

2017-08-09 Thread Nicholas Piggin
When CPUs start and stop the watchdog, they manipulate shared data
that is normally protected by the lock. Other CPUs can be running
concurrently at this time, so it's a good idea to use locking here
to be on the safe side.

Remove the barrier which is undocumented and didn't do anything.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index b84351f359ea..2f6eadd9408d 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -300,6 +300,8 @@ static void stop_watchdog_timer_on(unsigned int cpu)
 
 static int start_wd_on_cpu(unsigned int cpu)
 {
+   unsigned long flags;
+
if (cpumask_test_cpu(cpu, _cpus_enabled)) {
WARN_ON(1);
return 0;
@@ -314,12 +316,14 @@ static int start_wd_on_cpu(unsigned int cpu)
if (!cpumask_test_cpu(cpu, _cpumask))
return 0;
 
+   wd_smp_lock();
cpumask_set_cpu(cpu, _cpus_enabled);
if (cpumask_weight(_cpus_enabled) == 1) {
cpumask_set_cpu(cpu, _smp_cpus_pending);
wd_smp_last_reset_tb = get_tb();
}
-   smp_wmb();
+   wd_smp_unlock();
+
start_watchdog_timer_on(cpu);
 
return 0;
@@ -327,12 +331,17 @@ static int start_wd_on_cpu(unsigned int cpu)
 
 static int stop_wd_on_cpu(unsigned int cpu)
 {
+   unsigned long flags;
+
if (!cpumask_test_cpu(cpu, _cpus_enabled))
return 0; /* Can happen in CPU unplug case */
 
stop_watchdog_timer_on(cpu);
 
+   wd_smp_lock();
cpumask_clear_cpu(cpu, _cpus_enabled);
+   wd_smp_unlock();
+
wd_smp_clear_cpu_pending(cpu, get_tb());
 
return 0;
-- 
2.13.3



[PATCH 5/6] powerpc/watchdog: Fix marking of stuck CPUs

2017-08-09 Thread Nicholas Piggin
When the SMP detector finds other CPUs stuck, it iterates over
them and marks them as stuck. This pulls them out of the pending
mask and allows the detector to continue with remaining good
CPUs (if nmi_watchdog=panic is not enabled).

The code to dothat was buggy because when setting a CPU stuck,
if the pending mask became empty, it resets it to keep the
watchdog running. However the iterator will continue to run
over the new pending mask and mark remaining good CPUs sas stuck.

Fix this by doing it with cpumask bitwise operations.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 5a69654075c1..b84351f359ea 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -101,10 +101,10 @@ static void wd_lockup_ipi(struct pt_regs *regs)
nmi_panic(regs, "Hard LOCKUP");
 }
 
-static void set_cpu_stuck(int cpu, u64 tb)
+static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
 {
-   cpumask_set_cpu(cpu, _smp_cpus_stuck);
-   cpumask_clear_cpu(cpu, _smp_cpus_pending);
+   cpumask_or(_smp_cpus_stuck, _smp_cpus_stuck, cpumask);
+   cpumask_andnot(_smp_cpus_pending, _smp_cpus_pending, cpumask);
if (cpumask_empty(_smp_cpus_pending)) {
wd_smp_last_reset_tb = tb;
cpumask_andnot(_smp_cpus_pending,
@@ -112,6 +112,10 @@ static void set_cpu_stuck(int cpu, u64 tb)
_smp_cpus_stuck);
}
 }
+static void set_cpu_stuck(int cpu, u64 tb)
+{
+   set_cpumask_stuck(cpumask_of(cpu), tb);
+}
 
 static void watchdog_smp_panic(int cpu, u64 tb)
 {
@@ -140,9 +144,8 @@ static void watchdog_smp_panic(int cpu, u64 tb)
}
smp_flush_nmi_ipi(100);
 
-   /* Take the stuck CPU out of the watch group */
-   for_each_cpu(c, _smp_cpus_pending)
-   set_cpu_stuck(c, tb);
+   /* Take the stuck CPUs out of the watch group */
+   set_cpumask_stuck(_smp_cpus_pending, tb);
 
wd_smp_unlock();
 
-- 
2.13.3



[PATCH 4/6] powerpc/watchdog: Fix final-check recovered case

2017-08-09 Thread Nicholas Piggin
When the watchdog decides to panic, it takes the lock and double
checks everything (to avoid races with the CPU being unstuck or
panic()ed by something else).

The exit label was misplaced and would result in all-CPUs backtrace
and watchdog panic even in the case that the condition was found to be
resolved.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 2806383d2339..5a69654075c1 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -144,7 +144,6 @@ static void watchdog_smp_panic(int cpu, u64 tb)
for_each_cpu(c, _smp_cpus_pending)
set_cpu_stuck(c, tb);
 
-out:
wd_smp_unlock();
 
printk_safe_flush();
@@ -157,6 +156,11 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 
if (hardlockup_panic)
nmi_panic(NULL, "Hard LOCKUP");
+
+   return;
+
+out:
+   wd_smp_unlock();
 }
 
 static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
-- 
2.13.3



[PATCH 3/6] powerpc/watchdog: Moderate touch_nmi_watchdog overhead

2017-08-09 Thread Nicholas Piggin
Some code can go into a tight loop calling touch_nmi_watchdog (e.g.,
stop_machine CPU hotplug code). This can cause contention on watchdog
locks particularly if all CPUs with watchdog enabled are spinning in
the loops.

Avoid this storm of activity by running the watchdog timer callback
from this path if we have exceeded the timer period since it was last
run.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 3a26c3401acc..2806383d2339 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -266,9 +266,11 @@ static void wd_timer_fn(unsigned long data)
 
 void arch_touch_nmi_watchdog(void)
 {
+   unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
int cpu = smp_processor_id();
 
-   watchdog_timer_interrupt(cpu);
+   if (get_tb() - per_cpu(wd_timer_tb, cpu) >= ticks)
+   watchdog_timer_interrupt(cpu);
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
-- 
2.13.3



[PATCH 2/6] powerpc/watchdog: Improve watchdog lock primitive

2017-08-09 Thread Nicholas Piggin
- Hard-disable interrupts before taking the lock, which prevents
  soft-NMI re-entrancy and therefore can prevent deadlocks.
- Use raw_ variants of local_irq_disable to avoid irq debugging.
- When the lock is contended, spin at low SMT priority, using
  loads only, and with interrupts enabled (where possible).

Some stalls have been noticed at high loads that go away with improved
locking. There should not be so much locking contention in the first
place (which is addressed in a subsequent patch), but locking should
still be improved.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/watchdog.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 4b9a567c9975..3a26c3401acc 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -71,15 +71,20 @@ static inline void wd_smp_lock(unsigned long *flags)
 * This may be called from low level interrupt handlers at some
 * point in future.
 */
-   local_irq_save(*flags);
-   while (unlikely(test_and_set_bit_lock(0, &__wd_smp_lock)))
-   cpu_relax();
+   raw_local_irq_save(*flags);
+   hard_irq_disable(); /* Make it soft-NMI safe */
+   while (unlikely(test_and_set_bit_lock(0, &__wd_smp_lock))) {
+   raw_local_irq_restore(*flags);
+   spin_until_cond(!test_bit(0, &__wd_smp_lock));
+   raw_local_irq_save(*flags);
+   hard_irq_disable();
+   }
 }
 
 static inline void wd_smp_unlock(unsigned long *flags)
 {
clear_bit_unlock(0, &__wd_smp_lock);
-   local_irq_restore(*flags);
+   raw_local_irq_restore(*flags);
 }
 
 static void wd_lockup_ipi(struct pt_regs *regs)
-- 
2.13.3



[PATCH 1/6] powerpc: NMI IPI improve lock primitive

2017-08-09 Thread Nicholas Piggin
When the NMI IPI lock is contended, spin at low SMT priority, using
loads only, and with interrupts enabled (where possible). This
improves behaviour under high contention (e.g., a system crash when
a number of CPUs are trying to enter the debugger).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/smp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index cf0e1245b8cc..8d3320562c70 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -351,7 +351,7 @@ static void nmi_ipi_lock_start(unsigned long *flags)
hard_irq_disable();
while (atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1) {
raw_local_irq_restore(*flags);
-   cpu_relax();
+   spin_until_cond(atomic_read(&__nmi_ipi_lock) == 0);
raw_local_irq_save(*flags);
hard_irq_disable();
}
@@ -360,7 +360,7 @@ static void nmi_ipi_lock_start(unsigned long *flags)
 static void nmi_ipi_lock(void)
 {
while (atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1)
-   cpu_relax();
+   spin_until_cond(atomic_read(&__nmi_ipi_lock) == 0);
 }
 
 static void nmi_ipi_unlock(void)
@@ -475,7 +475,7 @@ int smp_send_nmi_ipi(int cpu, void (*fn)(struct pt_regs *), 
u64 delay_us)
nmi_ipi_lock_start();
while (nmi_ipi_busy_count) {
nmi_ipi_unlock_end();
-   cpu_relax();
+   spin_until_cond(nmi_ipi_busy_count == 0);
nmi_ipi_lock_start();
}
 
-- 
2.13.3



[PATCH 0/6] watchdog and NMI IPI locking improvements

2017-08-09 Thread Nicholas Piggin
It was noticed that the watchdog was causing hangs and lockups in
some cases, hammering on the watchdog lock, so I've found a few
other improvements and bugs. Thanks to Paulus for finding the problem
and fixing the lock primitives (I fixed it a bit differently but the
idea is his).

Thanks,
Nick

Nicholas Piggin (6):
  powerpc: NMI IPI improve lock primitive
  powerpc/watchdog: Improve watchdog lock primitive
  powerpc/watchdog: Moderate touch_nmi_watchdog overhead
  powerpc/watchdog: Fix final-check recovered case
  powerpc/watchdog: Fix marking of stuck CPUs
  powerpc/watchdog: add locking around init/exit functions

 arch/powerpc/kernel/smp.c  |  6 +++---
 arch/powerpc/kernel/watchdog.c | 49 +++---
 2 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.13.3



[PATCH] ibmvnic: Fix unused variable warning

2017-08-09 Thread Michal Suchanek
Fixes: a248878d7a1d ("ibmvnic: Check for transport event on driver resume")

Signed-off-by: Michal Suchanek 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 99576ba4187f..09c20d3b1b79 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3948,7 +3948,6 @@ static int ibmvnic_resume(struct device *dev)
 {
struct net_device *netdev = dev_get_drvdata(dev);
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
-   int i;
 
if (adapter->state != VNIC_OPEN)
return 0;
-- 
2.10.2



[PATCH] powerpc/configs: Re-enable HARD/SOFT lockup detectors

2017-08-09 Thread Michael Ellerman
In commit 05a4a9527931 ("kernel/watchdog: split up config options"),
CONFIG_LOCKUP_DETECTOR was split into two separate config options,
CONFIG_HARDLOCKUP_DETECTOR and CONFIG_SOFTLOCKUP_DETECTOR.

Our defconfigs still have CONFIG_LOCKUP_DETECTOR=y, but that is no longer
user selectable, and we don't mention the new options, so we end up with
none of them enabled.

So update the defconfigs to turn on the new SOFT and HARD options, the
end result being the same as what we had previously.

Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/configs/powernv_defconfig | 3 ++-
 arch/powerpc/configs/ppc64_defconfig   | 3 ++-
 arch/powerpc/configs/pseries_defconfig | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index 0695ce047d56..34fc9bbfca9e 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -293,7 +293,8 @@ CONFIG_MAGIC_SYSRQ=y
 CONFIG_DEBUG_KERNEL=y
 CONFIG_DEBUG_STACK_USAGE=y
 CONFIG_DEBUG_STACKOVERFLOW=y
-CONFIG_LOCKUP_DETECTOR=y
+CONFIG_SOFTLOCKUP_DETECTOR=y
+CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_LATENCYTOP=y
 CONFIG_SCHED_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index 5175028c56ce..c5246d29f385 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -324,7 +324,8 @@ CONFIG_MAGIC_SYSRQ=y
 CONFIG_DEBUG_KERNEL=y
 CONFIG_DEBUG_STACK_USAGE=y
 CONFIG_DEBUG_STACKOVERFLOW=y
-CONFIG_LOCKUP_DETECTOR=y
+CONFIG_SOFTLOCKUP_DETECTOR=y
+CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_LATENCYTOP=y
 CONFIG_SCHED_TRACER=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 1a61aa20dfba..fd5d98a0b95c 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -291,7 +291,8 @@ CONFIG_MAGIC_SYSRQ=y
 CONFIG_DEBUG_KERNEL=y
 CONFIG_DEBUG_STACK_USAGE=y
 CONFIG_DEBUG_STACKOVERFLOW=y
-CONFIG_LOCKUP_DETECTOR=y
+CONFIG_SOFTLOCKUP_DETECTOR=y
+CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_LATENCYTOP=y
 CONFIG_SCHED_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
-- 
2.7.4



Re: [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE

2017-08-09 Thread Laurent Dufour
On 09/08/2017 12:08, Kirill A. Shutemov wrote:
> On Tue, Aug 08, 2017 at 04:35:35PM +0200, Laurent Dufour wrote:
>> @@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>>  /*
>>   * Re-check the pte - we dropped the lock
>>   */
>> -vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
>> +if (!pte_map_lock(vmf)) {
>> +mem_cgroup_cancel_charge(new_page, memcg, false);
>> +ret = VM_FAULT_RETRY;
>> +goto oom_free_new;
> 
> With the change, label is misleading.

That's right.
But I'm wondering renaming it out to 'out_free_new' and replacing all the
matching 'goto' where the label was making sense will help readability ?
Have you better idea ?



Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-09 Thread Laurent Dufour
On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection agains the VMA modification done in :
>>  - madvise()
>>  - mremap()
>>  - mpol_rebind_policy()
>>  - vma_replace_policy()
>>  - change_prot_numa()
>>  - mlock(), munlock()
>>  - mprotect()
>>  - mmap_region()
>>  - collapse_huge_page()
> 
> I don't thinks it's anywhere near complete list of places where we touch
> vm_flags. What is your plan for the rest?

The goal is only to protect places where change to the VMA is impacting the
page fault handling. If you think I missed one, please advise.

Thanks,
Laurent.



Re: [PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-09 Thread Michael Ellerman
Suraj Jitindar Singh  writes:

> The host process table base is stored in the partition table by calling
> the function native_register_process_table(). Currently this just sets
> the entry in memory and is missing a proceeding cache invalidation
> instruction. Any update to the partition table should be followed by a
> cache invalidation instruction specifying invalidation of the caching of
> any partition table entries (RIC = 2, PRS = 0).
>
> We already have a function to update the partition table with the
> required cache invalidation instructions - mmu_partition_table_set_entry().
> Update the native_register_process_table() function to call
> mmu_partition_table_set_entry(), this ensures all appropriate
> invalidation will be performed.
>
> Signed-off-by: Suraj Jitindar Singh 
> ---
>  arch/powerpc/mm/pgtable-radix.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 671a45d..1d5178f 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -33,7 +33,8 @@ static int native_register_process_table(unsigned long 
> base, unsigned long pg_sz
>  {
>   unsigned long patb1 = base | table_size | PATB_GR;
>  
> - partition_tb->patb1 = cpu_to_be64(patb1);
> + mmu_partition_table_set_entry(0, be64_to_cpu(partition_tb->patb0),
> +   patb1);

This is really a bit gross.

Can we agree on whether partition_tb is an array or not?

How about ...

cheers

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index c1185c8ecb22..5d8be076f8e5 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -28,9 +28,13 @@
 static int native_register_process_table(unsigned long base, unsigned long 
pg_sz,
 unsigned long table_size)
 {
-   unsigned long patb1 = base | table_size | PATB_GR;
+   unsigned long patb0, patb1;
+
+   patb0 = be64_to_cpu(partition_tb[0].patb0);
+   patb1 = base | table_size | PATB_GR;
+
+   mmu_partition_table_set_entry(0, patb0, patb1);
 
-   partition_tb->patb1 = cpu_to_be64(patb1);
return 0;
 }


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-09 Thread Kirill A. Shutemov
On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection agains the VMA modification done in :
>   - madvise()
>   - mremap()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()

I don't thinks it's anywhere near complete list of places where we touch
vm_flags. What is your plan for the rest?
-- 
 Kirill A. Shutemov


Re: [PATCH 02/16] mm: Prepare for FAULT_FLAG_SPECULATIVE

2017-08-09 Thread Kirill A. Shutemov
On Tue, Aug 08, 2017 at 04:35:35PM +0200, Laurent Dufour wrote:
> @@ -2295,7 +2302,11 @@ static int wp_page_copy(struct vm_fault *vmf)
>   /*
>* Re-check the pte - we dropped the lock
>*/
> - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
> + if (!pte_map_lock(vmf)) {
> + mem_cgroup_cancel_charge(new_page, memcg, false);
> + ret = VM_FAULT_RETRY;
> + goto oom_free_new;

With the change, label is misleading.

> + }
>   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>   if (old_page) {
>   if (!PageAnon(old_page)) {

-- 
 Kirill A. Shutemov


Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-09 Thread Cédric Le Goater
On 08/09/2017 05:53 AM, David Gibson wrote:
> On Tue, Aug 08, 2017 at 10:56:12AM +0200, Cédric Le Goater wrote:
>> This is the framework for using XIVE in a PowerVM guest. The support
>> is very similar to the native one in a much simpler form.
>>
>> Instead of OPAL calls, a set of Hypervisors call are used to configure
>> the interrupt sources and the event/notification queues of the guest:
>>
>>  - H_INT_GET_SOURCE_INFO
>>
>>used to obtain the address of the MMIO page of the Event State
>>Buffer (PQ bits) entry associated with the source.
>>
>>  - H_INT_SET_SOURCE_CONFIG
>>
>>assigns a source to a "target".
>>
>>  - H_INT_GET_SOURCE_CONFIG
>>
>>determines to which "target" and "priority" is assigned to a source
>>
>>  - H_INT_GET_QUEUE_INFO
>>
>>returns the address of the notification management page associated
>>with the specified "target" and "priority".
>>
>>  - H_INT_SET_QUEUE_CONFIG
>>
>>sets or resets the event queue for a given "target" and "priority".
>>It is also used to set the notification config associated with the
>>queue, only unconditional notification for the moment.  Reset is
>>performed with a queue size of 0 and queueing is disabled in that
>>case.
>>
>>  - H_INT_GET_QUEUE_CONFIG
>>
>>returns the queue settings for a given "target" and "priority".
>>
>>  - H_INT_RESET
>>
>>resets all of the partition's interrupt exploitation structures to
>>their initial state, losing all configuration set via the hcalls
>>H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>>
>>  - H_INT_SYNC
>>
>>issue a synchronisation on a source to make sure sure all
>>notifications have reached their queue.
>>
>> As for XICS, the XIVE interface for the guest is described in the
>> device tree under the interrupt controller node. A couple of new
>> properties are specific to XIVE :
>>
>>  - "reg"
>>
>>contains the base address and size of the thread interrupt
>>managnement areas (TIMA) for the user level for the OS level. Only
>>the OS level is taken into account.
>>
>>  - "ibm,xive-eq-sizes"
>>
>>the size of the event queues.
>>
>>  - "ibm,xive-lisn-ranges"
>>
>>the interrupt numbers ranges assigned to the guest. These are
>>allocated using a simple bitmap.
>>
>> Tested with a QEMU XIVE model for pseries and with the Power
>> hypervisor
>>
>> Signed-off-by: Cédric Le Goater 
> 
> I don't know XIVE well enough to review in detail, but I've made some
> comments based on general knowledge.

Thanks for taking a look. 
 
> 
>> ---
>>  arch/powerpc/include/asm/hvcall.h|  13 +-
>>  arch/powerpc/include/asm/xive.h  |   2 +
>>  arch/powerpc/platforms/pseries/Kconfig   |   1 +
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  10 +-
>>  arch/powerpc/platforms/pseries/kexec.c   |   6 +-
>>  arch/powerpc/platforms/pseries/setup.c   |   8 +-
>>  arch/powerpc/platforms/pseries/smp.c |  32 +-
>>  arch/powerpc/sysdev/xive/Kconfig |   5 +
>>  arch/powerpc/sysdev/xive/Makefile|   1 +
>>  arch/powerpc/sysdev/xive/spapr.c | 554 
>> +++
>>  10 files changed, 619 insertions(+), 13 deletions(-)
>>  create mode 100644 arch/powerpc/sysdev/xive/spapr.c
>>
>> diff --git a/arch/powerpc/include/asm/hvcall.h 
>> b/arch/powerpc/include/asm/hvcall.h
>> index 57d38b504ff7..3d34dc0869f6 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -280,7 +280,18 @@
>>  #define H_RESIZE_HPT_COMMIT 0x370
>>  #define H_REGISTER_PROC_TBL 0x37C
>>  #define H_SIGNAL_SYS_RESET  0x380
>> -#define MAX_HCALL_OPCODEH_SIGNAL_SYS_RESET
>> +#define H_INT_GET_SOURCE_INFO   0x3A8
>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
>> +#define H_INT_GET_QUEUE_INFO0x3B4
>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
>> +#define H_INT_ESB   0x3C8
>> +#define H_INT_SYNC  0x3CC
>> +#define H_INT_RESET 0x3D0
>> +#define MAX_HCALL_OPCODEH_INT_RESET
>>  
>>  /* H_VIOCTL functions */
>>  #define H_GET_VIOA_DUMP_SIZE0x01
>> diff --git a/arch/powerpc/include/asm/xive.h 
>> b/arch/powerpc/include/asm/xive.h
>> index c23ff4389ca2..1deb10032d61 100644
>> --- a/arch/powerpc/include/asm/xive.h
>> +++ b/arch/powerpc/include/asm/xive.h
>> @@ -110,6 +110,7 @@ extern bool __xive_enabled;
>>  
>>  static inline bool xive_enabled(void) { return __xive_enabled; }
>>  
>> +extern bool xive_spapr_init(void);
>>  extern bool xive_native_init(void);
>>  extern void xive_smp_probe(void);
>>  extern int  xive_smp_prepare_cpu(unsigned int cpu);
>> @@ -147,6 +148,7 @@ extern int xive_native_get_vp_info(u32 vp_id, u32 
>> *out_cam_id, u32 *out_chip_id)
>>  
>>  static inline bool xive_enabled(void) { return false; }
>>  
>> +static 

Re: [PATCH 10/10] powerpc/xive: fix the size of the cpumask used in xive_find_target_in_mask()

2017-08-09 Thread Cédric Le Goater
On 08/09/2017 09:06 AM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
>> When called from xive_irq_startup(), the size of the cpumask can be
>> larger than nr_cpu_ids. Most of time, its value is NR_CPUS (2048).
> 
> Ugh, you're right.
> 
>   #define nr_cpumask_bits ((unsigned int)NR_CPUS)
>   ...
>   /**
>* cpumask_weight - Count of bits in *srcp
>* @srcp: the cpumask to count bits (< nr_cpu_ids) in.
>*/
>   static inline unsigned int cpumask_weight(const struct cpumask *srcp)
>   {
>   return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
>   }
> 
> 
> I don't know what the comment on srcp is trying to say. It's not true
> that it only counts nr_cpu_ids worth of bits.
> 
> So it does seem if we're passed a mask with > nr_cpu_ids bits set then
> cpumask_weight() will return > nr_cpu_ids, which is .. unhelpful.
> 
> 
> BUT, I don't see other code handling cpumask_weight() returning >
> nr_cpu_ids - at least I can't find any with some grepping.
> 
> 
> So what is going wrong here that we're being passed a mask with more
> than nr_cpu_ids bits set?
> 
> I think the affinity mask is copied to the desc in desc_smp_init(), and
> the call chain will be:
> 
>   irq_create_mapping()
> -> irq_domain_alloc_descs()
>-> __irq_alloc_descs()
>   -> alloc_descs()
>  -> alloc_desc()
> -> desc_set_defaults()
>-> desc_smp_init()
> 
> irq_create_mapping() is doing:
> 
>   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
> 
> Where the affinity mask is the NULL at the end.
> 
> So presumably we're hitting the irq_default_affinity case here:
> 
>   static void desc_smp_init(struct irq_desc *desc, int node,
> const struct cpumask *affinity)
>   {
>   if (!affinity)
>   affinity = irq_default_affinity;
>   cpumask_copy(desc->irq_common_data.affinity, affinity);
> 
> 
> Which comes from:
> 
>   static void __init init_irq_default_affinity(void)
>   {
>   #ifdef CONFIG_CPUMASK_OFFSTACK
>   if (!irq_default_affinity)
>   zalloc_cpumask_var(_default_affinity, GFP_NOWAIT);
>   #endif
>   if (cpumask_empty(irq_default_affinity))
>   cpumask_setall(irq_default_affinity);
>   }
> 
> And cpumask_setall() will indeed set NR_CPUs bits.
> 
> 
> So that all seems sane, except that it does mean cpumask_weight() can
> return > nr_cpu_ids which is awkward.
> 
> I guess this patch is a good fix, I'll expand the change log a bit.

Yes. Thanks for the digging. I didn't do as much.

Cheers,

C.



Re: [PATCH 03/10] powerpc/xive: rename xive_poke_esb in xive_esb_read

2017-08-09 Thread Cédric Le Goater
On 08/09/2017 05:55 AM, David Gibson wrote:
> On Tue, Aug 08, 2017 at 10:56:13AM +0200, Cédric Le Goater wrote:
>> xive_poke_esb() is performing a load/read so it is better named as
>> xive_esb_read().
> 
> Uh, patch seems to mismatch the comment here, calling it
> xive_peek_esb() instead.

euh yes. oops. 

> Does reading the ESB had a side effect on the hardware?  If so "poke"
> might be an appropriate name.

ok. I want to introduce a 'write' method. should I name them
load and store ? 

Thanks,

C. 

>> Also introduce a XIVE_ESB_LOAD_EOI read when EOI'ing
>> LSI interrupts.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/sysdev/xive/common.c | 20 ++--
>>  1 file changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c 
>> b/arch/powerpc/sysdev/xive/common.c
>> index 6595462b1fc8..e6b245bb9602 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -190,7 +190,7 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, 
>> bool just_peek)
>>   * This is used to perform the magic loads from an ESB
>>   * described in xive.h
>>   */
>> -static u8 xive_poke_esb(struct xive_irq_data *xd, u32 offset)
>> +static u8 xive_peek_esb(struct xive_irq_data *xd, u32 offset)
>>  {
>>  u64 val;
>>  
>> @@ -227,7 +227,7 @@ void xmon_xive_do_dump(int cpu)
>>  xive_dump_eq("IRQ", >queue[xive_irq_priority]);
>>  #ifdef CONFIG_SMP
>>  {
>> -u64 val = xive_poke_esb(>ipi_data, XIVE_ESB_GET);
>> +u64 val = xive_peek_esb(>ipi_data, XIVE_ESB_GET);
>>  xmon_printf("  IPI state: %x:%c%c\n", xc->hw_ipi,
>>  val & XIVE_ESB_VAL_P ? 'P' : 'p',
>>  val & XIVE_ESB_VAL_P ? 'Q' : 'q');
>> @@ -326,9 +326,9 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
>> *xd)
>>   * properly.
>>   */
>>  if (xd->flags & XIVE_IRQ_FLAG_LSI)
>> -in_be64(xd->eoi_mmio);
>> +xive_peek_esb(xd, XIVE_ESB_LOAD_EOI);
>>  else {
>> -eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
>> +eoi_val = xive_peek_esb(xd, XIVE_ESB_SET_PQ_00);
>>  DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
>>  
>>  /* Re-trigger if needed */
>> @@ -383,12 +383,12 @@ static void xive_do_source_set_mask(struct 
>> xive_irq_data *xd,
>>   * ESB accordingly on unmask.
>>   */
>>  if (mask) {
>> -val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01);
>> +val = xive_peek_esb(xd, XIVE_ESB_SET_PQ_01);
>>  xd->saved_p = !!(val & XIVE_ESB_VAL_P);
>>  } else if (xd->saved_p)
>> -xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
>> +xive_peek_esb(xd, XIVE_ESB_SET_PQ_10);
>>  else
>> -xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
>> +xive_peek_esb(xd, XIVE_ESB_SET_PQ_00);
>>  }
>>  
>>  /*
>> @@ -768,7 +768,7 @@ static int xive_irq_retrigger(struct irq_data *d)
>>   * To perform a retrigger, we first set the PQ bits to
>>   * 11, then perform an EOI.
>>   */
>> -xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
>> +xive_peek_esb(xd, XIVE_ESB_SET_PQ_11);
>>  
>>  /*
>>   * Note: We pass "0" to the hw_irq argument in order to
>> @@ -803,7 +803,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data 
>> *d, void *state)
>>  irqd_set_forwarded_to_vcpu(d);
>>  
>>  /* Set it to PQ=10 state to prevent further sends */
>> -pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
>> +pq = xive_peek_esb(xd, XIVE_ESB_SET_PQ_10);
>>  
>>  /* No target ? nothing to do */
>>  if (xd->target == XIVE_INVALID_TARGET) {
>> @@ -832,7 +832,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data 
>> *d, void *state)
>>   * for sure the queue slot is no longer in use.
>>   */
>>  if (pq & 2) {
>> -pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
>> +pq = xive_peek_esb(xd, XIVE_ESB_SET_PQ_11);
>>  xd->saved_p = true;
>>  
>>  /*
> 



Re: [PATCH] powerpc: xive: ensure active irqd when setting affinity

2017-08-09 Thread Benjamin Herrenschmidt
On Wed, 2017-08-09 at 16:15 +1000, Michael Ellerman wrote:
> I'm not sure I'm convinced. We can't handle every possible case of the
> higher level code calling us in situations we don't expect.
> 
> For example irq_data could be NULL, but we trust the higher level code
> not to do that to us.
> 
> Also I don't see any other driver doing this check.
> 
>   $ git grep irqd_is_started
>   include/linux/irq.h:static inline bool irqd_is_started(struct irq_data *d)
>   kernel/irq/chip.c:  if (irqd_is_started(d)) {
>   kernel/irq/chip.c:  if (irqd_is_started(>irq_data)) {
>   kernel/irq/cpuhotplug.c:if (irqd_is_per_cpu(d) || 
> !irqd_is_started(d) || !irq_needs_fixup(d)) {

irqd_is_started is brand new so you won't find any :-)

For most cases the problem is a non-issue. Due to how xive works, it's
more of a problem for us because a non-started interrupt has no
targetting information at all.

So this is *somewhat* related to xive internal and I'd rather have
that sanity check in there.

Cheers,
Ben.



Re: [PATCH 10/10] powerpc/xive: fix the size of the cpumask used in xive_find_target_in_mask()

2017-08-09 Thread Benjamin Herrenschmidt
On Wed, 2017-08-09 at 17:06 +1000, Michael Ellerman wrote:
>   /**
>* cpumask_weight - Count of bits in *srcp
>* @srcp: the cpumask to count bits (< nr_cpu_ids) in.
>*/
>   static inline unsigned int cpumask_weight(const struct cpumask *srcp)
>   {
> return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
>   }
> 
> 
> I don't know what the comment on srcp is trying to say. It's not true
> that it only counts nr_cpu_ids worth of bits.

Right, and that's what bit me. We should report that on lkml and maybe
propose a patch that crops the result...

Cheers,
Ben.



Re: [PATCH 08/10] powerpc/xive: take into account '/ibm,plat-res-int-priorities'

2017-08-09 Thread Cédric Le Goater
On 08/09/2017 06:02 AM, David Gibson wrote:
> On Tue, Aug 08, 2017 at 10:56:18AM +0200, Cédric Le Goater wrote:
>> '/ibm,plat-res-int-priorities' contains a list of priorities that the
>> hypervisor has reserved for its own use. Scan these ranges to choose
>> the lowest unused priority for the xive spapr backend.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/sysdev/xive/spapr.c | 62 
>> +++-
>>  1 file changed, 61 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
>> b/arch/powerpc/sysdev/xive/spapr.c
>> index 7fc40047c23d..220331986bd8 100644
>> --- a/arch/powerpc/sysdev/xive/spapr.c
>> +++ b/arch/powerpc/sysdev/xive/spapr.c
>> @@ -532,13 +532,70 @@ static const struct xive_ops xive_spapr_ops = {
>>  .name   = "spapr",
>>  };
>>  
>> +/*
>> + * get max priority from "/ibm,plat-res-int-priorities"
>> + */
>> +static bool xive_get_max_prio(u8 *max_prio)
>> +{
>> +struct device_node *rootdn;
>> +const __be32 *reg;
>> +u32 len;
>> +int prio, found;
>> +
>> +rootdn = of_find_node_by_path("/");
>> +if (!rootdn) {
>> +pr_err("not root node found !\n");
>> +return false;
>> +}
>> +
>> +reg = of_get_property(rootdn, "ibm,plat-res-int-priorities", );
>> +if (!reg) {
>> +pr_err("Failed to read 'ibm,plat-res-int-priorities' 
>> property\n");
>> +return false;
>> +}
>> +
>> +if (len % (2 * sizeof(u32)) != 0) {
>> +pr_err("invalid 'ibm,plat-res-int-priorities' property\n");
>> +return false;
>> +}
>> +
>> +/* HW supports priorities in the range [0-7] and 0xFF is a
>> + * wildcard priority used to mask. We scan the ranges reserved
>> + * by the hypervisor to find the lowest priority we can use.
>> + */
>> +found = 0xFF;
>> +for (prio = 0; prio < 8; prio++) {
>> +int reserved = 0;
>> +int i;
>> +
>> +for (i = 0; i < len / (2 * sizeof(u32)); i++) {
>> +int base  = be32_to_cpu(reg[2 * i]);
>> +int range = be32_to_cpu(reg[2 * i + 1]);
>> +
>> +if (prio >= base && prio < base + range)
>> +reserved++;
>> +}
>> +
>> +if (!reserved)
>> +found = prio;
> 
> So you continue the loop here, rather than using break.  Which means
> found will be the highest valued priority that's not reserved.  Is
> that what you intended?  The commit message says you find the lowest
> unused, but do lower numbers mean higher priorities or the other way around?

yes. I should probably add a statement on how the priorities are 
ordered : the most privileged is the lowest value.

Thanks,

C. 

>> +}
>> +
>> +if (found == 0xFF) {
>> +pr_err("no valid priority found in 
>> 'ibm,plat-res-int-priorities'\n");
>> +return false;
>> +}
>> +
>> +*max_prio = found;
>> +return true;
>> +}
>> +
>>  bool xive_spapr_init(void)
>>  {
>>  struct device_node *np;
>>  struct resource r;
>>  void __iomem *tima;
>>  struct property *prop;
>> -u8 max_prio = 7;
>> +u8 max_prio;
>>  u32 val;
>>  u32 len;
>>  const __be32 *reg;
>> @@ -566,6 +623,9 @@ bool xive_spapr_init(void)
>>  return false;
>>  }
>>  
>> +if (!xive_get_max_prio(_prio))
>> +return false;
>> +
>>  /* Feed the IRQ number allocator with the ranges given in the DT */
>>  reg = of_get_property(np, "ibm,xive-lisn-ranges", );
>>  if (!reg) {
> 



Re: [PATCH 10/10] powerpc/xive: fix the size of the cpumask used in xive_find_target_in_mask()

2017-08-09 Thread Michael Ellerman
Cédric Le Goater  writes:
> When called from xive_irq_startup(), the size of the cpumask can be
> larger than nr_cpu_ids. Most of time, its value is NR_CPUS (2048).

Ugh, you're right.

  #define nr_cpumask_bits   ((unsigned int)NR_CPUS)
  ...
  /**
   * cpumask_weight - Count of bits in *srcp
   * @srcp: the cpumask to count bits (< nr_cpu_ids) in.
   */
  static inline unsigned int cpumask_weight(const struct cpumask *srcp)
  {
return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
  }


I don't know what the comment on srcp is trying to say. It's not true
that it only counts nr_cpu_ids worth of bits.

So it does seem if we're passed a mask with > nr_cpu_ids bits set then
cpumask_weight() will return > nr_cpu_ids, which is .. unhelpful.


BUT, I don't see other code handling cpumask_weight() returning >
nr_cpu_ids - at least I can't find any with some grepping.


So what is going wrong here that we're being passed a mask with more
than nr_cpu_ids bits set?

I think the affinity mask is copied to the desc in desc_smp_init(), and
the call chain will be:

  irq_create_mapping()
-> irq_domain_alloc_descs()
   -> __irq_alloc_descs()
  -> alloc_descs()
 -> alloc_desc()
-> desc_set_defaults()
   -> desc_smp_init()

irq_create_mapping() is doing:

  virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);

Where the affinity mask is the NULL at the end.

So presumably we're hitting the irq_default_affinity case here:

  static void desc_smp_init(struct irq_desc *desc, int node,
  const struct cpumask *affinity)
  {
if (!affinity)
affinity = irq_default_affinity;
cpumask_copy(desc->irq_common_data.affinity, affinity);


Which comes from:

  static void __init init_irq_default_affinity(void)
  {
  #ifdef CONFIG_CPUMASK_OFFSTACK
if (!irq_default_affinity)
zalloc_cpumask_var(_default_affinity, GFP_NOWAIT);
  #endif
if (cpumask_empty(irq_default_affinity))
cpumask_setall(irq_default_affinity);
  }

And cpumask_setall() will indeed set NR_CPUs bits.


So that all seems sane, except that it does mean cpumask_weight() can
return > nr_cpu_ids which is awkward.

I guess this patch is a good fix, I'll expand the change log a bit.

cheers


> This can result in such WARNINGs in xive_find_target_in_mask():
>
>[0.094480] WARNING: CPU: 10 PID: 1 at 
> ../arch/powerpc/sysdev/xive/common.c:476 xive_find_target_in_mask+0x110/0x2f0
>[0.094486] Modules linked in:
>[0.094491] CPU: 10 PID: 1 Comm: swapper/0 Not tainted 4.12.0+ #3
>[0.094496] task: c003fae4f200 task.stack: c003fe108000
>[0.094501] NIP: c008a310 LR: c008a2e4 CTR: 
> 0072ca34
>[0.094506] REGS: c003fe10b360 TRAP: 0700   Not tainted  (4.12.0+)
>[0.094510] MSR: 80029033 
>[0.094515]   CR: 88000222  XER: 20040008
>[0.094521] CFAR: c008a2cc SOFTE: 0
>[0.094521] GPR00: c008a274 c003fe10b5e0 c1428f00 
> 0010
>[0.094521] GPR04: 0010 0010 0010 
> 0099
>[0.094521] GPR08: 0010 0001  
> 
>[0.094521] GPR12:  cfff2d00 c000d4d8 
> 
>[0.094521] GPR16:    
> 
>[0.094521] GPR20:    
> c0b451e8
>[0.094521] GPR24:  c1462354 0800 
> 07ff
>[0.094521] GPR28: c1462354 0010 c003f857e418 
> 0010
>[0.094580] NIP [c008a310] xive_find_target_in_mask+0x110/0x2f0
>[0.094585] LR [c008a2e4] xive_find_target_in_mask+0xe4/0x2f0
>[0.094589] Call Trace:
>[0.094593] [c003fe10b5e0] [c008a274] 
> xive_find_target_in_mask+0x74/0x2f0 (unreliable)
>[0.094601] [c003fe10b690] [c008abf0] 
> xive_pick_irq_target.isra.1+0x200/0x230
>[0.094608] [c003fe10b830] [c008b250] 
> xive_irq_startup+0x60/0x180
>[0.094614] [c003fe10b8b0] [c01608f0] irq_startup+0x70/0xd0
>[0.094620] [c003fe10b8f0] [c015df7c] 
> __setup_irq+0x7bc/0x880
>[0.094626] [c003fe10ba90] [c015e30c] 
> request_threaded_irq+0x14c/0x2c0
>[0.094632] [c003fe10baf0] [c00aeb00] 
> request_event_sources_irqs+0x100/0x180
>[0.094639] [c003fe10bc10] [c0e7d2f8] 
> __machine_initcall_pseries_init_ras_IRQ+0x104/0x134
>[0.094646] [c003fe10bc40] [c000cc88] 
> do_one_initcall+0x68/0x1d0
>[0.094652] [c003fe10bd00] [c0e643c8] 
> 

Re: [RFC PATCH v5 5/5] vfio-pci: Allow to expose MSI-X table to userspace when safe

2017-08-09 Thread David Gibson
On Mon, Aug 07, 2017 at 05:25:48PM +1000, Alexey Kardashevskiy wrote:
1;4803;0c> Some devices have a MSIX BAR not aligned to the system page size
> greater than 4K (like 64k for ppc64) which at the moment prevents
> such MMIO pages from being mapped to the userspace for the sake of
> the MSIX BAR content protection. If such page happens to share
> the same system page with some frequently accessed registers,
> the entire system page will be emulated which can seriously affect
> performance.
> 
> This allows mapping of MSI-X tables to userspace if hardware provides
> MSIX isolation via interrupt remapping or filtering; in other words
> allowing direct access to the MSIX BAR won't do any harm to other devices
> or cause spurious interrupts visible to the kernel.
> 
> This adds a wrapping helper to check if a capability is supported by
> an IOMMU group.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  include/linux/vfio.h |  1 +
>  drivers/vfio/pci/vfio_pci.c  | 20 +---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -
>  drivers/vfio/vfio.c  | 15 +++
>  4 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809abb273..7110bca2fb60 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -46,6 +46,7 @@ struct vfio_device_ops {
>  
>  extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
>  extern void vfio_iommu_group_put(struct iommu_group *group, struct device 
> *dev);
> +extern bool vfio_iommu_group_is_capable(struct device *dev, unsigned long 
> cap);

This diff probably belongs in the earlier patch adding the function,
rather than here where it's first used.  Not worth respinning just for
that, though.

>  extern int vfio_add_group_dev(struct device *dev,
> const struct vfio_device_ops *ops,
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d87a0a3cda14..c4c39ed64b1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -561,11 +561,17 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device 
> *vdev,
>   struct vfio_region_info_cap_sparse_mmap *sparse;
>   size_t end, size;
>   int nr_areas = 2, i = 0, ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(>pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>   end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> - /* If MSI-X table is aligned to the start or end, only one area */
> - if (((vdev->msix_offset & PAGE_MASK) == 0) ||
> + /*
> +  * If MSI-X table is allowed to mmap because of the capability
> +  * of IRQ remapping or aligned to the start or end, only one area
> +  */
> + if (is_msix_isolated ||
> + ((vdev->msix_offset & PAGE_MASK) == 0) ||
>   (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
>   nr_areas = 1;
>  
> @@ -577,6 +583,12 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device 
> *vdev,
>  
>   sparse->nr_areas = nr_areas;
>  
> + if (is_msix_isolated) {
> + sparse->areas[i].offset = 0;
> + sparse->areas[i].size = end;
> + return 0;
> + }
> +
>   if (vdev->msix_offset & PAGE_MASK) {
>   sparse->areas[i].offset = 0;
>   sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
> @@ -1094,6 +1106,8 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   unsigned int index;
>   u64 phys_len, req_len, pgoff, req_start;
>   int ret;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(>pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>   index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  
> @@ -1115,7 +1129,7 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   if (req_start + req_len > phys_len)
>   return -EINVAL;
>  
> - if (index == vdev->msix_bar) {
> + if (index == vdev->msix_bar && !is_msix_isolated) {
>   /*
>* Disallow mmaps overlapping the MSI-X table; users don't
>* get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c 
> b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 357243d76f10..7514206a5ea7 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vfio_pci_private.h"
>  
> @@ -123,6 +124,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, 
> char __user *buf,
>   resource_size_t end;
>   void __iomem *io;
>   ssize_t done;
> + bool is_msix_isolated = vfio_iommu_group_is_capable(>pdev->dev,
> + IOMMU_GROUP_CAP_ISOLATE_MSIX);
>  
>   if 

Re: [RFC PATCH v5 1/5] iommu: Add capabilities to a group

2017-08-09 Thread David Gibson
On Mon, Aug 07, 2017 at 05:25:44PM +1000, Alexey Kardashevskiy wrote:
> This introduces capabilities to IOMMU groups. The first defined
> capability is IOMMU_GROUP_CAP_ISOLATE_MSIX which tells the IOMMU
> group users that a particular IOMMU group is capable of MSIX message
> filtering; this is useful when deciding whether or not to allow mapping
> of MSIX table to the userspace. Various architectures will enable it
> when they decide that it is safe to do so.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

This seems like a reasonable concept that's probably useful for
something, whether or not it's the best approach for the problem at
hand.

> ---
>  include/linux/iommu.h | 20 
>  drivers/iommu/iommu.c | 28 
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54adc4a33..6b6f3c2f4904 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -155,6 +155,9 @@ struct iommu_resv_region {
>   enum iommu_resv_typetype;
>  };
>  
> +/* IOMMU group capabilities */
> +#define IOMMU_GROUP_CAP_ISOLATE_MSIX (1U)
> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -312,6 +315,11 @@ extern void *iommu_group_get_iommudata(struct 
> iommu_group *group);
>  extern void iommu_group_set_iommudata(struct iommu_group *group,
> void *iommu_data,
> void (*release)(void *iommu_data));
> +extern void iommu_group_set_caps(struct iommu_group *group,
> +  unsigned long clearcaps,
> +  unsigned long setcaps);
> +extern bool iommu_group_is_capable(struct iommu_group *group,
> +unsigned long cap);
>  extern int iommu_group_set_name(struct iommu_group *group, const char *name);
>  extern int iommu_group_add_device(struct iommu_group *group,
> struct device *dev);
> @@ -513,6 +521,18 @@ static inline void iommu_group_set_iommudata(struct 
> iommu_group *group,
>  {
>  }
>  
> +static inline void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps,
> + unsigned long setcaps)
> +{
> +}
> +
> +static inline bool iommu_group_is_capable(struct iommu_group *group,
> +   unsigned long cap)
> +{
> + return false;
> +}
> +
>  static inline int iommu_group_set_name(struct iommu_group *group,
>  const char *name)
>  {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea160afed..6b2c34fe2c3d 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -52,6 +52,7 @@ struct iommu_group {
>   void (*iommu_data_release)(void *iommu_data);
>   char *name;
>   int id;
> + unsigned long caps;
>   struct iommu_domain *default_domain;
>   struct iommu_domain *domain;
>  };
> @@ -447,6 +448,33 @@ void iommu_group_set_iommudata(struct iommu_group 
> *group, void *iommu_data,
>  EXPORT_SYMBOL_GPL(iommu_group_set_iommudata);
>  
>  /**
> + * iommu_group_set_caps - Change the group capabilities
> + * @group: the group
> + * @clearcaps: capabilities mask to remove
> + * @setcaps: capabilities mask to add
> + *
> + * IOMMU groups can be capable of various features which device drivers
> + * may read and adjust the behavior.
> + */
> +void iommu_group_set_caps(struct iommu_group *group,
> + unsigned long clearcaps, unsigned long setcaps)
> +{
> + group->caps &= ~clearcaps;
> + group->caps |= setcaps;
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_set_caps);
> +
> +/**
> + * iommu_group_is_capable - Returns if a group capability is present
> + * @group: the group
> + */
> +bool iommu_group_is_capable(struct iommu_group *group, unsigned long cap)
> +{
> + return !!(group->caps & cap);
> +}
> +EXPORT_SYMBOL_GPL(iommu_group_is_capable);
> +
> +/**
>   * iommu_group_set_name - set name for a group
>   * @group: the group
>   * @name: name

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 3/3] powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y

2017-08-09 Thread Christophe LEROY



Le 09/08/2017 à 04:29, Michael Ellerman a écrit :

Christophe LEROY  writes:

Le 14/07/2017 à 08:51, Michael Ellerman a écrit :

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c0737c86a362..3d562b210c65 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1192,5 +1192,12 @@ static inline const int pud_pfn(pud_t pud)
BUILD_BUG();
return 0;
   }
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+void mark_initmem_nx(void);
+#else
+static inline void mark_initmem_nx(void) { }
+#endif
+


Why do we want to limit that to CONFIG_STRICT_KERNEL_RWX ?
Only the kernel text is marked X, even without CONFIG_STRICT_KERNEL_RWX
(at least on PPC32), so I believe we should clear X on init text in any
case, shouldn't we ?


You're right, but ..

On 64-bit when STRICT_KERNEL_RWX=n we make no effort to ensure the
start/end of the init text is on a page boundary.

eg. on 64-bit hash we will typically use a 16M page to map the whole
kernel, text/data/init_text/etc.


Some of the 32 bit also use some huge mapings like BATs or large pages, 
in which case it is pointless but not harmfull to fix the page tables 
anyway.
At least it is correct for the ones that use regular pages, and kernel 
can also be started with nobats or noltlbs at command line, in which 
case it is usefull to have the page tables correct.




So yes we *should* always mark it no-execute but in practice we can't
because it's not page aligned.


On 32 bit it seems to always be aligned to the normal page size, so no 
problem.




But if that's different on (some?) 32-bit then we could introduce a new
CONFIG symbol that is enabled in the right cases.


For the time being, I have added an "|| CONFIG_PPC32 " on the ifdef, is 
that OK ?

See https://patchwork.ozlabs.org/patch/796625/

Christophe


Re: [PATCH 0/3] Minor updates for PS3

2017-08-09 Thread Michael Ellerman
Jens Axboe  writes:

> On 08/08/2017 04:16 AM, Michael Ellerman wrote:
>> Geoff Levand  writes:
>> 
>>> Hi Michael,
>>>
>>> A few very minor updates for PS3.  Please apply.
>> 
>> Jens do you want to take the block ones, or should I just take the lot?
>
> Up to you, I'm fine either way.

OK I'll grab them.

cheers


Re: [PATCH] powerpc: xive: ensure active irqd when setting affinity

2017-08-09 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:
> Michael Ellerman [m...@ellerman.id.au] wrote:
>> Sukadev Bhattiprolu  writes:
>> > From fd0abf5c61b6041fdb75296e8580b86dc91d08d6 Mon Sep 17 00:00:00 2001
>> > From: Benjamin Herrenschmidt 
>> > Date: Tue, 1 Aug 2017 20:54:41 -0500
>> > Subject: [PATCH] powerpc: xive: ensure active irqd when setting affinity
>> >
>> > Ensure irqd is active before attempting to set affinity. This should
>> > make the set affinity code more robust. For instance, this prevents
>> > these messages seen on a 4.12 based kernel when taking cpus offline:
>> >
>> >[  123.053037264,3] XIVE[ IC 00  ] ISN 2 lead to invalid IVE !
>> >[   77.885859] xive: Error -6 reconfiguring irq 17
>> >[   77.885862] IRQ17: set affinity failed(-6).
>> >
>> > The underlying problem with taking cpus offline was fixed in 4.13-rc1 by:
>> >
>> >commit 91f26cb4cd3c ("genirq/cpuhotplug: Do not migrated shutdown irqs")
>> 
>> So do we still need this? Or is the above only a partial fix?
>
> It would be good to have this fix.
>
> Commit 91f26cb4cd3c fixes the problem, so we wont see the errors with
> that commit applied. But if such a problem were to show up again, xive
> will handle them earlier before hitting those errors.

I'm not sure I'm convinced. We can't handle every possible case of the
higher level code calling us in situations we don't expect.

For example irq_data could be NULL, but we trust the higher level code
not to do that to us.

Also I don't see any other driver doing this check.

  $ git grep irqd_is_started
  include/linux/irq.h:static inline bool irqd_is_started(struct irq_data *d)
  kernel/irq/chip.c:  if (irqd_is_started(d)) {
  kernel/irq/chip.c:  if (irqd_is_started(>irq_data)) {
  kernel/irq/cpuhotplug.c:if (irqd_is_per_cpu(d) || !irqd_is_started(d) 
|| !irq_needs_fixup(d)) {


cheers