Re: [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM

2019-09-10 Thread David Gibson
On Tue, Sep 10, 2019 at 06:49:34PM +0200, Greg Kurz wrote:
> Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
> 8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
> can handle a maximum of 1M event queues.
> 
> The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
> and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
> that on a bi-socket system, we can create at most:
> 
> (2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices
> 
> ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
> Subsequent VMs need to rely on much slower userspace emulated XIVE device in
> QEMU.
> 
> This is problematic as one can legitimately expect to start the same
> number of mono-CPU VMs as the number of HW threads available on the
> system (eg, 144 on Witherspoon).
> 
> I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
> seem overkill to consume that many VPs per VM. Ideally we would even
> want userspace to be able to tell KVM about the maximum number of vCPUs
> when creating the VM.
> 
> For now, provide a module parameter to configure the maximum number of
> vCPUs per VM. While here, reduce the default value to 1024 to match the
> current limit in QEMU. This number is only used by the XIVE KVM devices,
> but some more users of KVM_MAX_VCPUS could possibly be converted.
> 
> With this change, I could successfully run 230 mono-CPU VMs on a
> Witherspoon system using the official skiboot-6.3.
> 
> I could even run more VMs by using upstream skiboot containing this
> fix, that allows to better spread interrupts between sockets:
> 
> e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")
> 
> MAX VPCUS | MAX VMS
> --+-
>  1024 | 255
>   512 | 511
>   256 |1023 (*)
> 
> (*) the system was barely usable because of the extreme load and
> memory exhaustion but the VMs did start.

Hrm.  I don't love the idea of using a global tunable for this,
although I guess it could have some use.  It's another global system
property that admins have to worry about.

A better approach would seem to be a way for userspace to be able to
hint the maximum number of cpus for a specific VM to the kernel.

> 
> Signed-off-by: Greg Kurz 
> ---
>  arch/powerpc/include/asm/kvm_host.h   |1 +
>  arch/powerpc/kvm/book3s_hv.c  |   32 
>  arch/powerpc/kvm/book3s_xive.c|2 +-
>  arch/powerpc/kvm/book3s_xive_native.c |2 +-
>  4 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 6fb5fb4779e0..17582ce38788 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -335,6 +335,7 @@ struct kvm_arch {
>   struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
>   /* This array can grow quite large, keep it at the end */
>   struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> + unsigned int max_vcpus;
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index f8975c620f41..393d8a1ce9d8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -125,6 +125,36 @@ static bool nested = true;
>  module_param(nested, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
>  
> +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> +
> +static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
> +
> +static int set_max_vcpus(const char *val, const struct kernel_param *kp)
> +{
> + unsigned int new_max_vcpus;
> + int ret;
> +
> + ret = kstrtouint(val, 0, _max_vcpus);
> + if (ret)
> + return ret;
> +
> + if (new_max_vcpus > KVM_MAX_VCPUS)
> + return -EINVAL;
> +
> + max_vcpus = new_max_vcpus;
> +
> + return 0;
> +}
> +
> +static struct kernel_param_ops max_vcpus_ops = {
> + .set = set_max_vcpus,
> + .get = param_get_uint,
> +};
> +
> +module_param_cb(max_vcpus, _vcpus_ops, _vcpus, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
> +  __stringify(KVM_MAX_VCPUS) ")");
> +
>  static inline bool nesting_enabled(struct kvm *kvm)
>  {
>   return kvm->arch.nested_enable && kvm_is_radix(kvm);
> @@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>   if (radix_enabled())
>   kvmhv_radix_debugfs_init(kvm);
>  
> + kvm->arch.max_vcpus = max_vcpus;
> +
>   return 0;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 2ef43d037a4f..0fea31b64564 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, 
> u32 type)
>   xive->q_page_order = xive->q_order 

[PATCH 2/2] ASoC: fsl_mqs: Add MQS component driver

2019-09-10 Thread Shengjiu Wang
MQS (medium quality sound), is used to generate medium quality
audio via a standard digital output pin. It can be used to
connect stereo speakers or headphones simply via power amplifier
stages without an additional DAC chip. It only accepts 2-channel,
LSB-valid 16bit, MSB shift-out first, frame sync asserting with
the first bit of the frame, data shifted with the posedge of
bit clock, 44.1 kHz or 48 kHz signals from SAI1 in left justified
format; and it provides the SNR target as no more than 20dB for
the signals below 10 kHz. The signals above 10 kHz will have
worse THD+N values.

MQS provides only simple audio reproduction. No internal pop,
click or distortion artifact reduction methods are provided.

The MQS receives the audio data from the SAI1 Tx section.

Signed-off-by: Shengjiu Wang 
---
 sound/soc/fsl/Kconfig   |  10 ++
 sound/soc/fsl/Makefile  |   2 +
 sound/soc/fsl/fsl_mqs.c | 336 
 3 files changed, 348 insertions(+)
 create mode 100644 sound/soc/fsl/fsl_mqs.c

diff --git a/sound/soc/fsl/Kconfig b/sound/soc/fsl/Kconfig
index aa99c008a925..65e8cd4be930 100644
--- a/sound/soc/fsl/Kconfig
+++ b/sound/soc/fsl/Kconfig
@@ -25,6 +25,16 @@ config SND_SOC_FSL_SAI
  This option is only useful for out-of-tree drivers since
  in-tree drivers select it automatically.
 
+config SND_SOC_FSL_MQS
+   tristate "Medium Quality Sound (MQS) module support"
+   depends on SND_SOC_FSL_SAI
+   select REGMAP_MMIO
+   help
+ Say Y if you want to add Medium Quality Sound (MQS)
+ support for the Freescale CPUs.
+ This option is only useful for out-of-tree drivers since
+ in-tree drivers select it automatically.
+
 config SND_SOC_FSL_AUDMIX
tristate "Audio Mixer (AUDMIX) module support"
select REGMAP_MMIO
diff --git a/sound/soc/fsl/Makefile b/sound/soc/fsl/Makefile
index c0dd04422fe9..8cde88c72d93 100644
--- a/sound/soc/fsl/Makefile
+++ b/sound/soc/fsl/Makefile
@@ -23,6 +23,7 @@ snd-soc-fsl-esai-objs := fsl_esai.o
 snd-soc-fsl-micfil-objs := fsl_micfil.o
 snd-soc-fsl-utils-objs := fsl_utils.o
 snd-soc-fsl-dma-objs := fsl_dma.o
+snd-soc-fsl-mqs-objs := fsl_mqs.o
 
 obj-$(CONFIG_SND_SOC_FSL_AUDMIX) += snd-soc-fsl-audmix.o
 obj-$(CONFIG_SND_SOC_FSL_ASOC_CARD) += snd-soc-fsl-asoc-card.o
@@ -33,6 +34,7 @@ obj-$(CONFIG_SND_SOC_FSL_SPDIF) += snd-soc-fsl-spdif.o
 obj-$(CONFIG_SND_SOC_FSL_ESAI) += snd-soc-fsl-esai.o
 obj-$(CONFIG_SND_SOC_FSL_MICFIL) += snd-soc-fsl-micfil.o
 obj-$(CONFIG_SND_SOC_FSL_UTILS) += snd-soc-fsl-utils.o
+obj-$(CONFIG_SND_SOC_FSL_MQS) += snd-soc-fsl-mqs.o
 obj-$(CONFIG_SND_SOC_POWERPC_DMA) += snd-soc-fsl-dma.o
 
 # MPC5200 Platform Support
diff --git a/sound/soc/fsl/fsl_mqs.c b/sound/soc/fsl/fsl_mqs.c
new file mode 100644
index ..d164f5da3460
--- /dev/null
+++ b/sound/soc/fsl/fsl_mqs.c
@@ -0,0 +1,336 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ALSA SoC IMX MQS driver
+ *
+ * Copyright (C) 2014-2019 Freescale Semiconductor, Inc.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define REG_MQS_CTRL   0x00
+
+#define MQS_EN_MASK(0x1 << 28)
+#define MQS_EN_SHIFT   (28)
+#define MQS_SW_RST_MASK(0x1 << 24)
+#define MQS_SW_RST_SHIFT   (24)
+#define MQS_OVERSAMPLE_MASK(0x1 << 20)
+#define MQS_OVERSAMPLE_SHIFT   (20)
+#define MQS_CLK_DIV_MASK   (0xFF << 0)
+#define MQS_CLK_DIV_SHIFT  (0)
+
+/* codec private data */
+struct fsl_mqs {
+   struct regmap *regmap;
+   struct clk *mclk;
+   struct clk *ipg;
+
+   unsigned int reg_iomuxc_gpr2;
+   unsigned int reg_mqs_ctrl;
+   bool use_gpr;
+};
+
+#define FSL_MQS_RATES  (SNDRV_PCM_RATE_44100 | SNDRV_PCM_RATE_48000)
+#define FSL_MQS_FORMATSSNDRV_PCM_FMTBIT_S16_LE
+
+static int fsl_mqs_hw_params(struct snd_pcm_substream *substream,
+struct snd_pcm_hw_params *params,
+struct snd_soc_dai *dai)
+{
+   struct snd_soc_component *component = dai->component;
+   struct fsl_mqs *mqs_priv = snd_soc_component_get_drvdata(component);
+   unsigned long mclk_rate;
+   int div, res;
+   int bclk, lrclk;
+
+   mclk_rate = clk_get_rate(mqs_priv->mclk);
+   bclk = snd_soc_params_to_bclk(params);
+   lrclk = params_rate(params);
+
+   /*
+* mclk_rate / (oversample(32,64) * FS * 2 * divider ) = repeat_rate;
+* if repeat_rate is 8, mqs can achieve better quality.
+* oversample rate is fix to 32 currently.
+*/
+   div = mclk_rate / (32 * lrclk * 2 * 8);
+   res = mclk_rate % (32 * lrclk * 2 * 8);
+
+   if (res == 0 && div > 0 && div <= 256) {
+   if (mqs_priv->use_gpr) {
+   regmap_update_bits(mqs_priv->regmap, IOMUXC_GPR2,
+ 

[PATCH 1/2] ASoC: fsl_mqs: add DT binding documentation

2019-09-10 Thread Shengjiu Wang
Add the DT binding documentation for NXP MQS driver

Signed-off-by: Shengjiu Wang 
---
 .../devicetree/bindings/sound/fsl,mqs.txt | 20 +++
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/sound/fsl,mqs.txt

diff --git a/Documentation/devicetree/bindings/sound/fsl,mqs.txt 
b/Documentation/devicetree/bindings/sound/fsl,mqs.txt
new file mode 100644
index ..a1dbe181204a
--- /dev/null
+++ b/Documentation/devicetree/bindings/sound/fsl,mqs.txt
@@ -0,0 +1,20 @@
+fsl,mqs audio CODEC
+
+Required properties:
+
+  - compatible : Must contain one of "fsl,imx6sx-mqs", "fsl,codec-mqs"
+   "fsl,imx8qm-mqs", "fsl,imx8qxp-mqs".
+  - clocks : A list of phandles + clock-specifiers, one for each entry in
+clock-names
+  - clock-names : Must contain "mclk"
+  - gpr : The gpr node.
+
+Example:
+
+mqs: mqs {
+   compatible = "fsl,imx6sx-mqs";
+   gpr = <>;
+   clocks = < IMX6SX_CLK_SAI1>;
+   clock-names = "mclk";
+   status = "disabled";
+};
-- 
2.21.0



[PATCH] powerpc/ptrace: Do not return ENOSYS if invalid syscall

2019-09-10 Thread Thadeu Lima de Souza Cascardo
If a tracer sets the syscall number to an invalid one, allow the return
value set by the tracer to be returned the tracee.

The test for NR_syscalls is already at entry_64.S, and it's at
do_syscall_trace_enter only to skip audit and trace.

After this, seccomp_bpf selftests complete just fine, as the failing test
was using ptrace to change the syscall to return an error or a fake value,
but were failing as it was always returning -ENOSYS.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 arch/powerpc/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..87315335f66a 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3316,7 +3316,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 
/* Avoid trace and audit when syscall is invalid. */
if (regs->gpr[0] >= NR_syscalls)
-   goto skip;
+   return regs->gpr[0];
 
if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->gpr[0]);
-- 
2.20.1



Re: missing doorbell interrupt when onlining cpu

2019-09-10 Thread Nathan Lynch
Nathan Lynch  writes:

> Nathan Lynch  writes:
>
>> I'm hoping for some help investigating a behavior I see when doing cpu
>> hotplug under load on P9 and P8 LPARs. Occasionally, while coming online
>> a cpu will seem to get "stuck" in idle, with a pending doorbell
>> interrupt unserviced (cpu 12 here):
>>
>> cpuhp/12-70[012] 46133.602202: cpuhp_enter:  cpu: 0012 target: 
>> 205 step: 174 (0xc0028920s)
>>  load.sh-8201  [014] 46133.602248: sched_waking: comm=cpuhp/12 
>> pid=70 prio=120 target_cpu=012
>>  load.sh-8201  [014] 46133.602251: smp_send_reschedule:  (c0052868) 
>> cpu=12
>>   -0 [012] 46133.602252: do_idle:  (c0162e08)
>>  load.sh-8201  [014] 46133.602252: smp_muxed_ipi_message_pass: 
>> (c00527e8) cpu=12 msg=1
>>  load.sh-8201  [014] 46133.602253: doorbell_core_ipi:(c004d3e8) 
>> cpu=12
>>   -0 [012] 46133.602257: arch_cpu_idle:(c0022d08)
>>   -0 [012] 46133.602259: pseries_lpar_idle:(c00d43c8)
>
> I should be more explicit that given my tracing configuration I would
> expect to see doorbell events etc here e.g.
>
>  -0 [012] 46133.602086: doorbell_entry:   
> pt_regs=0xc00200e7fb50
>  -0 [012] 46133.602087: smp_ipi_demux_relaxed: 
> (c00530f8)
>  -0 [012] 46133.602088: scheduler_ipi:
> (c015e4f8)
>  -0 [012] 46133.602091: sched_wakeup: cpuhp/12:70 
> [120] success=1 CPU:012
>  -0 [012] 46133.602092: sched_wakeup: 
> migration/12:71 [0] success=1 CPU:012
>  -0 [012] 46133.602093: doorbell_exit:
> pt_regs=0xc00200e7fb50
>
> but instead cpu 12 goes to idle.

Another clue is that I've occasionaly provoked this warning:

WARNING: CPU: 7 PID: 9045 at arch/powerpc/kernel/irq.c:282 
arch_local_irq_restore+0xdc/0x150
Modules linked in:
CPU: 7 PID: 9045 Comm: offliner Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty 
#45
NIP:  c001d91c LR: c1988210 CTR: 00334ee8
REGS: ce19f390 TRAP: 0700   Not tainted  
(5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR:  80010282b033   CR: 4424  
XER: 2004
CFAR: c001d884 IRQMASK: 0 
GPR00: c1988210 ce19f620 c32f6200  
GPR04: ce589f10 0006 ce19f664 c395f260 
GPR08: 003b 8000 0009  
GPR12: 0001 c0001eca7780 005c 0100106c7de0 
GPR16:   100c0a48 0001 
GPR20: 100c5748 0001fc71 0078 c3345c78 
GPR24: c003ffd99a00 c3349de0  c003fb086c10 
GPR28:  000f c003fb086c10  
NIP [c001d91c] arch_local_irq_restore+0xdc/0x150
LR [c1988210] _raw_spin_unlock_irqrestore+0xa0/0xd0
Call Trace:
[ce19f6a0] [c1988210] _raw_spin_unlock_irqrestore+0xa0/0xd0
[ce19f6d0] [c01be920] try_to_wake_up+0x330/0xf30
[ce19f7a0] [c01bf5b0] wake_up_q+0x70/0xc0
[ce19f7e0] [c02b5a08] cpu_stop_queue_work+0xc8/0x140
[ce19f850] [c02b5bac] queue_stop_cpus_work+0xdc/0x160
[ce19f8b0] [c02b5c98] __stop_cpus+0x68/0xc0
[ce19f950] [c02b65ec] stop_cpus+0x5c/0x90
[ce19f9a0] [c02b6924] stop_machine_cpuslocked+0x194/0x1f0
[ce19fa10] [c016c768] takedown_cpu+0x98/0x260
[ce19fad0] [c016cea4] cpuhp_invoke_callback+0x114/0xf40
[ce19fb60] [c017194c] _cpu_down+0x19c/0x320
[ce19fbd0] [c016ff58] do_cpu_down+0x68/0xb0
[ce19fc10] [c0d4] cpu_subsys_offline+0x24/0x40
[ce19fc30] [c0cc2860] device_offline+0x100/0x140
[ce19fc70] [c0cc2a00] online_store+0x70/0xf0
[ce19fcb0] [c0cbcee8] dev_attr_store+0x38/0x60
[ce19fcd0] [c059c970] sysfs_kf_write+0x70/0xb0
[ce19fd10] [c059afa8] kernfs_fop_write+0xf8/0x280
[ce19fd60] [c04b436c] __vfs_write+0x3c/0x70
[ce19fd80] [c04b8700] vfs_write+0xd0/0x220
[ce19fdd0] [c04b8abc] ksys_write+0x7c/0x140
[ce19fe20] [c000bbd8] system_call+0x5c/0x68

i.e. in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Anyway, I've proposed a fix:

https://patchwork.ozlabs.org/patch/1160572/


[PATCH] powerpc/pseries: correctly track irq state in default idle

2019-09-10 Thread Nathan Lynch
prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index b955d54628ff..f8adcd0e4589 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -321,6 +321,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1



Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-10 Thread Michael Roth
Quoting Michael Roth (2019-09-05 18:21:22)
> Quoting Michael Ellerman (2019-09-04 22:04:48)
> > That raises the question of whether this needs to be a full barrier or
> > just a write barrier, and where is the matching barrier on the reading
> > side?
> 
> For this particular case I think the same barrier orders it on the
> read-side via kvmppc_set_host_ipi(42, 0) above, but I'm not sure that
> work as a general solution, unless maybe we make that sort of usage
> (clear-before-processing) part of the protocol of using
> kvmppc_set_host_ipi()... it makes sense given we already need to take
> care to not miss clearing them else we get issues like what was fixed
> in 755563bc79c7, which introduced the clear in doorbell_exception(). So
> then it's a matter of additionally making sure we do it prior to
> processing host_ipi state. I haven't looked too closely at the other
> users of kvmppc_set_host_ipi() yet though.



> As far as using rw barriers, I can't think of any reason we couldn't, but
> I wouldn't say I'm at all confident in declaring that safe atm...

I think we need a full barrier after all. The following seems possible
otherwise:

  CPU
X: smp_mb()
X: ipi_message[RESCHEDULE] = 1
X: kvmppc_set_host_ipi(42, 1)
X: smp_mb()
X: doorbell/msgsnd -> 42
   42: doorbell_exception() (from CPU X)
   42: msgsync
   42: kvmppc_set_host_ipi(42, 0) // STORE DEFERRED DUE TO RE-ORDERING
   42: smp_ipi_demux_relaxed()
  105: smb_mb()
  105: ipi_message[CALL_FUNCTION] = 1
  105: smp_mb()
  105: kvmppc_set_host_ipi(42, 1)
   42: kvmppc_set_host_ipi(42, 0) // RE-ORDERED STORE COMPLETES
   42: // returns to executing guest
  105: doorbell/msgsnd -> 42
   42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
  105: // hangs waiting on 42 to process messages/call_single_queue

However that also means the current patch is insufficient, since the
barrier for preventing this scenario needs to come *after* setting
paca_ptrs[cpu]->kvm_hstate.host_ipi to 0.

So I think the right interface is for this is to split
kvmppc_set_host_ipi out into:

static inline void kvmppc_set_host_ipi(int cpu)
{
   smp_mb();
   paca_ptrs[cpu]->kvm_hstate.host_ipi = 1;
}

static inline void kvmppc_clear_host_ipi(int cpu)
{
   paca_ptrs[cpu]->kvm_hstate.host_ipi = 0;
   smp_mb();
}


Re: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-10 Thread Nathan Chancellor
On Wed, Sep 11, 2019 at 04:30:38AM +1000, Michael Ellerman wrote:
> Nathan Chancellor  writes:
> > On Wed, Sep 04, 2019 at 08:01:35AM -0500, Segher Boessenkool wrote:
> >> On Wed, Sep 04, 2019 at 08:16:45AM +, David Laight wrote:
> >> > From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> >> > > Fair enough so I guess we are back to just outright disabling the
> >> > > warning.
> >> > 
> >> > Just disabling the warning won't stop the compiler generating code
> >> > that breaks a 'user' implementation of setjmp().
> >> 
> >> Yeah.  I have a patch (will send in an hour or so) that enables the
> >> "returns_twice" attribute for setjmp (in ).  In testing
> >> (with GCC trunk) it showed no difference in code generation, but
> >> better save than sorry.
> >> 
> >> It also sets "noreturn" on longjmp, and that *does* help, it saves a
> >> hundred insns or so (all in xmon, no surprise there).
> >> 
> >> I don't think this will make LLVM shut up about this though.  And
> >> technically it is right: the C standard does say that in hosted mode
> >> setjmp is a reserved name and you need to include  to access
> >> it (not ).
> >
> > It does not fix the warning, I tested your patch.
> >
> >> So why is the kernel compiled as hosted?  Does adding -ffreestanding
> >> hurt anything?  Is that actually supported on LLVM, on all relevant
> >> versions of it?  Does it shut up the warning there (if not, that would
> >> be an LLVM bug)?
> >
> > It does fix this warning because -ffreestanding implies -fno-builtin,
> > which also solves the warning. LLVM has supported -ffreestanding since
> > at least 3.0.0. There are some parts of the kernel that are compiled
> > with this and it probably should be used in more places but it sounds
> > like there might be some good codegen improvements that are disabled
> > with it:
> >
> > https://lore.kernel.org/lkml/CAHk-=wi-epJZfBHDbKKDZ64us7WkF=lpufhvybmzsteo8q0...@mail.gmail.com/
> 
> For xmon.c and crash.c I think using -ffreestanding would be fine.
> They're both crash/debug code, so we don't care about minor optimisation
> differences. If anything we don't want the compiler being too clever
> when generating that code.
> 
> cheers

I will send a v2 later today along with another patch to fix this
warning and another build error.

Cheers,
Nathan


Re: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-10 Thread Michael Ellerman
Nathan Chancellor  writes:
> On Wed, Sep 04, 2019 at 08:01:35AM -0500, Segher Boessenkool wrote:
>> On Wed, Sep 04, 2019 at 08:16:45AM +, David Laight wrote:
>> > From: Nathan Chancellor [mailto:natechancel...@gmail.com]
>> > > Fair enough so I guess we are back to just outright disabling the
>> > > warning.
>> > 
>> > Just disabling the warning won't stop the compiler generating code
>> > that breaks a 'user' implementation of setjmp().
>> 
>> Yeah.  I have a patch (will send in an hour or so) that enables the
>> "returns_twice" attribute for setjmp (in ).  In testing
>> (with GCC trunk) it showed no difference in code generation, but
>> better save than sorry.
>> 
>> It also sets "noreturn" on longjmp, and that *does* help, it saves a
>> hundred insns or so (all in xmon, no surprise there).
>> 
>> I don't think this will make LLVM shut up about this though.  And
>> technically it is right: the C standard does say that in hosted mode
>> setjmp is a reserved name and you need to include  to access
>> it (not ).
>
> It does not fix the warning, I tested your patch.
>
>> So why is the kernel compiled as hosted?  Does adding -ffreestanding
>> hurt anything?  Is that actually supported on LLVM, on all relevant
>> versions of it?  Does it shut up the warning there (if not, that would
>> be an LLVM bug)?
>
> It does fix this warning because -ffreestanding implies -fno-builtin,
> which also solves the warning. LLVM has supported -ffreestanding since
> at least 3.0.0. There are some parts of the kernel that are compiled
> with this and it probably should be used in more places but it sounds
> like there might be some good codegen improvements that are disabled
> with it:
>
> https://lore.kernel.org/lkml/CAHk-=wi-epJZfBHDbKKDZ64us7WkF=lpufhvybmzsteo8q0...@mail.gmail.com/

For xmon.c and crash.c I think using -ffreestanding would be fine.
They're both crash/debug code, so we don't care about minor optimisation
differences. If anything we don't want the compiler being too clever
when generating that code.

cheers


Re: [PATCH] powerpc/xive: Fix bogus error code returned by OPAL

2019-09-10 Thread Cédric Le Goater
On 10/09/2019 15:53, Greg Kurz wrote:
> There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call
> to return the 32-bit value 0x when OPAL has run out of IRQs.
> Unfortunatelty, OPAL return values are signed 64-bit entities and
> errors are supposed to be negative. If that happens, the linux code
> confusingly treats 0x as a valid IRQ number and panics at some
> point.
> 
> A fix was recently merged in skiboot:
> 
> e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")
> 
> but we need a workaround anyway to support older skiboots already
> on the field.
> 
> Internally convert 0x to OPAL_RESOURCE which is the usual error
> returned upon resource exhaustion.
> 
> Signed-off-by: Greg Kurz 



Reviewed-by: Cédric Le Goater 

Thanks,

C.

> ---
>  arch/powerpc/sysdev/xive/native.c |   13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 37987c815913..c35583f84f9f 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -231,6 +231,15 @@ static bool xive_native_match(struct device_node *node)
>   return of_device_is_compatible(node, "ibm,opal-xive-vc");
>  }
>  
> +static int64_t opal_xive_allocate_irq_fixup(uint32_t chip_id)
> +{
> + s64 irq = opal_xive_allocate_irq(chip_id);
> +
> +#define XIVE_ALLOC_NO_SPACE  0x /* No possible space */
> + return
> + irq == XIVE_ALLOC_NO_SPACE ? OPAL_RESOURCE : irq;
> +}
> +
>  #ifdef CONFIG_SMP
>  static int xive_native_get_ipi(unsigned int cpu, struct xive_cpu *xc)
>  {
> @@ -238,7 +247,7 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
> xive_cpu *xc)
>  
>   /* Allocate an IPI and populate info about it */
>   for (;;) {
> - irq = opal_xive_allocate_irq(xc->chip_id);
> + irq = opal_xive_allocate_irq_fixup(xc->chip_id);
>   if (irq == OPAL_BUSY) {
>   msleep(OPAL_BUSY_DELAY_MS);
>   continue;
> @@ -259,7 +268,7 @@ u32 xive_native_alloc_irq(void)
>   s64 rc;
>  
>   for (;;) {
> - rc = opal_xive_allocate_irq(OPAL_XIVE_ANY_CHIP);
> + rc = opal_xive_allocate_irq_fixup(OPAL_XIVE_ANY_CHIP);
>   if (rc != OPAL_BUSY)
>   break;
>   msleep(OPAL_BUSY_DELAY_MS);
> 



[PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM

2019-09-10 Thread Greg Kurz
Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
can handle a maximum of 1M event queues.

The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
that on a bi-socket system, we can create at most:

(2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices

ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
Subsequent VMs need to rely on much slower userspace emulated XIVE device in
QEMU.

This is problematic as one can legitimately expect to start the same
number of mono-CPU VMs as the number of HW threads available on the
system (eg, 144 on Witherspoon).

I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
seem overkill to consume that many VPs per VM. Ideally we would even
want userspace to be able to tell KVM about the maximum number of vCPUs
when creating the VM.

For now, provide a module parameter to configure the maximum number of
vCPUs per VM. While here, reduce the default value to 1024 to match the
current limit in QEMU. This number is only used by the XIVE KVM devices,
but some more users of KVM_MAX_VCPUS could possibly be converted.

With this change, I could successfully run 230 mono-CPU VMs on a
Witherspoon system using the official skiboot-6.3.

I could even run more VMs by using upstream skiboot containing this
fix, that allows to better spread interrupts between sockets:

e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")

MAX VPCUS | MAX VMS
--+-
 1024 | 255
  512 | 511
  256 |1023 (*)

(*) the system was barely usable because of the extreme load and
memory exhaustion but the VMs did start.

Signed-off-by: Greg Kurz 
---
 arch/powerpc/include/asm/kvm_host.h   |1 +
 arch/powerpc/kvm/book3s_hv.c  |   32 
 arch/powerpc/kvm/book3s_xive.c|2 +-
 arch/powerpc/kvm/book3s_xive_native.c |2 +-
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6fb5fb4779e0..17582ce38788 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -335,6 +335,7 @@ struct kvm_arch {
struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
/* This array can grow quite large, keep it at the end */
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
+   unsigned int max_vcpus;
 #endif
 };
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f8975c620f41..393d8a1ce9d8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -125,6 +125,36 @@ static bool nested = true;
 module_param(nested, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
 
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+
+static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
+
+static int set_max_vcpus(const char *val, const struct kernel_param *kp)
+{
+   unsigned int new_max_vcpus;
+   int ret;
+
+   ret = kstrtouint(val, 0, _max_vcpus);
+   if (ret)
+   return ret;
+
+   if (new_max_vcpus > KVM_MAX_VCPUS)
+   return -EINVAL;
+
+   max_vcpus = new_max_vcpus;
+
+   return 0;
+}
+
+static struct kernel_param_ops max_vcpus_ops = {
+   .set = set_max_vcpus,
+   .get = param_get_uint,
+};
+
+module_param_cb(max_vcpus, _vcpus_ops, _vcpus, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
+__stringify(KVM_MAX_VCPUS) ")");
+
 static inline bool nesting_enabled(struct kvm *kvm)
 {
return kvm->arch.nested_enable && kvm_is_radix(kvm);
@@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
if (radix_enabled())
kvmhv_radix_debugfs_init(kvm);
 
+   kvm->arch.max_vcpus = max_vcpus;
+
return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 2ef43d037a4f..0fea31b64564 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 
type)
xive->q_page_order = xive->q_order - PAGE_SHIFT;
 
/* Allocate a bunch of VPs */
-   xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
+   xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
pr_devel("VP_Base=%x\n", xive->vp_base);
 
if (xive->vp_base == XIVE_INVALID_VP)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 84a354b90f60..20314010da56 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1095,7 +1095,7 @@ static int 

[PATCH v1] powerpc/pseries: CMM: Drop page array

2019-09-10 Thread David Hildenbrand
We can simply store the pages in a list (page->lru), no need for a
separate data structure (+ complicated handling). This is how most
other balloon drivers store allocated pages without additional tracking
data.

For the notifiers, use page_to_pfn() to check if a page is in the
applicable range. plpar_page_set_loaned()/plpar_page_set_active() were
called with __pa(page_address()) for now, I assume we can simply switch
to page_to_phys() here. The pfn_to_kaddr() handling is now mostly gone.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Arun KS 
Cc: Pavel Tatashin 
Cc: Thomas Gleixner 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---

Only compile-tested. I hope the page_to_phys() thingy is correct and I
didn't mess up something else / ignoring something important why the array
is needed.

I stumbled over this while looking at how the memory isolation notifier is
used - and wondered why the additional array is necessary. Also, I think
by switching to the generic balloon compaction mechanism, we could get
rid of the memory hotplug notifier and the memory isolation notifier in
this code, as the migration capability of the inflated pages is the real
requirement:
commit 14b8a76b9d53346f2871bf419da2aaf219940c50
Author: Robert Jennings 
Date:   Thu Dec 17 14:44:52 2009 +

powerpc: Make the CMM memory hotplug aware

The Collaborative Memory Manager (CMM) module allocates individual 
pages
over time that are not migratable.  On a long running system this 
can
severely impact the ability to find enough pages to support a 
hotplug
memory remove operation.
[...]

Thoughts?

---
 arch/powerpc/platforms/pseries/cmm.c | 155 ++-
 1 file changed, 31 insertions(+), 124 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index b33251d75927..9cab34a667bf 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -75,21 +75,13 @@ module_param_named(debug, cmm_debug, uint, 0644);
 MODULE_PARM_DESC(debug, "Enable module debugging logging. Set to 1 to enable. "
 "[Default=" __stringify(CMM_DEBUG) "]");
 
-#define CMM_NR_PAGES ((PAGE_SIZE - sizeof(void *) - sizeof(unsigned long)) / 
sizeof(unsigned long))
-
 #define cmm_dbg(...) if (cmm_debug) { printk(KERN_INFO "cmm: "__VA_ARGS__); }
 
-struct cmm_page_array {
-   struct cmm_page_array *next;
-   unsigned long index;
-   unsigned long page[CMM_NR_PAGES];
-};
-
 static unsigned long loaned_pages;
 static unsigned long loaned_pages_target;
 static unsigned long oom_freed_pages;
 
-static struct cmm_page_array *cmm_page_list;
+static LIST_HEAD(cmm_page_list);
 static DEFINE_SPINLOCK(cmm_lock);
 
 static DEFINE_MUTEX(hotplug_mutex);
@@ -138,8 +130,7 @@ static long plpar_page_set_active(unsigned long vpa)
  **/
 static long cmm_alloc_pages(long nr)
 {
-   struct cmm_page_array *pa, *npa;
-   unsigned long addr;
+   struct page *page;
long rc;
 
cmm_dbg("Begin request for %ld pages\n", nr);
@@ -156,43 +147,20 @@ static long cmm_alloc_pages(long nr)
break;
}
 
-   addr = __get_free_page(GFP_NOIO | __GFP_NOWARN |
-  __GFP_NORETRY | __GFP_NOMEMALLOC);
-   if (!addr)
+   page = alloc_page(GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC);
+   if (!page)
break;
spin_lock(_lock);
-   pa = cmm_page_list;
-   if (!pa || pa->index >= CMM_NR_PAGES) {
-   /* Need a new page for the page list. */
-   spin_unlock(_lock);
-   npa = (struct cmm_page_array *)__get_free_page(
-   GFP_NOIO | __GFP_NOWARN |
-   __GFP_NORETRY | __GFP_NOMEMALLOC);
-   if (!npa) {
-   pr_info("%s: Can not allocate new page list\n", 
__func__);
-   free_page(addr);
-   break;
-   }
-   spin_lock(_lock);
-   pa = cmm_page_list;
-
-   if (!pa || pa->index >= CMM_NR_PAGES) {
-   npa->next = pa;
-   npa->index = 0;
-   pa = npa;
-   cmm_page_list = pa;
-   } else
-   free_page((unsigned long) npa);
-   }
-
-   if ((rc = plpar_page_set_loaned(__pa(addr {
+   rc = plpar_page_set_loaned(page_to_phys(page));
+   if (rc) {

[PATCH v3 2/2] powerpc/kexec: move kexec files into a dedicated subdir.

2019-09-10 Thread Christophe Leroy
arch/powerpc/kernel/ contains 8 files dedicated to kexec.

Move them into a dedicated subdirectory.

Signed-off-by: Christophe Leroy 

---
v2: moved crash.c as well as it's part of kexec suite.
v3: renamed files to remove 'kexec' keyword from names.
---
 arch/powerpc/kernel/Makefile   | 19 +---
 arch/powerpc/kernel/kexec/Makefile | 25 ++
 arch/powerpc/kernel/{ => kexec}/crash.c|  0
 .../kernel/{kexec_elf_64.c => kexec/elf_64.c}  |  0
 arch/powerpc/kernel/{ima_kexec.c => kexec/ima.c}   |  0
 .../kernel/{machine_kexec.c => kexec/machine.c}|  0
 .../{machine_kexec_32.c => kexec/machine_32.c} |  0
 .../{machine_kexec_64.c => kexec/machine_64.c} |  0
 .../machine_file_64.c} |  0
 .../{kexec_relocate_32.S => kexec/relocate_32.S}   |  2 +-
 10 files changed, 27 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec/Makefile
 rename arch/powerpc/kernel/{ => kexec}/crash.c (100%)
 rename arch/powerpc/kernel/{kexec_elf_64.c => kexec/elf_64.c} (100%)
 rename arch/powerpc/kernel/{ima_kexec.c => kexec/ima.c} (100%)
 rename arch/powerpc/kernel/{machine_kexec.c => kexec/machine.c} (100%)
 rename arch/powerpc/kernel/{machine_kexec_32.c => kexec/machine_32.c} (100%)
 rename arch/powerpc/kernel/{machine_kexec_64.c => kexec/machine_64.c} (100%)
 rename arch/powerpc/kernel/{machine_kexec_file_64.c => 
kexec/machine_file_64.c} (100%)
 rename arch/powerpc/kernel/{kexec_relocate_32.S => kexec/relocate_32.S} (99%)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index f6c80f31502a..42e150e6e663 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -5,9 +5,6 @@
 
 CFLAGS_ptrace.o+= -DUTS_MACHINE='"$(UTS_MACHINE)"'
 
-# Disable clang warning for using setjmp without setjmp.h header
-CFLAGS_crash.o += $(call cc-disable-warning, builtin-requires-header)
-
 ifdef CONFIG_PPC64
 CFLAGS_prom_init.o += $(NO_MINIMAL_TOC)
 endif
@@ -81,7 +78,6 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
-obj-$(CONFIG_KEXEC_CORE)   += kexec_relocate_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
@@ -125,14 +121,7 @@ pci64-$(CONFIG_PPC64)  += pci_dn.o 
pci-hotplug.o isa-bridge.o
 obj-$(CONFIG_PCI)  += pci_$(BITS).o $(pci64-y) \
   pci-common.o pci_of_scan.o
 obj-$(CONFIG_PCI_MSI)  += msi.o
-obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
-  machine_kexec_$(BITS).o
-obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
-ifdef CONFIG_HAVE_IMA_KEXEC
-ifdef CONFIG_IMA
-obj-y  += ima_kexec.o
-endif
-endif
+obj-$(CONFIG_KEXEC_CORE)   += kexec/
 
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
@@ -164,12 +153,6 @@ endif
 GCOV_PROFILE_prom_init.o := n
 KCOV_INSTRUMENT_prom_init.o := n
 UBSAN_SANITIZE_prom_init.o := n
-GCOV_PROFILE_machine_kexec_64.o := n
-KCOV_INSTRUMENT_machine_kexec_64.o := n
-UBSAN_SANITIZE_machine_kexec_64.o := n
-GCOV_PROFILE_machine_kexec_32.o := n
-KCOV_INSTRUMENT_machine_kexec_32.o := n
-UBSAN_SANITIZE_machine_kexec_32.o := n
 GCOV_PROFILE_kprobes.o := n
 KCOV_INSTRUMENT_kprobes.o := n
 UBSAN_SANITIZE_kprobes.o := n
diff --git a/arch/powerpc/kernel/kexec/Makefile 
b/arch/powerpc/kernel/kexec/Makefile
new file mode 100644
index ..46e52ee95322
--- /dev/null
+++ b/arch/powerpc/kernel/kexec/Makefile
@@ -0,0 +1,25 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the linux kernel.
+#
+
+# Disable clang warning for using setjmp without setjmp.h header
+CFLAGS_crash.o += $(call cc-disable-warning, builtin-requires-header)
+
+obj-y  += machine.o crash.o machine_$(BITS).o
+
+obj-$(CONFIG_PPC32)+= relocate_32.o
+
+obj-$(CONFIG_KEXEC_FILE)   += machine_file_$(BITS).o elf_$(BITS).o
+
+ifdef CONFIG_HAVE_IMA_KEXEC
+ifdef CONFIG_IMA
+obj-y  += ima.o
+endif
+endif
+
+
+# Disable GCOV, KCOV & sanitizers in odd or sensitive code
+GCOV_PROFILE_machine_$(BITS).o := n
+KCOV_INSTRUMENT_machine_$(BITS).o := n
+UBSAN_SANITIZE_machine_$(BITS).o := n
diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/kexec/crash.c
similarity index 100%
rename from arch/powerpc/kernel/crash.c
rename to arch/powerpc/kernel/kexec/crash.c
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec/elf_64.c
similarity index 100%
rename from arch/powerpc/kernel/kexec_elf_64.c
rename to arch/powerpc/kernel/kexec/elf_64.c
diff --git a/arch/powerpc/kernel/ima_kexec.c b/arch/powerpc/kernel/kexec/ima.c
similarity index 100%
rename from 

[PATCH v3 1/2] powerpc/32: Split kexec low level code out of misc_32.S

2019-09-10 Thread Christophe Leroy
Almost half of misc_32.S is dedicated to kexec.
That's the relocation function for kexec.

Drop it into a dedicated kexec_relocate_32.S

Signed-off-by: Christophe Leroy 

---
v2: no change
v3: renamed kexec_32.S to kexec_relocate_32.S
---
 arch/powerpc/kernel/Makefile|   1 +
 arch/powerpc/kernel/kexec_relocate_32.S | 500 
 arch/powerpc/kernel/misc_32.S   | 491 ---
 3 files changed, 501 insertions(+), 491 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec_relocate_32.S

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index c9cc4b689e60..f6c80f31502a 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
+obj-$(CONFIG_KEXEC_CORE)   += kexec_relocate_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
diff --git a/arch/powerpc/kernel/kexec_relocate_32.S 
b/arch/powerpc/kernel/kexec_relocate_32.S
new file mode 100644
index ..3f8ca6a566fb
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_relocate_32.S
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * This file contains kexec low-level functions.
+ *
+ * Copyright (C) 2002-2003 Eric Biederman  
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ * PPC44x port. Copyright (C) 2011,  IBM Corporation
+ * Author: Suzuki Poulose 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+   .text
+
+   /*
+* Must be relocatable PIC code callable as a C function.
+*/
+   .globl relocate_new_kernel
+relocate_new_kernel:
+   /* r3 = page_list   */
+   /* r4 = reboot_code_buffer */
+   /* r5 = start_address  */
+
+#ifdef CONFIG_FSL_BOOKE
+
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#define ENTRY_MAPPING_KEXEC_SETUP
+#include "fsl_booke_entry_mapping.S"
+#undef ENTRY_MAPPING_KEXEC_SETUP
+
+   mr  r3, r29
+   mr  r4, r30
+   mr  r5, r31
+
+   li  r0, 0
+#elif defined(CONFIG_44x)
+
+   /* Save our parameters */
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#ifdef CONFIG_PPC_47x
+   /* Check for 47x cores */
+   mfspr   r3,SPRN_PVR
+   srwir3,r3,16
+   cmplwi  cr0,r3,PVR_476FPE@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476_ISS@h
+   beq setup_map_47x
+#endif /* CONFIG_PPC_47x */
+   
+/*
+ * Code for setting up 1:1 mapping for PPC440x for KEXEC
+ *
+ * We cannot switch off the MMU on PPC44x.
+ * So we:
+ * 1) Invalidate all the mappings except the one we are running from.
+ * 2) Create a tmp mapping for our code in the other address space(TS) and
+ *jump to it. Invalidate the entry we started in.
+ * 3) Create a 1:1 mapping for 0-2GiB in chunks of 256M in original TS.
+ * 4) Jump to the 1:1 mapping in original TS.
+ * 5) Invalidate the tmp mapping.
+ *
+ * - Based on the kexec support code for FSL BookE
+ *
+ */
+
+   /* 
+* Load the PID with kernel PID (0).
+* Also load our MSR_IS and TID to MMUCR for TLB search.
+*/
+   li  r3, 0
+   mtspr   SPRN_PID, r3
+   mfmsr   r4
+   andi.   r4,r4,MSR_IS@l
+   beq wmmucr
+   orisr3,r3,PPC44x_MMUCR_STS@h
+wmmucr:
+   mtspr   SPRN_MMUCR,r3
+   sync
+
+   /*
+* Invalidate all the TLB entries except the current entry
+* where we are running from
+*/
+   bl  0f  /* Find our address */
+0: mflrr5  /* Make it accessible */
+   tlbsx   r23,0,r5/* Find entry we are in */
+   li  r4,0/* Start at TLB entry 0 */
+   li  r3,0/* Set PAGEID inval value */
+1: cmpwr23,r4  /* Is this our entry? */
+   beq skip/* If so, skip the inval */
+   tlbwe   r3,r4,PPC44x_TLB_PAGEID /* If not, inval the entry */
+skip:
+   addir4,r4,1 /* Increment */
+   cmpwi   r4,64   /* Are we done? */
+   bne 1b  /* If not, repeat */
+   isync
+
+   /* Create a temp mapping and jump to it */
+   andi.   r6, r23, 1  /* Find the index to use */
+   addir24, r6, 1  /* r24 will contain 1 or 2 */
+
+   mfmsr   r9  /* get the MSR */
+   rlwinm  r5, r9, 27, 31, 31  /* Extract the MSR[IS] */
+   xorir7, r5, 1   /* Use the other address space */
+
+   

Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Hari Bathini



On 10/09/19 7:35 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>> On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
>>> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
 On 04/09/19 5:50 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
 [...]

>> +/*
>> + * CPU state data is provided by f/w. Below are the definitions
>> + * provided in HDAT spec. Refer to latest HDAT specification for
>> + * any update to this format.
>> + */
>
> How is this meant to work? If HDAT ever changes the format they will
> break all existing kernels in the field.
>
>> +#define HDAT_FADUMP_CPU_DATA_VERSION1

 Changes are not expected here. But this is just to cover for such scenario,
 if that ever happens.
>>>
>>> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
>>> As far as I can tell the values you've assumed here are chip-specific,
>>> non-architected SPR numbers that come from an array buried somewhere
>>> in the SBE codebase. I don't believe you for a second when you say
>>> that this will never change.
>>
>> At least, the understanding is that this numbers not change across processor
>> generations. If something changes, it is supposed to be handled in SBE. Also,
>> I am told this numbers would be listed in the HDAT Spec. Not sure if that
>> happened yet though. Vasant, you have anything to add?
> 
> That doesn't help much because the HDAT spec is not public.
> 
> The point is with the code written the way it is, these values *must
> not* change, or else all existing kernels will be broken, which is not
> acceptable.

Yeah. It is absurd to error out just by looking at version number...

> 
 Also, I think it is a bit far-fetched to error out if versions mismatch.
 Warning and proceeding sounds worthier because the changes are usually
 backward compatible, if and when there are any. Will update accordingly...
>>>
>>> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
>>> MPIPL series was because I assumed the kernel would do the sensible
>>> thing and reject or ignore the structure if it did not know how to
>>> parse the data.
>>
>> I think, the changes if any, would have to be backward compatible for the 
>> sake
>> of sanity.
> 
> People need to understand that this is an ABI between firmware and
> in-the-field distribution kernels which are only updated at customer
> discretion, or possibly never.
> 
> Any changes *must be* backward compatible.
> 
> Looking at the header struct:
> 
> +struct hdat_fadump_thread_hdr {
> + __be32  pir;
> + /* 0x00 - 0x0F - The corresponding stop state of the core */
> + u8  core_state;
> + u8  reserved[3];
> 
> You have those 3 reserved bytes, so a future revision could repurpose
> one of those as a flag to indicate a new format. And/or the hdr could be
> made bigger and new kernels could be taught to look for new things in
> the space after the hdr but before the reg entries.
> 
> So I think there is a reasonable mechanism for extending the format in
> future, but my point is people must understand that this is an ABI and
> changes must be made accordingly.

True. The folks who make the changes to this format should be aware that
breaking kernel ABI is not going to be pretty and I think they are :)

> 
>> Even if they are not, we are better off exporting the /proc/vmcore
>> with a warning and some crazy CPU register data (if parsing goes alright) 
>> than
>> no dump at all? 
> 
> If it's just a case of reg entries that we don't recognise then yes I
> think it would be OK to just skip them and continue exporting. But if
> there's any deeper misunderstanding of the format then we should bail
> out.

Sure. Will try and fix that by first trying to do a sanity check on the
fields that are needed for parsing the data and proceed with a warning if
nothing weird is detected and fallback to just appending crashing cpu as
done in patch 16/31, if anything weird is observed. That should hopefully
take care of all cases in the best possible way..

> 
> I notice now that you don't do anything in opal_fadump_set_regval_regnum()
> if you are passed a register we don't understand, so that probably needs
> fixing.

f/w provides about 100 odd registers in the CPU state data. Most of them
pt_regs doesn't care about. So, opal_fadump_set_regval_regnum is happy as
long as it find the registers listed in it. Unless, pt_regs changes, we
can stick to this and ignore the rest of them?

- Hari



Re: [PATCH v2 2/2] powerpc/kexec: move kexec files into a dedicated subdir.

2019-09-10 Thread Segher Boessenkool
On Tue, Sep 10, 2019 at 02:55:27PM +, Christophe Leroy wrote:
> arch/powerpc/kernel/ contains 7 files dedicated to kexec.
> 
> Move them into a dedicated subdirectory.

>  arch/powerpc/kernel/{ => kexec}/ima_kexec.c|  0
>  arch/powerpc/kernel/{ => kexec}/kexec_32.S |  2 +-
>  arch/powerpc/kernel/{ => kexec}/kexec_elf_64.c |  0
>  arch/powerpc/kernel/{ => kexec}/machine_kexec.c|  0
>  arch/powerpc/kernel/{ => kexec}/machine_kexec_32.c |  0
>  arch/powerpc/kernel/{ => kexec}/machine_kexec_64.c |  0
>  .../kernel/{ => kexec}/machine_kexec_file_64.c |  0

The filenames do not really need "kexec" in there anymore then?


Segher


[PATCH v2 2/2] powerpc/kexec: move kexec files into a dedicated subdir.

2019-09-10 Thread Christophe Leroy
arch/powerpc/kernel/ contains 7 files dedicated to kexec.

Move them into a dedicated subdirectory.

Signed-off-by: Christophe Leroy 

---
v2: moved crash.c as well as it's part of kexec suite.
---
 arch/powerpc/kernel/Makefile   | 19 +---
 arch/powerpc/kernel/kexec/Makefile | 25 ++
 arch/powerpc/kernel/{ => kexec}/crash.c|  0
 arch/powerpc/kernel/{ => kexec}/ima_kexec.c|  0
 arch/powerpc/kernel/{ => kexec}/kexec_32.S |  2 +-
 arch/powerpc/kernel/{ => kexec}/kexec_elf_64.c |  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec.c|  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec_32.c |  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec_64.c |  0
 .../kernel/{ => kexec}/machine_kexec_file_64.c |  0
 10 files changed, 27 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec/Makefile
 rename arch/powerpc/kernel/{ => kexec}/crash.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/ima_kexec.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/kexec_32.S (99%)
 rename arch/powerpc/kernel/{ => kexec}/kexec_elf_64.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_32.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_64.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_file_64.c (100%)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index df708de6f866..42e150e6e663 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -5,9 +5,6 @@
 
 CFLAGS_ptrace.o+= -DUTS_MACHINE='"$(UTS_MACHINE)"'
 
-# Disable clang warning for using setjmp without setjmp.h header
-CFLAGS_crash.o += $(call cc-disable-warning, builtin-requires-header)
-
 ifdef CONFIG_PPC64
 CFLAGS_prom_init.o += $(NO_MINIMAL_TOC)
 endif
@@ -81,7 +78,6 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
-obj-$(CONFIG_KEXEC_CORE)   += kexec_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
@@ -125,14 +121,7 @@ pci64-$(CONFIG_PPC64)  += pci_dn.o 
pci-hotplug.o isa-bridge.o
 obj-$(CONFIG_PCI)  += pci_$(BITS).o $(pci64-y) \
   pci-common.o pci_of_scan.o
 obj-$(CONFIG_PCI_MSI)  += msi.o
-obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
-  machine_kexec_$(BITS).o
-obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
-ifdef CONFIG_HAVE_IMA_KEXEC
-ifdef CONFIG_IMA
-obj-y  += ima_kexec.o
-endif
-endif
+obj-$(CONFIG_KEXEC_CORE)   += kexec/
 
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
@@ -164,12 +153,6 @@ endif
 GCOV_PROFILE_prom_init.o := n
 KCOV_INSTRUMENT_prom_init.o := n
 UBSAN_SANITIZE_prom_init.o := n
-GCOV_PROFILE_machine_kexec_64.o := n
-KCOV_INSTRUMENT_machine_kexec_64.o := n
-UBSAN_SANITIZE_machine_kexec_64.o := n
-GCOV_PROFILE_machine_kexec_32.o := n
-KCOV_INSTRUMENT_machine_kexec_32.o := n
-UBSAN_SANITIZE_machine_kexec_32.o := n
 GCOV_PROFILE_kprobes.o := n
 KCOV_INSTRUMENT_kprobes.o := n
 UBSAN_SANITIZE_kprobes.o := n
diff --git a/arch/powerpc/kernel/kexec/Makefile 
b/arch/powerpc/kernel/kexec/Makefile
new file mode 100644
index ..aa765037f0c0
--- /dev/null
+++ b/arch/powerpc/kernel/kexec/Makefile
@@ -0,0 +1,25 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the linux kernel.
+#
+
+# Disable clang warning for using setjmp without setjmp.h header
+CFLAGS_crash.o += $(call cc-disable-warning, builtin-requires-header)
+
+obj-y  += machine_kexec.o crash.o 
machine_kexec_$(BITS).o
+
+obj-$(CONFIG_PPC32)+= kexec_32.o
+
+obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
+
+ifdef CONFIG_HAVE_IMA_KEXEC
+ifdef CONFIG_IMA
+obj-y  += ima_kexec.o
+endif
+endif
+
+
+# Disable GCOV, KCOV & sanitizers in odd or sensitive code
+GCOV_PROFILE_machine_kexec_$(BITS).o := n
+KCOV_INSTRUMENT_machine_kexec_$(BITS).o := n
+UBSAN_SANITIZE_machine_kexec_$(BITS).o := n
diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/kexec/crash.c
similarity index 100%
rename from arch/powerpc/kernel/crash.c
rename to arch/powerpc/kernel/kexec/crash.c
diff --git a/arch/powerpc/kernel/ima_kexec.c 
b/arch/powerpc/kernel/kexec/ima_kexec.c
similarity index 100%
rename from arch/powerpc/kernel/ima_kexec.c
rename to arch/powerpc/kernel/kexec/ima_kexec.c
diff --git a/arch/powerpc/kernel/kexec_32.S 
b/arch/powerpc/kernel/kexec/kexec_32.S
similarity index 99%
rename from arch/powerpc/kernel/kexec_32.S
rename to arch/powerpc/kernel/kexec/kexec_32.S
index 3f8ca6a566fb..b9355e0d5c85 

[PATCH v2 1/2] powerpc/32: Split kexec low level code out of misc_32.S

2019-09-10 Thread Christophe Leroy
Almost half of misc_32.S is dedicated to kexec.

Drop it into a dedicated kexec_32.S

Signed-off-by: Christophe Leroy 

---
v2: no change
---
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/kexec_32.S | 500 +
 arch/powerpc/kernel/misc_32.S  | 491 
 3 files changed, 501 insertions(+), 491 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec_32.S

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index c9cc4b689e60..df708de6f866 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
+obj-$(CONFIG_KEXEC_CORE)   += kexec_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
diff --git a/arch/powerpc/kernel/kexec_32.S b/arch/powerpc/kernel/kexec_32.S
new file mode 100644
index ..3f8ca6a566fb
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_32.S
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * This file contains kexec low-level functions.
+ *
+ * Copyright (C) 2002-2003 Eric Biederman  
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ * PPC44x port. Copyright (C) 2011,  IBM Corporation
+ * Author: Suzuki Poulose 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+   .text
+
+   /*
+* Must be relocatable PIC code callable as a C function.
+*/
+   .globl relocate_new_kernel
+relocate_new_kernel:
+   /* r3 = page_list   */
+   /* r4 = reboot_code_buffer */
+   /* r5 = start_address  */
+
+#ifdef CONFIG_FSL_BOOKE
+
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#define ENTRY_MAPPING_KEXEC_SETUP
+#include "fsl_booke_entry_mapping.S"
+#undef ENTRY_MAPPING_KEXEC_SETUP
+
+   mr  r3, r29
+   mr  r4, r30
+   mr  r5, r31
+
+   li  r0, 0
+#elif defined(CONFIG_44x)
+
+   /* Save our parameters */
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#ifdef CONFIG_PPC_47x
+   /* Check for 47x cores */
+   mfspr   r3,SPRN_PVR
+   srwir3,r3,16
+   cmplwi  cr0,r3,PVR_476FPE@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476_ISS@h
+   beq setup_map_47x
+#endif /* CONFIG_PPC_47x */
+   
+/*
+ * Code for setting up 1:1 mapping for PPC440x for KEXEC
+ *
+ * We cannot switch off the MMU on PPC44x.
+ * So we:
+ * 1) Invalidate all the mappings except the one we are running from.
+ * 2) Create a tmp mapping for our code in the other address space(TS) and
+ *jump to it. Invalidate the entry we started in.
+ * 3) Create a 1:1 mapping for 0-2GiB in chunks of 256M in original TS.
+ * 4) Jump to the 1:1 mapping in original TS.
+ * 5) Invalidate the tmp mapping.
+ *
+ * - Based on the kexec support code for FSL BookE
+ *
+ */
+
+   /* 
+* Load the PID with kernel PID (0).
+* Also load our MSR_IS and TID to MMUCR for TLB search.
+*/
+   li  r3, 0
+   mtspr   SPRN_PID, r3
+   mfmsr   r4
+   andi.   r4,r4,MSR_IS@l
+   beq wmmucr
+   orisr3,r3,PPC44x_MMUCR_STS@h
+wmmucr:
+   mtspr   SPRN_MMUCR,r3
+   sync
+
+   /*
+* Invalidate all the TLB entries except the current entry
+* where we are running from
+*/
+   bl  0f  /* Find our address */
+0: mflrr5  /* Make it accessible */
+   tlbsx   r23,0,r5/* Find entry we are in */
+   li  r4,0/* Start at TLB entry 0 */
+   li  r3,0/* Set PAGEID inval value */
+1: cmpwr23,r4  /* Is this our entry? */
+   beq skip/* If so, skip the inval */
+   tlbwe   r3,r4,PPC44x_TLB_PAGEID /* If not, inval the entry */
+skip:
+   addir4,r4,1 /* Increment */
+   cmpwi   r4,64   /* Are we done? */
+   bne 1b  /* If not, repeat */
+   isync
+
+   /* Create a temp mapping and jump to it */
+   andi.   r6, r23, 1  /* Find the index to use */
+   addir24, r6, 1  /* r24 will contain 1 or 2 */
+
+   mfmsr   r9  /* get the MSR */
+   rlwinm  r5, r9, 27, 31, 31  /* Extract the MSR[IS] */
+   xorir7, r5, 1   /* Use the other address space */
+
+   /* Read the current mapping entries */
+   tlbre   r3, r23, PPC44x_TLB_PAGEID
+   tlbre   r4, r23, PPC44x_TLB_XLAT
+   tlbre   r5, r23, 

[PATCH v2 2/2] powerpc/32: Split kexec low level code out of misc_32.S

2019-09-10 Thread Christophe Leroy
Almost half of misc_32.S is dedicated to kexec.

Drop it into a dedicated kexec_32.S

Signed-off-by: Christophe Leroy 

---
v2: no change
---
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/kexec_32.S | 500 +
 arch/powerpc/kernel/misc_32.S  | 491 
 3 files changed, 501 insertions(+), 491 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec_32.S

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index c9cc4b689e60..df708de6f866 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
+obj-$(CONFIG_KEXEC_CORE)   += kexec_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
diff --git a/arch/powerpc/kernel/kexec_32.S b/arch/powerpc/kernel/kexec_32.S
new file mode 100644
index ..3f8ca6a566fb
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_32.S
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * This file contains kexec low-level functions.
+ *
+ * Copyright (C) 2002-2003 Eric Biederman  
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ * PPC44x port. Copyright (C) 2011,  IBM Corporation
+ * Author: Suzuki Poulose 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+   .text
+
+   /*
+* Must be relocatable PIC code callable as a C function.
+*/
+   .globl relocate_new_kernel
+relocate_new_kernel:
+   /* r3 = page_list   */
+   /* r4 = reboot_code_buffer */
+   /* r5 = start_address  */
+
+#ifdef CONFIG_FSL_BOOKE
+
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#define ENTRY_MAPPING_KEXEC_SETUP
+#include "fsl_booke_entry_mapping.S"
+#undef ENTRY_MAPPING_KEXEC_SETUP
+
+   mr  r3, r29
+   mr  r4, r30
+   mr  r5, r31
+
+   li  r0, 0
+#elif defined(CONFIG_44x)
+
+   /* Save our parameters */
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#ifdef CONFIG_PPC_47x
+   /* Check for 47x cores */
+   mfspr   r3,SPRN_PVR
+   srwir3,r3,16
+   cmplwi  cr0,r3,PVR_476FPE@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476_ISS@h
+   beq setup_map_47x
+#endif /* CONFIG_PPC_47x */
+   
+/*
+ * Code for setting up 1:1 mapping for PPC440x for KEXEC
+ *
+ * We cannot switch off the MMU on PPC44x.
+ * So we:
+ * 1) Invalidate all the mappings except the one we are running from.
+ * 2) Create a tmp mapping for our code in the other address space(TS) and
+ *jump to it. Invalidate the entry we started in.
+ * 3) Create a 1:1 mapping for 0-2GiB in chunks of 256M in original TS.
+ * 4) Jump to the 1:1 mapping in original TS.
+ * 5) Invalidate the tmp mapping.
+ *
+ * - Based on the kexec support code for FSL BookE
+ *
+ */
+
+   /* 
+* Load the PID with kernel PID (0).
+* Also load our MSR_IS and TID to MMUCR for TLB search.
+*/
+   li  r3, 0
+   mtspr   SPRN_PID, r3
+   mfmsr   r4
+   andi.   r4,r4,MSR_IS@l
+   beq wmmucr
+   orisr3,r3,PPC44x_MMUCR_STS@h
+wmmucr:
+   mtspr   SPRN_MMUCR,r3
+   sync
+
+   /*
+* Invalidate all the TLB entries except the current entry
+* where we are running from
+*/
+   bl  0f  /* Find our address */
+0: mflrr5  /* Make it accessible */
+   tlbsx   r23,0,r5/* Find entry we are in */
+   li  r4,0/* Start at TLB entry 0 */
+   li  r3,0/* Set PAGEID inval value */
+1: cmpwr23,r4  /* Is this our entry? */
+   beq skip/* If so, skip the inval */
+   tlbwe   r3,r4,PPC44x_TLB_PAGEID /* If not, inval the entry */
+skip:
+   addir4,r4,1 /* Increment */
+   cmpwi   r4,64   /* Are we done? */
+   bne 1b  /* If not, repeat */
+   isync
+
+   /* Create a temp mapping and jump to it */
+   andi.   r6, r23, 1  /* Find the index to use */
+   addir24, r6, 1  /* r24 will contain 1 or 2 */
+
+   mfmsr   r9  /* get the MSR */
+   rlwinm  r5, r9, 27, 31, 31  /* Extract the MSR[IS] */
+   xorir7, r5, 1   /* Use the other address space */
+
+   /* Read the current mapping entries */
+   tlbre   r3, r23, PPC44x_TLB_PAGEID
+   tlbre   r4, r23, PPC44x_TLB_XLAT
+   tlbre   r5, r23, 

[PATCH v2 1/2] NFS: Fix inode fileid checks in attribute revalidation code

2019-09-10 Thread Trond Myklebust
We want to throw out the attrbute if it refers to the mounted on fileid,
and not the real fileid. However we do not want to block cache consistency
updates from NFSv4 writes.

Reported-by: Murphy Zhou 
Fixes: 7e10cc25bfa0 ("NFS: Don't refresh attributes with mounted-on-file...")
Signed-off-by: Trond Myklebust 
Signed-off-by: Christophe Leroy 
---
 fs/nfs/inode.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index c764cfe456e5..2a03bfeec10a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1403,11 +1403,12 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
if (NFS_PROTO(inode)->have_delegation(inode, FMODE_READ))
return 0;
 
-   /* No fileid? Just exit */
-   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID))
-   return 0;
+   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
+   /* Only a mounted-on-fileid? Just exit */
+   if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
+   return 0;
/* Has the inode gone and changed behind our back? */
-   if (nfsi->fileid != fattr->fileid) {
+   } else if (nfsi->fileid != fattr->fileid) {
/* Is this perhaps the mounted-on fileid? */
if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
nfsi->fileid == fattr->mounted_on_fileid)
@@ -1807,11 +1808,12 @@ static int nfs_update_inode(struct inode *inode, struct 
nfs_fattr *fattr)
nfs_display_fhandle_hash(NFS_FH(inode)),
atomic_read(>i_count), fattr->valid);
 
-   /* No fileid? Just exit */
-   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID))
-   return 0;
+   if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) {
+   /* Only a mounted-on-fileid? Just exit */
+   if (fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID)
+   return 0;
/* Has the inode gone and changed behind our back? */
-   if (nfsi->fileid != fattr->fileid) {
+   } else if (nfsi->fileid != fattr->fileid) {
/* Is this perhaps the mounted-on fileid? */
if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
nfsi->fileid == fattr->mounted_on_fileid)
-- 
2.13.3



Re: [PATCH v3] powerpc/lockdep: fix a false positive warning

2019-09-10 Thread Michael Ellerman
Hi Qian,

Sorry I haven't replied sooner, I've been travelling.

Qian Cai  writes:
> The commit 108c14858b9e ("locking/lockdep: Add support for dynamic
> keys") introduced a boot warning on powerpc below, because since the
> commit 2d4f567103ff ("KVM: PPC: Introduce kvm_tmp framework") adds
> kvm_tmp[] into the .bss section and then free the rest of unused spaces
> back to the page allocator.

Thanks for debugging this, but I'd like to fix it differently.

kvm_tmp has caused trouble before, with kmemleak, and it can also cause
trouble with STRICT_KERNEL_RWX, so I'd like to change how it's done,
rather than doing more hacks for it.

It should just be a page in text that we use if needed, and don't free,
which should avoid all these problems.

I'll try and get that done and posted soon.

cheers


Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Michael Ellerman
Hari Bathini  writes:
> On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
>> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
>>> On 04/09/19 5:50 PM, Michael Ellerman wrote:
 Hari Bathini  writes:
>>> [...]
>>>
> +/*
> + * CPU state data is provided by f/w. Below are the definitions
> + * provided in HDAT spec. Refer to latest HDAT specification for
> + * any update to this format.
> + */

 How is this meant to work? If HDAT ever changes the format they will
 break all existing kernels in the field.

> +#define HDAT_FADUMP_CPU_DATA_VERSION1
>>>
>>> Changes are not expected here. But this is just to cover for such scenario,
>>> if that ever happens.
>> 
>> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
>> As far as I can tell the values you've assumed here are chip-specific,
>> non-architected SPR numbers that come from an array buried somewhere
>> in the SBE codebase. I don't believe you for a second when you say
>> that this will never change.
>
> At least, the understanding is that this numbers not change across processor
> generations. If something changes, it is supposed to be handled in SBE. Also,
> I am told this numbers would be listed in the HDAT Spec. Not sure if that
> happened yet though. Vasant, you have anything to add?

That doesn't help much because the HDAT spec is not public.

The point is with the code written the way it is, these values *must
not* change, or else all existing kernels will be broken, which is not
acceptable.

>>> Also, I think it is a bit far-fetched to error out if versions mismatch.
>>> Warning and proceeding sounds worthier because the changes are usually
>>> backward compatible, if and when there are any. Will update accordingly...
>> 
>> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
>> MPIPL series was because I assumed the kernel would do the sensible
>> thing and reject or ignore the structure if it did not know how to
>> parse the data.
>
> I think, the changes if any, would have to be backward compatible for the sake
> of sanity.

People need to understand that this is an ABI between firmware and
in-the-field distribution kernels which are only updated at customer
discretion, or possibly never.

Any changes *must be* backward compatible.

Looking at the header struct:

+struct hdat_fadump_thread_hdr {
+   __be32  pir;
+   /* 0x00 - 0x0F - The corresponding stop state of the core */
+   u8  core_state;
+   u8  reserved[3];

You have those 3 reserved bytes, so a future revision could repurpose
one of those as a flag to indicate a new format. And/or the hdr could be
made bigger and new kernels could be taught to look for new things in
the space after the hdr but before the reg entries.

So I think there is a reasonable mechanism for extending the format in
future, but my point is people must understand that this is an ABI and
changes must be made accordingly.

> Even if they are not, we are better off exporting the /proc/vmcore
> with a warning and some crazy CPU register data (if parsing goes alright) than
> no dump at all? 

If it's just a case of reg entries that we don't recognise then yes I
think it would be OK to just skip them and continue exporting. But if
there's any deeper misunderstanding of the format then we should bail
out.

I notice now that you don't do anything in opal_fadump_set_regval_regnum()
if you are passed a register we don't understand, so that probably needs
fixing.

cheers


[PATCH] powerpc/xive: Fix bogus error code returned by OPAL

2019-09-10 Thread Greg Kurz
There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call
to return the 32-bit value 0x when OPAL has run out of IRQs.
Unfortunatelty, OPAL return values are signed 64-bit entities and
errors are supposed to be negative. If that happens, the linux code
confusingly treats 0x as a valid IRQ number and panics at some
point.

A fix was recently merged in skiboot:

e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")

but we need a workaround anyway to support older skiboots already
on the field.

Internally convert 0x to OPAL_RESOURCE which is the usual error
returned upon resource exhaustion.

Signed-off-by: Greg Kurz 
---
 arch/powerpc/sysdev/xive/native.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 37987c815913..c35583f84f9f 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -231,6 +231,15 @@ static bool xive_native_match(struct device_node *node)
return of_device_is_compatible(node, "ibm,opal-xive-vc");
 }
 
+static int64_t opal_xive_allocate_irq_fixup(uint32_t chip_id)
+{
+   s64 irq = opal_xive_allocate_irq(chip_id);
+
+#define XIVE_ALLOC_NO_SPACE0x /* No possible space */
+   return
+   irq == XIVE_ALLOC_NO_SPACE ? OPAL_RESOURCE : irq;
+}
+
 #ifdef CONFIG_SMP
 static int xive_native_get_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
@@ -238,7 +247,7 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 
/* Allocate an IPI and populate info about it */
for (;;) {
-   irq = opal_xive_allocate_irq(xc->chip_id);
+   irq = opal_xive_allocate_irq_fixup(xc->chip_id);
if (irq == OPAL_BUSY) {
msleep(OPAL_BUSY_DELAY_MS);
continue;
@@ -259,7 +268,7 @@ u32 xive_native_alloc_irq(void)
s64 rc;
 
for (;;) {
-   rc = opal_xive_allocate_irq(OPAL_XIVE_ANY_CHIP);
+   rc = opal_xive_allocate_irq_fixup(OPAL_XIVE_ANY_CHIP);
if (rc != OPAL_BUSY)
break;
msleep(OPAL_BUSY_DELAY_MS);



CVE-2019-15031: Linux kernel: powerpc: data leak with FP/VMX triggerable by interrupt in transaction

2019-09-10 Thread Michael Neuling
The Linux kernel for powerpc since v4.15 has a bug in it's TM handling during
interrupts where any user can read the FP/VMX registers of a difference user's
process. Users of TM + FP/VMX can also experience corruption of their FP/VMX
state.

To trigger the bug, a process starts a transaction with FP/VMX off and then
takes an interrupt. Due to the kernels incorrect handling of the interrupt,
FP/VMX is turned on but the checkpointed state is not updated. If this
transaction then rolls back, the checkpointed state may contain the state of a
different process. This checkpointed state can then be read by the process hence
leaking data from one process to another.

The trigger for this bug is an interrupt inside a transaction where FP/VMX is
off, hence the process needs FP/VMX off when starting the transaction. FP/VMX
availability is under the control of the kernel and is transparent to the user,
hence the user has to retry the transaction many times to trigger this bug. High
interrupt loads also help trigger this bug.

All 64-bit machines where TM is present are affected. This includes all POWER8
variants and POWER9 VMs under KVM or LPARs under PowerVM. POWER9 bare metal
doesn't support TM and hence is not affected.

The bug was introduced in commit:
  fa7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed")
Which was originally merged in v4.15

The upstream fix is here:
  https://git.kernel.org/torvalds/c/a8318c13e79badb92bc6640704a64cc022a6eb97

The fix can be verified by running the tm-poison from the kernel selftests. This
test is in a patch here:
https://patchwork.ozlabs.org/patch/1157467/
which should eventually end up here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/tm/tm-poison.c

cheers
Mikey







CVE-2019-15030: Linux kernel: powerpc: data leak with FP/VMX triggerable by unavailable exception in transaction

2019-09-10 Thread Michael Neuling
The Linux kernel for powerpc since v4.12 has a bug in it's TM handling where any
user can read the FP/VMX registers of a difference user's process. Users of TM +
FP/VMX can also experience corruption of their FP/VMX state.

To trigger the bug, a process starts a transaction and reads a FP/VMX register.
This transaction can then fail which causes a rollback to the checkpointed
state. Due to the kernel taking an FP/VMX unavaliable exception inside a
transaction and the kernel's incorrect handling of this, the checkpointed state
can be set to the FP/VMX registers of another process. This checkpointed state
can then be read by the process hence leaking data from one process to another.

The trigger for this bug is an FP/VMX unavailable exception inside a
transaction, hence the process needs FP/VMX off when starting the transaction.
FP/VMX availability is under the control of the kernel and is transparent to the
user, hence the user has to retry the transaction many times to trigger this
bug. 

All 64-bit machines where TM is present are affected. This includes all POWER8
variants and POWER9 VMs under KVM or LPARs under PowerVM. POWER9 bare metal
doesn't support TM and hence is not affected.

The bug was introduced in commit:
  f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption")
Which was originally merged in v4.12

The upstream fix is here:
  https://git.kernel.org/torvalds/c/8205d5d98ef7f155de211f5e2eb6ca03d95a5a60

The fix can be verified by running the tm-poison from the kernel selftests. This
test is in a patch here:
https://patchwork.ozlabs.org/patch/1157467/
which should eventually end up here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/tm/tm-poison.c

cheers
Mikey







[PATCH v2] powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions

2019-09-10 Thread Ravi Bangoria
If watchpoint exception is generated by larx/stcx instructions, the
reservation created by larx gets lost while handling exception, and
thus stcx instruction always fails. Generally these instructions are
used in a while(1) loop, for example spinlocks. And because stcx
never succeeds, it loops forever and ultimately hangs the system.

Note that ptrace anyway works in one-shot mode and thus for ptrace
we don't change the behaviour. It's up to ptrace user to take care
of this.

Signed-off-by: Ravi Bangoria 
Acked-by: Naveen N. Rao 
---
v1->v2:
 - v1: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-September/196818.html
 - Christophe's patch is already merged. Don't include it.
 - Rewrite warning message

 arch/powerpc/kernel/hw_breakpoint.c | 49 +++--
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 28ad3171bb82..1007ec36b4cb 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -195,14 +195,32 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
tsk->thread.last_hit_ubp = NULL;
 }
 
+static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr)
+{
+   int ret, type;
+   struct instruction_op op;
+
+   ret = analyse_instr(, regs, instr);
+   type = GETTYPE(op.type);
+   return (!ret && (type == LARX || type == STCX));
+}
+
 /*
  * Handle debug exception notifications.
  */
 static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
 unsigned long addr)
 {
-   int stepped;
-   unsigned int instr;
+   unsigned int instr = 0;
+
+   if (__get_user_inatomic(instr, (unsigned int *)regs->nip))
+   goto fail;
+
+   if (is_larx_stcx_instr(regs, instr)) {
+   printk_ratelimited("Breakpoint hit on instruction that can't be 
emulated."
+  " Breakpoint at 0x%lx will be disabled.\n", 
addr);
+   goto disable;
+   }
 
/* Do not emulate user-space instructions, instead single-step them */
if (user_mode(regs)) {
@@ -211,23 +229,22 @@ static bool stepping_handler(struct pt_regs *regs, struct 
perf_event *bp,
return false;
}
 
-   stepped = 0;
-   instr = 0;
-   if (!__get_user_inatomic(instr, (unsigned int *)regs->nip))
-   stepped = emulate_step(regs, instr);
+   if (!emulate_step(regs, instr))
+   goto fail;
 
+   return true;
+
+fail:
/*
-* emulate_step() could not execute it. We've failed in reliably
-* handling the hw-breakpoint. Unregister it and throw a warning
-* message to let the user know about it.
+* We've failed in reliably handling the hw-breakpoint. Unregister
+* it and throw a warning message to let the user know about it.
 */
-   if (!stepped) {
-   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
-   "0x%lx will be disabled.", addr);
-   perf_event_disable_inatomic(bp);
-   return false;
-   }
-   return true;
+   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
+   "0x%lx will be disabled.", addr);
+
+disable:
+   perf_event_disable_inatomic(bp);
+   return false;
 }
 
 int hw_breakpoint_handler(struct die_args *args)
-- 
2.21.0



[PATCH 2/2] powerpc/kexec: move kexec files into a dedicated subdir.

2019-09-10 Thread Christophe Leroy
arch/powerpc/kernel/ contains 7 files dedicated to kexec.

Move them into a dedicated subdirectory.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/Makefile   | 16 +---
 arch/powerpc/kernel/kexec/Makefile | 22 ++
 arch/powerpc/kernel/{ => kexec}/ima_kexec.c|  0
 arch/powerpc/kernel/{ => kexec}/kexec_32.S |  2 +-
 arch/powerpc/kernel/{ => kexec}/kexec_elf_64.c |  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec.c|  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec_32.c |  0
 arch/powerpc/kernel/{ => kexec}/machine_kexec_64.c |  0
 .../kernel/{ => kexec}/machine_kexec_file_64.c |  0
 9 files changed, 24 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec/Makefile
 rename arch/powerpc/kernel/{ => kexec}/ima_kexec.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/kexec_32.S (99%)
 rename arch/powerpc/kernel/{ => kexec}/kexec_elf_64.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_32.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_64.c (100%)
 rename arch/powerpc/kernel/{ => kexec}/machine_kexec_file_64.c (100%)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index df708de6f866..b65c44d47157 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -81,7 +81,6 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
-obj-$(CONFIG_KEXEC_CORE)   += kexec_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
@@ -125,14 +124,7 @@ pci64-$(CONFIG_PPC64)  += pci_dn.o 
pci-hotplug.o isa-bridge.o
 obj-$(CONFIG_PCI)  += pci_$(BITS).o $(pci64-y) \
   pci-common.o pci_of_scan.o
 obj-$(CONFIG_PCI_MSI)  += msi.o
-obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
-  machine_kexec_$(BITS).o
-obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
-ifdef CONFIG_HAVE_IMA_KEXEC
-ifdef CONFIG_IMA
-obj-y  += ima_kexec.o
-endif
-endif
+obj-$(CONFIG_KEXEC_CORE)   += kexec/
 
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
@@ -164,12 +156,6 @@ endif
 GCOV_PROFILE_prom_init.o := n
 KCOV_INSTRUMENT_prom_init.o := n
 UBSAN_SANITIZE_prom_init.o := n
-GCOV_PROFILE_machine_kexec_64.o := n
-KCOV_INSTRUMENT_machine_kexec_64.o := n
-UBSAN_SANITIZE_machine_kexec_64.o := n
-GCOV_PROFILE_machine_kexec_32.o := n
-KCOV_INSTRUMENT_machine_kexec_32.o := n
-UBSAN_SANITIZE_machine_kexec_32.o := n
 GCOV_PROFILE_kprobes.o := n
 KCOV_INSTRUMENT_kprobes.o := n
 UBSAN_SANITIZE_kprobes.o := n
diff --git a/arch/powerpc/kernel/kexec/Makefile 
b/arch/powerpc/kernel/kexec/Makefile
new file mode 100644
index ..d96ee5660572
--- /dev/null
+++ b/arch/powerpc/kernel/kexec/Makefile
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the linux kernel.
+#
+
+obj-y  += machine_kexec.o crash.o 
machine_kexec_$(BITS).o
+
+obj-$(CONFIG_PPC32)+= kexec_32.o
+
+obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
+
+ifdef CONFIG_HAVE_IMA_KEXEC
+ifdef CONFIG_IMA
+obj-y  += ima_kexec.o
+endif
+endif
+
+
+# Disable GCOV, KCOV & sanitizers in odd or sensitive code
+GCOV_PROFILE_machine_kexec_$(BITS).o := n
+KCOV_INSTRUMENT_machine_kexec_$(BITS).o := n
+UBSAN_SANITIZE_machine_kexec_$(BITS).o := n
diff --git a/arch/powerpc/kernel/ima_kexec.c 
b/arch/powerpc/kernel/kexec/ima_kexec.c
similarity index 100%
rename from arch/powerpc/kernel/ima_kexec.c
rename to arch/powerpc/kernel/kexec/ima_kexec.c
diff --git a/arch/powerpc/kernel/kexec_32.S 
b/arch/powerpc/kernel/kexec/kexec_32.S
similarity index 99%
rename from arch/powerpc/kernel/kexec_32.S
rename to arch/powerpc/kernel/kexec/kexec_32.S
index 3f8ca6a566fb..b9355e0d5c85 100644
--- a/arch/powerpc/kernel/kexec_32.S
+++ b/arch/powerpc/kernel/kexec/kexec_32.S
@@ -32,7 +32,7 @@ relocate_new_kernel:
mr  r31, r5
 
 #define ENTRY_MAPPING_KEXEC_SETUP
-#include "fsl_booke_entry_mapping.S"
+#include 
 #undef ENTRY_MAPPING_KEXEC_SETUP
 
mr  r3, r29
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec/kexec_elf_64.c
similarity index 100%
rename from arch/powerpc/kernel/kexec_elf_64.c
rename to arch/powerpc/kernel/kexec/kexec_elf_64.c
diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/kexec/machine_kexec.c
similarity index 100%
rename from arch/powerpc/kernel/machine_kexec.c
rename to arch/powerpc/kernel/kexec/machine_kexec.c
diff --git a/arch/powerpc/kernel/machine_kexec_32.c 

[PATCH 1/2] powerpc/32: Split kexec low level code out of misc_32.S

2019-09-10 Thread Christophe Leroy
Almost half of misc_32.S is dedicated to kexec.

Drop it into a dedicated kexec_32.S

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/kexec_32.S | 500 +
 arch/powerpc/kernel/misc_32.S  | 491 
 3 files changed, 501 insertions(+), 491 deletions(-)
 create mode 100644 arch/powerpc/kernel/kexec_32.S

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index c9cc4b689e60..df708de6f866 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_CRASH_DUMP)  += crash_dump.o
 obj-$(CONFIG_FA_DUMP)  += fadump.o
 ifdef CONFIG_PPC32
 obj-$(CONFIG_E500) += idle_e500.o
+obj-$(CONFIG_KEXEC_CORE)   += kexec_32.o
 endif
 obj-$(CONFIG_PPC_BOOK3S_32)+= idle_6xx.o l2cr_6xx.o cpu_setup_6xx.o
 obj-$(CONFIG_TAU)  += tau_6xx.o
diff --git a/arch/powerpc/kernel/kexec_32.S b/arch/powerpc/kernel/kexec_32.S
new file mode 100644
index ..3f8ca6a566fb
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_32.S
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * This file contains kexec low-level functions.
+ *
+ * Copyright (C) 2002-2003 Eric Biederman  
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ * PPC44x port. Copyright (C) 2011,  IBM Corporation
+ * Author: Suzuki Poulose 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+   .text
+
+   /*
+* Must be relocatable PIC code callable as a C function.
+*/
+   .globl relocate_new_kernel
+relocate_new_kernel:
+   /* r3 = page_list   */
+   /* r4 = reboot_code_buffer */
+   /* r5 = start_address  */
+
+#ifdef CONFIG_FSL_BOOKE
+
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#define ENTRY_MAPPING_KEXEC_SETUP
+#include "fsl_booke_entry_mapping.S"
+#undef ENTRY_MAPPING_KEXEC_SETUP
+
+   mr  r3, r29
+   mr  r4, r30
+   mr  r5, r31
+
+   li  r0, 0
+#elif defined(CONFIG_44x)
+
+   /* Save our parameters */
+   mr  r29, r3
+   mr  r30, r4
+   mr  r31, r5
+
+#ifdef CONFIG_PPC_47x
+   /* Check for 47x cores */
+   mfspr   r3,SPRN_PVR
+   srwir3,r3,16
+   cmplwi  cr0,r3,PVR_476FPE@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476@h
+   beq setup_map_47x
+   cmplwi  cr0,r3,PVR_476_ISS@h
+   beq setup_map_47x
+#endif /* CONFIG_PPC_47x */
+   
+/*
+ * Code for setting up 1:1 mapping for PPC440x for KEXEC
+ *
+ * We cannot switch off the MMU on PPC44x.
+ * So we:
+ * 1) Invalidate all the mappings except the one we are running from.
+ * 2) Create a tmp mapping for our code in the other address space(TS) and
+ *jump to it. Invalidate the entry we started in.
+ * 3) Create a 1:1 mapping for 0-2GiB in chunks of 256M in original TS.
+ * 4) Jump to the 1:1 mapping in original TS.
+ * 5) Invalidate the tmp mapping.
+ *
+ * - Based on the kexec support code for FSL BookE
+ *
+ */
+
+   /* 
+* Load the PID with kernel PID (0).
+* Also load our MSR_IS and TID to MMUCR for TLB search.
+*/
+   li  r3, 0
+   mtspr   SPRN_PID, r3
+   mfmsr   r4
+   andi.   r4,r4,MSR_IS@l
+   beq wmmucr
+   orisr3,r3,PPC44x_MMUCR_STS@h
+wmmucr:
+   mtspr   SPRN_MMUCR,r3
+   sync
+
+   /*
+* Invalidate all the TLB entries except the current entry
+* where we are running from
+*/
+   bl  0f  /* Find our address */
+0: mflrr5  /* Make it accessible */
+   tlbsx   r23,0,r5/* Find entry we are in */
+   li  r4,0/* Start at TLB entry 0 */
+   li  r3,0/* Set PAGEID inval value */
+1: cmpwr23,r4  /* Is this our entry? */
+   beq skip/* If so, skip the inval */
+   tlbwe   r3,r4,PPC44x_TLB_PAGEID /* If not, inval the entry */
+skip:
+   addir4,r4,1 /* Increment */
+   cmpwi   r4,64   /* Are we done? */
+   bne 1b  /* If not, repeat */
+   isync
+
+   /* Create a temp mapping and jump to it */
+   andi.   r6, r23, 1  /* Find the index to use */
+   addir24, r6, 1  /* r24 will contain 1 or 2 */
+
+   mfmsr   r9  /* get the MSR */
+   rlwinm  r5, r9, 27, 31, 31  /* Extract the MSR[IS] */
+   xorir7, r5, 1   /* Use the other address space */
+
+   /* Read the current mapping entries */
+   tlbre   r3, r23, PPC44x_TLB_PAGEID
+   tlbre   r4, r23, PPC44x_TLB_XLAT
+   tlbre   r5, r23, PPC44x_TLB_ATTRIB
+
+  

Re: [PATCH 2/2] powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions

2019-09-10 Thread Naveen N. Rao

Ravi Bangoria wrote:

If watchpoint exception is generated by larx/stcx instructions, the
reservation created by larx gets lost while handling exception, and
thus stcx instruction always fails. Generally these instructions are
used in a while(1) loop, for example spinlocks. And because stcx
never succeeds, it loops forever and ultimately hangs the system.

Note that ptrace anyway works in one-shot mode and thus for ptrace
we don't change the behaviour. It's up to ptrace user to take care
of this.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/kernel/hw_breakpoint.c | 49 +++--
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 28ad3171bb82..9fa496a598ce 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -195,14 +195,32 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
tsk->thread.last_hit_ubp = NULL;
 }
 
+static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr)

+{
+   int ret, type;
+   struct instruction_op op;
+
+   ret = analyse_instr(, regs, instr);
+   type = GETTYPE(op.type);
+   return (!ret && (type == LARX || type == STCX));
+}
+
 /*
  * Handle debug exception notifications.
  */
 static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
 unsigned long addr)
 {
-   int stepped;
-   unsigned int instr;
+   unsigned int instr = 0;
+
+   if (__get_user_inatomic(instr, (unsigned int *)regs->nip))
+   goto fail;
+
+   if (is_larx_stcx_instr(regs, instr)) {
+   printk_ratelimited("Watchpoint: Can't emulate/single-step larx/"
+  "stcx instructions. Disabling 
watchpoint.\n");


The below WARN() uses the term 'breakpoint'. Better to use consistent 
terminology. I would rewrite the above as:

printk_ratelimited("Breakpoint hit on instruction that can't be emulated. 
"
"Breakpoint at 0x%lx will be disabled.\n", 
addr);

Otherwise:
Acked-by: Naveen N. Rao 

- Naveen


+   goto disable;
+   }
 
 	/* Do not emulate user-space instructions, instead single-step them */

if (user_mode(regs)) {
@@ -211,23 +229,22 @@ static bool stepping_handler(struct pt_regs *regs, struct 
perf_event *bp,
return false;
}
 
-	stepped = 0;

-   instr = 0;
-   if (!__get_user_inatomic(instr, (unsigned int *)regs->nip))
-   stepped = emulate_step(regs, instr);
+   if (!emulate_step(regs, instr))
+   goto fail;
 
+	return true;

+
+fail:
/*
-* emulate_step() could not execute it. We've failed in reliably
-* handling the hw-breakpoint. Unregister it and throw a warning
-* message to let the user know about it.
+* We've failed in reliably handling the hw-breakpoint. Unregister
+* it and throw a warning message to let the user know about it.
 */
-   if (!stepped) {
-   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
-   "0x%lx will be disabled.", addr);
-   perf_event_disable_inatomic(bp);
-   return false;
-   }
-   return true;
+   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
+   "0x%lx will be disabled.", addr);
+
+disable:
+   perf_event_disable_inatomic(bp);
+   return false;
 }
 
 int hw_breakpoint_handler(struct die_args *args)

--
2.21.0






Re: [PATCH 0/2] powerpc/watchpoint: Disable watchpoint hit by larx/stcx instruction

2019-09-10 Thread Christophe Leroy




Le 10/09/2019 à 12:24, Ravi Bangoria a écrit :

I've prepared my patch on top of Christophe's patch as it's easy
to change stepping_handler() rather than hw_breakpoint_handler().
2nd patch is the actual fix.


Anyway, my patch is already commited on powerpc/next

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20190904=658d029df0bce6472c94b724ca54d74bc6659c2e

Christophe



Christophe Leroy (1):
   powerpc/hw_breakpoint: move instruction stepping out of
 hw_breakpoint_handler()

Ravi Bangoria (1):
   powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions

  arch/powerpc/kernel/hw_breakpoint.c | 77 +++--
  1 file changed, 50 insertions(+), 27 deletions(-)



[PATCH 2/2] powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions

2019-09-10 Thread Ravi Bangoria
If watchpoint exception is generated by larx/stcx instructions, the
reservation created by larx gets lost while handling exception, and
thus stcx instruction always fails. Generally these instructions are
used in a while(1) loop, for example spinlocks. And because stcx
never succeeds, it loops forever and ultimately hangs the system.

Note that ptrace anyway works in one-shot mode and thus for ptrace
we don't change the behaviour. It's up to ptrace user to take care
of this.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/kernel/hw_breakpoint.c | 49 +++--
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 28ad3171bb82..9fa496a598ce 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -195,14 +195,32 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
tsk->thread.last_hit_ubp = NULL;
 }
 
+static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr)
+{
+   int ret, type;
+   struct instruction_op op;
+
+   ret = analyse_instr(, regs, instr);
+   type = GETTYPE(op.type);
+   return (!ret && (type == LARX || type == STCX));
+}
+
 /*
  * Handle debug exception notifications.
  */
 static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
 unsigned long addr)
 {
-   int stepped;
-   unsigned int instr;
+   unsigned int instr = 0;
+
+   if (__get_user_inatomic(instr, (unsigned int *)regs->nip))
+   goto fail;
+
+   if (is_larx_stcx_instr(regs, instr)) {
+   printk_ratelimited("Watchpoint: Can't emulate/single-step larx/"
+  "stcx instructions. Disabling 
watchpoint.\n");
+   goto disable;
+   }
 
/* Do not emulate user-space instructions, instead single-step them */
if (user_mode(regs)) {
@@ -211,23 +229,22 @@ static bool stepping_handler(struct pt_regs *regs, struct 
perf_event *bp,
return false;
}
 
-   stepped = 0;
-   instr = 0;
-   if (!__get_user_inatomic(instr, (unsigned int *)regs->nip))
-   stepped = emulate_step(regs, instr);
+   if (!emulate_step(regs, instr))
+   goto fail;
 
+   return true;
+
+fail:
/*
-* emulate_step() could not execute it. We've failed in reliably
-* handling the hw-breakpoint. Unregister it and throw a warning
-* message to let the user know about it.
+* We've failed in reliably handling the hw-breakpoint. Unregister
+* it and throw a warning message to let the user know about it.
 */
-   if (!stepped) {
-   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
-   "0x%lx will be disabled.", addr);
-   perf_event_disable_inatomic(bp);
-   return false;
-   }
-   return true;
+   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
+   "0x%lx will be disabled.", addr);
+
+disable:
+   perf_event_disable_inatomic(bp);
+   return false;
 }
 
 int hw_breakpoint_handler(struct die_args *args)
-- 
2.21.0



[PATCH 1/2] powerpc/hw_breakpoint: move instruction stepping out of hw_breakpoint_handler()

2019-09-10 Thread Ravi Bangoria
From: Christophe Leroy 

On 8xx, breakpoints stop after executing the instruction, so
stepping/emulation is not needed. Move it into a sub-function and
remove the #ifdefs.

Signed-off-by: Christophe Leroy 
Reviewed-by: Ravi Bangoria 
---
 arch/powerpc/kernel/hw_breakpoint.c | 60 -
 1 file changed, 33 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index c8d1fa2e9d53..28ad3171bb82 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -198,15 +198,43 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
 /*
  * Handle debug exception notifications.
  */
+static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
+unsigned long addr)
+{
+   int stepped;
+   unsigned int instr;
+
+   /* Do not emulate user-space instructions, instead single-step them */
+   if (user_mode(regs)) {
+   current->thread.last_hit_ubp = bp;
+   regs->msr |= MSR_SE;
+   return false;
+   }
+
+   stepped = 0;
+   instr = 0;
+   if (!__get_user_inatomic(instr, (unsigned int *)regs->nip))
+   stepped = emulate_step(regs, instr);
+
+   /*
+* emulate_step() could not execute it. We've failed in reliably
+* handling the hw-breakpoint. Unregister it and throw a warning
+* message to let the user know about it.
+*/
+   if (!stepped) {
+   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
+   "0x%lx will be disabled.", addr);
+   perf_event_disable_inatomic(bp);
+   return false;
+   }
+   return true;
+}
+
 int hw_breakpoint_handler(struct die_args *args)
 {
int rc = NOTIFY_STOP;
struct perf_event *bp;
struct pt_regs *regs = args->regs;
-#ifndef CONFIG_PPC_8xx
-   int stepped = 1;
-   unsigned int instr;
-#endif
struct arch_hw_breakpoint *info;
unsigned long dar = regs->dar;
 
@@ -251,31 +279,9 @@ int hw_breakpoint_handler(struct die_args *args)
  (dar - bp->attr.bp_addr < bp->attr.bp_len)))
info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
 
-#ifndef CONFIG_PPC_8xx
-   /* Do not emulate user-space instructions, instead single-step them */
-   if (user_mode(regs)) {
-   current->thread.last_hit_ubp = bp;
-   regs->msr |= MSR_SE;
+   if (!IS_ENABLED(CONFIG_PPC_8xx) && !stepping_handler(regs, bp, 
info->address))
goto out;
-   }
-
-   stepped = 0;
-   instr = 0;
-   if (!__get_user_inatomic(instr, (unsigned int *) regs->nip))
-   stepped = emulate_step(regs, instr);
 
-   /*
-* emulate_step() could not execute it. We've failed in reliably
-* handling the hw-breakpoint. Unregister it and throw a warning
-* message to let the user know about it.
-*/
-   if (!stepped) {
-   WARN(1, "Unable to handle hardware breakpoint. Breakpoint at "
-   "0x%lx will be disabled.", info->address);
-   perf_event_disable_inatomic(bp);
-   goto out;
-   }
-#endif
/*
 * As a policy, the callback is invoked in a 'trigger-after-execute'
 * fashion
-- 
2.21.0



[PATCH 0/2] powerpc/watchpoint: Disable watchpoint hit by larx/stcx instruction

2019-09-10 Thread Ravi Bangoria
I've prepared my patch on top of Christophe's patch as it's easy
to change stepping_handler() rather than hw_breakpoint_handler().
2nd patch is the actual fix.

Christophe Leroy (1):
  powerpc/hw_breakpoint: move instruction stepping out of
hw_breakpoint_handler()

Ravi Bangoria (1):
  powerpc/watchpoint: Disable watchpoint hit by larx/stcx instructions

 arch/powerpc/kernel/hw_breakpoint.c | 77 +++--
 1 file changed, 50 insertions(+), 27 deletions(-)

-- 
2.21.0



Re: [PATCH v8 2/8] kvmppc: Movement of pages between normal and secure memory

2019-09-10 Thread Bharata B Rao
On Tue, Sep 10, 2019 at 01:59:40PM +0530, Bharata B Rao wrote:
> +static struct page *kvmppc_uvmem_get_page(unsigned long *rmap,
> +   unsigned long gpa, unsigned int lpid)
> +{
> + struct page *dpage = NULL;
> + unsigned long bit, uvmem_pfn;
> + struct kvmppc_uvmem_page_pvt *pvt;
> + unsigned long pfn_last, pfn_first;
> +
> + pfn_first = kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT;
> + pfn_last = pfn_first +
> +(resource_size(_uvmem_pgmap.res) >> PAGE_SHIFT);
> +
> + spin_lock(_uvmem_pfn_lock);
> + bit = find_first_zero_bit(kvmppc_uvmem_pfn_bitmap,
> +   pfn_last - pfn_first);
> + if (bit >= (pfn_last - pfn_first))
> + goto out;
> + bitmap_set(kvmppc_uvmem_pfn_bitmap, bit, 1);
> +
> + uvmem_pfn = bit + pfn_first;
> + dpage = pfn_to_page(uvmem_pfn);
> + if (!trylock_page(dpage))
> + goto out_clear;
> +
> + pvt = kzalloc(sizeof(*pvt), GFP_KERNEL);

While re-arraging the code, I had moved this allocation to outside
of spinlock and changed from GFP_ATOMIC to GFP_KERNEL. But later
realized that error path exit would be cleaner with allocation under
the lock. So moved it back but missed changing it back to ATOMIC.

Here is the updated patch.

>From a97e34cdb7e9bc411627690602c6fa484aa16c56 Mon Sep 17 00:00:00 2001
From: Bharata B Rao 
Date: Wed, 22 May 2019 10:13:19 +0530
Subject: [PATCH v8 2/8] kvmppc: Movement of pages between normal and secure
 memory

Manage migration of pages betwen normal and secure memory of secure
guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   4 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  29 ++
 arch/powerpc/include/asm/kvm_host.h |  12 +
 arch/powerpc/include/asm/ultravisor-api.h   |   2 +
 arch/powerpc/include/asm/ultravisor.h   |  14 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  19 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 431 
 8 files changed, 514 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..2595d0144958 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,10 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..9603c2b48d67
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __POWERPC_KVM_PPC_HMM_H__
+#define __POWERPC_KVM_PPC_HMM_H__
+
+#ifdef CONFIG_PPC_UV
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+#else
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+#endif /* CONFIG_PPC_UV */
+#endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 81cd221ccc04..16633ad3be45 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -869,4 +869,16 @@ static inline void kvm_arch_vcpu_blocking(struct 

Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Pankaj Gupta


> 
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
> 
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
> 
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we
> fallback
> to allocation from memory.
> 
> This fix kernel crash reported below:
> 
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133
> devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .
> [  133.464901] NIP [c007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c00757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c07cbfd0f7b0] [c00757f8] vmemmap_free+0x298/0x3d0
> (unreliable)
> [  133.464921] [c07cbfd0f8d0] [c0370a44]
> section_deactivate+0x1a4/0x240
> [  133.464928] [c07cbfd0f980] [c0386270]
> __remove_pages+0x3a0/0x590
> [  133.464935] [c07cbfd0fa50] [c0074158]
> arch_remove_memory+0x88/0x160
> [  133.464942] [c07cbfd0fae0] [c03be8c0]
> devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c07cbfd0fb70] [c0738ea0]
> devm_action_release+0x30/0x50
> [  133.464955] [c07cbfd0fb90] [c073a5a4]
> release_nodes+0x344/0x400
> [  133.464961] [c07cbfd0fc40] [c073378c]
> device_release_driver_internal+0x15c/0x250
> [  133.464968] [c07cbfd0fc80] [c072fd14] unbind_store+0x104/0x110
> [  133.464973] [c07cbfd0fcd0] [c072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c07cbfd0fcf0] [c04a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c07cbfd0fd10] [c04a1dfc]
> kernfs_fop_write+0x17c/0x250
> [  133.464993] [c07cbfd0fd60] [c03c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c07cbfd0fd80] [c03c75d0] vfs_write+0xd0/0x250
> 
> Reported-by: Sachin Sant 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/init_64.c | 17 -
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned
> long phys,
>   vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long
> start,
> + unsigned long page_size)
> +{
> + unsigned long nr_pfn = page_size / sizeof(struct page);
> + unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> + if ((start_pfn + nr_pfn) > altmap->end_pfn)
> + return true;
> +
> + if (start_pfn < altmap->base_pfn)
> + return true;
> +
> + return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int
>  node,
>   struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start,
> unsigned long end, int node,
>* fail due to alignment issues when using 16MB hugepages, so
>* fall back to system memory if the altmap allocation fail.
>*/
> - if (altmap) {
> + if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
> {
>   p = altmap_alloc_block_buf(page_size, altmap);
>   if (!p)
>   pr_debug("altmap block allocation failed, 
> falling back to system
>   memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
> struct dev_pagemap *pgmap)
>   struct nd_namespace_common *ndns = nd_pfn->ndns;
>   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
>   resource_size_t base = nsio->res.start + start_pad;
> + resource_size_t end = nsio->res.end - end_trunc;
>   struct vmem_altmap __altmap = {
>   .base_pfn = init_altmap_base(base),
>   .reserve = init_altmap_reserve(base),
> + .end_pfn = 

[PATCH v3 15/15] powerpc/32s: Activate CONFIG_VMAP_STACK

2019-09-10 Thread Christophe Leroy
A few changes to retrieve DAR and DSISR from struct regs
instead of retrieving them directly, as they may have
changed due to a TLB miss.

Also modifies hash_page() and friends to work with virtual
data addresses instead of physical ones.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S |  4 +++
 arch/powerpc/kernel/head_32.S  | 19 +++---
 arch/powerpc/kernel/head_32.h  |  4 ++-
 arch/powerpc/mm/book3s32/hash_low.S| 46 +-
 arch/powerpc/mm/book3s32/mmu.c |  9 +--
 arch/powerpc/platforms/Kconfig.cputype |  2 ++
 6 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 00fcf954e742..1d3b152ee54f 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -1365,7 +1365,11 @@ _GLOBAL(enter_rtas)
lis r6,1f@ha/* physical return address for rtas */
addir6,r6,1f@l
tophys(r6,r6)
+#ifdef CONFIG_VMAP_STACK
+   mr  r7, r1
+#else
tophys(r7,r1)
+#endif
lwz r8,RTASENTRY(r4)
lwz r4,RTASBASE(r4)
mfmsr   r9
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 5bda6a092673..97bc02306a34 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -272,14 +272,22 @@ __secondary_hold_acknowledge:
  */
. = 0x200
DO_KVM  0x200
+MachineCheck:
EXCEPTION_PROLOG_0
+#ifdef CONFIG_VMAP_STACK
+   li  r11, MSR_KERNEL & ~(MSR_IR | MSR_RI) /* can take DTLB miss */
+   mtmsr   r11
+#endif
 #ifdef CONFIG_PPC_CHRP
mfspr   r11, SPRN_SPRG_THREAD
+#ifdef CONFIG_VMAP_STACK
+   tovirt(r11, r11)
+#endif
lwz r11, RTAS_SP(r11)
cmpwi   cr1, r11, 0
bne cr1, 7f
 #endif /* CONFIG_PPC_CHRP */
-   EXCEPTION_PROLOG_1
+   EXCEPTION_PROLOG_1 rtas
 7: EXCEPTION_PROLOG_2
addir3,r1,STACK_FRAME_OVERHEAD
 #ifdef CONFIG_PPC_CHRP
@@ -294,7 +302,7 @@ __secondary_hold_acknowledge:
. = 0x300
DO_KVM  0x300
 DataAccess:
-   EXCEPTION_PROLOG
+   EXCEPTION_PROLOG dar
get_and_save_dar_dsisr_on_stack r4, r5, r11
 BEGIN_MMU_FTR_SECTION
 #ifdef CONFIG_PPC_KUAP
@@ -336,7 +344,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
. = 0x600
DO_KVM  0x600
 Alignment:
-   EXCEPTION_PROLOG
+   EXCEPTION_PROLOG dar
save_dar_dsisr_on_stack r4, r5, r11
addir3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_STD(0x600, alignment_exception)
@@ -643,6 +651,11 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_NEED_DTLB_SW_LRU)
 handle_page_fault_tramp:
EXC_XFER_LITE(0x300, handle_page_fault)
 
+#ifdef CONFIG_VMAP_STACK
+stack_ovf_trampoline:
+   b   stack_ovf
+#endif
+
 AltiVecUnavailable:
EXCEPTION_PROLOG
 #ifdef CONFIG_ALTIVEC
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 283d4298d555..ae2c8e07e1d5 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -38,10 +38,12 @@
andi.   r11, r11, MSR_PR
 .endm
 
-.macro EXCEPTION_PROLOG_1
+.macro EXCEPTION_PROLOG_1 rtas
 #ifdef CONFIG_VMAP_STACK
+   .ifb\rtas
li  r11, MSR_KERNEL & ~(MSR_IR | MSR_RI) /* can take DTLB miss */
mtmsr   r11
+   .endif
subir11, r1, INT_FRAME_SIZE /* use r1 if kernel */
 #else
tophys(r11,r1)  /* use tophys(r1) if kernel */
diff --git a/arch/powerpc/mm/book3s32/hash_low.S 
b/arch/powerpc/mm/book3s32/hash_low.S
index 8bbbd9775c8a..c11b0a005196 100644
--- a/arch/powerpc/mm/book3s32/hash_low.S
+++ b/arch/powerpc/mm/book3s32/hash_low.S
@@ -25,6 +25,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_VMAP_STACK
+#define ADDR_OFFSET0
+#else
+#define ADDR_OFFSETPAGE_OFFSET
+#endif
+
 #ifdef CONFIG_SMP
.section .bss
.align  2
@@ -47,8 +53,8 @@ mmu_hash_lock:
.text
 _GLOBAL(hash_page)
 #ifdef CONFIG_SMP
-   lis r8, (mmu_hash_lock - PAGE_OFFSET)@h
-   ori r8, r8, (mmu_hash_lock - PAGE_OFFSET)@l
+   lis r8, (mmu_hash_lock - ADDR_OFFSET)@h
+   ori r8, r8, (mmu_hash_lock - ADDR_OFFSET)@l
lis r0,0x0fff
b   10f
 11:lwz r6,0(r8)
@@ -66,9 +72,12 @@ _GLOBAL(hash_page)
cmplw   0,r4,r0
ori r3,r3,_PAGE_USER|_PAGE_PRESENT /* test low addresses as user */
mfspr   r5, SPRN_SPRG_PGDIR /* phys page-table root */
+#ifdef CONFIG_VMAP_STACK
+   tovirt(r5, r5)
+#endif
blt+112f/* assume user more likely */
-   lis r5, (swapper_pg_dir - PAGE_OFFSET)@ha   /* if kernel address, 
use */
-   addir5 ,r5 ,(swapper_pg_dir - PAGE_OFFSET)@l/* kernel page 
table */
+   lis r5, (swapper_pg_dir - ADDR_OFFSET)@ha   /* if kernel address, 
use */
+   addir5 ,r5 ,(swapper_pg_dir - ADDR_OFFSET)@l/* kernel page 

[PATCH v3 14/15] powerpc/32s: reorganise DSI handler.

2019-09-10 Thread Christophe Leroy
The part decidated to handling hash_page() is fully unneeded for
processors not having real hash pages like the 603.

Lets enlarge the content of the feature fixup, and provide
an alternative which jumps directly instead of getting NIPs.

Also, in preparation of VMAP stacks, the end of DSI handler has moved
to later in the code as it won't fit anymore once VMAP stacks
are there.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.S | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 449625b4ff03..5bda6a092673 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -295,24 +295,22 @@ __secondary_hold_acknowledge:
DO_KVM  0x300
 DataAccess:
EXCEPTION_PROLOG
-   mfspr   r10,SPRN_DSISR
-   stw r10,_DSISR(r11)
+   get_and_save_dar_dsisr_on_stack r4, r5, r11
+BEGIN_MMU_FTR_SECTION
 #ifdef CONFIG_PPC_KUAP
-   andis.  r0,r10,(DSISR_BAD_FAULT_32S | DSISR_DABRMATCH | 
DSISR_PROTFAULT)@h
+   andis.  r0, r5, (DSISR_BAD_FAULT_32S | DSISR_DABRMATCH | 
DSISR_PROTFAULT)@h
 #else
-   andis.  r0,r10,(DSISR_BAD_FAULT_32S|DSISR_DABRMATCH)@h
+   andis.  r0, r5, (DSISR_BAD_FAULT_32S | DSISR_DABRMATCH)@h
 #endif
-   bne 1f  /* if not, try to put a PTE */
-   mfspr   r4,SPRN_DAR /* into the hash table */
-   rlwinm  r3,r10,32-15,21,21  /* DSISR_STORE -> _PAGE_RW */
-BEGIN_MMU_FTR_SECTION
+   bne handle_page_fault_tramp /* if not, try to put a PTE */
+   rlwinm  r3, r5, 32 - 15, 21, 21 /* DSISR_STORE -> _PAGE_RW */
bl  hash_page
-END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
-1: lwz r5,_DSISR(r11)  /* get DSISR value */
-   mfspr   r4,SPRN_DAR
-   stw r4, _DAR(r11)
-   EXC_XFER_LITE(0x300, handle_page_fault)
-
+   lwz r5, _DSISR(r11) /* get DSISR value */
+   lwz r4, _DAR(r11)
+   b   handle_page_fault_tramp
+FTR_SECTION_ELSE
+   b   handle_page_fault_tramp
+ALT_MMU_FTR_SECTION_END_IFSET(MMU_FTR_HPTE_TABLE)
 
 /* Instruction access exception. */
. = 0x400
@@ -642,6 +640,9 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_NEED_DTLB_SW_LRU)
 
. = 0x3000
 
+handle_page_fault_tramp:
+   EXC_XFER_LITE(0x300, handle_page_fault)
+
 AltiVecUnavailable:
EXCEPTION_PROLOG
 #ifdef CONFIG_ALTIVEC
-- 
2.13.3



[PATCH v3 09/15] powerpc/8xx: Use alternative scratch registers in DTLB miss handler

2019-09-10 Thread Christophe Leroy
In preparation of handling CONFIG_VMAP_STACK, DTLB miss handler need
to use different scratch registers than other exception handlers in
order to not jeopardise exception entry on stack DTLB misses.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 27 ++-
 arch/powerpc/perf/8xx-pmu.c| 12 
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 25e19af49705..3de9c5f1746c 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -193,8 +193,9 @@ SystemCall:
 0: lwz r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
addir10, r10, 1
stw r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
-   mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
+   mfspr   r10, SPRN_DAR
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r11, SPRN_M_TW
rfi
 #endif
 
@@ -337,8 +338,8 @@ ITLBMissLinear:
 
. = 0x1200
 DataStoreTLBMiss:
-   mtspr   SPRN_SPRG_SCRATCH0, r10
-   mtspr   SPRN_SPRG_SCRATCH1, r11
+   mtspr   SPRN_DAR, r10
+   mtspr   SPRN_M_TW, r11
mfcrr11
 
/* If we are faulting a kernel address, we have to use the
@@ -403,10 +404,10 @@ DataStoreTLBMiss:
mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
 
/* Restore registers */
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
 
-0: mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
+0: mfspr   r10, SPRN_DAR
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r11, SPRN_M_TW
rfi
patch_site  0b, patch__dtlbmiss_exit_1
 
@@ -422,10 +423,10 @@ DTLBMissIMMR:
mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
 
li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
 
-0: mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
+0: mfspr   r10, SPRN_DAR
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r11, SPRN_M_TW
rfi
patch_site  0b, patch__dtlbmiss_exit_2
 
@@ -459,10 +460,10 @@ DTLBMissLinear:
mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
 
li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
 
-0: mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
+0: mfspr   r10, SPRN_DAR
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r11, SPRN_M_TW
rfi
patch_site  0b, patch__dtlbmiss_exit_3
 
diff --git a/arch/powerpc/perf/8xx-pmu.c b/arch/powerpc/perf/8xx-pmu.c
index 19124b0b171a..1ad03c55c88c 100644
--- a/arch/powerpc/perf/8xx-pmu.c
+++ b/arch/powerpc/perf/8xx-pmu.c
@@ -157,10 +157,6 @@ static void mpc8xx_pmu_read(struct perf_event *event)
 
 static void mpc8xx_pmu_del(struct perf_event *event, int flags)
 {
-   /* mfspr r10, SPRN_SPRG_SCRATCH0 */
-   unsigned int insn = PPC_INST_MFSPR | __PPC_RS(R10) |
-   __PPC_SPR(SPRN_SPRG_SCRATCH0);
-
mpc8xx_pmu_read(event);
 
/* If it was the last user, stop counting to avoid useles overhead */
@@ -173,6 +169,10 @@ static void mpc8xx_pmu_del(struct perf_event *event, int 
flags)
break;
case PERF_8xx_ID_ITLB_LOAD_MISS:
if (atomic_dec_return(_miss_ref) == 0) {
+   /* mfspr r10, SPRN_SPRG_SCRATCH0 */
+   unsigned int insn = PPC_INST_MFSPR | __PPC_RS(R10) |
+   __PPC_SPR(SPRN_SPRG_SCRATCH0);
+
patch_instruction_site(__itlbmiss_exit_1, insn);
 #ifndef CONFIG_PIN_TLB_TEXT
patch_instruction_site(__itlbmiss_exit_2, insn);
@@ -181,6 +181,10 @@ static void mpc8xx_pmu_del(struct perf_event *event, int 
flags)
break;
case PERF_8xx_ID_DTLB_LOAD_MISS:
if (atomic_dec_return(_miss_ref) == 0) {
+   /* mfspr r10, SPRN_DAR */
+   unsigned int insn = PPC_INST_MFSPR | __PPC_RS(R10) |
+   __PPC_SPR(SPRN_DAR);
+
patch_instruction_site(__dtlbmiss_exit_1, insn);
patch_instruction_site(__dtlbmiss_exit_2, insn);
patch_instruction_site(__dtlbmiss_exit_3, insn);
-- 
2.13.3



[PATCH v3 11/15] powerpc/8xx: move DataStoreTLBMiss perf handler

2019-09-10 Thread Christophe Leroy
Move DataStoreTLBMiss perf handler in order to cope
with future growing exception prolog.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 5aa63693f790..1e718e47fe3c 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -166,18 +166,6 @@ SystemCall:
  */
EXCEPTION(0x1000, SoftEmu, program_check_exception, EXC_XFER_STD)
 
-/* Called from DataStoreTLBMiss when perf TLB misses events are activated */
-#ifdef CONFIG_PERF_EVENTS
-   patch_site  0f, patch__dtlbmiss_perf
-0: lwz r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
-   addir10, r10, 1
-   stw r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
-   mfspr   r10, SPRN_DAR
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
-   mfspr   r11, SPRN_M_TW
-   rfi
-#endif
-
. = 0x1100
 /*
  * For the MPC8xx, this is a software tablewalk to load the instruction
@@ -486,6 +474,18 @@ DARFixed:/* Return from dcbx instruction bug workaround */
/* 0x300 is DataAccess exception, needed by bad_page_fault() */
EXC_XFER_LITE(0x300, handle_page_fault)
 
+/* Called from DataStoreTLBMiss when perf TLB misses events are activated */
+#ifdef CONFIG_PERF_EVENTS
+   patch_site  0f, patch__dtlbmiss_perf
+0: lwz r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
+   addir10, r10, 1
+   stw r10, (dtlb_miss_counter - PAGE_OFFSET)@l(0)
+   mfspr   r10, SPRN_DAR
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+   mfspr   r11, SPRN_M_TW
+   rfi
+#endif
+
 /* On the MPC8xx, these next four traps are used for development
  * support of breakpoints and such.  Someday I will get around to
  * using them.
-- 
2.13.3



[PATCH v3 06/15] powerpc/32: prepare for CONFIG_VMAP_STACK

2019-09-10 Thread Christophe Leroy
To support CONFIG_VMAP_STACK, the kernel has to activate Data MMU
Translation for accessing the stack. Before doing that it must save
SRR0, SRR1 and also DAR and DSISR when relevant, in order to not
loose them in case there is a Data TLB Miss once the translation is
reactivated.

This patch adds fields in thread struct for saving those registers.
It prepares entry_32.S to handle exception entry with
Data MMU Translation enabled and alters EXCEPTION_PROLOG macros to
save SRR0, SRR1, DAR and DSISR then reenables Data MMU.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/processor.h   |   6 ++
 arch/powerpc/include/asm/thread_info.h |   5 ++
 arch/powerpc/kernel/asm-offsets.c  |   6 ++
 arch/powerpc/kernel/entry_32.S |   7 +++
 arch/powerpc/kernel/head_32.h  | 101 +
 5 files changed, 115 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a9993e7a443b..92c02d15f117 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -163,6 +163,12 @@ struct thread_struct {
 #if defined(CONFIG_PPC_BOOK3S_32) && defined(CONFIG_PPC_KUAP)
unsigned long   kuap;   /* opened segments for user access */
 #endif
+#ifdef CONFIG_VMAP_STACK
+   unsigned long   srr0;
+   unsigned long   srr1;
+   unsigned long   dar;
+   unsigned long   dsisr;
+#endif
/* Debug Registers */
struct debug_reg debug;
struct thread_fp_state  fp_state;
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 8e1d0195ac36..488d5c4670ff 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -10,10 +10,15 @@
 #define _ASM_POWERPC_THREAD_INFO_H
 
 #include 
+#include 
 
 #ifdef __KERNEL__
 
+#if defined(CONFIG_VMAP_STACK) && CONFIG_THREAD_SHIFT < PAGE_SHIFT
+#define THREAD_SHIFT   PAGE_SHIFT
+#else
 #define THREAD_SHIFT   CONFIG_THREAD_SHIFT
+#endif
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 484f54dab247..782cbf489ab0 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -127,6 +127,12 @@ int main(void)
OFFSET(KSP_VSID, thread_struct, ksp_vsid);
 #else /* CONFIG_PPC64 */
OFFSET(PGDIR, thread_struct, pgdir);
+#ifdef CONFIG_VMAP_STACK
+   OFFSET(SRR0, thread_struct, srr0);
+   OFFSET(SRR1, thread_struct, srr1);
+   OFFSET(DAR, thread_struct, dar);
+   OFFSET(DSISR, thread_struct, dsisr);
+#endif
 #ifdef CONFIG_SPE
OFFSET(THREAD_EVR0, thread_struct, evr[0]);
OFFSET(THREAD_ACC, thread_struct, acc);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 317ad9df8ba8..2a26fe19f0b1 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -140,6 +140,9 @@ transfer_to_handler:
stw r12,_CTR(r11)
stw r2,_XER(r11)
mfspr   r12,SPRN_SPRG_THREAD
+#ifdef CONFIG_VMAP_STACK
+   tovirt(r12, r12)
+#endif
beq 2f  /* if from user, fix up THREAD.regs */
addir2, r12, -THREAD
addir11,r1,STACK_FRAME_OVERHEAD
@@ -195,7 +198,11 @@ transfer_to_handler:
 transfer_to_handler_cont:
 3:
mflrr9
+#ifdef CONFIG_VMAP_STACK
+   tovirt(r9, r9)
+#else
tovirt(r2, r2)  /* set r2 to current */
+#endif
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */
 #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PERF_EVENTS)
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index f19a1ab91fb5..59e775930be8 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -10,31 +10,57 @@
  * We assume sprg3 has the physical address of the current
  * task's thread_struct.
  */
-.macro EXCEPTION_PROLOG
-   EXCEPTION_PROLOG_0
+.macro EXCEPTION_PROLOG ext
+   EXCEPTION_PROLOG_0  \ext
EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2
+   EXCEPTION_PROLOG_2  \ext
 .endm
 
-.macro EXCEPTION_PROLOG_0
+.macro EXCEPTION_PROLOG_0 ext
mtspr   SPRN_SPRG_SCRATCH0,r10
mtspr   SPRN_SPRG_SCRATCH1,r11
+#ifdef CONFIG_VMAP_STACK
+   mfspr   r10, SPRN_SPRG_THREAD
+   .ifnb   \ext
+   mfspr   r11, SPRN_DAR
+   stw r11, DAR(r10)
+   mfspr   r11, SPRN_DSISR
+   stw r11, DSISR(r10)
+   .endif
+   mfspr   r11, SPRN_SRR0
+   stw r11, SRR0(r10)
+#endif
mfspr   r11, SPRN_SRR1  /* check whether user or kernel */
+#ifdef CONFIG_VMAP_STACK
+   stw r11, SRR1(r10)
+#endif
mfcrr10
andi.   r11, r11, MSR_PR
 .endm
 
 .macro EXCEPTION_PROLOG_1
+#ifdef CONFIG_VMAP_STACK
+   li  r11, 

[PATCH v3 07/15] powerpc: align stack to 2 * THREAD_SIZE with VMAP_STACK

2019-09-10 Thread Christophe Leroy
In order to ease stack overflow detection, align
stack to 2 * THREAD_SIZE when using VMAP_STACK.
This allows overflow detection using a single bit check.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/thread_info.h | 13 +
 arch/powerpc/kernel/setup_32.c |  2 +-
 arch/powerpc/kernel/setup_64.c |  2 +-
 arch/powerpc/kernel/vmlinux.lds.S  |  2 +-
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 488d5c4670ff..a2270749b282 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -22,6 +22,19 @@
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
+/*
+ * By aligning VMAP'd stacks to 2 * THREAD_SIZE, we can detect overflow by
+ * checking sp & (1 << THREAD_SHIFT), which we can do cheaply in the entry
+ * assembly.
+ */
+#ifdef CONFIG_VMAP_STACK
+#define THREAD_ALIGN_SHIFT (THREAD_SHIFT + 1)
+#else
+#define THREAD_ALIGN_SHIFT THREAD_SHIFT
+#endif
+
+#define THREAD_ALIGN   (1 << THREAD_ALIGN_SHIFT)
+
 #ifndef __ASSEMBLY__
 #include 
 #include 
diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
index a7541edf0cdb..180e658c1a6b 100644
--- a/arch/powerpc/kernel/setup_32.c
+++ b/arch/powerpc/kernel/setup_32.c
@@ -137,7 +137,7 @@ arch_initcall(ppc_init);
 
 static void *__init alloc_stack(void)
 {
-   void *ptr = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
+   void *ptr = memblock_alloc(THREAD_SIZE, THREAD_ALIGN);
 
if (!ptr)
panic("cannot allocate %d bytes for stack at %pS\n",
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 44b4c432a273..f630fe4d36a8 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -644,7 +644,7 @@ static void *__init alloc_stack(unsigned long limit, int 
cpu)
 
BUILD_BUG_ON(STACK_INT_FRAME_SIZE % 16);
 
-   ptr = memblock_alloc_try_nid(THREAD_SIZE, THREAD_SIZE,
+   ptr = memblock_alloc_try_nid(THREAD_SIZE, THREAD_ALIGN,
 MEMBLOCK_LOW_LIMIT, limit,
 early_cpu_to_node(cpu));
if (!ptr)
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 060a1acd7c6d..d38335129c06 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -346,7 +346,7 @@ SECTIONS
 #endif
 
/* The initial task and kernel stack */
-   INIT_TASK_DATA_SECTION(THREAD_SIZE)
+   INIT_TASK_DATA_SECTION(THREAD_ALIGN)
 
.data..page_aligned : AT(ADDR(.data..page_aligned) - LOAD_OFFSET) {
PAGE_ALIGNED_DATA(PAGE_SIZE)
-- 
2.13.3



[PATCH v3 05/15] powerpc/32: add a macro to get and/or save DAR and DSISR on stack.

2019-09-10 Thread Christophe Leroy
Refactor reading and saving of DAR and DSISR in exception vectors.

This will ease the implementation of VMAP stack.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.S  |  5 +
 arch/powerpc/kernel/head_32.h  | 11 +++
 arch/powerpc/kernel/head_8xx.S | 23 +++
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index bebb49d877f2..449625b4ff03 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -339,10 +339,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
DO_KVM  0x600
 Alignment:
EXCEPTION_PROLOG
-   mfspr   r4,SPRN_DAR
-   stw r4,_DAR(r11)
-   mfspr   r5,SPRN_DSISR
-   stw r5,_DSISR(r11)
+   save_dar_dsisr_on_stack r4, r5, r11
addir3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_STD(0x600, alignment_exception)
 
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 436ffd862d2a..f19a1ab91fb5 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -144,6 +144,17 @@
RFI /* jump to handler, enable MMU */
 .endm
 
+.macro save_dar_dsisr_on_stack reg1, reg2, sp
+   mfspr   \reg1, SPRN_DAR
+   mfspr   \reg2, SPRN_DSISR
+   stw \reg1, _DAR(\sp)
+   stw \reg2, _DSISR(\sp)
+.endm
+
+.macro get_and_save_dar_dsisr_on_stack reg1, reg2, sp
+   save_dar_dsisr_on_stack \reg1, \reg2, \sp
+.endm
+
 /*
  * Note: code which follows this uses cr0.eq (set if from kernel),
  * r11, r12 (SRR0), and r9 (SRR1).
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 175c3cfc8014..25e19af49705 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -128,12 +128,9 @@ instruction_counter:
. = 0x200
 MachineCheck:
EXCEPTION_PROLOG
-   mfspr r4,SPRN_DAR
-   stw r4,_DAR(r11)
-   li r5,RPN_PATTERN
-   mtspr SPRN_DAR,r5   /* Tag DAR, to be used in DTLB Error */
-   mfspr r5,SPRN_DSISR
-   stw r5,_DSISR(r11)
+   save_dar_dsisr_on_stack r4, r5, r11
+   li  r6, RPN_PATTERN
+   mtspr   SPRN_DAR, r6/* Tag DAR, to be used in DTLB Error */
addi r3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_STD(0x200, machine_check_exception)
 
@@ -156,12 +153,9 @@ InstructionAccess:
. = 0x600
 Alignment:
EXCEPTION_PROLOG
-   mfspr   r4,SPRN_DAR
-   stw r4,_DAR(r11)
-   li  r5,RPN_PATTERN
-   mtspr   SPRN_DAR,r5 /* Tag DAR, to be used in DTLB Error */
-   mfspr   r5,SPRN_DSISR
-   stw r5,_DSISR(r11)
+   save_dar_dsisr_on_stack r4, r5, r11
+   li  r6, RPN_PATTERN
+   mtspr   SPRN_DAR, r6/* Tag DAR, to be used in DTLB Error */
addir3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_STD(0x600, alignment_exception)
 
@@ -502,10 +496,7 @@ DataTLBError:
 DARFixed:/* Return from dcbx instruction bug workaround */
EXCEPTION_PROLOG_1
EXCEPTION_PROLOG_2
-   mfspr   r5,SPRN_DSISR
-   stw r5,_DSISR(r11)
-   mfspr   r4,SPRN_DAR
-   stw r4, _DAR(r11)
+   get_and_save_dar_dsisr_on_stack r4, r5, r11
andis.  r10,r5,DSISR_NOHPTE@h
beq+.Ldtlbie
tlbie   r4
-- 
2.13.3



[PATCH v3 13/15] powerpc/8xx: Enable CONFIG_VMAP_STACK

2019-09-10 Thread Christophe Leroy
This patch enables CONFIG_VMAP_STACK. For that, a few changes are
done in head_8xx.S.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 34 --
 arch/powerpc/platforms/Kconfig.cputype |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 225e242ce1c5..fc6d4d10e298 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -127,7 +127,7 @@ instruction_counter:
 /* Machine check */
. = 0x200
 MachineCheck:
-   EXCEPTION_PROLOG
+   EXCEPTION_PROLOG dar
save_dar_dsisr_on_stack r4, r5, r11
li  r6, RPN_PATTERN
mtspr   SPRN_DAR, r6/* Tag DAR, to be used in DTLB Error */
@@ -140,7 +140,7 @@ MachineCheck:
 /* Alignment exception */
. = 0x600
 Alignment:
-   EXCEPTION_PROLOG
+   EXCEPTION_PROLOG dar
save_dar_dsisr_on_stack r4, r5, r11
li  r6, RPN_PATTERN
mtspr   SPRN_DAR, r6/* Tag DAR, to be used in DTLB Error */
@@ -457,20 +457,26 @@ InstructionTLBError:
  */
. = 0x1400
 DataTLBError:
-   EXCEPTION_PROLOG_0
+   EXCEPTION_PROLOG_0 dar
mfspr   r11, SPRN_DAR
cmpwi   cr1, r11, RPN_PATTERN
beq-cr1, FixupDAR   /* must be a buggy dcbX, icbi insn. */
 DARFixed:/* Return from dcbx instruction bug workaround */
+#ifdef CONFIG_VMAP_STACK
+   li  r11, RPN_PATTERN
+   mtspr   SPRN_DAR, r11   /* Tag DAR, to be used in DTLB Error */
+#endif
EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2
+   EXCEPTION_PROLOG_2 dar
get_and_save_dar_dsisr_on_stack r4, r5, r11
andis.  r10,r5,DSISR_NOHPTE@h
beq+.Ldtlbie
tlbie   r4
 .Ldtlbie:
+#ifndef CONFIG_VMAP_STACK
li  r10,RPN_PATTERN
mtspr   SPRN_DAR,r10/* Tag DAR, to be used in DTLB Error */
+#endif
/* 0x300 is DataAccess exception, needed by bad_page_fault() */
EXC_XFER_LITE(0x300, handle_page_fault)
 
@@ -492,16 +498,20 @@ DARFixed:/* Return from dcbx instruction bug workaround */
  */
 do_databreakpoint:
EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2
+   EXCEPTION_PROLOG_2 dar
addir3,r1,STACK_FRAME_OVERHEAD
mfspr   r4,SPRN_BAR
stw r4,_DAR(r11)
+#ifdef CONFIG_VMAP_STACK
+   lwz r5,_DSISR(r11)
+#else
mfspr   r5,SPRN_DSISR
+#endif
EXC_XFER_STD(0x1c00, do_break)
 
. = 0x1c00
 DataBreakpoint:
-   EXCEPTION_PROLOG_0
+   EXCEPTION_PROLOG_0 dar
mfspr   r11, SPRN_SRR0
cmplwi  cr1, r11, (.Ldtlbie - PAGE_OFFSET)@l
cmplwi  cr7, r11, (.Litlbie - PAGE_OFFSET)@l
@@ -530,6 +540,11 @@ InstructionBreakpoint:
EXCEPTION(0x1e00, Trap_1e, unknown_exception, EXC_XFER_STD)
EXCEPTION(0x1f00, Trap_1f, unknown_exception, EXC_XFER_STD)
 
+#ifdef CONFIG_VMAP_STACK
+stack_ovf_trampoline:
+   b   stack_ovf
+#endif
+
. = 0x2000
 
 /* This is the procedure to calculate the data EA for buggy dcbx,dcbi 
instructions
@@ -650,7 +665,14 @@ FixupDAR:/* Entry point for dcbx workaround. */
 152:
mfdar   r11
mtctr   r11 /* restore ctr reg from DAR */
+#ifdef CONFIG_VMAP_STACK
+   mfspr   r11, SPRN_SPRG_THREAD
+   stw r10, DAR(r11)
+   mfspr   r10, SPRN_DSISR
+   stw r10, DSISR(r11)
+#else
mtdar   r10 /* save fault EA to DAR */
+#endif
mfspr   r10,SPRN_M_TW
b   DARFixed/* Go back to normal TLB handling */
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 12543e53fa96..3c42569b75cc 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -49,6 +49,7 @@ config PPC_8xx
select PPC_HAVE_KUEP
select PPC_HAVE_KUAP
select PPC_MM_SLICES if HUGETLB_PAGE
+   select HAVE_ARCH_VMAP_STACK
 
 config 40x
bool "AMCC 40x"
-- 
2.13.3



[PATCH v3 10/15] powerpc/8xx: drop exception entries for non-existing exceptions

2019-09-10 Thread Christophe Leroy
head_8xx.S has entries for all exceptions from 0x100 to 0x1f00.
Several of them do not exist and are never generated by the 8xx
in accordance with the documentation.

Remove those entry points to make some room for future growing
exception code.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 29 -
 1 file changed, 29 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 3de9c5f1746c..5aa63693f790 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -134,18 +134,6 @@ MachineCheck:
addi r3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_STD(0x200, machine_check_exception)
 
-/* Data access exception.
- * This is "never generated" by the MPC8xx.
- */
-   . = 0x300
-DataAccess:
-
-/* Instruction access exception.
- * This is "never generated" by the MPC8xx.
- */
-   . = 0x400
-InstructionAccess:
-
 /* External interrupt */
EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE)
 
@@ -162,16 +150,9 @@ Alignment:
 /* Program check exception */
EXCEPTION(0x700, ProgramCheck, program_check_exception, EXC_XFER_STD)
 
-/* No FPU on MPC8xx.  This exception is not supposed to happen.
-*/
-   EXCEPTION(0x800, FPUnavailable, unknown_exception, EXC_XFER_STD)
-
 /* Decrementer */
EXCEPTION(0x900, Decrementer, timer_interrupt, EXC_XFER_LITE)
 
-   EXCEPTION(0xa00, Trap_0a, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0xb00, Trap_0b, unknown_exception, EXC_XFER_STD)
-
 /* System call */
. = 0xc00
 SystemCall:
@@ -179,8 +160,6 @@ SystemCall:
 
 /* Single step - not used on 601 */
EXCEPTION(0xd00, SingleStep, single_step_exception, EXC_XFER_STD)
-   EXCEPTION(0xe00, Trap_0e, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0xf00, Trap_0f, unknown_exception, EXC_XFER_STD)
 
 /* On the MPC8xx, this is a software emulation interrupt.  It occurs
  * for all unimplemented and illegal instructions.
@@ -507,14 +486,6 @@ DARFixed:/* Return from dcbx instruction bug workaround */
/* 0x300 is DataAccess exception, needed by bad_page_fault() */
EXC_XFER_LITE(0x300, handle_page_fault)
 
-   EXCEPTION(0x1500, Trap_15, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1600, Trap_16, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1700, Trap_17, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1800, Trap_18, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1900, Trap_19, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1a00, Trap_1a, unknown_exception, EXC_XFER_STD)
-   EXCEPTION(0x1b00, Trap_1b, unknown_exception, EXC_XFER_STD)
-
 /* On the MPC8xx, these next four traps are used for development
  * support of breakpoints and such.  Someday I will get around to
  * using them.
-- 
2.13.3



[PATCH v3 08/15] powerpc/32: Add early stack overflow detection with VMAP stack.

2019-09-10 Thread Christophe Leroy
To avoid recursive faults, stack overflow detection has to be
performed before writing in the stack in exception prologs.

Do it by checking the alignment. If the stack pointer alignment is
wrong, it means it is pointing to the following or preceding page.

Without VMAP stack, a stack overflow is catastrophic. With VMAP
stack, a stack overflow isn't destructive, so don't panic. Kill
the task with SIGSEGV instead.

A dedicated overflow stack is set up for each CPU.

lkdtm: Performing direct entry EXHAUST_STACK
lkdtm: Calling function with 512 frame size to depth 32 ...
lkdtm: loop 32/32 ...
lkdtm: loop 31/32 ...
lkdtm: loop 30/32 ...
lkdtm: loop 29/32 ...
lkdtm: loop 28/32 ...
lkdtm: loop 27/32 ...
lkdtm: loop 26/32 ...
lkdtm: loop 25/32 ...
lkdtm: loop 24/32 ...
lkdtm: loop 23/32 ...
lkdtm: loop 22/32 ...
lkdtm: loop 21/32 ...
lkdtm: loop 20/32 ...
Kernel stack overflow in process test[359], r1=c900c008
Oops: Kernel stack overflow, sig: 6 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in:
CPU: 0 PID: 359 Comm: test Not tainted 5.3.0-rc7+ #2225
NIP:  c0622060 LR: c0626710 CTR: 
REGS: c0895f48 TRAP:    Not tainted  (5.3.0-rc7+)
MSR:  1032   CR: 28004224  XER: 
GPR00: c0626ca4 c900c008 c783c000 c07335cc c900c010 c07335cc c900c0f0 c07335cc
GPR08: c900c0f0 0001   28008222   
GPR16:   10010128 1001 b799c245 10010158 c07335cc 0025
GPR24: c069 c08b91d4 c068f688 0020 c900c0f0 c068f668 c08b95b4 c08b91d4
NIP [c0622060] format_decode+0x0/0x4d4
LR [c0626710] vsnprintf+0x80/0x5fc
Call Trace:
[c900c068] [c0626ca4] vscnprintf+0x18/0x48
[c900c078] [c007b944] vprintk_store+0x40/0x214
[c900c0b8] [c007bf50] vprintk_emit+0x90/0x1dc
[c900c0e8] [c007c5cc] printk+0x50/0x60
[c900c128] [c03da5b0] recursive_loop+0x44/0x6c
[c900c338] [c03da5c4] recursive_loop+0x58/0x6c
[c900c548] [c03da5c4] recursive_loop+0x58/0x6c
[c900c758] [c03da5c4] recursive_loop+0x58/0x6c
[c900c968] [c03da5c4] recursive_loop+0x58/0x6c
[c900cb78] [c03da5c4] recursive_loop+0x58/0x6c
[c900cd88] [c03da5c4] recursive_loop+0x58/0x6c
[c900cf98] [c03da5c4] recursive_loop+0x58/0x6c
[c900d1a8] [c03da5c4] recursive_loop+0x58/0x6c
[c900d3b8] [c03da5c4] recursive_loop+0x58/0x6c
[c900d5c8] [c03da5c4] recursive_loop+0x58/0x6c
[c900d7d8] [c03da5c4] recursive_loop+0x58/0x6c
[c900d9e8] [c03da5c4] recursive_loop+0x58/0x6c
[c900dbf8] [c03da5c4] recursive_loop+0x58/0x6c
[c900de08] [c03da67c] lkdtm_EXHAUST_STACK+0x30/0x4c
[c900de18] [c03da3e8] direct_entry+0xc8/0x140
[c900de48] [c029fb40] full_proxy_write+0x64/0xcc
[c900de68] [c01500f8] __vfs_write+0x30/0x1d0
[c900dee8] [c0152cb8] vfs_write+0xb8/0x1d4
[c900df08] [c0152f7c] ksys_write+0x58/0xe8
[c900df38] [c0014208] ret_from_syscall+0x0/0x34
--- interrupt: c01 at 0xf806664
LR = 0x1000c868
Instruction dump:
4b91 80010014 7c832378 7c0803a6 38210010 4e800020 3d20c08a 3ca0c089
8089a0cc 38a58f0c 3861 4ba2d494 <9421ffe0> 7c0802a6 bfc10018 7c9f2378

Signed-off-by: Christophe Leroy 

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/irq.h |  1 +
 arch/powerpc/kernel/entry_32.S | 25 +
 arch/powerpc/kernel/head_32.h  |  4 
 arch/powerpc/kernel/irq.c  |  1 +
 arch/powerpc/kernel/setup_32.c |  1 +
 arch/powerpc/kernel/traps.c| 15 ---
 6 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 814dfab7e392..ec74ced2437d 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -55,6 +55,7 @@ extern void *mcheckirq_ctx[NR_CPUS];
  */
 extern void *hardirq_ctx[NR_CPUS];
 extern void *softirq_ctx[NR_CPUS];
+extern void *stackovf_ctx[NR_CPUS];
 
 void call_do_softirq(void *sp);
 void call_do_irq(struct pt_regs *regs, void *sp);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 2a26fe19f0b1..00fcf954e742 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -184,9 +184,11 @@ transfer_to_handler:
  */
kuap_save_and_lock r11, r12, r9, r2, r0
addir2, r12, -THREAD
+#ifndef CONFIG_VMAP_STACK
lwz r9,KSP_LIMIT(r12)
cmplw   r1,r9   /* if r1 <= ksp_limit */
ble-stack_ovf   /* then the kernel stack overflowed */
+#endif
 5:
 #if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_E500)
lwz r12,TI_LOCAL_FLAGS(r2)
@@ -298,6 +300,28 @@ reenable_mmu:
  * On kernel stack overflow, load up an initial stack pointer
  * and call StackOverflow(regs), which should not return.
  */
+#ifdef CONFIG_VMAP_STACK
+_GLOBAL(stack_ovf)
+   li  r11, 0
+#ifdef CONFIG_SMP
+   mfspr   r11, SPRN_SPRG_THREAD
+   tovirt(r11, r11)
+   lwz r11, TASK_CPU - THREAD(r11)
+   slwir11, r11, 3
+#endif
+   addis   r11, r11, stackovf_ctx@ha
+   addir11, r11, stackovf_ctx@l
+   lwz r11, 0(r11)
+   cmpwi   cr1, 

[PATCH v3 12/15] powerpc/8xx: split breakpoint exception

2019-09-10 Thread Christophe Leroy
Breakpoint exception is big.

Split it to support future growth on exception prolog.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 1e718e47fe3c..225e242ce1c5 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -490,14 +490,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
  * support of breakpoints and such.  Someday I will get around to
  * using them.
  */
-   . = 0x1c00
-DataBreakpoint:
-   EXCEPTION_PROLOG_0
-   mfspr   r11, SPRN_SRR0
-   cmplwi  cr1, r11, (.Ldtlbie - PAGE_OFFSET)@l
-   cmplwi  cr7, r11, (.Litlbie - PAGE_OFFSET)@l
-   beq-cr1, 11f
-   beq-cr7, 11f
+do_databreakpoint:
EXCEPTION_PROLOG_1
EXCEPTION_PROLOG_2
addir3,r1,STACK_FRAME_OVERHEAD
@@ -505,7 +498,15 @@ DataBreakpoint:
stw r4,_DAR(r11)
mfspr   r5,SPRN_DSISR
EXC_XFER_STD(0x1c00, do_break)
-11:
+
+   . = 0x1c00
+DataBreakpoint:
+   EXCEPTION_PROLOG_0
+   mfspr   r11, SPRN_SRR0
+   cmplwi  cr1, r11, (.Ldtlbie - PAGE_OFFSET)@l
+   cmplwi  cr7, r11, (.Litlbie - PAGE_OFFSET)@l
+   cror4*cr1+eq, 4*cr1+eq, 4*cr7+eq
+   bne cr1, do_databreakpoint
mtcrr10
mfspr   r10, SPRN_SPRG_SCRATCH0
mfspr   r11, SPRN_SPRG_SCRATCH1
-- 
2.13.3



[PATCH v3 03/15] powerpc/32: save DEAR/DAR before calling handle_page_fault

2019-09-10 Thread Christophe Leroy
handle_page_fault() is the only function that save DAR/DEAR itself.

Save DAR/DEAR before calling handle_page_fault() to prepare for
VMAP stack which will require to save even before.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S   | 1 -
 arch/powerpc/kernel/head_32.S| 2 ++
 arch/powerpc/kernel/head_40x.S   | 2 ++
 arch/powerpc/kernel/head_8xx.S   | 2 ++
 arch/powerpc/kernel/head_booke.h | 2 ++
 arch/powerpc/kernel/head_fsl_booke.S | 1 +
 6 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 6273b4862482..317ad9df8ba8 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -621,7 +621,6 @@ ppc_swapcontext:
  */
.globl  handle_page_fault
 handle_page_fault:
-   stw r4,_DAR(r1)
addir3,r1,STACK_FRAME_OVERHEAD
 #ifdef CONFIG_PPC_BOOK3S_32
andis.  r0,r5,DSISR_DABRMATCH@h
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 9e868567b716..bebb49d877f2 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -310,6 +310,7 @@ BEGIN_MMU_FTR_SECTION
 END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
 1: lwz r5,_DSISR(r11)  /* get DSISR value */
mfspr   r4,SPRN_DAR
+   stw r4, _DAR(r11)
EXC_XFER_LITE(0x300, handle_page_fault)
 
 
@@ -327,6 +328,7 @@ BEGIN_MMU_FTR_SECTION
 END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
 1: mr  r4,r12
andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
+   stw r4, _DAR(r11)
EXC_XFER_LITE(0x400, handle_page_fault)
 
 /* External interrupt */
diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
index 585ea1976550..9bb663977e84 100644
--- a/arch/powerpc/kernel/head_40x.S
+++ b/arch/powerpc/kernel/head_40x.S
@@ -313,6 +313,7 @@ _ENTRY(saved_ksp_limit)
START_EXCEPTION(0x0400, InstructionAccess)
EXCEPTION_PROLOG
mr  r4,r12  /* Pass SRR0 as arg2 */
+   stw r4, _DEAR(r11)
li  r5,0/* Pass zero as arg3 */
EXC_XFER_LITE(0x400, handle_page_fault)
 
@@ -676,6 +677,7 @@ DataAccess:
mfspr   r5,SPRN_ESR /* Grab the ESR, save it, pass arg3 */
stw r5,_ESR(r11)
mfspr   r4,SPRN_DEAR/* Grab the DEAR, save it, pass arg2 */
+   stw r4, _DEAR(r11)
EXC_XFER_LITE(0x300, handle_page_fault)
 
 /* Other PowerPC processors, namely those derived from the 6xx-series
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index dac7c0a34eea..fb284d95c76a 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -486,6 +486,7 @@ InstructionTLBError:
tlbie   r4
/* 0x400 is InstructionAccess exception, needed by bad_page_fault() */
 .Litlbie:
+   stw r4, _DAR(r11)
EXC_XFER_LITE(0x400, handle_page_fault)
 
 /* This is the data TLB error on the MPC8xx.  This could be due to
@@ -504,6 +505,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
mfspr   r5,SPRN_DSISR
stw r5,_DSISR(r11)
mfspr   r4,SPRN_DAR
+   stw r4, _DAR(r11)
andis.  r10,r5,DSISR_NOHPTE@h
beq+.Ldtlbie
tlbie   r4
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index 2ae635df9026..37fc84ed90e3 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -467,6 +467,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
mfspr   r5,SPRN_ESR;/* Grab the ESR and save it */\
stw r5,_ESR(r11); \
mfspr   r4,SPRN_DEAR;   /* Grab the DEAR */   \
+   stw r4, _DEAR(r11);   \
EXC_XFER_LITE(0x0300, handle_page_fault)
 
 #define INSTRUCTION_STORAGE_EXCEPTION\
@@ -475,6 +476,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
mfspr   r5,SPRN_ESR;/* Grab the ESR and save it */\
stw r5,_ESR(r11); \
mr  r4,r12; /* Pass SRR0 as arg2 */   \
+   stw r4, _DEAR(r11);   \
li  r5,0;   /* Pass zero as arg3 */   \
EXC_XFER_LITE(0x0400, handle_page_fault)
 
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index adf0505dbe02..442aaac292b0 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -376,6 +376,7 @@ interrupt_base:
mfspr   r4,SPRN_DEAR/* Grab the DEAR, save it, pass arg2 */
andis.  r10,r5,(ESR_ILK|ESR_DLK)@h
bne 1f
+   

[PATCH v3 01/15] powerpc/32: replace MTMSRD() by mtmsr

2019-09-10 Thread Christophe Leroy
On PPC32, MTMSRD() is simply defined as mtmsr.

Replace MTMSRD(reg) by mtmsr reg in files dedicated to PPC32,
this makes the code less obscure.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S | 18 +-
 arch/powerpc/kernel/head_32.h  |  4 ++--
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb..6273b4862482 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -397,7 +397,7 @@ ret_from_syscall:
LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)  /* doesn't include MSR_EE */
/* Note: We don't bother telling lockdep about it */
SYNC
-   MTMSRD(r10)
+   mtmsr   r10
lwz r9,TI_FLAGS(r2)
li  r8,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
@@ -554,7 +554,7 @@ syscall_exit_work:
 */
ori r10,r10,MSR_EE
SYNC
-   MTMSRD(r10)
+   mtmsr   r10
 
/* Save NVGPRS if they're not saved already */
lwz r4,_TRAP(r1)
@@ -697,7 +697,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_SPE)
and.r0,r0,r11   /* FP or altivec or SPE enabled? */
beq+1f
andcr11,r11,r0
-   MTMSRD(r11)
+   mtmsr   r11
isync
 1: stw r11,_MSR(r1)
mfcrr10
@@ -831,7 +831,7 @@ ret_from_except:
/* Note: We don't bother telling lockdep about it */
LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)
SYNC/* Some chip revs have problems here... */
-   MTMSRD(r10) /* disable interrupts */
+   mtmsr   r10 /* disable interrupts */
 
lwz r3,_MSR(r1) /* Returning to user mode? */
andi.   r0,r3,MSR_PR
@@ -998,7 +998,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
 */
LOAD_REG_IMMEDIATE(r10,MSR_KERNEL & ~MSR_RI)
SYNC
-   MTMSRD(r10) /* clear the RI bit */
+   mtmsr   r10 /* clear the RI bit */
.globl exc_exit_restart
 exc_exit_restart:
lwz r12,_NIP(r1)
@@ -1234,7 +1234,7 @@ do_resched:   /* r10 contains 
MSR_KERNEL here */
 #endif
ori r10,r10,MSR_EE
SYNC
-   MTMSRD(r10) /* hard-enable interrupts */
+   mtmsr   r10 /* hard-enable interrupts */
bl  schedule
 recheck:
/* Note: And we don't tell it we are disabling them again
@@ -1243,7 +1243,7 @@ recheck:
 */
LOAD_REG_IMMEDIATE(r10,MSR_KERNEL)
SYNC
-   MTMSRD(r10) /* disable interrupts */
+   mtmsr   r10 /* disable interrupts */
lwz r9,TI_FLAGS(r2)
andi.   r0,r9,_TIF_NEED_RESCHED
bne-do_resched
@@ -1252,7 +1252,7 @@ recheck:
 do_user_signal:/* r10 contains MSR_KERNEL here */
ori r10,r10,MSR_EE
SYNC
-   MTMSRD(r10) /* hard-enable interrupts */
+   mtmsr   r10 /* hard-enable interrupts */
/* save r13-r31 in the exception frame, if not already done */
lwz r3,_TRAP(r1)
andi.   r0,r3,1
@@ -1341,7 +1341,7 @@ _GLOBAL(enter_rtas)
stw r9,8(r1)
LOAD_REG_IMMEDIATE(r0,MSR_KERNEL)
SYNC/* disable interrupts so SRR0/1 */
-   MTMSRD(r0)  /* don't get trashed */
+   mtmsr   r0  /* don't get trashed */
li  r9,MSR_KERNEL & ~(MSR_IR|MSR_DR)
mtlrr6
stw r7, THREAD + RTAS_SP(r2)
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 8abc7783dbe5..b2ca8c9ffd8b 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -50,7 +50,7 @@
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
 #else
li  r10,MSR_KERNEL & ~(MSR_IR|MSR_DR) /* can take exceptions */
-   MTMSRD(r10) /* (except for mach check in rtas) */
+   mtmsr   r10 /* (except for mach check in rtas) */
 #endif
stw r0,GPR0(r11)
lis r10,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
@@ -80,7 +80,7 @@
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
 #else
LOAD_REG_IMMEDIATE(r10, MSR_KERNEL & ~(MSR_IR|MSR_DR)) /* can take 
exceptions */
-   MTMSRD(r10) /* (except for mach check in rtas) */
+   mtmsr   r10 /* (except for mach check in rtas) */
 #endif
lis r10,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
stw r2,GPR2(r11)
-- 
2.13.3



[PATCH v3 04/15] powerpc/32: move MSR_PR test into EXCEPTION_PROLOG_0

2019-09-10 Thread Christophe Leroy
In order to simplify  VMAP stack implementation, move
MSR_PR test into EXCEPTION_PROLOG_0.

This requires to not modify cr0 between EXCEPTION_PROLOG_0
and EXCEPTION_PROLOG_1.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.h  |  4 ++--
 arch/powerpc/kernel/head_8xx.S | 39 ---
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 8e345f8d4b0e..436ffd862d2a 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -19,12 +19,12 @@
 .macro EXCEPTION_PROLOG_0
mtspr   SPRN_SPRG_SCRATCH0,r10
mtspr   SPRN_SPRG_SCRATCH1,r11
+   mfspr   r11, SPRN_SRR1  /* check whether user or kernel */
mfcrr10
+   andi.   r11, r11, MSR_PR
 .endm
 
 .macro EXCEPTION_PROLOG_1
-   mfspr   r11,SPRN_SRR1   /* check whether user or kernel */
-   andi.   r11,r11,MSR_PR
tophys(r11,r1)  /* use tophys(r1) if kernel */
beq 1f
mfspr   r11,SPRN_SPRG_THREAD
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index fb284d95c76a..175c3cfc8014 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -497,8 +497,8 @@ InstructionTLBError:
 DataTLBError:
EXCEPTION_PROLOG_0
mfspr   r11, SPRN_DAR
-   cmpwi   cr0, r11, RPN_PATTERN
-   beq-FixupDAR/* must be a buggy dcbX, icbi insn. */
+   cmpwi   cr1, r11, RPN_PATTERN
+   beq-cr1, FixupDAR   /* must be a buggy dcbX, icbi insn. */
 DARFixed:/* Return from dcbx instruction bug workaround */
EXCEPTION_PROLOG_1
EXCEPTION_PROLOG_2
@@ -531,9 +531,9 @@ DARFixed:/* Return from dcbx instruction bug workaround */
 DataBreakpoint:
EXCEPTION_PROLOG_0
mfspr   r11, SPRN_SRR0
-   cmplwi  cr0, r11, (.Ldtlbie - PAGE_OFFSET)@l
+   cmplwi  cr1, r11, (.Ldtlbie - PAGE_OFFSET)@l
cmplwi  cr7, r11, (.Litlbie - PAGE_OFFSET)@l
-   beq-cr0, 11f
+   beq-cr1, 11f
beq-cr7, 11f
EXCEPTION_PROLOG_1
EXCEPTION_PROLOG_2
@@ -578,9 +578,9 @@ FixupDAR:/* Entry point for dcbx workaround. */
mfspr   r10, SPRN_SRR0
mtspr   SPRN_MD_EPN, r10
rlwinm  r11, r10, 16, 0xfff8
-   cmpli   cr0, r11, PAGE_OFFSET@h
+   cmpli   cr1, r11, PAGE_OFFSET@h
mfspr   r11, SPRN_M_TWB /* Get level 1 table */
-   blt+3f
+   blt+cr1, 3f
rlwinm  r11, r10, 16, 0xfff8
 
 0: cmpli   cr7, r11, (PAGE_OFFSET + 0x180)@h
@@ -595,7 +595,7 @@ FixupDAR:/* Entry point for dcbx workaround. */
 3:
lwz r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11)/* Get the 
level 1 entry */
mtspr   SPRN_MD_TWC, r11
-   mtcrr11
+   mtcrf   0x01, r11
mfspr   r11, SPRN_MD_TWC
lwz r11, 0(r11) /* Get the pte */
bt  28,200f /* bit 28 = Large page (8M) */
@@ -608,16 +608,16 @@ FixupDAR:/* Entry point for dcbx workaround. */
  * no need to include them here */
xoris   r10, r11, 0x7c00/* check if major OP code is 31 */
rlwinm  r10, r10, 0, 21, 5
-   cmpwi   cr0, r10, 2028  /* Is dcbz? */
-   beq+142f
-   cmpwi   cr0, r10, 940   /* Is dcbi? */
-   beq+142f
-   cmpwi   cr0, r10, 108   /* Is dcbst? */
-   beq+144f/* Fix up store bit! */
-   cmpwi   cr0, r10, 172   /* Is dcbf? */
-   beq+142f
-   cmpwi   cr0, r10, 1964  /* Is icbi? */
-   beq+142f
+   cmpwi   cr1, r10, 2028  /* Is dcbz? */
+   beq+cr1, 142f
+   cmpwi   cr1, r10, 940   /* Is dcbi? */
+   beq+cr1, 142f
+   cmpwi   cr1, r10, 108   /* Is dcbst? */
+   beq+cr1, 144f   /* Fix up store bit! */
+   cmpwi   cr1, r10, 172   /* Is dcbf? */
+   beq+cr1, 142f
+   cmpwi   cr1, r10, 1964  /* Is icbi? */
+   beq+cr1, 142f
 141:   mfspr   r10,SPRN_M_TW
b   DARFixed/* Nope, go back to normal TLB processing */
 
@@ -676,8 +676,9 @@ FixupDAR:/* Entry point for dcbx workaround. */
add r10, r10, r30   ;b  151f
add r10, r10, r31
 151:
-   rlwinm. r11,r11,19,24,28/* offset into jump table for reg RA */
-   beq 152f/* if reg RA is zero, don't add it */
+   rlwinm  r11,r11,19,24,28/* offset into jump table for reg RA */
+   cmpwi   cr1, r11, 0
+   beq cr1, 152f   /* if reg RA is zero, don't add it */
addir11, r11, 150b@l/* add start of table */
mtctr   r11 /* load ctr with jump address */
rlwinm  r11,r11,0,16,10 /* make sure we don't execute this more 
than once */
-- 
2.13.3



[PATCH v3 00/15] Enable CONFIG_VMAP_STACK on PPC32

2019-09-10 Thread Christophe Leroy
The purpose of this serie is to enable CONFIG_VMAP_STACK on PPC32.

rfc v1: initial support on 8xx

rfc v2: added stack overflow detection.

v3:
- Stack overflow detection works, tested with LKDTM STACK_EXHAUST test
- Support for book3s32 added

Christophe Leroy (15):
  powerpc/32: replace MTMSRD() by mtmsr
  powerpc/32: Add EXCEPTION_PROLOG_0 in head_32.h
  powerpc/32: save DEAR/DAR before calling handle_page_fault
  powerpc/32: move MSR_PR test into EXCEPTION_PROLOG_0
  powerpc/32: add a macro to get and/or save DAR and DSISR on stack.
  powerpc/32: prepare for CONFIG_VMAP_STACK
  powerpc: align stack to 2 * THREAD_SIZE with VMAP_STACK
  powerpc/32: Add early stack overflow detection with VMAP stack.
  powerpc/8xx: Use alternative scratch registers in DTLB miss handler
  powerpc/8xx: drop exception entries for non-existing exceptions
  powerpc/8xx: move DataStoreTLBMiss perf handler
  powerpc/8xx: split breakpoint exception
  powerpc/8xx: Enable CONFIG_VMAP_STACK
  powerpc/32s: reorganise DSI handler.
  powerpc/32s: Activate CONFIG_VMAP_STACK

 arch/powerpc/include/asm/irq.h |   1 +
 arch/powerpc/include/asm/processor.h   |   6 ++
 arch/powerpc/include/asm/thread_info.h |  18 
 arch/powerpc/kernel/asm-offsets.c  |   6 ++
 arch/powerpc/kernel/entry_32.S |  55 --
 arch/powerpc/kernel/head_32.S  |  57 ++
 arch/powerpc/kernel/head_32.h  | 129 ---
 arch/powerpc/kernel/head_40x.S |   2 +
 arch/powerpc/kernel/head_8xx.S | 186 +++--
 arch/powerpc/kernel/head_booke.h   |   2 +
 arch/powerpc/kernel/head_fsl_booke.S   |   1 +
 arch/powerpc/kernel/irq.c  |   1 +
 arch/powerpc/kernel/setup_32.c |   3 +-
 arch/powerpc/kernel/setup_64.c |   2 +-
 arch/powerpc/kernel/traps.c|  15 ++-
 arch/powerpc/kernel/vmlinux.lds.S  |   2 +-
 arch/powerpc/mm/book3s32/hash_low.S|  46 +---
 arch/powerpc/mm/book3s32/mmu.c |   9 +-
 arch/powerpc/perf/8xx-pmu.c|  12 ++-
 arch/powerpc/platforms/Kconfig.cputype |   3 +
 20 files changed, 379 insertions(+), 177 deletions(-)

-- 
2.13.3



[PATCH v3 02/15] powerpc/32: Add EXCEPTION_PROLOG_0 in head_32.h

2019-09-10 Thread Christophe Leroy
This patch creates a macro for the very first part of
exception prolog, this will help when implementing
CONFIG_VMAP_STACK

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.S  | 4 +---
 arch/powerpc/kernel/head_32.h  | 9 ++---
 arch/powerpc/kernel/head_8xx.S | 9 ++---
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 4a24f8f026c7..9e868567b716 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -272,9 +272,7 @@ __secondary_hold_acknowledge:
  */
. = 0x200
DO_KVM  0x200
-   mtspr   SPRN_SPRG_SCRATCH0,r10
-   mtspr   SPRN_SPRG_SCRATCH1,r11
-   mfcrr10
+   EXCEPTION_PROLOG_0
 #ifdef CONFIG_PPC_CHRP
mfspr   r11, SPRN_SPRG_THREAD
lwz r11, RTAS_SP(r11)
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index b2ca8c9ffd8b..8e345f8d4b0e 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -10,13 +10,16 @@
  * We assume sprg3 has the physical address of the current
  * task's thread_struct.
  */
-
 .macro EXCEPTION_PROLOG
+   EXCEPTION_PROLOG_0
+   EXCEPTION_PROLOG_1
+   EXCEPTION_PROLOG_2
+.endm
+
+.macro EXCEPTION_PROLOG_0
mtspr   SPRN_SPRG_SCRATCH0,r10
mtspr   SPRN_SPRG_SCRATCH1,r11
mfcrr10
-   EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2
 .endm
 
 .macro EXCEPTION_PROLOG_1
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 19f583e18402..dac7c0a34eea 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -494,10 +494,7 @@ InstructionTLBError:
  */
. = 0x1400
 DataTLBError:
-   mtspr   SPRN_SPRG_SCRATCH0, r10
-   mtspr   SPRN_SPRG_SCRATCH1, r11
-   mfcrr10
-
+   EXCEPTION_PROLOG_0
mfspr   r11, SPRN_DAR
cmpwi   cr0, r11, RPN_PATTERN
beq-FixupDAR/* must be a buggy dcbX, icbi insn. */
@@ -530,9 +527,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
  */
. = 0x1c00
 DataBreakpoint:
-   mtspr   SPRN_SPRG_SCRATCH0, r10
-   mtspr   SPRN_SPRG_SCRATCH1, r11
-   mfcrr10
+   EXCEPTION_PROLOG_0
mfspr   r11, SPRN_SRR0
cmplwi  cr0, r11, (.Ldtlbie - PAGE_OFFSET)@l
cmplwi  cr7, r11, (.Litlbie - PAGE_OFFSET)@l
-- 
2.13.3



Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Dan Williams
On Tue, Sep 10, 2019 at 1:31 AM Aneesh Kumar K.V
 wrote:
>
> On 9/10/19 1:40 PM, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> >  wrote:
> >>
> >> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> >> area. Some architectures map the memmap area with large page size. On
> >> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> >> This maps a namespace size of 16G.
> >>
> >> When populating memmap region with 16MB page from the device area,
> >> make sure the allocated space is not used to map resources outside this
> >> namespace. Such usage of device area will prevent a namespace destroy.
> >>
> >> Add resource end pnf in altmap and use that to check if the memmap area
> >> allocation can map pfn outside the namespace. On ppc64 in such case we 
> >> fallback
> >> to allocation from memory.
> >
> > Shouldn't this instead be comprehended by nd_pfn_init() to increase
> > the reservation size so that it fits in the alignment? It may not
> > always be possible to fall back to allocation from memory for
> > extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
> > there may not be enough DRAM to store the memmap.
> >
>
> We do switch between DRAM and device for memmap allocation. ppc64
> vmemmap_populate  does
>
> if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
> p = altmap_alloc_block_buf(page_size, altmap);
> if (!p)
> pr_debug("altmap block allocation failed, falling back to 
> system memory");
> }
> if (!p)
> p = vmemmap_alloc_block_buf(page_size, node);
>
>
> With that we should be using DRAM for the first and the last mapping,
> rest of the memmap should be backed by device.

Ah, ok, makes sense.


[PATCH 1/2] powerpc/xmon: Improve output of XIVE interrupts

2019-09-10 Thread Cédric Le Goater
When looping on the list of interrupts, add the current value of the
PQ bits with a load on the ESB page. This has the side effect of
faulting the ESB page of all interrupts.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   |  3 +--
 arch/powerpc/sysdev/xive/common.c | 29 ++---
 arch/powerpc/xmon/xmon.c  | 15 ---
 3 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 967d6ab3c977..71f52f22c36b 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -99,8 +99,7 @@ extern void xive_flush_interrupt(void);
 
 /* xmon hook */
 extern void xmon_xive_do_dump(int cpu);
-extern int xmon_xive_get_irq_config(u32 irq, u32 *target, u8 *prio,
-   u32 *sw_irq);
+extern int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
 
 /* APIs used by KVM */
 extern u32 xive_native_default_eq_shift(void);
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index ed4561e71951..85a27ec49d34 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -258,10 +258,33 @@ notrace void xmon_xive_do_dump(int cpu)
 #endif
 }
 
-int xmon_xive_get_irq_config(u32 irq, u32 *target, u8 *prio,
-u32 *sw_irq)
+int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 {
-   return xive_ops->get_irq_config(irq, target, prio, sw_irq);
+   int rc;
+   u32 target;
+   u8 prio;
+   u32 lirq;
+
+   rc = xive_ops->get_irq_config(hw_irq, , , );
+   if (rc) {
+   xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
+   return rc;
+   }
+
+   xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
+   hw_irq, target, prio, lirq);
+
+   if (d) {
+   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
+   u64 val = xive_esb_read(xd, XIVE_ESB_GET);
+
+   xmon_printf("PQ=%c%c",
+   val & XIVE_ESB_VAL_P ? 'P' : '-',
+   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
+   }
+
+   xmon_printf("\n");
+   return 0;
 }
 
 #endif /* CONFIG_XMON */
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index dc9832e06256..d83364ebc5c5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2572,16 +2572,9 @@ static void dump_all_xives(void)
dump_one_xive(cpu);
 }
 
-static void dump_one_xive_irq(u32 num)
+static void dump_one_xive_irq(u32 num, struct irq_data *d)
 {
-   int rc;
-   u32 target;
-   u8 prio;
-   u32 lirq;
-
-   rc = xmon_xive_get_irq_config(num, , , );
-   xmon_printf("IRQ 0x%08x : target=0x%x prio=%d lirq=0x%x (rc=%d)\n",
-   num, target, prio, lirq, rc);
+   xmon_xive_get_irq_config(num, d);
 }
 
 static void dump_all_xive_irq(void)
@@ -2599,7 +2592,7 @@ static void dump_all_xive_irq(void)
hwirq = (unsigned int)irqd_to_hwirq(d);
/* IPIs are special (HW number 0) */
if (hwirq)
-   dump_one_xive_irq(hwirq);
+   dump_one_xive_irq(hwirq, d);
}
 }
 
@@ -2619,7 +2612,7 @@ static void dump_xives(void)
return;
} else if (c == 'i') {
if (scanhex())
-   dump_one_xive_irq(num);
+   dump_one_xive_irq(num, NULL);
else
dump_all_xive_irq();
return;
-- 
2.21.0



[PATCH 2/2] powerpc/xmon: Fix output of XIVE IPI

2019-09-10 Thread Cédric Le Goater
When dumping the XIVE state of an CPU IPI, xmon does not check if the
CPU is started or not which can cause an error. Add a check for that
and change the output to be on one line just as the XIVE interrupts of
the machine.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 85a27ec49d34..20f45b8a52ab 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -237,25 +237,30 @@ static notrace void xive_dump_eq(const char *name, struct 
xive_q *q)
i0 = be32_to_cpup(q->qpage + idx);
idx = (idx + 1) & q->msk;
i1 = be32_to_cpup(q->qpage + idx);
-   xmon_printf("  %s Q T=%d %08x %08x ...\n", name,
-   q->toggle, i0, i1);
+   xmon_printf("%s idx=%d T=%d %08x %08x ...", name,
+q->idx, q->toggle, i0, i1);
 }
 
 notrace void xmon_xive_do_dump(int cpu)
 {
struct xive_cpu *xc = per_cpu(xive_cpu, cpu);
 
-   xmon_printf("XIVE state for CPU %d:\n", cpu);
-   xmon_printf("  pp=%02x cppr=%02x\n", xc->pending_prio, xc->cppr);
-   xive_dump_eq("IRQ", >queue[xive_irq_priority]);
+   xmon_printf("CPU %d:", cpu);
+   if (xc) {
+   xmon_printf("pp=%02x CPPR=%02x ", xc->pending_prio, xc->cppr);
+
 #ifdef CONFIG_SMP
-   {
-   u64 val = xive_esb_read(>ipi_data, XIVE_ESB_GET);
-   xmon_printf("  IPI state: %x:%c%c\n", xc->hw_ipi,
-   val & XIVE_ESB_VAL_P ? 'P' : 'p',
-   val & XIVE_ESB_VAL_Q ? 'Q' : 'q');
-   }
+   {
+   u64 val = xive_esb_read(>ipi_data, XIVE_ESB_GET);
+
+   xmon_printf("IPI=0x%08x PQ=%c%c ", xc->hw_ipi,
+   val & XIVE_ESB_VAL_P ? 'P' : '-',
+   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
+   }
 #endif
+   xive_dump_eq("EQ", >queue[xive_irq_priority]);
+   }
+   xmon_printf("\n");
 }
 
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
-- 
2.21.0



[PATCH 0/2] powerpc/xmon: Improve output of XIVE commands

2019-09-10 Thread Cédric Le Goater
Hello,

This series extend the interrupt command output with the PQ bit value
and reworks the CPU command output to check that a CPU is started.

Thanks,

C.

Cédric Le Goater (2):
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/xmon: Fix output of XIVE IPI

 arch/powerpc/include/asm/xive.h   |  3 +-
 arch/powerpc/sysdev/xive/common.c | 56 +++
 arch/powerpc/xmon/xmon.c  | 15 +++--
 3 files changed, 47 insertions(+), 27 deletions(-)

-- 
2.21.0



Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-10 Thread Hari Bathini



On 09/09/19 9:03 PM, Oliver O'Halloran wrote:
> On Mon, Sep 9, 2019 at 11:23 PM Hari Bathini  wrote:
>>
>> On 04/09/19 5:50 PM, Michael Ellerman wrote:
>>> Hari Bathini  writes:
>>>
>>
>> [...]
>>
 +/*
 + * CPU state data is provided by f/w. Below are the definitions
 + * provided in HDAT spec. Refer to latest HDAT specification for
 + * any update to this format.
 + */
>>>
>>> How is this meant to work? If HDAT ever changes the format they will
>>> break all existing kernels in the field.
>>>
 +#define HDAT_FADUMP_CPU_DATA_VERSION1
>>
>> Changes are not expected here. But this is just to cover for such scenario,
>> if that ever happens.
> 
> The HDAT spec doesn't define the SPR numbers for NIA, MSR and the CR.
> As far as I can tell the values you've assumed here are chip-specific,
> non-architected SPR numbers that come from an array buried somewhere
> in the SBE codebase. I don't believe you for a second when you say
> that this will never change.

At least, the understanding is that this numbers not change across processor
generations. If something changes, it is supposed to be handled in SBE. Also,
I am told this numbers would be listed in the HDAT Spec. Not sure if that
happened yet though. Vasant, you have anything to add?

>> Also, I think it is a bit far-fetched to error out if versions mismatch.
>> Warning and proceeding sounds worthier because the changes are usually
>> backward compatible, if and when there are any. Will update accordingly...
> 
> Literally the only reason I didn't drop the CPU DATA parts of the OPAL
> MPIPL series was because I assumed the kernel would do the sensible
> thing and reject or ignore the structure if it did not know how to
> parse the data.

I think, the changes if any, would have to be backward compatible for the sake
of sanity. Even if they are not, we are better off exporting the /proc/vmcore
with a warning and some crazy CPU register data (if parsing goes alright) than
no dump at all? 

- Hari



Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Aneesh Kumar K.V

On 9/10/19 1:40 PM, Dan Williams wrote:

On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
 wrote:


With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.

When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.

Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.


Shouldn't this instead be comprehended by nd_pfn_init() to increase
the reservation size so that it fits in the alignment? It may not
always be possible to fall back to allocation from memory for
extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
there may not be enough DRAM to store the memmap.



We do switch between DRAM and device for memmap allocation. ppc64 
vmemmap_populate  does


if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
p = altmap_alloc_block_buf(page_size, altmap);
if (!p)
pr_debug("altmap block allocation failed, falling back to system 
memory");
}
if (!p)
p = vmemmap_alloc_block_buf(page_size, node);


With that we should be using DRAM for the first and the last mapping, 
rest of the memmap should be backed by device.


-aneesh


[PATCH v8 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-09-10 Thread Bharata B Rao
From: Anshuman Khandual 

CONFIG_PPC_UV adds support for ultravisor.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[ Update config help and commit message ]
Signed-off-by: Claudio Carvalho 
---
 arch/powerpc/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..044838794112 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -448,6 +448,23 @@ config PPC_TRANSACTIONAL_MEM
help
  Support user-mode Transactional Memory on POWERPC.
 
+config PPC_UV
+   bool "Ultravisor support"
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   select ZONE_DEVICE
+   select DEV_PAGEMAP_OPS
+   select DEVICE_PRIVATE
+   select MEMORY_HOTPLUG
+   select MEMORY_HOTREMOVE
+   default n
+   help
+ This option paravirtualizes the kernel to run in POWER platforms that
+ supports the Protected Execution Facility (PEF). On such platforms,
+ the ultravisor firmware runs at a privilege level above the
+ hypervisor.
+
+ If unsure, say "N".
+
 config LD_HEAD_STUB_CATCH
bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if 
EXPERT
depends on PPC64
-- 
2.21.0



[PATCH v8 7/8] kvmppc: Support reset of secure guest

2019-09-10 Thread Bharata B Rao
Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
  side when guest becomes secure again. This is required because
  pinned pages can't be migrated.
- Reinitialize guest's partitioned scoped page tables. These are
  freed when guest becomes secure (H_SVM_INIT_DONE)
- Release all device pages of the secure guest.

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai 
[Unpinning of VPA pages]
---
 Documentation/virt/kvm/api.txt  | 19 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  7 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 ++
 arch/powerpc/kvm/book3s_hv.c| 74 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 62 -
 arch/powerpc/kvm/powerpc.c  | 12 
 include/uapi/linux/kvm.h|  1 +
 9 files changed, 182 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d067767b617..8e7a02e547e9 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -4111,6 +4111,25 @@ Valid values for 'action':
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+4.121 KVM_PPC_SVM_OFF
+
+Capability: basic
+Architectures: powerpc
+Type: vm ioctl
+Parameters: none
+Returns: 0 on successful completion,
+Errors:
+  EINVAL:if ultravisor failed to terminate the secure guest
+  ENOMEM:if hypervisor failed to allocate new radix page tables for guest
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages, reinitializes guest's partition scoped page
+tables and releases all the device pages that are used to track the
+secure pages by hypervisor.
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index fc924ef00b91..6b8cc8edd0ab 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -13,6 +13,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+   struct kvm_memslots *slots);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -37,5 +39,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
 {
return H_UNSUPPORTED;
 }
+
+static inline void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+ struct kvm_memslots *slots)
+{
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2484e6a8f5ca..e4093d067354 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,6 +177,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm 
*kvm,
 extern int kvmppc_switch_mmu_to_hpt(struct kvm *kvm);
 extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm);
 extern void kvmppc_setup_partition_table(struct kvm *kvm);
+extern int kvmppc_reinit_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
@@ -321,6 +322,7 @@ struct kvmppc_ops {
   int size);
int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
  int size);
+   int (*svm_off)(struct kvm *kvm);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index cf200d4ce703..3a27a0c0be05 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -30,5 +30,6 @@
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 #define UV_PAGE_INVAL  0xF138
+#define UV_SVM_TERMINATE   0xF13C
 
 #endif /* 

[PATCH v8 6/8] kvmppc: Radix changes for secure guest

2019-09-10 Thread Bharata B Rao
- After the guest becomes secure, when we handle a page fault of a page
  belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
  the guest don't do so in case of secure VM. For secure guest, the
  active secondary page tables are in secure memory and the secondary
  page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/kvm_host.h   | 12 
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c| 20 
 5 files changed, 60 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index cab3099db8d4..17780c82c1b4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -876,6 +876,8 @@ static inline void kvm_arch_vcpu_block_finish(struct 
kvm_vcpu *vcpu) {}
 #ifdef CONFIG_PPC_UV
 int kvmppc_uvmem_init(void);
 void kvmppc_uvmem_free(void);
+bool kvmppc_is_guest_secure(struct kvm *kvm);
+int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -883,6 +885,16 @@ static inline int kvmppc_uvmem_init(void)
 }
 
 static inline void kvmppc_uvmem_free(void) {}
+
+static inline bool kvmppc_is_guest_secure(struct kvm *kvm)
+{
+   return false;
+}
+
+static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa)
+{
+   return -EFAULT;
+}
 #endif /* CONFIG_PPC_UV */
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 46b1ee381695..cf200d4ce703 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -29,5 +29,6 @@
 #define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
+#define UV_PAGE_INVAL  0xF138
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 719c0c3930b9..b333241bbe4c 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -57,4 +57,9 @@ static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
 }
 
+static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..93ad34e63045 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!(dsisr & DSISR_PRTABLE_FAULT))
gpa |= ea & 0xfff;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return kvmppc_send_page_to_uv(kvm, gpa & PAGE_MASK);
+
/* Get the corresponding memslot */
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   if (kvmppc_is_guest_secure(kvm)) {
+   uv_page_inval(kvm->arch.lpid, gpa, PAGE_SIZE);
+   return 0;
+   }
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
@@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1046,9 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
 
+   if (kvmppc_is_guest_secure(kvm))
+  

[PATCH v8 5/8] kvmppc: Handle memory plug/unplug to secure VM

2019-09-10 Thread Bharata B Rao
Register the new memslot with UV during plug and unregister
the memslot during unplug.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
---
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_hv.c  | 21 +
 3 files changed, 27 insertions(+)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index c578d9b13a56..46b1ee381695 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -26,6 +26,7 @@
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
 #define UV_REGISTER_MEM_SLOT   0xF120
+#define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 58ccf5e2d6bb..719c0c3930b9 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -52,4 +52,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2527f1676b59..fc93e5ba5683 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4517,6 +4518,26 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   switch (change) {
+   case KVM_MR_CREATE:
+   uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id);
+   break;
+   case KVM_MR_DELETE:
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   break;
+   default:
+   /* TODO: Handle KVM_MR_MOVE */
+   break;
+   }
 }
 
 /*
-- 
2.21.0



[PATCH v8 2/8] kvmppc: Movement of pages between normal and secure memory

2019-09-10 Thread Bharata B Rao
Manage migration of pages betwen normal and secure memory of secure
guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   4 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  29 ++
 arch/powerpc/include/asm/kvm_host.h |  12 +
 arch/powerpc/include/asm/ultravisor-api.h   |   2 +
 arch/powerpc/include/asm/ultravisor.h   |  14 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  19 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 431 
 8 files changed, 514 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..2595d0144958 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,10 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..9603c2b48d67
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __POWERPC_KVM_PPC_HMM_H__
+#define __POWERPC_KVM_PPC_HMM_H__
+
+#ifdef CONFIG_PPC_UV
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+#else
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+#endif /* CONFIG_PPC_UV */
+#endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 81cd221ccc04..16633ad3be45 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -869,4 +869,16 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#ifdef CONFIG_PPC_UV
+int kvmppc_uvmem_init(void);
+void kvmppc_uvmem_free(void);
+#else
+static inline int kvmppc_uvmem_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_uvmem_free(void) {}
+#endif /* CONFIG_PPC_UV */
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 6a0f9c74f959..1cd1f595fd81 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,5 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_PAGE_IN 0xF128
+#define UV_PAGE_OUT0xF12C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index d7aa97aa7834..0fc4a974b2e8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -31,4 +31,18 @@ static inline int uv_register_pate(u64 lpid, u64 dw0, u64 
dw1)
return ucall_norets(UV_WRITE_PATE, lpid, dw0, dw1);
 }
 
+static inline int uv_page_in(u64 lpid, u64 src_ra, u64 dst_gpa, u64 flags,
+u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_IN, lpid, src_ra, dst_gpa, flags,
+   

[PATCH v8 4/8] kvmppc: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls

2019-09-10 Thread Bharata B Rao
H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

As part of H_SVM_INIT_START, register all existing memslots with
the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
the guest to secure mode is complete.

These two states (transition to secure mode STARTED and transition
to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
Setting these states will cause the assembly code that enters the
guest to call the UV_RETURN ucall instead of trying to enter the
guest directly.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
---
 arch/powerpc/include/asm/hvcall.h   |  2 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 12 
 arch/powerpc/include/asm/kvm_host.h |  4 +++
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  7 +
 arch/powerpc/kvm/book3s_hv.c|  7 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 34 +
 7 files changed, 67 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 4e98dd992bd1..13bd870609c3 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -348,6 +348,8 @@
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9603c2b48d67..fc924ef00b91 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -11,6 +11,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long gra,
unsigned long flags,
unsigned long page_shift);
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -25,5 +27,15 @@ kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
 {
return H_UNSUPPORTED;
 }
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 16633ad3be45..cab3099db8d4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -281,6 +281,10 @@ struct kvm_hpt_info {
 
 struct kvm_resize_hpt;
 
+/* Flag values for kvm_arch.secure_guest */
+#define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */
+#define KVMPPC_SECURE_INIT_DONE  0x2 /* H_SVM_INIT_DONE completed */
+
 struct kvm_arch {
unsigned int lpid;
unsigned int smt_mode;  /* # vcpus per virtual core */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 1cd1f595fd81..c578d9b13a56 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,6 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_REGISTER_MEM_SLOT   0xF120
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 0fc4a974b2e8..58ccf5e2d6bb 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -45,4 +45,11 @@ static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 
src_gpa, u64 flags,
page_shift);
 }
 
+static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size,
+  u64 flags, u64 slotid)
+{
+   return ucall_norets(UV_REGISTER_MEM_SLOT, lpid, start_gpa,
+   size, flags, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c5404db8f0cd..2527f1676b59 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1089,6 +1089,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_get_gpr(vcpu, 5),
kvmppc_get_gpr(vcpu, 6));
break;
+   case H_SVM_INIT_START:
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+   break;
+   case H_SVM_INIT_DONE:
+  

[PATCH v8 3/8] kvmppc: Shared pages support for secure guests

2019-09-10 Thread Bharata B Rao
A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support sharing of pages between hypervisor and
ultravisor.

Once a secure page is converted to shared page, the device page is
unmapped from the HV side page tables.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h  |  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c | 65 --
 2 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2595d0144958..4e98dd992bd1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a1eccb065ba9..bcecb643a730 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -46,6 +46,7 @@ struct kvmppc_uvmem_page_pvt {
unsigned long *rmap;
unsigned int lpid;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 /*
@@ -159,6 +160,53 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
return ret;
 }
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses the dev_pagemap_ops migrate_to_ram handler
+ * to unmap the device page from QEMU's page tables.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+{
+
+   int ret = H_PARAMETER;
+   struct page *uvmem_page;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   unsigned long *rmap;
+   struct kvm_memory_slot *slot;
+   unsigned long gfn = gpa >> page_shift;
+   int srcu_idx;
+
+   srcu_idx = srcu_read_lock(>srcu);
+   slot = gfn_to_memslot(kvm, gfn);
+   if (!slot)
+   goto out;
+
+   rmap = >arch.rmap[gfn - slot->base_gfn];
+   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
+   uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN);
+   pvt = (struct kvmppc_uvmem_page_pvt *)
+   uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   }
+
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (is_error_noslot_pfn(pfn))
+   goto out;
+
+   if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+   ret = H_SUCCESS;
+   kvm_release_pfn_clean(pfn);
+out:
+   srcu_read_unlock(>srcu, srcu_idx);
+   return ret;
+}
+
 /*
  * H_SVM_PAGE_IN: Move page from normal memory to secure memory.
  */
@@ -177,9 +225,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+   if (flags & H_PAGE_IN_SHARED)
+   return kvmppc_share_page(kvm, gpa, page_shift);
+
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(>srcu);
down_read(>mm->mmap_sem);
@@ -252,8 +303,16 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
long start,
pvt = spage->zone_device_data;
pfn = page_to_pfn(dpage);
 
-   ret = uv_page_out(pvt->lpid, pfn << page_shift, pvt->gpa, 0,
- page_shift);
+   /*
+* This function is used in two cases:
+* - When HV touches a secure page, for which we do UV_PAGE_OUT
+* - When a secure page is converted to shared page, we touch
+*   the page to essentially unmap the device page. In this
+*   case we skip page-out.
+*/
+   if (!pvt->skip_page_out)
+   ret = uv_page_out(pvt->lpid, pfn << page_shift, pvt->gpa, 0,
+ page_shift);
 
if (ret == U_SUCCESS)
*mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
-- 
2.21.0



[PATCH v8 1/8] KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot

2019-09-10 Thread Bharata B Rao
From: Suraj Jitindar Singh 

The rmap array in the guest memslot is an array of size number of guest
pages, allocated at memslot creation time. Each rmap entry in this array
is used to store information about the guest page to which it
corresponds. For example for a hpt guest it is used to store a lock bit,
rc bits, a present bit and the index of a hpt entry in the guest hpt
which maps this page. For a radix guest which is running nested guests
it is used to store a pointer to a linked list of nested rmap entries
which store the nested guest physical address which maps this guest
address and for which there is a pte in the shadow page table.

As there are currently two uses for the rmap array, and the potential
for this to expand to more in the future, define a type field (being the
top 8 bits of the rmap entry) to be used to define the type of the rmap
entry which is currently present and define two values for this field
for the two current uses of the rmap array.

Since the nested case uses the rmap entry to store a pointer, define
this type as having the two high bits set as is expected for a pointer.
Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
Signed-off-by: Bharata B Rao 
[Added rmap type KVMPPC_RMAP_UVMEM_PFN]
---
 arch/powerpc/include/asm/kvm_host.h | 28 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4bb552d639b8..81cd221ccc04 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -232,11 +232,31 @@ struct revmap_entry {
 };
 
 /*
- * We use the top bit of each memslot->arch.rmap entry as a lock bit,
- * and bit 32 as a present flag.  The bottom 32 bits are the
- * index in the guest HPT of a HPTE that points to the page.
+ * The rmap array of size number of guest pages is allocated for each memslot.
+ * This array is used to store usage specific information about the guest page.
+ * Below are the encodings of the various possible usage types.
  */
-#define KVMPPC_RMAP_LOCK_BIT   63
+/* Free bits which can be used to define a new usage */
+#define KVMPPC_RMAP_TYPE_MASK  0xff00
+#define KVMPPC_RMAP_NESTED 0xc000  /* Nested rmap array */
+#define KVMPPC_RMAP_HPT0x0100  /* HPT guest */
+#define KVMPPC_RMAP_UVMEM_PFN  0x0200  /* Secure GPA */
+
+static inline unsigned long kvmppc_rmap_type(unsigned long *rmap)
+{
+   return (*rmap & KVMPPC_RMAP_TYPE_MASK);
+}
+
+/*
+ * rmap usage definition for a hash page table (hpt) guest:
+ * 0x0800  Lock bit
+ * 0x0180  RC bits
+ * 0x0001  Present bit
+ * 0x  HPT index bits
+ * The bottom 32 bits are the index in the guest HPT of a HPTE that points to
+ * the page.
+ */
+#define KVMPPC_RMAP_LOCK_BIT   43
 #define KVMPPC_RMAP_RC_SHIFT   32
 #define KVMPPC_RMAP_REFERENCED (HPTE_R_R << KVMPPC_RMAP_RC_SHIFT)
 #define KVMPPC_RMAP_PRESENT0x1ul
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..7186c65c61c9 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -99,7 +99,7 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct 
revmap_entry *rev,
} else {
rev->forw = rev->back = pte_index;
*rmap = (*rmap & ~KVMPPC_RMAP_INDEX) |
-   pte_index | KVMPPC_RMAP_PRESENT;
+   pte_index | KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_HPT;
}
unlock_rmap(rmap);
 }
-- 
2.21.0



[PATCH v8 0/8] kvmppc: Driver to manage pages of secure guest

2019-09-10 Thread Bharata B Rao
Hi,

A pseries guest can be run as a secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor(HV) and secure memory managed by Ultravisor(UV).

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages(). The reverse movement is driven
via pagemap_ops.migrate_to_ram().

The page-in or page-out requests from UV will come to HV as hcalls and
HV will call back into UV via uvcalls to satisfy these page requests.

These patches are against hmm.git
(https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm)

plus

Claudio Carvalho's base ultravisor enablement patches that are present
in Michael Ellerman's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/ppc-kvm)

These patches along with Claudio's above patches are required to
run secure pseries guests on KVM. This patchset is based on hmm.git
because hmm.git has migrate_vma cleanup and not-device memremap_pages
patchsets that are required by this patchset.

Changes in v8
=
- s/kvmppc_devm/kvmppc_uvmem
- Carrying Suraj's patch that defines bit positions for different rmap
  functions from Paul's kvm-next branch. Added KVMPPC_RMAP_UVMEM_PFN
  to this patch.
- No need to use irqsave version of spinlock to protect pfn bitmap
- mmap_sem and srcu_lock reversal in page-in/page-out so that we
  have uniform locking semantics in page-in, page-out, fault and
  reset paths. This also matches with other usages of the same
  two locks in powerpc code.
- kvmppc_uvmem_free_memslot_pfns() needs kvm srcu read lock.
- Addressed all the review feedback from Christoph and Sukadev.
  - Dropped kvmppc_rmap_is_devm_pfn() and introduced kvmppc_rmap_type()
  - Bail out early if page-in request comes for an already paged-in page
  - kvmppc_uvmem_pfn_lock re-arrangement
  - Check for failure from gfn_to_memslot in kvmppc_h_svm_page_in
  - Consolidate migrate_vma setup and related code into two helpers
kvmppc_svm_page_in/out.
  - Use NUMA_NO_NODE in memremap_pages() instead of -1
  - Removed externs in declarations
  - Ensure *rmap assignment gets cleared in the error case in
kvmppc_uvmem_get_page()
- A few other code cleanups

v7: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-August/195631.html

Anshuman Khandual (1):
  KVM: PPC: Ultravisor: Add PPC_UV config option

Bharata B Rao (6):
  kvmppc: Movement of pages between normal and secure memory
  kvmppc: Shared pages support for secure guests
  kvmppc: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls
  kvmppc: Handle memory plug/unplug to secure VM
  kvmppc: Radix changes for secure guest
  kvmppc: Support reset of secure guest

Suraj Jitindar Singh (1):
  KVM: PPC: Book3S HV: Define usage types for rmap array in guest
memslot

 Documentation/virt/kvm/api.txt  |  19 +
 arch/powerpc/Kconfig|  17 +
 arch/powerpc/include/asm/hvcall.h   |   9 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  48 ++
 arch/powerpc/include/asm/kvm_host.h |  56 +-
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/asm/ultravisor-api.h   |   6 +
 arch/powerpc/include/asm/ultravisor.h   |  36 ++
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  22 +
 arch/powerpc/kvm/book3s_hv.c| 121 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 604 
 arch/powerpc/kvm/powerpc.c  |  12 +
 include/uapi/linux/kvm.h|   1 +
 15 files changed, 953 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

-- 
2.21.0



Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Santosh Sivaraj
"Aneesh Kumar K.V"  writes:

> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we 
> fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 
> devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .
> [  133.464901] NIP [c007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c00757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c07cbfd0f7b0] [c00757f8] vmemmap_free+0x298/0x3d0 
> (unreliable)
> [  133.464921] [c07cbfd0f8d0] [c0370a44] 
> section_deactivate+0x1a4/0x240
> [  133.464928] [c07cbfd0f980] [c0386270] 
> __remove_pages+0x3a0/0x590
> [  133.464935] [c07cbfd0fa50] [c0074158] 
> arch_remove_memory+0x88/0x160
> [  133.464942] [c07cbfd0fae0] [c03be8c0] 
> devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c07cbfd0fb70] [c0738ea0] 
> devm_action_release+0x30/0x50
> [  133.464955] [c07cbfd0fb90] [c073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c07cbfd0fc40] [c073378c] 
> device_release_driver_internal+0x15c/0x250
> [  133.464968] [c07cbfd0fc80] [c072fd14] unbind_store+0x104/0x110
> [  133.464973] [c07cbfd0fcd0] [c072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c07cbfd0fcf0] [c04a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c07cbfd0fd10] [c04a1dfc] 
> kernfs_fop_write+0x17c/0x250
> [  133.464993] [c07cbfd0fd60] [c03c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c07cbfd0fd80] [c03c75d0] vfs_write+0xd0/0x250
>
> Reported-by: Sachin Sant 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/init_64.c | 17 -
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)

Tested-by: Santosh Sivaraj 

>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned 
> long phys,
>   vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
> start,
> + unsigned long page_size)
> +{
> + unsigned long nr_pfn = page_size / sizeof(struct page);
> + unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> + if ((start_pfn + nr_pfn) > altmap->end_pfn)
> + return true;
> +
> + if (start_pfn < altmap->base_pfn)
> + return true;
> +
> + return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
>   struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>* fail due to alignment issues when using 16MB hugepages, so
>* fall back to system memory if the altmap allocation fail.
>*/
> - if (altmap) {
> + if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
> {
>   p = altmap_alloc_block_buf(page_size, altmap);
>   if (!p)
>   pr_debug("altmap block allocation failed, 
> falling back to system memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, 
> struct dev_pagemap *pgmap)
>   struct nd_namespace_common *ndns = nd_pfn->ndns;
>   struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
>   resource_size_t base = nsio->res.start + start_pad;
> + resource_size_t end = nsio->res.end - end_trunc;
>   struct vmem_altmap __altmap = {
>   .base_pfn = init_altmap_base(base),
>   .reserve = init_altmap_reserve(base),
> +   

Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Dan Williams
On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
 wrote:
>
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we 
> fallback
> to allocation from memory.

Shouldn't this instead be comprehended by nd_pfn_init() to increase
the reservation size so that it fits in the alignment? It may not
always be possible to fall back to allocation from memory for
extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
there may not be enough DRAM to store the memmap.


Re: [PATCH] net/ibmvnic: Fix missing { in __ibmvnic_reset

2019-09-10 Thread David Miller
From: Michal Suchanek 
Date: Mon,  9 Sep 2019 22:44:51 +0200

> Commit 1c2977c09499 ("net/ibmvnic: free reset work of removed device from 
> queue")
> adds a } without corresponding { causing build break.
> 
> Fixes: 1c2977c09499 ("net/ibmvnic: free reset work of removed device from 
> queue")
> Signed-off-by: Michal Suchanek 

Applied.


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-10 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
> >
> > Linus, you rejected resolveat() because you wanted a *nice* API
> 
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
> 
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
> 
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.

I suspect there is a "add a clean new flags namespace" analogy to the 
classic "add a clean new standard" XKCD:

https://xkcd.com/927/

Thanks,

Ingo


[PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap

2019-09-10 Thread Aneesh Kumar K.V
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.

When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.

Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.

This fix kernel crash reported below:

[  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 
devm_memremap_pages_release+0x2d8/0x2e0
[  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
[  133.464760] Faulting instruction address: 0xc007580c
[  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
[  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.
[  133.464901] NIP [c007580c] vmemmap_free+0x2ac/0x3d0
[  133.464906] LR [c00757f8] vmemmap_free+0x298/0x3d0
[  133.464910] Call Trace:
[  133.464914] [c07cbfd0f7b0] [c00757f8] vmemmap_free+0x298/0x3d0 
(unreliable)
[  133.464921] [c07cbfd0f8d0] [c0370a44] 
section_deactivate+0x1a4/0x240
[  133.464928] [c07cbfd0f980] [c0386270] __remove_pages+0x3a0/0x590
[  133.464935] [c07cbfd0fa50] [c0074158] 
arch_remove_memory+0x88/0x160
[  133.464942] [c07cbfd0fae0] [c03be8c0] 
devm_memremap_pages_release+0x150/0x2e0
[  133.464949] [c07cbfd0fb70] [c0738ea0] 
devm_action_release+0x30/0x50
[  133.464955] [c07cbfd0fb90] [c073a5a4] release_nodes+0x344/0x400
[  133.464961] [c07cbfd0fc40] [c073378c] 
device_release_driver_internal+0x15c/0x250
[  133.464968] [c07cbfd0fc80] [c072fd14] unbind_store+0x104/0x110
[  133.464973] [c07cbfd0fcd0] [c072ee24] drv_attr_store+0x44/0x70
[  133.464981] [c07cbfd0fcf0] [c04a32bc] sysfs_kf_write+0x6c/0xa0
[  133.464987] [c07cbfd0fd10] [c04a1dfc] 
kernfs_fop_write+0x17c/0x250
[  133.464993] [c07cbfd0fd60] [c03c348c] __vfs_write+0x3c/0x70
[  133.464999] [c07cbfd0fd80] [c03c75d0] vfs_write+0xd0/0x250

Reported-by: Sachin Sant 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 17 -
 drivers/nvdimm/pfn_devs.c |  2 ++
 include/linux/memremap.h  |  1 +
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a44f6281ca3a..4e08246acd79 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned long 
phys,
vmemmap_list = vmem_back;
 }
 
+static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
start,
+   unsigned long page_size)
+{
+   unsigned long nr_pfn = page_size / sizeof(struct page);
+   unsigned long start_pfn = page_to_pfn((struct page *)start);
+
+   if ((start_pfn + nr_pfn) > altmap->end_pfn)
+   return true;
+
+   if (start_pfn < altmap->base_pfn)
+   return true;
+
+   return false;
+}
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
struct vmem_altmap *altmap)
 {
@@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
 * fail due to alignment issues when using 16MB hugepages, so
 * fall back to system memory if the altmap allocation fail.
 */
-   if (altmap) {
+   if (altmap && !altmap_cross_boundary(altmap, start, page_size)) 
{
p = altmap_alloc_block_buf(page_size, altmap);
if (!p)
pr_debug("altmap block allocation failed, 
falling back to system memory");
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e7b11cf1aae..a616d69c8224 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, 
struct dev_pagemap *pgmap)
struct nd_namespace_common *ndns = nd_pfn->ndns;
struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
resource_size_t base = nsio->res.start + start_pad;
+   resource_size_t end = nsio->res.end - end_trunc;
struct vmem_altmap __altmap = {
.base_pfn = init_altmap_base(base),
.reserve = init_altmap_reserve(base),
+   .end_pfn = PHYS_PFN(end),
};
 
memcpy(res, >res, sizeof(*res));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f8a5b2a19945..c70996fe48c8 100644
--- a/include/linux/memremap.h

[PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range

2019-09-10 Thread Aneesh Kumar K.V
With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
can map the memmap area using 16MB hugepage size and that can cover
a memory range of 16G.

While enabling new pmem namespaces, since memory is added in sub-section chunks,
before creating a new memmap mapping, kernel should check whether there is an
existing memmap mapping covering the new pmem namespace. Currently, this is
validated by checking whether the section covering the range is already
initialized or not. Considering there can be multiple namespaces in the same
section this can result in wrong validation. Update this to check for
sub-sections in the range. This is done by checking for all pfns in the range we
are mapping.

We could optimize this by checking only just one pfn in each sub-section. But
since this is not fast-path we keep this simple.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 45 ---
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 4e08246acd79..7710ccdc19a2 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -70,30 +70,24 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
- * Given an address within the vmemmap, determine the pfn of the page that
- * represents the start of the section it is within.  Note that we have to
- * do this by hand as the proffered address may not be correctly aligned.
- * Subtraction of non-aligned pointers produces undefined results.
- */
-static unsigned long __meminit vmemmap_section_start(unsigned long page)
-{
-   unsigned long offset = page - ((unsigned long)(vmemmap));
-
-   /* Return the pfn of the start of the section. */
-   return (offset / sizeof(struct page)) & PAGE_SECTION_MASK;
-}
-
-/*
- * Check if this vmemmap page is already initialised.  If any section
+ * Check if this vmemmap page is already initialised.  If any sub section
  * which overlaps this vmemmap page is initialised then this page is
  * initialised already.
  */
-static int __meminit vmemmap_populated(unsigned long start, int page_size)
+
+static int __meminit vmemmap_populated(unsigned long start, int size)
 {
-   unsigned long end = start + page_size;
-   start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));
+   unsigned long end = start + size;
 
-   for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
+   /* start is size aligned and it is always > sizeof(struct page) */
+   VM_BUG_ON(start & sizeof(struct page));
+   for (; start < end; start += sizeof(struct page))
+   /*
+* pfn valid check here is intended to really check
+* whether we have any subsection already initialized
+* in this range. We keep it simple by checking every
+* pfn in the range.
+*/
if (pfn_valid(page_to_pfn((struct page *)start)))
return 1;
 
@@ -201,6 +195,12 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
void *p = NULL;
int rc;
 
+   /*
+* This vmemmap range is backing different subsections. If any
+* of that subsection is marked valid, that means we already
+* have initialized a page table covering this range and hence
+* the vmemmap range is populated.
+*/
if (vmemmap_populated(start, page_size))
continue;
 
@@ -290,9 +290,10 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
struct page *page;
 
/*
-* the section has already be marked as invalid, so
-* vmemmap_populated() true means some other sections still
-* in this page, so skip it.
+* We have already marked the subsection we are trying to remove
+* invalid. So if we want to remove the vmemmap range, we
+* need to make sure there is no subsection marked valid
+* in this range.
 */
if (vmemmap_populated(start, page_size))
continue;
-- 
2.21.0



[PATCH] crypto: talitos - fix hash result for VMAP_STACK

2019-09-10 Thread Christophe Leroy
When VMAP_STACK is selected, stack cannot be DMA-mapped.
Therefore, the hash result has to be DMA-mapped in the request
context and copied into areq->result at completion.

Signed-off-by: Christophe Leroy 
---
 drivers/crypto/talitos.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index c9d686a0e805..9bd9ff312e2d 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1728,6 +1728,7 @@ static void common_nonsnoop_hash_unmap(struct device *dev,
   struct ahash_request *areq)
 {
struct talitos_ahash_req_ctx *req_ctx = ahash_request_ctx(areq);
+   struct crypto_ahash *tfm = crypto_ahash_reqtfm(areq);
struct talitos_private *priv = dev_get_drvdata(dev);
bool is_sec1 = has_ftr_sec1(priv);
struct talitos_desc *desc = >desc;
@@ -1738,6 +1739,9 @@ static void common_nonsnoop_hash_unmap(struct device *dev,
if (desc->next_desc &&
desc->ptr[5].ptr != desc2->ptr[5].ptr)
unmap_single_talitos_ptr(dev, >ptr[5], DMA_FROM_DEVICE);
+   if (req_ctx->last)
+   memcpy(areq->result, req_ctx->hw_context,
+  crypto_ahash_digestsize(tfm));
 
if (req_ctx->psrc)
talitos_sg_unmap(dev, edesc, req_ctx->psrc, NULL, 0, 0);
@@ -1869,7 +1873,7 @@ static int common_nonsnoop_hash(struct talitos_edesc 
*edesc,
if (req_ctx->last)
map_single_talitos_ptr(dev, >ptr[5],
   crypto_ahash_digestsize(tfm),
-  areq->result, DMA_FROM_DEVICE);
+  req_ctx->hw_context, DMA_FROM_DEVICE);
else
map_single_talitos_ptr_nosync(dev, >ptr[5],
  req_ctx->hw_context_size,
-- 
2.13.3



[PATCH v2] PPC: Set reserved PCR bits

2019-09-10 Thread Alistair Popple
Currently the reserved bits of the Processor Compatibility Register
(PCR) are cleared as per the Programming Note in Section 1.3.3 of
version 3.0B of the Power ISA. This causes all new architecture
features to be made available when running on newer processors with
new architecture features added to the PCR as bits must be set to
disable a given feature.

For example to disable new features added as part of Version 2.07 of
the ISA the corresponding bit in the PCR needs to be set.

As new processor features generally require explicit kernel support
they should be disabled until such support is implemented. Therefore
kernels should set all unknown/reserved bits in the PCR such that any
new architecture features which the kernel does not currently know
about get disabled.

An update is planned to the ISA to clarify that the PCR is an
exception to the Programming Note on reserved bits in Section 1.3.3.

Signed-off-by: Alistair Popple 
Signed-off-by: Jordan Niethe 
Tested-by: Joel Stanley 
---
v2: Added some clarifications to the commit message
---
 arch/powerpc/include/asm/reg.h  |  3 +++
 arch/powerpc/kernel/cpu_setup_power.S   |  6 ++
 arch/powerpc/kernel/dt_cpu_ftrs.c   |  3 ++-
 arch/powerpc/kvm/book3s_hv.c| 11 +++
 arch/powerpc/kvm/book3s_hv_nested.c |  6 +++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 10 ++
 6 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 10caa145f98b..258435c75c43 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -475,6 +475,7 @@
 #define   PCR_VEC_DIS  (1ul << (63-0)) /* Vec. disable (bit NA since POWER8) */
 #define   PCR_VSX_DIS  (1ul << (63-1)) /* VSX disable (bit NA since POWER8) */
 #define   PCR_TM_DIS   (1ul << (63-2)) /* Trans. memory disable (POWER8) */
+#define   PCR_HIGH_BITS(PCR_VEC_DIS | PCR_VSX_DIS | PCR_TM_DIS)
 /*
  * These bits are used in the function kvmppc_set_arch_compat() to specify and
  * determine both the compatibility level which we want to emulate and the
@@ -483,6 +484,8 @@
 #define   PCR_ARCH_207 0x8 /* Architecture 2.07 */
 #define   PCR_ARCH_206 0x4 /* Architecture 2.06 */
 #define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
+#define   PCR_LOW_BITS (PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205)
+#define   PCR_MASK ~(PCR_HIGH_BITS | PCR_LOW_BITS) /* PCR Reserved Bits */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
 #define SPRN_TLBINDEXR 0x154   /* P7 TLB control register */
 #define SPRN_TLBVPNR   0x155   /* P7 TLB control register */
diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 3239a9fe6c1c..a460298c7ddb 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -23,6 +23,7 @@ _GLOBAL(__setup_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
@@ -37,6 +38,7 @@ _GLOBAL(__restore_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
@@ -54,6 +56,7 @@ _GLOBAL(__setup_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
@@ -76,6 +79,7 @@ _GLOBAL(__restore_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
@@ -98,6 +102,7 @@ _GLOBAL(__setup_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE  | 
LPCR_HEIC)
@@ -123,6 +128,7 @@ _GLOBAL(__restore_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE | 
LPCR_HEIC)
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index bd95318d2202..bceee2fde885 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -101,7 +101,7 @@ static void __restore_cpu_cpufeatures(void)
if (hv_mode) {
mtspr(SPRN_LPID, 0);
mtspr(SPRN_HFSCR, system_registers.hfscr);
-   mtspr(SPRN_PCR, 0);
+