date:20190904

Re: [PATCH v2] powerpc/64: system call implement the bulk of the logic in C

2019-09-04 Thread Nicholas Piggin

Michael Ellerman's on September 5, 2019 2:14 pm:
> Nicholas Piggin  writes:
>> System call entry and particularly exit code is beyond the limit of what
>> is reasonable to implement in asm.
>>
>> This conversion moves all conditional branches out of the asm code,
>> except for the case that all GPRs should be restored at exit.
>>
>> Null syscall test is about 5% faster after this patch, because the exit
>> work is handled under local_irq_disable, and the hard mask and pending
>> interrupt replay is handled after that, which avoids games with MSR.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>> Since v1:
>> - Fix big endian build (mpe)
>> - Fix order of exit tracing to after the result registers have been set.
>> - Move ->softe store before MSR[EE] is set, fix the now obsolete comment.
>> - More #ifdef tidyups and writing the accounting helpers nicer (Christophe)
>> - Minor things like move the TM debug store into C
> 
> This doesn't build in a few configs.
> 
> It needed:
> 
> +#else
> +static inline void kuap_check_amr(void) { }
> 
> In kup-radix.h to fix the KUAP=n build.

Thanks.

> 
> It still fails to build on ppc64e with:
> 
>   arch/powerpc/kernel/syscall_64.c:161:2: error: implicit declaration of 
> function '__mtmsrd' [-Werror=implicit-function-declaration]

Ah, no __mtmsrd or RI on BookE, so that will need a new function in
hw_irq.h. I think I also need to do the irq tracing with RI=1 now I 
think about it.

I'll resend a fixed patch.

Thanks,
Nick

[PATCH 2/2] powerpc/pci: Fix IOMMU setup for hotplugged devices on pseries

2019-09-04 Thread Shawn Anastasio

Move PCI device setup from pcibios_add_device() to pcibios_fixup_dev().
This ensures that platform-specific DMA and IOMMU setup occurs after the
device has been registered in sysfs, which is a requirement for IOMMU group
assignment to work.

This fixes IOMMU group assignment for hotplugged devices on pseries, where
the existing behavior results in IOMMU assignment before registration.

Signed-off-by: Shawn Anastasio 
---
 arch/powerpc/kernel/pci-common.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index f627e15bb43c..21b4761bb0ed 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -987,15 +987,14 @@ static void pcibios_setup_device(struct pci_dev *dev)
ppc_md.pci_irq_fixup(dev);
 }
 
-int pcibios_add_device(struct pci_dev *dev)
+void pcibios_fixup_dev(struct pci_dev *dev)
 {
-   /*
-* We can only call pcibios_setup_device() after bus setup is complete,
-* since some of the platform specific DMA setup code depends on it.
-*/
-   if (dev->bus->is_added)
-   pcibios_setup_device(dev);
+   /* Device is registered in sysfs and ready to be set up */
+   pcibios_setup_device(dev);
+}
 
+int pcibios_add_device(struct pci_dev *dev)
+{
 #ifdef CONFIG_PCI_IOV
if (ppc_md.pcibios_fixup_sriov)
ppc_md.pcibios_fixup_sriov(dev);
-- 
2.20.1

[PATCH 1/2] PCI: Introduce pcibios_fixup_dev()

2019-09-04 Thread Shawn Anastasio

Introduce pcibios_fixup_dev to allow platform-specific code to perform
final setup of a PCI device after it has been registered in sysfs.

The default implementation is a no-op.

Signed-off-by: Shawn Anastasio 
---
 drivers/pci/probe.c | 14 ++
 include/linux/pci.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index a3c7338fad86..14eb7ee38794 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2652,6 +2652,17 @@ static void pci_set_msi_domain(struct pci_dev *dev)
dev_set_msi_domain(>dev, d);
 }
 
+/**
+ * pcibios_fixup_dev - Platform-specific device setup
+ * @dev: Device to set up
+ *
+ * Default empty implementation. Replace with an architecture-specific
+ * setup routine, if necessary.
+ */
+void __weak pcibios_fixup_dev(struct pci_dev *dev)
+{
+}
+
 void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 {
int ret;
@@ -2699,6 +2710,9 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus 
*bus)
dev->match_driver = false;
ret = device_add(>dev);
WARN_ON(ret < 0);
+
+   /* Allow platform-specific code to perform final setup of device */
+   pcibios_fixup_dev(dev);
 }
 
 struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 82e4cd1b7ac3..83eb0e241137 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -960,6 +960,7 @@ void pcibios_bus_add_device(struct pci_dev *pdev);
 void pcibios_add_bus(struct pci_bus *bus);
 void pcibios_remove_bus(struct pci_bus *bus);
 void pcibios_fixup_bus(struct pci_bus *);
+void pcibios_fixup_dev(struct pci_dev *);
 int __must_check pcibios_enable_device(struct pci_dev *, int mask);
 /* Architecture-specific versions may override this (weak) */
 char *pcibios_setup(char *str);
-- 
2.20.1

[PATCH 0/2] Fix IOMMU setup for hotplugged devices on pseries

2019-09-04 Thread Shawn Anastasio

On pseries QEMU guests, IOMMU setup for hotplugged PCI devices is currently
broken for all but the first device on a given bus. The culprit is an ordering
issue in the pseries hotplug path (via pci_rescan_bus()) which results in IOMMU
group assigment occuring before device registration in sysfs. This triggers
the following check in arch/powerpc/kernel/iommu.c:

/*
 * The sysfs entries should be populated before
 * binding IOMMU group. If sysfs entries isn't
 * ready, we simply bail.
 */
if (!device_is_registered(dev))
return -ENOENT;

This fails for hotplugged devices since the pcibios_add_device() call in the
pseries hotplug path (in pci_device_add()) occurs before device_add().
Since the IOMMU groups are set up in pcibios_add_device(), this means that a
sysfs entry will not yet be present and it will fail.

There is a special case that allows the first hotplugged device on a bus to
succeed, though. The powerpc pcibios_add_device() implementation will skip
initializing the device if bus setup is not yet complete.
Later, the pci core will call pcibios_fixup_bus() which will perform setup
for the first (and only) device on the bus and since it has already been
registered in sysfs, the IOMMU setup will succeed.

My current solution is to introduce another pcibios function, pcibios_fixup_dev,
which is called after device_add() in pci_device_add(). Then in powerpc code,
pcibios_setup_device() was moved from pcibios_add_device() to this new function
which will occur after sysfs registration so IOMMU assignment will succeed.

I added a new pcibios function rather than moving the pcibios_add_device() call
to after the device_add() call in pci_add_device() because there are other
architectures that use it and it wasn't immediately clear to me whether moving
it would break them.

If anybody has more insight or a better way to fix this, please let me know.

Shawn Anastasio (2):
  PCI: Introduce pcibios_fixup_dev()
  powerpc/pci: Fix IOMMU setup for hotplugged devices on pseries

 arch/powerpc/kernel/pci-common.c | 13 ++---
 drivers/pci/probe.c  | 14 ++
 include/linux/pci.h  |  1 +
 3 files changed, 21 insertions(+), 7 deletions(-)

-- 
2.20.1

Re: [PATCH v5 14/31] powernv/fadump: define register/un-register callback functions

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> Make OPAL calls to register and un-register with firmware for MPIPL.
>

This has the same subject as patch 6, would be good to make them
different.

cheers

Re: [PATCH v2] powerpc/64: system call implement the bulk of the logic in C

2019-09-04 Thread Michael Ellerman

Nicholas Piggin  writes:
> System call entry and particularly exit code is beyond the limit of what
> is reasonable to implement in asm.
>
> This conversion moves all conditional branches out of the asm code,
> except for the case that all GPRs should be restored at exit.
>
> Null syscall test is about 5% faster after this patch, because the exit
> work is handled under local_irq_disable, and the hard mask and pending
> interrupt replay is handled after that, which avoids games with MSR.
>
> Signed-off-by: Nicholas Piggin 
> ---
> Since v1:
> - Fix big endian build (mpe)
> - Fix order of exit tracing to after the result registers have been set.
> - Move ->softe store before MSR[EE] is set, fix the now obsolete comment.
> - More #ifdef tidyups and writing the accounting helpers nicer (Christophe)
> - Minor things like move the TM debug store into C

This doesn't build in a few configs.

It needed:

+#else
+static inline void kuap_check_amr(void) { }

In kup-radix.h to fix the KUAP=n build.

It still fails to build on ppc64e with:

  arch/powerpc/kernel/syscall_64.c:161:2: error: implicit declaration of 
function '__mtmsrd' [-Werror=implicit-function-declaration]

  http://kisskb.ellerman.id.au/kisskb/buildresult/13946972/

Which I haven't debugged yet.

cheers

Re: [PATCH v4 02/16] powerpc/pseries: Introduce option to build secure virtual machines

2019-09-04 Thread Michael Ellerman

Thiago Jung Bauermann  writes:
> Michael Ellerman  writes:
>> On Tue, 2019-08-20 at 02:13:12 UTC, Thiago Jung Bauermann wrote:
>>> Introduce CONFIG_PPC_SVM to control support for secure guests and include
>>> Ultravisor-related helpers when it is selected
>>> 
>>> Signed-off-by: Thiago Jung Bauermann 
>>
>> Patch 2-14 & 16 applied to powerpc next, thanks.
>>
>> https://git.kernel.org/powerpc/c/136bc0397ae21dbf63ca02e5775ad353a479cd2f
>
> Thank you very much!

No worries. I meant to say, there were some minor differences between
your patch 15 adding the documentation and Claudio's version. If you
want those differences applied please send me an incremental patch.

cheers

Re: [PATCH v3 3/4] x86/efi: move common keyring handler functions to new file

2019-09-04 Thread Michael Ellerman

Mimi Zohar  writes:
> (Cc'ing Josh Boyer, David Howells)
>
> On Mon, 2019-09-02 at 21:55 +1000, Michael Ellerman wrote:
>> Nayna Jain  writes:
>> 
>> > The handlers to add the keys to the .platform keyring and blacklisted
>> > hashes to the .blacklist keyring is common for both the uefi and powerpc
>> > mechanisms of loading the keys/hashes from the firmware.
>> >
>> > This patch moves the common code from load_uefi.c to keyring_handler.c
>> >
>> > Signed-off-by: Nayna Jain 
>
> Acked-by: Mimi Zohar 
>
>> > ---
>> >  security/integrity/Makefile   |  3 +-
>> >  .../platform_certs/keyring_handler.c  | 80 +++
>> >  .../platform_certs/keyring_handler.h  | 32 
>> >  security/integrity/platform_certs/load_uefi.c | 67 +---
>> >  4 files changed, 115 insertions(+), 67 deletions(-)
>> >  create mode 100644 security/integrity/platform_certs/keyring_handler.c
>> >  create mode 100644 security/integrity/platform_certs/keyring_handler.h
>> 
>> This has no acks from security folks, though I'm not really clear on who
>> maintains those files.
>
> I upstreamed David's, Josh's, and Nayna's patches, so that's probably
> me.
>
>> Do I take it because it's mostly just code movement people are OK with
>> it going in via the powerpc tree?
>
> Yes, the only reason for splitting load_uefi.c is for powerpc.  These
> patches should be upstreamed together.  

Thanks.

cheers

Re: [PATCH v3 2/3] Powerpc64/Watchpoint: Don't ignore extraneous exceptions

2019-09-04 Thread Ravi Bangoria





On 9/4/19 8:12 PM, Naveen N. Rao wrote:

Ravi Bangoria wrote:

On Powerpc64, watchpoint match range is double-word granular. On
a watchpoint hit, DAR is set to the first byte of overlap between
actual access and watched range. And thus it's quite possible that
DAR does not point inside user specified range. Ex, say user creates
a watchpoint with address range 0x1004 to 0x1007. So hw would be
configured to watch from 0x1000 to 0x1007. If there is a 4 byte
access from 0x1002 to 0x1005, DAR will point to 0x1002 and thus
interrupt handler considers it as extraneous, but it's actually not,
because part of the access belongs to what user has asked. So, let
kernel pass it on to user and let user decide what to do with it
instead of silently ignoring it. The drawback is, it can generate
false positive events.


I think you should do the additional validation here, instead of generating 
false positives. You should be able to read the instruction, run it through 
analyse_instr(), and then use OP_IS_LOAD_STORE() and GETSIZE() to understand 
the access range. This can be used to then perform a better match against what 
the user asked for.


Ok. Let me see how feasible that is.

But patch 1 and 3 are independent of this and can still go in. mpe?

-Ravi

Re: lockdep warning while booting POWER9 PowerNV

2019-09-04 Thread Michael Ellerman

Bart Van Assche  writes:
> On 8/30/19 2:13 PM, Qian Cai wrote:
>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>> 
>> Once in a while, booting an IBM POWER9 PowerNV system (8335-GTH) would 
>> generate
>> a warning in lockdep_register_key() at,
>> 
>> if (WARN_ON_ONCE(static_obj(key)))
>> 
>> because
>> 
>> key = 0xc19ad118
>> &_stext = 0xc000
>> &_end = 0xc49d
>> 
>> i.e., it will cause static_obj() returns 1.
>
> (back from a trip)
>
> Hi Qian,
>
> Does this mean that on POWER9 it can happen that a dynamically allocated 
> object has an address that falls between &_stext and &_end?

I thought that was true on all arches due to initmem, but seems not.

I guess we have the same problem as s390 and we need to define
arch_is_kernel_initmem_freed().

Qian, can you try this:

diff --git a/arch/powerpc/include/asm/sections.h 
b/arch/powerpc/include/asm/sections.h
index 4a1664a8658d..616b1b7b7e52 100644
--- a/arch/powerpc/include/asm/sections.h
+++ b/arch/powerpc/include/asm/sections.h
@@ -5,8 +5,22 @@
 
 #include 
 #include 
+
+#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
+
 #include 
 
+extern bool init_mem_is_free;
+
+static inline int arch_is_kernel_initmem_freed(unsigned long addr)
+{
+   if (!init_mem_is_free)
+   return 0;
+
+   return addr >= (unsigned long)__init_begin &&
+   addr < (unsigned long)__init_end;
+}
+
 extern char __head_end[];
 
 #ifdef __powerpc64__


cheers

Re: [PATCH v5 19/31] powerpc/fadump: Update documentation about OPAL platform support

2019-09-04 Thread Michael Ellerman

"Oliver O'Halloran"  writes:
> On Wed, Sep 4, 2019 at 9:51 PM Michael Ellerman  wrote:
>> Hari Bathini  writes:
...
>> > diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
>> > b/Documentation/powerpc/firmware-assisted-dump.rst
>> > index d912755..2c3342c 100644
>> > --- a/Documentation/powerpc/firmware-assisted-dump.rst
>> > +++ b/Documentation/powerpc/firmware-assisted-dump.rst
>> > @@ -96,7 +97,9 @@ as follows:
>> >
>> >  Please note that the firmware-assisted dump feature
>> >  is only available on Power6 and above systems with recent
>> > -firmware versions.
>>
>> Notice how "recent" has bit rotted.
>>
>> > +firmware versions on PSeries (PowerVM) platform and Power9
>> > +and above systems with recent firmware versions on PowerNV
>> > +(OPAL) platform.
>>
>> Can we say something more helpful here, ie. "recent" is not very useful.
>> AFAIK it's actually wrong, there isn't a released firmware with the
>> support yet at all, right?
>>
>> Given all the relevant firmware is open source can't we at least point
>> to a commit or release tag or something?
>
> Even if we can quote a git sha it's not terrible useful or user
> friendly. We already gate the feature behind DT nodes / properties
> existing, so why not just say "fadump requires XYZ firmware feature,
> as indicated by  device-tree property."

But how does that help someone who's got a Talos/Blackbird and wants to
test this stuff?

cheers

Re: [PATCH v5 15/31] powernv/fadump: support copying multiple kernel boot memory regions

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> On 04/09/19 5:00 PM, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> Firmware uses 32-bit field for region size while copying/backing-up
>> 
>> Which firmware exactly is imposing that limit?
>
> I think the MDST/MDRT tables in the f/w. Vasant, which component is that?
>
>>> +   /*
>>> +* Firmware currently supports only 32-bit value for size,
>> 
>> "currently" implies it could change in future?
>> 
>> If it does we assume it will only increase, and we're happy that old
>> kernels will continue to use the 32-bit limit?
>
> I am not aware of any plans to make it 64-bit. Let me just say f/w supports
> only 32-bit to get rid of that ambiguity..

OK. As long as everyone is aware that the kernel has no support for it
increasing it, without code changes.

cheers

Re: [PATCH v5 11/31] powernv/fadump: add fadump support on powernv

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> On 03/09/19 10:01 PM, Hari Bathini wrote:
>> 
> [...]
 diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
 index f7c8073..b8061fb9 100644
 --- a/arch/powerpc/kernel/fadump.c
 +++ b/arch/powerpc/kernel/fadump.c
 @@ -114,6 +114,9 @@ int __init early_init_dt_scan_fw_dump(unsigned long 
 node, const char *uname,
if (strcmp(uname, "rtas") == 0)
return rtas_fadump_dt_scan(_dump, node);
  
 +  if (strcmp(uname, "ibm,opal") == 0)
 +  return opal_fadump_dt_scan(_dump, node);
 +
>>>
>>> ie this would become:
>>>
>>> if (strcmp(uname, "ibm,opal") == 0 && opal_fadump_dt_scan(_dump, 
>>> node))
>>> return 1;
>>>
>> 
>> Yeah. Will update accordingly...
>
> On second thoughts, we don't need a return type at all here. fw_dump struct 
> and callbacks are
> populated based on what we found in the DT. And irrespective of what we found 
> in DT, we got
> to return `1` once the particular depth and node is processed..

True. It's a little unclear because you're looking for "rtas" and
"ibm,opal" in the same function. But we know™ that no platform should
have both an "rtas" and an "ibm,opal" node, so once we find either we
are done scanning, regardless of whether the foo_fadump_dt_scan()
succeeds or fails.

cheers

Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-04 Thread Michael Ellerman

Hi Mike,

Thanks for the patch & great change log, just a few comments.

Michael Roth  writes:
> On a 2-socket Witherspoon system with 128 cores and 1TB of memory
   ^
   Power9 (not everyone knows what a Witherspoon is)
 
> running the following guest configs:
>
>   guest A:
> - 224GB of memory
> - 56 VCPUs (sockets=1,cores=28,threads=2), where:
>   VCPUs 0-1 are pinned to CPUs 0-3,
>   VCPUs 2-3 are pinned to CPUs 4-7,
>   ...
>   VCPUs 54-55 are pinned to CPUs 108-111
>
>   guest B:
> - 4GB of memory
> - 4 VCPUs (sockets=1,cores=4,threads=1)
>
> with the following workloads (with KSM and THP enabled in all):
>
>   guest A:
> stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
>
>   guest B:
> stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
>
>   host:
> stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
>
> the below soft-lockup traces were observed after an hour or so and
> persisted until the host was reset (this was found to be reliably
> reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
> and 5.3-rc5):
>
>   [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
>   [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
> 52/KVM:19709]
>   [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
> [worker:19913]
>   [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
> [worker:20331]
>   [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
> [worker:20338]
>   [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
> [avocado:19525]
>   [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
>   [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
>   [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> [ksmd:791]
>   [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
>   [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
> idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
>   [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
>   [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
>   [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
>   [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
>   [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
>   [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
>   [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
>   [ 1455.112303]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
>   [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
>   [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
>   [ 1455.112392]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1

There should be stack traces here, did they get lost or you omitted them
for brevity?

> CPU 45/0x2d, 24/0x18, 124/0x7c are stuck on spin locks, likely held by
> CPUs 105/31

That last one "105/31" is confusing because it looks like you're giving
the decimal/hex values again, but you're not.

I know xmon uses hex CPU numbers, but you don't actually refer to them
much in this change log, so it's probably clearer just to convert all
CPU numbers to decimal for the sake of the change log.

> CPU

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 4:29 PM Al Viro  wrote:
>
> On Wed, Sep 04, 2019 at 03:38:20PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 4, 2019 at 3:31 PM David Howells  wrote:
> > >
> > > It ought to be reasonably easy to make them per-sb at least, I think.  We
> > > don't allow cross-super rename, right?
> >
> > Right now the sequence count handling very much depends on it being a
> > global entity on the reader side, at least.
> >
> > And while the rename sequence count could (and probably should) be
> > per-sb, the same is very much not true of the mount one.
>
> Huh?  That will cost us having to have a per-superblock dentry
> hash table; recall that lockless lockup can give false negatives
> if something gets moved from chain to chain, and rename_lock is
> first and foremost used to catch those and retry.  If we split
> it on per-superblock basis, we can't have dentries from different
> superblocks in the same chain anymore...

That's exactly the "very much depends on it being a global entity on
the reader side" thing.

I'm not convinced that's the _only_ way to handle things. Maybe a
combination of (wild handwaving) per-hashqueue sequence count and some
clever scheme for pathname handling could work.

I've not personally seen a load where the global rename lock has been
a problem (very few things really do a lot of renames), but
system-wide locks do make me nervous.

We have other (and worse) ones. tasklist_lock comes to mind.

 Linus

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Al Viro

On Wed, Sep 04, 2019 at 03:38:20PM -0700, Linus Torvalds wrote:
> On Wed, Sep 4, 2019 at 3:31 PM David Howells  wrote:
> >
> > It ought to be reasonably easy to make them per-sb at least, I think.  We
> > don't allow cross-super rename, right?
> 
> Right now the sequence count handling very much depends on it being a
> global entity on the reader side, at least.
> 
> And while the rename sequence count could (and probably should) be
> per-sb, the same is very much not true of the mount one.

Huh?  That will cost us having to have a per-superblock dentry
hash table; recall that lockless lockup can give false negatives
if something gets moved from chain to chain, and rename_lock is
first and foremost used to catch those and retry.  If we split
it on per-superblock basis, we can't have dentries from different
superblocks in the same chain anymore...

Re: missing doorbell interrupt when onlining cpu

2019-09-04 Thread Nathan Lynch

Nathan Lynch  writes:

> I'm hoping for some help investigating a behavior I see when doing cpu
> hotplug under load on P9 and P8 LPARs. Occasionally, while coming online
> a cpu will seem to get "stuck" in idle, with a pending doorbell
> interrupt unserviced (cpu 12 here):
>
> cpuhp/12-70[012] 46133.602202: cpuhp_enter:  cpu: 0012 target: 
> 205 step: 174 (0xc0028920s)
>  load.sh-8201  [014] 46133.602248: sched_waking: comm=cpuhp/12 pid=70 
> prio=120 target_cpu=012
>  load.sh-8201  [014] 46133.602251: smp_send_reschedule:  (c0052868) 
> cpu=12
>   -0 [012] 46133.602252: do_idle:  (c0162e08)
>  load.sh-8201  [014] 46133.602252: smp_muxed_ipi_message_pass: 
> (c00527e8) cpu=12 msg=1
>  load.sh-8201  [014] 46133.602253: doorbell_core_ipi:(c004d3e8) 
> cpu=12
>   -0 [012] 46133.602257: arch_cpu_idle:(c0022d08)
>   -0 [012] 46133.602259: pseries_lpar_idle:(c00d43c8)

I should be more explicit that given my tracing configuration I would
expect to see doorbell events etc here e.g.

 -0 [012] 46133.602086: doorbell_entry:   
pt_regs=0xc00200e7fb50
 -0 [012] 46133.602087: smp_ipi_demux_relaxed: 
(c00530f8)
 -0 [012] 46133.602088: scheduler_ipi:
(c015e4f8)
 -0 [012] 46133.602091: sched_wakeup: cpuhp/12:70 
[120] success=1 CPU:012
 -0 [012] 46133.602092: sched_wakeup: migration/12:71 
[0] success=1 CPU:012
 -0 [012] 46133.602093: doorbell_exit:
pt_regs=0xc00200e7fb50

but instead cpu 12 goes to idle.

Re: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-04 Thread Nathan Chancellor

On Wed, Sep 04, 2019 at 08:01:35AM -0500, Segher Boessenkool wrote:
> On Wed, Sep 04, 2019 at 08:16:45AM +, David Laight wrote:
> > From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> > > Fair enough so I guess we are back to just outright disabling the
> > > warning.
> > 
> > Just disabling the warning won't stop the compiler generating code
> > that breaks a 'user' implementation of setjmp().
> 
> Yeah.  I have a patch (will send in an hour or so) that enables the
> "returns_twice" attribute for setjmp (in ).  In testing
> (with GCC trunk) it showed no difference in code generation, but
> better save than sorry.
> 
> It also sets "noreturn" on longjmp, and that *does* help, it saves a
> hundred insns or so (all in xmon, no surprise there).
> 
> I don't think this will make LLVM shut up about this though.  And
> technically it is right: the C standard does say that in hosted mode
> setjmp is a reserved name and you need to include  to access
> it (not ).

It does not fix the warning, I tested your patch.

> So why is the kernel compiled as hosted?  Does adding -ffreestanding
> hurt anything?  Is that actually supported on LLVM, on all relevant
> versions of it?  Does it shut up the warning there (if not, that would
> be an LLVM bug)?

It does fix this warning because -ffreestanding implies -fno-builtin,
which also solves the warning. LLVM has supported -ffreestanding since
at least 3.0.0. There are some parts of the kernel that are compiled
with this and it probably should be used in more places but it sounds
like there might be some good codegen improvements that are disabled
with it:

https://lore.kernel.org/lkml/CAHk-=wi-epJZfBHDbKKDZ64us7WkF=lpufhvybmzsteo8q0...@mail.gmail.com/

Cheers,
Nathan

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Dave Hansen

On 9/3/19 1:01 AM, Anshuman Khandual wrote:
> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.

This looks really cool.  The "only" complication on x86 is the large
number of compile and runtime options that we have.  When this gets
merged, it would be really nice to make sure that the 0day guys have
good coverage of all the configurations.

I'm not _quite_ sure what kind of bugs it will catch on x86 and I
suspect it'll have more value for the other architectures, but it seems
harmless enough.

missing doorbell interrupt when onlining cpu

2019-09-04 Thread Nathan Lynch

I'm hoping for some help investigating a behavior I see when doing cpu
hotplug under load on P9 and P8 LPARs. Occasionally, while coming online
a cpu will seem to get "stuck" in idle, with a pending doorbell
interrupt unserviced (cpu 12 here):

cpuhp/12-70[012] 46133.602202: cpuhp_enter:  cpu: 0012 target: 205 
step: 174 (0xc0028920s)
 load.sh-8201  [014] 46133.602248: sched_waking: comm=cpuhp/12 pid=70 
prio=120 target_cpu=012
 load.sh-8201  [014] 46133.602251: smp_send_reschedule:  (c0052868) 
cpu=12
  -0 [012] 46133.602252: do_idle:  (c0162e08)
 load.sh-8201  [014] 46133.602252: smp_muxed_ipi_message_pass: 
(c00527e8) cpu=12 msg=1
 load.sh-8201  [014] 46133.602253: doorbell_core_ipi:(c004d3e8) 
cpu=12
  -0 [012] 46133.602257: arch_cpu_idle:(c0022d08)
  -0 [012] 46133.602259: pseries_lpar_idle:(c00d43c8)

This leaves the task initiating the online blocked in a state like this:

[<0>] __switch_to+0x2dc/0x430
[<0>] __cpuhp_kick_ap+0x78/0xa0
[<0>] cpuhp_kick_ap+0x60/0xf0
[<0>] cpuhp_invoke_callback+0xf4/0x780
[<0>] _cpu_up+0x138/0x260
[<0>] do_cpu_up+0x130/0x160
[<0>] cpu_subsys_online+0x68/0xe0
[<0>] device_online+0xb4/0x120
[<0>] online_store+0xb4/0xc0
[<0>] dev_attr_store+0x3c/0x60
[<0>] sysfs_kf_write+0x70/0xb0
[<0>] kernfs_fop_write+0x17c/0x250
[<0>] __vfs_write+0x40/0x80
[<0>] vfs_write+0xd4/0x250
[<0>] ksys_write+0x74/0x130
[<0>] system_call+0x5c/0x70

This trace is from a 5.2.10 kernel, and I've observed the problem on a
4.12 vendor kernel as well.

The issue always occurs before the cpu has completed all the cpuhp
callbacks that need to run on that cpu. Often it occurs before it even
runs a task (rcu_sched, migration, or cpuhp kthreads are the first to
run). But sometimes it will have run a task or two, as in this case.

It seems specific to doorbell i.e. intra-core IPIs; I have not observed
IPIs between cores getting dropped.

sysrq-l gets the newly onlined cpu unstuck.

The cpu can get in this state even after servicing doorbells earlier in
the online process.

This is using the default cede offline state, not stop-self (which I
haven't tried).

Ideas?

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 3:31 PM David Howells  wrote:
>
> It ought to be reasonably easy to make them per-sb at least, I think.  We
> don't allow cross-super rename, right?

Right now the sequence count handling very much depends on it being a
global entity on the reader side, at least.

And while the rename sequence count could (and probably should) be
per-sb, the same is very much not true of the mount one.

So the rename seqcount is likely easier to fix than the mount one, but
neither of them are entirely trivial, afaik.

   Linus

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread David Howells

Linus Torvalds  wrote:

> > Hinting to userspace to do a retry (with -EAGAIN as you mention in your
> > other mail) wouldn't be a bad thing at all, though you'd almost
> > certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
> > global for the entire machine, after all.
> 
> I'd hope that we have some future (possibly very long-term)
> alternative that is not quite system-global, but yes, right now they
> are.

It ought to be reasonably easy to make them per-sb at least, I think.  We
don't allow cross-super rename, right?

David

[PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-04 Thread Michael Roth

On a 2-socket Witherspoon system with 128 cores and 1TB of memory
running the following guest configs:

  guest A:
- 224GB of memory
- 56 VCPUs (sockets=1,cores=28,threads=2), where:
  VCPUs 0-1 are pinned to CPUs 0-3,
  VCPUs 2-3 are pinned to CPUs 4-7,
  ...
  VCPUs 54-55 are pinned to CPUs 108-111

  guest B:
- 4GB of memory
- 4 VCPUs (sockets=1,cores=4,threads=1)

with the following workloads (with KSM and THP enabled in all):

  guest A:
stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M

  guest B:
stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M

  host:
stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M

the below soft-lockup traces were observed after an hour or so and
persisted until the host was reset (this was found to be reliably
reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
and 5.3-rc5):

  [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
  [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
52/KVM:19709]
  [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
[worker:19913]
  [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
[worker:20331]
  [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
[worker:20338]
  [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
[avocado:19525]
  [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
  [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
  [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
  [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
  [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
  [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
  [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
  [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
  [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
  [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
  [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
  [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
  [ 1455.112303]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
  [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
  [ 1455.112392]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1

CPU 45/0x2d, 24/0x18, 124/0x7c are stuck on spin locks, likely held by
CPUs 105/31

CPU 105/0x69, and 31/0x1f are stuck in smp_call_function_many(),
waiting on target CPU 42. For instance:

  69:mon> r
  R00 = c020b20c   R16 = 7d1bcd80
  R01 = c0363eaa7970   R17 = 0001
  R02 = c19b3a00   R18 = 006b
  R03 = 002a   R19 = 7d537d7aecf0
  R04 = 002a   R20 = 60e0
  R05 = 002a   R21 = 08010080
  R06 = c0002073fb0caa08   R22 = 0d60
  R07 = c19ddd78   R23 = 0001
  R08 = 002a   R24 = c147a700
  R09 = 0001   R25 = c0002073fb0ca908
  R10 = c08ffeb4e660   R26 = 
  R11 = c0002073fb0ca900   R27 = c19e2464
  R12 = c0050790   R28 = c00812b0
  R13 =

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 2:49 PM Aleksa Sarai  wrote:
>
> Hinting to userspace to do a retry (with -EAGAIN as you mention in your
> other mail) wouldn't be a bad thing at all, though you'd almost
> certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
> global for the entire machine, after all.

I'd hope that we have some future (possibly very long-term)
alternative that is not quite system-global, but yes, right now they
are.

Which is one reason I'd rather see EAGAIN in user space - yes, it
probably makes it even easier to trigger, but it also means that user
space might be able to do something about it when it does trigger.

For example, maybe user space can first just use an untrusted path
as-is, and if it gets EAGAIN or EXDEV, it may be that user space can
simplify the path (ie turn "xyz/.../abc" into just "abc".

And even if user space doesn't do anything like that, I suspect a
performance problem is going to be a whole lot easier to debug and
report when somebody ends up seeing excessive retries happening. As a
developer you'll see it in profiles or in system call traces, rather
than it resulting in very odd possible slowdowns for the kernel.

And yeah, it would probably be best to then at least delay doing
option 3 indefinitely, just to make sure user space knows about and
actually has a test-case for that EAGAIN happening.

  Linus

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Aleksa Sarai

On 2019-09-04, Linus Torvalds  wrote:
> On Wed, Sep 4, 2019 at 1:23 PM Aleksa Sarai  wrote:
> > This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
> > ".." resolution (in the case of LOOKUP_BENEATH the resolution will still
> > fail if ".." resolution would resolve a path outside of the root --
> > while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
> > are still disallowed entirely because now they could result in
> > inconsistent behaviour if resolution encounters a subsequent ".."[*].
> 
> This is the only patch in the series that makes me go "umm".
> 
> Why is it ok to re-initialize m_seq, which is used by other things
> too? I think it's because we're out of RCU lookup, but there's no
> comment about it, and it looks iffy to me. I'd rather have a separate
> sequence count that doesn't have two users with different lifetime
> rules.

Yeah, the reasoning was that it's because we're out of RCU lookup and if
we didn't re-grab ->m_seq we'd hit path_is_under() on every subsequent
".." (even though we've checked that it's safe). But yes, I should've
used a different field to avoid confusion (and stop it looking
unnecessarily dodgy). I will fix that.

> But even apart from that, I think from a "patch continuity" standpoint
> it would be better to introduce the sequence counts as just an error
> condition first - iow, not have the "path_is_under()" check, but just
> return -EXDEV if the sequence number doesn't match.

Ack, will do.

> So you'd have three stages:
> 
>  1) ".." always returns -EXDEV
> 
>  2) ".." returns -EXDEV if there was a concurrent rename/mount
> 
>  3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> reset the sequence numbers and check if you escaped.
> 
> becasue the sequence number reset really does make me go "hmm", plus I
> get this nagging little feeling in the back of my head that you can
> cause nasty O(n^2) lookup cost behavior with deep paths, lots of "..",
> and repeated path_is_under() calls.

The reason for doing the concurrent-{rename,mount} checks was to try to
avoid the O(n^2) in most cases, but you're right that if you have an
attacker that is spamming renames (or you're on a box with a lot of
renames and/or mounts going on *anywhere*) you will hit an O(n^2) here
(more pedantically, O(m*n) but who's counting?).

Unfortunately, I'm not sure what the best solution would be for this
one. If -EAGAIN retries are on the table, we could limit how many times
we're willing to do path_is_under() and then just return -EAGAIN.

> So (1) sounds safe. (2) sounds simple. And (3) is where I think subtle
> things start happening.
> 
> Also, I'm not 100% convinced that (3) is needed at all. I think the
> retry could be done in user space instead, which needs to have a
> fallback anyway. Yes? No?

Hinting to userspace to do a retry (with -EAGAIN as you mention in your
other mail) wouldn't be a bad thing at all, though you'd almost
certainly get quite a few spurious -EAGAINs -- &{mount,rename}_lock are
global for the entire machine, after all.

But if the only significant roadblock is that (3) seems a bit too hairy,
I would be quite happy with landing (2) as a first step (with -EAGAIN).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH

signature.asc
Description: PGP signature

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 2:35 PM Linus Torvalds
 wrote:
>
> On Wed, Sep 4, 2019 at 2:09 PM Linus Torvalds
>  wrote:
> >
> > So you'd have three stages:
> >
> >  1) ".." always returns -EXDEV
> >
> >  2) ".." returns -EXDEV if there was a concurrent rename/mount
> >
> >  3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> > reset the sequence numbers and check if you escaped.
>
> In fact, I wonder if this should return -EAGAIN instead - to say that
> "retrying may work".

And here "this" was meant to be "case 2" - I was moving the quoted
text around and didn't fix my wording, so now it is ambiguous or
implies #3, which would be crazy.

Sorry for the confusion,

Linus

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 2:09 PM Linus Torvalds
 wrote:
>
> So you'd have three stages:
>
>  1) ".." always returns -EXDEV
>
>  2) ".." returns -EXDEV if there was a concurrent rename/mount
>
>  3) ".." returns -EXDEV if there was a concurrent rename/mount and we
> reset the sequence numbers and check if you escaped.

In fact, I wonder if this should return -EAGAIN instead - to say that
"retrying may work".

Because then:

> Also, I'm not 100% convinced that (3) is needed at all. I think the
> retry could be done in user space instead, which needs to have a
> fallback anyway. Yes? No?

Any user mode fallback would want to know whether it's a final error
or whether simply re-trying might make it work again.

I think that re-try case is valid for any of the possible "races
happened, we can't guarantee that it's safe", and retrying inside the
kernel (or doing that re-validation) could have latency issues.

Maybe ".." is the only such case. I can't think of any other ones in
your series, but at least conceptually they could happen. For example,
we've had people who wanted pathname lookup without any IO happening,
because if you have to wait for IO you could want to use another
thread etc if you're doing some server in user space..

 Linus

Re: [PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 1:23 PM Aleksa Sarai  wrote:
>
> This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
> ".." resolution (in the case of LOOKUP_BENEATH the resolution will still
> fail if ".." resolution would resolve a path outside of the root --
> while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
> are still disallowed entirely because now they could result in
> inconsistent behaviour if resolution encounters a subsequent ".."[*].

This is the only patch in the series that makes me go "umm".

Why is it ok to re-initialize m_seq, which is used by other things
too? I think it's because we're out of RCU lookup, but there's no
comment about it, and it looks iffy to me. I'd rather have a separate
sequence count that doesn't have two users with different lifetime
rules.

But even apart from that, I think from a "patch continuity" standpoint
it would be better to introduce the sequence counts as just an error
condition first - iow, not have the "path_is_under()" check, but just
return -EXDEV if the sequence number doesn't match.

So you'd have three stages:

 1) ".." always returns -EXDEV

 2) ".." returns -EXDEV if there was a concurrent rename/mount

 3) ".." returns -EXDEV if there was a concurrent rename/mount and we
reset the sequence numbers and check if you escaped.

becasue the sequence number reset really does make me go "hmm", plus I
get this nagging little feeling in the back of my head that you can
cause nasty O(n^2) lookup cost behavior with deep paths, lots of "..",
and repeated path_is_under() calls.

So (1) sounds safe. (2) sounds simple. And (3) is where I think subtle
things start happening.

Also, I'm not 100% convinced that (3) is needed at all. I think the
retry could be done in user space instead, which needs to have a
fallback anyway. Yes? No?

 Linus

Re: [PATCH v5 16/31] powernv/fadump: process the crashdump by exporting it as /proc/vmcore

2019-09-04 Thread Hari Bathini




On 04/09/19 5:12 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
>> b/arch/powerpc/platforms/powernv/opal-fadump.c
>> index a755705..10f6086 100644
>> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
>> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
>> @@ -41,6 +43,37 @@ static void opal_fadump_update_config(struct fw_dump 
>> *fadump_conf,
>>  fadump_conf->fadumphdr_addr = fdm->fadumphdr_addr;
>>  }
>>  
>> +/*
>> + * This function is called in the capture kernel to get configuration 
>> details
>> + * from metadata setup by the first kernel.
>> + */
>> +static void opal_fadump_get_config(struct fw_dump *fadump_conf,
>> +   const struct opal_fadump_mem_struct *fdm)
>> +{
>> +int i;
>> +
>> +if (!fadump_conf->dump_active)
>> +return;
>> +
>> +fadump_conf->boot_memory_size = 0;
>> +
>> +pr_debug("Boot memory regions:\n");
>> +for (i = 0; i < fdm->region_cnt; i++) {
>> +pr_debug("\t%d. base: 0x%llx, size: 0x%llx\n",
>> + (i + 1), fdm->rgn[i].src, fdm->rgn[i].size);
> 
> Printing the zero-based array off by one (i + 1) seems confusing.

Hmmm... Indexing the regions from `0` sounded inappropriate..

> 
>> +
>> +fadump_conf->boot_memory_size += fdm->rgn[i].size;
>> +}
>> +
>> +/*
>> + * Start address of reserve dump area (permanent reservation) for
>> + * re-registering FADump after dump capture.
>> + */
>> +fadump_conf->reserve_dump_area_start = fdm->rgn[0].dest;
>> +
>> +opal_fadump_update_config(fadump_conf, fdm);
>> +}
>> +
>>  /* Initialize kernel metadata */
>>  static void opal_fadump_init_metadata(struct opal_fadump_mem_struct *fdm)
>>  {
>> @@ -215,24 +248,114 @@ static void opal_fadump_cleanup(struct fw_dump 
>> *fadump_conf)
>>  pr_warn("Could not reset (%llu) kernel metadata tag!\n", ret);
>>  }
>>  
>> +/*
>> + * Convert CPU state data saved at the time of crash into ELF notes.
>> + */
>> +static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
>> +{
>> +u32 num_cpus, *note_buf;
>> +struct fadump_crash_info_header *fdh = NULL;
>> +
>> +num_cpus = 1;
>> +/* Allocate buffer to hold cpu crash notes. */
>> +fadump_conf->cpu_notes_buf_size = num_cpus * sizeof(note_buf_t);
>> +fadump_conf->cpu_notes_buf_size =
>> +PAGE_ALIGN(fadump_conf->cpu_notes_buf_size);
>> +note_buf = fadump_cpu_notes_buf_alloc(fadump_conf->cpu_notes_buf_size);
>> +if (!note_buf) {
>> +pr_err("Failed to allocate 0x%lx bytes for cpu notes buffer\n",
>> +   fadump_conf->cpu_notes_buf_size);
>> +return -ENOMEM;
>> +}
>> +fadump_conf->cpu_notes_buf = __pa(note_buf);
>> +
>> +pr_debug("Allocated buffer for cpu notes of size %ld at %p\n",
>> + (num_cpus * sizeof(note_buf_t)), note_buf);
>> +
>> +if (fadump_conf->fadumphdr_addr)
>> +fdh = __va(fadump_conf->fadumphdr_addr);
>> +
>> +if (fdh && (fdh->crashing_cpu != FADUMP_CPU_UNKNOWN)) {
>> +note_buf = fadump_regs_to_elf_notes(note_buf, &(fdh->regs));
>> +final_note(note_buf);
>> +
>> +pr_debug("Updating elfcore header (%llx) with cpu notes\n",
>> + fdh->elfcorehdr_addr);
>> +fadump_update_elfcore_header(fadump_conf,
>> + __va(fdh->elfcorehdr_addr));
>> +}
>> +
>> +return 0;
>> +}
>> +
>>  static int __init opal_fadump_process(struct fw_dump *fadump_conf)
>>  {
>> -return -EINVAL;
>> +struct fadump_crash_info_header *fdh;
>> +int rc = 0;
> > No need to initialise rc there.
> 

rc = -EINVAL;

and


>> +if (!opal_fdm_active || !fadump_conf->fadumphdr_addr)
>> +return -EINVAL;

>> +
>> +/* Validate the fadump crash info header */
>> +fdh = __va(fadump_conf->fadumphdr_addr);
>> +if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>> +pr_err("Crash info header is not valid.\n");
>> +return -EINVAL;

return rc; ??

>> +}
>> +
>> +/*
>> + * TODO: To build cpu notes, find a way to map PIR to logical id.
>> + *   Also, we may need different method for pseries and powernv.
>> + *   The currently booted kernel could have a different PIR to
>> + *   logical id mapping. So, try saving info of previous kernel's
>> + *   paca to get the right PIR to logical id mapping.
>> + */
> 
> That TODO is removed by the end of the series, so please just omit it 
> entirely.
> 
>> +rc = opal_fadump_build_cpu_notes(fadump_conf);
>> +if (rc)
>> +return rc;
> 
> I think this all runs early in boot, so we don't need to worry about
> another CPU seeing the partially initialised core due to there being no
> barrier here before we set elfcorehdr_addr?
> 

This is processed in fs/proc/vmcore.c during

Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-04 Thread Randy Dunlap

Hi,
just noisy nits here:

On 9/4/19 1:19 PM, Aleksa Sarai wrote:

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..479baf2da10e 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,47 @@
>  
>  #define AT_RECURSIVE 0x8000  /* Apply to the entire subtree */
>  
> +/**

/** means "the following is kernel-doc", but it's not, so please either make
it kernel-doc format or just use /* to begin the comment.

> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates identically to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
> + * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __u32 flags;
> + union {
> + __u16 mode;
> + __u16 upgrade_mask;
> + };
> + __u16 resolve;
> +};


-- 
~Randy

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-04 Thread Randy Dunlap

Hi,
just kernel-doc fixes:

On 9/4/19 1:19 PM, Aleksa Sarai wrote:
> 
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index ..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define BUFFER_SIZE 64
> +

> +
> +/**
> + * copy_struct_to_user: copy a struct to user space

use correct format:

* copy_struct_to_user - copy a struct to user space

> + * @dst:   Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src:   Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> space.
> + * The recommended usage is something like the following:
> + *
> + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + *   {
> + *  int err;
> + *  struct foo karg = {};
> + *
> + *  // do something with karg
> + *
> + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> + *  if (err)
> + *return err;
> + *
> + *  // ...
> + *   }
> + *
> + * There are three cases to consider:
> + *  * If @usize == @ksize, then it's copied verbatim.
> + *  * If @usize < @ksize, then kernel space is "returning" a newer struct to 
> an
> + *older user space. In order to avoid user space getting incomplete
> + *information (new fields might be important), all trailing bytes in @src
> + *(@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
> + *  * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + *newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> + *zero-filled.
> + *
> + * Returns (in all cases, some data may have been copied):
> + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in 
> @src.
> + *  * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;
> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
> +
> +/**

same here:

> + * copy_struct_from_user: copy a struct from user space

* copy_struct_from_user - copy a struct from user space

> + * @dst:   Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src:   Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> space.
> + * The recommended usage is something like the following:
> + *
> + *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + *   {
> + *  int err;
> + *  struct foo karg = {};
> + *
> + *  err = copy_struct_from_user(, sizeof(karg), uarg, size);
> + *  if (err)
> + *return err;
> + *
> + *  // ...
> + *   }
> + *
> + * There are three cases to consider:
> + *  * If @usize == @ksize, then it's copied verbatim.
> + *  * If @usize < @ksize, then the user space has passed an old struct to a
> + *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + *are to be zero-filled.
> + *  * If @usize > @ksize, then the user space has passed a new struct to an
> + *older kernel. The trailing bytes unknown to the kernel (@usize - 
> @ksize)
> + *are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to, from}_user helpers

2019-09-04 Thread Linus Torvalds

On Wed, Sep 4, 2019 at 1:20 PM Aleksa Sarai  wrote:
>
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases).

Ack, this makes the whole series (and a few unrelated system calls) cleaner.

   Linus

[PATCH v12 12/12] selftests: add openat2(2) selftests

2019-09-04 Thread Aleksa Sarai

Test all of the various openat2(2) flags, as well as how file
descriptor re-opening works. A small stress-test of a symlink-rename
attack is included to show that the protections against ".."-based
attacks are sufficient.

In addition, the memfd selftest is fixed to no longer depend on the
now-disallowed functionality of upgrading an O_RDONLY descriptor to
O_RDWR.

Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/memfd/memfd_test.c|   7 +-
 tools/testing/selftests/openat2/.gitignore|   1 +
 tools/testing/selftests/openat2/Makefile  |   8 +
 tools/testing/selftests/openat2/helpers.c | 167 
 tools/testing/selftests/openat2/helpers.h | 118 +
 .../testing/selftests/openat2/linkmode_test.c | 333 +++
 .../testing/selftests/openat2/openat2_test.c  | 106 +
 .../selftests/openat2/rename_attack_test.c| 127 ++
 .../testing/selftests/openat2/resolve_test.c  | 402 ++
 10 files changed, 1268 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 25b43a8c2b15..13c02e0d0efc 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index c67d32eeb668..e71df3d3e55d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
  */
 static void test_share_open(char *banner, char *b_suffix)
 {
-   int fd, fd2;
+   int procfd, fd, fd2;
 
printf("%s %s %s\n", memfd_str, banner, b_suffix);
 
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix)
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
 
+   /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */
+   procfd = mfd_assert_open(fd2, O_PATH, 0);
close(fd2);
-   fd2 = mfd_assert_open(fd, O_RDWR, 0);
+   fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
 
mfd_assert_add_seals(fd2, F_SEAL_SEAL);
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
 
+   close(procfd);
close(fd2);
close(fd);
 }
diff --git a/tools/testing/selftests/openat2/.gitignore 
b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index ..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile 
b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index ..0b8d42ec4052
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := linkmode_test openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c 
b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index ..def6f7720086
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai 
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "helpers.h"
+
+int raw_openat2(int dfd, const char *path, const void *how, size_t size)
+{
+   int ret = syscall(__NR_openat2, dfd, path, how, size);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, const struct open_how *how)
+{
+   return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, const struct open_how *how)
+{
+   int ret = openat(dfd, path, how->flags, how->mode);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+   int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+

[PATCH v12 11/12] open: openat2(2) syscall

2019-09-04 Thread Aleksa Sarai

The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2). This can be fixed
   by having userspace libraries handle this for users[1], but should be
   avoided if possible.

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink -- not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface. In
   particular, the new @how->upgrade_mask field for fd re-opening is
   only possible because we have a clean slate without needing to re-use
   the ACC_MODE flag design nor the existing openat(2) @mode semantics.

To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.

In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC.

[1]: https://github.com/openSUSE/libpathrs

Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
 arch/arm/tools/syscall.tbl  |  1 +
 arch/arm64/include/asm/unistd.h |  2 +-
 arch/arm64/include/asm/unistd32.h   |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
 arch/s390/kernel/syscalls/syscall.tbl   |  1 +
 arch/sh/kernel/syscalls/syscall.tbl |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
 fs/open.c   | 94 -
 include/linux/fcntl.h   | 19 -
 include/linux/fs.h  |  4 +-
 include/linux/syscalls.h| 14 ++-
 include/uapi/asm-generic/unistd.h   |  5 +-
 include/uapi/linux/fcntl.h  | 42 +
 24 files changed, 168 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+547common  openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+437common  openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   436
+#define __NR_compat_syscalls   438
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index

[PATCH v12 10/12] namei: aggressively check for nd->root escape on ".." resolution

2019-09-04 Thread Aleksa Sarai

This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely because now they could result in
inconsistent behaviour if resolution encounters a subsequent ".."[*].

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2)
of a path can be used to "skip over" nd->root and thus escape to the
filesystem above nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  openat2(dirb, "b/c/../../etc/shadow",
  { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or magic-link). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through _lock or _lock
has occurred during path resolution (_lock and _lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential:

  * If path_is_under() occurs before the rename, then the path will be
resolved -- however the path was originally inside the root and thus
there is no escape (and to userspace it'd look like the rename
occurred after the path was resolved). If path_is_under() occurs
afterwards, the resolution is blocked.

  * Subsequent ".." jumps are guaranteed to check path_is_under() -- by
construction, _lock or _lock must have been taken by
the attacker after path_is_under() returned in the victim. Thus ".."
will not be able to escape from the previously-inside-root path.

  * Walking down in the moved path is still safe since the entire
subtree was moved (either by rename(2) or MS_MOVE) and because (as
discussed above) walking down is safe.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check
after resolving a magic-link and permit resolution if the
nd_jump_link() result is still within the dirfd. However this seems
unlikely to be a feature that people *really* need* -- it can be
added later if it turns out a lot of people want it.

Cc: Al Viro 
Cc: Jann Horn 
Cc: Kees Cook 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 45 +++--
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 0352d275bd13..fd1eb5ce8baa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1758,22 +1758,36 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* LOOKUP_BENEATH resolving ".." is not currently safe -- races
-* can cause our parent to have moved outside of the root and
-* us to skip over it.
-*/
-   if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-   return -EXDEV;
if (!nd->root.mnt) {

[PATCH v12 09/12] namei: LOOKUP_IN_ROOT: chroot-like path resolution

2019-09-04 Thread Aleksa Sarai

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_NO_XDEV and LOOKUP_NO_MAGICLINKS help defend against other
potential attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and magic-link jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
magic-link jumping).

The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that
absolute pathnames no longer cause the dirfd to be ignored completely.

The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).

[1]: https://github.com/cyphar/filepath-securejoin

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 41 +++--
 include/linux/namei.h |  1 +
 2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2e18ce5a313e..0352d275bd13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -676,7 +676,7 @@ static int unlazy_walk(struct nameidata *nd)
goto out1;
if (!nd->root.mnt) {
/* Restart from path_init() if nd->root was cleared. */
-   if (nd->flags & LOOKUP_BENEATH)
+   if (nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
goto out;
} else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, >root, nd->root_seq)))
@@ -809,10 +809,18 @@ static int complete_walk(struct nameidata *nd)
return status;
 }
 
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
 {
struct fs_struct *fs = current->fs;
 
+   /*
+* Jumping to the real root as part of LOOKUP_IN_ROOT is a BUG in namei,
+* but we still have to ensure it doesn't happen because it will cause a
+* breakout from the dirfd.
+*/
+   if (WARN_ON(nd->flags & LOOKUP_IN_ROOT))
+   return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
 
@@ -824,6 +832,7 @@ static void set_root(struct nameidata *nd)
} else {
get_fs_root(fs, >root);
}
+   return 0;
 }
 
 static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -854,6 +863,11 @@ static int nd_jump_root(struct nameidata *nd)
if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
return -EXDEV;
}
+   if (!nd->root.mnt) {
+   int error = set_root(nd);
+   if (error)
+   return error;
+   }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1100,15 +1114,13 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
-   if (unlikely(nd->flags & LOOKUP_BENEATH))
+   if (unlikely(nd->flags & (LOOKUP_BENEATH | 
LOOKUP_IN_ROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
-   if (!nd->root.mnt)
-   set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -1744,15 +1756,20 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
if (type == LAST_DOTDOT) {
+   int error = 0;
+
/*
 * LOOKUP_BENEATH resolving ".." is not currently safe -- races
 * can cause our parent to have moved outside of the root and
 * us to skip over it.
 */
-   if (unlikely(nd->flags & LOOKUP_BENEATH))
+   if

[PATCH v12 08/12] namei: O_BENEATH-style path resolution flags

2019-09-04 Thread Aleksa Sarai

Add the following flags to allow various restrictions on path resolution
(these affect the *entire* resolution, rather than just the final path
component -- as is the case with LOOKUP_FOLLOW).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web servers,
archive extraction tools, network file servers, and so on.

These flags are exposed to userspace through openat2(2) in a later
patchset.

* LOOKUP_NO_XDEV: Disallow mount-point crossing (both *down* into one,
  or *up* from one). Both bind-mounts and cross-filesystem mounts are
  blocked by this flag. The naming is based on "find -xdev" as well as
  -EXDEV (though find(1) doesn't walk upwards, the semantics seem
  obvious).

* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" (or rather,
  magic-link) jumping. This is a very specific restriction, and it
  exists because /proc/$pid/fd/... "symlinks" allow for access outside
  nd->root and pose risk to container runtimes that don't want to be
  tricked into accessing a host path (but do want to allow
  no-funny-business symlink resolution).

* LOOKUP_NO_SYMLINKS: Disallows resolution through symlinks of any kind
  (including magic-links).

* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all magic-link jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of magic-link jumping is done because it is not clear
  whether semantically they should be allowed -- while some magic-links
  are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable magic-link jumping when such jumps
  would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 85 ---
 include/linux/namei.h |  7 
 2 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e39b573fcc4d..2e18ce5a313e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -674,7 +674,11 @@ static int unlazy_walk(struct nameidata *nd)
goto out2;
if (unlikely(!legitimize_path(nd, >path, nd->seq)))
goto out1;
-   if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) {
+   if (!nd->root.mnt) {
+   /* Restart from path_init() if nd->root was cleared. */
+   if (nd->flags & LOOKUP_BENEATH)
+   goto out;
+   } else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, >root, nd->root_seq)))
goto out;
}
@@ -843,6 +847,13 @@ static inline void path_to_nameidata(const struct path 
*path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+   if (unlikely(nd->flags & LOOKUP_BENEATH))
+   return -EXDEV;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+   /* Absolute path arguments to path_init() are allowed. */
+   if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+

[PATCH v12 07/12] open: O_EMPTYPATH: procfs-less file descriptor re-opening

2019-09-04 Thread Aleksa Sarai

Userspace has made use of /proc/self/fd very liberally to allow for
descriptors to be re-opened. There are a wide variety of uses for this
feature, but it has always required constructing a pathname and could
not be done without procfs mounted. The obvious solution for this is to
extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.

Now that descriptor re-opening has been made safe through the new
magic-link resolution restrictions, we can replicate these restrictions
for O_EMPTYPATH. In particular, we only allow "upgrading" the file
descriptor if the corresponding FMODE_PATH_* bit is set (or the
FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).

When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and
O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH
re-open an existing file descriptor, and so accommodating them at the
expense of further complicating O_PATH makes little sense. Ultimately,
if users ask for this we can always add RESOLVE_EMPTY_PATH to
resolveat(2) in the future.

Signed-off-by: Aleksa Sarai 
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h | 39 ++--
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c   |  2 +-
 fs/namei.c   | 20 ++
 fs/open.c|  7 -
 include/linux/fcntl.h|  2 +-
 include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h 
b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..1f879bade68b 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH 04000
 #define __O_TMPFILE01
+#define O_EMPTYPATH02
 
 #define F_GETLK7
 #define F_SETLK8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h 
b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..5d709058a76f 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND   00010
-#define O_BLKSEEK  00100 /* HPUX only */
-#define O_CREAT00400 /* not fcntl */
-#define O_EXCL 02000 /* not fcntl */
-#define O_LARGEFILE04000
-#define __O_SYNC   00010
+#define O_APPEND   10
+#define O_BLKSEEK  000100 /* HPUX only */
+#define O_CREAT000400 /* not fcntl */
+#define O_EXCL 002000 /* not fcntl */
+#define O_LARGEFILE004000
+#define __O_SYNC   10
 #define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 00024 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY   00040 /* not fcntl */
-#define O_DSYNC00100 /* HPUX only */
-#define O_RSYNC00200 /* HPUX only */
-#define O_NOATIME  00400
-#define O_CLOEXEC  01000 /* set close_on_exec */
-
-#define O_DIRECTORY1 /* must be a directory */
-#define O_NOFOLLOW 00200 /* don't follow links */
-#define O_INVISIBLE00400 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 02000
-#define __O_TMPFILE04000
+#define O_NONBLOCK 24 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY   40 /* not fcntl */
+#define O_DSYNC000100 /* HPUX only */
+#define O_RSYNC000200 /* HPUX only */
+#define O_NOATIME  000400
+#define O_CLOEXEC  001000 /* set close_on_exec */
+
+#define O_DIRECTORY01 /* must be a directory */
+#define O_NOFOLLOW 000200 /* don't follow links */
+#define O_INVISIBLE000400 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 002000
+#define __O_TMPFILE004000
+#define O_EMPTYPATH01
 
 #define F_GETLK64  8
 #define F_SETLK64  9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h 
b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..dc86c9eaf950 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH 0x100
 #define __O_TMPFILE0x200
+#define O_EMPTYPATH0x400
 
 #define F_GETOWN   5   /*  for sockets. */
 #define F_SETOWN   6   /*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=

[PATCH v12 06/12] procfs: switch magic-link modes to be more sane

2019-09-04 Thread Aleksa Sarai

Now that magic-link modes are obeyed for file re-opening purposes, some
of the pre-existing magic-link modes need to be adjusted to be more
semantically correct.

The most blatant example of this is /proc/self/exe, which had a mode of
a+rwx even though tautologically the file could never be opened for
writing (because it is the current->mm of a live process).

With the new O_PATH restrictions, changing the default mode of these
magic-links allows us to avoid delayed-access attacks such as we saw in
CVE-2019-5736.

Signed-off-by: Aleksa Sarai 
---
 fs/proc/base.c   | 20 ++--
 fs/proc/namespaces.c |  2 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..297242174402 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -133,9 +133,9 @@ struct pid_entry {
 
 #define DIR(NAME, MODE, iops, fops)\
NOD(NAME, (S_IFDIR|(MODE)), , , {} )
-#define LNK(NAME, get_link)\
-   NOD(NAME, (S_IFLNK|S_IRWXUGO),  \
-   _pid_link_inode_operations, NULL,  \
+#define LNK(NAME, MODE, get_link)  \
+   NOD(NAME, (S_IFLNK|(MODE)), \
+   _pid_link_inode_operations, NULL,  \
{ .proc_get_link = get_link } )
 #define REG(NAME, MODE, fops)  \
NOD(NAME, (S_IFREG|(MODE)), NULL, , {})
@@ -3028,9 +3028,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",proc_cwd_link),
-   LNK("root",   proc_root_link),
-   LNK("exe",proc_exe_link),
+   LNK("cwd",S_IRWXUGO, proc_cwd_link),
+   LNK("root",   S_IRWXUGO, proc_root_link),
+   LNK("exe",S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
@@ -3429,11 +3429,11 @@ static const struct pid_entry tid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",   proc_cwd_link),
-   LNK("root",  proc_root_link),
-   LNK("exe",   proc_exe_link),
+   LNK("cwd",   S_IRWXUGO, proc_cwd_link),
+   LNK("root",  S_IRWXUGO, proc_root_link),
+   LNK("exe",   S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts",S_IRUGO, proc_mounts_operations),
-   REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
+   REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
 #ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..cd1e130913f7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry 
*dentry,
struct inode *inode;
struct proc_inode *ei;
 
-   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO);
if (!inode)
return ERR_PTR(-ENOENT);
 
-- 
2.23.0

[PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-04 Thread Aleksa Sarai

The ability for userspace to "re-open" file descriptors through
/proc/self/fd has been a very useful tool for all sorts of usecases
(container runtimes are one common example). However, the current
interface for doing this has resulted in some pretty subtle security
holes. Userspace can re-open a file descriptor with more permissions
than the original, which can result in cases such as /proc/$pid/exe
being re-opened O_RDWR at a later date even though (by definition)
/proc/$pid/exe cannot be opened for writing. When combined with O_PATH
the results can get even more confusing.

We cannot block this outright. Aside from userspace already depending on
it, it's a useful feature which can actually increase the security of
userspace. For instance, LXC keeps an O_PATH of the container's
/dev/pts/ptmx that gets re-opened to create new ptys and then uses
TIOCGPTPEER to get the slave end. This allows for pty allocation without
resolving paths inside an (untrusted) container's rootfs. There isn't a
trivial way of doing this that is as straight-forward and safe as O_PATH
re-opening.

Instead we have to restrict it in such a way that it doesn't break
(good) users but does block potential attackers. The solution applied in
this patch is to restrict *re-opening* (not resolution through)
magic-links by requiring that mode of the link be obeyed. Normal
symlinks have modes of a+rwx but magic-links have other modes. These
magic-link modes were historically ignored during path resolution, but
they've now been re-purposed for more useful ends.

It is also necessary to define semantics for the mode of an O_PATH
descriptor, since re-opening a magic-link through an O_PATH needs to be
just as restricted as the corresponding magic-link -- otherwise the
above protection can be bypassed. There are two distinct cases:

 1. The target is a regular file (not a magic-link). Userspace depends
on being able to re-open the O_PATH of a regular file, so we must
define the mode to be a+rwx.

 2. The target is a magic-link. In this case, we simply copy the mode of
the magic-link. This results in an O_PATH of a magic-link
effectively acting as a no-op in terms of how much re-opening
privileges a process has.

CAP_DAC_OVERRIDE can be used to override all of these restrictions, but
we only permit _userns's capabilities to affect these semantics.
The reason for this is that there isn't a clear way to track what
user_ns is the original owner of a given O_PATH chain -- thus an
unprivileged user could create a new userns and O_PATH the file
descriptor, owning it. All signs would indicate that the user really
does have CAP_DAC_OVERRIDE over the new descriptor and the protection
would be bypassed. We thus opt for the more conservative approach.

I have run this patch on several machines for several days. So far, the
only processes which have hit this case ("loadkeys" and "kbd_mode" from
the kbd package[1]) gracefully handle the permission error and do not
cause any user-visible problems. In order to give users a heads-up, a
warning is output to dmesg whenever may_open_magiclink() refuses access.

[1]: http://git.altlinux.org/people/legion/packages/kbd.git

Suggested-by: Andy Lutomirski 
Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst |  12 +--
 fs/internal.h |   1 +
 fs/namei.c| 105 +++---
 fs/open.c |   3 +-
 fs/proc/fd.c  |  23 -
 include/linux/fs.h|   4 +
 include/linux/namei.h |   1 +
 7 files changed, 130 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a57d78ec8bee 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1310,12 +1310,14 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate 
component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate

[PATCH v12 04/12] perf_event_open: switch to copy_struct_from_user()

2019-09-04 Thread Aleksa Sarai

The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_from_user() -- that the
memchr_inv() check is done on a buffer instead of one-at-at-time with
get_user().

Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 45 
 1 file changed, 8 insertions(+), 37 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..fe5f58443ba6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10498,55 +10498,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
if (size < PERF_ATTR_SIZE_VER0)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0

[PATCH v12 03/12] sched_setattr: switch to copy_struct_{to, from}_user()

2019-09-04 Thread Aleksa Sarai

The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_{to,from}_user() -- that
the memchr_inv() check is done on a buffer instead of one-at-at-time
with get_user() or put_user().

Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 85 ++---
 1 file changed, 10 insertions(+), 75 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..2f58b07d3468 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4900,9 +4900,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -4910,45 +4907,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
if (size < SCHED_ATTR_SIZE_VER0)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5105,51 +5076,15 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct 
sched_param __user *, param)
return retval;
 }
 
-static int sched_read_attr(struct sched_attr __user *uattr,
-  struct sched_attr *attr,
-  unsigned int usize)
-{
-   int ret;
-
-   if (!access_ok(uattr, usize))
-   return -EFAULT;
-
-   /*
-* If we're handed a smaller struct than we know of,
-* ensure all the unknown bits are 0 - i.e. old
-* user-space does not get uncomplete information.
-*/
-   if (usize < sizeof(*attr)) {
-   unsigned char *addr;
-   unsigned char *end;
-
-   addr = (void *)attr + usize;
-   end  = (void *)attr + sizeof(*attr);
-
-   for (; addr < end; addr++) {
-   if (*addr)
-   return -EFBIG;
-   }
-
-   attr->size = usize;
-   }
-
-   ret = copy_to_user(uattr, attr, attr->size);
-   if (ret)
-   return -EFAULT;
-
-   return 0;
-}
-
 /**
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @size: sizeof(attr) for fwd/bwd comp.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-   unsigned int, size, unsigned int, flags)
+   unsigned int, usize, unsigned int, flags)
 {
struct sched_attr attr = {
.size = sizeof(struct sched_attr),
@@ -5157,8 +5092,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct 
sched_attr __user *, uattr,
struct task_struct *p;
int retval;
 
-   if (!uattr || pid < 0 || size > PAGE_SIZE ||
-   size < SCHED_ATTR_SIZE_VER0 || flags)
+   if (!uattr || pid < 0 || usize > PAGE_SIZE ||
+   usize < SCHED_ATTR_SIZE_VER0 || flags)
return -EINVAL;
 
rcu_read_lock();
@@ -5188,7 +5123,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct 
sched_attr __user *, uattr,
 
rcu_read_unlock();
 
-   retval = sched_read_attr(uattr, , size);
+   retval = copy_struct_to_user(uattr, usize, , sizeof(attr));
return retval;
 
 out_unlock:
-- 
2.23.0

[PATCH v12 02/12] clone3: switch to copy_struct_from_user()

2019-09-04 Thread Aleksa Sarai

The change is very straightforward, and takes advantage of the (very
minor) efficiency improvements in copy_struct_from_user() -- that the
memchr_inv() check is done on a buffer instead of one-at-at-time with
get_user().

Additionally, explicitly define CLONE_ARGS_SIZE_VER0 to match the other
users of the struct-extension pattern.

Cc: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 ++
 2 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..70c10d9b429a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2528,39 +2528,17 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
-   return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
*kargs = (struct kernel_clone_args){
.flags  = args.flags,
-- 
2.23.0

Re: [PATCH v5 15/31] powernv/fadump: support copying multiple kernel boot memory regions

2019-09-04 Thread Hari Bathini




On 04/09/19 5:00 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>> Firmware uses 32-bit field for region size while copying/backing-up
> 
> Which firmware exactly is imposing that limit?

I think the MDST/MDRT tables in the f/w. Vasant, which component is that?

>> +/*
>> + * Firmware currently supports only 32-bit value for size,
> 
> "currently" implies it could change in future?
> 
> If it does we assume it will only increase, and we're happy that old
> kernels will continue to use the 32-bit limit?

I am not aware of any plans to make it 64-bit. Let me just say f/w supports
only 32-bit to get rid of that ambiguity..

- Hari

[PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-04 Thread Aleksa Sarai

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases). This is done
in both directions -- hence two helpers -- though it's more common to
have to copy user space structs into kernel space.

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[1]). A future
patch replaces all of the common uses of this pattern to use the new
copy_struct_{to,from}_user() helpers.

[1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/uaccess.h |   5 ++
 lib/Makefile|   2 +-
 lib/struct_user.c   | 182 
 3 files changed, 188 insertions(+), 1 deletion(-)
 create mode 100644 lib/struct_user.c

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 34a038563d97..0ad9544a1aee 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -230,6 +230,11 @@ static inline unsigned long 
__copy_from_user_inatomic_nocache(void *to,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int copy_struct_to_user(void __user *dst, size_t usize,
+  const void *src, size_t ksize);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+const void __user *src, size_t usize);
+
 /*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
diff --git a/lib/Makefile b/lib/Makefile
index 29c02a924973..d86c71feaf0a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ endif
 CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
 endif
 
-lib-y := ctype.o string.o vsprintf.o cmdline.o \
+lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o timerqueue.o xarray.o \
 idr.o extable.o \
 sha1.o chacha.o irq_regs.o argv_split.o \
diff --git a/lib/struct_user.c b/lib/struct_user.c
new file mode 100644
index ..7301ab1bbe98
--- /dev/null
+++ b/lib/struct_user.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 SUSE LLC
+ * Copyright (C) 2019 Aleksa Sarai 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define BUFFER_SIZE 64
+
+/*
+ * "memset(p, 0, size)" but for user space buffers. Caller must have already
+ * checked access_ok(p, size).
+ */
+static int __memzero_user(void __user *p, size_t s)
+{
+   const char zeros[BUFFER_SIZE] = {};
+   while (s > 0) {
+   size_t n = min(s, sizeof(zeros));
+
+   if (__copy_to_user(p, zeros, n))
+   return -EFAULT;
+
+   p += n;
+   s -= n;
+   }
+   return 0;
+}
+
+/**
+ * copy_struct_to_user: copy a struct to user space
+ * @dst:   Destination address, in user space.
+ * @usize: Size of @dst struct.
+ * @src:   Source address, in kernel space.
+ * @ksize: Size of @src struct.
+ *
+ * Copies a struct from kernel space to user space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by user space.
+ * The recommended usage is something like the following:
+ *
+ *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
+ *   {
+ *  int err;
+ *  struct foo karg = {};
+ *
+ *  // do something with karg
+ *
+ *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
+ *  if (err)
+ *return err;
+ *
+ *  // ...
+ *   }
+ *
+ * There are three cases to consider:
+ *  * If @usize == @ksize, then it's copied verbatim.
+ *  * If @usize < @ksize, then kernel space is "returning" a newer struct to an
+ *older user space. In order to avoid user space getting incomplete
+ *information (new fields might be important), all trailing bytes in @src
+ *(@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
+ *  * If @usize > @ksize, then the kernel is "returning" an older struct to a
+ *newer user space. The trailing bytes in @dst (@usize - @ksize) will be
+ *zero-filled.
+ *
+ * Returns (in all cases, some data may have been copied):
+ *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in @src.
+ *  * -EFAULT: access to user space failed.
+ */
+int copy_struct_to_user(void

[PATCH v12 00/12] namei: openat2(2) path resolution restrictions

2019-09-04 Thread Aleksa Sarai

This patchset is being developed here:


Patch changelog:
 v12:
  * Remove @how->reserved field from openat2(2), and instead use the
(struct, size) design for syscall extensions.
  * Implement copy_struct_{to,from}_user() to unify (struct, size)
syscall extension designs (as well as make them slightly more
efficient by using memchr_inv() as well as using buffers and
avoiding repeated access_ok() checks for trailing byte operations).
* Port sched_setattr(), perf_event_attr(), and clone3() to use the
  new helpers.
 v11: 
  
 v10: 
 v09: 
 v08: 
 v07: 
 v06: 
 v05: 
 v04: 
 v03: 
 v02: 
 v01: 

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides several other improvements to the openat(2) interface (see the
patch description for more details). The following new LOOKUP_* flags are
added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.

It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".

Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.

In addition, two new flags are added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.

If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.

The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Gerald Schaefer

On Tue,  3 Sep 2019 13:31:46 +0530
Anshuman Khandual  wrote:

> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required alignments. If memory pages
> with required size and alignment could not be allocated, then all depending
> individual tests are skipped.

This looks very useful, thanks. Of course, s390 is quite special and does
not work nicely with this patch (yet), mostly because of our dynamic page
table levels/folding. Still need to figure out what can be fixed in the arch
code and what would need to be changed in the test module. See below for some
generic comments/questions.

At least one real bug in the s390 code was already revealed by this, which
is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
instead of reporting them as bad, which is apparently not how it is expected.

[...]
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VADDR_TEST   (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)

Why is P4D_SIZE missing in the VADDR_TEST calculation?

[...]
> +
> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> +static void pud_clear_tests(pud_t *pudp)
> +{
> + memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> + pud_clear(pudp);
> + WARN_ON(!pud_none(READ_ONCE(*pudp)));
> +}

For pgd/p4d/pud_clear(), we only clear if the page table level is present
and not folded. The memset() here overwrites the table type bits, so
pud_clear() will not clear anything on s390 and the pud_none() check will
fail.
Would it be possible to OR a (larger) random value into the table, so that
the lower 12 bits would be preserved?

> +
> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
> *pmdp)
> +{
> + /*
> +  * This entry points to next level page table page.
> +  * Hence this must not qualify as pud_bad().
> +  */
> + pmd_clear(pmdp);
> + pud_clear(pudp);
> + pud_populate(mm, pudp, pmdp);
> + WARN_ON(pud_bad(READ_ONCE(*pudp)));
> +}

This will populate the pud with a pmd pointer that does not point to the
beginning of the pmd table, but to the second entry (because of how
VADDR_TEST is constructed). This will result in failing pud_bad() check
on s390. Not sure why/how it works on other archs, but would it be possible
to align pmdp down to the beginning of the pmd table (and similar for the
other pxd_populate_tests)?

[...]
> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));

pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
but a pte pointer. So this will go wrong, also on all other archs (if any)
where pgtable_t is not struct page.
Would it be possible to use pte_free_kernel() instead, and just pass
saved_ptep directly?

Regards,
Gerald

Re: [PATCH v2] powerpc/fadump: when fadump is supported register the fadump sysfs files.

2019-09-04 Thread Michal Suchánek

On Thu, 29 Aug 2019 10:58:16 +0530
Hari Bathini  wrote:

> On 28/08/19 10:57 PM, Michal Suchanek wrote:
> > Currently it is not possible to distinguish the case when fadump is
> > supported by firmware and disabled in kernel and completely unsupported
> > using the kernel sysfs interface. User can investigate the devicetree
> > but it is more reasonable to provide sysfs files in case we get some
> > fadumpv2 in the future.
> > 
> > With this patch sysfs files are available whenever fadump is supported
> > by firmware.
> > 
> > Signed-off-by: Michal Suchanek 
> > ---  
> 
> [...]
> 
> > -   if (!fw_dump.fadump_supported) {
> > +   if (!fw_dump.fadump_supported && fw_dump.fadump_enabled) {
> > printk(KERN_ERR "Firmware-assisted dump is not supported on"
> > " this hardware\n");
> > -   return 0;
> > }  
> 
> The above hunk is redundant with similar message already logged during
> early boot in fadump_reserve_mem() function. I am not strongly against
> this though. So...

I see this:
[0.00] debug: ignoring loglevel setting.
[0.00] Firmware-assisted dump is not supported on this hardware
[0.00] Reserving 256MB of memory at 128MB for crashkernel (System RAM: 
8192MB)
[0.00] Allocated 5832704 bytes for 2048 pacas at c7a8
[0.00] Page sizes from device-tree:
[0.00] base_shift=12: shift=12, sllp=0x, avpnm=0x, 
tlbiel=1, penc=0
[0.00] base_shift=16: shift=16, sllp=0x0110, avpnm=0x, 
tlbiel=1, penc=1
[0.00] Page orders: linear mapping = 16, virtual = 16, io = 16, vmemmap 
= 16
[0.00] Using 1TB segments
[0.00] Initializing hash mmu with SLB

Clearly the second message is logged from the above code. The duplicate
is capitalized: "Firmware-Assisted Dump is not supported on this
hardware" and I don't see it logged. So if anything is duplicate that
should be removed it is the message in fadump_reserve_mem(). It is not
clear why that one is not seen, though.

Thanks

Michal

Re: [PATCH v5 02/31] powerpc/fadump: move internal code to a new file

2019-09-04 Thread Hari Bathini




On 04/09/19 2:32 PM, Mahesh Jagannath Salgaonkar wrote:
> On 9/3/19 9:35 PM, Hari Bathini wrote:
>>
>>
>> On 03/09/19 4:39 PM, Michael Ellerman wrote:
>>> Hari Bathini  writes:
 Make way for refactoring platform specific FADump code by moving code
 that could be referenced from multiple places to fadump-common.c file.

 Signed-off-by: Hari Bathini 
 ---
  arch/powerpc/kernel/Makefile|2 
  arch/powerpc/kernel/fadump-common.c |  140 
 ++
  arch/powerpc/kernel/fadump-common.h |8 ++
  arch/powerpc/kernel/fadump.c|  146 
 ++-
  4 files changed, 158 insertions(+), 138 deletions(-)
  create mode 100644 arch/powerpc/kernel/fadump-common.c
>>>
>>> I don't understand why we need fadump.c and fadump-common.c? They're
>>> both common/shared across pseries & powernv aren't they?
>>
>> The convention I tried to follow to have fadump-common.c shared between 
>> fadump.c,
>> pseries & powernv code while pseries & powernv code take callback requests 
>> from
>> fadump.c and use fadump-common.c (shared by both platforms), if necessary to 
>> fullfil
>> those requests...
>>
>>> By the end of the series we end up with 149 lines in fadump-common.c
>>> which seems like a waste of time. Just put it all in fadump.c.
>>
>> Yeah. Probably not worth a new C file. Will just have two separate headers. 
>> One for
>> internal code and one for interfacing with other modules...
>>
>> [...]
>>
 + * Copyright 2019, IBM Corp.
 + * Author: Hari Bathini 
>>>
>>> These can just be:
>>>
>>>  * Copyright 2011, Mahesh Salgaonkar, IBM Corporation.
>>>  * Copyright 2019, Hari Bathini, IBM Corporation.
>>>
>>
>> Sure.
>>
 + */
 +
 +#undef DEBUG
>>>
>>> Don't undef DEBUG please.
>>>
>>
>> Sorry! Seeing such thing in most files, I thought this was the convention. 
>> Will drop
>> this change in all the new files I added.
>>
 +#define pr_fmt(fmt) "fadump: " fmt
 +
 +#include 
 +#include 
 +#include 
 +#include 
 +
 +#include "fadump-common.h"
 +
 +void *fadump_cpu_notes_buf_alloc(unsigned long size)
 +{
 +  void *vaddr;
 +  struct page *page;
 +  unsigned long order, count, i;
 +
 +  order = get_order(size);
 +  vaddr = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
 +  if (!vaddr)
 +  return NULL;
 +
 +  count = 1 << order;
 +  page = virt_to_page(vaddr);
 +  for (i = 0; i < count; i++)
 +  SetPageReserved(page + i);
 +  return vaddr;
 +}
>>>
>>> I realise you're just moving this code, but why do we need all this hand
>>> rolled allocation stuff?
>>
>> Yeah, I think alloc_pages_exact() may be better here. Mahesh, am I missing 
>> something?
> 
> We hook up the physical address of this buffer to ELF core header as
> PT_NOTE section. Hence we don't want these pages to be moved around or
> reclaimed.

alloc_pages_exact() + mark_page_reserved() should take care of that, I guess..

- Hari

Re: lockdep warning while booting POWER9 PowerNV

2019-09-04 Thread Bart Van Assche


On 8/30/19 2:13 PM, Qian Cai wrote:

https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config

Once in a while, booting an IBM POWER9 PowerNV system (8335-GTH) would generate
a warning in lockdep_register_key() at,

if (WARN_ON_ONCE(static_obj(key)))

because

key = 0xc19ad118
&_stext = 0xc000
&_end = 0xc49d

i.e., it will cause static_obj() returns 1.


(back from a trip)

Hi Qian,

Does this mean that on POWER9 it can happen that a dynamically allocated 
object has an address that falls between &_stext and &_end? Since I am 
not familiar with POWER9 nor have access to such a system, can you 
propose a patch?


Thanks,

Bart.

[PATCH AUTOSEL 4.19 41/52] ibmvnic: Do not process reset during or after device removal

2019-09-04 Thread Sasha Levin

From: Thomas Falcon 

[ Upstream commit 36f1031c51a2538e5558fb44c6d6b88f98d3c0f2 ]

Currently, the ibmvnic driver will not schedule device resets
if the device is being removed, but does not check the device
state before the reset is actually processed. This leads to a race
where a reset is scheduled with a valid device state but is
processed after the driver has been removed, resulting in an oops.

Fix this by checking the device state before processing a queued
reset event.

Reported-by: Abdul Haleem 
Tested-by: Abdul Haleem 
Signed-off-by: Thomas Falcon 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 0ae43d27cdcff..af1e8671515e0 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1996,6 +1996,10 @@ static void __ibmvnic_reset(struct work_struct *work)
 
rwi = get_next_rwi(adapter);
while (rwi) {
+   if (adapter->state == VNIC_REMOVING ||
+   adapter->state == VNIC_REMOVED)
+   goto out;
+
if (adapter->force_reset_recovery) {
adapter->force_reset_recovery = false;
rc = do_hard_reset(adapter, rwi, reset_state);
@@ -2020,7 +2024,7 @@ static void __ibmvnic_reset(struct work_struct *work)
netdev_dbg(adapter->netdev, "Reset failed\n");
free_all_rwi(adapter);
}
-
+out:
adapter->resetting = false;
if (we_lock_rtnl)
rtnl_unlock();
-- 
2.20.1

[PATCH AUTOSEL 5.2 67/94] ibmvnic: Do not process reset during or after device removal

2019-09-04 Thread Sasha Levin

From: Thomas Falcon 

[ Upstream commit 36f1031c51a2538e5558fb44c6d6b88f98d3c0f2 ]

Currently, the ibmvnic driver will not schedule device resets
if the device is being removed, but does not check the device
state before the reset is actually processed. This leads to a race
where a reset is scheduled with a valid device state but is
processed after the driver has been removed, resulting in an oops.

Fix this by checking the device state before processing a queued
reset event.

Reported-by: Abdul Haleem 
Tested-by: Abdul Haleem 
Signed-off-by: Thomas Falcon 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 3da6800732656..d103be77eb406 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1981,6 +1981,10 @@ static void __ibmvnic_reset(struct work_struct *work)
 
rwi = get_next_rwi(adapter);
while (rwi) {
+   if (adapter->state == VNIC_REMOVING ||
+   adapter->state == VNIC_REMOVED)
+   goto out;
+
if (adapter->force_reset_recovery) {
adapter->force_reset_recovery = false;
rc = do_hard_reset(adapter, rwi, reset_state);
@@ -2005,7 +2009,7 @@ static void __ibmvnic_reset(struct work_struct *work)
netdev_dbg(adapter->netdev, "Reset failed\n");
free_all_rwi(adapter);
}
-
+out:
adapter->resetting = false;
if (we_lock_rtnl)
rtnl_unlock();
-- 
2.20.1

Re: [PATCH v3 2/3] Powerpc64/Watchpoint: Don't ignore extraneous exceptions

2019-09-04 Thread Naveen N. Rao


Ravi Bangoria wrote:

On Powerpc64, watchpoint match range is double-word granular. On
a watchpoint hit, DAR is set to the first byte of overlap between
actual access and watched range. And thus it's quite possible that
DAR does not point inside user specified range. Ex, say user creates
a watchpoint with address range 0x1004 to 0x1007. So hw would be
configured to watch from 0x1000 to 0x1007. If there is a 4 byte
access from 0x1002 to 0x1005, DAR will point to 0x1002 and thus
interrupt handler considers it as extraneous, but it's actually not,
because part of the access belongs to what user has asked. So, let
kernel pass it on to user and let user decide what to do with it
instead of silently ignoring it. The drawback is, it can generate
false positive events.


I think you should do the additional validation here, instead of 
generating false positives. You should be able to read the instruction, 
run it through analyse_instr(), and then use OP_IS_LOAD_STORE() and 
GETSIZE() to understand the access range. This can be used to then 
perform a better match against what the user asked for.


- Naveen

Re: [PATCH v5 11/31] powernv/fadump: add fadump support on powernv

2019-09-04 Thread Hari Bathini




On 03/09/19 10:01 PM, Hari Bathini wrote:
> 
[...]
>>> diff --git a/arch/powerpc/kernel/fadump-common.h 
>>> b/arch/powerpc/kernel/fadump-common.h
>>> index d2c5b16..f6c52d3 100644
>>> --- a/arch/powerpc/kernel/fadump-common.h
>>> +++ b/arch/powerpc/kernel/fadump-common.h
>>> @@ -140,4 +140,13 @@ static inline int rtas_fadump_dt_scan(struct fw_dump 
>>> *fadump_config, ulong node)
>>>  }
>>>  #endif
>>>  
>>> +#ifdef CONFIG_PPC_POWERNV
>>> +extern int opal_fadump_dt_scan(struct fw_dump *fadump_config, ulong node);
>>> +#else
>>> +static inline int opal_fadump_dt_scan(struct fw_dump *fadump_config, ulong 
>>> node)
>>> +{
>>> +   return 1;
>>> +}
>>
>> Extending the strange flat device tree calling convention to these
>> functions is not ideal.
>>
>> It would be better I think if they just returned bool true/false for
>> "found it" / "not found", and then early_init_dt_scan_fw_dump() can
>> convert that into the appropriate return value.
>>
>>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>>> index f7c8073..b8061fb9 100644
>>> --- a/arch/powerpc/kernel/fadump.c
>>> +++ b/arch/powerpc/kernel/fadump.c
>>> @@ -114,6 +114,9 @@ int __init early_init_dt_scan_fw_dump(unsigned long 
>>> node, const char *uname,
>>> if (strcmp(uname, "rtas") == 0)
>>> return rtas_fadump_dt_scan(_dump, node);
>>>  
>>> +   if (strcmp(uname, "ibm,opal") == 0)
>>> +   return opal_fadump_dt_scan(_dump, node);
>>> +
>>
>> ie this would become:
>>
>>  if (strcmp(uname, "ibm,opal") == 0 && opal_fadump_dt_scan(_dump, 
>> node))
>> return 1;
>>
> 
> Yeah. Will update accordingly...

On second thoughts, we don't need a return type at all here. fw_dump struct and 
callbacks are
populated based on what we found in the DT. And irrespective of what we found 
in DT, we got
to return `1` once the particular depth and node is processed..

- Hari

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Kirill A. Shutemov

On Tue, Sep 03, 2019 at 01:31:46PM +0530, Anshuman Khandual wrote:
> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required alignments. If memory pages
> with required size and alignment could not be allocated, then all depending
> individual tests are skipped.

See my comments below.

> 
> Cc: Andrew Morton 
> Cc: Vlastimil Babka 
> Cc: Greg Kroah-Hartman 
> Cc: Thomas Gleixner 
> Cc: Mike Rapoport 
> Cc: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Peter Zijlstra 
> Cc: Michal Hocko 
> Cc: Mark Rutland 
> Cc: Mark Brown 
> Cc: Steven Price 
> Cc: Ard Biesheuvel 
> Cc: Masahiro Yamada 
> Cc: Kees Cook 
> Cc: Tetsuo Handa 
> Cc: Matthew Wilcox 
> Cc: Sri Krishna chowdary 
> Cc: Dave Hansen 
> Cc: Russell King - ARM Linux 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: "David S. Miller" 
> Cc: Vineet Gupta 
> Cc: James Hogan 
> Cc: Paul Burton 
> Cc: Ralf Baechle 
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: x...@kernel.org
> Cc: linux-ker...@vger.kernel.org
> 
> Suggested-by: Catalin Marinas 
> Signed-off-by: Anshuman Khandual 
> ---
>  mm/Kconfig.debug   |  14 ++
>  mm/Makefile|   1 +
>  mm/arch_pgtable_test.c | 425 +
>  3 files changed, 440 insertions(+)
>  create mode 100644 mm/arch_pgtable_test.c
> 
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 327b3ebf23bf..ce9c397f7b07 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
>  depends on STRICT_KERNEL_RWX
>  ---help---
>This option enables a testcase for the setting rodata read-only.
> +
> +config DEBUG_ARCH_PGTABLE_TEST
> + bool "Test arch page table helpers for semantics compliance"
> + depends on MMU
> + depends on DEBUG_KERNEL
> + help
> +   This options provides a kernel module which can be used to test
> +   architecture page table helper functions on various platform in
> +   verifying if they comply with expected generic MM semantics. This
> +   will help architectures code in making sure that any changes or
> +   new additions of these helpers will still conform to generic MM
> +   expected semantics.
> +
> +   If unsure, say N.
> diff --git a/mm/Makefile b/mm/Makefile
> index d996846697ef..bb572c5aa8c5 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
> +obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> new file mode 100644
> index ..f15be8a73723
> --- /dev/null
> +++ b/mm/arch_pgtable_test.c
> @@ -0,0 +1,425 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * This kernel module validates architecture page table helpers &
> + * accessors and helps in verifying their continued compliance with
> + * generic MM semantics.
> + *
> + * Copyright (C) 2019 ARM Ltd.
> + *
> + * Author: Anshuman Khandual 
> + */
> +#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VADDR_TEST   (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)

What is special about this address? How do you know if it is not occupied
yet?

> +#define VMA_TEST_FLAGS   (VM_READ|VM_WRITE|VM_EXEC)
> +#define RANDOM_NZVALUE

[PATCH] powerpc: Add attributes for setjmp/longjmp

2019-09-04 Thread Segher Boessenkool

The setjmp function should be declared as "returns_twice", or bad
things can happen[1].  This does not actually change generated code
in my testing.

The longjmp function should be declared as "noreturn", so that the
compiler can optimise calls to it better.  This makes the generated
code a little shorter.

Signed-off-by: Segher Boessenkool 

[1] See 
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-returns_005ftwice-function-attribute
---
 arch/powerpc/include/asm/setjmp.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/setjmp.h 
b/arch/powerpc/include/asm/setjmp.h
index d995061..e9f81bb 100644
--- a/arch/powerpc/include/asm/setjmp.h
+++ b/arch/powerpc/include/asm/setjmp.h
@@ -7,7 +7,7 @@
 
 #define JMP_BUF_LEN23
 
-extern long setjmp(long *);
-extern void longjmp(long *, long);
+extern long setjmp(long *) __attribute__((returns_twice));
+extern void longjmp(long *, long) __attribute__((noreturn));
 
 #endif /* _ASM_POWERPC_SETJMP_H */
-- 
1.8.3.1

Re: [PATCH v2 3/6] powerpc: Convert flush_icache_range & friends to C

2019-09-04 Thread Segher Boessenkool

On Wed, Sep 04, 2019 at 01:23:36PM +1000, Alastair D'Silva wrote:
> > Maybe also add "msr" in the clobbers.
> > 
> Ok.

There is no known register "msr" in GCC.


Segher

Re: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-04 Thread Segher Boessenkool

On Wed, Sep 04, 2019 at 08:16:45AM +, David Laight wrote:
> From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> > Fair enough so I guess we are back to just outright disabling the
> > warning.
> 
> Just disabling the warning won't stop the compiler generating code
> that breaks a 'user' implementation of setjmp().

Yeah.  I have a patch (will send in an hour or so) that enables the
"returns_twice" attribute for setjmp (in ).  In testing
(with GCC trunk) it showed no difference in code generation, but
better save than sorry.

It also sets "noreturn" on longjmp, and that *does* help, it saves a
hundred insns or so (all in xmon, no surprise there).

I don't think this will make LLVM shut up about this though.  And
technically it is right: the C standard does say that in hosted mode
setjmp is a reserved name and you need to include  to access
it (not ).

So why is the kernel compiled as hosted?  Does adding -ffreestanding
hurt anything?  Is that actually supported on LLVM, on all relevant
versions of it?  Does it shut up the warning there (if not, that would
be an LLVM bug)?

Segher

Re: linux-next: build warnings after merge of the kbuild tree

2019-09-04 Thread Stephen Rothwell

Hi Masahiro,

On Wed, 4 Sep 2019 15:22:09 +0900 Masahiro Yamada 
 wrote:
>
> For today's linux-next, please squash the following too.
> 
> (This is my fault, since scripts/mkuboot.sh is a bash script)
> 
> 
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 41c50f9461e5..2d72327417a9 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -374,7 +374,7 @@ UIMAGE_ENTRYADDR ?= $(UIMAGE_LOADADDR)
>  UIMAGE_NAME ?= 'Linux-$(KERNELRELEASE)'
> 
>  quiet_cmd_uimage = UIMAGE  $@
> -  cmd_uimage = $(CONFIG_SHELL) $(MKIMAGE) -A $(UIMAGE_ARCH) -O linux \
> +  cmd_uimage = $(BASE) $(MKIMAGE) -A $(UIMAGE_ARCH) -O linux \
> -C $(UIMAGE_COMPRESSION) $(UIMAGE_OPTS-y) \
> -T $(UIMAGE_TYPE) \
> -a $(UIMAGE_LOADADDR) -e $(UIMAGE_ENTRYADDR) \

Umm, that seems to have already been done.

-- 
Cheers,
Stephen Rothwell


pgpgbSPRBmOqR.pgp
Description: OpenPGP digital signature

Re: linux-next: build warnings after merge of the kbuild tree

2019-09-04 Thread Stephen Rothwell

Hi Masahiro,

On Wed, 4 Sep 2019 10:00:30 +0900 Masahiro Yamada 
 wrote:
>
> Could you fix it up as follows?
> I will squash it for tomorrow's linux-next.
> 
> 
> --- a/arch/powerpc/Makefile.postlink
> +++ b/arch/powerpc/Makefile.postlink
> @@ -18,7 +18,7 @@ quiet_cmd_relocs_check = CHKREL  $@
>  ifdef CONFIG_PPC_BOOK3S_64
>cmd_relocs_check =   \
> $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh
> "$(OBJDUMP)" "$@" ; \
> -   $(CONFIG_SHELL)
> $(srctree)/arch/powerpc/tools/unrel_branch_check.sh "$(OBJDUMP)" "$@"
> +   $(BASH) $(srctree)/arch/powerpc/tools/unrel_branch_check.sh
> "$(OBJDUMP)" "$@"
>  else
>cmd_relocs_check =   \
> $(CONFIG_SHELL) $(srctree)/arch/powerpc/tools/relocs_check.sh
> "$(OBJDUMP)" "$@"

I added that in linux-next today.

-- 
Cheers,
Stephen Rothwell


pgp3qMM2UYblH.pgp
Description: OpenPGP digital signature

Re: [PATCH v5 21/31] powernv/fadump: process architected register state data provided by firmware

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:

> diff --git a/arch/powerpc/kernel/fadump-common.h 
> b/arch/powerpc/kernel/fadump-common.h
> index 7107cf2..fc408b0 100644
> --- a/arch/powerpc/kernel/fadump-common.h
> +++ b/arch/powerpc/kernel/fadump-common.h
> @@ -98,7 +98,11 @@ struct fw_dump {
>   /* cmd line option during boot */
>   unsigned long   reserve_bootvar;
>  
> + unsigned long   cpu_state_destination_addr;

AFAICS that is only used in two places, and both of them have to call
__va() on it, so why don't we store the virtual address to start with?

> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index f75b861..9a32a7f 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -282,15 +283,122 @@ static void opal_fadump_cleanup(struct fw_dump 
> *fadump_conf)
>   pr_warn("Could not reset (%llu) kernel metadata tag!\n", ret);
>  }
>  
> +static inline void opal_fadump_set_regval_regnum(struct pt_regs *regs,
> +  u32 reg_type, u32 reg_num,
> +  u64 reg_val)
> +{
> + if (reg_type == HDAT_FADUMP_REG_TYPE_GPR) {
> + if (reg_num < 32)
> + regs->gpr[reg_num] = reg_val;
> + return;
> + }
> +
> + switch (reg_num) {
> + case SPRN_CTR:
> + regs->ctr = reg_val;
> + break;
> + case SPRN_LR:
> + regs->link = reg_val;
> + break;
> + case SPRN_XER:
> + regs->xer = reg_val;
> + break;
> + case SPRN_DAR:
> + regs->dar = reg_val;
> + break;
> + case SPRN_DSISR:
> + regs->dsisr = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_NIP:
> + regs->nip = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_MSR:
> + regs->msr = reg_val;
> + break;
> + case HDAT_FADUMP_REG_ID_CCR:
> + regs->ccr = reg_val;
> + break;
> + }
> +}
> +
> +static inline void opal_fadump_read_regs(char *bufp, unsigned int regs_cnt,
> +  unsigned int reg_entry_size,
> +  struct pt_regs *regs)
> +{
> + int i;
> + struct hdat_fadump_reg_entry *reg_entry;

Where's my christmas tree :)

> +
> + memset(regs, 0, sizeof(struct pt_regs));
> +
> + for (i = 0; i < regs_cnt; i++, bufp += reg_entry_size) {
> + reg_entry = (struct hdat_fadump_reg_entry *)bufp;
> + opal_fadump_set_regval_regnum(regs,
> +   be32_to_cpu(reg_entry->reg_type),
> +   be32_to_cpu(reg_entry->reg_num),
> +   be64_to_cpu(reg_entry->reg_val));
> + }
> +}
> +
> +static inline bool __init is_thread_core_inactive(u8 core_state)
> +{
> + bool is_inactive = false;
> +
> + if (core_state == HDAT_FADUMP_CORE_INACTIVE)
> + is_inactive = true;
> +
> + return is_inactive;

return core_state == HDAT_FADUMP_CORE_INACTIVE;

??

In fact there's only one caller, so just drop the inline entirely.

> +}
> +
>  /*
>   * Convert CPU state data saved at the time of crash into ELF notes.
> + *
> + * Each register entry is of 16 bytes, A numerical identifier along with
> + * a GPR/SPR flag in the first 8 bytes and the register value in the next
> + * 8 bytes. For more details refer to F/W documentation.
>   */
>  static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
>  {
>   u32 num_cpus, *note_buf;
>   struct fadump_crash_info_header *fdh = NULL;
> + struct hdat_fadump_thread_hdr *thdr;
> + unsigned long addr;
> + u32 thread_pir;
> + char *bufp;
> + struct pt_regs regs;
> + unsigned int size_of_each_thread;
> + unsigned int regs_offset, regs_cnt, reg_esize;
> + int i;

unsigned int size_of_each_thread, regs_offset, regs_cnt, reg_esize;
struct fadump_crash_info_header *fdh = NULL;
u32 num_cpus, thread_pir, *note_buf;
struct hdat_fadump_thread_hdr *thdr;
struct pt_regs regs;
unsigned long addr;
char *bufp;
int i;

Ah much better :)

Though the number of variables might be an indication that this function
could be split into smaller parts.

> @@ -473,6 +627,26 @@ int __init opal_fadump_dt_scan(struct fw_dump 
> *fadump_conf, ulong node)
>   return 1;
>   }
>  
> + ret = opal_mpipl_query_tag(OPAL_MPIPL_TAG_CPU, );
> + if ((ret != OPAL_SUCCESS) || !addr) {
> + pr_err("Failed to get CPU metadata (%lld)\n", ret);
> + return 1;
> + }
> +
> + addr = be64_to_cpu(addr);
> + pr_debug("CPU metadata addr: %llx\n", addr);
> +
>

Re: [PATCH v5 19/31] powerpc/fadump: Update documentation about OPAL platform support

2019-09-04 Thread Oliver O'Halloran

On Wed, Sep 4, 2019 at 9:51 PM Michael Ellerman  wrote:
>
> Hari Bathini  writes:
> > With FADump support now available on both pseries and OPAL platforms,
> > update FADump documentation with these details.
> >
> > Signed-off-by: Hari Bathini 
> > ---
> >  Documentation/powerpc/firmware-assisted-dump.rst |  104 
> > +-
> >  1 file changed, 63 insertions(+), 41 deletions(-)
> >
> > diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
> > b/Documentation/powerpc/firmware-assisted-dump.rst
> > index d912755..2c3342c 100644
> > --- a/Documentation/powerpc/firmware-assisted-dump.rst
> > +++ b/Documentation/powerpc/firmware-assisted-dump.rst
> > @@ -72,7 +72,8 @@ as follows:
> > normal.
> >
> >  -  The freshly booted kernel will notice that there is a new
> > -   node (ibm,dump-kernel) in the device tree, indicating that
> > +   node (ibm,dump-kernel on PSeries or ibm,opal/dump/mpipl-boot
> > +   on OPAL platform) in the device tree, indicating that
> > there is crash data available from a previous boot. During
> > the early boot OS will reserve rest of the memory above
> > boot memory size effectively booting with restricted memory
> > @@ -96,7 +97,9 @@ as follows:
> >
> >  Please note that the firmware-assisted dump feature
> >  is only available on Power6 and above systems with recent
> > -firmware versions.
>
> Notice how "recent" has bit rotted.
>
> > +firmware versions on PSeries (PowerVM) platform and Power9
> > +and above systems with recent firmware versions on PowerNV
> > +(OPAL) platform.
>
> Can we say something more helpful here, ie. "recent" is not very useful.
> AFAIK it's actually wrong, there isn't a released firmware with the
> support yet at all, right?
>
> Given all the relevant firmware is open source can't we at least point
> to a commit or release tag or something?
>
> cheers

Even if we can quote a git sha it's not terrible useful or user
friendly. We already gate the feature behind DT nodes / properties
existing, so why not just say "fadump requires XYZ firmware feature,
as indicated by  device-tree property."

[PATCH v2 20/20] powerpc/64s/exception: only test KVM in SRR interrupts when PR KVM is supported

2019-09-04 Thread Nicholas Piggin

Apart from SRESET, MCE, and syscall (hcall variant), the SRR type
interrupts are not escalated to hypervisor mode, so delivered to the OS.

When running PR KVM, the OS is the hypervisor, and the guest runs with
MSR[PR]=1, so these interrupts must test if a guest was running when
interrupted. These tests are required at the real-mode entry points
because the PR KVM host runs with LPCR[AIL]=0.

In HV KVM and nested HV KVM, the guest always receives these interrupts,
so there is no need for the host to make this test. So remove the tests
if PR KVM is not configured.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 65 ++--
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 5171ed0e5f68..a711adf1e499 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -214,9 +214,36 @@ do_define_int n
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
- * If hv is possible, interrupts come into to the hv version
- * of the kvmppc_interrupt code, which then jumps to the PR handler,
- * kvmppc_interrupt_pr, if the guest is a PR guest.
+ * All interrupts which set HSRR registers, as well as SRESET and MCE and
+ * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
+ * so they all generally need to test whether they were taken in guest context.
+ *
+ * Note: SRESET and MCE may also be sent to the guest by the hypervisor, and be
+ * taken with MSR[HV]=0.
+ *
+ * Interrupts which set SRR registers (with the above exceptions) do not
+ * elevate to MSR[HV]=1 mode, though most can be taken when running with
+ * MSR[HV]=1  (e.g., bare metal kernel and userspace). So these interrupts do
+ * not need to test whether a guest is running because they get delivered to
+ * the guest directly, including nested HV KVM guests.
+ *
+ * The exception is PR KVM, where the guest runs with MSR[PR]=1 and the host
+ * runs with MSR[HV]=0, so the host takes all interrupts on behalf of the
+ * guest. PR KVM runs with LPCR[AIL]=0 which causes interrupts to always be
+ * delivered to the real-mode entry point, therefore such interrupts only test
+ * KVM in their real mode handlers, and only when PR KVM is possible.
+ *
+ * Interrupts that are taken in MSR[HV]=0 and escalate to MSR[HV]=1 are always
+ * delivered in real-mode when the MMU is in hash mode because the MMU
+ * registers are not set appropriately to translate host addresses. In nested
+ * radix mode these can be delivered in virt-mode as the host translations are
+ * used implicitly (see: effective LPID, effective PID).
+ */
+
+/*
+ * If an interrupt is taken while a guest is running, it is immediately routed
+ * to KVM to handle. If both HV and PR KVM arepossible, KVM interrupts go first
+ * to kvmppc_interrupt_hv, which handles the PR guest case.
  */
 #define kvmppc_interrupt kvmppc_interrupt_hv
 #else
@@ -1250,8 +1277,10 @@ INT_DEFINE_BEGIN(data_access)
IVEC=0x300
IDAR=1
IDSISR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_SKIP=1
IKVM_REAL=1
+#endif
 INT_DEFINE_END(data_access)
 
 EXC_REAL_BEGIN(data_access, 0x300, 0x80)
@@ -1298,8 +1327,10 @@ INT_DEFINE_BEGIN(data_access_slb)
IAREA=PACA_EXSLB
IRECONCILE=0
IDAR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_SKIP=1
IKVM_REAL=1
+#endif
 INT_DEFINE_END(data_access_slb)
 
 EXC_REAL_BEGIN(data_access_slb, 0x380, 0x80)
@@ -1349,7 +1380,9 @@ INT_DEFINE_BEGIN(instruction_access)
IISIDE=1
IDAR=1
IDSISR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(instruction_access)
 
 EXC_REAL_BEGIN(instruction_access, 0x400, 0x80)
@@ -1388,7 +1421,9 @@ INT_DEFINE_BEGIN(instruction_access_slb)
IRECONCILE=0
IISIDE=1
IDAR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(instruction_access_slb)
 
 EXC_REAL_BEGIN(instruction_access_slb, 0x480, 0x80)
@@ -1480,7 +1515,9 @@ INT_DEFINE_BEGIN(alignment)
IVEC=0x600
IDAR=1
IDSISR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(alignment)
 
 EXC_REAL_BEGIN(alignment, 0x600, 0x100)
@@ -1510,7 +1547,9 @@ EXC_COMMON_BEGIN(alignment_common)
  */
 INT_DEFINE_BEGIN(program_check)
IVEC=0x700
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(program_check)
 
 EXC_REAL_BEGIN(program_check, 0x700, 0x100)
@@ -1573,7 +1612,9 @@ EXC_COMMON_BEGIN(program_check_common)
 INT_DEFINE_BEGIN(fp_unavailable)
IVEC=0x800
IRECONCILE=0
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(fp_unavailable)
 
 EXC_REAL_BEGIN(fp_unavailable, 0x800, 0x100)
@@ -1635,7 +1676,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
 INT_DEFINE_BEGIN(decrementer)
IVEC=0x900

[PATCH v2 19/20] powerpc/64s/exception: add more comments for interrupt handlers

2019-09-04 Thread Nicholas Piggin

A few of the non-standard handlers are left uncommented. Some more
description could be added to some.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 391 ---
 1 file changed, 353 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 413876293659..5171ed0e5f68 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,26 +121,26 @@ name:
 /*
  * Interrupt code generation macros
  */
-#define IVEC   .L_IVEC_\name\()
-#define IHSRR  .L_IHSRR_\name\()
-#define IHSRR_IF_HVMODE.L_IHSRR_IF_HVMODE_\name\()
-#define IAREA  .L_IAREA_\name\()
-#define IVIRT  .L_IVIRT_\name\()
-#define IISIDE .L_IISIDE_\name\()
-#define IDAR   .L_IDAR_\name\()
-#define IDSISR .L_IDSISR_\name\()
-#define ISET_RI.L_ISET_RI_\name\()
-#define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\()
-#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\()
-#define IMASK  .L_IMASK_\name\()
-#define IKVM_SKIP  .L_IKVM_SKIP_\name\()
-#define IKVM_REAL  .L_IKVM_REAL_\name\()
+#define IVEC   .L_IVEC_\name\()/* Interrupt vector address */
+#define IHSRR  .L_IHSRR_\name\()   /* Sets SRR or HSRR registers */
+#define IHSRR_IF_HVMODE.L_IHSRR_IF_HVMODE_\name\() /* HSRR if HV else 
SRR */
+#define IAREA  .L_IAREA_\name\()   /* PACA save area */
+#define IVIRT  .L_IVIRT_\name\()   /* Has virt mode entry point */
+#define IISIDE .L_IISIDE_\name\()  /* Uses SRR0/1 not DAR/DSISR */
+#define IDAR   .L_IDAR_\name\()/* Uses DAR (or SRR0) */
+#define IDSISR .L_IDSISR_\name\()  /* Uses DSISR (or SRR1) */
+#define ISET_RI.L_ISET_RI_\name\() /* Run common code w/ 
MSR[RI]=1 */
+#define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\() /* ENTRY branch 
to common */
+#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\() /* Common runs in 
realmode */
+#define IMASK  .L_IMASK_\name\()   /* IRQ soft-mask bit */
+#define IKVM_SKIP  .L_IKVM_SKIP_\name\()   /* Generate KVM skip handler */
+#define IKVM_REAL  .L_IKVM_REAL_\name\()   /* Real entry tests KVM */
 #define __IKVM_REAL(name)  .L_IKVM_REAL_ ## name
-#define IKVM_VIRT  .L_IKVM_VIRT_\name\()
-#define ISTACK .L_ISTACK_\name\()
+#define IKVM_VIRT  .L_IKVM_VIRT_\name\()   /* Virt entry tests KVM */
+#define ISTACK .L_ISTACK_\name\()  /* Set regular kernel stack */
 #define __ISTACK(name) .L_ISTACK_ ## name
-#define IRECONCILE .L_IRECONCILE_\name\()
-#define IKUAP  .L_IKUAP_\name\()
+#define IRECONCILE .L_IRECONCILE_\name\()  /* Do RECONCILE_IRQ_STATE */
+#define IKUAP  .L_IKUAP_\name\()   /* Do KUAP lock */
 
 #define INT_DEFINE_BEGIN(n)\
 .macro int_define_ ## n name
@@ -751,6 +751,39 @@ __start_interrupts:
 EXC_VIRT_NONE(0x4000, 0x100)
 
 
+/**
+ * Interrupt 0x100 - System Reset Interrupt (SRESET aka NMI).
+ * This is a non-maskable, asynchronous interrupt always taken in real-mode.
+ * It is caused by:
+ * - Wake from power-saving state, on powernv.
+ * - An NMI from another CPU, triggered by firmware or hypercall.
+ * - As crash/debug signal injected from BMC, firmware or hypervisor.
+ *
+ * Handling:
+ * Power-save wakeup is the only performance critical path, so this is
+ * determined quickly as possible first. In this case volatile registers
+ * can be discarded and SPRs like CFAR don't need to be read.
+ *
+ * If not a powersave wakeup, then it's run as a regular interrupt, however
+ * it uses its own stack and PACA save area to preserve the regular kernel
+ * environment for debugging.
+ *
+ * This interrupt is not maskable, so triggering it when MSR[RI] is clear,
+ * or SCRATCH0 is in use, etc. may cause a crash. It's also not entirely
+ * correct to switch to virtual mode to run the regular interrupt handler
+ * because it might be interrupted when the MMU is in a bad state (e.g., SLB
+ * is clear).
+ *
+ * FWNMI:
+ * PAPR specifies a "fwnmi" facility which sends the sreset to a different
+ * entry point with a different register set up. Some hypervisors will
+ * send the sreset to 0x100 in the guest if it is not fwnmi capable.
+ *
+ * KVM:
+ * Unlike most SRR interrupts, this may be taken by the host while executing
+ * in a guest, so a KVM test is required. KVM will pull the CPU out of guest
+ * mode and then raise the sreset.
+ */
 INT_DEFINE_BEGIN(system_reset)
IVEC=0x100
IAREA=PACA_EXNMI
@@ -826,6 +859,7 @@ TRAMP_REAL_BEGIN(system_reset_idle_wake)
  * Vectors for the FWNMI option.  Share common code.
  */
 TRAMP_REAL_BEGIN(system_reset_fwnmi)
+   /* XXX: fwnmi guest could run a nested/PR guest, so why no test?  */
__IKVM_REAL(system_reset)=0

[PATCH v2 18/20] powerpc/64s/exception: Clean up SRR specifiers

2019-09-04 Thread Nicholas Piggin

Remove more magic numbers and replace with nicely named bools.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 68 +---
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 696aa19592e2..413876293659 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -105,11 +105,6 @@ name:
ori reg,reg,(ABS_ADDR(label))@l;\
addis   reg,reg,(ABS_ADDR(label))@h
 
-/* Exception register prefixes */
-#define EXC_HV_OR_STD  2 /* depends on HVMODE */
-#define EXC_HV 1
-#define EXC_STD0
-
 /*
  * Branch to label using its 0xC000 address. This results in instruction
  * address suitable for MSR[IR]=0 or 1, which allows relocation to be turned
@@ -128,6 +123,7 @@ name:
  */
 #define IVEC   .L_IVEC_\name\()
 #define IHSRR  .L_IHSRR_\name\()
+#define IHSRR_IF_HVMODE.L_IHSRR_IF_HVMODE_\name\()
 #define IAREA  .L_IAREA_\name\()
 #define IVIRT  .L_IVIRT_\name\()
 #define IISIDE .L_IISIDE_\name\()
@@ -159,7 +155,10 @@ do_define_int n
.error "IVEC not defined"
.endif
.ifndef IHSRR
-   IHSRR=EXC_STD
+   IHSRR=0
+   .endif
+   .ifndef IHSRR_IF_HVMODE
+   IHSRR_IF_HVMODE=0
.endif
.ifndef IAREA
IAREA=PACA_EXGEN
@@ -257,7 +256,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r9,IAREA+EX_R9(r13)
ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
-   .if IHSRR == EXC_HV_OR_STD
+   .if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
ori r12,r12,(IVEC + 0x2)
FTR_SECTION_ELSE
@@ -278,7 +277,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r10,IAREA+EX_R10(r13)
ld  r11,IAREA+EX_R11(r13)
ld  r12,IAREA+EX_R12(r13)
-   .if IHSRR == EXC_HV_OR_STD
+   .if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
b   kvmppc_skip_Hinterrupt
FTR_SECTION_ELSE
@@ -403,7 +402,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
stw r10,IAREA+EX_DSISR(r13)
.endif
 
-   .if IHSRR == EXC_HV_OR_STD
+   .if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
mfspr   r12,SPRN_HSRR1  /* and HSRR1 */
@@ -482,7 +481,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
.abort "Bad maskable vector"
.endif
 
-   .if IHSRR == EXC_HV_OR_STD
+   .if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
bne masked_Hinterrupt
FTR_SECTION_ELSE
@@ -610,12 +609,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
  * Restore all registers including H/SRR0/1 saved in a stack frame of a
  * standard exception.
  */
-.macro EXCEPTION_RESTORE_REGS hsrr
+.macro EXCEPTION_RESTORE_REGS hsrr=0
/* Move original SRR0 and SRR1 into the respective regs */
ld  r9,_MSR(r1)
-   .if \hsrr == EXC_HV_OR_STD
-   .error "EXC_HV_OR_STD Not implemented for EXCEPTION_RESTORE_REGS"
-   .endif
.if \hsrr
mtspr   SPRN_HSRR1,r9
.else
@@ -890,7 +886,7 @@ EXC_COMMON_BEGIN(system_reset_common)
ld  r10,SOFTE(r1)
stb r10,PACAIRQSOFTMASK(r13)
 
-   EXCEPTION_RESTORE_REGS EXC_STD
+   EXCEPTION_RESTORE_REGS
RFI_TO_USER_OR_KERNEL
 
GEN_KVM system_reset
@@ -944,7 +940,7 @@ TRAMP_REAL_BEGIN(machine_check_fwnmi)
lhz r12,PACA_IN_MCE(r13);   \
subir12,r12,1;  \
sth r12,PACA_IN_MCE(r13);   \
-   EXCEPTION_RESTORE_REGS EXC_STD
+   EXCEPTION_RESTORE_REGS
 
 EXC_COMMON_BEGIN(machine_check_early_common)
/*
@@ -1313,7 +1309,7 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
 INT_DEFINE_BEGIN(hardware_interrupt)
IVEC=0x500
-   IHSRR=EXC_HV_OR_STD
+   IHSRR_IF_HVMODE=1
IMASK=IRQS_DISABLED
IKVM_REAL=1
IKVM_VIRT=1
@@ -1482,7 +1478,7 @@ EXC_COMMON_BEGIN(decrementer_common)
 
 INT_DEFINE_BEGIN(hdecrementer)
IVEC=0x980
-   IHSRR=EXC_HV
+   IHSRR=1
ISTACK=0
IRECONCILE=0
IKVM_REAL=1
@@ -1724,7 +1720,7 @@ EXC_COMMON_BEGIN(single_step_common)
 
 INT_DEFINE_BEGIN(h_data_storage)
IVEC=0xe00
-   IHSRR=EXC_HV
+   IHSRR=1
IDAR=1
IDSISR=1
IKVM_SKIP=1
@@ -1756,7 +1752,7 @@ ALT_MMU_FTR_SECTION_END_IFSET(MMU_FTR_TYPE_RADIX)
 
 INT_DEFINE_BEGIN(h_instr_storage)
IVEC=0xe20
-   IHSRR=EXC_HV
+   IHSRR=1
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(h_instr_storage)
@@ -1779,7 +1775,7 @@ EXC_COMMON_BEGIN(h_instr_storage_common)
 
 INT_DEFINE_BEGIN(emulation_assist)
IVEC=0xe40
-

[PATCH v2 17/20] powerpc/64s/exception: re-inline some handlers

2019-09-04 Thread Nicholas Piggin

The reduction in interrupt entry size allows some handlers to be
re-inlined.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 04359ff5d336..696aa19592e2 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1178,7 +1178,7 @@ INT_DEFINE_BEGIN(data_access)
 INT_DEFINE_END(data_access)
 
 EXC_REAL_BEGIN(data_access, 0x300, 0x80)
-   GEN_INT_ENTRY data_access, virt=0, ool=1
+   GEN_INT_ENTRY data_access, virt=0
 EXC_REAL_END(data_access, 0x300, 0x80)
 EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
GEN_INT_ENTRY data_access, virt=1
@@ -1208,7 +1208,7 @@ INT_DEFINE_BEGIN(data_access_slb)
 INT_DEFINE_END(data_access_slb)
 
 EXC_REAL_BEGIN(data_access_slb, 0x380, 0x80)
-   GEN_INT_ENTRY data_access_slb, virt=0, ool=1
+   GEN_INT_ENTRY data_access_slb, virt=0
 EXC_REAL_END(data_access_slb, 0x380, 0x80)
 EXC_VIRT_BEGIN(data_access_slb, 0x4380, 0x80)
GEN_INT_ENTRY data_access_slb, virt=1
@@ -1464,7 +1464,7 @@ INT_DEFINE_BEGIN(decrementer)
 INT_DEFINE_END(decrementer)
 
 EXC_REAL_BEGIN(decrementer, 0x900, 0x80)
-   GEN_INT_ENTRY decrementer, virt=0, ool=1
+   GEN_INT_ENTRY decrementer, virt=0
 EXC_REAL_END(decrementer, 0x900, 0x80)
 EXC_VIRT_BEGIN(decrementer, 0x4900, 0x80)
GEN_INT_ENTRY decrementer, virt=1
-- 
2.22.0

[PATCH v2 16/20] powerpc/64s/exception: hdecrementer avoid touching the stack

2019-09-04 Thread Nicholas Piggin

The hdec interrupt handler is reported to sometimes fire in Linux if
KVM leaves it pending after a guest exists. This is harmless, so there
is a no-op handler for it.

The interrupt handler currently uses the regular kernel stack. Change
this to avoid touching the stack entirely.

This should be the last place where the regular Linux stack can be
accessed with asynchronous interrupts (including PMI) soft-masked.
It might be possible to take advantage of this invariant, e.g., to
context switch the kernel stack SLB entry without clearing MSR[EE].

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h  |  1 -
 arch/powerpc/kernel/exceptions-64s.S | 25 -
 arch/powerpc/kernel/time.c   |  9 -
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 5d78e2844384..39ce95016a3a 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,7 +24,6 @@ extern struct clock_event_device decrementer_clockevent;
 
 
 extern void generic_calibrate_decr(void);
-extern void hdec_interrupt(struct pt_regs *regs);
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 1bca009cd495..04359ff5d336 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1483,6 +1483,8 @@ EXC_COMMON_BEGIN(decrementer_common)
 INT_DEFINE_BEGIN(hdecrementer)
IVEC=0x980
IHSRR=EXC_HV
+   ISTACK=0
+   IRECONCILE=0
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(hdecrementer)
@@ -1494,11 +1496,24 @@ EXC_VIRT_BEGIN(hdecrementer, 0x4980, 0x80)
GEN_INT_ENTRY hdecrementer, virt=1
 EXC_VIRT_END(hdecrementer, 0x4980, 0x80)
 EXC_COMMON_BEGIN(hdecrementer_common)
-   GEN_COMMON hdecrementer
-   bl  save_nvgprs
-   addir3,r1,STACK_FRAME_OVERHEAD
-   bl  hdec_interrupt
-   b   ret_from_except
+   __GEN_COMMON_ENTRY hdecrementer
+   /*
+* Hypervisor decrementer interrupts not caught by the KVM test
+* shouldn't occur but are sometimes left pending on exit from a KVM
+* guest.  We don't need to do anything to clear them, as they are
+* edge-triggered.
+*
+* Be careful to avoid touching the kernel stack.
+*/
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
+   mtcrf   0x80,r9
+   ld  r9,PACA_EXGEN+EX_R9(r13)
+   ld  r10,PACA_EXGEN+EX_R10(r13)
+   ld  r11,PACA_EXGEN+EX_R11(r13)
+   ld  r12,PACA_EXGEN+EX_R12(r13)
+   ld  r13,PACA_EXGEN+EX_R13(r13)
+   HRFI_TO_KERNEL
 
GEN_KVM hdecrementer
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 694522308cd5..bebc8c440289 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -663,15 +663,6 @@ void timer_broadcast_interrupt(void)
 }
 #endif
 
-/*
- * Hypervisor decrementer interrupts shouldn't occur but are sometimes
- * left pending on exit from a KVM guest.  We don't need to do anything
- * to clear them, as they are edge-triggered.
- */
-void hdec_interrupt(struct pt_regs *regs)
-{
-}
-
 #ifdef CONFIG_SUSPEND
 static void generic_suspend_disable_irqs(void)
 {
-- 
2.22.0

[PATCH v2 15/20] powerpc/64s/exception: trim unused arguments from KVMTEST macro

2019-09-04 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 2705fd84accd..1bca009cd495 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -224,7 +224,7 @@ do_define_int n
 #define kvmppc_interrupt kvmppc_interrupt_pr
 #endif
 
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
bne \name\()_kvm
@@ -293,7 +293,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 .endm
 
 #else
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
 .endm
 .macro GEN_KVM name
 .endm
@@ -437,7 +437,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name IHSRR IVEC
+   KVMTEST \name
.endif
 
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
@@ -452,7 +452,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
.if IKVM_VIRT
-   KVMTEST \name IHSRR IVEC
+   KVMTEST \name
.endif
 
.if ISET_RI
@@ -1587,7 +1587,7 @@ INT_DEFINE_END(system_call)
GET_PACA(r13)
std r10,PACA_EXGEN+EX_R10(r13)
INTERRUPT_TO_KERNEL
-   KVMTEST system_call EXC_STD 0xc00 /* uses r10, branch to 
system_call_kvm */
+   KVMTEST system_call /* uses r10, branch to system_call_kvm */
mfctr   r9
 #else
mr  r9,r13
-- 
2.22.0

[PATCH v2 13/20] powerpc/64s/exception: remove confusing IEARLY option

2019-09-04 Thread Nicholas Piggin

Replace IEARLY=1 and IEARLY=2 with IBRANCH_COMMON, which controls if
the entry code branches to a common handler; and IREALMODE_COMMON,
which controls whether the common handler should remain in real mode.

These special cases no longer avoid loading the SRR registers, there
is no point as most of them load the registers immediately anyway.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 48 ++--
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3bc3336182c7..c46e4911cff6 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -174,7 +174,8 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
 #define ISET_RI.L_ISET_RI_\name\()
-#define IEARLY .L_IEARLY_\name\()
+#define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\()
+#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\()
 #define IMASK  .L_IMASK_\name\()
 #define IKVM_SKIP  .L_IKVM_SKIP_\name\()
 #define IKVM_REAL  .L_IKVM_REAL_\name\()
@@ -218,8 +219,15 @@ do_define_int n
.ifndef ISET_RI
ISET_RI=1
.endif
-   .ifndef IEARLY
-   IEARLY=0
+   .ifndef IBRANCH_TO_COMMON
+   IBRANCH_TO_COMMON=1
+   .endif
+   .ifndef IREALMODE_COMMON
+   IREALMODE_COMMON=0
+   .else
+   .if ! IBRANCH_TO_COMMON
+   .error "IREALMODE_COMMON=1 but IBRANCH_TO_COMMON=0"
+   .endif
.endif
.ifndef IMASK
IMASK=0
@@ -353,6 +361,11 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
  */
 
 .macro GEN_BRANCH_TO_COMMON name, virt
+   .if IREALMODE_COMMON
+   LOAD_HANDLER(r10, \name\()_common)
+   mtctr   r10
+   bctr
+   .else
.if \virt
 #ifndef CONFIG_RELOCATABLE
b   \name\()_common_virt
@@ -366,6 +379,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mtctr   r10
bctr
.endif
+   .endif
 .endm
 
 .macro GEN_INT_ENTRY name, virt, ool=0
@@ -421,11 +435,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
stw r10,IAREA+EX_DSISR(r13)
.endif
 
-   .if IEARLY == 2
-   /* nothing more */
-   .elseif IEARLY
-   BRANCH_TO_C000(r11, \name\()_common)
-   .else
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
@@ -441,6 +450,8 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mfspr   r11,SPRN_SRR0   /* save SRR0 */
mfspr   r12,SPRN_SRR1   /* and SRR1 */
.endif
+
+   .if IBRANCH_TO_COMMON
GEN_BRANCH_TO_COMMON \name \virt
.endif
 
@@ -918,6 +929,7 @@ INT_DEFINE_BEGIN(machine_check_early)
IVEC=0x200
IAREA=PACA_EXMC
IVIRT=0 /* no virt entry point */
+   IREALMODE_COMMON=1
/*
 * MSR_RI is not enabled, because PACA_EXMC is being used, so a
 * nested machine check corrupts it. machine_check_common enables
@@ -925,7 +937,6 @@ INT_DEFINE_BEGIN(machine_check_early)
 */
ISET_RI=0
ISTACK=0
-   IEARLY=1
IDAR=1
IDSISR=1
IRECONCILE=0
@@ -965,9 +976,6 @@ TRAMP_REAL_BEGIN(machine_check_fwnmi)
EXCEPTION_RESTORE_REGS EXC_STD
 
 EXC_COMMON_BEGIN(machine_check_early_common)
-   mfspr   r11,SPRN_SRR0
-   mfspr   r12,SPRN_SRR1
-
/*
 * Switch to mc_emergency stack and handle re-entrancy (we limit
 * the nested MCE upto level 4 to avoid stack overflow).
@@ -1814,7 +1822,7 @@ EXC_COMMON_BEGIN(emulation_assist_common)
 INT_DEFINE_BEGIN(hmi_exception_early)
IVEC=0xe60
IHSRR=EXC_HV
-   IEARLY=1
+   IREALMODE_COMMON=1
ISTACK=0
IRECONCILE=0
IKUAP=0 /* We don't touch AMR here, we never go to virtual mode */
@@ -1834,8 +1842,6 @@ EXC_REAL_END(hmi_exception, 0xe60, 0x20)
 EXC_VIRT_NONE(0x4e60, 0x20)
 
 EXC_COMMON_BEGIN(hmi_exception_early_common)
-   mfspr   r11,SPRN_HSRR0  /* Save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* Save HSRR1 */
mr  r10,r1  /* Save r1 */
ld  r1,PACAEMERGSP(r13) /* Use emergency stack for realmode */
subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
@@ -2161,29 +2167,23 @@ EXC_VIRT_NONE(0x5400, 0x100)
 INT_DEFINE_BEGIN(denorm_exception)
IVEC=0x1500
IHSRR=EXC_HV
-   IEARLY=2
+   IBRANCH_TO_COMMON=0
IKVM_REAL=1
 INT_DEFINE_END(denorm_exception)
 
 EXC_REAL_BEGIN(denorm_exception, 0x1500, 0x100)
GEN_INT_ENTRY denorm_exception, virt=0
 #ifdef CONFIG_PPC_DENORMALISATION
-   mfspr   r10,SPRN_HSRR1
-   andis.  r10,r10,(HSRR1_DENORM)@h /*

[PATCH v2 14/20] powerpc/64s/exception: remove the SPR saving patch code macros

2019-09-04 Thread Nicholas Piggin

These are used infrequently enough they don't provide much help, so
inline them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 82 ++--
 1 file changed, 28 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index c46e4911cff6..2705fd84accd 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -110,46 +110,6 @@ name:
 #define EXC_HV 1
 #define EXC_STD0
 
-/*
- * PPR save/restore macros used in exceptions-64s.S
- * Used for P7 or later processors
- */
-#define SAVE_PPR(area, ra) \
-BEGIN_FTR_SECTION_NESTED(940)  \
-   ld  ra,area+EX_PPR(r13);/* Read PPR from paca */\
-   std ra,_PPR(r1);\
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
-
-#define RESTORE_PPR_PACA(area, ra) \
-BEGIN_FTR_SECTION_NESTED(941)  \
-   ld  ra,area+EX_PPR(r13);\
-   mtspr   SPRN_PPR,ra;\
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,941)
-
-/*
- * Get an SPR into a register if the CPU has the given feature
- */
-#define OPT_GET_SPR(ra, spr, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   mfspr   ra,spr; \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
-/*
- * Set an SPR from a register if the CPU has the given feature
- */
-#define OPT_SET_SPR(ra, spr, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   mtspr   spr,ra; \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
-/*
- * Save a register to the PACA if the CPU has the given feature
- */
-#define OPT_SAVE_REG_TO_PACA(offset, ra, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   std ra,offset(r13); \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
 /*
  * Branch to label using its 0xC000 address. This results in instruction
  * address suitable for MSR[IR]=0 or 1, which allows relocation to be turned
@@ -278,18 +238,18 @@ do_define_int n
cmpwi   r10,KVM_GUEST_MODE_SKIP
beq 89f
.else
-BEGIN_FTR_SECTION_NESTED(947)
+BEGIN_FTR_SECTION
ld  r10,IAREA+EX_CFAR(r13)
std r10,HSTATE_CFAR(r13)
-END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947)
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
.endif
 
ld  r10,PACA_EXGEN+EX_CTR(r13)
mtctr   r10
-BEGIN_FTR_SECTION_NESTED(948)
+BEGIN_FTR_SECTION
ld  r10,IAREA+EX_PPR(r13)
std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r11,IAREA+EX_R11(r13)
ld  r12,IAREA+EX_R12(r13)
std r12,HSTATE_SCRATCH0(r13)
@@ -386,10 +346,14 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
SET_SCRATCH0(r13)   /* save r13 */
GET_PACA(r13)
std r9,IAREA+EX_R9(r13) /* save r9 */
-   OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR)
+BEGIN_FTR_SECTION
+   mfspr   r9,SPRN_PPR
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
HMT_MEDIUM
std r10,IAREA+EX_R10(r13)   /* save r10 - r12 */
-   OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_CFAR
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
.if \ool
.if !\virt
b   tramp_real_\name
@@ -402,8 +366,12 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
.endif
 
-   OPT_SAVE_REG_TO_PACA(IAREA+EX_PPR, r9, CPU_FTR_HAS_PPR)
-   OPT_SAVE_REG_TO_PACA(IAREA+EX_CFAR, r10, CPU_FTR_CFAR)
+BEGIN_FTR_SECTION
+   std r9,IAREA+EX_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+BEGIN_FTR_SECTION
+   std r10,IAREA+EX_CFAR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
INTERRUPT_TO_KERNEL
mfctr   r10
std r10,IAREA+EX_CTR(r13)
@@ -550,7 +518,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
.endif
beq 101f/* if from kernel mode  */
ACCOUNT_CPU_USER_ENTRY(r13, r9, r10)
-   SAVE_PPR(IAREA, r9)
+BEGIN_FTR_SECTION
+   ld  r9,IAREA+EX_PPR(r13)/* Read PPR from paca   */
+   std r9,_PPR(r1)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 101:
.else
.if IKUAP
@@ -590,10 +561,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
std r10,_DSISR(r1)
.endif

[PATCH v2 12/20] powerpc/64s/exception: move KVM test to common code

2019-09-04 Thread Nicholas Piggin

This allows more code to be moved out of unrelocated regions. The system
call KVMTEST is changed to be open-coded and remain in the tramp area to
avoid having to move it to entry_64.S. The custom nature of the system
call entry code means the hcall case can be made more streamlined than
regular interrupt handlers.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S| 235 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  11 --
 arch/powerpc/kvm/book3s_segment.S   |   7 -
 3 files changed, 115 insertions(+), 138 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index af41de2dbc75..3bc3336182c7 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -44,7 +44,6 @@
  * EXC_VIRT_BEGIN/END  - virt (AIL), unrelocated exception vectors
  * TRAMP_REAL_BEGIN- real, unrelocated helpers (virt may call these)
  * TRAMP_VIRT_BEGIN- virt, unreloc helpers (in practice, real can use)
- * TRAMP_KVM_BEGIN - KVM handlers, these are put into real, unrelocated
  * EXC_COMMON  - After switching to virtual, relocated mode.
  */
 
@@ -74,13 +73,6 @@ name:
 #define TRAMP_VIRT_BEGIN(name) \
FIXED_SECTION_ENTRY_BEGIN(virt_trampolines, name)
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-#define TRAMP_KVM_BEGIN(name)  \
-   TRAMP_VIRT_BEGIN(name)
-#else
-#define TRAMP_KVM_BEGIN(name)
-#endif
-
 #define EXC_REAL_NONE(start, size) \
FIXED_SECTION_ENTRY_BEGIN_LOCATION(real_vectors, 
exc_real_##start##_##unused, start, size); \
FIXED_SECTION_ENTRY_END_LOCATION(real_vectors, 
exc_real_##start##_##unused, start, size)
@@ -271,6 +263,9 @@ do_define_int n
 .endm
 
 .macro GEN_KVM name
+   .balign IFETCH_ALIGN_BYTES
+\name\()_kvm:
+
.if IKVM_SKIP
cmpwi   r10,KVM_GUEST_MODE_SKIP
beq 89f
@@ -281,13 +276,18 @@ BEGIN_FTR_SECTION_NESTED(947)
 END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947)
.endif
 
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
 BEGIN_FTR_SECTION_NESTED(948)
ld  r10,IAREA+EX_PPR(r13)
std r10,HSTATE_PPR(r13)
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
-   ld  r10,IAREA+EX_R10(r13)
+   ld  r11,IAREA+EX_R11(r13)
+   ld  r12,IAREA+EX_R12(r13)
std r12,HSTATE_SCRATCH0(r13)
sldir12,r9,32
+   ld  r9,IAREA+EX_R9(r13)
+   ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
@@ -300,29 +300,16 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.else
ori r12,r12,(IVEC)
.endif
-
-#ifdef CONFIG_RELOCATABLE
-   /*
-* KVM requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
-* outside the head section. CONFIG_RELOCATABLE KVM expects CTR
-* to be saved in HSTATE_SCRATCH1.
-*/
-   ld  r9,IAREA+EX_CTR(r13)
-   std r9,HSTATE_SCRATCH1(r13)
-   __LOAD_FAR_HANDLER(r9, kvmppc_interrupt)
-   mtctr   r9
-   ld  r9,IAREA+EX_R9(r13)
-   bctr
-#else
-   ld  r9,IAREA+EX_R9(r13)
b   kvmppc_interrupt
-#endif
-
 
.if IKVM_SKIP
 89:mtocrf  0x80,r9
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
ld  r9,IAREA+EX_R9(r13)
ld  r10,IAREA+EX_R10(r13)
+   ld  r11,IAREA+EX_R11(r13)
+   ld  r12,IAREA+EX_R12(r13)
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
b   kvmppc_skip_Hinterrupt
@@ -407,11 +394,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mfctr   r10
std r10,IAREA+EX_CTR(r13)
mfcrr9
-
-   .if (!\virt && IKVM_REAL) || (\virt && IKVM_VIRT)
-   KVMTEST \name IHSRR IVEC
-   .endif
-
std r11,IAREA+EX_R11(r13)
std r12,IAREA+EX_R12(r13)
 
@@ -475,6 +457,10 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 .macro __GEN_COMMON_ENTRY name
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
+   .if IKVM_REAL
+   KVMTEST \name IHSRR IVEC
+   .endif
+
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
.if ! ISET_RI
xorir10,r10,MSR_RI  /* Clear MSR_RI */
@@ -486,6 +472,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
.balign IFETCH_ALIGN_BYTES
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
+   .if IKVM_VIRT
+   KVMTEST \name IHSRR IVEC
+   .endif
+
.if ISET_RI
li  r10,MSR_RI
mtmsrd  r10,1   /* Set MSR_RI */
@@ -844,8 +834,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 */
 EXC_REAL_END(system_reset, 0x100, 0x100)

[PATCH v2 11/20] powerpc/64s/exception: move soft-mask test to common code

2019-09-04 Thread Nicholas Piggin

As well as moving code out of the unrelocated vectors, this allows the
masked handlers to be moved to common code, and allows the soft_nmi
handler to be generated more like a regular handler.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 106 +--
 1 file changed, 49 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 473ba1fa7bbe..af41de2dbc75 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -411,36 +411,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.if (!\virt && IKVM_REAL) || (\virt && IKVM_VIRT)
KVMTEST \name IHSRR IVEC
.endif
-   .if IMASK
-   lbz r10,PACAIRQSOFTMASK(r13)
-   andi.   r10,r10,IMASK
-   /* Associate vector numbers with bits in paca->irq_happened */
-   .if IVEC == 0x500 || IVEC == 0xea0
-   li  r10,PACA_IRQ_EE
-   .elseif IVEC == 0x900
-   li  r10,PACA_IRQ_DEC
-   .elseif IVEC == 0xa00 || IVEC == 0xe80
-   li  r10,PACA_IRQ_DBELL
-   .elseif IVEC == 0xe60
-   li  r10,PACA_IRQ_HMI
-   .elseif IVEC == 0xf00
-   li  r10,PACA_IRQ_PMI
-   .else
-   .abort "Bad maskable vector"
-   .endif
-
-   .if IHSRR == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   bne masked_Hinterrupt
-   FTR_SECTION_ELSE
-   bne masked_interrupt
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif IHSRR
-   bne masked_Hinterrupt
-   .else
-   bne masked_interrupt
-   .endif
-   .endif
 
std r11,IAREA+EX_R11(r13)
std r12,IAREA+EX_R12(r13)
@@ -525,6 +495,37 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 .endm
 
 .macro __GEN_COMMON_BODY name
+   .if IMASK
+   lbz r10,PACAIRQSOFTMASK(r13)
+   andi.   r10,r10,IMASK
+   /* Associate vector numbers with bits in paca->irq_happened */
+   .if IVEC == 0x500 || IVEC == 0xea0
+   li  r10,PACA_IRQ_EE
+   .elseif IVEC == 0x900
+   li  r10,PACA_IRQ_DEC
+   .elseif IVEC == 0xa00 || IVEC == 0xe80
+   li  r10,PACA_IRQ_DBELL
+   .elseif IVEC == 0xe60
+   li  r10,PACA_IRQ_HMI
+   .elseif IVEC == 0xf00
+   li  r10,PACA_IRQ_PMI
+   .else
+   .abort "Bad maskable vector"
+   .endif
+
+   .if IHSRR == EXC_HV_OR_STD
+   BEGIN_FTR_SECTION
+   bne masked_Hinterrupt
+   FTR_SECTION_ELSE
+   bne masked_interrupt
+   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   .elseif IHSRR
+   bne masked_Hinterrupt
+   .else
+   bne masked_interrupt
+   .endif
+   .endif
+
.if ISTACK
andi.   r10,r12,MSR_PR  /* See if coming from user  */
mr  r10,r1  /* Save r1  */
@@ -2339,18 +2340,10 @@ EXC_VIRT_NONE(0x5800, 0x100)
 
 #ifdef CONFIG_PPC_WATCHDOG
 
-#define MASKED_DEC_HANDLER_LABEL 3f
-
-#define MASKED_DEC_HANDLER(_H) \
-3: /* soft-nmi */  \
-   std r12,PACA_EXGEN+EX_R12(r13); \
-   GET_SCRATCH0(r10);  \
-   std r10,PACA_EXGEN+EX_R13(r13); \
-   mfspr   r11,SPRN_SRR0;  /* save SRR0 */ \
-   mfspr   r12,SPRN_SRR1;  /* and SRR1 */  \
-   LOAD_HANDLER(r10, soft_nmi_common); \
-   mtctr   r10;\
-   bctr
+INT_DEFINE_BEGIN(soft_nmi)
+   IVEC=0x900
+   ISTACK=0
+INT_DEFINE_END(soft_nmi)
 
 /*
  * Branch to soft_nmi_interrupt using the emergency stack. The emergency
@@ -2362,19 +2355,16 @@ EXC_VIRT_NONE(0x5800, 0x100)
  * and run it entirely with interrupts hard disabled.
  */
 EXC_COMMON_BEGIN(soft_nmi_common)
+   mfspr   r11,SPRN_SRR0
mr  r10,r1
ld  r1,PACAEMERGSP(r13)
subir1,r1,INT_FRAME_SIZE
-   __ISTACK(decrementer)=0
-   __GEN_COMMON_BODY decrementer
+   __GEN_COMMON_BODY soft_nmi
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  soft_nmi_interrupt
b   ret_from_except
 
-#else /* CONFIG_PPC_WATCHDOG */
-#define MASKED_DEC_HANDLER_LABEL 2f /* normal return */
-#define MASKED_DEC_HANDLER(_H)
 #endif /* CONFIG_PPC_WATCHDOG */
 
 /*
@@ -2393,7 +2383,6 @@ masked_Hinterrupt:
.else
 masked_interrupt:

[PATCH v2 10/20] powerpc/64s/exception: move real->virt switch into the common handler

2019-09-04 Thread Nicholas Piggin

The real mode interrupt entry points currently use rfid to branch to
the common handler in virtual mode. This is a significant amount of
code, and forces other code (notably the KVM test) to live in the
real mode handler.

In the interest of minimising the amount of code that runs unrelocated
move the switch to virt mode into the common code, and do it with
mtmsrd, which avoids clobbering SRRs (although the post-KVMTEST
performance of real-mode interrupt handlers is not a big concern these
days).

This requires CTR to always be saved (real-mode needs to reach 0xc...)
but that's not a huge impact these days. It could be optimized away in
future.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h |   4 -
 arch/powerpc/kernel/exceptions-64s.S | 247 ++-
 2 files changed, 105 insertions(+), 146 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 33f4f72eb035..47bd4ea0837d 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -33,11 +33,7 @@
 #include 
 
 /* PACA save area size in u64 units (exgen, exmc, etc) */
-#if defined(CONFIG_RELOCATABLE)
 #define EX_SIZE10
-#else
-#define EX_SIZE9
-#endif
 
 /*
  * maximum recursive depth of MCE exceptions
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b8588618cdc3..473ba1fa7bbe 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -32,16 +32,10 @@
 #define EX_CCR 52
 #define EX_CFAR56
 #define EX_PPR 64
-#if defined(CONFIG_RELOCATABLE)
 #define EX_CTR 72
 .if EX_SIZE != 10
.error "EX_SIZE is wrong"
 .endif
-#else
-.if EX_SIZE != 9
-   .error "EX_SIZE is wrong"
-.endif
-#endif
 
 /*
  * Following are fixed section helper macros.
@@ -124,22 +118,6 @@ name:
 #define EXC_HV 1
 #define EXC_STD0
 
-#if defined(CONFIG_RELOCATABLE)
-/*
- * If we support interrupts with relocation on AND we're a relocatable kernel,
- * we need to use CTR to get to the 2nd level handler.  So, save/restore it
- * when required.
- */
-#define SAVE_CTR(reg, area)mfctr   reg ;   std reg,area+EX_CTR(r13)
-#define GET_CTR(reg, area) ld  reg,area+EX_CTR(r13)
-#define RESTORE_CTR(reg, area) ld  reg,area+EX_CTR(r13) ; mtctr reg
-#else
-/* ...else CTR is unused and in register. */
-#define SAVE_CTR(reg, area)
-#define GET_CTR(reg, area) mfctr   reg
-#define RESTORE_CTR(reg, area)
-#endif
-
 /*
  * PPR save/restore macros used in exceptions-64s.S
  * Used for P7 or later processors
@@ -199,6 +177,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IVEC   .L_IVEC_\name\()
 #define IHSRR  .L_IHSRR_\name\()
 #define IAREA  .L_IAREA_\name\()
+#define IVIRT  .L_IVIRT_\name\()
 #define IISIDE .L_IISIDE_\name\()
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
@@ -232,6 +211,9 @@ do_define_int n
.ifndef IAREA
IAREA=PACA_EXGEN
.endif
+   .ifndef IVIRT
+   IVIRT=1
+   .endif
.ifndef IISIDE
IISIDE=0
.endif
@@ -325,7 +307,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 * outside the head section. CONFIG_RELOCATABLE KVM expects CTR
 * to be saved in HSTATE_SCRATCH1.
 */
-   mfctr   r9
+   ld  r9,IAREA+EX_CTR(r13)
std r9,HSTATE_SCRATCH1(r13)
__LOAD_FAR_HANDLER(r9, kvmppc_interrupt)
mtctr   r9
@@ -362,101 +344,6 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 .endm
 #endif
 
-.macro INT_SAVE_SRR_AND_JUMP label, hsrr, set_ri
-   ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
-   .if ! \set_ri
-   xorir10,r10,MSR_RI  /* Clear MSR_RI */
-   .endif
-   .if \hsrr == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* and HSRR1 */
-   mtspr   SPRN_HSRR1,r10
-   FTR_SECTION_ELSE
-   mfspr   r11,SPRN_SRR0   /* save SRR0 */
-   mfspr   r12,SPRN_SRR1   /* and SRR1 */
-   mtspr   SPRN_SRR1,r10
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif \hsrr
-   mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* and HSRR1 */
-   mtspr   SPRN_HSRR1,r10
-   .else
-   mfspr   r11,SPRN_SRR0   /* save SRR0 */
-   mfspr   r12,SPRN_SRR1   /* and SRR1 */
-   mtspr   SPRN_SRR1,r10
-   .endif
-   LOAD_HANDLER(r10, \label\())
-   .if \hsrr == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   mtspr   SPRN_HSRR0,r10
-   HRFI_TO_KERNEL
-   FTR_SECTION_ELSE
-   mtspr   SPRN_SRR0,r10
-

[PATCH v2 09/20] powerpc/64s/exception: Add ISIDE option

2019-09-04 Thread Nicholas Piggin

Rather than using DAR=2 to select the i-side registers, add an
explicit option.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index bef0c2eee7dc..b8588618cdc3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -199,6 +199,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IVEC   .L_IVEC_\name\()
 #define IHSRR  .L_IHSRR_\name\()
 #define IAREA  .L_IAREA_\name\()
+#define IISIDE .L_IISIDE_\name\()
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
 #define ISET_RI.L_ISET_RI_\name\()
@@ -231,6 +232,9 @@ do_define_int n
.ifndef IAREA
IAREA=PACA_EXGEN
.endif
+   .ifndef IISIDE
+   IISIDE=0
+   .endif
.ifndef IDAR
IDAR=0
.endif
@@ -542,7 +546,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 */
GET_SCRATCH0(r10)
std r10,IAREA+EX_R13(r13)
-   .if IDAR == 1
+   .if IDAR && !IISIDE
.if IHSRR
mfspr   r10,SPRN_HDAR
.else
@@ -550,7 +554,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
std r10,IAREA+EX_DAR(r13)
.endif
-   .if IDSISR == 1
+   .if IDSISR && !IISIDE
.if IHSRR
mfspr   r10,SPRN_HDSISR
.else
@@ -625,16 +629,18 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
std r9,GPR11(r1)
std r10,GPR12(r1)
std r11,GPR13(r1)
+
.if IDAR
-   .if IDAR == 2
+   .if IISIDE
ld  r10,_NIP(r1)
.else
ld  r10,IAREA+EX_DAR(r13)
.endif
std r10,_DAR(r1)
.endif
+
.if IDSISR
-   .if IDSISR == 2
+   .if IISIDE
ld  r10,_MSR(r1)
lis r11,DSISR_SRR1_MATCH_64S@h
and r10,r10,r11
@@ -643,6 +649,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
std r10,_DSISR(r1)
.endif
+
 BEGIN_FTR_SECTION_NESTED(66)
ld  r10,IAREA+EX_CFAR(r13)
std r10,ORIG_GPR3(r1)
@@ -1311,8 +1318,9 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
 INT_DEFINE_BEGIN(instruction_access)
IVEC=0x400
-   IDAR=2
-   IDSISR=2
+   IISIDE=1
+   IDAR=1
+   IDSISR=1
IKVM_REAL=1
 INT_DEFINE_END(instruction_access)
 
@@ -1341,7 +1349,8 @@ INT_DEFINE_BEGIN(instruction_access_slb)
IVEC=0x480
IAREA=PACA_EXSLB
IRECONCILE=0
-   IDAR=2
+   IISIDE=1
+   IDAR=1
IKVM_REAL=1
 INT_DEFINE_END(instruction_access_slb)
 
-- 
2.22.0

[PATCH v2 08/20] powerpc/64s/exception: Remove old INT_KVM_HANDLER

2019-09-04 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 55 +---
 1 file changed, 26 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f318869607db..bef0c2eee7dc 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -266,15 +266,6 @@ do_define_int n
.endif
 .endm
 
-.macro INT_KVM_HANDLER name, vec, hsrr, area, skip
-   TRAMP_KVM_BEGIN(\name\()_kvm)
-   KVM_HANDLER \vec, \hsrr, \area, \skip
-.endm
-
-.macro GEN_KVM name
-   KVM_HANDLER IVEC, IHSRR, IAREA, IKVM_SKIP
-.endm
-
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
@@ -293,35 +284,35 @@ do_define_int n
bne \name\()_kvm
 .endm
 
-.macro KVM_HANDLER vec, hsrr, area, skip
-   .if \skip
+.macro GEN_KVM name
+   .if IKVM_SKIP
cmpwi   r10,KVM_GUEST_MODE_SKIP
beq 89f
.else
 BEGIN_FTR_SECTION_NESTED(947)
-   ld  r10,\area+EX_CFAR(r13)
+   ld  r10,IAREA+EX_CFAR(r13)
std r10,HSTATE_CFAR(r13)
 END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947)
.endif
 
 BEGIN_FTR_SECTION_NESTED(948)
-   ld  r10,\area+EX_PPR(r13)
+   ld  r10,IAREA+EX_PPR(r13)
std r10,HSTATE_PPR(r13)
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
-   ld  r10,\area+EX_R10(r13)
+   ld  r10,IAREA+EX_R10(r13)
std r12,HSTATE_SCRATCH0(r13)
sldir12,r9,32
/* HSRR variants have the 0x2 bit added to their trap number */
-   .if \hsrr == EXC_HV_OR_STD
+   .if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
-   ori r12,r12,(\vec + 0x2)
+   ori r12,r12,(IVEC + 0x2)
FTR_SECTION_ELSE
-   ori r12,r12,(\vec)
+   ori r12,r12,(IVEC)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif \hsrr
-   ori r12,r12,(\vec + 0x2)
+   .elseif IHSRR
+   ori r12,r12,(IVEC+ 0x2)
.else
-   ori r12,r12,(\vec)
+   ori r12,r12,(IVEC)
.endif
 
 #ifdef CONFIG_RELOCATABLE
@@ -334,25 +325,25 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
std r9,HSTATE_SCRATCH1(r13)
__LOAD_FAR_HANDLER(r9, kvmppc_interrupt)
mtctr   r9
-   ld  r9,\area+EX_R9(r13)
+   ld  r9,IAREA+EX_R9(r13)
bctr
 #else
-   ld  r9,\area+EX_R9(r13)
+   ld  r9,IAREA+EX_R9(r13)
b   kvmppc_interrupt
 #endif
 
 
-   .if \skip
+   .if IKVM_SKIP
 89:mtocrf  0x80,r9
-   ld  r9,\area+EX_R9(r13)
-   ld  r10,\area+EX_R10(r13)
-   .if \hsrr == EXC_HV_OR_STD
+   ld  r9,IAREA+EX_R9(r13)
+   ld  r10,IAREA+EX_R10(r13)
+   .if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
b   kvmppc_skip_Hinterrupt
FTR_SECTION_ELSE
b   kvmppc_skip_interrupt
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif \hsrr
+   .elseif IHSRR
b   kvmppc_skip_Hinterrupt
.else
b   kvmppc_skip_interrupt
@@ -363,7 +354,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 #else
 .macro KVMTEST name, hsrr, n
 .endm
-.macro KVM_HANDLER name, vec, hsrr, area, skip
+.macro GEN_KVM name
 .endm
 #endif
 
@@ -1640,6 +1631,12 @@ EXC_VIRT_NONE(0x4b00, 0x100)
  * without saving, though xer is not a good idea to use, as hardware may
  * interpret some bits so it may be costly to change them.
  */
+INT_DEFINE_BEGIN(system_call)
+   IVEC=0xc00
+   IKVM_REAL=1
+   IKVM_VIRT=1
+INT_DEFINE_END(system_call)
+
 .macro SYSTEM_CALL virt
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
/*
@@ -1733,7 +1730,7 @@ TRAMP_KVM_BEGIN(system_call_kvm)
SET_SCRATCH0(r10)
std r9,PACA_EXGEN+EX_R9(r13)
mfcrr9
-   KVM_HANDLER 0xc00, EXC_STD, PACA_EXGEN, 0
+   GEN_KVM system_call
 #endif
 
 
-- 
2.22.0

[PATCH v2 07/20] powerpc/64s/exception: Remove old INT_COMMON macro

2019-09-04 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 51 +---
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ba2dcd91aaaf..f318869607db 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -591,8 +591,8 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
  * If stack=0, then the stack is already set in r1, and r1 is saved in r10.
  * PPR save and CPU accounting is not done for the !stack case (XXX why not?)
  */
-.macro INT_COMMON vec, area, stack, kaup, reconcile, dar, dsisr
-   .if \stack
+.macro GEN_COMMON name
+   .if ISTACK
andi.   r10,r12,MSR_PR  /* See if coming from user  */
mr  r10,r1  /* Save r1  */
subir1,r1,INT_FRAME_SIZE/* alloc frame on kernel stack  */
@@ -609,54 +609,54 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
std r0,GPR0(r1) /* save r0 in stackframe*/
std r10,GPR1(r1)/* save r1 in stackframe*/
 
-   .if \stack
-   .if \kaup
+   .if ISTACK
+   .if IKUAP
kuap_save_amr_and_lock r9, r10, cr1, cr0
.endif
beq 101f/* if from kernel mode  */
ACCOUNT_CPU_USER_ENTRY(r13, r9, r10)
-   SAVE_PPR(\area, r9)
+   SAVE_PPR(IAREA, r9)
 101:
.else
-   .if \kaup
+   .if IKUAP
kuap_save_amr_and_lock r9, r10, cr1
.endif
.endif
 
/* Save original regs values from save area to stack frame. */
-   ld  r9,\area+EX_R9(r13) /* move r9, r10 to stackframe   */
-   ld  r10,\area+EX_R10(r13)
+   ld  r9,IAREA+EX_R9(r13) /* move r9, r10 to stackframe   */
+   ld  r10,IAREA+EX_R10(r13)
std r9,GPR9(r1)
std r10,GPR10(r1)
-   ld  r9,\area+EX_R11(r13)/* move r11 - r13 to stackframe */
-   ld  r10,\area+EX_R12(r13)
-   ld  r11,\area+EX_R13(r13)
+   ld  r9,IAREA+EX_R11(r13)/* move r11 - r13 to stackframe */
+   ld  r10,IAREA+EX_R12(r13)
+   ld  r11,IAREA+EX_R13(r13)
std r9,GPR11(r1)
std r10,GPR12(r1)
std r11,GPR13(r1)
-   .if \dar
-   .if \dar == 2
+   .if IDAR
+   .if IDAR == 2
ld  r10,_NIP(r1)
.else
-   ld  r10,\area+EX_DAR(r13)
+   ld  r10,IAREA+EX_DAR(r13)
.endif
std r10,_DAR(r1)
.endif
-   .if \dsisr
-   .if \dsisr == 2
+   .if IDSISR
+   .if IDSISR == 2
ld  r10,_MSR(r1)
lis r11,DSISR_SRR1_MATCH_64S@h
and r10,r10,r11
.else
-   lwz r10,\area+EX_DSISR(r13)
+   lwz r10,IAREA+EX_DSISR(r13)
.endif
std r10,_DSISR(r1)
.endif
 BEGIN_FTR_SECTION_NESTED(66)
-   ld  r10,\area+EX_CFAR(r13)
+   ld  r10,IAREA+EX_CFAR(r13)
std r10,ORIG_GPR3(r1)
 END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66)
-   GET_CTR(r10, \area)
+   GET_CTR(r10, IAREA)
std r10,_CTR(r1)
std r2,GPR2(r1) /* save r2 in stackframe*/
SAVE_4GPRS(3, r1)   /* save r3 - r6 in stackframe   */
@@ -668,26 +668,22 @@ END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66)
mfspr   r11,SPRN_XER/* save XER in stackframe   */
std r10,SOFTE(r1)
std r11,_XER(r1)
-   li  r9,(\vec)+1
+   li  r9,(IVEC)+1
std r9,_TRAP(r1)/* set trap number  */
li  r10,0
ld  r11,exception_marker@toc(r2)
std r10,RESULT(r1)  /* clear regs->result   */
std r11,STACK_FRAME_OVERHEAD-16(r1) /* mark the frame   */
 
-   .if \stack
+   .if ISTACK
ACCOUNT_STOLEN_TIME
.endif
 
-   .if \reconcile
+   .if IRECONCILE
RECONCILE_IRQ_STATE(r10, r11)
.endif
 .endm
 
-.macro GEN_COMMON name
-   INT_COMMON IVEC, IAREA, ISTACK, IKUAP, IRECONCILE, IDAR, IDSISR
-.endm
-
 /*
  * Restore all registers including H/SRR0/1 saved in a stack frame of a
  * standard exception.
@@ -2400,7 +2396,8 @@ EXC_COMMON_BEGIN(soft_nmi_common)
mr  r10,r1
ld  r1,PACAEMERGSP(r13)
subir1,r1,INT_FRAME_SIZE
-   INT_COMMON 0x900, PACA_EXGEN, 0, 1, 1, 0, 0
+   __ISTACK(decrementer)=0
+   GEN_COMMON decrementer
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  soft_nmi_interrupt
-- 
2.22.0

[PATCH v2 06/20] powerpc/64s/exception: Remove old INT_ENTRY macro

2019-09-04 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 68 
 1 file changed, 30 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b5decc9a0cbf..ba2dcd91aaaf 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -482,13 +482,13 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
  * - Fall through and continue executing in real, unrelocated mode.
  *   This is done if early=2.
  */
-.macro INT_HANDLER name, vec, ool=0, early=0, virt=0, hsrr=0, area=PACA_EXGEN, 
ri=1, dar=0, dsisr=0, bitmask=0, kvm=0
+.macro GEN_INT_ENTRY name, virt, ool=0
SET_SCRATCH0(r13)   /* save r13 */
GET_PACA(r13)
-   std r9,\area\()+EX_R9(r13)  /* save r9 */
+   std r9,IAREA+EX_R9(r13) /* save r9 */
OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR)
HMT_MEDIUM
-   std r10,\area\()+EX_R10(r13)/* save r10 - r12 */
+   std r10,IAREA+EX_R10(r13)   /* save r10 - r12 */
OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
.if \ool
.if !\virt
@@ -502,47 +502,47 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
.endif
 
-   OPT_SAVE_REG_TO_PACA(\area\()+EX_PPR, r9, CPU_FTR_HAS_PPR)
-   OPT_SAVE_REG_TO_PACA(\area\()+EX_CFAR, r10, CPU_FTR_CFAR)
+   OPT_SAVE_REG_TO_PACA(IAREA+EX_PPR, r9, CPU_FTR_HAS_PPR)
+   OPT_SAVE_REG_TO_PACA(IAREA+EX_CFAR, r10, CPU_FTR_CFAR)
INTERRUPT_TO_KERNEL
-   SAVE_CTR(r10, \area\())
+   SAVE_CTR(r10, IAREA)
mfcrr9
-   .if \kvm
-   KVMTEST \name \hsrr \vec
+   .if (!\virt && IKVM_REAL) || (\virt && IKVM_VIRT)
+   KVMTEST \name IHSRR IVEC
.endif
-   .if \bitmask
+   .if IMASK
lbz r10,PACAIRQSOFTMASK(r13)
-   andi.   r10,r10,\bitmask
+   andi.   r10,r10,IMASK
/* Associate vector numbers with bits in paca->irq_happened */
-   .if \vec == 0x500 || \vec == 0xea0
+   .if IVEC == 0x500 || IVEC == 0xea0
li  r10,PACA_IRQ_EE
-   .elseif \vec == 0x900
+   .elseif IVEC == 0x900
li  r10,PACA_IRQ_DEC
-   .elseif \vec == 0xa00 || \vec == 0xe80
+   .elseif IVEC == 0xa00 || IVEC == 0xe80
li  r10,PACA_IRQ_DBELL
-   .elseif \vec == 0xe60
+   .elseif IVEC == 0xe60
li  r10,PACA_IRQ_HMI
-   .elseif \vec == 0xf00
+   .elseif IVEC == 0xf00
li  r10,PACA_IRQ_PMI
.else
.abort "Bad maskable vector"
.endif
 
-   .if \hsrr == EXC_HV_OR_STD
+   .if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
bne masked_Hinterrupt
FTR_SECTION_ELSE
bne masked_interrupt
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif \hsrr
+   .elseif IHSRR
bne masked_Hinterrupt
.else
bne masked_interrupt
.endif
.endif
 
-   std r11,\area\()+EX_R11(r13)
-   std r12,\area\()+EX_R12(r13)
+   std r11,IAREA+EX_R11(r13)
+   std r12,IAREA+EX_R12(r13)
 
/*
 * DAR/DSISR, SCRATCH0 must be read before setting MSR[RI],
@@ -550,47 +550,39 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 * not recoverable if they are live.
 */
GET_SCRATCH0(r10)
-   std r10,\area\()+EX_R13(r13)
-   .if \dar == 1
-   .if \hsrr
+   std r10,IAREA+EX_R13(r13)
+   .if IDAR == 1
+   .if IHSRR
mfspr   r10,SPRN_HDAR
.else
mfspr   r10,SPRN_DAR
.endif
-   std r10,\area\()+EX_DAR(r13)
+   std r10,IAREA+EX_DAR(r13)
.endif
-   .if \dsisr == 1
-   .if \hsrr
+   .if IDSISR == 1
+   .if IHSRR
mfspr   r10,SPRN_HDSISR
.else
mfspr   r10,SPRN_DSISR
.endif
-   stw r10,\area\()+EX_DSISR(r13)
+   stw r10,IAREA+EX_DSISR(r13)
.endif
 
-   .if \early == 2
+   .if IEARLY == 2
/* nothing more */
-   .elseif \early
+   .elseif IEARLY
mfctr   r10 /* save ctr, even for !RELOCATABLE */
BRANCH_TO_C000(r11, \name\()_common)
.elseif !\virt
-   INT_SAVE_SRR_AND_JUMP \name\()_common, \hsrr, \ri
+   INT_SAVE_SRR_AND_JUMP \name\()_common, IHSRR, ISET_RI
.else
-   INT_VIRT_SAVE_SRR_AND_JUMP \name\()_common, \hsrr
+   INT_VIRT_SAVE_SRR_AND_JUMP \name\()_common, IHSRR
.endif
.if \ool
.popsection
.endif
 .endm

[PATCH v2 05/20] powerpc/64s/exception: Move all interrupt handlers to new style code gen macros

2019-09-04 Thread Nicholas Piggin

Aside from label names and BUG line numbers, the generated code change
is an additional HMI KVM handler added for the "late" KVM handler,
because early and late HMI generation is achieved by defining two
different interrupt types.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 556 ---
 1 file changed, 418 insertions(+), 138 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 828fa4df15cf..b5decc9a0cbf 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -206,8 +206,10 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IMASK  .L_IMASK_\name\()
 #define IKVM_SKIP  .L_IKVM_SKIP_\name\()
 #define IKVM_REAL  .L_IKVM_REAL_\name\()
+#define __IKVM_REAL(name)  .L_IKVM_REAL_ ## name
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()
 #define ISTACK .L_ISTACK_\name\()
+#define __ISTACK(name) .L_ISTACK_ ## name
 #define IRECONCILE .L_IRECONCILE_\name\()
 #define IKUAP  .L_IKUAP_\name\()
 
@@ -570,7 +572,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
/* nothing more */
.elseif \early
mfctr   r10 /* save ctr, even for !RELOCATABLE */
-   BRANCH_TO_C000(r11, \name\()_early_common)
+   BRANCH_TO_C000(r11, \name\()_common)
.elseif !\virt
INT_SAVE_SRR_AND_JUMP \name\()_common, \hsrr, \ri
.else
@@ -843,6 +845,19 @@ __start_interrupts:
 EXC_VIRT_NONE(0x4000, 0x100)
 
 
+INT_DEFINE_BEGIN(system_reset)
+   IVEC=0x100
+   IAREA=PACA_EXNMI
+   /*
+* MSR_RI is not enabled, because PACA_EXNMI and nmi stack is
+* being used, so a nested NMI exception would corrupt it.
+*/
+   ISET_RI=0
+   ISTACK=0
+   IRECONCILE=0
+   IKVM_REAL=1
+INT_DEFINE_END(system_reset)
+
 EXC_REAL_BEGIN(system_reset, 0x100, 0x100)
 #ifdef CONFIG_PPC_P7_NAP
/*
@@ -880,11 +895,8 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif
 
-   INT_HANDLER system_reset, 0x100, area=PACA_EXNMI, ri=0, kvm=1
+   GEN_INT_ENTRY system_reset, virt=0
/*
-* MSR_RI is not enabled, because PACA_EXNMI and nmi stack is
-* being used, so a nested NMI exception would corrupt it.
-*
 * In theory, we should not enable relocation here if it was disabled
 * in SRR1, because the MMU may not be configured to support it (e.g.,
 * SLB may have been cleared). In practice, there should only be a few
@@ -893,7 +905,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 */
 EXC_REAL_END(system_reset, 0x100, 0x100)
 EXC_VIRT_NONE(0x4100, 0x100)
-INT_KVM_HANDLER system_reset 0x100, EXC_STD, PACA_EXNMI, 0
+TRAMP_KVM_BEGIN(system_reset_kvm)
+   GEN_KVM system_reset
 
 #ifdef CONFIG_PPC_P7_NAP
 TRAMP_REAL_BEGIN(system_reset_idle_wake)
@@ -908,8 +921,8 @@ TRAMP_REAL_BEGIN(system_reset_idle_wake)
  * Vectors for the FWNMI option.  Share common code.
  */
 TRAMP_REAL_BEGIN(system_reset_fwnmi)
-   /* See comment at system_reset exception, don't turn on RI */
-   INT_HANDLER system_reset, 0x100, area=PACA_EXNMI, ri=0
+   __IKVM_REAL(system_reset)=0
+   GEN_INT_ENTRY system_reset, virt=0
 
 #endif /* CONFIG_PPC_PSERIES */
 
@@ -929,7 +942,7 @@ EXC_COMMON_BEGIN(system_reset_common)
mr  r10,r1
ld  r1,PACA_NMI_EMERG_SP(r13)
subir1,r1,INT_FRAME_SIZE
-   INT_COMMON 0x100, PACA_EXNMI, 0, 1, 0, 0, 0
+   GEN_COMMON system_reset
bl  save_nvgprs
/*
 * Set IRQS_ALL_DISABLED unconditionally so arch_irqs_disabled does
@@ -971,23 +984,46 @@ EXC_COMMON_BEGIN(system_reset_common)
RFI_TO_USER_OR_KERNEL
 
 
-EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
-   INT_HANDLER machine_check, 0x200, early=1, area=PACA_EXMC, dar=1, 
dsisr=1
+INT_DEFINE_BEGIN(machine_check_early)
+   IVEC=0x200
+   IAREA=PACA_EXMC
/*
 * MSR_RI is not enabled, because PACA_EXMC is being used, so a
 * nested machine check corrupts it. machine_check_common enables
 * MSR_RI.
 */
+   ISET_RI=0
+   ISTACK=0
+   IEARLY=1
+   IDAR=1
+   IDSISR=1
+   IRECONCILE=0
+   IKUAP=0 /* We don't touch AMR here, we never go to virtual mode */
+INT_DEFINE_END(machine_check_early)
+
+INT_DEFINE_BEGIN(machine_check)
+   IVEC=0x200
+   IAREA=PACA_EXMC
+   ISET_RI=0
+   IDAR=1
+   IDSISR=1
+   IKVM_SKIP=1
+   IKVM_REAL=1
+INT_DEFINE_END(machine_check)
+
+EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
+   GEN_INT_ENTRY machine_check_early, virt=0
 EXC_REAL_END(machine_check, 0x200, 0x100)
 EXC_VIRT_NONE(0x4200, 0x100)
 
 #ifdef CONFIG_PPC_PSERIES
 TRAMP_REAL_BEGIN(machine_check_fwnmi)
/* See comment at machine_check exception, don't turn on RI */
-   INT_HANDLER machine_check, 0x200, early=1, area=PACA_EXMC,

[PATCH v2 04/20] powerpc/64s/exception: Expand EXC_COMMON and EXC_COMMON_ASYNC macros

2019-09-04 Thread Nicholas Piggin

These don't provide a large amount of code sharing. Removing them
makes code easier to shuffle around. For example, some of the common
instructions will be moved into the common code gen macro.

No generated code change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 160 ---
 1 file changed, 117 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 0e39e98ef719..828fa4df15cf 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -757,28 +757,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CAN_NAP)
 #define FINISH_NAP
 #endif
 
-#define EXC_COMMON(name, realvec, hdlr)
\
-   EXC_COMMON_BEGIN(name); \
-   INT_COMMON realvec, PACA_EXGEN, 1, 1, 1, 0, 0 ; \
-   bl  save_nvgprs;\
-   addir3,r1,STACK_FRAME_OVERHEAD; \
-   bl  hdlr;   \
-   b   ret_from_except
-
-/*
- * Like EXC_COMMON, but for exceptions that can occur in the idle task and
- * therefore need the special idle handling (finish nap and runlatch)
- */
-#define EXC_COMMON_ASYNC(name, realvec, hdlr)  \
-   EXC_COMMON_BEGIN(name); \
-   INT_COMMON realvec, PACA_EXGEN, 1, 1, 1, 0, 0 ; \
-   FINISH_NAP; \
-   RUNLATCH_ON;\
-   addir3,r1,STACK_FRAME_OVERHEAD; \
-   bl  hdlr;   \
-   b   ret_from_except_lite
-
-
 /*
  * There are a few constraints to be concerned with.
  * - Real mode exceptions code/data must be located at their physical location.
@@ -1349,7 +1327,13 @@ EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
INT_HANDLER hardware_interrupt, 0x500, virt=1, hsrr=EXC_HV_OR_STD, 
bitmask=IRQS_DISABLED, kvm=1
 EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
 INT_KVM_HANDLER hardware_interrupt, 0x500, EXC_HV_OR_STD, PACA_EXGEN, 0
-EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
+EXC_COMMON_BEGIN(hardware_interrupt_common)
+   INT_COMMON 0x500, PACA_EXGEN, 1, 1, 1, 0, 0
+   FINISH_NAP
+   RUNLATCH_ON
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  do_IRQ
+   b   ret_from_except_lite
 
 
 EXC_REAL_BEGIN(alignment, 0x600, 0x100)
@@ -1455,7 +1439,13 @@ EXC_VIRT_BEGIN(decrementer, 0x4900, 0x80)
INT_HANDLER decrementer, 0x900, virt=1, bitmask=IRQS_DISABLED
 EXC_VIRT_END(decrementer, 0x4900, 0x80)
 INT_KVM_HANDLER decrementer, 0x900, EXC_STD, PACA_EXGEN, 0
-EXC_COMMON_ASYNC(decrementer_common, 0x900, timer_interrupt)
+EXC_COMMON_BEGIN(decrementer_common)
+   INT_COMMON 0x900, PACA_EXGEN, 1, 1, 1, 0, 0
+   FINISH_NAP
+   RUNLATCH_ON
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  timer_interrupt
+   b   ret_from_except_lite
 
 
 EXC_REAL_BEGIN(hdecrementer, 0x980, 0x80)
@@ -1465,7 +1455,12 @@ EXC_VIRT_BEGIN(hdecrementer, 0x4980, 0x80)
INT_HANDLER hdecrementer, 0x980, virt=1, hsrr=EXC_HV, kvm=1
 EXC_VIRT_END(hdecrementer, 0x4980, 0x80)
 INT_KVM_HANDLER hdecrementer, 0x980, EXC_HV, PACA_EXGEN, 0
-EXC_COMMON(hdecrementer_common, 0x980, hdec_interrupt)
+EXC_COMMON_BEGIN(hdecrementer_common)
+   INT_COMMON 0x980, PACA_EXGEN, 1, 1, 1, 0, 0
+   bl  save_nvgprs
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  hdec_interrupt
+   b   ret_from_except
 
 
 EXC_REAL_BEGIN(doorbell_super, 0xa00, 0x100)
@@ -1475,11 +1470,17 @@ EXC_VIRT_BEGIN(doorbell_super, 0x4a00, 0x100)
INT_HANDLER doorbell_super, 0xa00, virt=1, bitmask=IRQS_DISABLED
 EXC_VIRT_END(doorbell_super, 0x4a00, 0x100)
 INT_KVM_HANDLER doorbell_super, 0xa00, EXC_STD, PACA_EXGEN, 0
+EXC_COMMON_BEGIN(doorbell_super_common)
+   INT_COMMON 0xa00, PACA_EXGEN, 1, 1, 1, 0, 0
+   FINISH_NAP
+   RUNLATCH_ON
+   addir3,r1,STACK_FRAME_OVERHEAD
 #ifdef CONFIG_PPC_DOORBELL
-EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, doorbell_exception)
+   bl  doorbell_exception
 #else
-EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, unknown_exception)
+   bl  unknown_exception
 #endif
+   b   ret_from_except_lite
 
 
 EXC_REAL_NONE(0xb00, 0x100)
@@ -1623,7 +1624,12 @@ EXC_VIRT_BEGIN(single_step, 0x4d00, 0x100)
INT_HANDLER single_step, 0xd00, virt=1
 EXC_VIRT_END(single_step, 0x4d00, 0x100)
 INT_KVM_HANDLER single_step, 0xd00, EXC_STD, PACA_EXGEN, 0
-EXC_COMMON(single_step_common, 0xd00, single_step_exception)
+EXC_COMMON_BEGIN(single_step_common)
+   INT_COMMON 0xd00, PACA_EXGEN, 1, 1, 1, 0, 0
+   bl  save_nvgprs
+   addi

[PATCH v2 03/20] powerpc/64s/exception: Add GEN_KVM macro that uses INT_DEFINE parameters

2019-09-04 Thread Nicholas Piggin

No generated code change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 591ae2a73e18..0e39e98ef719 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -204,6 +204,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define ISET_RI.L_ISET_RI_\name\()
 #define IEARLY .L_IEARLY_\name\()
 #define IMASK  .L_IMASK_\name\()
+#define IKVM_SKIP  .L_IKVM_SKIP_\name\()
 #define IKVM_REAL  .L_IKVM_REAL_\name\()
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()
 #define ISTACK .L_ISTACK_\name\()
@@ -243,6 +244,9 @@ do_define_int n
.ifndef IMASK
IMASK=0
.endif
+   .ifndef IKVM_SKIP
+   IKVM_SKIP=0
+   .endif
.ifndef IKVM_REAL
IKVM_REAL=0
.endif
@@ -265,6 +269,10 @@ do_define_int n
KVM_HANDLER \vec, \hsrr, \area, \skip
 .endm
 
+.macro GEN_KVM name
+   KVM_HANDLER IVEC, IHSRR, IAREA, IKVM_SKIP
+.endm
+
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
@@ -1226,6 +1234,7 @@ INT_DEFINE_BEGIN(data_access)
IVEC=0x300
IDAR=1
IDSISR=1
+   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(data_access)
 
@@ -1235,7 +1244,8 @@ EXC_REAL_END(data_access, 0x300, 0x80)
 EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
GEN_INT_ENTRY data_access, virt=1
 EXC_VIRT_END(data_access, 0x4300, 0x80)
-INT_KVM_HANDLER data_access, 0x300, EXC_STD, PACA_EXGEN, 1
+TRAMP_KVM_BEGIN(data_access_kvm)
+   GEN_KVM data_access
 EXC_COMMON_BEGIN(data_access_common)
GEN_COMMON data_access
ld  r4,_DAR(r1)
-- 
2.22.0

[PATCH v2 02/20] powerpc/64s/exception: Add GEN_COMMON macro that uses INT_DEFINE parameters

2019-09-04 Thread Nicholas Piggin

No generated code change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e6ad6e6cf65e..591ae2a73e18 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -206,6 +206,9 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IMASK  .L_IMASK_\name\()
 #define IKVM_REAL  .L_IKVM_REAL_\name\()
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()
+#define ISTACK .L_ISTACK_\name\()
+#define IRECONCILE .L_IRECONCILE_\name\()
+#define IKUAP  .L_IKUAP_\name\()
 
 #define INT_DEFINE_BEGIN(n)\
 .macro int_define_ ## n name
@@ -246,6 +249,15 @@ do_define_int n
.ifndef IKVM_VIRT
IKVM_VIRT=0
.endif
+   .ifndef ISTACK
+   ISTACK=1
+   .endif
+   .ifndef IRECONCILE
+   IRECONCILE=1
+   .endif
+   .ifndef IKUAP
+   IKUAP=1
+   .endif
 .endm
 
 .macro INT_KVM_HANDLER name, vec, hsrr, area, skip
@@ -670,6 +682,10 @@ END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66)
.endif
 .endm
 
+.macro GEN_COMMON name
+   INT_COMMON IVEC, IAREA, ISTACK, IKUAP, IRECONCILE, IDAR, IDSISR
+.endm
+
 /*
  * Restore all registers including H/SRR0/1 saved in a stack frame of a
  * standard exception.
@@ -1221,13 +1237,7 @@ EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
 EXC_VIRT_END(data_access, 0x4300, 0x80)
 INT_KVM_HANDLER data_access, 0x300, EXC_STD, PACA_EXGEN, 1
 EXC_COMMON_BEGIN(data_access_common)
-   /*
-* Here r13 points to the paca, r9 contains the saved CR,
-* SRR0 and SRR1 are saved in r11 and r12,
-* r9 - r13 are saved in paca->exgen.
-* EX_DAR and EX_DSISR have saved DAR/DSISR
-*/
-   INT_COMMON 0x300, PACA_EXGEN, 1, 1, 1, 1, 1
+   GEN_COMMON data_access
ld  r4,_DAR(r1)
ld  r5,_DSISR(r1)
 BEGIN_MMU_FTR_SECTION
-- 
2.22.0

[PATCH v2 00/20] remaining interrupt handler changes

2019-09-04 Thread Nicholas Piggin

This is a rebase of the remaining patches in this series to the
powerpc next branch

https://lore.kernel.org/r/20190802105709.27696-2-npig...@gmail.com

Plus the next series. A few improvements were added, such as using
the name=val style of parameter for invoking macros.


Nicholas Piggin (20):
  powerpc/64s/exception: Introduce INT_DEFINE parameter block for code
generation
  powerpc/64s/exception: Add GEN_COMMON macro that uses INT_DEFINE
parameters
  powerpc/64s/exception: Add GEN_KVM macro that uses INT_DEFINE
parameters
  powerpc/64s/exception: Expand EXC_COMMON and EXC_COMMON_ASYNC macros
  powerpc/64s/exception: Move all interrupt handlers to new style code
gen macros
  powerpc/64s/exception: Remove old INT_ENTRY macro
  powerpc/64s/exception: Remove old INT_COMMON macro
  powerpc/64s/exception: Remove old INT_KVM_HANDLER
  powerpc/64s/exception: Add ISIDE option
  powerpc/64s/exception: move real->virt switch into the common handler
  powerpc/64s/exception: move soft-mask test to common code
  powerpc/64s/exception: move KVM test to common code
  powerpc/64s/exception: remove confusing IEARLY option
  powerpc/64s/exception: remove the SPR saving patch code macros
  powerpc/64s/exception: trim unused arguments from KVMTEST macro
  powerpc/64s/exception: hdecrementer avoid touching the stack
  powerpc/64s/exception: re-inline some handlers
  powerpc/64s/exception: Clean up SRR specifiers
  powerpc/64s/exception: add more comments for interrupt handlers
  powerpc/64s/exception: only test KVM in SRR interrupts when PR KVM is
supported

 arch/powerpc/include/asm/exception-64s.h |4 -
 arch/powerpc/include/asm/time.h  |1 -
 arch/powerpc/kernel/exceptions-64s.S | 1859 +++---
 arch/powerpc/kernel/time.c   |9 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   11 -
 arch/powerpc/kvm/book3s_segment.S|7 -
 6 files changed, 1303 insertions(+), 588 deletions(-)

-- 
2.22.0

[PATCH v2 01/20] powerpc/64s/exception: Introduce INT_DEFINE parameter block for code generation

2019-09-04 Thread Nicholas Piggin

The code generation macro arguments are difficult to read, and
defaults can't easily be used.

This introduces a block where parameters can be set for interrupt
handler code generation by the subsequent macros, and adds the first
generation macro for interrupt entry.

One interrupt handler is converted to the new macros to demonstrate
the change, the rest will be coverted all at once.

No generated code change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 77 ++--
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index d0018dd17e0a..e6ad6e6cf65e 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -193,6 +193,61 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
mtctr   reg;\
bctr
 
+/*
+ * Interrupt code generation macros
+ */
+#define IVEC   .L_IVEC_\name\()
+#define IHSRR  .L_IHSRR_\name\()
+#define IAREA  .L_IAREA_\name\()
+#define IDAR   .L_IDAR_\name\()
+#define IDSISR .L_IDSISR_\name\()
+#define ISET_RI.L_ISET_RI_\name\()
+#define IEARLY .L_IEARLY_\name\()
+#define IMASK  .L_IMASK_\name\()
+#define IKVM_REAL  .L_IKVM_REAL_\name\()
+#define IKVM_VIRT  .L_IKVM_VIRT_\name\()
+
+#define INT_DEFINE_BEGIN(n)\
+.macro int_define_ ## n name
+
+#define INT_DEFINE_END(n)  \
+.endm ;
\
+int_define_ ## n n ;   \
+do_define_int n
+
+.macro do_define_int name
+   .ifndef IVEC
+   .error "IVEC not defined"
+   .endif
+   .ifndef IHSRR
+   IHSRR=EXC_STD
+   .endif
+   .ifndef IAREA
+   IAREA=PACA_EXGEN
+   .endif
+   .ifndef IDAR
+   IDAR=0
+   .endif
+   .ifndef IDSISR
+   IDSISR=0
+   .endif
+   .ifndef ISET_RI
+   ISET_RI=1
+   .endif
+   .ifndef IEARLY
+   IEARLY=0
+   .endif
+   .ifndef IMASK
+   IMASK=0
+   .endif
+   .ifndef IKVM_REAL
+   IKVM_REAL=0
+   .endif
+   .ifndef IKVM_VIRT
+   IKVM_VIRT=0
+   .endif
+.endm
+
 .macro INT_KVM_HANDLER name, vec, hsrr, area, skip
TRAMP_KVM_BEGIN(\name\()_kvm)
KVM_HANDLER \vec, \hsrr, \area, \skip
@@ -474,7 +529,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 */
GET_SCRATCH0(r10)
std r10,\area\()+EX_R13(r13)
-   .if \dar
+   .if \dar == 1
.if \hsrr
mfspr   r10,SPRN_HDAR
.else
@@ -482,7 +537,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
std r10,\area\()+EX_DAR(r13)
.endif
-   .if \dsisr
+   .if \dsisr == 1
.if \hsrr
mfspr   r10,SPRN_HDSISR
.else
@@ -506,6 +561,14 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
 .endm
 
+.macro GEN_INT_ENTRY name, virt, ool=0
+   .if ! \virt
+   INT_HANDLER \name, IVEC, \ool, IEARLY, \virt, IHSRR, IAREA, 
ISET_RI, IDAR, IDSISR, IMASK, IKVM_REAL
+   .else
+   INT_HANDLER \name, IVEC, \ool, IEARLY, \virt, IHSRR, IAREA, 
ISET_RI, IDAR, IDSISR, IMASK, IKVM_VIRT
+   .endif
+.endm
+
 /*
  * On entry r13 points to the paca, r9-r13 are saved in the paca,
  * r9 contains the saved CR, r11 and r12 contain the saved SRR0 and
@@ -1143,12 +1206,18 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
bl  unrecoverable_exception
b   .
 
+INT_DEFINE_BEGIN(data_access)
+   IVEC=0x300
+   IDAR=1
+   IDSISR=1
+   IKVM_REAL=1
+INT_DEFINE_END(data_access)
 
 EXC_REAL_BEGIN(data_access, 0x300, 0x80)
-   INT_HANDLER data_access, 0x300, ool=1, dar=1, dsisr=1, kvm=1
+   GEN_INT_ENTRY data_access, virt=0, ool=1
 EXC_REAL_END(data_access, 0x300, 0x80)
 EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
-   INT_HANDLER data_access, 0x300, virt=1, dar=1, dsisr=1
+   GEN_INT_ENTRY data_access, virt=1
 EXC_VIRT_END(data_access, 0x4300, 0x80)
 INT_KVM_HANDLER data_access, 0x300, EXC_STD, PACA_EXGEN, 1
 EXC_COMMON_BEGIN(data_access_common)
-- 
2.22.0

Re: [PATCH v5 20/31] powerpc/fadump: use smaller offset while finding memory for reservation

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> diff --git a/arch/powerpc/kernel/fadump-common.h 
> b/arch/powerpc/kernel/fadump-common.h
> index d2dd117..7107cf2 100644
> --- a/arch/powerpc/kernel/fadump-common.h
> +++ b/arch/powerpc/kernel/fadump-common.h
> @@ -66,6 +66,14 @@ static inline u64 fadump_str_to_u64(const char *str)
>  
>  #define FADUMP_CRASH_INFO_MAGIC  fadump_str_to_u64("FADMPINF")
>  
> +/*
> + * Amount of memory (1024MB) to skip before making another attempt at
> + * reserving memory (after the previous attempt to reserve memory for
> + * FADump failed due to memory holes and/or reserved ranges) to reduce
> + * the likelihood of memory reservation failure.
> + */
> +#define FADUMP_OFFSET_SIZE   0x4000U

This seems like a bit of a hack.

> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 971c50d..8dd2dcc 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -371,7 +371,7 @@ int __init fadump_reserve_mem(void)
>   !memblock_is_region_reserved(base, size))
>   break;
>  
> - base += size;
> + base += FADUMP_OFFSET_SIZE;
>   }

The comment above the loop says:

/*
 * Reserve memory at an offset closer to bottom of the RAM to
 * minimize the impact of memory hot-remove operation. We can't
 * use memblock_find_in_range() here since it doesn't allocate
 * from bottom to top.
 */

Is that true? Can't we set memblock to bottom up mode and then call it?

cheers

Re: [PATCH v5 19/31] powerpc/fadump: Update documentation about OPAL platform support

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> With FADump support now available on both pseries and OPAL platforms,
> update FADump documentation with these details.
>
> Signed-off-by: Hari Bathini 
> ---
>  Documentation/powerpc/firmware-assisted-dump.rst |  104 
> +-
>  1 file changed, 63 insertions(+), 41 deletions(-)
>
> diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
> b/Documentation/powerpc/firmware-assisted-dump.rst
> index d912755..2c3342c 100644
> --- a/Documentation/powerpc/firmware-assisted-dump.rst
> +++ b/Documentation/powerpc/firmware-assisted-dump.rst
> @@ -72,7 +72,8 @@ as follows:
> normal.
>  
>  -  The freshly booted kernel will notice that there is a new
> -   node (ibm,dump-kernel) in the device tree, indicating that
> +   node (ibm,dump-kernel on PSeries or ibm,opal/dump/mpipl-boot
> +   on OPAL platform) in the device tree, indicating that
> there is crash data available from a previous boot. During
> the early boot OS will reserve rest of the memory above
> boot memory size effectively booting with restricted memory
> @@ -96,7 +97,9 @@ as follows:
>  
>  Please note that the firmware-assisted dump feature
>  is only available on Power6 and above systems with recent
> -firmware versions.

Notice how "recent" has bit rotted.

> +firmware versions on PSeries (PowerVM) platform and Power9
> +and above systems with recent firmware versions on PowerNV
> +(OPAL) platform.

Can we say something more helpful here, ie. "recent" is not very useful.
AFAIK it's actually wrong, there isn't a released firmware with the
support yet at all, right?

Given all the relevant firmware is open source can't we at least point
to a commit or release tag or something?

cheers

Re: [PATCH v5 17/31] powernv/fadump: Warn before processing partial crashdump

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index 10f6086..6a05d51 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -71,6 +71,30 @@ static void opal_fadump_get_config(struct fw_dump 
> *fadump_conf,
>*/
>   fadump_conf->reserve_dump_area_start = fdm->rgn[0].dest;
>  
> + /*
> +  * Rarely, but it can so happen that system crashes before all
> +  * boot memory regions are registered for MPIPL. In such
> +  * cases, warn that the vmcore may not be accurate and proceed
> +  * anyway as that is the best bet considering free pages, cache
> +  * pages, user pages, etc are usually filtered out.
> +  *
> +  * Hope the memory that could not be preserved only has pages
> +  * that are usually filtered out while saving the vmcore.
> +  */
> + if (fdm->region_cnt > fdm->registered_regions) {
> + pr_warn("Not all memory regions are saved as system seems to 
> have crashed before all the memory regions could be registered for MPIPL!\n");

That line is rather long, I mean the actual printed line not the source line.

Also "seems to" is vague, I think better to just state what we know to
be true, ie: "Not all memory regions were saved".

> + pr_warn("  The below boot memory regions could not be 
> saved:\n");
> + i = fdm->registered_regions;
> + while (i < fdm->region_cnt) {
> + pr_warn("\t%d. base: 0x%llx, size: 0x%llx\n", (i + 1),
> + fdm->rgn[i].src, fdm->rgn[i].size);
> + i++;
> + }
> +
> + pr_warn("  Wishing for the above regions to have only pages 
> that are usually filtered out (user pages, free pages, etc..) and proceeding 
> anyway..\n");
> + pr_warn("  But the sanity of the '/proc/vmcore' file depends on 
> whether the above region(s) have any kernel pages or not.\n");

Again those lines are too long for people on small consoles.

And "Wishing" is not really what people want to see when their system
has crashed :)

You should say something more definite, eg:
  "If the unsaved regions only contain pages that are filtered out (eg.
   free/user pages), the vmcore should still be usable. If the unsaved
   regions contain kernel pages the vmcore will be corrupted."

Or something like that.

cheers

Re: [PATCH v5 16/31] powernv/fadump: process the crashdump by exporting it as /proc/vmcore

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index a755705..10f6086 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -41,6 +43,37 @@ static void opal_fadump_update_config(struct fw_dump 
> *fadump_conf,
>   fadump_conf->fadumphdr_addr = fdm->fadumphdr_addr;
>  }
>  
> +/*
> + * This function is called in the capture kernel to get configuration details
> + * from metadata setup by the first kernel.
> + */
> +static void opal_fadump_get_config(struct fw_dump *fadump_conf,
> +const struct opal_fadump_mem_struct *fdm)
> +{
> + int i;
> +
> + if (!fadump_conf->dump_active)
> + return;
> +
> + fadump_conf->boot_memory_size = 0;
> +
> + pr_debug("Boot memory regions:\n");
> + for (i = 0; i < fdm->region_cnt; i++) {
> + pr_debug("\t%d. base: 0x%llx, size: 0x%llx\n",
> +  (i + 1), fdm->rgn[i].src, fdm->rgn[i].size);

Printing the zero-based array off by one (i + 1) seems confusing.

> +
> + fadump_conf->boot_memory_size += fdm->rgn[i].size;
> + }
> +
> + /*
> +  * Start address of reserve dump area (permanent reservation) for
> +  * re-registering FADump after dump capture.
> +  */
> + fadump_conf->reserve_dump_area_start = fdm->rgn[0].dest;
> +
> + opal_fadump_update_config(fadump_conf, fdm);
> +}
> +
>  /* Initialize kernel metadata */
>  static void opal_fadump_init_metadata(struct opal_fadump_mem_struct *fdm)
>  {
> @@ -215,24 +248,114 @@ static void opal_fadump_cleanup(struct fw_dump 
> *fadump_conf)
>   pr_warn("Could not reset (%llu) kernel metadata tag!\n", ret);
>  }
>  
> +/*
> + * Convert CPU state data saved at the time of crash into ELF notes.
> + */
> +static int __init opal_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
> +{
> + u32 num_cpus, *note_buf;
> + struct fadump_crash_info_header *fdh = NULL;
> +
> + num_cpus = 1;
> + /* Allocate buffer to hold cpu crash notes. */
> + fadump_conf->cpu_notes_buf_size = num_cpus * sizeof(note_buf_t);
> + fadump_conf->cpu_notes_buf_size =
> + PAGE_ALIGN(fadump_conf->cpu_notes_buf_size);
> + note_buf = fadump_cpu_notes_buf_alloc(fadump_conf->cpu_notes_buf_size);
> + if (!note_buf) {
> + pr_err("Failed to allocate 0x%lx bytes for cpu notes buffer\n",
> +fadump_conf->cpu_notes_buf_size);
> + return -ENOMEM;
> + }
> + fadump_conf->cpu_notes_buf = __pa(note_buf);
> +
> + pr_debug("Allocated buffer for cpu notes of size %ld at %p\n",
> +  (num_cpus * sizeof(note_buf_t)), note_buf);
> +
> + if (fadump_conf->fadumphdr_addr)
> + fdh = __va(fadump_conf->fadumphdr_addr);
> +
> + if (fdh && (fdh->crashing_cpu != FADUMP_CPU_UNKNOWN)) {
> + note_buf = fadump_regs_to_elf_notes(note_buf, &(fdh->regs));
> + final_note(note_buf);
> +
> + pr_debug("Updating elfcore header (%llx) with cpu notes\n",
> +  fdh->elfcorehdr_addr);
> + fadump_update_elfcore_header(fadump_conf,
> +  __va(fdh->elfcorehdr_addr));
> + }
> +
> + return 0;
> +}
> +
>  static int __init opal_fadump_process(struct fw_dump *fadump_conf)
>  {
> - return -EINVAL;
> + struct fadump_crash_info_header *fdh;
> + int rc = 0;

No need to initialise rc there.

> + if (!opal_fdm_active || !fadump_conf->fadumphdr_addr)
> + return -EINVAL;
> +
> + /* Validate the fadump crash info header */
> + fdh = __va(fadump_conf->fadumphdr_addr);
> + if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
> + pr_err("Crash info header is not valid.\n");
> + return -EINVAL;
> + }
> +
> + /*
> +  * TODO: To build cpu notes, find a way to map PIR to logical id.
> +  *   Also, we may need different method for pseries and powernv.
> +  *   The currently booted kernel could have a different PIR to
> +  *   logical id mapping. So, try saving info of previous kernel's
> +  *   paca to get the right PIR to logical id mapping.
> +  */

That TODO is removed by the end of the series, so please just omit it entirely.

> + rc = opal_fadump_build_cpu_notes(fadump_conf);
> + if (rc)
> + return rc;

I think this all runs early in boot, so we don't need to worry about
another CPU seeing the partially initialised core due to there being no
barrier here before we set elfcorehdr_addr?

> + /*
> +  * We are done validating dump info and elfcore header is now ready
> +  * to be exported. set elfcorehdr_addr so that vmcore module will
> +  * export the elfcore header through '/proc/vmcore'.
> +  */
> + elfcorehdr_addr = fdh->elfcorehdr_addr;

> @@ -283,5

Re: [PATCH] sysfs: add BIN_ATTR_WO() macro

2019-09-04 Thread Greg Kroah-Hartman

On Tue, Sep 03, 2019 at 01:37:02PM +1000, Michael Ellerman wrote:
> Greg Kroah-Hartman  writes:
> > This variant was missing from sysfs.h, I guess no one noticed it before.
> >
> > Turns out the powerpc secure variable code can use it, so add it to the
> > tree for it, and potentially others to take advantage of, instead of
> > open-coding it.
> >
> > Reported-by: Nayna Jain 
> > Signed-off-by: Greg Kroah-Hartman 
> > ---
> >
> > I'll queue this up to my tree for 5.4-rc1, but if you want to take this
> > in your tree earlier, feel free to do so.
> 
> OK. This series is blocked on the firmware support going in, so at the
> moment it might miss v5.4 anyway. So this going via your tree is no
> problem.

Ok, will queue it up now, thanks!

greg k-h

Re: [PATCH v5 15/31] powernv/fadump: support copying multiple kernel boot memory regions

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> Firmware uses 32-bit field for region size while copying/backing-up

Which firmware exactly is imposing that limit?

> memory during MPIPL. So, the maximum copy size for a region would
> be a page less than 4GB (aligned to pagesize) but FADump capture
> kernel usually needs more memory than that to be preserved to avoid
> running into out of memory errors.
>
> So, request firmware to copy multiple kernel boot memory regions
> instead of just one (which worked fine for pseries as 64-bit field
> was used for size there).
>
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/platforms/powernv/opal-fadump.c |   35 
> +-
>  1 file changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index 91fb909..a755705 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -28,6 +28,8 @@ static int opal_fadump_unregister(struct fw_dump 
> *fadump_conf);
>  static void opal_fadump_update_config(struct fw_dump *fadump_conf,
> const struct opal_fadump_mem_struct *fdm)
>  {
> + pr_debug("Boot memory regions count: %d\n", fdm->region_cnt);
> +
>   /*
>* The destination address of the first boot memory region is the
>* destination address of boot memory regions.
> @@ -50,16 +52,35 @@ static void opal_fadump_init_metadata(struct 
> opal_fadump_mem_struct *fdm)
>  
>  static ulong opal_fadump_init_mem_struct(struct fw_dump *fadump_conf)
>  {
> - ulong addr = fadump_conf->reserve_dump_area_start;
> + ulong src_addr, dest_addr;
> + int max_copy_size, cur_size, size;
>  
>   opal_fdm = __va(fadump_conf->kernel_metadata);
>   opal_fadump_init_metadata(opal_fdm);
>  
> - opal_fdm->region_cnt = 1;
> - opal_fdm->rgn[0].src= RMA_START;
> - opal_fdm->rgn[0].dest   = addr;
> - opal_fdm->rgn[0].size   = fadump_conf->boot_memory_size;
> - addr += fadump_conf->boot_memory_size;
> + /*
> +  * Firmware currently supports only 32-bit value for size,

"currently" implies it could change in future?

If it does we assume it will only increase, and we're happy that old
kernels will continue to use the 32-bit limit?

> +  * align it to pagesize and request firmware to copy multiple
> +  * kernel boot memory regions.
> +  */
> + max_copy_size = _ALIGN_DOWN(U32_MAX, PAGE_SIZE);
> +
> + /* Boot memory regions */
> + src_addr = RMA_START;

I'm not convinced using RMA_START actually makes things any clearer,
given that it's #defined as 0, and we even have a BUILD_BUG_ON() to make
sure it's never anything else.

eg:

src_addr = 0;

> + dest_addr = fadump_conf->reserve_dump_area_start;
> + size = fadump_conf->boot_memory_size;
> + while (size) {
> + cur_size = size > max_copy_size ? max_copy_size : size;
> +
> + opal_fdm->rgn[opal_fdm->region_cnt].src  = src_addr;
> + opal_fdm->rgn[opal_fdm->region_cnt].dest = dest_addr;
> + opal_fdm->rgn[opal_fdm->region_cnt].size = cur_size;
> +
> + opal_fdm->region_cnt++;
> + dest_addr   += cur_size;
> + src_addr+= cur_size;
> + size-= cur_size;
> + }
>  
>   /*
>* Kernel metadata is passed to f/w and retrieved in capture kerenl.
> @@ -70,7 +91,7 @@ static ulong opal_fadump_init_mem_struct(struct fw_dump 
> *fadump_conf)
>  
>   opal_fadump_update_config(fadump_conf, opal_fdm);
>  
> - return addr;
> + return dest_addr;
>  }
>  
>  static ulong opal_fadump_get_metadata_size(void)

cheers

Re: [PATCH v5 12/31] powernv/fadump: register kernel metadata address with opal

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index b8061fb9..a086a09 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -283,17 +286,17 @@ static void __init fadump_reserve_crash_area(unsigned 
> long base,
>  
>  int __init fadump_reserve_mem(void)
>  {
> + int ret = 1;
>   unsigned long base, size, memory_boundary;

Please try to use reverse christmas tree style when possible.

> @@ -363,29 +366,43 @@ int __init fadump_reserve_mem(void)
>* use memblock_find_in_range() here since it doesn't allocate
>* from bottom to top.
>*/
> - for (base = fw_dump.boot_memory_size;
> -  base <= (memory_boundary - size);
> -  base += size) {
> + while (base <= (memory_boundary - size)) {
>   if (memblock_is_region_memory(base, size) &&
>   !memblock_is_region_reserved(base, size))
>   break;
> +
> + base += size;
>   }

Some of these changes look like they might not be necessary in this
patch, ie. could be split out into a lead-up patch. eg. the conversion
from for to while. But it's a bit hard to tell.

> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index e330877..e5c4700 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -13,14 +13,86 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
> +#include 
>  #include 
>  
>  #include "../../kernel/fadump-common.h"
> +#include "opal-fadump.h"
> +
> +static struct opal_fadump_mem_struct *opal_fdm;
> +
> +/* Initialize kernel metadata */
> +static void opal_fadump_init_metadata(struct opal_fadump_mem_struct *fdm)
> +{
> + fdm->version = OPAL_FADUMP_VERSION;
> + fdm->region_cnt = 0;
> + fdm->registered_regions = 0;
> + fdm->fadumphdr_addr = 0;
> +}
>  
>  static ulong opal_fadump_init_mem_struct(struct fw_dump *fadump_conf)
>  {
> - return fadump_conf->reserve_dump_area_start;
> + ulong addr = fadump_conf->reserve_dump_area_start;

I just noticed you're using ulong, which I haven't seen much before. KVM
uses it a lot but not much else.

Because this is all 64-bit only code I'd probably rather you just use
u64 explicitly to avoid anyone having to think about it.

> +
> + opal_fdm = __va(fadump_conf->kernel_metadata);
> + opal_fadump_init_metadata(opal_fdm);
> +
> + opal_fdm->region_cnt = 1;
> + opal_fdm->rgn[0].src= RMA_START;
> + opal_fdm->rgn[0].dest   = addr;
> + opal_fdm->rgn[0].size   = fadump_conf->boot_memory_size;
> + addr += fadump_conf->boot_memory_size;
> +
> + /*
> +  * Kernel metadata is passed to f/w and retrieved in capture kerenl.
> +  * So, use it to save fadump header address instead of calculating it.
> +  */
> + opal_fdm->fadumphdr_addr = (opal_fdm->rgn[0].dest +
> + fadump_conf->boot_memory_size);
> +
> + return addr;
> +}
> +
> +static ulong opal_fadump_get_metadata_size(void)
> +{
> + ulong size = sizeof(struct opal_fadump_mem_struct);
> +
> + size = PAGE_ALIGN(size);
> + return size;

return PAGE_ALIGN(sizeof(struct opal_fadump_mem_struct));

???

> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.h 
> b/arch/powerpc/platforms/powernv/opal-fadump.h
> new file mode 100644
> index 000..19cac1f
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Firmware-Assisted Dump support on POWER platform (OPAL).
> + *
> + * Copyright 2019, IBM Corp.
> + * Author: Hari Bathini 
> + */
> +
> +#ifndef __PPC64_OPAL_FA_DUMP_H__
> +#define __PPC64_OPAL_FA_DUMP_H__

Usual style is _ASM_POWERPC_OPAL_FADUMP_H.


> +/* OPAL FADump structure format version */
> +#define OPAL_FADUMP_VERSION  0x1

What is the meaning of this version? How/can we change it. What does it
mean if it's a different number? Please provide some comments or doco
describing how it's expected to be used.

We're defining some sort of ABI here and I want to understand/have
better documentation on what the implications of that are.

> diff --git a/arch/powerpc/platforms/pseries/rtas-fadump.c 
> b/arch/powerpc/platforms/pseries/rtas-fadump.c
> index 2b94392..4111ee9 100644
> --- a/arch/powerpc/platforms/pseries/rtas-fadump.c
> +++ b/arch/powerpc/platforms/pseries/rtas-fadump.c
> @@ -121,6 +121,21 @@ static ulong rtas_fadump_init_mem_struct(struct fw_dump 
> *fadump_conf)
>   return addr;
>  }
>  
> +/*
> + * On this platform, the metadata struture is passed while registering
> + * for FADump and the same is returned by f/w in capture kernel.
> + * No additional provision to setup kernel metadata separately.
> + */
>

Re: [PATCH v5 16/23] PCI: hotplug: movable BARs: Don't reserve IO/mem bus space

2019-09-04 Thread Sergey Miroshnichenko

On 9/4/19 8:42 AM, Oliver O'Halloran wrote:
> On Fri, 2019-08-16 at 19:50 +0300, Sergey Miroshnichenko wrote:
>> A hotplugged bridge with many hotplug-capable ports may request
>> reserving more IO space than the machine has. This could be overridden
>> with the "hpiosize=" kernel argument though.
>>
>> But when BARs are movable, there are no need to reserve space anymore:
>> new BARs are allocated not from reserved gaps, but via rearranging the
>> existing BARs. Requesting a precise amount of space for bridge windows
>> increases the chances of adding the new bridge successfully.
> 
> It wouldn't hurt to reserve some memory space to prevent unnecessary
> BAR shuffling at runtime. If it turns out that we need more space then
> we can always fall back to re-assigning the whole tree.
> 

Hi Oliver,

Thank you for your comments!

We had an issue on a x86_64 PC with a small amount of IO space: after
hotplugging an empty bridge of 32 ports even a DEFAULT_HOTPLUG_IO_SIZE
(which is 256) was enough to exhaust the space. So another patch of
this series ("Don't allow added devices to steal resources") had
disabled the BAR allocating for this bridge. It took some time for me
to guess that "hpiosize=0" can solve that.

For MEM and MEM64 spaces it will be harder to reproduce the same, but
there can be a similar problem when fitting between two immovable BARs.

To implement a fallback it would need to add some flag indicating that
allocating this bridge with reserved spaces has failed, so its windows
should be recalculated without reserved spaces - and try again. Maybe
even two types of retrials: with and without the full re-assignment.
We've tried to avoid adding execution paths and code complicatedness.

Serge

>> Signed-off-by: Sergey Miroshnichenko 
>> ---
>>  drivers/pci/setup-bus.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index c7b7e30c6284..7d64ec8e7088 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -1287,7 +1287,7 @@ void __pci_bus_size_bridges(struct pci_bus *bus, 
>> struct list_head *realloc_head)
>>  
>>  case PCI_HEADER_TYPE_BRIDGE:
>>  pci_bridge_check_ranges(bus);
>> -if (bus->self->is_hotplug_bridge) {
>> +if (bus->self->is_hotplug_bridge && 
>> !pci_movable_bars_enabled()) {
>>  additional_io_size  = pci_hotplug_io_size;
>>  additional_mem_size = pci_hotplug_mem_size;
>>  }
>

Re: [PATCH v5 10/31] opal: add MPIPL interface definitions

2019-09-04 Thread Michael Ellerman

Hi Hari,

One other comment.

Hari Bathini  writes:
> Signed-off-by: Hari Bathini 

Change log is missing.

Please define what MPIPL means, and give people some explanation of what
it is, how it works and how you're using it for fadump.

cheers

Re: [PATCH v5 10/31] opal: add MPIPL interface definitions

2019-09-04 Thread Michael Ellerman

Hari Bathini  writes:
> On 03/09/19 4:40 PM, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> diff --git a/arch/powerpc/include/asm/opal.h 
>>> b/arch/powerpc/include/asm/opal.h
>>> index 57bd029..878110a 100644
>>> --- a/arch/powerpc/include/asm/opal.h
>>> +++ b/arch/powerpc/include/asm/opal.h
>>> @@ -39,6 +39,12 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, 
>>> uint32_t bdfn,
>>> uint64_t PE_handle);
>>>  int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
>>> uint64_t rate_phys, uint32_t size);
>>> +
>>> +int64_t opal_mpipl_update(enum opal_mpipl_ops op, u64 src,
>>> + u64 dest, u64 size);
>>> +int64_t opal_mpipl_register_tag(enum opal_mpipl_tags tag, uint64_t addr);
>>> +int64_t opal_mpipl_query_tag(enum opal_mpipl_tags tag, uint64_t *addr);
>>> +
>> 
>> Please consistently use kernel types for new prototypes in here.
>
> uint64_t instead of 'enum's?

The enums are fine, I mean u64 instead of uint64_t, s64 instead of
int64_t etc.

cheers

[PATCH] KVM: PPC: Book3S HV: Delete an unnecessary check before kfree() in __kvmhv_nested_page_fault()

2019-09-04 Thread Markus Elfring

From: Markus Elfring 
Date: Wed, 4 Sep 2019 11:00:20 +0200

The kfree() function tests whether its argument is NULL
and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..36d21090a713 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1416,8 +1416,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
rmapp = >arch.rmap[gfn - memslot->base_gfn];
ret = kvmppc_create_pte(kvm, gp->shadow_pgtable, pte, n_gpa, level,
mmu_seq, gp->shadow_lpid, rmapp, _rmap);
-   if (n_rmap)
-   kfree(n_rmap);
+   kfree(n_rmap);
if (ret == -EAGAIN)
ret = RESUME_GUEST; /* Let the guest try again */

--
2.23.0

Re: [PATCH v3 3/3] Powerpc64/Watchpoint: Rewrite ptrace-hwbreak.c selftest

2019-09-04 Thread Ravi Bangoria





On 8/28/19 11:44 AM, Christophe Leroy wrote:



Le 10/07/2019 à 06:54, Ravi Bangoria a écrit :

ptrace-hwbreak.c selftest is logically broken. On powerpc, when
watchpoint is created with ptrace, signals are generated before
executing the instruction and user has to manually singlestep
the instruction with watchpoint disabled, which selftest never
does and thus it keeps on getting the signal at the same
instruction. If we fix it, selftest fails because the logical
connection between tracer(parent) and tracee(child) is also
broken. Rewrite the selftest and add new tests for unaligned
access.


On the 8xx, signals are generated after executing the instruction.

Can we make the test work in both case ?


Sure. I don't mind. I guess, it should be trivial to do that.

But I'm still waiting for Mikey / Mpe's replay on actual patches.
Mikey, mpe, is it ok to not ignore actual events but generate false 
positive events? Is there any other better approach?


Ravi

Re: [PATCH v5 02/31] powerpc/fadump: move internal code to a new file

2019-09-04 Thread Mahesh Jagannath Salgaonkar

On 9/3/19 9:35 PM, Hari Bathini wrote:
> 
> 
> On 03/09/19 4:39 PM, Michael Ellerman wrote:
>> Hari Bathini  writes:
>>> Make way for refactoring platform specific FADump code by moving code
>>> that could be referenced from multiple places to fadump-common.c file.
>>>
>>> Signed-off-by: Hari Bathini 
>>> ---
>>>  arch/powerpc/kernel/Makefile|2 
>>>  arch/powerpc/kernel/fadump-common.c |  140 
>>> ++
>>>  arch/powerpc/kernel/fadump-common.h |8 ++
>>>  arch/powerpc/kernel/fadump.c|  146 
>>> ++-
>>>  4 files changed, 158 insertions(+), 138 deletions(-)
>>>  create mode 100644 arch/powerpc/kernel/fadump-common.c
>>
>> I don't understand why we need fadump.c and fadump-common.c? They're
>> both common/shared across pseries & powernv aren't they?
> 
> The convention I tried to follow to have fadump-common.c shared between 
> fadump.c,
> pseries & powernv code while pseries & powernv code take callback requests 
> from
> fadump.c and use fadump-common.c (shared by both platforms), if necessary to 
> fullfil
> those requests...
> 
>> By the end of the series we end up with 149 lines in fadump-common.c
>> which seems like a waste of time. Just put it all in fadump.c.
> 
> Yeah. Probably not worth a new C file. Will just have two separate headers. 
> One for
> internal code and one for interfacing with other modules...
> 
> [...]
> 
>>> + * Copyright 2019, IBM Corp.
>>> + * Author: Hari Bathini 
>>
>> These can just be:
>>
>>  * Copyright 2011, Mahesh Salgaonkar, IBM Corporation.
>>  * Copyright 2019, Hari Bathini, IBM Corporation.
>>
> 
> Sure.
> 
>>> + */
>>> +
>>> +#undef DEBUG
>>
>> Don't undef DEBUG please.
>>
> 
> Sorry! Seeing such thing in most files, I thought this was the convention. 
> Will drop
> this change in all the new files I added.
> 
>>> +#define pr_fmt(fmt) "fadump: " fmt
>>> +
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +#include "fadump-common.h"
>>> +
>>> +void *fadump_cpu_notes_buf_alloc(unsigned long size)
>>> +{
>>> +   void *vaddr;
>>> +   struct page *page;
>>> +   unsigned long order, count, i;
>>> +
>>> +   order = get_order(size);
>>> +   vaddr = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
>>> +   if (!vaddr)
>>> +   return NULL;
>>> +
>>> +   count = 1 << order;
>>> +   page = virt_to_page(vaddr);
>>> +   for (i = 0; i < count; i++)
>>> +   SetPageReserved(page + i);
>>> +   return vaddr;
>>> +}
>>
>> I realise you're just moving this code, but why do we need all this hand
>> rolled allocation stuff?
> 
> Yeah, I think alloc_pages_exact() may be better here. Mahesh, am I missing 
> something?

We hook up the physical address of this buffer to ELF core header as
PT_NOTE section. Hence we don't want these pages to be moved around or
reclaimed.

Thanks,
-Mahesh.

RE: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-04 Thread David Laight

From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> Sent: 04 September 2019 01:24
> On Tue, Sep 03, 2019 at 02:31:28PM -0500, Segher Boessenkool wrote:
> > On Mon, Sep 02, 2019 at 10:55:53PM -0700, Nathan Chancellor wrote:
> > > On Thu, Aug 29, 2019 at 09:59:48AM +, David Laight wrote:
> > > > From: Nathan Chancellor
> > > > > Sent: 28 August 2019 19:45
> > > > ...
> > > > > However, I think that -fno-builtin-* would be appropriate here because
> > > > > we are providing our own setjmp implementation, meaning clang should 
> > > > > not
> > > > > be trying to do anything with the builtin implementation like 
> > > > > building a
> > > > > declaration for it.
> > > >
> > > > Isn't implementing setjmp impossible unless you tell the compiler that
> > > > you function is 'setjmp-like' ?
> > >
> > > No idea, PowerPC is the only architecture that does such a thing.
> >
> > Since setjmp can return more than once, yes, exciting things can happen
> > if you do not tell the compiler about this.
> >
> >
> > Segher
> >
> 
> Fair enough so I guess we are back to just outright disabling the
> warning.

Just disabling the warning won't stop the compiler generating code
that breaks a 'user' implementation of setjmp().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH 1/2] ftrace: Fix NULL pointer dereference in t_probe_next()

2019-09-04 Thread Naveen N. Rao


Steven Rostedt wrote:

On Thu,  4 Jul 2019 20:04:41 +0530
"Naveen N. Rao"  wrote:



 kernel/trace/ftrace.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 7b037295a1f1..0791eafb693d 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3093,6 +3093,10 @@ t_probe_next(struct seq_file *m, loff_t *pos)
hnd = >probe_entry->hlist;
 
 	hash = iter->probe->ops.func_hash->filter_hash;

+
+   if (!hash)
+   return NULL;
+
size = 1 << hash->size_bits;
 
  retry:


OK, I added this, but I'm also adding this on top:


Thanks, the additional comments do make this much clearer.

Regards,
Naveen

[PATCH v2] powerpc: dump kernel log before carrying out fadump or kdump

2019-09-04 Thread Ganesh Goudar

Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request
through the oops path"), pstore dmesg file is not updated when dump is
triggered from HMC. This commit modified system reset (sreset) handler
to invoke fadump or kdump (if configured), without pushing dmesg to
pstore. This leaves pstore to have old dmesg data which won't be much
of a help if kdump fails to capture the dump. This patch fixes that by
calling kmsg_dump() before heading to fadump ot kdump.

Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the 
oops path")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Ganesh Goudar 
---
V2: Rephrasing the commit message
---
 arch/powerpc/kernel/traps.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 11caa0291254..82f43535e686 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -472,6 +472,7 @@ void system_reset_exception(struct pt_regs *regs)
if (debugger(regs))
goto out;
 
+   kmsg_dump(KMSG_DUMP_OOPS);
/*
 * A system reset is a request to dump, so we always send
 * it through the crashdump code (if fadump or kdump are
-- 
2.17.2

Re: [PATCH v2 2/2] powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error

2019-09-04 Thread Vaibhav Jain

Hi Aneesh,

Thanks for the patch. A minor suggestion below:

"Aneesh Kumar K.V"  writes:

> Right now we force an unbind of SCM memory at drcindex on H_OVERLAP error.
> This really slows down operations like kexec where we get the H_OVERLAP
> error because we don't go through a full hypervisor re init.
>
> H_OVERLAP error for a H_SCM_BIND_MEM hcall indicates that SCM memory at
> drc index is already bound. Since we don't specify a logical memory
> address for bind hcall, we can use the H_SCM_QUERY hcall to query
> the already bound logical address.
>
> Boot time difference with and without patch is:
>
> [5.583617] IOMMU table initialized, virtual merging enabled
> [5.603041] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Retrying 
> bind after unbinding
> [  301.514221] papr_scm ibm,persistent-memory:ibm,pmemory@44108001: Retrying 
> bind after unbinding
> [  340.057238] hv-24x7: read 1530 catalog entries, created 537 event attrs (0 
> failures), 275 descs
>
> after fix
>
> [5.101572] IOMMU table initialized, virtual merging enabled
> [5.116984] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Querying 
> SCM details
> [5.117223] papr_scm ibm,persistent-memory:ibm,pmemory@44108001: Querying 
> SCM details
> [5.120530] hv-24x7: read 1530 catalog entries, created 537 event attrs (0 
> failures), 275 descs
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
> Changes from V1:
> * Use the first block and last block to query the logical bind memory
> * If we fail to query, ubind and retry the bind.
>
>
>  arch/powerpc/platforms/pseries/papr_scm.c | 48 +++
>  1 file changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 3bef4d298ac6..61883291defc 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -65,10 +65,8 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
>   cond_resched();
>   } while (rc == H_BUSY);
>  
> - if (rc) {
> - dev_err(>pdev->dev, "bind err: %lld\n", rc);
> + if (rc)
>   return rc;
> - }
>  
>   p->bound_addr = saved;
>   dev_dbg(>pdev->dev, "bound drc 0x%x to %pR\n", p->drc_index, 
> >res);
> @@ -110,6 +108,42 @@ static void drc_pmem_unbind(struct papr_scm_priv *p)
>   return;
>  }
>  
> +static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
> +{
> + unsigned long start_addr;
> + unsigned long end_addr;
> + unsigned long ret[PLPAR_HCALL_BUFSIZE];
> + int64_t rc;
> +
> +
> + rc = plpar_hcall(H_SCM_QUERY_BLOCK_MEM_BINDING, ret,
> +  p->drc_index, 0);
> + if (rc)
> + goto err_out;
> + start_addr = ret[0];
> +
> + /* Make sure the full region is bound. */
> + rc = plpar_hcall(H_SCM_QUERY_BLOCK_MEM_BINDING, ret,
> +  p->drc_index, p->blocks - 1);
> + if (rc)
> + goto err_out;
> + end_addr = ret[0];
> +
> + if ((end_addr - start_addr) != ((p->blocks - 1) * p->block_size))
> + goto err_out;
> +
> + p->bound_addr = start_addr;
> + dev_dbg(>pdev->dev, "bound drc 0x%x to %pR\n", p->drc_index, 
> >res);
> + return rc;
> +

> +err_out:
> + dev_info(>pdev->dev,
> +  "Failed to query, trying an unbind followed by bind");
> + drc_pmem_unbind(p);
> + return drc_pmem_bind(p);
> +}
Would have preferred error handling for bind failure to be done at
single location i.e in papr_scm_probe() rather than in
drc_pmem_query_n_bind().

> +
> +
>  static int papr_scm_meta_get(struct papr_scm_priv *p,
>struct nd_cmd_get_config_data_hdr *hdr)
>  {
> @@ -430,13 +464,11 @@ static int papr_scm_probe(struct platform_device *pdev)
>   rc = drc_pmem_bind(p);
>  
>   /* If phyp says drc memory still bound then force unbound and retry */
> - if (rc == H_OVERLAP) {
> - dev_warn(>dev, "Retrying bind after unbinding\n");
> - drc_pmem_unbind(p);
> - rc = drc_pmem_bind(p);
> - }
> + if (rc == H_OVERLAP)
> + rc = drc_pmem_query_n_bind(p);
>  
>   if (rc != H_SUCCESS) {
> + dev_err(>pdev->dev, "bind err: %d\n", rc);
>   rc = -ENXIO;
>   goto err;
>   }
> -- 
> 2.21.0
>

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.

1 2 >

1 - 100 of 102 matches

Mail list logo