Re: [EXTERNAL] Re: [PATCH v2 3/6] powerpc/eeh: Improve debug messages around device addition

2019-07-17 Thread Sam Bobroff
On Tue, Jul 16, 2019 at 05:00:44PM +1000, Oliver O'Halloran wrote:
> On Tue, 2019-07-16 at 16:48 +1000, Sam Bobroff wrote:
> > On Thu, Jun 20, 2019 at 01:45:24PM +1000, Oliver O'Halloran wrote:
> > > On Thu, Jun 20, 2019 at 12:40 PM Alexey Kardashevskiy  
> > > wrote:
> > > > On 19/06/2019 14:27, Sam Bobroff wrote:
> > > > > On Tue, Jun 11, 2019 at 03:47:58PM +1000, Alexey Kardashevskiy wrote:
> > > > > > On 07/05/2019 14:30, Sam Bobroff wrote:
> > > > > > > Also remove useless comment.
> > > > > > > 
> > > > > > > Signed-off-by: Sam Bobroff 
> > > > > > > Reviewed-by: Alexey Kardashevskiy 
> > > > > > > ---
> > > > *snip*
> > > > > I can see that edev will be non-NULL here, but that pr_debug() pattern
> > > > > (using the PDN information to form the PCI address) is quite common
> > > > > across the EEH code, so I think rather than changing a couple of
> > > > > specific cases, I should do a separate cleanup patch and introduce
> > > > > something like pdn_debug(pdn, ""). What do you think?
> > > > 
> > > > I'd switch them all to already existing dev_dbg/pci_debug rather than
> > > > adding pdn_debug as imho it should not have been used in the first place
> > > > really...
> > > > 
> > > > > (I don't know exactly when edev->pdev can be NULL.)
> > > > 
> > > > ... and if you switch to dev_dbg/pci_debug, I think quite soon you'll
> > > > know if it can or cannot be NULL :)
> > > 
> > > As far as I can tell edev->pdev is NULL in two cases:
> > > 
> > > 1. Before eeh_device_add_late() has been called on the pdev. The late
> > > part of the add maps the pdev to an edev and sets the pdev's edev
> > > pointer and vis a vis.
> > > 2. While recoverying EEH unaware devices. Unaware devices are
> > > destroyed and rescanned and the edev->pdev pointer is cleared by
> > > pcibios_device_release()
> > > 
> > > In most of these cases it should be safe to use the pci_*() functions
> > > rather than making a new one up for printing pdns. In the cases where
> > > we might not have a PCI dev i'd make a new set of prints that take an
> > > EEH dev rather than a pci_dn since i'd like pci_dn to die sooner
> > > rather than later.
> > > 
> > > Oliver
> > 
> > I'll change the calls in {pnv,pseries}_pcibios_bus_add_device() and
> > eeh_add_device_late() to use dev_dbg() and post a new version.
> > 
> > For {pnv,pseries}_eeh_probe() I'm not sure what we can do; there's no
> > pci_dev available yet and while it would be nice to use the eeh_dev
> > rather than the pdn, it doesn't seem to have the bus/device/fn
> > information we need. Am I missing something there?  (The code in the
> > probe functions seems to get it from the pci_dn.)
> 
> We do have a pci_dev in the powernv case since pnv_eeh_probe() isn't
> called until the late probe happens (which is after the pci_dev has
> been created). I've got some patches to rework the probe path to make
> this a bit clearer, but they need a bit more work.
> 
> > 
> > If there isn't an easy way around this, would it therefore be reasonable
> > to just leave them open-coded as they are?
> 
> I've had this patch floating around a while that should do the trick.
> The PCI_BUSNO macro is probably unnecessary since I'm sure there is
> something that does it in generic code, but I couldn't find it.

Looks good, I'll try including it and create a dev_dbg style function
or macro that takes an edev.

I don't think I can use it in the pcibios bus add device handlers (where
there is no edev, or where it may be attached to the wrong device) but
I'll use it for all the other cases.

If it works out well I can follow up and update more of the EEH logging
to use it :-)

> From 61ff8c23c4d13ff640fb2d069dcedacdf2ee22dd Mon Sep 17 00:00:00 2001
> From: Oliver O'Halloran 
> Date: Thu, 18 Apr 2019 18:25:13 +1000
> Subject: [PATCH] powerpc/eeh: Add bdfn field to eeh_dev
> 
> Preperation for removing pci_dn from the powernv EEH code. The only thing we
> really use pci_dn for is to get the bdfn of the device for config space
> accesses, so adding that information to eeh_dev reduces the need to carry
> around the pci_dn.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/include/asm/eeh.h | 2 ++
>  arch/powerpc/include/asm/ppc-pci.h | 2 ++
>  arch/powerpc/kernel/eeh_dev.c  | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 7fd476d..a208e02 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -131,6 +131,8 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
>  struct eeh_dev {
>   int mode;   /* EEH mode */
>   int class_code; /* Class code of the device */
> + int bdfn;   /* bdfn of device (for cfg ops) */
> + struct pci_controller *controller;
>   int pe_config_addr; /* PE config address*/
>   u32 config_space[16];   /* Saved PCI config sp

[PATCH kernel v5 4/4] powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages

2019-07-17 Thread Alexey Kardashevskiy
At the moment we create a small window only for 32bit devices, the window
maps 0..2GB of the PCI space only. For other devices we either use
a sketchy bypass or hardware bypass but the former can only work if
the amount of RAM is no bigger than the device's DMA mask and the latter
requires devices to support at least 59bit DMA.

This extends the default DMA window to the maximum size possible to allow
a wider DMA mask than just 32bit. The default window size is now limited
by the the iommu_table::it_map allocation bitmap which is a contiguous
array, 1 bit per an IOMMU page.

This increases the default IOMMU page size from hard coded 4K to
the system page size to allow wider DMA masks.

This increases the level number to not exceed the max order allocation
limit per TCE level. By the same time, this keeps minimal levels number
as 2 in order to save memory.

As the extended window now overlaps the 32bit MMIO region, this adds
an area reservation to iommu_init_table().

After this change the default window size is 0x800==1<<43 so
devices limited to DMA mask smaller than the amount of system RAM can
still use more than just 2GB of memory for DMA.

This is an optimization and not a bug fix for DMA API usage.

With the on-demand allocation of indirect TCE table levels enabled and
2 levels, the first TCE level size is just
1<
---
Changes:
v5:
* ditched iommu_init_table_res and pass start..end to iommu_init_table
directly
* fixed WARN_ON in iommu_table_reserve_pages (was opposite)

v4:
* fixed take/release ownership handlers
* fixed reserved region for tables with it_offset!=0 (this is not going
to be exploited here but still this is a correct behavior)

v3:
* fixed tce levels calculation

v2:
* adjusted level number to the max order
---
 arch/powerpc/include/asm/iommu.h  |  7 ++-
 arch/powerpc/kernel/iommu.c   | 74 ---
 arch/powerpc/platforms/cell/iommu.c   |  2 +-
 arch/powerpc/platforms/pasemi/iommu.c |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 42 ++---
 arch/powerpc/platforms/pseries/iommu.c|  8 +--
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 arch/powerpc/sysdev/dart_iommu.c  |  2 +-
 8 files changed, 100 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 18d342b815e4..d7bf1f104c15 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,6 +111,8 @@ struct iommu_table {
struct iommu_table_ops *it_ops;
struct krefit_kref;
int it_nid;
+   unsigned long it_reserved_start; /* Start of not-DMA-able (MMIO) area */
+   unsigned long it_reserved_end;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \
@@ -149,8 +151,9 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
  */
-extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-   int nid);
+extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
+   int nid, unsigned long res_start, unsigned long res_end);
+
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
 struct iommu_table_group;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0a67ce9f827e..e7a2b160d4c6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -633,11 +633,54 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
+static void iommu_table_reserve_pages(struct iommu_table *tbl,
+   unsigned long res_start, unsigned long res_end)
+{
+   int i;
+
+   WARN_ON_ONCE(res_end < res_start);
+   /*
+* Reserve page 0 so it will not be used for any mappings.
+* This avoids buggy drivers that consider page 0 to be invalid
+* to crash the machine or even lose data.
+*/
+   if (tbl->it_offset == 0)
+   set_bit(0, tbl->it_map);
+
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
+
+   /* Check if res_start..res_end isn't empty and overlaps the table */
+   if (res_start && res_end &&
+   (tbl->it_offset + tbl->it_size < res_start ||
+res_end < tbl->it_offset))
+   return;
+
+   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
+   set_bit(i - tbl->it_offset, tbl->it_map);
+}
+
+static void iommu_table_release_pages(struct iommu_table *tbl)
+{
+   int i;
+
+   /*
+* In case we have reserved the first bit, we should not emit
+* the warning below.
+*/
+   if (tbl->it_offset == 0)
+   clear_bit(0, tbl->it_map);
+
+   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
+   clear_bit(i - tbl->it_offset, tbl->it_map);
+}
+
 /*
  * Build a iommu_table structure.  This contains a bit map 

[PATCH kernel v5 3/4] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window

2019-07-17 Thread Alexey Kardashevskiy
We allocate only the first level of multilevel TCE tables for KVM
already (alloc_userspace_copy==true), and the rest is allocated on demand.
This is not enabled though for bare metal.

This removes the KVM limitation (implicit, via the alloc_userspace_copy
parameter) and always allocates just the first level. The on-demand
allocation of missing levels is already implemented.

As from now on DMA map might happen with disabled interrupts, this
allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
In practice just a single page is allocated there so chances for failure
are quite low.

To save time when creating a new clean table, this skips non-allocated
indirect TCE entries in pnv_tce_free just like we already do in
the VFIO IOMMU TCE driver.

This changes the default level number from 1 to 2 to reduce the amount
of memory required for the default 32bit DMA window at the boot time.
The default window size is up to 2GB which requires 4MB of TCEs which is
unlikely to be used entirely or at all as most devices these days are
64bit capable so by switching to 2 levels by default we save 4032KB of
RAM per a device.

While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
can trigger this path via VFIO, see the failure and try creating a table
again with different parameters which might succeed.

[1]:
===
BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
2 locks held by scsi_eh_1/1038:
 #0: 5efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
 #1: 06cf56a6 (&(&host->lock)->rlock){}, at: 
ata_exec_internal_sg+0xb0/0x5c0
irq event stamp: 500
hardirqs last  enabled at (499): [] 
_raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (500): [] 
_raw_spin_lock_irqsave+0x44/0x120
softirqs last  enabled at (0): [] 
copy_process.isra.4.part.5+0x640/0x1a80
softirqs last disabled at (0): [<>] 0x0
CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 
#634
Call Trace:
[c03d064cef50] [c0c8e6c4] dump_stack+0xe8/0x164 (unreliable)
[c03d064cefa0] [c014ed78] ___might_sleep+0x2f8/0x310
[c03d064cf020] [c03ca084] __alloc_pages_nodemask+0x2a4/0x1560
[c03d064cf220] [c00c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
[c03d064cf290] [c00c2888] pnv_tce+0x128/0x3b0
[c03d064cf360] [c00c2c00] pnv_tce_build+0xb0/0xf0
[c03d064cf3c0] [c00bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
[c03d064cf400] [c004cfe0] ppc_iommu_map_sg+0x210/0x550
[c03d064cf510] [c004b7a4] dma_iommu_map_sg+0x74/0xb0
[c03d064cf530] [c0863944] ata_qc_issue+0x134/0x470
[c03d064cf5b0] [c0863ec4] ata_exec_internal_sg+0x244/0x5c0
[c03d064cf700] [c08642d0] ata_exec_internal+0x90/0xe0
[c03d064cf780] [c08650ac] ata_dev_read_id+0x2ec/0x640
[c03d064cf8d0] [c0878e28] ata_eh_recover+0x948/0x16d0
[c03d064cfa10] [c087d760] sata_pmp_error_handler+0x480/0xbf0
[c03d064cfbc0] [c0884624] ahci_error_handler+0x74/0xe0
[c03d064cfbf0] [c0879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
[c03d064cfca0] [c087a544] ata_scsi_error+0xb4/0x100
[c03d064cfd00] [c0802450] scsi_error_handler+0x120/0x510
[c03d064cfdb0] [c0140c48] kthread+0x1b8/0x1c0
[c03d064cfe20] [c000bd8c] ret_from_kernel_thread+0x5c/0x70
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
irq event stamp: 2305


hardirqs last  enabled at (2305): [] 
fast_exc_return_irq+0x28/0x34
hardirqs last disabled at (2303): [] __do_softirq+0x4a0/0x654
WARNING: possible irq lock inversion dependency detected
5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: GW
softirqs last  enabled at (2304): [] __do_softirq+0x524/0x654
softirqs last disabled at (2297): [] irq_exit+0x128/0x180

swapper/0/0 just changed the state of lock:
06cf56a6 (&(&host->lock)->rlock){-...}, at: 
ahci_single_level_irq_intr+0xac/0x120
but this lock took another, HARDIRQ-unsafe lock in the past:
 (fs_reclaim){+.+.}


and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(fs_reclaim);
   local_irq_disable();
   lock(&(&host->lock)->rlock);
   lock(fs_reclaim);
  
lock(&(&host->lock)->rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 -> (fs_reclaim){+.+.} ops: 167579 {
HARDIRQ-ON-W at:
  lock_acquire+0xf8/0x2a0
  fs_reclaim_acquire.part.23+0x44/0x60
   

[PATCH kernel v5 2/4] powerpc/iommu: Allow bypass-only for DMA

2019-07-17 Thread Alexey Kardashevskiy
POWER8 and newer support a bypass mode which maps all host memory to
PCI buses so an IOMMU table is not always required. However if we fail to
create such a table, the DMA setup fails and the kernel does not boot.

This skips the 32bit DMA setup check if the bypass is selected.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---

This minor thing helped me debugging next 2 patches so it can help
somebody else too.
---
 arch/powerpc/kernel/dma-iommu.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a0879674a9c8..c963d704fa31 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -122,18 +122,17 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
 
-   if (!tbl) {
-   dev_info(dev, "Warning: IOMMU dma not supported: mask 0x%08llx"
-   ", table unavailable\n", mask);
-   return 0;
-   }
-
if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
dev->archdata.iommu_bypass = true;
dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
return 1;
}
 
+   if (!tbl) {
+   dev_err(dev, "Warning: IOMMU dma not supported: mask 0x%08llx, 
table unavailable\n", mask);
+   return 0;
+   }
+
if (tbl->it_offset > (mask >> tbl->it_page_shift)) {
dev_info(dev, "Warning: IOMMU offset too big for device 
mask\n");
dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",
-- 
2.17.1



[PATCH kernel v5 0/4] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59

2019-07-17 Thread Alexey Kardashevskiy


This is an attempt to allow DMA masks between 32..59 which are not large
enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
on the max order, up to 40 is usually available.

This is an optimization and not a bug fix for DMA API usage.

Changelogs are in the patches.


This is based on sha1
a2b6f26c264e Christophe Leroy "powerpc/module64: Use symbolic instructions 
names.".

Please comment. Thanks.



Alexey Kardashevskiy (4):
  powerpc/powernv/ioda: Fix race in TCE level allocation
  powerpc/iommu: Allow bypass-only for DMA
  powerpc/powernv/ioda2: Allocate TCE table levels on demand for default
DMA window
  powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU
pages

 arch/powerpc/include/asm/iommu.h  |  7 +-
 arch/powerpc/platforms/powernv/pci.h  |  2 +-
 arch/powerpc/kernel/dma-iommu.c   | 11 ++-
 arch/powerpc/kernel/iommu.c   | 74 +--
 arch/powerpc/platforms/cell/iommu.c   |  2 +-
 arch/powerpc/platforms/pasemi/iommu.c |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 38 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 42 +--
 arch/powerpc/platforms/pseries/iommu.c|  8 +-
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 arch/powerpc/sysdev/dart_iommu.c  |  2 +-
 11 files changed, 129 insertions(+), 61 deletions(-)

-- 
2.17.1



[PATCH kernel v5 1/4] powerpc/powernv/ioda: Fix race in TCE level allocation

2019-07-17 Thread Alexey Kardashevskiy
pnv_tce() returns a pointer to a TCE entry and originally a TCE table
would be pre-allocated. For the default case of 2GB window the table
needs only a single level and that is fine. However if more levels are
requested, it is possible to get a race when 2 threads want a pointer
to a TCE entry from the same page of TCEs.

This adds cmpxchg to handle the race. Note that once TCE is non-zero,
it cannot become zero again.

CC: sta...@vger.kernel.org # v4.19+
Fixes: a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels on 
demand")
Signed-off-by: Alexey Kardashevskiy 
---

The race occurs about 30 times in the first 3 minutes of copying files
via rsync and that's about it.

This fixes EEH's from
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110810

---
Changes:
v2:
* replaced spin_lock with cmpxchg+readonce
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index e28f03e1eb5e..8d6569590161 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -48,6 +48,9 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int 
shift)
return addr;
 }
 
+static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
+   unsigned long size, unsigned int levels);
+
 static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool 
alloc)
 {
__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
@@ -57,9 +60,9 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, 
long idx, bool alloc)
 
while (level) {
int n = (idx & mask) >> (level * shift);
-   unsigned long tce;
+   unsigned long oldtce, tce = be64_to_cpu(READ_ONCE(tmp[n]));
 
-   if (tmp[n] == 0) {
+   if (!tce) {
__be64 *tmp2;
 
if (!alloc)
@@ -70,10 +73,15 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, 
long idx, bool alloc)
if (!tmp2)
return NULL;
 
-   tmp[n] = cpu_to_be64(__pa(tmp2) |
-   TCE_PCI_READ | TCE_PCI_WRITE);
+   tce = __pa(tmp2) | TCE_PCI_READ | TCE_PCI_WRITE;
+   oldtce = be64_to_cpu(cmpxchg(&tmp[n], 0,
+   cpu_to_be64(tce)));
+   if (oldtce) {
+   pnv_pci_ioda2_table_do_free_pages(tmp2,
+   ilog2(tbl->it_level_size) + 3, 1);
+   tce = oldtce;
+   }
}
-   tce = be64_to_cpu(tmp[n]);
 
tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
idx &= ~mask;
-- 
2.17.1



[PATCH kernel v4 2/2] powerpc/xive: Drop current cpu priority for orphaned interrupts

2019-07-17 Thread Alexey Kardashevskiy
There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu. When such released irq appears in a queue,
we take it from the queue but we do not change the current priority
on that cpu and since there is no handler for the irq, EOI is never
called and the cpu current priority remains elevated
(7 vs. 0xff==unmasked). If another irq is assigned to the same cpu,
then that device stops working until irq is moved to another cpu or
the device is reset.

This implements ppc_md.orphan_irq callback which is called if no irq
descriptor is found and which drops the current priority
to 0xff which effectively unmasks interrupts in a current CPU.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/sysdev/xive/common.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..17e696b2d71b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,6 +283,23 @@ static unsigned int xive_get_irq(void)
return irq;
 }
 
+/*
+ * Handles the case when a target CPU catches an interrupt which is being shut
+ * down on another CPU. generic_handle_irq() returns an error in such case
+ * and then the orphan_irq() handler restores the CPPR to reenable interrupts.
+ *
+ * Without orphan_irq() and valid irq_desc, there is no other way to restore
+ * the CPPR. This executes on a CPU which caught the interrupt.
+ */
+static void xive_orphan_irq(unsigned int irq)
+{
+   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
+
+   xc->cppr = 0xff;
+   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
+   DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
+}
+
 /*
  * After EOI'ing an interrupt, we need to re-check the queue
  * to see if another interrupt is pending since multiple
@@ -1419,6 +1436,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
void __iomem *area, u32 o
xive_irq_priority = max_prio;
 
ppc_md.get_irq = xive_get_irq;
+   ppc_md.orphan_irq = xive_orphan_irq;
__xive_enabled = true;
 
pr_devel("Initializing host..\n");
-- 
2.17.1



[PATCH kernel v4 1/2] powerpc: Add handler for orphaned interrupts

2019-07-17 Thread Alexey Kardashevskiy
The test on generic_handle_irq() catches interrupt events that
were served on a target CPU while the source interrupt was being
shutdown on another CPU. This may lead to a blocked interrupt queue
on a target CPU so if there is another assigned irq on that CPU, that
device stops working.

This adds necessary infrastructure to allow platform to deal with it.
The next patch implements it for XIVE.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/machdep.h | 3 +++
 arch/powerpc/kernel/irq.c  | 9 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index c43d6eca9edd..6cc14e28e89a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
/* Return an irq, or 0 to indicate there are none pending. */
unsigned int(*get_irq)(void);
 
+   /* Drops irq if it does not have a valid descriptor */
+   void(*orphan_irq)(unsigned int irq);
+
/* PCI stuff */
/* Called after allocating resources */
void(*pcibios_fixup)(void);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 5645bc9cbc09..e0689dcb17f0 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
may_hard_irq_enable();
 
/* And finally process it */
-   if (unlikely(!irq))
+   if (unlikely(!irq)) {
__this_cpu_inc(irq_stat.spurious_irqs);
-   else
-   generic_handle_irq(irq);
+   } else if (generic_handle_irq(irq)) {
+   if (ppc_md.orphan_irq)
+   ppc_md.orphan_irq(irq);
+   __this_cpu_inc(irq_stat.spurious_irqs);
+   }
 
trace_irq_exit(regs);
 
-- 
2.17.1



[PATCH kernel v4 0/2] powerpc/xive: Drop deregistered irqs

2019-07-17 Thread Alexey Kardashevskiy
Now 2 patches.
v3 is here: https://patchwork.ozlabs.org/patch/1133106/


This is based on sha1
a2b6f26c264e Christophe Leroy "powerpc/module64: Use symbolic instructions 
names.".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc: Add handler for orphaned interrupts
  powerpc/xive: Drop current cpu priority for orphaned interrupts

 arch/powerpc/include/asm/machdep.h |  3 +++
 arch/powerpc/kernel/irq.c  |  9 ++---
 arch/powerpc/sysdev/xive/common.c  | 18 ++
 3 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.17.1



Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-17 Thread Oliver O'Halloran
On Thu, Jul 18, 2019 at 1:16 PM Shawn Anastasio  wrote:
>
> On 7/17/19 9:59 PM, Alexey Kardashevskiy wrote:
> >
> > On 18/07/2019 09:54, Shawn Anastasio wrote:
> >> The refactor of powerpc DMA functions in commit cc17d780
> >> ("powerpc/dma: remove dma_nommu_mmap_coherent") incorrectly
> >> changes the way DMA mappings are handled on powerpc.
> >> Since this change, all mapped pages are marked as cache-inhibited
> >> through the default implementation of arch_dma_mmap_pgprot.
> >> This differs from the previous behavior of only marking pages
> >> in noncoherent mappings as cache-inhibited and has resulted in
> >> sporadic system crashes in certain hardware configurations and
> >> workloads (see Bugzilla).
> >>
> >> This commit restores the previous correct behavior by providing
> >> an implementation of arch_dma_mmap_pgprot that only marks
> >> pages in noncoherent mappings as cache-inhibited. As this behavior
> >> should be universal for all powerpc platforms a new file,
> >> dma-generic.c, was created to store it.
> >>
> >> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204145
> >> Fixes: cc17d780 ("powerpc/dma: remove dma_nommu_mmap_coherent")
> >> Signed-off-by: Shawn Anastasio 
> >
> > Is this the default one?
> >
> > include/linux/dma-noncoherent.h
> > # define arch_dma_mmap_pgprot(dev, prot, attrs) pgprot_noncached(prot)
>
> Yep, that's the one.
>
> > Out of curiosity - do not we want to fix this one for everyone?
>
> Other than m68k, mips, and arm64, everybody else that doesn't have
> ARCH_NO_COHERENT_DMA_MMAP set uses this default implementation, so
> I assume this behavior is acceptable on those architectures.

It might be acceptable, but there's no reason to use pgport_noncached
if the platform supports cache-coherent DMA.

Christoph (+cc) made the change so maybe he saw something we're missing.

> >> ---
> >>   arch/powerpc/Kconfig |  1 +
> >>   arch/powerpc/kernel/Makefile |  3 ++-
> >>   arch/powerpc/kernel/dma-common.c | 17 +
> >>   3 files changed, 20 insertions(+), 1 deletion(-)
> >>   create mode 100644 arch/powerpc/kernel/dma-common.c
> >>
> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> >> index d8dcd8820369..77f6ebf97113 100644
> >> --- a/arch/powerpc/Kconfig
> >> +++ b/arch/powerpc/Kconfig
> >> @@ -121,6 +121,7 @@ config PPC
> >>   select ARCH_32BIT_OFF_T if PPC32
> >>   select ARCH_HAS_DEBUG_VIRTUAL
> >>   select ARCH_HAS_DEVMEM_IS_ALLOWED
> >> +select ARCH_HAS_DMA_MMAP_PGPROT
> >>   select ARCH_HAS_ELF_RANDOMIZE
> >>   select ARCH_HAS_FORTIFY_SOURCE
> >>   select ARCH_HAS_GCOV_PROFILE_ALL
> >> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> >> index 56dfa7a2a6f2..ea0c69236789 100644
> >> --- a/arch/powerpc/kernel/Makefile
> >> +++ b/arch/powerpc/kernel/Makefile
> >> @@ -49,7 +49,8 @@ obj-y:= cputable.o ptrace.o
> >> syscalls.o \
> >>  signal.o sysfs.o cacheinfo.o time.o \
> >>  prom.o traps.o setup-common.o \
> >>  udbg.o misc.o io.o misc_$(BITS).o \
> >> -   of_platform.o prom_parse.o
> >> +   of_platform.o prom_parse.o \
> >> +   dma-common.o
> >>   obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \
> >>  signal_64.o ptrace32.o \
> >>  paca.o nvram_64.o firmware.o
> >> diff --git a/arch/powerpc/kernel/dma-common.c
> >> b/arch/powerpc/kernel/dma-common.c
> >> new file mode 100644
> >> index ..5a15f99f4199
> >> --- /dev/null
> >> +++ b/arch/powerpc/kernel/dma-common.c
> >> @@ -0,0 +1,17 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/*
> >> + * Contains common dma routines for all powerpc platforms.
> >> + *
> >> + * Copyright (C) 2019 Shawn Anastasio (sh...@anastas.io)
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot,
> >> +unsigned long attrs)
> >> +{
> >> +if (!dev_is_dma_coherent(dev))
> >> +return pgprot_noncached(prot);
> >> +return prot;
> >> +}
> >>
> >


[PATCH v3 6/6] s390/mm: Remove sev_active() function

2019-07-17 Thread Thiago Jung Bauermann
All references to sev_active() were moved to arch/x86 so we don't need to
define it for s390 anymore.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/s390/include/asm/mem_encrypt.h | 1 -
 arch/s390/mm/init.c | 8 +---
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/mem_encrypt.h 
b/arch/s390/include/asm/mem_encrypt.h
index ff813a56bc30..2542cbf7e2d1 100644
--- a/arch/s390/include/asm/mem_encrypt.h
+++ b/arch/s390/include/asm/mem_encrypt.h
@@ -5,7 +5,6 @@
 #ifndef __ASSEMBLY__
 
 static inline bool mem_encrypt_active(void) { return false; }
-extern bool sev_active(void);
 
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 78c319c5ce48..6286eb3e815b 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -155,15 +155,9 @@ int set_memory_decrypted(unsigned long addr, int numpages)
return 0;
 }
 
-/* are we a protected virtualization guest? */
-bool sev_active(void)
-{
-   return is_prot_virt_guest();
-}
-
 bool force_dma_unencrypted(struct device *dev)
 {
-   return sev_active();
+   return is_prot_virt_guest();
 }
 
 /* protected virtualization */


[PATCH v3 4/6] x86, s390/mm: Move sme_active() and sme_me_mask to x86-specific header

2019-07-17 Thread Thiago Jung Bauermann
Now that generic code doesn't reference them, move sme_active() and
sme_me_mask to x86's .

Also remove the export for sme_active() since it's only used in files that
won't be built as modules. sme_me_mask on the other hand is used in
arch/x86/kvm/svm.c (via __sme_set() and __psp_pa()) which can be built as a
module so its export needs to stay.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/s390/include/asm/mem_encrypt.h |  4 +---
 arch/x86/include/asm/mem_encrypt.h  | 10 ++
 arch/x86/mm/mem_encrypt.c   |  1 -
 include/linux/mem_encrypt.h | 14 +-
 4 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/arch/s390/include/asm/mem_encrypt.h 
b/arch/s390/include/asm/mem_encrypt.h
index 3eb018508190..ff813a56bc30 100644
--- a/arch/s390/include/asm/mem_encrypt.h
+++ b/arch/s390/include/asm/mem_encrypt.h
@@ -4,9 +4,7 @@
 
 #ifndef __ASSEMBLY__
 
-#define sme_me_mask0ULL
-
-static inline bool sme_active(void) { return false; }
+static inline bool mem_encrypt_active(void) { return false; }
 extern bool sev_active(void);
 
 int set_memory_encrypted(unsigned long addr, int numpages);
diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index 0c196c47d621..848ce43b9040 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -92,6 +92,16 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned 
long size) { return 0;
 
 extern char __start_bss_decrypted[], __end_bss_decrypted[], 
__start_bss_decrypted_unused[];
 
+static inline bool mem_encrypt_active(void)
+{
+   return sme_me_mask;
+}
+
+static inline u64 sme_get_me_mask(void)
+{
+   return sme_me_mask;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __X86_MEM_ENCRYPT_H__ */
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c805f0a5c16e..7139f2f43955 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -344,7 +344,6 @@ bool sme_active(void)
 {
return sme_me_mask && !sev_enabled;
 }
-EXPORT_SYMBOL(sme_active);
 
 bool sev_active(void)
 {
diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h
index 470bd53a89df..0c5b0ff9eb29 100644
--- a/include/linux/mem_encrypt.h
+++ b/include/linux/mem_encrypt.h
@@ -18,23 +18,11 @@
 
 #else  /* !CONFIG_ARCH_HAS_MEM_ENCRYPT */
 
-#define sme_me_mask0ULL
-
-static inline bool sme_active(void) { return false; }
+static inline bool mem_encrypt_active(void) { return false; }
 static inline bool sev_active(void) { return false; }
 
 #endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */
 
-static inline bool mem_encrypt_active(void)
-{
-   return sme_me_mask;
-}
-
-static inline u64 sme_get_me_mask(void)
-{
-   return sme_me_mask;
-}
-
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 /*
  * The __sme_set() and __sme_clr() macros are useful for adding or removing


[PATCH v3 5/6] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-17 Thread Thiago Jung Bauermann
Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
appear in generic kernel code because it forces non-x86 architectures to
define the sev_active() function, which doesn't make a lot of sense.

To solve this problem, add an x86 elfcorehdr_read() function to override
the generic weak implementation. To do that, it's necessary to make
read_from_oldmem() public so that it can be used outside of vmcore.c.

Also, remove the export for sev_active() since it's only used in files that
won't be built as modules.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/x86/kernel/crash_dump_64.c |  5 +
 arch/x86/mm/mem_encrypt.c   |  1 -
 fs/proc/vmcore.c|  8 
 include/linux/crash_dump.h  | 14 ++
 include/linux/mem_encrypt.h |  1 -
 5 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/crash_dump_64.c b/arch/x86/kernel/crash_dump_64.c
index 22369dd5de3b..045e82e8945b 100644
--- a/arch/x86/kernel/crash_dump_64.c
+++ b/arch/x86/kernel/crash_dump_64.c
@@ -70,3 +70,8 @@ ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char 
*buf, size_t csize,
 {
return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true);
 }
+
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+   return read_from_oldmem(buf, count, ppos, 0, sev_active());
+}
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 7139f2f43955..b1e823441093 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -349,7 +349,6 @@ bool sev_active(void)
 {
return sme_me_mask && sev_enabled;
 }
-EXPORT_SYMBOL(sev_active);
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 57957c91c6df..ca1f20bedd8c 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -100,9 +100,9 @@ static int pfn_is_ram(unsigned long pfn)
 }
 
 /* Reads a page from the oldmem device from given offset. */
-static ssize_t read_from_oldmem(char *buf, size_t count,
-   u64 *ppos, int userbuf,
-   bool encrypted)
+ssize_t read_from_oldmem(char *buf, size_t count,
+u64 *ppos, int userbuf,
+bool encrypted)
 {
unsigned long pfn, offset;
size_t nr_bytes;
@@ -166,7 +166,7 @@ void __weak elfcorehdr_free(unsigned long long addr)
  */
 ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
 {
-   return read_from_oldmem(buf, count, ppos, 0, sev_active());
+   return read_from_oldmem(buf, count, ppos, 0, false);
 }
 
 /*
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f774c5eb9e3c..4664fc1871de 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -115,4 +115,18 @@ static inline int vmcore_add_device_dump(struct 
vmcoredd_data *data)
return -EOPNOTSUPP;
 }
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+#ifdef CONFIG_PROC_VMCORE
+ssize_t read_from_oldmem(char *buf, size_t count,
+u64 *ppos, int userbuf,
+bool encrypted);
+#else
+static inline ssize_t read_from_oldmem(char *buf, size_t count,
+  u64 *ppos, int userbuf,
+  bool encrypted)
+{
+   return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE */
+
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h
index 0c5b0ff9eb29..5c4a18a91f89 100644
--- a/include/linux/mem_encrypt.h
+++ b/include/linux/mem_encrypt.h
@@ -19,7 +19,6 @@
 #else  /* !CONFIG_ARCH_HAS_MEM_ENCRYPT */
 
 static inline bool mem_encrypt_active(void) { return false; }
-static inline bool sev_active(void) { return false; }
 
 #endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */
 



[PATCH v3 3/6] dma-mapping: Remove dma_check_mask()

2019-07-17 Thread Thiago Jung Bauermann
sme_active() is an x86-specific function so it's better not to call it from
generic code. Christoph Hellwig mentioned that "There is no reason why we
should have a special debug printk just for one specific reason why there
is a requirement for a large DMA mask.", so just remove dma_check_mask().

Signed-off-by: Thiago Jung Bauermann 
---
 kernel/dma/mapping.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 1f628e7ac709..61eeefbfcb36 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -291,12 +291,6 @@ void dma_free_attrs(struct device *dev, size_t size, void 
*cpu_addr,
 }
 EXPORT_SYMBOL(dma_free_attrs);
 
-static inline void dma_check_mask(struct device *dev, u64 mask)
-{
-   if (sme_active() && (mask < (((u64)sme_get_me_mask() << 1) - 1)))
-   dev_warn(dev, "SME is active, device will require DMA bounce 
buffers\n");
-}
-
 int dma_supported(struct device *dev, u64 mask)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -327,7 +321,6 @@ int dma_set_mask(struct device *dev, u64 mask)
return -EIO;
 
arch_dma_set_mask(dev, mask);
-   dma_check_mask(dev, mask);
*dev->dma_mask = mask;
return 0;
 }
@@ -345,7 +338,6 @@ int dma_set_coherent_mask(struct device *dev, u64 mask)
if (!dma_supported(dev, mask))
return -EIO;
 
-   dma_check_mask(dev, mask);
dev->coherent_dma_mask = mask;
return 0;
 }


[PATCH v3 2/6] swiotlb: Remove call to sme_active()

2019-07-17 Thread Thiago Jung Bauermann
sme_active() is an x86-specific function so it's better not to call it from
generic code.

There's no need to mention which memory encryption feature is active, so
just use a more generic message. Besides, other architectures will have
different names for similar technology.

Signed-off-by: Thiago Jung Bauermann 
---
 kernel/dma/swiotlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 62fa5a82a065..e52401f94e91 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -459,8 +459,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
panic("Can not allocate SWIOTLB buffer earlier and can't now 
provide you with the DMA bounce buffer");
 
if (mem_encrypt_active())
-   pr_warn_once("%s is active and system is using DMA bounce 
buffers\n",
-sme_active() ? "SME" : "SEV");
+   pr_warn_once("Memory encryption is active and system is using 
DMA bounce buffers\n");
 
mask = dma_get_seg_boundary(hwdev);
 


[PATCH v3 0/6] Remove x86-specific code from generic headers

2019-07-17 Thread Thiago Jung Bauermann
Hello,

This version is mostly about splitting up patch 2/3 into three separate
patches, as suggested by Christoph Hellwig. Two other changes are a fix in
patch 1 which wasn't selecting ARCH_HAS_MEM_ENCRYPT for s390 spotted by
Janani and removal of sme_active and sev_active symbol exports as suggested
by Christoph Hellwig.

These patches are applied on top of today's dma-mapping/for-next.

I don't have a way to test SME, SEV, nor s390's PEF so the patches have only
been build tested.

Changelog

Since v2:

- Patch "x86,s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig"
  - Added "select ARCH_HAS_MEM_ENCRYPT" to config S390. Suggested by Janani.

- Patch "DMA mapping: Move SME handling to x86-specific files"
  - Split up into 3 new patches. Suggested by Christoph Hellwig.

- Patch "swiotlb: Remove call to sme_active()"
  - New patch.

- Patch "dma-mapping: Remove dma_check_mask()"
  - New patch.

- Patch "x86,s390/mm: Move sme_active() and sme_me_mask to x86-specific header"
  - New patch.
  - Removed export of sme_active symbol. Suggested by Christoph Hellwig.

- Patch "fs/core/vmcore: Move sev_active() reference to x86 arch code"
  - Removed export of sev_active symbol. Suggested by Christoph Hellwig.

- Patch "s390/mm: Remove sev_active() function"
  - New patch.

Since v1:

- Patch "x86,s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig"
  - Remove definition of ARCH_HAS_MEM_ENCRYPT from s390/Kconfig as well.
  - Reworded patch title and message a little bit.

- Patch "DMA mapping: Move SME handling to x86-specific files"
  - Adapt s390's  as well.
  - Remove dma_check_mask() from kernel/dma/mapping.c. Suggested by
Christoph Hellwig.

Thiago Jung Bauermann (6):
  x86,s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig
  swiotlb: Remove call to sme_active()
  dma-mapping: Remove dma_check_mask()
  x86,s390/mm: Move sme_active() and sme_me_mask to x86-specific header
  fs/core/vmcore: Move sev_active() reference to x86 arch code
  s390/mm: Remove sev_active() function

 arch/Kconfig|  3 +++
 arch/s390/Kconfig   |  4 +---
 arch/s390/include/asm/mem_encrypt.h |  5 +
 arch/s390/mm/init.c |  8 +---
 arch/x86/Kconfig|  4 +---
 arch/x86/include/asm/mem_encrypt.h  | 10 ++
 arch/x86/kernel/crash_dump_64.c |  5 +
 arch/x86/mm/mem_encrypt.c   |  2 --
 fs/proc/vmcore.c|  8 
 include/linux/crash_dump.h  | 14 ++
 include/linux/mem_encrypt.h | 15 +--
 kernel/dma/mapping.c|  8 
 kernel/dma/swiotlb.c|  3 +--
 13 files changed, 42 insertions(+), 47 deletions(-)



[PATCH v3 1/6] x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig

2019-07-17 Thread Thiago Jung Bauermann
powerpc is also going to use this feature, so put it in a generic location.

Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
---
 arch/Kconfig  | 3 +++
 arch/s390/Kconfig | 4 +---
 arch/x86/Kconfig  | 4 +---
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e8d19c3cb91f..8fc285180848 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -935,6 +935,9 @@ config LOCK_EVENT_COUNTS
  the chance of application behavior change because of timing
  differences. The counts are reported via debugfs.
 
+config ARCH_HAS_MEM_ENCRYPT
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index a4ad2733eedf..f43319c44454 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -1,7 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
-config ARCH_HAS_MEM_ENCRYPT
-def_bool y
-
 config MMU
def_bool y
 
@@ -68,6 +65,7 @@ config S390
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_HAS_KCOV
+   select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c9f331bb538b..5d3295f2df94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -68,6 +68,7 @@ config X86
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOVif X86_64
+   select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_PTE_SPECIAL
@@ -1520,9 +1521,6 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.
 
-config ARCH_HAS_MEM_ENCRYPT
-   def_bool y
-
 config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD



Re: [PATCH v2] powerpc/nvdimm: Pick the nearby online node if the device node is not online

2019-07-17 Thread Oliver O'Halloran
On Tue, Jul 16, 2019 at 7:08 PM Aneesh Kumar K.V
 wrote:
>
> This is similar to what ACPI does. Nvdimm layer doesn't bring the SCM device
> numa node online. Hence we need to make sure we always use an online node
> as ndr_desc.numa_node. Otherwise this result in kernel crashes. The target
> node is used by dax/kmem and that will bring up the numa node online 
> correctly.
>
> Without this patch, we do hit kernel crash as below because we try to access
> uninitialized NODE_DATA in different code paths.

Right, so we're getting a crash due to libnvdimm (via devm_kmalloc)
trying to to node local allocations to an offline node. Using a
different node fixes that problem, but what else does changing
ndr_desc.numa_node do?

> cpu 0x0: Vector: 300 (Data Access) at [c000fac53170]
> pc: c04bbc50: ___slab_alloc+0x120/0xca0
> lr: c04bc834: __slab_alloc+0x64/0xc0
> sp: c000fac53400
>msr: 82009033
>dar: 73e8
>  dsisr: 8
>   current = 0xc000fabb6d80
>   paca= 0xc387   irqmask: 0x03   irq_happened: 0x01
> pid   = 7, comm = kworker/u16:0
> Linux version 5.2.0-06234-g76bd729b2644 (kvaneesh@ltc-boston123) (gcc version 
> 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #135 SMP Thu Jul 11 05:36:30 CDT 2019
> enter ? for help
> [link register   ] c04bc834 __slab_alloc+0x64/0xc0
> [c000fac53400] c000fac53480 (unreliable)
> [c000fac53500] c04bc818 __slab_alloc+0x48/0xc0
> [c000fac53560] c04c30a0 __kmalloc_node_track_caller+0x3c0/0x6b0
> [c000fac535d0] c0cfafe4 devm_kmalloc+0x74/0xc0
> [c000fac53600] c0d69434 nd_region_activate+0x144/0x560
> [c000fac536d0] c0d6b19c nd_region_probe+0x17c/0x370
> [c000fac537b0] c0d6349c nvdimm_bus_probe+0x10c/0x230
> [c000fac53840] c0cf3cc4 really_probe+0x254/0x4e0
> [c000fac538d0] c0cf429c driver_probe_device+0x16c/0x1e0
> [c000fac53950] c0cf0b44 bus_for_each_drv+0x94/0x130
> [c000fac539b0] c0cf392c __device_attach+0xdc/0x200
> [c000fac53a50] c0cf231c bus_probe_device+0x4c/0xf0
> [c000fac53a90] c0ced268 device_add+0x528/0x810
> [c000fac53b60] c0d62a58 nd_async_device_register+0x28/0xa0
> [c000fac53bd0] c01ccb8c async_run_entry_fn+0xcc/0x1f0
> [c000fac53c50] c01bcd9c process_one_work+0x46c/0x860
> [c000fac53d20] c01bd4f4 worker_thread+0x364/0x5f0
> [c000fac53db0] c01c7260 kthread+0x1b0/0x1c0
> [c000fac53e20] c000b954 ret_from_kernel_thread+0x5c/0x68
>
> With the patch we get
>
>  # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 130865 MB
> node 1 free: 129130 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>  # cat /sys/bus/nd/devices/region0/numa_node
> 0
>  # dmesg | grep papr_scm
> [   91.332305] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Region 
> registered with target node 2 and online node 0
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
> changes from V1:
> * handle NUMA_NO_NODE
>
>  arch/powerpc/platforms/pseries/papr_scm.c | 30 +--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index c8ec670ee924..b813bc92f35f 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -255,12 +255,32 @@ static const struct attribute_group 
> *papr_scm_dimm_groups[] = {
> NULL,
>  };
>
> +static inline int papr_scm_node(int node)
> +{
> +   int min_dist = INT_MAX, dist;
> +   int nid, min_node;
> +
> +   if ((node == NUMA_NO_NODE) || node_online(node))
> +   return node;
> +
> +   min_node = first_online_node;
> +   for_each_online_node(nid) {
> +   dist = node_distance(node, nid);
> +   if (dist < min_dist) {
> +   min_dist = dist;
> +   min_node = nid;
> +   }
> +   }
> +   return min_node;
> +}
> +
>  static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  {
> struct device *dev = &p->pdev->dev;
> struct nd_mapping_desc mapping;
> struct nd_region_desc ndr_desc;
> unsigned long dimm_flags;
> +   int target_nid, online_nid;
>
> p->bus_desc.ndctl = papr_scm_ndctl;
> p->bus_desc.module = THIS_MODULE;
> @@ -299,8 +319,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>
> memset(&ndr_desc, 0, sizeof(ndr_desc));
> ndr_desc.attr_groups = region_attr_groups;
> -   ndr_desc.numa_node = dev_to_node(&p->pdev->dev);
> -   ndr_desc.target_node = ndr_desc.numa_node;
> +   target_nid = dev_to_node(&p->pdev->dev);
> +   online_nid =

Re: [PATCH v9 05/10] namei: O_BENEATH-style path resolution flags

2019-07-17 Thread Aleksa Sarai
On 2019-07-14, Al Viro  wrote:
> On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:
> > The basic property being guaranteed by LOOKUP_IN_ROOT is that it will
> > not result in resolution of a path component which was not inside the
> > root of the dirfd tree at some point during resolution (and that all
> > absolute symlink and ".." resolution will be done relative to the
> > dirfd). This may smell slightly of chroot(2), because unfortunately it
> > is a similar concept -- the reason for this is to allow for a more
> > efficient way to safely resolve paths inside a rootfs than spawning a
> > separate process to then pass back the fd to the caller.
> 
> IDGI...  If attacker can modify your subtree, you have already lost -
> after all, they can make anything appear inside that tree just before
> your syscall is made and bring it back out immediately afterwards.
> And if they can't, what is the race you are trying to protect against?
> Confused...

I'll be honest, this code mostly exists because Jann Horn said that it
was necessary in order for this interface to be safe against those kinds
of attacks. Though, it's also entirely possible I just am
mis-remembering the attack scenario he described when I posted v1 of
this series last year.

The use-case I need this functionality for (as do other container
runtimes) is one where you are trying to safely interact with a
directory tree that is a (malicious) container's root filesystem -- so
the container won't be able to move the directory tree root, nor can
they move things outside the rootfs into it (or the reverse). Users
dealing with FTP, web, or file servers probably have similar
requirements.

There is an obvious race condition if you allow the attacker to move the
root (I give an example and test-case of it in the last patch in the
series), and given that it is fairly trivial to defend against I don't
see the downside in including it? But it's obviously your call -- and
maybe Jann Horn can explain the reasoning behind this much better than I
can.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-17 Thread Shawn Anastasio

On 7/17/19 9:59 PM, Alexey Kardashevskiy wrote:



On 18/07/2019 09:54, Shawn Anastasio wrote:

The refactor of powerpc DMA functions in commit cc17d780
("powerpc/dma: remove dma_nommu_mmap_coherent") incorrectly
changes the way DMA mappings are handled on powerpc.
Since this change, all mapped pages are marked as cache-inhibited
through the default implementation of arch_dma_mmap_pgprot.
This differs from the previous behavior of only marking pages
in noncoherent mappings as cache-inhibited and has resulted in
sporadic system crashes in certain hardware configurations and
workloads (see Bugzilla).

This commit restores the previous correct behavior by providing
an implementation of arch_dma_mmap_pgprot that only marks
pages in noncoherent mappings as cache-inhibited. As this behavior
should be universal for all powerpc platforms a new file,
dma-generic.c, was created to store it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204145
Fixes: cc17d780 ("powerpc/dma: remove dma_nommu_mmap_coherent")
Signed-off-by: Shawn Anastasio 



Is this the default one?

include/linux/dma-noncoherent.h
# define arch_dma_mmap_pgprot(dev, prot, attrs) pgprot_noncached(prot)


Yep, that's the one.


Out of curiosity - do not we want to fix this one for everyone?


Other than m68k, mips, and arm64, everybody else that doesn't have
ARCH_NO_COHERENT_DMA_MMAP set uses this default implementation, so
I assume this behavior is acceptable on those architectures.

Either way, the patch is correct. I'm glad to know it was not my " 
powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59" 
which broke it :)


Yeah, turns out it was just bad luck that I happened to run into these
crashes right after deciding to try out your patch :)


Reviewed-by: Alexey Kardashevskiy 


Thanks!






---
  arch/powerpc/Kconfig |  1 +
  arch/powerpc/kernel/Makefile |  3 ++-
  arch/powerpc/kernel/dma-common.c | 17 +
  3 files changed, 20 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/kernel/dma-common.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..77f6ebf97113 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -121,6 +121,7 @@ config PPC
  select ARCH_32BIT_OFF_T if PPC32
  select ARCH_HAS_DEBUG_VIRTUAL
  select ARCH_HAS_DEVMEM_IS_ALLOWED
+    select ARCH_HAS_DMA_MMAP_PGPROT
  select ARCH_HAS_ELF_RANDOMIZE
  select ARCH_HAS_FORTIFY_SOURCE
  select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 56dfa7a2a6f2..ea0c69236789 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -49,7 +49,8 @@ obj-y    := cputable.o ptrace.o 
syscalls.o \

 signal.o sysfs.o cacheinfo.o time.o \
 prom.o traps.o setup-common.o \
 udbg.o misc.o io.o misc_$(BITS).o \
-   of_platform.o prom_parse.o
+   of_platform.o prom_parse.o \
+   dma-common.o
  obj-$(CONFIG_PPC64)    += setup_64.o sys_ppc32.o \
 signal_64.o ptrace32.o \
 paca.o nvram_64.o firmware.o
diff --git a/arch/powerpc/kernel/dma-common.c 
b/arch/powerpc/kernel/dma-common.c

new file mode 100644
index ..5a15f99f4199
--- /dev/null
+++ b/arch/powerpc/kernel/dma-common.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Contains common dma routines for all powerpc platforms.
+ *
+ * Copyright (C) 2019 Shawn Anastasio (sh...@anastas.io)
+ */
+
+#include 
+#include 
+
+pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot,
+    unsigned long attrs)
+{
+    if (!dev_is_dma_coherent(dev))
+    return pgprot_noncached(prot);
+    return prot;
+}





Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-17 Thread Alexey Kardashevskiy




On 18/07/2019 09:54, Shawn Anastasio wrote:

The refactor of powerpc DMA functions in commit cc17d780
("powerpc/dma: remove dma_nommu_mmap_coherent") incorrectly
changes the way DMA mappings are handled on powerpc.
Since this change, all mapped pages are marked as cache-inhibited
through the default implementation of arch_dma_mmap_pgprot.
This differs from the previous behavior of only marking pages
in noncoherent mappings as cache-inhibited and has resulted in
sporadic system crashes in certain hardware configurations and
workloads (see Bugzilla).

This commit restores the previous correct behavior by providing
an implementation of arch_dma_mmap_pgprot that only marks
pages in noncoherent mappings as cache-inhibited. As this behavior
should be universal for all powerpc platforms a new file,
dma-generic.c, was created to store it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204145
Fixes: cc17d780 ("powerpc/dma: remove dma_nommu_mmap_coherent")
Signed-off-by: Shawn Anastasio 



Is this the default one?

include/linux/dma-noncoherent.h
# define arch_dma_mmap_pgprot(dev, prot, attrs) pgprot_noncached(prot)

Out of curiosity - do not we want to fix this one for everyone?

Either way, the patch is correct. I'm glad to know it was not my " 
powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59" 
which broke it :)


Reviewed-by: Alexey Kardashevskiy 




---
  arch/powerpc/Kconfig |  1 +
  arch/powerpc/kernel/Makefile |  3 ++-
  arch/powerpc/kernel/dma-common.c | 17 +
  3 files changed, 20 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/kernel/dma-common.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..77f6ebf97113 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -121,6 +121,7 @@ config PPC
select ARCH_32BIT_OFF_T if PPC32
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
+   select ARCH_HAS_DMA_MMAP_PGPROT
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 56dfa7a2a6f2..ea0c69236789 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -49,7 +49,8 @@ obj-y := cputable.o ptrace.o 
syscalls.o \
   signal.o sysfs.o cacheinfo.o time.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
-  of_platform.o prom_parse.o
+  of_platform.o prom_parse.o \
+  dma-common.o
  obj-$(CONFIG_PPC64)   += setup_64.o sys_ppc32.o \
   signal_64.o ptrace32.o \
   paca.o nvram_64.o firmware.o
diff --git a/arch/powerpc/kernel/dma-common.c b/arch/powerpc/kernel/dma-common.c
new file mode 100644
index ..5a15f99f4199
--- /dev/null
+++ b/arch/powerpc/kernel/dma-common.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Contains common dma routines for all powerpc platforms.
+ *
+ * Copyright (C) 2019 Shawn Anastasio (sh...@anastas.io)
+ */
+
+#include 
+#include 
+
+pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot,
+   unsigned long attrs)
+{
+   if (!dev_is_dma_coherent(dev))
+   return pgprot_noncached(prot);
+   return prot;
+}



--
Alexey


Re: [PATCH v4 7/8] KVM: PPC: Ultravisor: Enter a secure guest

2019-07-17 Thread Sukadev Bhattiprolu
Michael Ellerman [m...@ellerman.id.au] wrote:
> Claudio Carvalho  writes:
> > From: Sukadev Bhattiprolu 
> >
> > To enter a secure guest, we have to go through the ultravisor, therefore
> > we do a ucall when we are entering a secure guest.
> >
> > This change is needed for any sort of entry to the secure guest from the
> > hypervisor, whether it is a return from an hcall, a return from a
> > hypervisor interrupt, or the first time that a secure guest vCPU is run.
> >
> > If we are returning from an hcall, the results are already in the
> > appropriate registers R3:12, except for R3, R6 and R7. R3 has the status
> > of the reflected hcall, therefore we move it to R0 for the ultravisor and
> > set R3 to the UV_RETURN ucall number. R6,7 were used as temporary
> > registers, hence we restore them.
> 
> This is another case where some documentation would help people to
> review the code.
> 
> > Have fast_guest_return check the kvm_arch.secure_guest field so that a
> > new CPU enters UV when started (in response to a RTAS start-cpu call).
> >
> > Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.
> >
> > Signed-off-by: Sukadev Bhattiprolu 
> > [ Pass SRR1 in r11 for UV_RETURN, fix kvmppc_msr_interrupt to preserve
> >   the MSR_S bit ]
> > Signed-off-by: Paul Mackerras 
> > [ Fix UV_RETURN ucall number and arch.secure_guest check ]
> > Signed-off-by: Ram Pai 
> > [ Save the actual R3 in R0 for the ultravisor and use R3 for the
> >   UV_RETURN ucall number. Update commit message and ret_to_ultra comment ]
> > Signed-off-by: Claudio Carvalho 
> > ---
> >  arch/powerpc/include/asm/kvm_host.h   |  1 +
> >  arch/powerpc/include/asm/ultravisor-api.h |  1 +
> >  arch/powerpc/kernel/asm-offsets.c |  1 +
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 40 +++
> >  4 files changed, 37 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index cffb365d9d02..89813ca987c2 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -36,6 +36,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /* Sign-extend HDEC if not on POWER9 */
> >  #define EXTEND_HDEC(reg)   \
> > @@ -1092,16 +1093,12 @@ BEGIN_FTR_SECTION
> >  END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
> >  
> > ld  r5, VCPU_LR(r4)
> > -   ld  r6, VCPU_CR(r4)
> > mtlrr5
> > -   mtcrr6
> >  
> > ld  r1, VCPU_GPR(R1)(r4)
> > ld  r2, VCPU_GPR(R2)(r4)
> > ld  r3, VCPU_GPR(R3)(r4)
> > ld  r5, VCPU_GPR(R5)(r4)
> > -   ld  r6, VCPU_GPR(R6)(r4)
> > -   ld  r7, VCPU_GPR(R7)(r4)
> > ld  r8, VCPU_GPR(R8)(r4)
> > ld  r9, VCPU_GPR(R9)(r4)
> > ld  r10, VCPU_GPR(R10)(r4)
> > @@ -1119,10 +1116,38 @@ BEGIN_FTR_SECTION
> > mtspr   SPRN_HDSISR, r0
> >  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> >  
> > +   ld  r6, VCPU_KVM(r4)
> > +   lbz r7, KVM_SECURE_GUEST(r6)
> > +   cmpdi   r7, 0
> 
> You could hoist the load of r6 and r7 to here?

we could move 'ld r7' here. r6 is used to restore CR below so
it (r6) has to stay there?

> 
> > +   bne ret_to_ultra
> > +
> > +   lwz r6, VCPU_CR(r4)
> > +   mtcrr6
> > +
> > +   ld  r7, VCPU_GPR(R7)(r4)
> > +   ld  r6, VCPU_GPR(R6)(r4)
> > ld  r0, VCPU_GPR(R0)(r4)
> > ld  r4, VCPU_GPR(R4)(r4)
> > HRFI_TO_GUEST
> > b   .
> > +/*
> > + * We are entering a secure guest, so we have to invoke the ultravisor to 
> > do
> > + * that. If we are returning from a hcall, the results are already in the
> > + * appropriate registers R3:12, except for R3, R6 and R7. R3 has the 
> > status of
> > + * the reflected hcall, therefore we move it to R0 for the ultravisor and 
> > set
> > + * R3 to the UV_RETURN ucall number. R6,7 were used as temporary registers
> > + * above, hence we restore them.
> > + */
> > +ret_to_ultra:
> > +   lwz r6, VCPU_CR(r4)
> > +   mtcrr6
> > +   mfspr   r11, SPRN_SRR1
> > +   mr  r0, r3
> > +   LOAD_REG_IMMEDIATE(r3, UV_RETURN)
> 
> Worth open coding to save three instructions?

Yes, good point:

-   LOAD_REG_IMMEDIATE(r3, UV_RETURN)
+
+   li  r3, 0
+   orisr3, r3, (UV_RETURN)@__AS_ATHIGH
+   ori r3, r3, (UV_RETURN)@l
+

> 
> > +   ld  r7, VCPU_GPR(R7)(r4)
> > +   ld  r6, VCPU_GPR(R6)(r4)
> > +   ld  r4, VCPU_GPR(R4)(r4)
> > +   sc  2
> >  
> >  /*
> >   * Enter the guest on a P9 or later system where we have exactly
> > @@ -3318,13 +3343,16 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
> >   *   r0 is used as a scratch register
> >   */
> >  kvmppc_msr_interrupt:
> > +   andis.  r0, r11, MSR_S@h
> > rldicl  r0, r11, 64 - MSR_TS_S_LG, 62
> > -   cmpwi   r0, 2 /* Check if we are in transactional state..  */
> > +   cmpwi   cr1, r0, 2 /* Check if we are in transactional state..  */
> > ld  r11, VCPU_INTR_MSR(r9

Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-17 Thread Masahiro Yamada
On Thu, Jul 18, 2019 at 1:46 AM Segher Boessenkool
 wrote:
>
> On Thu, Jul 18, 2019 at 12:19:36AM +0900, Masahiro Yamada wrote:
> > On Wed, Jul 17, 2019 at 11:38 PM Segher Boessenkool
> >  wrote:
> > >
> > > On Tue, Jul 16, 2019 at 10:15:47PM +1000, Michael Ellerman wrote:
> > > > Segher Boessenkool  writes:
> > > > And it's definitely calling ar with no flags, eg:
> > > >
> > > >   rm -f init/built-in.a; powerpc-linux-ar rcSTPD init/built-in.a 
> > > > init/main.o init/version.o init/do_mounts.o init/do_mounts_rd.o 
> > > > init/do_mounts_initrd.o init/do_mounts_md.o init/initramfs.o 
> > > > init/init_task.o
> > >
> > > This uses thin archives.  Those will work fine.
> > >
> > > The failing case was empty files IIRC, stuff created from no inputs.
> >
> > Actually, empty files are created everywhere.
>
> >cmd_ar_builtin = rm -f $@; $(AR) rcSTP$(KBUILD_ARFLAGS) $@
> > $(real-prereqs)
>
> You use thin archives.
>
> Does every config use thin archives always nowadays?

Kbuild always uses thin archives as far as vmlinux is concerned.

But, there are some other call-sites.

masahiro@pug:~/ref/linux$ git grep  '$(AR)' -- :^Documentation :^tools
arch/powerpc/boot/Makefile:BOOTAR := $(AR)
arch/unicore32/lib/Makefile:$(Q)$(AR) p $(GNU_LIBC_A) $(notdir $@) > $@
arch/unicore32/lib/Makefile:$(Q)$(AR) p $(GNU_LIBGCC_A) $(notdir $@) > $@
lib/raid6/test/Makefile: $(AR) cq $@ $^
scripts/Kbuild.include:ar-option = $(call try-run, $(AR) rc$(1)
"$$TMP",$(1),$(2))
scripts/Makefile.build:  cmd_ar_builtin = rm -f $@; $(AR)
rcSTP$(KBUILD_ARFLAGS) $@ $(real-prereqs)
scripts/Makefile.lib:  cmd_ar = rm -f $@; $(AR)
rcsTP$(KBUILD_ARFLAGS) $@ $(real-prereqs)


Probably, you are interested in arch/powerpc/boot/Makefile.
This does not seem a thin archive.


> > BTW, your commit 8995ac8702737147115e1c75879a1a2d75627b9e
> > dates back to 2008.
> >
> > At that time, thin archive was not used.
>
> Yes, I know.  This isn't about built-in.[oa], it is about *other*
> archives we at least *used to* create.  If we *know* we do not anymore,
> then this workaround can of course be removed (and good riddance).

If it is not about built-in.[oa],
which archive are you talking about?

Can you pin-point the one?

masahiro@pug:~/ref/linux$ git log --oneline  -1
8995ac870273 (HEAD) [POWERPC] Specify GNUTARGET on $(AR) invocations
masahiro@pug:~/ref/linux$ git grep '(AR)'
Documentation/kbuild/makefiles.txt: per-directory options to $(LD)
and $(AR).
arch/powerpc/Makefile:CROSS32AR := GNUTARGET=elf32-powerpc $(AR)
arch/powerpc/Makefile:override AR   := GNUTARGET=elf$(SZ)-powerpc $(AR)
drivers/md/raid6test/Makefile:   $(AR) cq $@ $^
scripts/Makefile.build:   rm -f $@; $(AR) rcs $@)
scripts/Makefile.build:cmd_link_l_target = rm -f $@; $(AR)
$(EXTRA_ARFLAGS) rcs $@ $(lib-y)
masahiro@pug:~/ref/linux$ git grep '(CROSS32AR)'
arch/powerpc/boot/Makefile:  cmd_bootar = $(CROSS32AR) -cr $@.
$(filter-out FORCE,$^); mv $@. $@



> If ar creates an archive file (a real one, not a thin archive), and it
> has no input files, it uses its default object format as destination
> format, if it isn't told to use something else.  And that doesn't work,
> it needs to use some format compatible with what that archive later is
> linked with.

I compile-tested v4.10, which was before the thin-archive migration,
but I did not see any problem for building ppc32/64.

Whether or not it is a thin archive,
an empty archive is always 8-byte file.

masahiro@pug:~/ref/linux$ cat kernel/livepatch/built-in.o
!

Is there a room for caring about the under-lying architecture?


-- 
Best Regards
Masahiro Yamada


[PATCH] ibmvfc: fix WARN_ON during event pool release

2019-07-17 Thread Tyrel Datwyler
While removing an ibmvfc client adapter a WARN_ON like the following WARN_ON
is seen in the kernel log:

WARNING: CPU: 6 PID: 5421 at ./include/linux/dma-mapping.h:541
ibmvfc_free_event_pool+0x12c/0x1f0 [ibmvfc]
CPU: 6 PID: 5421 Comm: rmmod Tainted: GE 
4.17.0-rc1-next-20180419-autotest #1
NIP:  d290328c LR: d290325c CTR: c036ee20
REGS: c00288d1b7e0 TRAP: 0700   Tainted: GE  
(4.17.0-rc1-next-20180419-autotest)
MSR:  80010282b033   CR: 44008828  
XER: 2000
CFAR: c036e408 SOFTE: 1
GPR00: d290325c c00288d1ba60 d2917900 c00289d75448
GPR04: 0071 c000ff87 1804 0001
GPR08:  c156e838 0001 d290c640
GPR12: c036ee20 c0001ec4dc00  
GPR16:   0100276901e0 10020598
GPR20: 10020550 10020538 10020578 100205b0
GPR24:   10020590 5deadbeef100
GPR28: 5deadbeef200 d2910b00 0071 c002822f87d8
NIP [d290328c] ibmvfc_free_event_pool+0x12c/0x1f0 [ibmvfc]
LR [d290325c] ibmvfc_free_event_pool+0xfc/0x1f0 [ibmvfc]
Call Trace:
[c00288d1ba60] [d290325c] ibmvfc_free_event_pool+0xfc/0x1f0 
[ibmvfc] (unreliable)
[c00288d1baf0] [d2909390] ibmvfc_abort_task_set+0x7b0/0x8b0 [ibmvfc]
[c00288d1bb70] [c00d8c68] vio_bus_remove+0x68/0x100
[c00288d1bbb0] [c07da7c4] device_release_driver_internal+0x1f4/0x2d0
[c00288d1bc00] [c07da95c] driver_detach+0x7c/0x100
[c00288d1bc40] [c07d8af4] bus_remove_driver+0x84/0x140
[c00288d1bcb0] [c07db6ac] driver_unregister+0x4c/0xa0
[c00288d1bd20] [c00d6e7c] vio_unregister_driver+0x2c/0x50
[c00288d1bd50] [d290ba0c] cleanup_module+0x24/0x15e0 [ibmvfc]
[c00288d1bd70] [c01dadb0] sys_delete_module+0x220/0x2d0
[c00288d1be30] [c000b284] system_call+0x58/0x6c
Instruction dump:
e8410018 e87f0068 809f0078 e8bf0080 e8df0088 2fa3 419e008c e9230200
2fa9 419e0080 894d098a 794a07e0 <0b0a> e9290008 2fa9 419e0028

This is tripped as a result of irqs being disabled during the call to
dma_free_coherent() by ibmvfc_free_event_pool(). At this point in the code path
we have quiesced the adapter and its overly paranoid anyways to be holding the
host lock.

Reported-by: Abdul Haleem 
Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index acd16e0d52cf..8cdbac076a1b 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -4864,8 +4864,8 @@ static int ibmvfc_remove(struct vio_dev *vdev)
 
spin_lock_irqsave(vhost->host->host_lock, flags);
ibmvfc_purge_requests(vhost, DID_ERROR);
-   ibmvfc_free_event_pool(vhost);
spin_unlock_irqrestore(vhost->host->host_lock, flags);
+   ibmvfc_free_event_pool(vhost);
 
ibmvfc_free_mem(vhost);
spin_lock(&ibmvfc_driver_lock);
-- 
2.12.3



[PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-17 Thread Shawn Anastasio
The refactor of powerpc DMA functions in commit cc17d780
("powerpc/dma: remove dma_nommu_mmap_coherent") incorrectly
changes the way DMA mappings are handled on powerpc.
Since this change, all mapped pages are marked as cache-inhibited
through the default implementation of arch_dma_mmap_pgprot.
This differs from the previous behavior of only marking pages
in noncoherent mappings as cache-inhibited and has resulted in
sporadic system crashes in certain hardware configurations and
workloads (see Bugzilla).

This commit restores the previous correct behavior by providing
an implementation of arch_dma_mmap_pgprot that only marks
pages in noncoherent mappings as cache-inhibited. As this behavior
should be universal for all powerpc platforms a new file,
dma-generic.c, was created to store it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204145
Fixes: cc17d780 ("powerpc/dma: remove dma_nommu_mmap_coherent")
Signed-off-by: Shawn Anastasio 
---
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/kernel/Makefile |  3 ++-
 arch/powerpc/kernel/dma-common.c | 17 +
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/kernel/dma-common.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..77f6ebf97113 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -121,6 +121,7 @@ config PPC
select ARCH_32BIT_OFF_T if PPC32
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
+   select ARCH_HAS_DMA_MMAP_PGPROT
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 56dfa7a2a6f2..ea0c69236789 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -49,7 +49,8 @@ obj-y := cputable.o ptrace.o 
syscalls.o \
   signal.o sysfs.o cacheinfo.o time.o \
   prom.o traps.o setup-common.o \
   udbg.o misc.o io.o misc_$(BITS).o \
-  of_platform.o prom_parse.o
+  of_platform.o prom_parse.o \
+  dma-common.o
 obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \
   signal_64.o ptrace32.o \
   paca.o nvram_64.o firmware.o
diff --git a/arch/powerpc/kernel/dma-common.c b/arch/powerpc/kernel/dma-common.c
new file mode 100644
index ..5a15f99f4199
--- /dev/null
+++ b/arch/powerpc/kernel/dma-common.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Contains common dma routines for all powerpc platforms.
+ *
+ * Copyright (C) 2019 Shawn Anastasio (sh...@anastas.io)
+ */
+
+#include 
+#include 
+
+pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot,
+   unsigned long attrs)
+{
+   if (!dev_is_dma_coherent(dev))
+   return pgprot_noncached(prot);
+   return prot;
+}
-- 
2.22.0



Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-17 Thread Segher Boessenkool
On Thu, Jul 18, 2019 at 12:19:36AM +0900, Masahiro Yamada wrote:
> On Wed, Jul 17, 2019 at 11:38 PM Segher Boessenkool
>  wrote:
> >
> > On Tue, Jul 16, 2019 at 10:15:47PM +1000, Michael Ellerman wrote:
> > > Segher Boessenkool  writes:
> > > And it's definitely calling ar with no flags, eg:
> > >
> > >   rm -f init/built-in.a; powerpc-linux-ar rcSTPD init/built-in.a 
> > > init/main.o init/version.o init/do_mounts.o init/do_mounts_rd.o 
> > > init/do_mounts_initrd.o init/do_mounts_md.o init/initramfs.o 
> > > init/init_task.o
> >
> > This uses thin archives.  Those will work fine.
> >
> > The failing case was empty files IIRC, stuff created from no inputs.
> 
> Actually, empty files are created everywhere.

>cmd_ar_builtin = rm -f $@; $(AR) rcSTP$(KBUILD_ARFLAGS) $@
> $(real-prereqs)

You use thin archives.

Does every config use thin archives always nowadays?

> BTW, your commit 8995ac8702737147115e1c75879a1a2d75627b9e
> dates back to 2008.
> 
> At that time, thin archive was not used.

Yes, I know.  This isn't about built-in.[oa], it is about *other*
archives we at least *used to* create.  If we *know* we do not anymore,
then this workaround can of course be removed (and good riddance).

If ar creates an archive file (a real one, not a thin archive), and it
has no input files, it uses its default object format as destination
format, if it isn't told to use something else.  And that doesn't work,
it needs to use some format compatible with what that archive later is
linked with.


Segher


Re: [PATCH v4 13/15] docs: ABI: testing: make the files compatible with ReST output

2019-07-17 Thread Jonathan Cameron
On Wed, 17 Jul 2019 09:28:17 -0300
Mauro Carvalho Chehab  wrote:

> Some files over there won't parse well by Sphinx.
> 
> Fix them.
> 
> Signed-off-by: Mauro Carvalho Chehab 
Hi Mauro,

Does feel like this one should perhaps have been broken up a touch!

For the IIO ones I've eyeballed it rather than testing the results

Acked-by: Jonathan Cameron 




Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-17 Thread Masahiro Yamada
On Wed, Jul 17, 2019 at 11:38 PM Segher Boessenkool
 wrote:
>
> On Tue, Jul 16, 2019 at 10:15:47PM +1000, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > And it's definitely calling ar with no flags, eg:
> >
> >   rm -f init/built-in.a; powerpc-linux-ar rcSTPD init/built-in.a 
> > init/main.o init/version.o init/do_mounts.o init/do_mounts_rd.o 
> > init/do_mounts_initrd.o init/do_mounts_md.o init/initramfs.o 
> > init/init_task.o
>
> This uses thin archives.  Those will work fine.
>
> The failing case was empty files IIRC, stuff created from no inputs.

Actually, empty files are created everywhere.
I do not see any problems whether the target is 32-bit or 64-bit.


Just try allnoconfig, for example.
You can apply the following to debug.


diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index be38198d98b2..9d6b30e50663 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -413,6 +413,7 @@ quiet_cmd_ar_builtin = AR  $@
   cmd_ar_builtin = rm -f $@; $(AR) rcSTP$(KBUILD_ARFLAGS) $@
$(real-prereqs)

 $(builtin-target): $(real-obj-y) FORCE
+   @$(if $(real-prereqs),,echo creating empty archive: $@)
$(call if_changed,ar_builtin)

 targets += $(builtin-target)

The following are all empty files created from no input.

creating empty archive: usr/built-in.a
creating empty archive: arch/powerpc/crypto/built-in.a
creating empty archive: arch/powerpc/math-emu/built-in.a
  CHK include/generated/compile.h
creating empty archive: arch/powerpc/net/built-in.a
creating empty archive: arch/powerpc/platforms/built-in.a
creating empty archive: arch/powerpc/sysdev/built-in.a
creating empty archive: certs/built-in.a
creating empty archive: ipc/built-in.a
creating empty archive: block/built-in.a
creating empty archive: drivers/amba/built-in.a
creating empty archive: sound/built-in.a
creating empty archive: drivers/auxdisplay/built-in.a
creating empty archive: crypto/built-in.a
creating empty archive: drivers/block/built-in.a
creating empty archive: net/built-in.a
creating empty archive: drivers/bus/built-in.a
creating empty archive: drivers/cdrom/built-in.a
creating empty archive: drivers/char/ipmi/built-in.a
creating empty archive: arch/powerpc/kernel/trace/built-in.a
creating empty archive: drivers/char/agp/built-in.a
creating empty archive: virt/lib/built-in.a
creating empty archive: drivers/clk/actions/built-in.a
creating empty archive: drivers/clocksource/built-in.a
creating empty archive: drivers/clk/analogbits/built-in.a
creating empty archive: drivers/firewire/built-in.a
creating empty archive: drivers/clk/bcm/built-in.a
creating empty archive: drivers/clk/imx/built-in.a
creating empty archive: drivers/clk/imgtec/built-in.a
creating empty archive: drivers/clk/ingenic/built-in.a
creating empty archive: drivers/clk/mediatek/built-in.a
creating empty archive: drivers/gpu/drm/arm/built-in.a
creating empty archive: drivers/clk/mvebu/built-in.a
creating empty archive: drivers/firmware/broadcom/built-in.a
creating empty archive: drivers/firmware/imx/built-in.a
creating empty archive: drivers/gpu/drm/bridge/synopsys/built-in.a
creating empty archive: drivers/clk/renesas/built-in.a
creating empty archive: drivers/firmware/meson/built-in.a
creating empty archive: drivers/firmware/psci/built-in.a
creating empty archive: drivers/clk/ti/built-in.a
creating empty archive: drivers/gpu/drm/hisilicon/built-in.a
creating empty archive: drivers/firmware/tegra/built-in.a
creating empty archive: drivers/gpu/drm/i2c/built-in.a
creating empty archive: drivers/gpu/drm/omapdrm/displays/built-in.a
creating empty archive: drivers/gpu/drm/omapdrm/dss/built-in.a
creating empty archive: drivers/firmware/xilinx/built-in.a
creating empty archive: drivers/gpu/drm/panel/built-in.a
creating empty archive: drivers/gpu/drm/rcar-du/built-in.a
creating empty archive: drivers/hwtracing/intel_th/built-in.a
creating empty archive: kernel/livepatch/built-in.a
creating empty archive: drivers/gpu/drm/tilcdc/built-in.a
creating empty archive: drivers/base/firmware_loader/builtin/built-in.a
creating empty archive: drivers/base/power/built-in.a
creating empty archive: drivers/gpu/vga/built-in.a
creating empty archive: drivers/base/test/built-in.a
creating empty archive: drivers/i2c/algos/built-in.a
creating empty archive: drivers/i3c/built-in.a
creating empty archive: drivers/i2c/busses/built-in.a
creating empty archive: drivers/idle/built-in.a
creating empty archive: drivers/i2c/muxes/built-in.a
creating empty archive: drivers/iommu/built-in.a
creating empty archive: fs/devpts/built-in.a
creating empty archive: drivers/macintosh/built-in.a
creating empty archive: fs/notify/dnotify/built-in.a
creating empty archive: fs/notify/fanotify/built-in.a
creating empty archive: fs/notify/inotify/built-in.a
creating empty archive: drivers/media/common/b2c2/built-in.a
creating empty archive: drivers/media/common/saa7146/built-in.a
creating empty archive: fs/quota/built-in.a
creating empty archive: drivers/media/firewi

Re: [PATCH v4 4/8] KVM: PPC: Ultravisor: Use UV_WRITE_PATE ucall to register a PATE

2019-07-17 Thread Ryan Grimm
On Thu, 2019-07-11 at 22:57 +1000, Michael Ellerman wrote:
> Claudio Carvalho  writes:
> > From: Michael Anderson 
> > 
> > When running under an ultravisor, the ultravisor controls the real
> > partition table and has it in secure memory where the hypervisor
> > can't
> > access it, and therefore we (the HV) have to do a ucall whenever we
> > want
> > to update an entry.
> > 
> > The HV still keeps a copy of its view of the partition table in
> > normal
> > memory so that the nest MMU can access it.
> > 
> > Both partition tables will have PATE entries for HV and normal
> > virtual
> 
> Can you expand novel acronyms on their first usage please, ie. PATE.
> 

Agreed.  This confused me a while ago.  It is "Partition Table Entry",
correct?

> > machines.
> > 
> > Suggested-by: Ryan Grimm 
> 

Please remove this and add 

Reviewed-by: Ryan Grimm 


> "Suggested" implies this is optional, but it's not :)
> 
> I'm not sure exactly what Ryan's input was, but feel free to make up
> a
> better tag :)
> 
> > Signed-off-by: Michael Anderson 
> > Signed-off-by: Madhavan Srinivasan 
> > Signed-off-by: Ram Pai 
> > [ Write the pate in HV's table before doing that in UV's ]
> > Signed-off-by: Claudio Carvalho 
> > ---
> >  arch/powerpc/include/asm/ultravisor-api.h |  5 +++-
> >  arch/powerpc/include/asm/ultravisor.h | 14 ++
> >  arch/powerpc/mm/book3s64/hash_utils.c |  3 +-
> >  arch/powerpc/mm/book3s64/pgtable.c| 34
> > +--
> >  arch/powerpc/mm/book3s64/radix_pgtable.c  |  9 --
> >  5 files changed, 57 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/ultravisor-api.h
> > b/arch/powerpc/include/asm/ultravisor-api.h
> > index 49e766adabc7..141940771add 100644
> > --- a/arch/powerpc/include/asm/ultravisor-api.h
> > +++ b/arch/powerpc/include/asm/ultravisor-api.h
> > @@ -15,6 +15,9 @@
> >  #define U_SUCCESS  H_SUCCESS
> >  #define U_FUNCTION H_FUNCTION
> >  #define U_PARAMETERH_PARAMETER
> > +#define U_PERMISSION   H_PERMISSION
> >  
> > -#endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
> 
> What happened there? Just a diff artifact?
> 
> > +/* opcodes */
> > +#define UV_WRITE_PATE  0xF104
> >  
> > +#endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
> > diff --git a/arch/powerpc/include/asm/ultravisor.h
> > b/arch/powerpc/include/asm/ultravisor.h
> > index a78a2dacfd0b..996c1efd6c6d 100644
> > --- a/arch/powerpc/include/asm/ultravisor.h
> > +++ b/arch/powerpc/include/asm/ultravisor.h
> > @@ -12,6 +12,8 @@
> >  
> >  #if !defined(__ASSEMBLY__)
> >  
> > +#include 
> > +
> >  /* Internal functions */
> >  extern int early_init_dt_scan_ultravisor(unsigned long node, const
> > char *uname,
> >  int depth, void *data);
> > @@ -28,8 +30,20 @@ extern int
> > early_init_dt_scan_ultravisor(unsigned long node, const char
> > *uname,
> >   */
> >  #if defined(CONFIG_PPC_POWERNV)
> >  long ucall(unsigned long opcode, unsigned long *retbuf, ...);
> > +#else
> > +static long ucall(unsigned long opcode, unsigned long *retbuf,
> > ...)
> 
>   ^
>   inline
> 
> > +{
> > +   return U_NOT_AVAILABLE;
> > +}
> >  #endif
> 
> That should have been in the previous patch.
> 
> > +static inline int uv_register_pate(u64 lpid, u64 dw0, u64 dw1)
> > +{
> > +   unsigned long retbuf[UCALL_BUFSIZE];
> > +
> > +   return ucall(UV_WRITE_PATE, retbuf, lpid, dw0, dw1);
> 
> You probably want a ucall_norets().
> 
> > +}
> > +
> >  #endif /* !__ASSEMBLY__ */
> >  
> >  #endif /* _ASM_POWERPC_ULTRAVISOR_H */
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 1ff451892d7f..220a4e133240 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -1080,9 +1080,10 @@ void hash__early_init_mmu_secondary(void)
> >  
> > if (!cpu_has_feature(CPU_FTR_ARCH_300))
> > mtspr(SPRN_SDR1, _SDR1);
> > -   else
> > +   else if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
> > mtspr(SPRN_PTCR,
> >   __pa(partition_tb) | (PATB_SIZE_SHIFT -
> > 12));
> 
> Needs a comment for the else case.
> 
> > }
> > /* Initialize SLB */
> > slb_initialize();
> > diff --git a/arch/powerpc/mm/book3s64/pgtable.c
> > b/arch/powerpc/mm/book3s64/pgtable.c
> > index ad3dd977c22d..224c5c7c2e3d 100644
> > --- a/arch/powerpc/mm/book3s64/pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/pgtable.c
> > @@ -16,6 +16,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -206,12 +208,25 @@ void __init mmu_partition_table_init(void)
> >  * 64 K size.
> >  */
> > ptcr = __pa(partition_tb) | (PATB_SIZE_SHIFT - 12);
> > -   mtspr(SPRN_PTCR, ptcr);
> > +   /*
> > +* If ultravisor is available, it is responsible for creating
> > and
> > +* mana

Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-17 Thread Segher Boessenkool
On Tue, Jul 16, 2019 at 10:15:47PM +1000, Michael Ellerman wrote:
> Segher Boessenkool  writes:
> And it's definitely calling ar with no flags, eg:
> 
>   rm -f init/built-in.a; powerpc-linux-ar rcSTPD init/built-in.a init/main.o 
> init/version.o init/do_mounts.o init/do_mounts_rd.o init/do_mounts_initrd.o 
> init/do_mounts_md.o init/initramfs.o init/init_task.o

This uses thin archives.  Those will work fine.

The failing case was empty files IIRC, stuff created from no inputs.

> So presumably at some point ar learnt to cope with objects that don't
> match its default? (how do I ask it what its default is?)

The first supported target in the --help list is the default, I think?

$ powerpc-linux-ar --help
 [...]
powerpc-linux-ar: supported targets: elf32-powerpc aixcoff-rs6000 
elf32-powerpcle ppcboot elf64-powerpc elf64-powerpcle elf64-little elf64-big 
elf32-little elf32-big plugin srec symbolsrec verilog tekhex binary ihex

$ powerpc64-linux-ar --help
 [...]
powerpc64-linux-ar: supported targets: elf64-powerpc elf64-powerpcle 
elf32-powerpc elf32-powerpcle aixcoff-rs6000 aixcoff64-rs6000 aix5coff64-rs6000 
elf64-little elf64-big elf32-little elf32-big plugin srec symbolsrec verilog 
tekhex binary ihex

$ powerpc64le-linux-ar --help
 [...]
powerpc64le-linux-ar: supported targets: elf64-powerpcle elf64-powerpc 
elf32-powerpcle elf32-powerpc aixcoff-rs6000 aixcoff64-rs6000 aix5coff64-rs6000 
elf64-little elf64-big elf32-little elf32-big plugin srec symbolsrec verilog 
tekhex binary ihex

$ powerpcle-linux-ar --help
 [...]
powerpcle-linux-ar: supported targets: elf32-powerpcle aixcoff-rs6000 
elf32-powerpc ppcboot elf64-powerpc elf64-powerpcle elf64-little elf64-big 
elf32-little elf32-big plugin srec symbolsrec verilog tekhex binary ihex

> > Then again, does that work at *all* nowadays?  Do we even consider that
> > important, *should* it work?
> 
> Yes and yes. There were a lot of bugs in the kernel makefiles after we
> added LE support which prevented a biarch/biendian compiler from working.
> But now it does work and we want it to keep working because it means you
> can have a single compiler for building 32-bit, 64-bit BE & 64-bit LE.

Good to hear :-)


Segher


Re: [PATCH v2 4/4] tools: Add fchmodat4

2019-07-17 Thread Arnaldo Carvalho de Melo
Em Tue, Jul 16, 2019 at 06:27:19PM -0700, Palmer Dabbelt escreveu:
> I'm not sure why it's necessary to add this explicitly to tools/ as well
> as arch/, but there were a few instances of fspick in here so I blindly
> added fchmodat4 in the same fashion.

The copies in tools/ for these specific files are used to generate a
syscall table used by 'perf trace', and we don't/can't access files
outside of tools/ to build tools/perf/, so we grab a copy and have
checks in place to warn perf developers when those copies get out of
sync.

Its not required that kernel developers update anything in tools, you're
welcomed to do so if you wish tho.

Thanks,

- Arnaldo
 
> Signed-off-by: Palmer Dabbelt 
> ---
>  tools/include/uapi/asm-generic/unistd.h   | 4 +++-
>  tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/include/uapi/asm-generic/unistd.h 
> b/tools/include/uapi/asm-generic/unistd.h
> index a87904daf103..36232ea94956 100644
> --- a/tools/include/uapi/asm-generic/unistd.h
> +++ b/tools/include/uapi/asm-generic/unistd.h
> @@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
>  __SYSCALL(__NR_fsmount, sys_fsmount)
>  #define __NR_fspick 433
>  __SYSCALL(__NR_fspick, sys_fspick)
> +#define __NR_fchmodat4 434
> +__SYSCALL(__NR_fchmodat4, sys_fchmodat4)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 434
> +#define __NR_syscalls 435
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
> b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> index b4e6f9e6204a..b92d5b195e66 100644
> --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  431  common  fsconfig__x64_sys_fsconfig
>  432  common  fsmount __x64_sys_fsmount
>  433  common  fspick  __x64_sys_fspick
> +434  common  fchmodat4   __x64_sys_fchmodat4
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> -- 
> 2.21.0

-- 

- Arnaldo


[PATCH v4 12/15] docs: ABI: stable: make files ReST compatible

2019-07-17 Thread Mauro Carvalho Chehab
Several entries at the stable ABI files won't parse if we pass
them directly to the ReST output.

Adjust them, in order to allow adding their contents as-is at
the stable ABI book.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/ABI/stable/firewire-cdev|  4 +
 Documentation/ABI/stable/sysfs-acpi-pmprofile | 22 +++--
 Documentation/ABI/stable/sysfs-bus-firewire   |  3 +
 Documentation/ABI/stable/sysfs-bus-nvmem  | 19 ++--
 Documentation/ABI/stable/sysfs-bus-usb|  6 +-
 .../ABI/stable/sysfs-class-backlight  |  1 +
 .../ABI/stable/sysfs-class-infiniband | 95 +--
 Documentation/ABI/stable/sysfs-class-rfkill   | 13 ++-
 Documentation/ABI/stable/sysfs-class-tpm  | 90 +-
 Documentation/ABI/stable/sysfs-devices|  5 +-
 Documentation/ABI/stable/sysfs-driver-ib_srp  |  1 +
 .../ABI/stable/sysfs-firmware-efi-vars|  4 +
 .../ABI/stable/sysfs-firmware-opal-dump   |  5 +
 .../ABI/stable/sysfs-firmware-opal-elog   |  2 +
 Documentation/ABI/stable/sysfs-hypervisor-xen |  3 +
 Documentation/ABI/stable/vdso |  5 +-
 Documentation/admin-guide/abi-stable.rst  |  1 +
 17 files changed, 179 insertions(+), 100 deletions(-)

diff --git a/Documentation/ABI/stable/firewire-cdev 
b/Documentation/ABI/stable/firewire-cdev
index f72ed653878a..c9e8ff026154 100644
--- a/Documentation/ABI/stable/firewire-cdev
+++ b/Documentation/ABI/stable/firewire-cdev
@@ -14,12 +14,14 @@ Description:
Each /dev/fw* is associated with one IEEE 1394 node, which can
be remote or local nodes.  Operations on a /dev/fw* file have
different scope:
+
  - The 1394 node which is associated with the file:
  - Asynchronous request transmission
  - Get the Configuration ROM
  - Query node ID
  - Query maximum speed of the path between this node
and local node
+
  - The 1394 bus (i.e. "card") to which the node is attached to:
  - Isochronous stream transmission and reception
  - Asynchronous stream transmission and reception
@@ -31,6 +33,7 @@ Description:
manager
  - Query cycle time
  - Bus reset initiation, bus reset event reception
+
  - All 1394 buses:
  - Allocation of IEEE 1212 address ranges on the local
link layers, reception of inbound requests to such
@@ -43,6 +46,7 @@ Description:
userland implement different access permission models, some
operations are restricted to /dev/fw* files that are associated
with a local node:
+
  - Addition of descriptors or directories to the local
nodes' Configuration ROM
  - PHY packet transmission and reception
diff --git a/Documentation/ABI/stable/sysfs-acpi-pmprofile 
b/Documentation/ABI/stable/sysfs-acpi-pmprofile
index 964c7a8afb26..fd97d22b677f 100644
--- a/Documentation/ABI/stable/sysfs-acpi-pmprofile
+++ b/Documentation/ABI/stable/sysfs-acpi-pmprofile
@@ -6,17 +6,21 @@ Description:  The ACPI pm_profile sysfs interface exports the 
platform
power management (and performance) requirement expectations
as provided by BIOS. The integer value is directly passed as
retrieved from the FADT ACPI table.
-Values: For possible values see ACPI specification:
+
+Values:For possible values see ACPI specification:
5.2.9 Fixed ACPI Description Table (FADT)
Field: Preferred_PM_Profile
 
Currently these values are defined by spec:
-   0 Unspecified
-   1 Desktop
-   2 Mobile
-   3 Workstation
-   4 Enterprise Server
-   5 SOHO Server
-   6 Appliance PC
-   7 Performance Server
+
+   == =
+   0  Unspecified
+   1  Desktop
+   2  Mobile
+   3  Workstation
+   4  Enterprise Server
+   5  SOHO Server
+   6  Appliance PC
+   7  Performance Server
>7 Reserved
+   == =
diff --git a/Documentation/ABI/stable/sysfs-bus-firewire 
b/Documentation/ABI/stable/sysfs-bus-firewire
index 41e5a0cd1e3e..9ac9eddb82ef 100644
--- a/Documentation/ABI/stable/sysfs-bus-firewire
+++ b/Documentation/ABI/stable/sysfs-bus-firewire
@@ -47,6 +47,7 @@ Description:
IEEE 1394 node device attribute.
Read-only and immutable.
 Values:1: The sysfs entry represents a local node (a 
controller card).
+
0

[PATCH v3 14/20] docs: ABI: stable: make files ReST compatible

2019-07-17 Thread Mauro Carvalho Chehab
Several entries at the stable ABI files won't parse if we pass
them directly to the ReST output.

Adjust them, in order to allow adding their contents as-is at
the stable ABI book.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/ABI/stable/firewire-cdev|  4 +
 Documentation/ABI/stable/sysfs-acpi-pmprofile | 22 +++--
 Documentation/ABI/stable/sysfs-bus-firewire   |  3 +
 Documentation/ABI/stable/sysfs-bus-nvmem  | 19 ++--
 Documentation/ABI/stable/sysfs-bus-usb|  6 +-
 .../ABI/stable/sysfs-class-backlight  |  1 +
 .../ABI/stable/sysfs-class-infiniband | 95 +--
 Documentation/ABI/stable/sysfs-class-rfkill   | 13 ++-
 Documentation/ABI/stable/sysfs-class-tpm  | 90 +-
 Documentation/ABI/stable/sysfs-devices|  5 +-
 Documentation/ABI/stable/sysfs-driver-ib_srp  |  1 +
 .../ABI/stable/sysfs-firmware-efi-vars|  4 +
 .../ABI/stable/sysfs-firmware-opal-dump   |  5 +
 .../ABI/stable/sysfs-firmware-opal-elog   |  2 +
 Documentation/ABI/stable/sysfs-hypervisor-xen |  3 +
 Documentation/ABI/stable/vdso |  5 +-
 16 files changed, 178 insertions(+), 100 deletions(-)

diff --git a/Documentation/ABI/stable/firewire-cdev 
b/Documentation/ABI/stable/firewire-cdev
index f72ed653878a..c9e8ff026154 100644
--- a/Documentation/ABI/stable/firewire-cdev
+++ b/Documentation/ABI/stable/firewire-cdev
@@ -14,12 +14,14 @@ Description:
Each /dev/fw* is associated with one IEEE 1394 node, which can
be remote or local nodes.  Operations on a /dev/fw* file have
different scope:
+
  - The 1394 node which is associated with the file:
  - Asynchronous request transmission
  - Get the Configuration ROM
  - Query node ID
  - Query maximum speed of the path between this node
and local node
+
  - The 1394 bus (i.e. "card") to which the node is attached to:
  - Isochronous stream transmission and reception
  - Asynchronous stream transmission and reception
@@ -31,6 +33,7 @@ Description:
manager
  - Query cycle time
  - Bus reset initiation, bus reset event reception
+
  - All 1394 buses:
  - Allocation of IEEE 1212 address ranges on the local
link layers, reception of inbound requests to such
@@ -43,6 +46,7 @@ Description:
userland implement different access permission models, some
operations are restricted to /dev/fw* files that are associated
with a local node:
+
  - Addition of descriptors or directories to the local
nodes' Configuration ROM
  - PHY packet transmission and reception
diff --git a/Documentation/ABI/stable/sysfs-acpi-pmprofile 
b/Documentation/ABI/stable/sysfs-acpi-pmprofile
index 964c7a8afb26..fd97d22b677f 100644
--- a/Documentation/ABI/stable/sysfs-acpi-pmprofile
+++ b/Documentation/ABI/stable/sysfs-acpi-pmprofile
@@ -6,17 +6,21 @@ Description:  The ACPI pm_profile sysfs interface exports the 
platform
power management (and performance) requirement expectations
as provided by BIOS. The integer value is directly passed as
retrieved from the FADT ACPI table.
-Values: For possible values see ACPI specification:
+
+Values:For possible values see ACPI specification:
5.2.9 Fixed ACPI Description Table (FADT)
Field: Preferred_PM_Profile
 
Currently these values are defined by spec:
-   0 Unspecified
-   1 Desktop
-   2 Mobile
-   3 Workstation
-   4 Enterprise Server
-   5 SOHO Server
-   6 Appliance PC
-   7 Performance Server
+
+   == =
+   0  Unspecified
+   1  Desktop
+   2  Mobile
+   3  Workstation
+   4  Enterprise Server
+   5  SOHO Server
+   6  Appliance PC
+   7  Performance Server
>7 Reserved
+   == =
diff --git a/Documentation/ABI/stable/sysfs-bus-firewire 
b/Documentation/ABI/stable/sysfs-bus-firewire
index 41e5a0cd1e3e..9ac9eddb82ef 100644
--- a/Documentation/ABI/stable/sysfs-bus-firewire
+++ b/Documentation/ABI/stable/sysfs-bus-firewire
@@ -47,6 +47,7 @@ Description:
IEEE 1394 node device attribute.
Read-only and immutable.
 Values:1: The sysfs entry represents a local node (a 
controller card).
+
0: The sysfs entry represents a remote node.
 
 
@@ -12

Re: [PATCH v2 3/4] arch: Register fchmodat4, usually as syscall 434

2019-07-17 Thread Arnd Bergmann
On Wed, Jul 17, 2019 at 3:29 AM Palmer Dabbelt  wrote:
>
> This registers the new fchmodat4 syscall in most places as nuber 434,
> with alpha being the exception where it's 544.  I found all these sites
> by grepping for fspick, which I assume has found me everything.

434 is now pidfd_open, the next free one is 436.

>  arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
>  arch/arm/tools/syscall.tbl  | 1 +
>  arch/arm64/include/asm/unistd32.h   | 2 ++

You missed arch/arm64/include/asm/unistd.h, which
contains __NR_compat_syscalls

  Arnd


[PATCH] powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()

2019-07-17 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

xive_find_target_in_mask() has the following for(;;) loop which has a
bug when @first == cpumask_first(@mask) and condition 1 fails to hold
for every CPU in @mask. In this case we loop forever in the for-loop.

  first = cpu;
  for (;;) {
  if (cpu_online(cpu) && xive_try_pick_target(cpu)) // condition 1
  return cpu;
  cpu = cpumask_next(cpu, mask);
  if (cpu == first) // condition 2
  break;

  if (cpu >= nr_cpu_ids) // condition 3
  cpu = cpumask_first(mask);
  }

This is because, when @first == cpumask_first(@mask), we never hit the
condition 2 (cpu == first) since prior to this check, we would have
executed "cpu = cpumask_next(cpu, mask)" which will set the value of
@cpu to a value greater than @first or to nr_cpus_ids. When this is
coupled with the fact that condition 1 is not met, we will never exit
this loop.

This was discovered by the hard-lockup detector while running LTP test
concurrently with SMT switch tests.

 watchdog: CPU 12 detected hard LOCKUP on other CPUs 68
 watchdog: CPU 12 TB:85587019220796, last SMP heartbeat TB:85578827223399 
(15999ms ago)
 watchdog: CPU 68 Hard LOCKUP
 watchdog: CPU 68 TB:85587019361273, last heartbeat TB:85576815065016 (19930ms 
ago)
 CPU: 68 PID: 45050 Comm: hxediag Kdump: loaded Not tainted 
4.18.0-100.el8.ppc64le #1
 NIP:  c06f5578 LR: c0cba9ec CTR: 
 REGS: c000201fff3c7d80 TRAP: 0100   Not tainted  (4.18.0-100.el8.ppc64le)
 MSR:  92883033   CR: 24028424  XER: 

 CFAR: c06f558c IRQMASK: 1
 GPR00: c00afc58 c000201c01c43400 c15ce500 c000201cae26ec18
 GPR04: 0800 0540 0800 00f8
 GPR08: 0020 00a8 8000 c0081a1beed8
 GPR12: c00b1410 c000201fff7f4c00  
 GPR16:   0540 0001
 GPR20: 0048 1011 c0081a1e3780 c000201cae26ed18
 GPR24:  c000201cae26ed8c 0001 c1116bc0
 GPR28: c1601ee8 c1602494 c000201cae26ec18 001f
 NIP [c06f5578] find_next_bit+0x38/0x90
 LR [c0cba9ec] cpumask_next+0x2c/0x50
 Call Trace:
 [c000201c01c43400] [c000201cae26ec18] 0xc000201cae26ec18 (unreliable)
 [c000201c01c43420] [c00afc58] xive_find_target_in_mask+0x1b8/0x240
 [c000201c01c43470] [c00b0228] xive_pick_irq_target.isra.3+0x168/0x1f0
 [c000201c01c435c0] [c00b1470] xive_irq_startup+0x60/0x260
 [c000201c01c43640] [c01d8328] __irq_startup+0x58/0xf0
 [c000201c01c43670] [c01d844c] irq_startup+0x8c/0x1a0
 [c000201c01c436b0] [c01d57b0] __setup_irq+0x9f0/0xa90
 [c000201c01c43760] [c01d5aa0] request_threaded_irq+0x140/0x220
 [c000201c01c437d0] [c0081a17b3d4] bnx2x_nic_load+0x188c/0x3040 [bnx2x]
 [c000201c01c43950] [c0081a187c44] bnx2x_self_test+0x1fc/0x1f70 [bnx2x]
 [c000201c01c43a90] [c0adc748] dev_ethtool+0x11d8/0x2cb0
 [c000201c01c43b60] [c0b0b61c] dev_ioctl+0x5ac/0xa50
 [c000201c01c43bf0] [c0a8d4ec] sock_do_ioctl+0xbc/0x1b0
 [c000201c01c43c60] [c0a8dfb8] sock_ioctl+0x258/0x4f0
 [c000201c01c43d20] [c04c9704] do_vfs_ioctl+0xd4/0xa70
 [c000201c01c43de0] [c04ca274] sys_ioctl+0xc4/0x160
 [c000201c01c43e30] [c000b388] system_call+0x5c/0x70
 Instruction dump:
 78aad182 54a806be 3920 78a50664 794a1f24 7d294036 7d43502a 7d295039
 4182001c 4834 78a9d182 79291f24 <7d23482a> 2fa9 409e0020 38a50040

To fix this, move the check for condition 2 after the check for
condition 3, so that we are able to break out of the loop soon after
iterating through all the CPUs in the @mask in the problem case. Use
do..while() to achieve this.

Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE
interrupt controller")
Cc:  # 4.12+
Reported-by: Indira P. Joga 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/sysdev/xive/common.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1..1cdb395 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -479,7 +479,7 @@ static int xive_find_target_in_mask(const struct cpumask 
*mask,
 * Now go through the entire mask until we find a valid
 * target.
 */
-   for (;;) {
+   do {
/*
 * We re-check online as the fallback case passes us
 * an untested affinity mask
@@ -487,12 +487,11 @@ static int xive_find_target_in_mask(const struct cpumask 
*mask,
if (cpu_online(cpu) && xive_try_pick_target(cpu))
return cpu;
cpu = cpumask_next(cpu, mask);
-   if (cpu == first)
-   break;
  

[RFC PATCH 10/10] powerpc/fsl_booke/kaslr: dump out kernel offset information on panic

2019-07-17 Thread Jason Yan
When kaslr is enabled, the kernel offset is different for every boot.
This brings some difficult to debug the kernel. Dump out the kernel
offset when panic so that we can easily debug the kernel.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/include/asm/page.h |  5 +
 arch/powerpc/kernel/machine_kexec.c |  1 +
 arch/powerpc/kernel/setup-common.c  | 23 +++
 3 files changed, 29 insertions(+)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 60a68d3a54b1..cd3ac530e58d 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -317,6 +317,11 @@ struct vm_area_struct;
 
 extern unsigned long kimage_vaddr;
 
+static inline unsigned long kaslr_offset(void)
+{
+   return kimage_vaddr - KERNELBASE;
+}
+
 #include 
 #endif /* __ASSEMBLY__ */
 #include 
diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index c4ed328a7b96..078fe3d76feb 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -86,6 +86,7 @@ void arch_crash_save_vmcoreinfo(void)
VMCOREINFO_STRUCT_SIZE(mmu_psize_def);
VMCOREINFO_OFFSET(mmu_psize_def, shift);
 #endif
+   vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
 }
 
 /*
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 1f8db666468d..49e540c0adeb 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -715,12 +715,35 @@ static struct notifier_block ppc_panic_block = {
.priority = INT_MIN /* may not return; must be done last */
 };
 
+/*
+ * Dump out kernel offset information on panic.
+ */
+static int dump_kernel_offset(struct notifier_block *self, unsigned long v,
+ void *p)
+{
+   const unsigned long offset = kaslr_offset();
+
+   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && offset > 0)
+   pr_emerg("Kernel Offset: 0x%lx from 0x%lx\n",
+offset, KERNELBASE);
+   else
+   pr_emerg("Kernel Offset: disabled\n");
+
+   return 0;
+}
+
+static struct notifier_block kernel_offset_notifier = {
+   .notifier_call = dump_kernel_offset
+};
+
 void __init setup_panic(void)
 {
/* PPC64 always does a hard irq disable in its panic handler */
if (!IS_ENABLED(CONFIG_PPC64) && !ppc_md.panic)
return;
atomic_notifier_chain_register(&panic_notifier_list, &ppc_panic_block);
+   atomic_notifier_chain_register(&panic_notifier_list,
+  &kernel_offset_notifier);
 }
 
 #ifdef CONFIG_CHECK_CACHE_COHERENCY
-- 
2.17.2



[RFC PATCH 04/10] powerpc/fsl_booke/32: introduce create_tlb_entry() helper

2019-07-17 Thread Jason Yan
Add a new helper create_tlb_entry() to create a tlb entry by the virtual
and physical address. This is a preparation to support boot kernel at a
randomized address.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/head_fsl_booke.S | 30 
 arch/powerpc/mm/mmu_decl.h   |  1 +
 2 files changed, 31 insertions(+)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index adf0505dbe02..a57d44638031 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -1114,6 +1114,36 @@ __secondary_hold_acknowledge:
.long   -1
 #endif
 
+/*
+ * Create a 64M tlb by address and entry
+ * r3/r4 - physical address
+ * r5 - virtual address
+ * r6 - entry
+ */
+_GLOBAL(create_tlb_entry)
+   lis r7,0x1000   /* Set MAS0(TLBSEL) = 1 */
+   rlwimi  r7,r6,16,4,15   /* Setup MAS0 = TLBSEL | ESEL(r6) */
+   mtspr   SPRN_MAS0,r7/* Write MAS0 */
+
+   lis r6,(MAS1_VALID|MAS1_IPROT)@h
+   ori r6,r6,(MAS1_TSIZE(BOOK3E_PAGESZ_64M))@l
+   mtspr   SPRN_MAS1,r6/* Write MAS1 */
+
+   lis r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@h
+   ori r6,r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@l
+   and r6,r6,r5
+   ori r6,r6,MAS2_M@l
+   mtspr   SPRN_MAS2,r6/* Write MAS2(EPN) */
+
+   mr  r8,r4
+   ori r8,r8,(MAS3_SW|MAS3_SR|MAS3_SX)
+   mtspr   SPRN_MAS3,r8/* Write MAS3(RPN) */
+
+   tlbwe   /* Write TLB */
+   isync
+   sync
+   blr
+
 /*
  * Create a tlb entry with the same effective and physical address as
  * the tlb entry used by the current running code. But set the TS to 1.
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 32c1a191c28a..d7737cf97cee 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -142,6 +142,7 @@ extern unsigned long calc_cam_sz(unsigned long ram, 
unsigned long virt,
 extern void adjust_total_lowmem(void);
 extern int switch_to_as1(void);
 extern void restore_to_as0(int esel, int offset, void *dt_ptr, int bootcpu);
+extern void create_tlb_entry(phys_addr_t phys, unsigned long virt, int entry);
 #endif
 extern void loadcam_entry(unsigned int index);
 extern void loadcam_multi(int first_idx, int num, int tmp_idx);
-- 
2.17.2



[RFC PATCH 05/10] powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper

2019-07-17 Thread Jason Yan
Add a new helper reloc_kernel_entry() to jump back to the start of the
new kernel. After we put the new kernel in a randomized place we can use
this new helper to enter the kernel and begin to relocate again.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/head_fsl_booke.S | 16 
 arch/powerpc/mm/mmu_decl.h   |  1 +
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index a57d44638031..ce40f96dae20 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -1144,6 +1144,22 @@ _GLOBAL(create_tlb_entry)
sync
blr
 
+/*
+ * Return to the start of the relocated kernel and run again
+ * r3 - virtual address of fdt
+ * r4 - entry of the kernel
+ */
+_GLOBAL(reloc_kernel_entry)
+   mfmsr   r7
+   li  r8,(MSR_IS | MSR_DS)
+   andcr7,r7,r8
+
+   mtspr   SPRN_SRR0,r4
+   mtspr   SPRN_SRR1,r7
+   isync
+   sync
+   rfi
+
 /*
  * Create a tlb entry with the same effective and physical address as
  * the tlb entry used by the current running code. But set the TS to 1.
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index d7737cf97cee..dae8e9177574 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -143,6 +143,7 @@ extern void adjust_total_lowmem(void);
 extern int switch_to_as1(void);
 extern void restore_to_as0(int esel, int offset, void *dt_ptr, int bootcpu);
 extern void create_tlb_entry(phys_addr_t phys, unsigned long virt, int entry);
+extern void reloc_kernel_entry(void *fdt, int addr);
 #endif
 extern void loadcam_entry(unsigned int index);
 extern void loadcam_multi(int first_idx, int num, int tmp_idx);
-- 
2.17.2



[RFC PATCH 07/10] powerpc/fsl_booke/32: randomize the kernel image offset

2019-07-17 Thread Jason Yan
After we have the basic support of relocate the kernel in some
appropriate place, we can start to randomize the offset now.

Entropy is derived from the banner and timer, which will change every
build and boot. This not so much safe so additionally the bootloader may
pass entropy via the /chosen/kaslr-seed node in device tree.

We will use the first 512M of the low memory to randomize the kernel
image. The memory will be split in 64M zones. We will use the lower 8
bit of the entropy to decide the index of the 64M zone. Then we chose a
16K aligned offset inside the 64M zone to put the kernel in.

KERNELBASE

|-->   64M   <--|
|   |
+---+++---+
|   |||kernel||   |
+---+++---+
| |
|->   offset<-|

  kimage_vaddr

We also check if we will overlap with some areas like the dtb area, the
initrd area or the crashkernel area. If we cannot find a proper area,
kaslr will be disabled and boot from the original kernel.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/kaslr_booke.c | 335 +-
 1 file changed, 333 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kaslr_booke.c 
b/arch/powerpc/kernel/kaslr_booke.c
index 72d8e9432048..90357f4bd313 100644
--- a/arch/powerpc/kernel/kaslr_booke.c
+++ b/arch/powerpc/kernel/kaslr_booke.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -33,15 +35,342 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
+
+#ifdef DEBUG
+#define DBG(fmt...) printk(KERN_ERR fmt)
+#else
+#define DBG(fmt...)
+#endif
+
+struct regions {
+   unsigned long pa_start;
+   unsigned long pa_end;
+   unsigned long kernel_size;
+   unsigned long dtb_start;
+   unsigned long dtb_end;
+   unsigned long initrd_start;
+   unsigned long initrd_end;
+   unsigned long crash_start;
+   unsigned long crash_end;
+   int reserved_mem;
+   int reserved_mem_addr_cells;
+   int reserved_mem_size_cells;
+};
 
 extern int is_second_reloc;
 
+/* Simplified build-specific string for starting entropy. */
+static const char build_str[] = UTS_RELEASE " (" LINUX_COMPILE_BY "@"
+   LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION;
+static char __initdata early_command_line[COMMAND_LINE_SIZE];
+
+static __init void kaslr_get_cmdline(void *fdt)
+{
+   const char *cmdline = CONFIG_CMDLINE;
+   if (!IS_ENABLED(CONFIG_CMDLINE_FORCE)) {
+   int node;
+   const u8 *prop;
+   node = fdt_path_offset(fdt, "/chosen");
+   if (node < 0)
+   goto out;
+
+   prop = fdt_getprop(fdt, node, "bootargs", NULL);
+   if (!prop)
+   goto out;
+   cmdline = prop;
+   }
+out:
+   strscpy(early_command_line, cmdline, COMMAND_LINE_SIZE);
+}
+
+static unsigned long __init rotate_xor(unsigned long hash, const void *area,
+   size_t size)
+{
+   size_t i;
+   unsigned long *ptr = (unsigned long *)area;
+
+   for (i = 0; i < size / sizeof(hash); i++) {
+   /* Rotate by odd number of bits and XOR. */
+   hash = (hash << ((sizeof(hash) * 8) - 7)) | (hash >> 7);
+   hash ^= ptr[i];
+   }
+
+   return hash;
+}
+
+/* Attempt to create a simple but unpredictable starting entropy. */
+static unsigned long __init get_boot_seed(void *fdt)
+{
+   unsigned long hash = 0;
+
+   hash = rotate_xor(hash, build_str, sizeof(build_str));
+   hash = rotate_xor(hash, fdt, fdt_totalsize(fdt));
+
+   return hash;
+}
+
+static __init u64 get_kaslr_seed(void *fdt)
+{
+   int node, len;
+   fdt64_t *prop;
+   u64 ret;
+
+   node = fdt_path_offset(fdt, "/chosen");
+   if (node < 0)
+   return 0;
+
+   prop = fdt_getprop_w(fdt, node, "kaslr-seed", &len);
+   if (!prop || len != sizeof(u64))
+   return 0;
+
+   ret = fdt64_to_cpu(*prop);
+   *prop = 0;
+   return ret;
+}
+
+static __init bool regions_overlap(u32 s1, u32 e1, u32 s2, u32 e2)
+{
+   return e1 >= s2 && e2 >= s1;
+}
+
+static __init bool overlaps_reserved_region(const void *fdt, u32 start,
+  u32 end, struct regions *regions)
+{
+   int subnode, len, i;
+   u64 base, size;
+
+   /* check for overlap with /memreserve/ entries */
+   for (i = 0; i < fdt_num_mem_rsv(fdt); i++) {
+   if (fdt_get_mem_rsv(fdt, i, &base, &size) < 0)
+   continue;
+   

[RFC PATCH 09/10] powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter

2019-07-17 Thread Jason Yan
One may want to disable kaslr when boot, so provide a cmdline parameter
'nokaslr' to support this.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/kaslr_booke.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/kernel/kaslr_booke.c 
b/arch/powerpc/kernel/kaslr_booke.c
index 00339c05879f..e65a5d9d2ff1 100644
--- a/arch/powerpc/kernel/kaslr_booke.c
+++ b/arch/powerpc/kernel/kaslr_booke.c
@@ -373,6 +373,18 @@ static unsigned long __init kaslr_choose_location(void 
*dt_ptr, phys_addr_t size
return kaslr_offset;
 }
 
+static inline __init bool kaslr_disabled(void)
+{
+   char *str;
+
+   str = strstr(early_command_line, "nokaslr");
+   if ((str == early_command_line) ||
+   (str > early_command_line && *(str - 1) == ' '))
+   return true;
+
+   return false;
+}
+
 /*
  * To see if we need to relocate the kernel to a random offset
  * void *dt_ptr - address of the device tree
@@ -388,6 +400,8 @@ notrace void __init kaslr_early_init(void *dt_ptr, 
phys_addr_t size)
kernel_sz = (unsigned long)_end - KERNELBASE;
 
kaslr_get_cmdline(dt_ptr);
+   if (kaslr_disabled())
+   return;
 
offset = kaslr_choose_location(dt_ptr, size, kernel_sz);
 
-- 
2.17.2



[RFC PATCH 06/10] powerpc/fsl_booke/32: implement KASLR infrastructure

2019-07-17 Thread Jason Yan
This patch add support to boot kernel from places other than KERNELBASE.
Since CONFIG_RELOCATABLE has already supported, what we need to do is
map or copy kernel to a proper place and relocate. Freescale Book-E
parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
entries are not suitable to map the kernel directly in a randomized
region, so we chose to copy the kernel to a proper place and restart to
relocate.

The offset of the kernel was not randomized yet(a fixed 64M is set). We
will randomize it in the next patch.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/Kconfig  | 11 +++
 arch/powerpc/kernel/Makefile  |  1 +
 arch/powerpc/kernel/early_32.c|  2 +-
 arch/powerpc/kernel/fsl_booke_entry_mapping.S | 13 ++-
 arch/powerpc/kernel/head_fsl_booke.S  | 15 +++-
 arch/powerpc/kernel/kaslr_booke.c | 83 +++
 arch/powerpc/mm/mmu_decl.h|  6 ++
 arch/powerpc/mm/nohash/fsl_booke.c|  7 +-
 8 files changed, 125 insertions(+), 13 deletions(-)
 create mode 100644 arch/powerpc/kernel/kaslr_booke.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f516796dd819..3742df54bdc8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -547,6 +547,17 @@ config RELOCATABLE
  setting can still be useful to bootwrappers that need to know the
  load address of the kernel (eg. u-boot/mkimage).
 
+config RANDOMIZE_BASE
+   bool "Randomize the address of the kernel image"
+   depends on (FSL_BOOKE && FLATMEM && PPC32)
+   select RELOCATABLE
+   help
+ Randomizes the virtual address at which the kernel image is
+ loaded, as a security feature that deters exploit attempts
+ relying on knowledge of the location of kernel internals.
+
+ If unsure, say N.
+
 config RELOCATABLE_TEST
bool "Test relocatable kernel"
depends on (PPC64 && RELOCATABLE)
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 56dfa7a2a6f2..cf87a0921db4 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -105,6 +105,7 @@ extra-$(CONFIG_PPC_8xx) := head_8xx.o
 extra-y+= vmlinux.lds
 
 obj-$(CONFIG_RELOCATABLE)  += reloc_$(BITS).o
+obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr_booke.o
 
 obj-$(CONFIG_PPC32)+= entry_32.o setup_32.o early_32.o
 obj-$(CONFIG_PPC64)+= dma-iommu.o iommu.o
diff --git a/arch/powerpc/kernel/early_32.c b/arch/powerpc/kernel/early_32.c
index 3482118ffe76..fe8347cdc07d 100644
--- a/arch/powerpc/kernel/early_32.c
+++ b/arch/powerpc/kernel/early_32.c
@@ -32,5 +32,5 @@ notrace unsigned long __init early_init(unsigned long dt_ptr)
 
apply_feature_fixups();
 
-   return KERNELBASE + offset;
+   return kimage_vaddr + offset;
 }
diff --git a/arch/powerpc/kernel/fsl_booke_entry_mapping.S 
b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
index de0980945510..6d2967673ac7 100644
--- a/arch/powerpc/kernel/fsl_booke_entry_mapping.S
+++ b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
@@ -161,17 +161,16 @@ skpinv:   addir6,r6,1 /* 
Increment */
lis r6,(MAS1_VALID|MAS1_IPROT)@h
ori r6,r6,(MAS1_TSIZE(BOOK3E_PAGESZ_64M))@l
mtspr   SPRN_MAS1,r6
-   lis r6,MAS2_VAL(PAGE_OFFSET, BOOK3E_PAGESZ_64M, M_IF_NEEDED)@h
-   ori r6,r6,MAS2_VAL(PAGE_OFFSET, BOOK3E_PAGESZ_64M, M_IF_NEEDED)@l
-   mtspr   SPRN_MAS2,r6
+   lis r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@h
+   ori r6,r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@l
+   and r6,r6,r20
+   ori r6,r6,M_IF_NEEDED@l
+   mtspr   SPRN_MAS2,r6
mtspr   SPRN_MAS3,r8
tlbwe
 
 /* 7. Jump to KERNELBASE mapping */
-   lis r6,(KERNELBASE & ~0xfff)@h
-   ori r6,r6,(KERNELBASE & ~0xfff)@l
-   rlwinm  r7,r25,0,0x03ff
-   add r6,r7,r6
+   mr  r6,r20
 
 #elif defined(ENTRY_MAPPING_KEXEC_SETUP)
 /*
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index ce40f96dae20..d34933b0745a 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -155,6 +155,8 @@ _ENTRY(_start);
  */
 
 _ENTRY(__early_start)
+   LOAD_REG_ADDR_PIC(r20, kimage_vaddr)
+   lwz r20,0(r20)
 
 #define ENTRY_MAPPING_BOOT_SETUP
 #include "fsl_booke_entry_mapping.S"
@@ -277,8 +279,8 @@ set_ivor:
ori r6, r6, swapper_pg_dir@l
lis r5, abatron_pteptrs@h
ori r5, r5, abatron_pteptrs@l
-   lis r4, KERNELBASE@h
-   ori r4, r4, KERNELBASE@l
+   lis r3, kimage_vaddr@ha
+   lwz r4, kimage_vaddr@l(r3)
stw r5, 0(r4)   /* Save abatron_pteptrs at a fixed 

[RFC PATCH 08/10] powerpc/fsl_booke/kaslr: clear the original kernel if randomized

2019-07-17 Thread Jason Yan
The original kernel still exists in the memory, clear it now.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/kaslr_booke.c  | 11 +++
 arch/powerpc/mm/mmu_decl.h |  2 ++
 arch/powerpc/mm/nohash/fsl_booke.c |  1 +
 3 files changed, 14 insertions(+)

diff --git a/arch/powerpc/kernel/kaslr_booke.c 
b/arch/powerpc/kernel/kaslr_booke.c
index 90357f4bd313..00339c05879f 100644
--- a/arch/powerpc/kernel/kaslr_booke.c
+++ b/arch/powerpc/kernel/kaslr_booke.c
@@ -412,3 +412,14 @@ notrace void __init kaslr_early_init(void *dt_ptr, 
phys_addr_t size)
 
reloc_kernel_entry(dt_ptr, kimage_vaddr);
 }
+
+void __init kaslr_second_init(void)
+{
+   /* If randomized, clear the original kernel */
+   if (kimage_vaddr != KERNELBASE) {
+   unsigned long kernel_sz;
+
+   kernel_sz = (unsigned long)_end - kimage_vaddr;
+   memset((void *)KERNELBASE, 0, kernel_sz);
+   }
+}
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 754ae1e69f92..9912ee598f9b 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -150,8 +150,10 @@ extern void loadcam_multi(int first_idx, int num, int 
tmp_idx);
 
 #ifdef CONFIG_RANDOMIZE_BASE
 extern void kaslr_early_init(void *dt_ptr, phys_addr_t size);
+extern void kaslr_second_init(void);
 #else
 static inline void kaslr_early_init(void *dt_ptr, phys_addr_t size) {}
+static inline void kaslr_second_init(void) {}
 #endif
 
 struct tlbcam {
diff --git a/arch/powerpc/mm/nohash/fsl_booke.c 
b/arch/powerpc/mm/nohash/fsl_booke.c
index 8d25a8dc965f..fa5a87f5c08e 100644
--- a/arch/powerpc/mm/nohash/fsl_booke.c
+++ b/arch/powerpc/mm/nohash/fsl_booke.c
@@ -269,6 +269,7 @@ notrace void __init relocate_init(u64 dt_ptr, phys_addr_t 
start)
kernstart_addr = start;
if (is_second_reloc) {
virt_phys_offset = PAGE_OFFSET - memstart_addr;
+   kaslr_second_init();
return;
}
 
-- 
2.17.2



[RFC PATCH 03/10] powerpc: introduce kimage_vaddr to store the kernel base

2019-07-17 Thread Jason Yan
Now the kernel base is a fixed value - KERNELBASE. To support KASLR, we
need a variable to store the kernel base.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/include/asm/page.h | 2 ++
 arch/powerpc/mm/init-common.c   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 0d52f57fca04..60a68d3a54b1 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -315,6 +315,8 @@ void arch_free_page(struct page *page, int order);
 
 struct vm_area_struct;
 
+extern unsigned long kimage_vaddr;
+
 #include 
 #endif /* __ASSEMBLY__ */
 #include 
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 9273c38009cb..c7a98c73e5c1 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -25,6 +25,8 @@ phys_addr_t memstart_addr = (phys_addr_t)~0ull;
 EXPORT_SYMBOL(memstart_addr);
 phys_addr_t kernstart_addr;
 EXPORT_SYMBOL(kernstart_addr);
+unsigned long kimage_vaddr = KERNELBASE;
+EXPORT_SYMBOL(kimage_vaddr);
 
 static bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 static bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
-- 
2.17.2



[RFC PATCH 00/10] implement KASLR for powerpc/fsl_booke/32

2019-07-17 Thread Jason Yan
This series implements KASLR for powerpc/fsl_booke/32, as a security
feature that deters exploit attempts relying on knowledge of the location
of kernel internals.

Since CONFIG_RELOCATABLE has already supported, what we need to do is
map or copy kernel to a proper place and relocate. Freescale Book-E
parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
entries are not suitable to map the kernel directly in a randomized
region, so we chose to copy the kernel to a proper place and restart to
relocate.

Entropy is derived from the banner and timer base, which will change every
build and boot. This not so much safe so additionally the bootloader may
pass entropy via the /chosen/kaslr-seed node in device tree.

We will use the first 512M of the low memory to randomize the kernel
image. The memory will be split in 64M zones. We will use the lower 8
bit of the entropy to decide the index of the 64M zone. Then we chose a
16K aligned offset inside the 64M zone to put the kernel in.

KERNELBASE

|-->   64M   <--|
|   |
+---+++---+
|   |||kernel||   |
+---+++---+
| |
|->   offset<-|

  kimage_vaddr

We also check if we will overlap with some areas like the dtb area, the
initrd area or the crashkernel area. If we cannot find a proper area,
kaslr will be disabled and boot from the original kernel.

Jason Yan (10):
  powerpc: unify definition of M_IF_NEEDED
  powerpc: move memstart_addr and kernstart_addr to init-common.c
  powerpc: introduce kimage_vaddr to store the kernel base
  powerpc/fsl_booke/32: introduce create_tlb_entry() helper
  powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper
  powerpc/fsl_booke/32: implement KASLR infrastructure
  powerpc/fsl_booke/32: randomize the kernel image offset
  powerpc/fsl_booke/kaslr: clear the original kernel if randomized
  powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter
  powerpc/fsl_booke/kaslr: dump out kernel offset information on panic

 arch/powerpc/Kconfig  |  11 +
 arch/powerpc/include/asm/nohash/mmu-book3e.h  |  10 +
 arch/powerpc/include/asm/page.h   |   7 +
 arch/powerpc/kernel/Makefile  |   1 +
 arch/powerpc/kernel/early_32.c|   2 +-
 arch/powerpc/kernel/exceptions-64e.S  |  10 -
 arch/powerpc/kernel/fsl_booke_entry_mapping.S |  23 +-
 arch/powerpc/kernel/head_fsl_booke.S  |  61 ++-
 arch/powerpc/kernel/kaslr_booke.c | 439 ++
 arch/powerpc/kernel/machine_kexec.c   |   1 +
 arch/powerpc/kernel/misc_64.S |   5 -
 arch/powerpc/kernel/setup-common.c|  23 +
 arch/powerpc/mm/init-common.c |   7 +
 arch/powerpc/mm/init_32.c |   5 -
 arch/powerpc/mm/init_64.c |   5 -
 arch/powerpc/mm/mmu_decl.h|  10 +
 arch/powerpc/mm/nohash/fsl_booke.c|   8 +-
 17 files changed, 580 insertions(+), 48 deletions(-)
 create mode 100644 arch/powerpc/kernel/kaslr_booke.c

-- 
2.17.2



[RFC PATCH 02/10] powerpc: move memstart_addr and kernstart_addr to init-common.c

2019-07-17 Thread Jason Yan
These two variables are both defined in init_32.c and init_64.c. Move
them to init-common.c.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/mm/init-common.c | 5 +
 arch/powerpc/mm/init_32.c | 5 -
 arch/powerpc/mm/init_64.c | 5 -
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index a84da92920f7..9273c38009cb 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -21,6 +21,11 @@
 #include 
 #include 
 
+phys_addr_t memstart_addr = (phys_addr_t)~0ull;
+EXPORT_SYMBOL(memstart_addr);
+phys_addr_t kernstart_addr;
+EXPORT_SYMBOL(kernstart_addr);
+
 static bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 static bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 
diff --git a/arch/powerpc/mm/init_32.c b/arch/powerpc/mm/init_32.c
index b04896a88d79..872df48ae41b 100644
--- a/arch/powerpc/mm/init_32.c
+++ b/arch/powerpc/mm/init_32.c
@@ -56,11 +56,6 @@
 phys_addr_t total_memory;
 phys_addr_t total_lowmem;
 
-phys_addr_t memstart_addr = (phys_addr_t)~0ull;
-EXPORT_SYMBOL(memstart_addr);
-phys_addr_t kernstart_addr;
-EXPORT_SYMBOL(kernstart_addr);
-
 #ifdef CONFIG_RELOCATABLE
 /* Used in __va()/__pa() */
 long long virt_phys_offset;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a44f6281ca3a..c836f1269ee7 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -63,11 +63,6 @@
 
 #include 
 
-phys_addr_t memstart_addr = ~0;
-EXPORT_SYMBOL_GPL(memstart_addr);
-phys_addr_t kernstart_addr;
-EXPORT_SYMBOL_GPL(kernstart_addr);
-
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Given an address within the vmemmap, determine the pfn of the page that
-- 
2.17.2



[RFC PATCH 01/10] powerpc: unify definition of M_IF_NEEDED

2019-07-17 Thread Jason Yan
M_IF_NEEDED is defined too many times. Move it to a common place.

Signed-off-by: Jason Yan 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/include/asm/nohash/mmu-book3e.h  | 10 ++
 arch/powerpc/kernel/exceptions-64e.S  | 10 --
 arch/powerpc/kernel/fsl_booke_entry_mapping.S | 10 --
 arch/powerpc/kernel/misc_64.S |  5 -
 4 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/mmu-book3e.h 
b/arch/powerpc/include/asm/nohash/mmu-book3e.h
index 4c9777d256fb..0877362e48fa 100644
--- a/arch/powerpc/include/asm/nohash/mmu-book3e.h
+++ b/arch/powerpc/include/asm/nohash/mmu-book3e.h
@@ -221,6 +221,16 @@
 #define TLBILX_T_CLASS26
 #define TLBILX_T_CLASS37
 
+/*
+ * The mapping only needs to be cache-coherent on SMP, except on
+ * Freescale e500mc derivatives where it's also needed for coherent DMA.
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_PPC_E500MC)
+#define M_IF_NEEDEDMAS2_M
+#else
+#define M_IF_NEEDED0
+#endif
+
 #ifndef __ASSEMBLY__
 #include 
 
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 1cfb3da4a84a..fd49ec07ce4a 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1342,16 +1342,6 @@ skpinv:  addir6,r6,1 /* 
Increment */
sync
isync
 
-/*
- * The mapping only needs to be cache-coherent on SMP, except on
- * Freescale e500mc derivatives where it's also needed for coherent DMA.
- */
-#if defined(CONFIG_SMP) || defined(CONFIG_PPC_E500MC)
-#define M_IF_NEEDEDMAS2_M
-#else
-#define M_IF_NEEDED0
-#endif
-
 /* 6. Setup KERNELBASE mapping in TLB[0]
  *
  * r3 = MAS0 w/TLBSEL & ESEL for the entry we started in
diff --git a/arch/powerpc/kernel/fsl_booke_entry_mapping.S 
b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
index ea065282b303..de0980945510 100644
--- a/arch/powerpc/kernel/fsl_booke_entry_mapping.S
+++ b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
@@ -153,16 +153,6 @@ skpinv:addir6,r6,1 /* 
Increment */
tlbivax 0,r9
TLBSYNC
 
-/*
- * The mapping only needs to be cache-coherent on SMP, except on
- * Freescale e500mc derivatives where it's also needed for coherent DMA.
- */
-#if defined(CONFIG_SMP) || defined(CONFIG_PPC_E500MC)
-#define M_IF_NEEDEDMAS2_M
-#else
-#define M_IF_NEEDED0
-#endif
-
 #if defined(ENTRY_MAPPING_BOOT_SETUP)
 
 /* 6. Setup KERNELBASE mapping in TLB1[0] */
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index b55a7b4cb543..26074f92d4bc 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -432,11 +432,6 @@ kexec_create_tlb:
rlwimi  r9,r10,16,4,15  /* Setup MAS0 = TLBSEL | ESEL(r9) */
 
 /* Set up a temp identity mapping v:0 to p:0 and return to it. */
-#if defined(CONFIG_SMP) || defined(CONFIG_PPC_E500MC)
-#define M_IF_NEEDEDMAS2_M
-#else
-#define M_IF_NEEDED0
-#endif
mtspr   SPRN_MAS0,r9
 
lis r9,(MAS1_VALID|MAS1_IPROT)@h
-- 
2.17.2