Re: rename probe_kernel_* and probe_user_*

2020-06-18 Thread Michael Ellerman
Linus Torvalds  writes:
> [ Explicitly added architecture lists and developers to the cc to make
> this more visible ]
>
> On Wed, Jun 17, 2020 at 12:38 AM Christoph Hellwig  wrote:
>>
>> Andrew and I decided to drop the patches implementing your suggested
>> rename of the probe_kernel_* and probe_user_* helpers from -mm as there
>> were way to many conflicts.  After -rc1 might be a good time for this as
>> all the conflicts are resolved now.
>
> So I've merged this renaming now, together with my changes to make
> 'get_kernel_nofault()' look and act a lot more like 'get_user()'.
>
> It just felt wrong (and potentially dangerous) to me to have a
> 'get_kernel_nofault()' naming that implied semantics that we're all
> familiar with from 'get_user()', but acting very differently.
>
> But part of the fixups I made for the type checking are for
> architectures where I didn't even compile-test the end result. I
> looked at every case individually, and the patch looks sane, but I
> could have screwed something up.
>
> Basically, 'get_kernel_nofault()' doesn't do the same automagic type
> munging from the pointer to the target that 'get_user()' does, but at
> least now it checks that the types are superficially compatible.
> There should be build failures if they aren't, but I hopefully fixed
> everything up properly for all architectures.
>
> This email is partly to ask people to double-check, but partly just as
> a heads-up so that _if_ I screwed something up, you'll have the
> background and it won't take you by surprise.

The powerpc changes look right, compile cleanly and seem to work
correctly.

cheers


Re: powerpc/pci: [PATCH 1/1 V3] PCIE PHB reset

2020-06-18 Thread Oliver O'Halloran
On Wed, Jun 17, 2020 at 4:29 PM Michael Ellerman  wrote:
>
> "Oliver O'Halloran"  writes:
> > On Tue, Jun 16, 2020 at 9:55 PM Michael Ellerman  
> > wrote:
> >> wenxi...@linux.vnet.ibm.com writes:
> >> > From: Wen Xiong 
> >> >
> >> > Several device drivers hit EEH(Extended Error handling) when triggering
> >> > kdump on Pseries PowerVM. This patch implemented a reset of the PHBs
> >> > in pci general code when triggering kdump.
> >>
> >> Actually it's in pseries specific PCI code, and the reset is done in the
> >> 2nd kernel as it boots, not when triggering the kdump.
> >>
> >> You're doing it as a:
> >>
> >>   machine_postcore_initcall(pseries, pseries_phb_reset);
> >>
> >> But we do the EEH initialisation in:
> >>
> >>   core_initcall_sync(eeh_init);
> >>
> >> Which happens first.
> >>
> >> So it seems to me that this should be called from pseries_eeh_init().
> >
> > This happens to use some of the same RTAS calls as EEH, but it's
> > entirely orthogonal to it.
>
> I don't agree. I mean it's literally calling EEH_RESET_FUNDAMENTAL etc.
> Those RTAS calls are all documented in the EEH section of PAPR.
>
> I guess you're saying it's orthogonal to the kernel handling an EEH and
> doing the recovery process etc, which I can kind of see.
>
> > Wedging the two together doesn't make any real sense IMO since this
> > should be usable even with !CONFIG_EEH.
>
> You can't turn CONFIG_EEH off for pseries or powernv.

Not yet :)

> And if you could this patch wouldn't compile because it uses EEH
> constants that are behind #ifdef CONFIG_EEH.

That's fixable.

> If you could turn CONFIG_EEH off it would presumably be because you were
> on a platform that didn't support EEH, in which case you wouldn't need
> this code.

I think there's an argument to be made for disabling EEH in some
situations. A lot of drivers do a pretty poor job of recovering in the
first place so it's conceivable that someone might want to disable it
in say, a kdump kernel. That said, the real reason is mostly for the
sake of code organisation. EEH is an optional platform feature but you
wouldn't know it looking at the implementation and I'd like to stop it
bleeding into odd places. Making it buildable without !CONFIG_EEH
would probably help.

> So IMO this is EEH code, and should be with the other EEH code and
> should be behind CONFIG_EEH.

*shrug*

I wanted it to follow the model of the powernv implementation of the
same feature which is done immediately after initialising the
pci_controller and independent of all of the EEH setup. Although,
looking at it again I see it calls pnv_eeh_phb_reset() which is in
eeh_powernv.c so I guess that's pretty similar to what you're
suggesting.

> That sounds like a good cleanup. I'm not concerned about conflicts
> within arch/powerpc, I can fix them up.
>
> >> > + list_for_each_entry(phb, &hose_list, list_node) {
> >> > + config_addr = pseries_get_pdn_addr(phb);
> >> > + if (config_addr == -1)
> >> > + continue;
> >> > +
> >> > + ret = rtas_call(ibm_set_slot_reset, 4, 1, NULL,
> >> > + config_addr, BUID_HI(phb->buid),
> >> > + BUID_LO(phb->buid), EEH_RESET_FUNDAMENTAL);
> >> > +
> >> > + /* If fundamental-reset not supported, try 
> >> > hot-reset */
> >> > + if (ret == -8)
> >>
> >> Where does -8 come from?
> >
> > There's a comment right there.
>
> Yeah I guess. I was expecting it would map to some RTAS_ERROR_FOO value,
> but it's just literally -8 in PAPR.

Yeah, as far as I can tell the meaning of the return codes are
specific to each RTAS call, it's a bit bad.


Re: [PATCH] ASoC: fsl_spdif: Add pm runtime function

2020-06-18 Thread Nicolin Chen
On Thu, Jun 18, 2020 at 07:55:34PM +0800, Shengjiu Wang wrote:
> Add pm runtime support and move clock handling there.
> Close the clocks at suspend to reduce the power consumption.
> 
> fsl_spdif_suspend is replaced by pm_runtime_force_suspend.
> fsl_spdif_resume is replaced by pm_runtime_force_resume.
> 
> Signed-off-by: Shengjiu Wang 

LGTM, yet some nits, please add my ack after fixing:

Acked-by: Nicolin Chen 

> @@ -495,25 +496,10 @@ static int fsl_spdif_startup(struct snd_pcm_substream 
> *substream,

>  
> -disable_txclk:
> - for (i--; i >= 0; i--)
> - clk_disable_unprepare(spdif_priv->txclk[i]);
>  err:
> - if (!IS_ERR(spdif_priv->spbaclk))
> - clk_disable_unprepare(spdif_priv->spbaclk);
> -err_spbaclk:
> - clk_disable_unprepare(spdif_priv->coreclk);
> -
>   return ret;

Only "return ret;" remains now. We could clean the goto away.

> -static int fsl_spdif_resume(struct device *dev)
> +static int fsl_spdif_runtime_resume(struct device *dev)

> +disable_rx_clk:
> + clk_disable_unprepare(spdif_priv->rxclk);
> +disable_tx_clk:
> +disable_spba_clk:

Why have two duplicated ones? Could probably drop the 2nd one.


[PATCH 4/4] powerpc/pseries/iommu: Remove default DMA window before creating DDW

2020-06-18 Thread Leonardo Bras
On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
default DMA window for the device, before attempting to configure a DDW,
in order to make the maximum resources available for the next DDW to be
created.

This is a requirement for some devices to use DDW, given they only
allow one DMA window.

If setting up a new DDW fails anywhere after the removal of this
default DMA window, restore it using reset_dma_window.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index de633f6ae093..68d1ea957ac7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1074,8 +1074,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
u64 dma_addr, max_addr;
struct device_node *dn;
u32 ddw_avail[3];
+
struct direct_window *window;
-   struct property *win64;
+   struct property *win64, *dfl_win;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
 
@@ -1110,8 +,19 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (ret)
goto out_failed;
 
-   /*
-* Query if there is a second window of size to map the
+   /*
+* First step of setting up DDW is removing the default DMA window,
+* if it's present. It will make all the resources available to the
+* new DDW window.
+* If anything fails after this, we need to restore it.
+*/
+
+   dfl_win = of_find_property(pdn, "ibm,dma-window", NULL);
+   if (dfl_win)
+   remove_dma_window(pdn, ddw_avail, dfl_win);
+
+   /*
+* Query if there is a window of size to map the
 * whole partition.  Query returns number of windows, largest
 * block assigned to PE (partition endpoint), and two bitmasks
 * of page sizes: supported and supported for migrate-dma.
@@ -1219,6 +1231,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
kfree(win64);
 
 out_failed:
+   if (dfl_win)
+   reset_dma_window(dev, pdn);
 
fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
if (!fpdn)
-- 
2.25.4



[PATCH 3/4] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window

2020-06-18 Thread Leonardo Bras
Move the window-removing part of remove_ddw into a new function
(remove_dma_window), so it can be used to remove other DMA windows.

It's useful for removing DMA windows that don't create DIRECT64_PROPNAME
property, like the default DMA window from the device, which uses
"ibm,dma-window".

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 53 +++---
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 5e1fbc176a37..de633f6ae093 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -767,25 +767,14 @@ static int __init disable_ddw_setup(char *str)
 
 early_param("disable_ddw", disable_ddw_setup);
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static void remove_dma_window(struct device_node *pdn, u32 *ddw_avail,
+ struct property *win)
 {
struct dynamic_dma_window_prop *dwp;
-   struct property *win64;
-   u32 ddw_avail[3];
u64 liobn;
-   int ret = 0;
-
-   ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
-&ddw_avail[0], 3);
-
-   win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win64)
-   return;
-
-   if (ret || win64->length < sizeof(*dwp))
-   goto delprop;
+   int ret;
 
-   dwp = win64->value;
+   dwp = win->value;
liobn = (u64)be32_to_cpu(dwp->liobn);
 
/* clear the whole window, note the arg is in kernel pages */
@@ -793,24 +782,44 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
1ULL << (be32_to_cpu(dwp->window_shift) - PAGE_SHIFT), dwp);
if (ret)
pr_warn("%pOF failed to clear tces in window.\n",
-   np);
+   pdn);
else
pr_debug("%pOF successfully cleared tces in window.\n",
-np);
+pdn);
 
ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
if (ret)
pr_warn("%pOF: failed to remove direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-   np, ret, ddw_avail[2], liobn);
+   pdn, ret, ddw_avail[2], liobn);
else
pr_debug("%pOF: successfully removed direct window: rtas 
returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-   np, ret, ddw_avail[2], liobn);
+   pdn, ret, ddw_avail[2], liobn);
+}
+
+static void remove_ddw(struct device_node *np, bool remove_prop)
+{
+   struct property *win;
+   u32 ddw_avail[3];
+   int ret = 0;
+
+   ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
+&ddw_avail[0], 3);
+   if (ret)
+   return;
+
+   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+   if (!win)
+   return;
+
+   if (win->length >= sizeof(struct dynamic_dma_window_prop))
+   remove_dma_window(np, ddw_avail, win);
+
+   if (!remove_prop)
+   return;
 
-delprop:
-   if (remove_prop)
-   ret = of_remove_property(np, win64);
+   ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
-- 
2.25.4



[PATCH 2/4] powerpc/pseries/iommu: Implement ibm, reset-pe-dma-windows rtas call

2020-06-18 Thread Leonardo Bras
Platforms supporting the DDW option starting with LoPAR level 2.7 implement
ibm,ddw-extensions. The first extension available (index 2) carries the
token for ibm,reset-pe-dma-windows rtas call, which is used to restore
the default DMA window for a device, if it has been deleted.

It does so by resetting the TCE table allocation for the PE to it's
boot time value, available in "ibm,dma-window" device tree node.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 33 ++
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e5a617738c8b..5e1fbc176a37 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1012,6 +1012,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
return max_addr;
 }
 
+/*
+ * Platforms supporting the DDW option starting with LoPAR level 2.7 implement
+ * ibm,ddw-extensions, which carries the rtas token for
+ * ibm,reset-pe-dma-windows.
+ * That rtas-call can be used to restore the default DMA window for the device.
+ */
+static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
+{
+   int ret;
+   u32 cfg_addr, ddw_ext[3];
+   u64 buid;
+   struct device_node *dn;
+   struct pci_dn *pdn;
+
+   ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
+&ddw_ext[0], 3);
+   if (ret)
+   return;
+
+   dn = pci_device_to_OF_node(dev);
+   pdn = PCI_DN(dn);
+   buid = pdn->phb->buid;
+   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
+
+   ret = rtas_call(ddw_ext[1], 3, 1, NULL, cfg_addr,
+   BUID_HI(buid), BUID_LO(buid));
+   if (ret)
+   dev_info(&dev->dev,
+"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
+ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),
+ret);
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
-- 
2.25.4



[PATCH 1/4] powerpc/pseries/iommu: Update call to ibm, query-pe-dma-windows

2020-06-18 Thread Leonardo Bras
>From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of
outputs from "ibm,query-pe-dma-windows" go from 5 to 6.

This change of output size is meant to expand the address size of
largest_available_block PE TCE from 32-bit to 64-bit, which ends up
shifting page_size and migration_capable.

This ends up requiring the update of
ddw_query_response->largest_available_block from u32 to u64, and manually
assigning the values from the buffer into this struct, according to
output size.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 57 +-
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6d47b4a3ce39..e5a617738c8b 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -334,7 +334,7 @@ struct direct_window {
 /* Dynamic DMA Window support */
 struct ddw_query_response {
u32 windows_available;
-   u32 largest_available_block;
+   u64 largest_available_block;
u32 page_size;
u32 migration_capable;
 };
@@ -869,14 +869,32 @@ static int find_existing_ddw_windows(void)
 }
 machine_arch_initcall(pseries, find_existing_ddw_windows);
 
+/*
+ * From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many output
+ * parameters ibm,query-pe-dma-windows will have, ranging from 5 to 6.
+ */
+
+static int query_ddw_out_sz(struct device_node *par_dn)
+{
+   int ret;
+   u32 ddw_ext[3];
+
+   ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
+&ddw_ext[0], 3);
+   if (ret || ddw_ext[0] < 2 || ddw_ext[2] != 1)
+   return 5;
+   return 6;
+}
+
 static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
-   struct ddw_query_response *query)
+struct ddw_query_response *query,
+struct device_node *par_dn)
 {
struct device_node *dn;
struct pci_dn *pdn;
-   u32 cfg_addr;
+   u32 cfg_addr, query_out[5];
u64 buid;
-   int ret;
+   int ret, out_sz;
 
/*
 * Get the config address and phb buid of the PE window.
@@ -888,12 +906,29 @@ static int query_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
pdn = PCI_DN(dn);
buid = pdn->phb->buid;
cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
+   out_sz = query_ddw_out_sz(par_dn);
+
+   ret = rtas_call(ddw_avail[0], 3, out_sz, query_out,
+   cfg_addr, BUID_HI(buid), BUID_LO(buid));
+   dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned 
%d\n",
+ddw_avail[0], cfg_addr, BUID_HI(buid), BUID_LO(buid), ret);
+
+   switch (out_sz) {
+   case 5:
+   query->windows_available = query_out[0];
+   query->largest_available_block = query_out[1];
+   query->page_size = query_out[2];
+   query->migration_capable = query_out[3];
+   break;
+   case 6:
+   query->windows_available = query_out[0];
+   query->largest_available_block = ((u64)query_out[1] << 32) |
+query_out[2];
+   query->page_size = query_out[3];
+   query->migration_capable = query_out[4];
+   break;
+   }
 
-   ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
- cfg_addr, BUID_HI(buid), BUID_LO(buid));
-   dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
-   " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
-   BUID_LO(buid), ret);
return ret;
 }
 
@@ -1040,7 +1075,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * of page sizes: supported and supported for migrate-dma.
 */
dn = pci_device_to_OF_node(dev);
-   ret = query_ddw(dev, ddw_avail, &query);
+   ret = query_ddw(dev, ddw_avail, &query, pdn);
if (ret != 0)
goto out_failed;
 
@@ -1068,7 +1103,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
/* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
-   dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
+   dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
  "%llu-sized pages\n", max_addr,  
query.largest_available_block,
  1ULL << page_shift);
goto out_failed;
-- 
2.25.4



[PATCH 0/4] Remove default DMA window before creating DDW

2020-06-18 Thread Leonardo Bras
There are some devices that only allow 1 DMA window to exist at a time,
and in those cases, a DDW is never created to them, since the default DMA
window keeps using this resource.

LoPAR recommends this procedure:
1. Remove the default DMA window,
2. Query for which configs the DDW can be created,
3. Create a DDW.

Patch #1:
- After LoPAR level 2.8, there is an extension that can make
  ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
  order of the outputs, and that can cause some trouble. 
- query_ddw() was updated to check how many outputs the 
  ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
  deal correctly with the outputs in both cases.
- This patch looks somehow unrelated to the series, but it can avoid future
  problems on DDW creation.

Patch #2 implements a new rtas call to recover the default DMA window,
in case anything fails after it was removed, and a DDW couldn't be created.

Patch #3 moves the window-removing code from remove_ddw() to
remove_dma_window(), creating a way to delete any DMA window, so it can be
used to delete the default DMA window.

Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
default DMA window before query_ddw() and the rtas call from patch #2
to recover it if something goes wrong.

All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]

Leonardo Bras (4):
  powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
  powerpc/pseries/iommu: Implement ibm,reset-pe-dma-windows rtas call
  powerpc/pseries/iommu: Move window-removing part of remove_ddw into
remove_dma_window
  powerpc/pseries/iommu: Remove default DMA window before creating DDW

 arch/powerpc/platforms/pseries/iommu.c | 163 +++--
 1 file changed, 127 insertions(+), 36 deletions(-)

-- 
2.25.4



Re: [PATCH 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-06-18 Thread Gautham R Shenoy
On Thu, Jun 18, 2020 at 05:57:13PM +0530, Kajol Jain wrote:
> Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.
> 
> command:# cat /sys/devices/hv_24x7/cpumask
> 0
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_24x7|  6 
>  arch/powerpc/perf/hv-24x7.c   | 31 ++-
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> index e8698afcd952..281e7b367733 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> @@ -43,6 +43,12 @@ Description:   read only
>   This sysfs interface exposes the number of cores per chip
>   present in the system.
> 
> +What:/sys/devices/hv_24x7/cpumask
> +Date:June 2020
> +Contact: Linux on PowerPC Developer List 
> +Description: read only
> + This sysfs file exposes cpumask.

Could you please describe this in little more detail as to what the
cpumask is ?

> +
>  What:
> /sys/bus/event_source/devices/hv_24x7/event_descs/
>  Date:February 2014
>  Contact: Linux on PowerPC Developer List 
> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
> index fdc4ae155d60..03d870a9fc36 100644
> --- a/arch/powerpc/perf/hv-24x7.c
> +++ b/arch/powerpc/perf/hv-24x7.c
> @@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
>   return sprintf(buf, "%s\n", (char *)d->var);
>  }
> 
> +static ssize_t cpumask_get_attr(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return cpumap_print_to_pagebuf(true, buf, &hv_24x7_cpumask);
> +}
> +
>  static ssize_t sockets_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets);
>  static DEVICE_ATTR_RO(chipspersocket);
>  static DEVICE_ATTR_RO(coresperchip);
> 
> +static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL);
> +
> +static struct attribute *cpumask_attrs[] = {
> + &dev_attr_cpumask.attr,
> + NULL,
> +};
> +
> +static struct attribute_group cpumask_attr_group = {
> + .attrs = cpumask_attrs,
> +};
> +
>  static struct bin_attribute *if_bin_attrs[] = {
>   &bin_attr_catalog,
>   NULL,
> @@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = {
>   &event_desc_group,
>   &event_long_desc_group,
>   &if_group,
> + /*
> +  * This NULL is a placeholder for the cpumask attr which will update
> +  * onlyif cpuhotplug registration is successful
> +  */
> + NULL,
>   NULL,
>  };
> 
> @@ -1727,8 +1749,15 @@ static int hv_24x7_init(void)
> 
>   /* init cpuhotplug */
>   r = hv_24x7_cpu_hotplug_init();
> - if (r)
> + if (r) {
>   pr_err("hv_24x7: CPU hotplug init failed\n");
> + } else {
> + /*
> +  * Cpu hotplug init is successful, add the
> +  * cpumask file as part of pmu attr group
> +  */
> + attr_groups[5] = &cpumask_attr_group;

Since this is only a one-time initialization, wouldn't it be safer to
iterate through attr_groups[] and assin cpumask_attr_group to the
first NULL location ?

> + }
> 
>   r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
>   if (r)
> -- 
> 2.18.2
> 


Re: [PATCH 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support

2020-06-18 Thread Gautham R Shenoy
Hello Kajol,

On Thu, Jun 18, 2020 at 05:57:12PM +0530, Kajol Jain wrote:
> Patch here adds cpu hotplug functions to hv_24x7 pmu.
> A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
> is added.
> 
> The online function update the cpumask only if its NULL.
> As the primary intention for adding hotplug support
> is to desiginate a CPU to make HCALL to collect the
> count data.
> 
> The offline function test and clear corresponding cpu in a cpumask
> and update cpumask to any other active cpu.
> 
> With this patchset, perf tool side does not need "-C "
> to be added.
> 
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/perf/hv-24x7.c | 45 +
>  include/linux/cpuhotplug.h  |  1 +
>  2 files changed, 46 insertions(+)
> 
> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
> index db213eb7cb02..fdc4ae155d60 100644
> --- a/arch/powerpc/perf/hv-24x7.c
> +++ b/arch/powerpc/perf/hv-24x7.c
> @@ -31,6 +31,8 @@ static int interface_version;
>  /* Whether we have to aggregate result data for some domains. */
>  static bool aggregate_result_elements;
> 
> +static cpumask_t hv_24x7_cpumask;
> +
>  static bool domain_is_valid(unsigned domain)
>  {
>   switch (domain) {
> @@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
>   .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
>  };
> 
> +static int ppc_hv_24x7_cpu_online(unsigned int cpu)
> +{
> + /* Make this CPU the designated target for counter collection */
> + if (cpumask_empty(&hv_24x7_cpumask))
> + cpumask_set_cpu(cpu, &hv_24x7_cpumask);
> +
> + return 0;
> +}
> +
> +static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
> +{
> + int target = -1;
> +
> + /* Check if exiting cpu is used for collecting 24x7 events */
> + if (!cpumask_test_and_clear_cpu(cpu, &hv_24x7_cpumask))
> + return 0;
> +
> + /* Find a new cpu to collect 24x7 events */
> + target = cpumask_any_but(cpu_active_mask, cpu);

cpumask_any_but() typically picks the first CPU in cpu_active_mask
that is not @cpu.


> +
> + if (target < 0 || target >= nr_cpu_ids)
> + return -1;
> +
> + /* Migrate 24x7 events to the new target */
> + cpumask_set_cpu(target, &hv_24x7_cpumask);
> + perf_pmu_migrate_context(&h_24x7_pmu, cpu, target);


On a system with N CPUs numbered [O..N-1], can you please verify if
the time required to sequentially offline CPUs [0..N-2] ,in that
order, increase with this patch ?

I am asking this because we have encountered this problem once before
at a customer site and the commit 9c9f8fb71fee ("powerpc/perf: Use
cpumask_last() to determine the designated cpu for nest/core units.")
was introduced to fix that problem.

> +
> + return 0;
> +}
> +
> +static int hv_24x7_cpu_hotplug_init(void)
> +{
> + return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
> +   "perf/powerpc/hv_24x7:online",
> +   ppc_hv_24x7_cpu_online,
> +   ppc_hv_24x7_cpu_offline);
> +}
> +
>  static int hv_24x7_init(void)
>  {
>   int r;
> @@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
>   if (r)
>   return r;
> 
> + /* init cpuhotplug */
> + r = hv_24x7_cpu_hotplug_init();
> + if (r)
> + pr_err("hv_24x7: CPU hotplug init failed\n");
> +
>   r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
>   if (r)
>   return r;
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 8377afef8806..16ed8f6f8774 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -180,6 +180,7 @@ enum cpuhp_state {
>   CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
>   CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
>   CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
> + CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
>   CPUHP_AP_WATCHDOG_ONLINE,
>   CPUHP_AP_WORKQUEUE_ONLINE,
>   CPUHP_AP_RCUTREE_ONLINE,
> -- 
> 2.18.2
> 


Re: [PATCHv2] tpm: ibmvtpm: Wait for ready buffer before probing for TPM2 attributes

2020-06-18 Thread Jerry Snitselaar

On Fri Jun 19 20, David Gibson wrote:

The tpm2_get_cc_attrs_tbl() call will result in TPM commands being issued,
which will need the use of the internal command/response buffer.  But,
we're issuing this *before* we've waited to make sure that buffer is
allocated.

This can result in intermittent failures to probe if the hypervisor / TPM
implementation doesn't respond quickly enough.  I find it fails almost
every time with an 8 vcpu guest under KVM with software emulated TPM.

To fix it, just move the tpm2_get_cc_attrs_tlb() call after the
existing code to wait for initialization, which will ensure the buffer
is allocated.

Fixes: 18b3670d79ae9 ("tpm: ibmvtpm: Add support for TPM2")
Signed-off-by: David Gibson 
---


Reviewed-by: Jerry Snitselaar 



Changes from v1:
* Fixed a formatting error in the commit message
* Added some more detail to the commit message

drivers/char/tpm/tpm_ibmvtpm.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/char/tpm/tpm_ibmvtpm.c b/drivers/char/tpm/tpm_ibmvtpm.c
index 09fe45246b8cc..994385bf37c0c 100644
--- a/drivers/char/tpm/tpm_ibmvtpm.c
+++ b/drivers/char/tpm/tpm_ibmvtpm.c
@@ -683,13 +683,6 @@ static int tpm_ibmvtpm_probe(struct vio_dev *vio_dev,
if (rc)
goto init_irq_cleanup;

-   if (!strcmp(id->compat, "IBM,vtpm20")) {
-   chip->flags |= TPM_CHIP_FLAG_TPM2;
-   rc = tpm2_get_cc_attrs_tbl(chip);
-   if (rc)
-   goto init_irq_cleanup;
-   }
-
if (!wait_event_timeout(ibmvtpm->crq_queue.wq,
ibmvtpm->rtce_buf != NULL,
HZ)) {
@@ -697,6 +690,13 @@ static int tpm_ibmvtpm_probe(struct vio_dev *vio_dev,
goto init_irq_cleanup;
}

+   if (!strcmp(id->compat, "IBM,vtpm20")) {
+   chip->flags |= TPM_CHIP_FLAG_TPM2;
+   rc = tpm2_get_cc_attrs_tbl(chip);
+   if (rc)
+   goto init_irq_cleanup;
+   }
+
return tpm_chip_register(chip);
init_irq_cleanup:
do {
--
2.26.2





[PATCHv2] tpm: ibmvtpm: Wait for ready buffer before probing for TPM2 attributes

2020-06-18 Thread David Gibson
The tpm2_get_cc_attrs_tbl() call will result in TPM commands being issued,
which will need the use of the internal command/response buffer.  But,
we're issuing this *before* we've waited to make sure that buffer is
allocated.

This can result in intermittent failures to probe if the hypervisor / TPM
implementation doesn't respond quickly enough.  I find it fails almost
every time with an 8 vcpu guest under KVM with software emulated TPM.

To fix it, just move the tpm2_get_cc_attrs_tlb() call after the
existing code to wait for initialization, which will ensure the buffer
is allocated.

Fixes: 18b3670d79ae9 ("tpm: ibmvtpm: Add support for TPM2")
Signed-off-by: David Gibson 
---

Changes from v1:
 * Fixed a formatting error in the commit message
 * Added some more detail to the commit message
 
drivers/char/tpm/tpm_ibmvtpm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/char/tpm/tpm_ibmvtpm.c b/drivers/char/tpm/tpm_ibmvtpm.c
index 09fe45246b8cc..994385bf37c0c 100644
--- a/drivers/char/tpm/tpm_ibmvtpm.c
+++ b/drivers/char/tpm/tpm_ibmvtpm.c
@@ -683,13 +683,6 @@ static int tpm_ibmvtpm_probe(struct vio_dev *vio_dev,
if (rc)
goto init_irq_cleanup;
 
-   if (!strcmp(id->compat, "IBM,vtpm20")) {
-   chip->flags |= TPM_CHIP_FLAG_TPM2;
-   rc = tpm2_get_cc_attrs_tbl(chip);
-   if (rc)
-   goto init_irq_cleanup;
-   }
-
if (!wait_event_timeout(ibmvtpm->crq_queue.wq,
ibmvtpm->rtce_buf != NULL,
HZ)) {
@@ -697,6 +690,13 @@ static int tpm_ibmvtpm_probe(struct vio_dev *vio_dev,
goto init_irq_cleanup;
}
 
+   if (!strcmp(id->compat, "IBM,vtpm20")) {
+   chip->flags |= TPM_CHIP_FLAG_TPM2;
+   rc = tpm2_get_cc_attrs_tbl(chip);
+   if (rc)
+   goto init_irq_cleanup;
+   }
+
return tpm_chip_register(chip);
 init_irq_cleanup:
do {
-- 
2.26.2



Re: [PATCH V3 (RESEND) 0/3] arm64: Enable vmemmap mapping from device memory

2020-06-18 Thread Anshuman Khandual



On 06/18/2020 02:26 PM, Mike Rapoport wrote:
> On Thu, Jun 18, 2020 at 06:45:27AM +0530, Anshuman Khandual wrote:
>> This series enables vmemmap backing memory allocation from device memory
>> ranges on arm64. But before that, it enables vmemmap_populate_basepages()
>> and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
>> alocation requests.
>>
>> This series applies on 5.8-rc1.
>>
>> Pending Question:
>>
>> altmap_alloc_block_buf() does not have any other remaining users in
>> the tree after this change. Should it be converted into a static
>> function and it's declaration be dropped from the header
>> (include/linux/mm.h). Avoided doing so because I was not sure if there
>> are any off-tree users or not.
> 
> Well, off-tree users probably have an active fork anyway so they could
> switch to vmemmap_alloc_block_buf()...

Sure, will make the function a static and remove it's declaration
from the header.

> 
> Regardless, can you please update Documentation/vm/memory-model.rst to
> keep it in sync with the code?
Sure, will do.


Re: [PATCH v2 2/4] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-06-18 Thread Ram Pai
On Thu, Jun 18, 2020 at 03:31:06PM +0200, Laurent Dufour wrote:
> Le 18/06/2020 à 11:19, Ram Pai a écrit :
> >

.snip..

> >
> >  1. States of a GFN
> > ---
> >  The GFN can be in one of the following states.
> >diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> >b/arch/powerpc/kvm/book3s_64_mmu_radix.c

...snip...

> >index 803940d..3448459 100644
> >--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> >+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> >@@ -1100,7 +1100,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
> > unsigned int shift;
> > if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)
> >-kvmppc_uvmem_drop_pages(memslot, kvm, true);
> >+kvmppc_uvmem_drop_pages(memslot, kvm, true, false);
> 
> When reviewing the v1 of this series, I asked you the question about
> the fact that the call here is made with purge_gfn = false. Your
> answer was:
> 
> >This function does not know, under what context it is called. Since
> >its job is to just flush the memslot, it cannot assume anything
> >about purging the pages in the memslot.
> 
> Indeed in the case of the memory hotplug operation, this function is
> called to wipe the page from the secure device in the case the pages
> are secured. In that case the purge is required. Indeed, I checked
> the other call to kvmppc_radix_flush_memslot() in
> kvmppc_core_flush_memslot_hv() and I cannot see why in that case too
> purge_gfn should be false, especially when the memslot is reused as
> detailed in __kvm_set_memory_region() around the call to
> kvm_arch_flush_shadow_memslot().
> 
> I'm sorry to not have ask this earlier, but could you please elaborate on 
> this?

You are right. kvmppc_radix_flush_memslot() is getting called everytime with
the intention of disassociating the memslot from that VM. Which implies,
the memslot is intended to be deleted and possibly reused.

I should be calling kvmppc_uvmem_drop_pages() with purge_gfn=true, here
aswell.

I expect some form of problem showing up in memhot-plug/unplug path.

RP



[PATCH net v2] ibmveth: Fix max MTU limit

2020-06-18 Thread Thomas Falcon
The max MTU limit defined for ibmveth is not accounting for
virtual ethernet buffer overhead, which is twenty-two additional
bytes set aside for the ethernet header and eight additional bytes
of an opaque handle reserved for use by the hypervisor. Update the
max MTU to reflect this overhead.

Signed-off-by: Thomas Falcon 
Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more 
drivers")
Fixes: 110447f8269a ("ethernet: fix min/max MTU typos")
---
v2: Include Fixes tags suggested by Jakub Kicisnki
---
 drivers/net/ethernet/ibm/ibmveth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 96d36ae5049e..c5c732601e35 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1715,7 +1715,7 @@ static int ibmveth_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
}
 
netdev->min_mtu = IBMVETH_MIN_MTU;
-   netdev->max_mtu = ETH_MAX_MTU;
+   netdev->max_mtu = ETH_MAX_MTU - IBMVETH_BUFF_OH;
 
memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
 
-- 
2.26.2



[PATCH net] ibmvnic: continue to init in CRQ reset returns H_CLOSED

2020-06-18 Thread Dany Madden
Continue the reset path when partner adapter is not ready or H_CLOSED is
returned from reset crq. This patch allows the CRQ init to proceed to
establish a valid CRQ for traffic to flow after reset.

Signed-off-by: Dany Madden 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 2baf7b3ff4cb..4b7cb483c47f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1971,13 +1971,18 @@ static int do_reset(struct ibmvnic_adapter *adapter,
release_sub_crqs(adapter, 1);
} else {
rc = ibmvnic_reset_crq(adapter);
-   if (!rc)
+   if (rc == H_CLOSED || rc == H_SUCCESS) {
rc = vio_enable_interrupts(adapter->vdev);
+   if (rc)
+   netdev_err(adapter->netdev,
+  "Reset failed to enable 
interrupts. rc=%d\n",
+  rc);
+   }
}
 
if (rc) {
netdev_err(adapter->netdev,
-  "Couldn't initialize crq. rc=%d\n", rc);
+  "Reset couldn't initialize crq. rc=%d\n", 
rc);
goto out;
}
 
-- 
2.18.2



[PATCH] pci: pcie: AER: Fix logging of Correctable errors

2020-06-18 Thread Matt Jolly
The AER documentation indicates that correctable (severity=Corrected)
errors should be output as a warning so that users can filter these
errors if they choose to; This functionality does not appear to have been 
implemented.

This patch modifies the functions aer_print_error and __aer_print_error
to send correctable errors as a warning (pci_warn), rather than as an error 
(pci_err). It
partially addresses several bugs in relation to kernel message buffer
spam for misbehaving devices - the root cause (possibly device firmware?) isn't
addressed, but the dmesg output is less alarming for end users, and can
be filtered separately from uncorrectable errors. This should hopefully
reduce the need for users to disable AER to suppress corrected errors.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=201517
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196183

Signed-off-by: Matt Jolly 
---
 drivers/pci/pcie/aer.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3acf56683915..131ecc0df2cb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -662,12 +662,18 @@ static void __aer_print_error(struct pci_dev *dev,
errmsg = i < ARRAY_SIZE(aer_uncorrectable_error_string) 
?
aer_uncorrectable_error_string[i] : NULL;
 
-   if (errmsg)
-   pci_err(dev, "   [%2d] %-22s%s\n", i, errmsg,
-   info->first_error == i ? " (First)" : "");
-   else
+   if (errmsg) {
+   if (info->severity == AER_CORRECTABLE) {
+   pci_warn(dev, "   [%2d] %-22s%s\n", i, errmsg,
+   info->first_error == i ? " (First)" : 
"");
+   } else {
+   pci_err(dev, "   [%2d] %-22s%s\n", i, errmsg,
+   info->first_error == i ? " (First)" : 
"");
+   }
+   } else {
pci_err(dev, "   [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
+   }
}
pci_dev_aer_stats_incr(dev, info);
 }
@@ -686,13 +692,23 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-   aer_error_severity_string[info->severity],
-   aer_error_layer[layer], aer_agent_string[agent]);
+   if  (info->severity == AER_CORRECTABLE) {
+   pci_warn(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+   aer_error_severity_string[info->severity],
+   aer_error_layer[layer], aer_agent_string[agent]);
 
-   pci_err(dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
-   dev->vendor, dev->device,
-   info->status, info->mask);
+   pci_warn(dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+   dev->vendor, dev->device,
+   info->status, info->mask);
+   } else {
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+   aer_error_severity_string[info->severity],
+   aer_error_layer[layer], aer_agent_string[agent]);
+
+   pci_err(dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+   dev->vendor, dev->device,
+   info->status, info->mask);
+   }
 
__aer_print_error(dev, info);
 
-- 
2.26.2



Re: rename probe_kernel_* and probe_user_*

2020-06-18 Thread Helge Deller
On 18.06.20 21:48, Linus Torvalds wrote:
> [ Explicitly added architecture lists and developers to the cc to make
> this more visible ]
>
> On Wed, Jun 17, 2020 at 12:38 AM Christoph Hellwig  wrote:
>>
>> Andrew and I decided to drop the patches implementing your suggested
>> rename of the probe_kernel_* and probe_user_* helpers from -mm as there
>> were way to many conflicts.  After -rc1 might be a good time for this as
>> all the conflicts are resolved now.
>
> So I've merged this renaming now, together with my changes to make
> 'get_kernel_nofault()' look and act a lot more like 'get_user()'.
>
> It just felt wrong (and potentially dangerous) to me to have a
> 'get_kernel_nofault()' naming that implied semantics that we're all
> familiar with from 'get_user()', but acting very differently.
>
> But part of the fixups I made for the type checking are for
> architectures where I didn't even compile-test the end result. I
> looked at every case individually, and the patch looks sane, but I
> could have screwed something up.
>
> Basically, 'get_kernel_nofault()' doesn't do the same automagic type
> munging from the pointer to the target that 'get_user()' does, but at
> least now it checks that the types are superficially compatible.
> There should be build failures if they aren't, but I hopefully fixed
> everything up properly for all architectures.
>
> This email is partly to ask people to double-check, but partly just as
> a heads-up so that _if_ I screwed something up, you'll have the
> background and it won't take you by surprise.

Linus. thanks for the heads-up!
With your change it compiles cleanly on 32- and 64-bit parisc.

Helge


Re: [PATCH net] ibmveth: Fix max MTU limit

2020-06-18 Thread Thomas Falcon



On 6/18/20 10:57 AM, Jakub Kicinski wrote:

On Thu, 18 Jun 2020 10:43:46 -0500 Thomas Falcon wrote:

The max MTU limit defined for ibmveth is not accounting for
virtual ethernet buffer overhead, which is twenty-two additional
bytes set aside for the ethernet header and eight additional bytes
of an opaque handle reserved for use by the hypervisor. Update the
max MTU to reflect this overhead.

Signed-off-by: Thomas Falcon 

How about

Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more 
drivers")
Fixes: 110447f8269a ("ethernet: fix min/max MTU typos")

?


Thanks, do you need me to send a v2 with those tags?

Tom



Re: [PATCH v2 0/4] Migrate non-migrated pages of a SVM.

2020-06-18 Thread Ram Pai
I should have elaborated on the problem and the need for these patches.

Explaining it here. Will add it to the series in next version.

-

The time taken to switch a VM to Secure-VM, increases by the size of
the VM.  A 100GB VM takes about 7minutes. This is unacceptable.  This
linear increase is caused by a suboptimal behavior by the Ultravisor and
the Hypervisor.  The Ultravisor unnecessarily migrates all the GFN of
the VM from normal-memory to secure-memory. It has to just migrate the
necessary and sufficient GFNs.

However when the optimization is incorporated in the Ultravisor, the
Hypervisor starts misbehaving. The Hypervisor has a inbuilt assumption
that the Ultravisor will explicitly request to migrate, each and every
GFN of the VM. If only necessary and sufficient GFNs are requested for
migration, the Hypervisor continues to manage the rest of the GFNs are
normal GFNs. This leads of memory corruption, manifested consistently
when the SVM reboots.

The same is true, when a memory slot is hotplugged into a SVM. The
Hypervisor expects the ultravisor to request migration of all GFNs to
secure-GFN.  But at the same time the hypervisor is unable to handle any
H_SVM_PAGE_IN requests from the Ultravisor, done in the context of
UV_REGISTER_MEM_SLOT ucall.  This problem manifests as random errors in
the SVM, when a memory-slot is hotplugged.

This patch series automatically migrates the non-migrated pages of a SVM,
and thus solves the problem.

--



On Thu, Jun 18, 2020 at 02:19:01AM -0700, Ram Pai wrote:
> This patch series migrates the non-migrated pages of a SVM.
> This is required when the UV calls H_SVM_INIT_DONE, and
> when a memory-slot is hotplugged to a Secure VM.
> 
> Testing: Passed rigorous SVM reboot test using different
>   sized SVMs.
> 
> Changelog:
>   . fixed a bug observed by Bharata. Pages that
>   where paged-in and later paged-out must also be
>   skipped from migration during H_SVM_INIT_DONE.
> 
> Laurent Dufour (1):
>   KVM: PPC: Book3S HV: migrate hot plugged memory
> 
> Ram Pai (3):
>   KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c
>   KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
>   KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in
> H_SVM_INIT_DONE
> 
>  Documentation/powerpc/ultravisor.rst|   2 +
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   8 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c  |   2 +-
>  arch/powerpc/kvm/book3s_hv.c|  12 +-
>  arch/powerpc/kvm/book3s_hv_uvmem.c  | 449 
> ++--
>  5 files changed, 368 insertions(+), 105 deletions(-)
> 
> -- 
> 1.8.3.1

-- 
Ram Pai


Re: rename probe_kernel_* and probe_user_*

2020-06-18 Thread Linus Torvalds
[ Explicitly added architecture lists and developers to the cc to make
this more visible ]

On Wed, Jun 17, 2020 at 12:38 AM Christoph Hellwig  wrote:
>
> Andrew and I decided to drop the patches implementing your suggested
> rename of the probe_kernel_* and probe_user_* helpers from -mm as there
> were way to many conflicts.  After -rc1 might be a good time for this as
> all the conflicts are resolved now.

So I've merged this renaming now, together with my changes to make
'get_kernel_nofault()' look and act a lot more like 'get_user()'.

It just felt wrong (and potentially dangerous) to me to have a
'get_kernel_nofault()' naming that implied semantics that we're all
familiar with from 'get_user()', but acting very differently.

But part of the fixups I made for the type checking are for
architectures where I didn't even compile-test the end result. I
looked at every case individually, and the patch looks sane, but I
could have screwed something up.

Basically, 'get_kernel_nofault()' doesn't do the same automagic type
munging from the pointer to the target that 'get_user()' does, but at
least now it checks that the types are superficially compatible.
There should be build failures if they aren't, but I hopefully fixed
everything up properly for all architectures.

This email is partly to ask people to double-check, but partly just as
a heads-up so that _if_ I screwed something up, you'll have the
background and it won't take you by surprise.

   Linus


Re: [PATCH v2 02/12] ocxl: Change type of pasid to unsigned int

2020-06-18 Thread Frederic Barrat




Le 18/06/2020 à 17:37, Fenghua Yu a écrit :

The first 3 patches clean up pasid and flag defitions to prepare for
following patches.

If you think this patch can be dropped, we will drop it.


Yes, I think that's the case.

Thanks,

 Fred


[PATCH v1 5/5] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-06-18 Thread Bharata B Rao
In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/firmware.h   |  4 +++-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 27 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 +--
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
 #define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 #define FW_FEATURE_STUFF_TCE   ASM_CONST(0x0080)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+   FW_FEATURE_RPT_INVALIDATE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 84acb4769487..fcf8b031a32e 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -313,10 +314,17 @@ void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned 
long addr,
return;
}
 
-   psi = shift_to_mmu_psize(pshift);
-   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
-   lpid, rb);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) {
+   psi = shift_to_mmu_psize(pshift);
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(0, 0, 1), lpid, rb);
+   } else {
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB, H_RPTI_PAGE_ALL,
+   addr, addr + psize);
+   }
if (rc)
pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
@@ -330,8 +338,15 @@ static void kvmppc_radix_flush_pwc(struct kvm *kvm, 
unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_PWC, H_RPTI_PAGE_ALL,
+   0, -1UL);
if (rc)
pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 75993f44519b..81f903284d34 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct patb_entry *pseries_partition_tb;
 
@@ -402,8 +403,16 @@ static void kvmhv_flush_lpid(unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC |
+   H_RPTI_TYPE_PAT,
+   

[PATCH v1 4/5] powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when !GTSE

2020-06-18 Thread Bharata B Rao
From: Nicholas Piggin 

When platform doesn't support GTSE, let TLB invalidation requests
for radix guests be off-loaded to the host using H_RPT_INVALIDATE
hcall.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Bharata B Rao 
[hcall wrapper, error path handling and renames]
---
 arch/powerpc/include/asm/hvcall.h | 27 ++-
 arch/powerpc/include/asm/plpar_wrappers.h | 52 +
 arch/powerpc/mm/book3s64/radix_tlb.c  | 95 +--
 3 files changed, 166 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073e437e..3f9bc7ad1cdd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -305,7 +305,8 @@
 #define H_SCM_UNBIND_ALL0x3FC
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
-#define MAX_HCALL_OPCODE   H_SCM_PERFORMANCE_STATS
+#define H_RPT_INVALIDATE   0x448
+#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -389,6 +390,30 @@
 #define PROC_TABLE_RADIX   0x04
 #define PROC_TABLE_GTSE0x01
 
+/*
+ * Defines for
+ * H_RPT_INVALIDATE - Invalidate RPT translation lookaside information.
+ */
+
+/* Type of translation to invalidate (type) */
+#define H_RPTI_TYPE_NESTED 0x0001  /* Invalidate nested guest 
partition-scope */
+#define H_RPTI_TYPE_TLB0x0002  /* Invalidate TLB */
+#define H_RPTI_TYPE_PWC0x0004  /* Invalidate Page Walk Cache */
+#define H_RPTI_TYPE_PRT0x0008  /* Invalidate Process Table 
Entries if H_RPTI_TYPE_NESTED is clear */
+#define H_RPTI_TYPE_PAT0x0008  /* Invalidate Partition Table 
Entries if H_RPTI_TYPE_NESTED is set */
+
+/* Invalidation targets (target) */
+#define H_RPTI_TARGET_CMMU 0x01 /* All virtual processors in the 
partition */
+#define H_RPTI_TARGET_CMMU_LOCAL   0x02 /* Current virtual processor */
+#define H_RPTI_TARGET_NMMU 0x04 /* All nest/accelerator agents in 
use by the partition */
+
+/* Page size mask (page sizes) */
+#define H_RPTI_PAGE_4K 0x01
+#define H_RPTI_PAGE_64K0x02
+#define H_RPTI_PAGE_2M 0x04
+#define H_RPTI_PAGE_1G 0x08
+#define H_RPTI_PAGE_ALL (-1UL)
+
 #ifndef __ASSEMBLY__
 #include 
 
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 4497c8afb573..92320bb309c7 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -334,6 +334,51 @@ static inline long plpar_get_cpu_characteristics(struct 
h_cpu_char_result *p)
return rc;
 }
 
+/*
+ * Wrapper to H_RPT_INVALIDATE hcall that handles return values appropriately
+ *
+ * - Returns H_SUCCESS on success
+ * - For H_BUSY return value, we retry the hcall.
+ * - For any other hcall failures, attempt a full flush once before
+ *   resorting to BUG().
+ *
+ * Note: This hcall is expected to fail only very rarely. The correct
+ * error recovery of killing the process/guest will be eventually
+ * needed.
+ */
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 type,
+ u64 page_sizes, u64 start, u64 end)
+{
+   long rc;
+   unsigned long all = H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC;
+
+   while (true) {
+   rc = plpar_hcall_norets(H_RPT_INVALIDATE, pid, target, type,
+   page_sizes, start, end);
+   if (rc == H_BUSY) {
+   cpu_relax();
+   continue;
+   } else if (rc == H_SUCCESS)
+   return rc;
+
+   /* Flush request failed, try with a full flush once */
+   if (type & H_RPTI_TYPE_NESTED)
+   all |= H_RPTI_TYPE_PAT;
+   else
+   all |= H_RPTI_TYPE_PRT;
+retry:
+   rc = plpar_hcall_norets(H_RPT_INVALIDATE, pid, target,
+   all, page_sizes, 0, -1UL);
+   if (rc == H_BUSY) {
+   cpu_relax();
+   goto retry;
+   } else if (rc == H_SUCCESS)
+   return rc;
+
+   BUG();
+   }
+}
+
 #else /* !CONFIG_PPC_PSERIES */
 
 static inline long plpar_set_ciabr(unsigned long ciabr)
@@ -346,6 +391,13 @@ static inline long plpar_pte_read_4(unsigned long flags, 
unsigned long ptex,
 {
return 0;
 }
+
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 type,
+ u64 page_sizes, u64 start, u64 end)
+{
+   return 0;
+}
+
 #endif /* CONFIG_PPC_PSERIES */
 
 #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index b5cc9b23cf02..733935b68f37 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/a

[PATCH v1 3/5] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-06-18 Thread Bharata B Rao
H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index e4ed5317f117..58ba76bc1964 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3



[PATCH v1 2/5] powerpc/prom_init: Ask for Radix GTSE only if supported.

2020-06-18 Thread Bharata B Rao
In the case of radix, don't ask for GTSE by default but ask
only if GTSE is enabled.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/kernel/prom_init.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5f15b10eb007..16dd14f58ba6 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
-- 
2.21.3



[PATCH v1 1/5] powerpc/mm: Make GTSE an MMU FTR

2020-06-18 Thread Bharata B Rao
Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector.

Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/mmu.h| 4 
 arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
 arch/powerpc/mm/init_64.c | 5 -
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 3a409517c031..fcb815b3a84d 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -337,6 +337,7 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c7ce4ec5060e..a7b571c60e90 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -408,12 +408,15 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
 }
 
-- 
2.21.3



[PATCH v1 0/5] Off-load TLB invalidations to host for !GTSE

2020-06-18 Thread Bharata B Rao
Hypervisor may choose not to enable Guest Translation Shootdown Enable
(GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
permitted to use instructions like tblie and tlbsync directly, but is
expected to make hypervisor calls to get the TLB flushed.

This series enables the TLB flush routines in the radix code to
off-load TLB flushing to hypervisor via the newly proposed hcall
H_RPT_INVALIDATE. The specification of this hcall is still evolving
while the patchset is posted here for any early comments.

To easily check the availability of GTSE, it is made an MMU feature.
The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
handle GTSE as an optionally available feature and to not assume GTSE
when radix support is available.

The actual hcall implementation for KVM isn't included in this
patchset.

H_RPT_INVALIDATE

Syntax:
int64   /* H_Success: Return code on successful completion */
    /* H_Busy - repeat the call with the same */
    /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */
    hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation 
lookaside information */
  uint64 pid,   /* PID/LPID to invalidate */
  uint64 target,    /* Invalidation target */
  uint64 type,  /* Type of lookaside information */
  uint64 pageSizes, /* Page sizes */
  uint64 start, /* Start of Effective Address (EA) range 
(inclusive) */
  uint64 end)   /* End of EA range (exclusive) */

Invalidation targets (target)
-
Core MMU    0x01 /* All virtual processors in the partition */
Core local MMU  0x02 /* Current virtual processor */
Nest MMU    0x04 /* All nest/accelerator agents in use by the partition */

A combination of the above can be specified, except core and core local.

Type of translation to invalidate (type)
---
NESTED   0x0001  /* invalidate nested guest partition-scope */
TLB  0x0002  /* Invalidate TLB */
PWC  0x0004  /* Invalidate Page Walk Cache */
PRT  0x0008  /* Invalidate Process Table Entries if NESTED is clear*/
PAT  0x0008  /* Invalidate Partition Table Entries  if NESTED is set*/

A combination of the above can be specified.

Page size mask (pages)
--
4K  0x01
64K 0x02
2M  0x04
1G  0x08
All sizes   (-1UL)

A combination of the above can be specified.
All page sizes can be selected with -1.

Semantics: Invalidate radix tree lookaside information
   matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are
  different from the defined values.
* Return H_PARAMETER if NESTED is set and pid is not a valid nested
  LPID allocated to this partition
* Return H_P5 if (start, end) doesn't form a valid range. Start and end
  should be a valid Quadrant address and  end > start.
* Return H_NotSupported if the partition is not in running in radix
  translation mode.
* May invalidate more translation information than requested.
* If start = 0 and end = -1, set the range to cover all valid addresses.
  Else start and end should be aligned to 4kB (lower 11 bits clear).
* If NESTED is clear, then invalidate process scoped lookaside information.
  Else pid specifies a nested LPID, and the invalidation is performed
  on nested guest partition table and nested guest partition scope real
  addresses.
* If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and
  quadrant 0 spaces, Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
  Those which are partially covered are considered outside invalidation
  range, which allows a caller to optimally invalidate ranges that may
  contain mixed page sizes.
* Return H_SUCCESS on success.

Bharata B Rao (4):
  powerpc/mm: Make GTSE an MMU FTR
  powerpc/prom_init: Ask for Radix GTSE only if supported.
  powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
enabled
  KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

Nicholas Piggin (1):
  powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
!GTSE

 arch/powerpc/include/asm/firmware.h   |  4 +-
 arch/powerpc/include/asm/hvcall.h | 27 ++-
 arch/powerpc/include/asm/mmu.h|  4 +
 arch/powerpc/include/asm/plpar_wrappers.h | 52 +
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 27 +--
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 +++-
 arch/powerpc/mm/book3s64/radix_tlb.c  | 95 +--
 arch/powerpc/mm/init_64.c |  5 +-
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 arch/powerpc/platforms/pseries/lpar.c |  8 +-
 12 files changed, 224 insertions(+), 26 deletion

Re: [PATCH net] ibmveth: Fix max MTU limit

2020-06-18 Thread Jakub Kicinski
On Thu, 18 Jun 2020 10:43:46 -0500 Thomas Falcon wrote:
> The max MTU limit defined for ibmveth is not accounting for
> virtual ethernet buffer overhead, which is twenty-two additional
> bytes set aside for the ethernet header and eight additional bytes
> of an opaque handle reserved for use by the hypervisor. Update the
> max MTU to reflect this overhead.
> 
> Signed-off-by: Thomas Falcon 

How about

Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more 
drivers")
Fixes: 110447f8269a ("ethernet: fix min/max MTU typos")

?


[PATCH net] ibmveth: Fix max MTU limit

2020-06-18 Thread Thomas Falcon
The max MTU limit defined for ibmveth is not accounting for
virtual ethernet buffer overhead, which is twenty-two additional
bytes set aside for the ethernet header and eight additional bytes
of an opaque handle reserved for use by the hypervisor. Update the
max MTU to reflect this overhead.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmveth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 96d36ae5049e..c5c732601e35 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1715,7 +1715,7 @@ static int ibmveth_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
}
 
netdev->min_mtu = IBMVETH_MIN_MTU;
-   netdev->max_mtu = ETH_MAX_MTU;
+   netdev->max_mtu = ETH_MAX_MTU - IBMVETH_BUFF_OH;
 
memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
 
-- 
2.26.2



Re: [PATCH v2 02/12] ocxl: Change type of pasid to unsigned int

2020-06-18 Thread Fenghua Yu
Hi, Frederic,

On Thu, Jun 18, 2020 at 10:05:19AM +0200, Frederic Barrat wrote:
> 
> 
> Le 13/06/2020 à 02:41, Fenghua Yu a écrit :
> >PASID is defined as "int" although it's a 20-bit value and shouldn't be
> >negative int. To be consistent with type defined in iommu, define PASID
> >as "unsigned int".
> 
> 
> It looks like this patch was considered because of the use of 'pasid' in
> variable or function names. The ocxl driver only makes sense on powerpc and
> shouldn't compile on anything else, so it's probably useless in the context
> of that series.
> The pasid here is defined by the opencapi specification
> (https://opencapi.org), it is borrowed from the PCI world and you could
> argue it could be an unsigned int. But then I think the patch doesn't go far
> enough. But considering it's not used on x86, I think this patch can be
> dropped.

The first 3 patches clean up pasid and flag defitions to prepare for
following patches.

If you think this patch can be dropped, we will drop it.

Thanks.

-Fenghua


[PATCH 4/6] exec: split prepare_arg_pages

2020-06-18 Thread Christoph Hellwig
Move counting the arguments and enviroment variables out of
prepare_arg_pages and rename the rest of the function to check_arg_limit.
This prepares for a version of do_execvat that takes kernel pointers.

Signed-off-by: Christoph Hellwig 
---
 fs/exec.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a5d91f8b1341d5..34781db6bf6889 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -435,20 +435,10 @@ static int count_strings(const char __user *const __user 
*argv)
return i;
 }
 
-static int prepare_arg_pages(struct linux_binprm *bprm,
-   const char __user *const __user *argv,
-   const char __user *const __user *envp)
+static int check_arg_limit(struct linux_binprm *bprm)
 {
unsigned long limit, ptr_size;
 
-   bprm->argc = count_strings(argv);
-   if (bprm->argc < 0)
-   return bprm->argc;
-
-   bprm->envc = count_strings(envp);
-   if (bprm->envc < 0)
-   return bprm->envc;
-
/*
 * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
 * (whichever is smaller) for the argv+env strings.
@@ -1886,7 +1876,19 @@ int do_execveat(int fd, struct filename *filename,
if (retval)
goto out_unmark;
 
-   retval = prepare_arg_pages(bprm, argv, envp);
+   bprm->argc = count_strings(argv);
+   if (bprm->argc < 0) {
+   retval = bprm->argc;
+   goto out;
+   }
+
+   bprm->envc = count_strings(envp);
+   if (bprm->envc < 0) {
+   retval = bprm->envc;
+   goto out;
+   }
+
+   retval = check_arg_limit(bprm);
if (retval < 0)
goto out;
 
-- 
2.26.2



[PATCH 6/6] kernel: add a kernel_wait helper

2020-06-18 Thread Christoph Hellwig
Add a helper that waits for a pid and stores the status in the passed
in kernel pointer.  Use it to fix the usage of kernel_wait4 in
call_usermodehelper_exec_sync that only happens to work due to the
implicit set_fs(KERNEL_DS) for kernel threads.

Signed-off-by: Christoph Hellwig 
---
 include/linux/sched/task.h |  1 +
 kernel/exit.c  | 16 
 kernel/umh.c   | 29 -
 3 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236ad7..a80007df396e95 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -102,6 +102,7 @@ struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
 extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
+int kernel_wait(pid_t pid, int *stat);
 
 extern void free_task(struct task_struct *tsk);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 727150f2810338..fd598846df0b17 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1626,6 +1626,22 @@ long kernel_wait4(pid_t upid, int __user *stat_addr, int 
options,
return ret;
 }
 
+int kernel_wait(pid_t pid, int *stat)
+{
+   struct wait_opts wo = {
+   .wo_type= PIDTYPE_PID,
+   .wo_pid = find_get_pid(pid),
+   .wo_flags   = WEXITED,
+   };
+   int ret;
+
+   ret = do_wait(&wo);
+   if (ret > 0 && wo.wo_stat)
+   *stat = wo.wo_stat;
+   put_pid(wo.wo_pid);
+   return ret;
+}
+
 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
int, options, struct rusage __user *, ru)
 {
diff --git a/kernel/umh.c b/kernel/umh.c
index 1284823dbad338..6fd948e478bec4 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -126,37 +126,16 @@ static void call_usermodehelper_exec_sync(struct 
subprocess_info *sub_info)
 {
pid_t pid;
 
-   /* If SIGCLD is ignored kernel_wait4 won't populate the status. */
+   /* If SIGCLD is ignored do_wait won't populate the status. */
kernel_sigaction(SIGCHLD, SIG_DFL);
pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
-   if (pid < 0) {
+   if (pid < 0)
sub_info->retval = pid;
-   } else {
-   int ret = -ECHILD;
-   /*
-* Normally it is bogus to call wait4() from in-kernel because
-* wait4() wants to write the exit code to a userspace address.
-* But call_usermodehelper_exec_sync() always runs as kernel
-* thread (workqueue) and put_user() to a kernel address works
-* OK for kernel threads, due to their having an mm_segment_t
-* which spans the entire address space.
-*
-* Thus the __user pointer cast is valid here.
-*/
-   kernel_wait4(pid, (int __user *)&ret, 0, NULL);
-
-   /*
-* If ret is 0, either call_usermodehelper_exec_async failed and
-* the real error code is already in sub_info->retval or
-* sub_info->retval is 0 anyway, so don't mess with it then.
-*/
-   if (ret)
-   sub_info->retval = ret;
-   }
+   else
+   kernel_wait(pid, &sub_info->retval);
 
/* Restore default kernel sig handler */
kernel_sigaction(SIGCHLD, SIG_IGN);
-
umh_complete(sub_info);
 }
 
-- 
2.26.2



[PATCH 5/6] exec: add a kernel_execveat helper

2020-06-18 Thread Christoph Hellwig
Add a kernel_execveat helper to execute a binary with kernel space argv
and envp pointers.  Switch executing init and user mode helpers to this
new helper instead of relying on the implicit set_fs(KERNEL_DS) for early
init code and kernel threads, and move the getname call into the
do_execve helper.

Signed-off-by: Christoph Hellwig 
---
 fs/exec.c   | 109 
 include/linux/binfmts.h |   6 +--
 init/main.c |   6 +--
 kernel/umh.c|   8 ++-
 4 files changed, 95 insertions(+), 34 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 34781db6bf6889..7923b8334ae600 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -435,6 +435,21 @@ static int count_strings(const char __user *const __user 
*argv)
return i;
 }
 
+static int count_kernel_strings(const char *const *argv)
+{
+   int i;
+
+   if (!argv)
+   return 0;
+
+   for (i = 0; argv[i]; i++) {
+   if (i >= MAX_ARG_STRINGS)
+   return -E2BIG;
+   }
+
+   return i;
+}
+
 static int check_arg_limit(struct linux_binprm *bprm)
 {
unsigned long limit, ptr_size;
@@ -611,6 +626,19 @@ int copy_string_kernel(const char *arg, struct 
linux_binprm *bprm)
 }
 EXPORT_SYMBOL(copy_string_kernel);
 
+static int copy_strings_kernel(int argc, const char *const *argv,
+   struct linux_binprm *bprm)
+{
+   int ret;
+
+   while (argc-- > 0) {
+   ret = copy_string_kernel(argv[argc], bprm);
+   if (ret)
+   break;
+   }
+   return ret;
+}
+
 #ifdef CONFIG_MMU
 
 /*
@@ -1793,9 +1821,11 @@ static int exec_binprm(struct linux_binprm *bprm)
return 0;
 }
 
-int do_execveat(int fd, struct filename *filename,
+static int __do_execveat(int fd, struct filename *filename,
const char __user *const __user *argv,
const char __user *const __user *envp,
+   const char *const *kernel_argv,
+   const char *const *kernel_envp,
int flags, struct file *file)
 {
char *pathbuf = NULL;
@@ -1876,16 +1906,30 @@ int do_execveat(int fd, struct filename *filename,
if (retval)
goto out_unmark;
 
-   bprm->argc = count_strings(argv);
-   if (bprm->argc < 0) {
-   retval = bprm->argc;
-   goto out;
-   }
+   if (unlikely(kernel_argv)) {
+   bprm->argc = count_kernel_strings(kernel_argv);
+   if (bprm->argc < 0) {
+   retval = bprm->argc;
+   goto out;
+   }
 
-   bprm->envc = count_strings(envp);
-   if (bprm->envc < 0) {
-   retval = bprm->envc;
-   goto out;
+   bprm->envc = count_kernel_strings(kernel_envp);
+   if (bprm->envc < 0) {
+   retval = bprm->envc;
+   goto out;
+   }
+   } else {
+   bprm->argc = count_strings(argv);
+   if (bprm->argc < 0) {
+   retval = bprm->argc;
+   goto out;
+   }
+
+   bprm->envc = count_strings(envp);
+   if (bprm->envc < 0) {
+   retval = bprm->envc;
+   goto out;
+   }
}
 
retval = check_arg_limit(bprm);
@@ -1902,13 +1946,22 @@ int do_execveat(int fd, struct filename *filename,
goto out;
 
bprm->exec = bprm->p;
-   retval = copy_strings(bprm->envc, envp, bprm);
-   if (retval < 0)
-   goto out;
 
-   retval = copy_strings(bprm->argc, argv, bprm);
-   if (retval < 0)
-   goto out;
+   if (unlikely(kernel_argv)) {
+   retval = copy_strings_kernel(bprm->envc, kernel_envp, bprm);
+   if (retval < 0)
+   goto out;
+   retval = copy_strings_kernel(bprm->argc, kernel_argv, bprm);
+   if (retval < 0)
+   goto out;
+   } else {
+   retval = copy_strings(bprm->envc, envp, bprm);
+   if (retval < 0)
+   goto out;
+   retval = copy_strings(bprm->argc, argv, bprm);
+   if (retval < 0)
+   goto out;
+   }
 
retval = exec_binprm(bprm);
if (retval < 0)
@@ -1959,6 +2012,23 @@ int do_execveat(int fd, struct filename *filename,
return retval;
 }
 
+static int do_execveat(int fd, const char *filename,
+  const char __user *const __user *argv,
+  const char __user *const __user *envp, int flags)
+{
+   int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+   struct filename *name = getname_flags(filename, lookup_flags, NULL);
+
+   return __do_execveat(fd, name, argv, envp, NULL, NULL, flags, NULL);
+}
+
+int kernel_execveat(int fd, const char *file

[PATCH 3/6] exec: cleanup the count() function

2020-06-18 Thread Christoph Hellwig
Remove the max argument as it is hard wired to MAX_ARG_STRINGS, and
give the function a slightly less generic name.

Signed-off-by: Christoph Hellwig 
---
 fs/exec.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4e5db0e35797a5..a5d91f8b1341d5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -407,9 +407,9 @@ get_user_arg_ptr(const char __user *const __user *argv, int 
nr)
 }
 
 /*
- * count() counts the number of strings in array ARGV.
+ * count_strings() counts the number of strings in array ARGV.
  */
-static int count(const char __user *const __user *argv, int max)
+static int count_strings(const char __user *const __user *argv)
 {
int i = 0;
 
@@ -423,7 +423,7 @@ static int count(const char __user *const __user *argv, int 
max)
if (IS_ERR(p))
return -EFAULT;
 
-   if (i >= max)
+   if (i >= MAX_ARG_STRINGS)
return -E2BIG;
++i;
 
@@ -441,11 +441,11 @@ static int prepare_arg_pages(struct linux_binprm *bprm,
 {
unsigned long limit, ptr_size;
 
-   bprm->argc = count(argv, MAX_ARG_STRINGS);
+   bprm->argc = count_strings(argv);
if (bprm->argc < 0)
return bprm->argc;
 
-   bprm->envc = count(envp, MAX_ARG_STRINGS);
+   bprm->envc = count_strings(envp);
if (bprm->envc < 0)
return bprm->envc;
 
-- 
2.26.2



[PATCH 2/6] exec: simplify the compat syscall handling

2020-06-18 Thread Christoph Hellwig
The only differenence betweeen the compat exec* syscalls and their
native versions is that compat_ptr sign extension, and the fact that
the pointer arithmetics for the two dimensional arrays needs to use
the compat pointer size.  Instead of the compat wrappers and the
struct user_arg_ptr machinery just use in_compat_syscall() to do the
right thing for the compat case deep inside get_user_arg_ptr().

Signed-off-by: Christoph Hellwig 
---
 arch/arm64/include/asm/unistd32.h |   4 +-
 arch/mips/kernel/syscalls/syscall_n32.tbl |   4 +-
 arch/mips/kernel/syscalls/syscall_o32.tbl |   4 +-
 arch/parisc/kernel/syscalls/syscall.tbl   |   4 +-
 arch/powerpc/kernel/syscalls/syscall.tbl  |   4 +-
 arch/s390/kernel/syscalls/syscall.tbl |   4 +-
 arch/sparc/kernel/syscalls.S  |   4 +-
 arch/x86/entry/syscall_x32.c  |   7 ++
 arch/x86/entry/syscalls/syscall_32.tbl|   4 +-
 arch/x86/entry/syscalls/syscall_64.tbl|   4 +-
 fs/exec.c | 103 --
 include/linux/compat.h|   7 --
 include/uapi/asm-generic/unistd.h |   4 +-
 tools/include/uapi/asm-generic/unistd.h   |   4 +-
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   4 +-
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   4 +-
 .../arch/x86/entry/syscalls/syscall_64.tbl|   4 +-
 17 files changed, 56 insertions(+), 117 deletions(-)

diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f47..141f5d2ff1c34f 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -33,7 +33,7 @@ __SYSCALL(__NR_link, sys_link)
 #define __NR_unlink 10
 __SYSCALL(__NR_unlink, sys_unlink)
 #define __NR_execve 11
-__SYSCALL(__NR_execve, compat_sys_execve)
+__SYSCALL(__NR_execve, sys_execve)
 #define __NR_chdir 12
 __SYSCALL(__NR_chdir, sys_chdir)
/* 13 was sys_time */
@@ -785,7 +785,7 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 #define __NR_bpf 386
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 387
-__SYSCALL(__NR_execveat, compat_sys_execveat)
+__SYSCALL(__NR_execveat, sys_execveat)
 #define __NR_userfaultfd 388
 __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 #define __NR_membarrier 389
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index f777141f52568f..e861b5ab7179c9 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -64,7 +64,7 @@
 54 n32 getsockopt  compat_sys_getsockopt
 55 n32 clone   __sys_clone
 56 n32 fork__sys_fork
-57 n32 execve  compat_sys_execve
+57 n32 execve  sys_execve
 58 n32 exitsys_exit
 59 n32 wait4   compat_sys_wait4
 60 n32 killsys_kill
@@ -328,7 +328,7 @@
 317n32 getrandom   sys_getrandom
 318n32 memfd_createsys_memfd_create
 319n32 bpf sys_bpf
-320n32 execveatcompat_sys_execveat
+320n32 execveatsys_execveat
 321n32 userfaultfd sys_userfaultfd
 322n32 membarrier  sys_membarrier
 323n32 mlock2  sys_mlock2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl 
b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 13280625d312e9..bba80f74e9968e 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -18,7 +18,7 @@
 8  o32 creat   sys_creat
 9  o32 linksys_link
 10 o32 unlink  sys_unlink
-11 o32 execve  sys_execve  
compat_sys_execve
+11 o32 execve  sys_execve
 12 o32 chdir   sys_chdir
 13 o32 timesys_time32
 14 o32 mknod   sys_mknod
@@ -367,7 +367,7 @@
 353o32 getrandom   sys_getrandom
 354o32 memfd_createsys_memfd_create
 355o32 bpf sys_bpf
-356o32 execveatsys_execveat
compat_sys_execveat
+356o32 execveatsys_execveat
 357o32 userfaultfd sys_userfaultfd
 358o32 membarrier  sys_membarrier
 359o32 mlock2  sys_mlock2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl 
b/arch/parisc/kernel/syscalls/sysca

[PATCH 1/6] exec: cleanup the execve wrappers

2020-06-18 Thread Christoph Hellwig
Remove a whole bunch of wrappers that eventually all call
__do_execve_file, and consolidate the execvce helpers to:

  (1) __do_execveat, which is the lowest level helper implementing the
  actual functionality
  (2) do_execvat, which is used by all callers that want native
  pointers
  (3) do_compat_execve, which is used by all compat syscalls

Signed-off-by: Christoph Hellwig 
---
 fs/exec.c   | 98 +++--
 include/linux/binfmts.h | 12 ++---
 init/main.c |  7 +--
 kernel/umh.c| 16 +++
 4 files changed, 41 insertions(+), 92 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a7032784..354fdaa536ae7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1815,10 +1815,7 @@ static int exec_binprm(struct linux_binprm *bprm)
return 0;
 }
 
-/*
- * sys_execve() executes a new program.
- */
-static int __do_execve_file(int fd, struct filename *filename,
+static int __do_execveat(int fd, struct filename *filename,
struct user_arg_ptr argv,
struct user_arg_ptr envp,
int flags, struct file *file)
@@ -1972,74 +1969,16 @@ static int __do_execve_file(int fd, struct filename 
*filename,
return retval;
 }
 
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
-{
-   return __do_execve_file(fd, filename, argv, envp, flags, NULL);
-}
-
-int do_execve_file(struct file *file, void *__argv, void *__envp)
-{
-   struct user_arg_ptr argv = { .ptr.native = __argv };
-   struct user_arg_ptr envp = { .ptr.native = __envp };
-
-   return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
-}
-
-int do_execve(struct filename *filename,
-   const char __user *const __user *__argv,
-   const char __user *const __user *__envp)
-{
-   struct user_arg_ptr argv = { .ptr.native = __argv };
-   struct user_arg_ptr envp = { .ptr.native = __envp };
-   return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
-}
-
 int do_execveat(int fd, struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp,
-   int flags)
+   int flags, struct file *file)
 {
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
 
-   return do_execveat_common(fd, filename, argv, envp, flags);
-}
-
-#ifdef CONFIG_COMPAT
-static int compat_do_execve(struct filename *filename,
-   const compat_uptr_t __user *__argv,
-   const compat_uptr_t __user *__envp)
-{
-   struct user_arg_ptr argv = {
-   .is_compat = true,
-   .ptr.compat = __argv,
-   };
-   struct user_arg_ptr envp = {
-   .is_compat = true,
-   .ptr.compat = __envp,
-   };
-   return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
-}
-
-static int compat_do_execveat(int fd, struct filename *filename,
- const compat_uptr_t __user *__argv,
- const compat_uptr_t __user *__envp,
- int flags)
-{
-   struct user_arg_ptr argv = {
-   .is_compat = true,
-   .ptr.compat = __argv,
-   };
-   struct user_arg_ptr envp = {
-   .is_compat = true,
-   .ptr.compat = __envp,
-   };
-   return do_execveat_common(fd, filename, argv, envp, flags);
+   return __do_execveat(fd, filename, argv, envp, flags, file);
 }
-#endif
 
 void set_binfmt(struct linux_binfmt *new)
 {
@@ -2070,7 +2009,7 @@ SYSCALL_DEFINE3(execve,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
 {
-   return do_execve(getname(filename), argv, envp);
+   return do_execveat(AT_FDCWD, getname(filename), argv, envp, 0, NULL);
 }
 
 SYSCALL_DEFINE5(execveat,
@@ -2080,18 +2019,34 @@ SYSCALL_DEFINE5(execveat,
int, flags)
 {
int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+   struct filename *name = getname_flags(filename, lookup_flags, NULL);
 
-   return do_execveat(fd,
-  getname_flags(filename, lookup_flags, NULL),
-  argv, envp, flags);
+   return do_execveat(fd, name, argv, envp, flags, NULL);
 }
 
 #ifdef CONFIG_COMPAT
+static int do_compat_execve(int fd, struct filename *filename,
+   const compat_uptr_t __user *__argv,
+   const compat_uptr_t __user *__envp,
+   int flags)
+{
+   struct user_arg_ptr argv = {
+   .is_compat = true,
+   .ptr.compat = __argv,
+   };
+   struct user_arg_ptr envp = {
+   .is_compat = true,
+  

properly support exec and wait with kernel pointers v2

2020-06-18 Thread Christoph Hellwig
Hi all,

this series first cleans up the exec code and then adds proper
kernel_execveat and kernel_wait callers instead of relying on the fact
that the early init code and kernel threads implicitly run with
the address limit set to KERNEL_DS.

Note that the cleanup removes the compat execve(at) handlers entirely, as
we can handle the compat difference very nicely in a unified codebase.
x32 needs two hacky #defines for that for now, although those can go
away if the x32 syscall rework from Brian gets merged.

Changes since v1:
 - remove a pointless ifdef from get_user_arg_ptr
 - remove the need for a compat syscall handler for x32


Diffstat:
 arch/arm64/include/asm/unistd32.h  |4 
 arch/mips/kernel/syscalls/syscall_n32.tbl  |4 
 arch/mips/kernel/syscalls/syscall_o32.tbl  |4 
 arch/parisc/kernel/syscalls/syscall.tbl|4 
 arch/powerpc/kernel/syscalls/syscall.tbl   |4 
 arch/s390/kernel/syscalls/syscall.tbl  |4 
 arch/sparc/kernel/syscalls.S   |4 
 arch/x86/entry/syscall_x32.c   |7 
 arch/x86/entry/syscalls/syscall_32.tbl |4 
 arch/x86/entry/syscalls/syscall_64.tbl |4 
 fs/exec.c  |  248 -
 include/linux/binfmts.h|   10 
 include/linux/compat.h |7 
 include/linux/sched/task.h |1 
 include/uapi/asm-generic/unistd.h  |4 
 init/main.c|5 
 kernel/exit.c  |   16 +
 kernel/umh.c   |   43 ---
 tools/include/uapi/asm-generic/unistd.h|4 
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl |4 
 tools/perf/arch/s390/entry/syscalls/syscall.tbl|4 
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl  |4 
 22 files changed, 170 insertions(+), 223 deletions(-)


[PATCH] mm/debug_vm_pgtable: Fix build failure with powerpc 8xx

2020-06-18 Thread Christophe Leroy
Since commit 9e343b467c70 ("READ_ONCE: Enforce atomicity for
{READ,WRITE}_ONCE() memory accesses"), READ_ONCE() cannot be used
anymore to read complex page table entries. This leads to:

  CC  mm/debug_vm_pgtable.o
In file included from ./include/asm-generic/bug.h:5,
 from ./arch/powerpc/include/asm/bug.h:109,
 from ./include/linux/bug.h:5,
 from ./include/linux/mmdebug.h:5,
 from ./include/linux/gfp.h:5,
 from mm/debug_vm_pgtable.c:13:
In function 'pte_clear_tests',
inlined from 'debug_vm_pgtable' at mm/debug_vm_pgtable.c:363:2:
./include/linux/compiler.h:392:38: error: call to '__compiletime_assert_210' 
declared with attribute error: Unsupported access size for {READ,WRITE}_ONCE().
  392 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
  |  ^
./include/linux/compiler.h:373:4: note: in definition of macro 
'__compiletime_assert'
  373 |prefix ## suffix();\
  |^~
./include/linux/compiler.h:392:2: note: in expansion of macro 
'_compiletime_assert'
  392 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
  |  ^~~
./include/linux/compiler.h:405:2: note: in expansion of macro 
'compiletime_assert'
  405 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), 
\
  |  ^~
./include/linux/compiler.h:291:2: note: in expansion of macro 
'compiletime_assert_rwonce_type'
  291 |  compiletime_assert_rwonce_type(x);\
  |  ^~
mm/debug_vm_pgtable.c:249:14: note: in expansion of macro 'READ_ONCE'
  249 |  pte_t pte = READ_ONCE(*ptep);
  |  ^
make[2]: *** [mm/debug_vm_pgtable.o] Error 1

Fix it by using the recently added ptep_get() helper.

Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() 
memory accesses")
Signed-off-by: Christophe Leroy 
---
 mm/debug_vm_pgtable.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index e45623016aea..61ab16fb2e36 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -246,13 +246,13 @@ static void __init pgd_populate_tests(struct mm_struct 
*mm, pgd_t *pgdp,
 static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
   unsigned long vaddr)
 {
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte = ptep_get(ptep);
 
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
set_pte_at(mm, vaddr, ptep, pte);
barrier();
pte_clear(mm, vaddr, ptep);
-   pte = READ_ONCE(*ptep);
+   pte = ptep_get(ptep);
WARN_ON(!pte_none(pte));
 }
 
-- 
2.25.0



Re: [PATCH 3/3] powerpc/8xx: Provide ptep_get() with 16k pages

2020-06-18 Thread Christophe Leroy




Le 18/06/2020 à 02:58, Michael Ellerman a écrit :

Peter Zijlstra  writes:

On Thu, Jun 18, 2020 at 12:21:22AM +1000, Michael Ellerman wrote:

Peter Zijlstra  writes:

On Mon, Jun 15, 2020 at 12:57:59PM +, Christophe Leroy wrote:



+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+#define __HAVE_ARCH_PTEP_GET
+static inline pte_t ptep_get(pte_t *ptep)
+{
+   pte_t pte = {READ_ONCE(ptep->pte), 0, 0, 0};
+
+   return pte;
+}
+#endif


Would it make sense to have a comment with this magic? The casual reader
might wonder WTH just happened when he stumbles on this :-)


I tried writing a helpful comment but it's too late for my brain to form
sensible sentences.

Christophe can you send a follow-up with a comment explaining it? In
particular the zero entries stand out, it's kind of subtle that those
entries are only populated with the right value when we write to the
page table.


static inline pte_t ptep_get(pte_t *ptep)
{
unsigned long val = READ_ONCE(ptep->pte);
/* 16K pages have 4 identical value 4K entries */
pte_t pte = {val, val, val, val);
return pte;
}

Maybe something like that?


I think val wants to be pte_basic_t, but otherwise yeah I like that much
better.



I sent a patch for that.

I'll also send one to fix mm/debug_vm_pgtable.c which also uses 
READ_ONCE() to access page table entries.


Christophe


Re: [PATCH 3/3] powerpc/8xx: Provide ptep_get() with 16k pages

2020-06-18 Thread Christophe Leroy




Le 18/06/2020 à 03:00, Michael Ellerman a écrit :

Christophe Leroy  writes:

Le 17/06/2020 à 16:38, Peter Zijlstra a écrit :

On Thu, Jun 18, 2020 at 12:21:22AM +1000, Michael Ellerman wrote:

Peter Zijlstra  writes:

On Mon, Jun 15, 2020 at 12:57:59PM +, Christophe Leroy wrote:



+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+#define __HAVE_ARCH_PTEP_GET
+static inline pte_t ptep_get(pte_t *ptep)
+{
+   pte_t pte = {READ_ONCE(ptep->pte), 0, 0, 0};
+
+   return pte;
+}
+#endif


Would it make sense to have a comment with this magic? The casual reader
might wonder WTH just happened when he stumbles on this :-)


I tried writing a helpful comment but it's too late for my brain to form
sensible sentences.

Christophe can you send a follow-up with a comment explaining it? In
particular the zero entries stand out, it's kind of subtle that those
entries are only populated with the right value when we write to the
page table.


static inline pte_t ptep_get(pte_t *ptep)
{
unsigned long val = READ_ONCE(ptep->pte);
/* 16K pages have 4 identical value 4K entries */
pte_t pte = {val, val, val, val);
return pte;
}

Maybe something like that?


This should work as well. Indeed nobody cares about what's in the other
three. They are only there to ensure that ptep++ increases the ptep
pointer by 16 bytes. Only the HW require 4 identical values, that's
taken care of in set_pte_at() and pte_update().


Right, but it seems less error-prone to have the in-memory
representation match what we have in the page table (well that's
in-memory too but you know what I mean).


So we should use the most efficient. Thinking once more, maybe what you
propose is the most efficient as there is no need to load another
register with value 0 in order to write it in the stack.


On 64-bit I'd say it makes zero difference, the only thing that's going
to matter is the load from ptep->pte. I don't know whether that's true
on the 8xx cores though.


On 8xx core, loading a register with value 0 will take one cycle unless 
there is some bubble left by another instruction (like a load from 
memory or a taken branch). But that's in the noise.


Christophe


Re: [PATCH v2 0/2] powerpc/pci: unmap interrupts when a PHB is removed

2020-06-18 Thread Cédric Le Goater
On 6/17/20 6:29 PM, Cédric Le Goater wrote:
> Hello,
> 
> When a passthrough IO adapter is removed from a pseries machine using
> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
> guest OS to clear all page table entries related to the adapter. If
> some are still present, the RTAS call which isolates the PCI slot
> returns error 9001 "valid outstanding translations" and the removal of
> the IO adapter fails. This is because when the PHBs are scanned, Linux
> maps automatically some interrupts in the Linux interrupt number space
> but these are never removed.
> 
> To solve this problem, we introduce a PPC platform specific
> pcibios_remove_bus() routine which clears all interrupt mappings when
> the bus is removed. This also clears the associated page table entries
> of the ESB pages when using XIVE.
> 
> For this purpose, we record the logical interrupt numbers of the
> mapped interrupt under the PHB structure and let pcibios_remove_bus()
> do the clean up.
> 
> Tested on :
> 
>   - PowerNV with PCI, OpenCAPI, CAPI and GPU adapters. I don't know
> how to inject a failure on a PHB but that would be a good test.

I found out that powering down the slot is enough :

echo 0 > /sys/bus/pci/slots//power

The IRQ cleanup is done as expected on baremetal also.

Cheers,

C. 

>   - KVM P8+P9 guests with passthrough PCI adapters, but PHBs can not
> be removed under QEMU/KVM.   
>   - PowerVM with passthrough PCI adapters (main target)
>   
> Thanks,
> 
> C.
> 
> Changes since v1:
> 
>  - extended the removal to interrupts other than the legacy INTx.
> 
> Cédric Le Goater (2):
>   powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
>   powerpc/pci: unmap all interrupts when a PHB is removed
> 
>  arch/powerpc/include/asm/pci-bridge.h |   6 ++
>  arch/powerpc/kernel/pci-common.c  | 114 ++
>  2 files changed, 120 insertions(+)
> 



Re: [PATCH v2 2/4] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-06-18 Thread Laurent Dufour

Le 18/06/2020 à 11:19, Ram Pai a écrit :

During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.

The ability to identify the state of a GFN is needed to skip migration of its
PFN to secure-PFN during ESM transition.

The code is re-organized to track the states of a GFN as explained
below.


  1. States of a GFN
 ---
  The GFN can be in one of the following states.

  (a) Secure - The GFN is secure. The GFN is associated with
a Secure VM, the contents of the GFN is not accessible
to the Hypervisor.  This GFN can be backed by a secure-PFN,
or can be backed by a normal-PFN with contents encrypted.
The former is true when the GFN is paged-in into the
ultravisor. The latter is true when the GFN is paged-out
of the ultravisor.

  (b) Shared - The GFN is shared. The GFN is associated with a
a secure VM. The contents of the GFN is accessible to
Hypervisor. This GFN is backed by a normal-PFN and its
content is un-encrypted.

  (c) Normal - The GFN is a normal. The GFN is associated with
a normal VM. The contents of the GFN is accesible to
the Hypervisor. Its content is never encrypted.

  2. States of a VM.
 ---

  (a) Normal VM:  A VM whose contents are always accessible to
the hypervisor.  All its GFNs are normal-GFNs.

  (b) Secure VM: A VM whose contents are not accessible to the
hypervisor without the VM's consent.  Its GFNs are
either Shared-GFN or Secure-GFNs.

  (c) Transient VM: A Normal VM that is transitioning to secure VM.
The transition starts on successful return of
H_SVM_INIT_START, and ends on successful return
of H_SVM_INIT_DONE. This transient VM, can have GFNs
in any of the three states; i.e Secure-GFN, Shared-GFN,
and Normal-GFN. The VM never executes in this state
in supervisor-mode.

  3. Memory slot State.
 --
The state of a memory slot mirrors the state of the
VM the memory slot is associated with.

  4. VM State transition.
 

   A VM always starts in Normal Mode.

   H_SVM_INIT_START moves the VM into transient state. During this
   time the Ultravisor may request some of its GFNs to be shared or
   secured. So its GFNs can be in one of the three GFN states.

   H_SVM_INIT_DONE moves the VM entirely from transient state to
   secure-state. At this point any left-over normal-GFNs are
   transitioned to Secure-GFN.

   H_SVM_INIT_ABORT moves the transient VM back to normal VM.
   All its GFNs are moved to Normal-GFNs.

   UV_TERMINATE transitions the secure-VM back to normal-VM. All
   the secure-GFN and shared-GFNs are tranistioned to normal-GFN
   Note: The contents of the normal-GFN is undefined at this point.

  5. GFN state implementation:
 -

  Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
  when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
  set, and contains the value of the secure-PFN.
  It is associated with a normal-PFN; also called mem_pfn, when
  the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
  The value of the normal-PFN is not tracked.

  Shared GFN is associated with a normal-PFN. Its pfn[] has
  KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
  is not tracked.

  Normal GFN is associated with normal-PFN. Its pfn[] has
  no flag set. The value of the normal-PFN is not tracked.

  6. Life cycle of a GFN
 
  --
  || Share  |  Unshare | SVM   |H_SVM_INIT_DONE|
  ||operation   |operation | abort/|   |
  |||  | terminate |   |
  -
  |||  |   |   |
  | Secure | Shared | Secure   |Normal |Secure |
  |||  |   |   |
  | Shared | Shared | Secure   |Normal |Shared |
  |||  |   |   |
  | Normal | Shared | Secure   |Normal |Secure |
  --

  7. Life cycle of a VM
 
  
  | |  start|  H_SVM_  |H_SVM_   |H_SVM_ |UV_SVM_|
  | |  VM   |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE  |
  |   

Re: [PATCH] powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()

2020-06-18 Thread Michael Ellerman
On Mon, 15 Jun 2020 12:22:29 +0300, Mike Rapoport wrote:
> The pte_update() implementation for PPC_8xx unfolds page table from the PGD
> level to access a PMD entry. Since 8xx has only 2-level page table this can
> be simplified with pmd_off() shortcut.
> 
> Replace explicit unfolding with pmd_off() and drop defines of pgd_index()
> and pgd_offset() that are no longer needed.

Applied to powerpc/fixes.

[1/1] powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
  https://git.kernel.org/powerpc/c/687993ccf3b05070598b89fad97410b26d7bc9d2

cheers


Re: [PATCH] powerpc/64s: Fix KVM interrupt using wrong save area

2020-06-18 Thread Michael Ellerman
On Mon, 15 Jun 2020 16:12:47 +1000, Nicholas Piggin wrote:
> The CTR register reload in the KVM interrupt path used the wrong save
> area for SLB (and NMI) interrupts.

Applied to powerpc/fixes.

[1/1] powerpc/64s: Fix KVM interrupt using wrong save area
  https://git.kernel.org/powerpc/c/0bdcfa182506526fbe4e088ff9ca86a31b81828d

cheers


Re: [PATCH 1/2] powerpc/syscalls: Use the number when building SPU syscall table

2020-06-18 Thread Michael Ellerman
On Tue, 16 Jun 2020 23:56:16 +1000, Michael Ellerman wrote:
> Currently the macro that inserts entries into the SPU syscall table
> doesn't actually use the "nr" (syscall number) parameter.
> 
> This does work, but it relies on the exact right number of syscall
> entries being emitted in order for the syscal numbers to line up with
> the array entries. If for example we had two entries with the same
> syscall number we wouldn't get an error, it would just cause all
> subsequent syscalls to be off by one in the spu_syscall_table.
> 
> [...]

Applied to powerpc/fixes.

[1/2] powerpc/syscalls: Use the number when building SPU syscall table
  https://git.kernel.org/powerpc/c/1497eea68624f6076bf3eaf66baec3771ea04045
[2/2] powerpc/syscalls: Split SPU-ness out of ABI
  https://git.kernel.org/powerpc/c/35e32a6cb5f694fda54a5f391917e4ceefa0fece

cheers


Re: [PATCH 0/3] Fix build failure with v5.8-rc1

2020-06-18 Thread Michael Ellerman
On Mon, 15 Jun 2020 12:57:55 + (UTC), Christophe Leroy wrote:
> Commit 2ab3a0a02905 ("READ_ONCE: Enforce atomicity for
> {READ,WRITE}_ONCE() memory accesses") leads to following build
> failure on powerpc 8xx.
> 
> To fix it, this small series introduces a new helper named ptep_get()
> to replace the direct access with READ_ONCE(). This new helper
> can be overriden by architectures.
> 
> [...]

Applied to powerpc/fixes.

[1/3] mm/gup: Use huge_ptep_get() in gup_hugepte()
  https://git.kernel.org/powerpc/c/01a80ec6495f9e43f61b3231f3b283ca050a800e
[2/3] mm: Allow arches to provide ptep_get()
  https://git.kernel.org/powerpc/c/f7583fd6bdcc4d0b43f68fb81ebfae9669ee9338
[3/3] powerpc/8xx: Provide ptep_get() with 16k pages
  https://git.kernel.org/powerpc/c/b55129f97aeefd265314e12d98935330e011a14a

cheers


Re: [PATCH v2 1/4] powerpc/instruction_dump: Fix kernel crash with show_instructions

2020-06-18 Thread Michael Ellerman
On Sun, 24 May 2020 15:08:19 +0530, Aneesh Kumar K.V wrote:
> With Hard Lockup watchdog, we can hit a BUG() if we take a watchdog
> interrupt when in OPAL mode. This happens in show_instructions()
> where the kernel takes the watchdog NMI IPI with MSR_IR == 0.
> With that show_instructions() updates the variable pc in the loop
> and the second iterations will result in BUG().
> 
> We hit the BUG_ON due the below check in  __va()
> 
> [...]

Patch 1 applied to powerpc/fixes.

[1/4] powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL
  https://git.kernel.org/powerpc/c/a6e2c226c3d51fd93636320e47cabc8a8f0824c5

cheers


[PATCH 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-06-18 Thread Kajol Jain
Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.

command:# cat /sys/devices/hv_24x7/cpumask
0

Signed-off-by: Kajol Jain 
---
 .../sysfs-bus-event_source-devices-hv_24x7|  6 
 arch/powerpc/perf/hv-24x7.c   | 31 ++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index e8698afcd952..281e7b367733 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -43,6 +43,12 @@ Description: read only
This sysfs interface exposes the number of cores per chip
present in the system.
 
+What:  /sys/devices/hv_24x7/cpumask
+Date:  June 2020
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs file exposes cpumask.
+
 What:  /sys/bus/event_source/devices/hv_24x7/event_descs/
 Date:  February 2014
 Contact:   Linux on PowerPC Developer List 
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index fdc4ae155d60..03d870a9fc36 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
return sprintf(buf, "%s\n", (char *)d->var);
 }
 
+static ssize_t cpumask_get_attr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return cpumap_print_to_pagebuf(true, buf, &hv_24x7_cpumask);
+}
+
 static ssize_t sockets_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets);
 static DEVICE_ATTR_RO(chipspersocket);
 static DEVICE_ATTR_RO(coresperchip);
 
+static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL);
+
+static struct attribute *cpumask_attrs[] = {
+   &dev_attr_cpumask.attr,
+   NULL,
+};
+
+static struct attribute_group cpumask_attr_group = {
+   .attrs = cpumask_attrs,
+};
+
 static struct bin_attribute *if_bin_attrs[] = {
&bin_attr_catalog,
NULL,
@@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = {
&event_desc_group,
&event_long_desc_group,
&if_group,
+   /*
+* This NULL is a placeholder for the cpumask attr which will update
+* onlyif cpuhotplug registration is successful
+*/
+   NULL,
NULL,
 };
 
@@ -1727,8 +1749,15 @@ static int hv_24x7_init(void)
 
/* init cpuhotplug */
r = hv_24x7_cpu_hotplug_init();
-   if (r)
+   if (r) {
pr_err("hv_24x7: CPU hotplug init failed\n");
+   } else {
+   /*
+* Cpu hotplug init is successful, add the
+* cpumask file as part of pmu attr group
+*/
+   attr_groups[5] = &cpumask_attr_group;
+   }
 
r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
if (r)
-- 
2.18.2



[PATCH 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support

2020-06-18 Thread Kajol Jain
Patch here adds cpu hotplug functions to hv_24x7 pmu.
A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
is added.

The online function update the cpumask only if its NULL.
As the primary intention for adding hotplug support
is to desiginate a CPU to make HCALL to collect the
count data.

The offline function test and clear corresponding cpu in a cpumask
and update cpumask to any other active cpu.

With this patchset, perf tool side does not need "-C "
to be added.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-24x7.c | 45 +
 include/linux/cpuhotplug.h  |  1 +
 2 files changed, 46 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index db213eb7cb02..fdc4ae155d60 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -31,6 +31,8 @@ static int interface_version;
 /* Whether we have to aggregate result data for some domains. */
 static bool aggregate_result_elements;
 
+static cpumask_t hv_24x7_cpumask;
+
 static bool domain_is_valid(unsigned domain)
 {
switch (domain) {
@@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
.capabilities = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
+static int ppc_hv_24x7_cpu_online(unsigned int cpu)
+{
+   /* Make this CPU the designated target for counter collection */
+   if (cpumask_empty(&hv_24x7_cpumask))
+   cpumask_set_cpu(cpu, &hv_24x7_cpumask);
+
+   return 0;
+}
+
+static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
+{
+   int target = -1;
+
+   /* Check if exiting cpu is used for collecting 24x7 events */
+   if (!cpumask_test_and_clear_cpu(cpu, &hv_24x7_cpumask))
+   return 0;
+
+   /* Find a new cpu to collect 24x7 events */
+   target = cpumask_any_but(cpu_active_mask, cpu);
+
+   if (target < 0 || target >= nr_cpu_ids)
+   return -1;
+
+   /* Migrate 24x7 events to the new target */
+   cpumask_set_cpu(target, &hv_24x7_cpumask);
+   perf_pmu_migrate_context(&h_24x7_pmu, cpu, target);
+
+   return 0;
+}
+
+static int hv_24x7_cpu_hotplug_init(void)
+{
+   return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
+ "perf/powerpc/hv_24x7:online",
+ ppc_hv_24x7_cpu_online,
+ ppc_hv_24x7_cpu_offline);
+}
+
 static int hv_24x7_init(void)
 {
int r;
@@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
if (r)
return r;
 
+   /* init cpuhotplug */
+   r = hv_24x7_cpu_hotplug_init();
+   if (r)
+   pr_err("hv_24x7: CPU hotplug init failed\n");
+
r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
if (r)
return r;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 8377afef8806..16ed8f6f8774 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -180,6 +180,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
+   CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
-- 
2.18.2



[PATCH 0/2] Add cpu hotplug support for powerpc/perf/hv-24x7

2020-06-18 Thread Kajol Jain
This patchset add cpu hotplug support for hv_24x7 driver by adding
online/offline cpu hotplug function. It also add sysfs file
"cpumask" to expose current online cpu that can be used for
hv_24x7 event count.

Kajol Jain (2):
  powerpc/perf/hv-24x7: Add cpu hotplug support
  powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

 .../sysfs-bus-event_source-devices-hv_24x7|  6 ++
 arch/powerpc/perf/hv-24x7.c   | 74 +++
 include/linux/cpuhotplug.h|  1 +
 3 files changed, 81 insertions(+)

-- 
2.18.2



[PATCH] powerpc/8xx: Modify ptep_get()

2020-06-18 Thread Christophe Leroy
Move ptep_get() close to pte_update(), in an ifdef section already
dedicated to powerpc 8xx. This section contains explanation about
the layout of page table entries.

Also modify it to return 4 times the pte value instead of padding
with zeroes.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 22 +++-
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index b0afbdd07740..b9e134d0f03a 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -249,6 +249,18 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, 
unsigned long addr, p
 
return old;
 }
+
+#ifdef CONFIG_PPC_16K_PAGES
+#define __HAVE_ARCH_PTEP_GET
+static inline pte_t ptep_get(pte_t *ptep)
+{
+   pte_basic_t val = READ_ONCE(ptep->pte);
+   pte_t pte = {val, val, val, val};
+
+   return pte;
+}
+#endif /* CONFIG_PPC_16K_PAGES */
+
 #else
 static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, 
pte_t *p,
 unsigned long clr, unsigned long set, int 
huge)
@@ -284,16 +296,6 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
*mm, unsigned long addr,
return __pte(pte_update(mm, addr, ptep, ~0, 0, 0));
 }
 
-#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
-#define __HAVE_ARCH_PTEP_GET
-static inline pte_t ptep_get(pte_t *ptep)
-{
-   pte_t pte = {READ_ONCE(ptep->pte), 0, 0, 0};
-
-   return pte;
-}
-#endif
-
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
-- 
2.25.0



[PATCH] ASoC: fsl_spdif: Add pm runtime function

2020-06-18 Thread Shengjiu Wang
Add pm runtime support and move clock handling there.
Close the clocks at suspend to reduce the power consumption.

fsl_spdif_suspend is replaced by pm_runtime_force_suspend.
fsl_spdif_resume is replaced by pm_runtime_force_resume.

Signed-off-by: Shengjiu Wang 
---
 sound/soc/fsl/fsl_spdif.c | 113 ++
 1 file changed, 67 insertions(+), 46 deletions(-)

diff --git a/sound/soc/fsl/fsl_spdif.c b/sound/soc/fsl/fsl_spdif.c
index 5bc0e4729341..46719fd2f1ec 100644
--- a/sound/soc/fsl/fsl_spdif.c
+++ b/sound/soc/fsl/fsl_spdif.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -495,25 +496,10 @@ static int fsl_spdif_startup(struct snd_pcm_substream 
*substream,
struct platform_device *pdev = spdif_priv->pdev;
struct regmap *regmap = spdif_priv->regmap;
u32 scr, mask;
-   int i;
int ret;
 
/* Reset module and interrupts only for first initialization */
if (!snd_soc_dai_active(cpu_dai)) {
-   ret = clk_prepare_enable(spdif_priv->coreclk);
-   if (ret) {
-   dev_err(&pdev->dev, "failed to enable core clock\n");
-   return ret;
-   }
-
-   if (!IS_ERR(spdif_priv->spbaclk)) {
-   ret = clk_prepare_enable(spdif_priv->spbaclk);
-   if (ret) {
-   dev_err(&pdev->dev, "failed to enable spba 
clock\n");
-   goto err_spbaclk;
-   }
-   }
-
ret = spdif_softreset(spdif_priv);
if (ret) {
dev_err(&pdev->dev, "failed to soft reset\n");
@@ -531,18 +517,10 @@ static int fsl_spdif_startup(struct snd_pcm_substream 
*substream,
mask = SCR_TXFIFO_AUTOSYNC_MASK | SCR_TXFIFO_CTRL_MASK |
SCR_TXSEL_MASK | SCR_USRC_SEL_MASK |
SCR_TXFIFO_FSEL_MASK;
-   for (i = 0; i < SPDIF_TXRATE_MAX; i++) {
-   ret = clk_prepare_enable(spdif_priv->txclk[i]);
-   if (ret)
-   goto disable_txclk;
-   }
} else {
scr = SCR_RXFIFO_FSEL_IF8 | SCR_RXFIFO_AUTOSYNC;
mask = SCR_RXFIFO_FSEL_MASK | SCR_RXFIFO_AUTOSYNC_MASK|
SCR_RXFIFO_CTL_MASK | SCR_RXFIFO_OFF_MASK;
-   ret = clk_prepare_enable(spdif_priv->rxclk);
-   if (ret)
-   goto err;
}
regmap_update_bits(regmap, REG_SPDIF_SCR, mask, scr);
 
@@ -551,15 +529,7 @@ static int fsl_spdif_startup(struct snd_pcm_substream 
*substream,
 
return 0;
 
-disable_txclk:
-   for (i--; i >= 0; i--)
-   clk_disable_unprepare(spdif_priv->txclk[i]);
 err:
-   if (!IS_ERR(spdif_priv->spbaclk))
-   clk_disable_unprepare(spdif_priv->spbaclk);
-err_spbaclk:
-   clk_disable_unprepare(spdif_priv->coreclk);
-
return ret;
 }
 
@@ -569,20 +539,17 @@ static void fsl_spdif_shutdown(struct snd_pcm_substream 
*substream,
struct snd_soc_pcm_runtime *rtd = substream->private_data;
struct fsl_spdif_priv *spdif_priv = 
snd_soc_dai_get_drvdata(asoc_rtd_to_cpu(rtd, 0));
struct regmap *regmap = spdif_priv->regmap;
-   u32 scr, mask, i;
+   u32 scr, mask;
 
if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK) {
scr = 0;
mask = SCR_TXFIFO_AUTOSYNC_MASK | SCR_TXFIFO_CTRL_MASK |
SCR_TXSEL_MASK | SCR_USRC_SEL_MASK |
SCR_TXFIFO_FSEL_MASK;
-   for (i = 0; i < SPDIF_TXRATE_MAX; i++)
-   clk_disable_unprepare(spdif_priv->txclk[i]);
} else {
scr = SCR_RXFIFO_OFF | SCR_RXFIFO_CTL_ZERO;
mask = SCR_RXFIFO_FSEL_MASK | SCR_RXFIFO_AUTOSYNC_MASK|
SCR_RXFIFO_CTL_MASK | SCR_RXFIFO_OFF_MASK;
-   clk_disable_unprepare(spdif_priv->rxclk);
}
regmap_update_bits(regmap, REG_SPDIF_SCR, mask, scr);
 
@@ -591,9 +558,6 @@ static void fsl_spdif_shutdown(struct snd_pcm_substream 
*substream,
spdif_intr_status_clear(spdif_priv);
regmap_update_bits(regmap, REG_SPDIF_SCR,
SCR_LOW_POWER, SCR_LOW_POWER);
-   if (!IS_ERR(spdif_priv->spbaclk))
-   clk_disable_unprepare(spdif_priv->spbaclk);
-   clk_disable_unprepare(spdif_priv->coreclk);
}
 }
 
@@ -1350,6 +1314,8 @@ static int fsl_spdif_probe(struct platform_device *pdev)
 
/* Register with ASoC */
dev_set_drvdata(&pdev->dev, spdif_priv);
+   pm_runtime_enable(&pdev->dev);
+   regcache_cache_only(spdif_priv->regmap, true);
 
ret = devm_snd_soc_register_component(&pdev->dev, &fsl_spdif_component,
 

[PATCH v2 4/4] KVM: PPC: Book3S HV: migrate hot plugged memory

2020-06-18 Thread Ram Pai
From: Laurent Dufour 

When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.

kvmppc_uv_migrate_mem_slot() is called to accomplish this. UV_PAGE_IN
ucall is skipped, since the ultravisor does not trust the content of
those pages and hence ignores it.

Signed-off-by: Ram Pai 
[resolved conflicts, and modified the commit log]
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  2 ++
 arch/powerpc/kvm/book3s_hv.c| 10 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  |  2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index f0c5708..05ae789 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out,
 bool purge_gfn);
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6cf80e5..bf7324d 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4531,10 +4531,12 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
case KVM_MR_CREATE:
if (kvmppc_uvmem_slot_init(kvm, new))
return;
-   uv_register_mem_slot(kvm->arch.lpid,
-new->base_gfn << PAGE_SHIFT,
-new->npages * PAGE_SIZE,
-0, new->id);
+   if (uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id))
+   return;
+   kvmppc_uv_migrate_mem_slot(kvm, new);
break;
case KVM_MR_DELETE:
uv_unregister_mem_slot(kvm->arch.lpid, old->id);
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 78f8580..4d8f5bc 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -451,7 +451,7 @@ static int kvmppc_svm_migrate_page(struct vm_area_struct 
*vma,
return ret;
 }
 
-static int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
const struct kvm_memory_slot *memslot)
 {
unsigned long gfn = memslot->base_gfn;
-- 
1.8.3.1



[PATCH v2 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-18 Thread Ram Pai
H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
called H_SVM_PAGE_IN for all secure pages. These GFNs continue to be
normal GFNs associated with normal PFNs; when infact, these GFNs should
have been secure GFNs, associated with device PFNs.

Move all the PFN associated with the SVM's GFNs, to secure-PFNs, in
H_SVM_INIT_DONE. Skip the GFNs that are already Paged-in or Shared
through H_SVM_PAGE_IN, or Paged-in followed by a Paged-out through
UV_PAGE_OUT.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
---
 Documentation/powerpc/ultravisor.rst |   2 +
 arch/powerpc/kvm/book3s_hv_uvmem.c   | 235 +--
 2 files changed, 171 insertions(+), 66 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index 363736d..3bc8957 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -933,6 +933,8 @@ Return values
* H_UNSUPPORTED if called from the wrong context (e.g.
from an SVM or before an H_SVM_INIT_START
hypercall).
+   * H_STATE   if the hypervisor could not successfully
+transition the VM to Secure VM.
 
 Description
 ~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 666d1bb..78f8580 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -339,6 +339,21 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
struct kvm *kvm,
return false;
 }
 
+/* return true, if the GFN is a shared-GFN, or a secure-GFN */
+bool kvmppc_gfn_has_transitioned(unsigned long gfn, struct kvm *kvm)
+{
+   struct kvmppc_uvmem_slot *p;
+
+   list_for_each_entry(p, &kvm->arch.uvmem_pfns, list) {
+   if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
+   unsigned long index = gfn - p->base_pfn;
+
+   return (p->pfns[index] & KVMPPC_GFN_FLAG_MASK);
+   }
+   }
+   return false;
+}
+
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
 {
struct kvm_memslots *slots;
@@ -377,14 +392,152 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
return ret;
 }
 
+static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm);
+
+/*
+ * Alloc a PFN from private device memory pool. If @pagein is true,
+ * copy page from normal memory to secure memory using UV_PAGE_IN uvcall.
+ */
+static int kvmppc_svm_migrate_page(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long gpa, struct kvm *kvm,
+   unsigned long page_shift,
+   bool pagein)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma mig;
+   struct page *dpage;
+   struct page *spage;
+   unsigned long pfn;
+   int ret = 0;
+
+   memset(&mig, 0, sizeof(mig));
+   mig.vma = vma;
+   mig.start = start;
+   mig.end = end;
+   mig.src = &src_pfn;
+   mig.dst = &dst_pfn;
+
+   ret = migrate_vma_setup(&mig);
+   if (ret)
+   return ret;
+
+   if (!(*mig.src & MIGRATE_PFN_MIGRATE)) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   dpage = kvmppc_uvmem_get_page(gpa, kvm);
+   if (!dpage) {
+   ret = -1;
+   goto out_finalize;
+   }
+
+   if (pagein) {
+   pfn = *mig.src >> MIGRATE_PFN_SHIFT;
+   spage = migrate_pfn_to_page(*mig.src);
+   if (spage) {
+   ret = uv_page_in(kvm->arch.lpid, pfn << page_shift,
+   gpa, 0, page_shift);
+   if (ret)
+   goto out_finalize;
+   }
+   }
+
+   *mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   migrate_vma_pages(&mig);
+out_finalize:
+   migrate_vma_finalize(&mig);
+   return ret;
+}
+
+static int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   unsigned long gfn = memslot->base_gfn;
+   unsigned long end;
+   bool downgrade = false;
+   struct vm_area_struct *vma;
+   int i, ret = 0;
+   unsigned long start = gfn_to_hva(kvm, gfn);
+
+   if (kvm_is_error_hva(start))
+   return H_STATE;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+
+   down_write(&kvm->mm->mmap_sem);
+
+   mutex_lock(&kvm->arch.uvmem_lock);
+   vma = find_vma_intersection(kvm->mm, start, end);
+   if (!vma || vma-

[PATCH v2 2/4] KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs

2020-06-18 Thread Ram Pai
During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.

The ability to identify the state of a GFN is needed to skip migration of its
PFN to secure-PFN during ESM transition.

The code is re-organized to track the states of a GFN as explained
below.


 1. States of a GFN
---
 The GFN can be in one of the following states.

 (a) Secure - The GFN is secure. The GFN is associated with
a Secure VM, the contents of the GFN is not accessible
to the Hypervisor.  This GFN can be backed by a secure-PFN,
or can be backed by a normal-PFN with contents encrypted.
The former is true when the GFN is paged-in into the
ultravisor. The latter is true when the GFN is paged-out
of the ultravisor.

 (b) Shared - The GFN is shared. The GFN is associated with a
a secure VM. The contents of the GFN is accessible to
Hypervisor. This GFN is backed by a normal-PFN and its
content is un-encrypted.

 (c) Normal - The GFN is a normal. The GFN is associated with
a normal VM. The contents of the GFN is accesible to
the Hypervisor. Its content is never encrypted.

 2. States of a VM.
---

 (a) Normal VM:  A VM whose contents are always accessible to
the hypervisor.  All its GFNs are normal-GFNs.

 (b) Secure VM: A VM whose contents are not accessible to the
hypervisor without the VM's consent.  Its GFNs are
either Shared-GFN or Secure-GFNs.

 (c) Transient VM: A Normal VM that is transitioning to secure VM.
The transition starts on successful return of
H_SVM_INIT_START, and ends on successful return
of H_SVM_INIT_DONE. This transient VM, can have GFNs
in any of the three states; i.e Secure-GFN, Shared-GFN,
and Normal-GFN. The VM never executes in this state
in supervisor-mode.

 3. Memory slot State.
--
The state of a memory slot mirrors the state of the
VM the memory slot is associated with.

 4. VM State transition.


  A VM always starts in Normal Mode.

  H_SVM_INIT_START moves the VM into transient state. During this
  time the Ultravisor may request some of its GFNs to be shared or
  secured. So its GFNs can be in one of the three GFN states.

  H_SVM_INIT_DONE moves the VM entirely from transient state to
  secure-state. At this point any left-over normal-GFNs are
  transitioned to Secure-GFN.

  H_SVM_INIT_ABORT moves the transient VM back to normal VM.
  All its GFNs are moved to Normal-GFNs.

  UV_TERMINATE transitions the secure-VM back to normal-VM. All
  the secure-GFN and shared-GFNs are tranistioned to normal-GFN
  Note: The contents of the normal-GFN is undefined at this point.

 5. GFN state implementation:
-

 Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
 when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
 set, and contains the value of the secure-PFN.
 It is associated with a normal-PFN; also called mem_pfn, when
 the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
 The value of the normal-PFN is not tracked.

 Shared GFN is associated with a normal-PFN. Its pfn[] has
 KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
 is not tracked.

 Normal GFN is associated with normal-PFN. Its pfn[] has
 no flag set. The value of the normal-PFN is not tracked.

 6. Life cycle of a GFN

 --
 || Share  |  Unshare | SVM   |H_SVM_INIT_DONE|
 ||operation   |operation | abort/|   |
 |||  | terminate |   |
 -
 |||  |   |   |
 | Secure | Shared | Secure   |Normal |Secure |
 |||  |   |   |
 | Shared | Shared | Secure   |Normal |Shared |
 |||  |   |   |
 | Normal | Shared | Secure   |Normal |Secure |
 --

 7. Life cycle of a VM

 
 | |  start|  H_SVM_  |H_SVM_   |H_SVM_ |UV_SVM_|
 | |  VM   |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE  |
 | |   |  | |   |   |
 - ---

[PATCH v2 1/4] KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c

2020-06-18 Thread Ram Pai
Without this fix, git is confused. It generates wrong
function context for code changes in subsequent patches.
Weird, but true.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Bharata B Rao 
Cc: Aneesh Kumar K.V 
Cc: Sukadev Bhattiprolu 
Cc: Laurent Dufour 
Cc: Thiago Jung Bauermann 
Cc: David Gibson 
Cc: Claudio Carvalho 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index ad950f89..3599aaa 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -369,8 +369,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * Alloc a PFN from private device memory pool and copy page from normal
  * memory to secure memory using UV_PAGE_IN uvcall.
  */
-static int
-kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
+static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
   unsigned long end, unsigned long gpa, struct kvm *kvm,
   unsigned long page_shift, bool *downgrade)
 {
@@ -437,8 +436,8 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
  * to unmap the device page from QEMU's page tables.
  */
-static unsigned long
-kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+static unsigned long kvmppc_share_page(struct kvm *kvm, unsigned long gpa,
+   unsigned long page_shift)
 {
 
int ret = H_PARAMETER;
@@ -487,9 +486,9 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * H_PAGE_IN_SHARED flag makes the page shared which means that the same
  * memory in is visible from both UV and HV.
  */
-unsigned long
-kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
-unsigned long flags, unsigned long page_shift)
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
+   unsigned long flags,
+   unsigned long page_shift)
 {
bool downgrade = false;
unsigned long start, end;
@@ -546,10 +545,10 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  * Provision a new page on HV side and copy over the contents
  * from secure memory using UV_PAGE_OUT uvcall.
  */
-static int
-kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, unsigned long page_shift,
-   struct kvm *kvm, unsigned long gpa)
+static int kvmppc_svm_page_out(struct vm_area_struct *vma,
+   unsigned long start,
+   unsigned long end, unsigned long page_shift,
+   struct kvm *kvm, unsigned long gpa)
 {
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
-- 
1.8.3.1



[PATCH v2 0/4] Migrate non-migrated pages of a SVM.

2020-06-18 Thread Ram Pai
This patch series migrates the non-migrated pages of a SVM.
This is required when the UV calls H_SVM_INIT_DONE, and
when a memory-slot is hotplugged to a Secure VM.

Testing: Passed rigorous SVM reboot test using different
sized SVMs.

Changelog:
. fixed a bug observed by Bharata. Pages that
where paged-in and later paged-out must also be
skipped from migration during H_SVM_INIT_DONE.

Laurent Dufour (1):
  KVM: PPC: Book3S HV: migrate hot plugged memory

Ram Pai (3):
  KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c
  KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
  KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in
H_SVM_INIT_DONE

 Documentation/powerpc/ultravisor.rst|   2 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |   8 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |   2 +-
 arch/powerpc/kvm/book3s_hv.c|  12 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 449 ++--
 5 files changed, 368 insertions(+), 105 deletions(-)

-- 
1.8.3.1



Re: [PATCH] mm: Move p?d_alloc_track to separate header file

2020-06-18 Thread Mike Rapoport
On Wed, Jun 17, 2020 at 06:12:26PM -0700, Andrew Morton wrote:
> On Tue,  9 Jun 2020 14:05:33 +0200 Joerg Roedel  wrote:
> 
> > From: Joerg Roedel 
> > 
> > The functions are only used in two source files, so there is no need
> > for them to be in the global  header. Move them to the new
> >  header and include it only where needed.
> > 
> > ...
> >
> > new file mode 100644
> > index ..1dcc865029a2
> > --- /dev/null
> > +++ b/include/linux/pgalloc-track.h
> > @@ -0,0 +1,51 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_PGALLLC_TRACK_H
> > +#define _LINUX_PGALLLC_TRACK_H
> 
> hm, no #includes.  I guess this is OK, given the limited use.
> 
> But it does make one wonder whether ioremap.c should be moved from lib/
> to mm/ and this file should be moved from include/linux/ to mm/.

It makes sense, but I am anyway planning consolidation of pgalloc.h, so
most probably pgalloc-track will not survive until 5.9-rc1 :)

If you think that it worth moving ioremap.c to mm/ regardless of chrun,
I can send a patch for that.

> Oh well.

-- 
Sincerely yours,
Mike.


Re: [PATCH V3 (RESEND) 0/3] arm64: Enable vmemmap mapping from device memory

2020-06-18 Thread Mike Rapoport
On Thu, Jun 18, 2020 at 06:45:27AM +0530, Anshuman Khandual wrote:
> This series enables vmemmap backing memory allocation from device memory
> ranges on arm64. But before that, it enables vmemmap_populate_basepages()
> and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
> alocation requests.
> 
> This series applies on 5.8-rc1.
> 
> Pending Question:
> 
> altmap_alloc_block_buf() does not have any other remaining users in
> the tree after this change. Should it be converted into a static
> function and it's declaration be dropped from the header
> (include/linux/mm.h). Avoided doing so because I was not sure if there
> are any off-tree users or not.

Well, off-tree users probably have an active fork anyway so they could
switch to vmemmap_alloc_block_buf()...

Regardless, can you please update Documentation/vm/memory-model.rst to
keep it in sync with the code?

> Changes in V3:
> 
> - Dropped comment from free_hotplug_page_range() per Robin
> - Modified comment in unmap_hotplug_range() per Robin
> - Enabled altmap support in vmemmap_alloc_block_buf() per Robin
> 
> Changes in V2: (https://lkml.org/lkml/2020/3/4/475)
> 
> - Rebased on latest hot-remove series (v14) adding P4D page table support
> 
> Changes in V1: (https://lkml.org/lkml/2020/1/23/12)
> 
> - Added an WARN_ON() in unmap_hotplug_range() when altmap is
>   provided without the page table backing memory being freed
> 
> Changes in RFC V2: (https://lkml.org/lkml/2019/10/21/11)
> 
> - Changed the commit message on 1/2 patch per Will
> - Changed the commit message on 2/2 patch as well
> - Rebased on arm64 memory hot remove series (v10)
> 
> RFC V1: (https://lkml.org/lkml/2019/6/28/32)
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: David Hildenbrand 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: "Kirill A. Shutemov" 
> Cc: Andrew Morton 
> Cc: Dan Williams 
> Cc: Pavel Tatashin 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-ri...@lists.infradead.org
> Cc: x...@kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> 
> Anshuman Khandual (3):
>   mm/sparsemem: Enable vmem_altmap support in vmemmap_populate_basepages()
>   mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()
>   arm64/mm: Enable vmem_altmap support for vmemmap mappings
> 
>  arch/arm64/mm/mmu.c   | 59 ++-
>  arch/ia64/mm/discontig.c  |  2 +-
>  arch/powerpc/mm/init_64.c | 10 +++
>  arch/riscv/mm/init.c  |  2 +-
>  arch/x86/mm/init_64.c | 12 
>  include/linux/mm.h|  8 --
>  mm/sparse-vmemmap.c   | 38 -
>  7 files changed, 87 insertions(+), 44 deletions(-)
> 
> -- 
> 2.20.1
> 

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 02/12] ocxl: Change type of pasid to unsigned int

2020-06-18 Thread Frederic Barrat




Le 13/06/2020 à 02:41, Fenghua Yu a écrit :

PASID is defined as "int" although it's a 20-bit value and shouldn't be
negative int. To be consistent with type defined in iommu, define PASID
as "unsigned int".



It looks like this patch was considered because of the use of 'pasid' in 
variable or function names. The ocxl driver only makes sense on powerpc 
and shouldn't compile on anything else, so it's probably useless in the 
context of that series.
The pasid here is defined by the opencapi specification 
(https://opencapi.org), it is borrowed from the PCI world and you could 
argue it could be an unsigned int. But then I think the patch doesn't go 
far enough. But considering it's not used on x86, I think this patch can 
be dropped.


  Fred




Suggested-by: Thomas Gleixner 
Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
v2:
- Create this new patch to define PASID as "unsigned int" consistently in
   ocxl (Thomas)

  drivers/misc/ocxl/config.c|  3 ++-
  drivers/misc/ocxl/link.c  |  6 +++---
  drivers/misc/ocxl/ocxl_internal.h |  6 +++---
  drivers/misc/ocxl/pasid.c |  2 +-
  drivers/misc/ocxl/trace.h | 20 ++--
  include/misc/ocxl.h   |  6 +++---
  6 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index c8e19bfb5ef9..22d034caed3d 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -806,7 +806,8 @@ int ocxl_config_set_TL(struct pci_dev *dev, int tl_dvsec)
  }
  EXPORT_SYMBOL_GPL(ocxl_config_set_TL);
  
-int ocxl_config_terminate_pasid(struct pci_dev *dev, int afu_control, int pasid)

+int ocxl_config_terminate_pasid(struct pci_dev *dev, int afu_control,
+   unsigned int pasid)
  {
u32 val;
unsigned long timeout;
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 58d111afd9f6..931f6ae022db 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -492,7 +492,7 @@ static u64 calculate_cfg_state(bool kernel)
return state;
  }
  
-int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,

+int ocxl_link_add_pe(void *link_handle, unsigned int pasid, u32 pidr, u32 tidr,
u64 amr, struct mm_struct *mm,
void (*xsl_err_cb)(void *data, u64 addr, u64 dsisr),
void *xsl_err_data)
@@ -572,7 +572,7 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 
pidr, u32 tidr,
  }
  EXPORT_SYMBOL_GPL(ocxl_link_add_pe);
  
-int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid)

+int ocxl_link_update_pe(void *link_handle, unsigned int pasid, __u16 tid)
  {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
struct spa *spa = link->spa;
@@ -608,7 +608,7 @@ int ocxl_link_update_pe(void *link_handle, int pasid, __u16 
tid)
return rc;
  }
  
-int ocxl_link_remove_pe(void *link_handle, int pasid)

+int ocxl_link_remove_pe(void *link_handle, unsigned int pasid)
  {
struct ocxl_link *link = (struct ocxl_link *) link_handle;
struct spa *spa = link->spa;
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 345bf843a38e..3ca982ba7472 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -41,7 +41,7 @@ struct ocxl_afu {
struct ocxl_afu_config config;
int pasid_base;
int pasid_count; /* opened contexts */
-   int pasid_max; /* maximum number of contexts */
+   unsigned int pasid_max; /* maximum number of contexts */
int actag_base;
int actag_enabled;
struct mutex contexts_lock;
@@ -69,7 +69,7 @@ struct ocxl_xsl_error {
  
  struct ocxl_context {

struct ocxl_afu *afu;
-   int pasid;
+   unsigned int pasid;
struct mutex status_mutex;
enum ocxl_context_status status;
struct address_space *mapping;
@@ -128,7 +128,7 @@ int ocxl_config_check_afu_index(struct pci_dev *dev,
   * pasid: the PASID for the AFU context
   * tid: the new thread id for the process element
   */
-int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
+int ocxl_link_update_pe(void *link_handle, unsigned int pasid, __u16 tid);
  
  int ocxl_context_mmap(struct ocxl_context *ctx,

struct vm_area_struct *vma);
diff --git a/drivers/misc/ocxl/pasid.c b/drivers/misc/ocxl/pasid.c
index d14cb56e6920..a151fc8f0bec 100644
--- a/drivers/misc/ocxl/pasid.c
+++ b/drivers/misc/ocxl/pasid.c
@@ -80,7 +80,7 @@ static void range_free(struct list_head *head, u32 start, u32 
size,
  
  int ocxl_pasid_afu_alloc(struct ocxl_fn *fn, u32 size)

  {
-   int max_pasid;
+   unsigned int max_pasid;
  
  	if (fn->config.max_pasid_log < 0)

return -ENOSPC;
diff --git a/drivers/misc/ocxl/trace.h b/drivers/misc/ocxl/trace.h
index 17e21cb2addd..019e2fc63b1d 100644
--- a/drivers/misc/ocxl/trace.h
+++ b/drivers/misc/o

[V2 PATCH 3/3] Add support for arm64 to carry over IMA measurement logs

2020-06-18 Thread Prakhar Srivastava
Add support for arm64 to carry over IMA measurement logs.
Update arm64 code to call into functions made available in patch 1/3.

---
 arch/arm64/Kconfig |  1 +
 arch/arm64/include/asm/ima.h   | 17 ++
 arch/arm64/include/asm/kexec.h |  3 ++
 arch/arm64/kernel/machine_kexec_file.c | 47 +-
 4 files changed, 60 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/ima.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5d513f461957..3d544e2e25e6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1070,6 +1070,7 @@ config KEXEC
 config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
+   select HAVE_IMA_KEXEC
help
  This is new version of kexec system call. This system call is
  file based and takes file descriptors as system call argument
diff --git a/arch/arm64/include/asm/ima.h b/arch/arm64/include/asm/ima.h
new file mode 100644
index ..70ac39b74607
--- /dev/null
+++ b/arch/arm64/include/asm/ima.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARCH_IMA_H
+#define _ASM_ARCH_IMA_H
+
+struct kimage;
+
+#ifdef CONFIG_IMA_KEXEC
+int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
+ size_t size);
+#else
+static inline int arch_ima_add_kexec_buffer(struct kimage *image,
+   unsigned long load_addr, size_t size)
+{
+   return 0;
+}
+#endif /* CONFIG_IMA_KEXEC */
+#endif /* _ASM_ARCH_IMA_H */
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index d24b527e8c00..7bd60c185ad3 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -100,6 +100,9 @@ struct kimage_arch {
void *elf_headers;
unsigned long elf_headers_mem;
unsigned long elf_headers_sz;
+
+   phys_addr_t ima_buffer_addr;
+   size_t ima_buffer_size;
 };
 
 extern const struct kexec_file_ops kexec_image_ops;
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index b40c3b0def92..1e9007c926db 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -24,20 +24,37 @@
 #include 
 
 /* relevant device tree properties */
-#define FDT_PROP_KEXEC_ELFHDR  "linux,elfcorehdr"
-#define FDT_PROP_MEM_RANGE "linux,usable-memory-range"
-#define FDT_PROP_INITRD_START  "linux,initrd-start"
-#define FDT_PROP_INITRD_END"linux,initrd-end"
-#define FDT_PROP_BOOTARGS  "bootargs"
-#define FDT_PROP_KASLR_SEED"kaslr-seed"
-#define FDT_PROP_RNG_SEED  "rng-seed"
-#define RNG_SEED_SIZE  128
+#define FDT_PROP_KEXEC_ELFHDR  "linux,elfcorehdr"
+#define FDT_PROP_MEM_RANGE "linux,usable-memory-range"
+#define FDT_PROP_INITRD_START  "linux,initrd-start"
+#define FDT_PROP_INITRD_END"linux,initrd-end"
+#define FDT_PROP_BOOTARGS  "bootargs"
+#define FDT_PROP_KASLR_SEED"kaslr-seed"
+#define FDT_PROP_RNG_SEED  "rng-seed"
+#define FDT_PROP_IMA_KEXEC_BUFFER  "linux,ima-kexec-buffer"
+#define RNG_SEED_SIZE  128
 
 const struct kexec_file_ops * const kexec_file_loaders[] = {
&kexec_image_ops,
NULL
 };
 
+/**
+ * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer
+ *
+ * Architectures should use this function to pass on the IMA buffer
+ * information to the next kernel.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
+ size_t size)
+{
+   image->arch.ima_buffer_addr = load_addr;
+   image->arch.ima_buffer_size = size;
+   return 0;
+}
+
 int arch_kimage_file_post_load_cleanup(struct kimage *image)
 {
vfree(image->arch.dtb);
@@ -66,6 +83,9 @@ static int setup_dtb(struct kimage *image,
if (ret && ret != -FDT_ERR_NOTFOUND)
goto out;
ret = fdt_delprop(dtb, off, FDT_PROP_MEM_RANGE);
+   if (ret && ret != -FDT_ERR_NOTFOUND)
+   goto out;
+   ret = fdt_delprop(dtb, off, FDT_PROP_IMA_KEXEC_BUFFER);
if (ret && ret != -FDT_ERR_NOTFOUND)
goto out;
 
@@ -119,6 +139,17 @@ static int setup_dtb(struct kimage *image,
goto out;
}
 
+   if (image->arch.ima_buffer_size > 0) {
+
+   ret = fdt_appendprop_addrrange(dtb, 0, off,
+   FDT_PROP_IMA_KEXEC_BUFFER,
+   image->arch.ima_buffer_addr,
+   image->arch.ima_buffer_size);
+   if (ret)
+   return (ret == -FDT_ERR_NOSPACE ? -ENOMEM : -EINVAL);
+
+   }
+
/* add kaslr-seed */
ret = fdt_delprop(dtb, off, FDT_PROP_KASLR_SEED);
if (ret == -FDT_ERR_NOTFOUND)
-- 
2.25.1



[V2 PATCH 2/3] dt-bindings: chosen: Document ima-kexec-buffer

2020-06-18 Thread Prakhar Srivastava
Integrity measurement architecture(IMA) validates if files
have been accidentally or maliciously altered, both remotely and
locally, appraise a file's measurement against a "good" value stored
as an extended attribute, and enforce local file integrity.

IMA also measures singatures of kernel and initrd during kexec along with
the command line used for kexec.
These measurements are critical to verify the seccurity posture of the OS.

Resering memory and adding the memory information to a device tree node
acts as the mechanism to carry over IMA measurement logs.

Update devicetree documentation to reflect the addition of new property
under the chosen node. 

---
 Documentation/devicetree/bindings/chosen.txt | 17 +
 1 file changed, 17 insertions(+)

diff --git a/Documentation/devicetree/bindings/chosen.txt 
b/Documentation/devicetree/bindings/chosen.txt
index 45e79172a646..a15f70c007ef 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -135,3 +135,20 @@ e.g.
linux,initrd-end = <0x8280>;
};
 };
+
+linux,ima-kexec-buffer
+--
+
+This property(currently used by powerpc, arm64) holds the memory range,
+the address and the size, of the IMA measurement logs that are being carried
+over to the kexec session.
+
+/ {
+   chosen {
+   linux,ima-kexec-buffer = <0x9 0x8200 0x0 0x8000>;
+   };
+};
+
+This porperty does not represent real hardware, but the memory allocated for
+carrying the IMA measurement logs. The address and the suze are expressed in
+#address-cells and #size-cells, respectively of the root node.
-- 
2.25.1



[V2 PATCH 0/3] Adding support for carrying IMA measurement logs

2020-06-18 Thread Prakhar Srivastava
Integrgity Measurement Architecture(IMA) during kexec(kexec file load)
verifies the kernel signature and measures the signature of the kernel.

The signature in the measuremnt logs is used to verfiy the 
authenticity of the kernel in the subsequent kexec'd session, however in
the current implementation IMA measurement logs are not carried over thus
remote attesation cannot verify the signature of the running kernel.

Adding support to arm64 to carry over the IMA measurement logs over kexec.

Add a new chosen node entry linux,ima-kexec-buffer to hold the address and
the size of the memory reserved to carry the IMA measurement log.
Refactor existing powerpc code to be used by amr64 as well.  

Changelog:

v2:
  Break patches into separate patches.
  - Powerpc related Refactoring
  - Updating the docuemntation for chosen node
  - Updating arm64 to support IMA buffer pass

v1:
  Refactoring carrying over IMA measuremnet logs over Kexec. This patch
moves the non-architecture specific code out of powerpc and adds to
security/ima.(Suggested by Thiago)
  Add Documentation regarding the ima-kexec-buffer node in the chosen
node documentation

v0:
  Add a layer of abstraction to use the memory reserved by device tree
for ima buffer pass.
  Add support for ima buffer pass using reserved memory for arm64 kexec.
Update the arch sepcific code path in kexec file load to store the
ima buffer in the reserved memory. The same reserved memory is read
on kexec or cold boot.

Prakhar Srivastava (3):
  Refactoring powerpc code for carrying over IMA measurement logs, to
move non architecture specific code to security/ima.
  dt-bindings: chosen: Document ima-kexec-buffer carrying over IMA
measuremnt logs over kexec.
  Add support for arm64 to carry over IMA measurement logs

 Documentation/devicetree/bindings/chosen.txt |  17 +++
 arch/arm64/Kconfig   |   1 +
 arch/arm64/include/asm/ima.h |  17 +++
 arch/arm64/include/asm/kexec.h   |   3 +
 arch/arm64/kernel/machine_kexec_file.c   |  47 +--
 arch/powerpc/include/asm/ima.h   |  10 --
 arch/powerpc/kexec/ima.c | 126 ++-
 security/integrity/ima/ima_kexec.c   | 116 +
 8 files changed, 201 insertions(+), 136 deletions(-)
 create mode 100644 arch/arm64/include/asm/ima.h

-- 
2.25.1



[V2 PATCH 1/3] Refactoring powerpc code for carrying over IMA measurement logs, to move non architecture specific code to security/ima.

2020-06-18 Thread Prakhar Srivastava
Powerpc has support to carry over the IMA measurement logs. Refatoring the 
non-architecture specific code out of arch/powerpc and into security/ima.

The code adds support for reserving and freeing up of memory for IMA measurement
logs.

---
 arch/powerpc/include/asm/ima.h |  10 ---
 arch/powerpc/kexec/ima.c   | 126 ++---
 security/integrity/ima/ima_kexec.c | 116 ++
 3 files changed, 124 insertions(+), 128 deletions(-)

diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
index ead488cf3981..c29ec86498f8 100644
--- a/arch/powerpc/include/asm/ima.h
+++ b/arch/powerpc/include/asm/ima.h
@@ -4,15 +4,6 @@
 
 struct kimage;
 
-int ima_get_kexec_buffer(void **addr, size_t *size);
-int ima_free_kexec_buffer(void);
-
-#ifdef CONFIG_IMA
-void remove_ima_buffer(void *fdt, int chosen_node);
-#else
-static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
-#endif
-
 #ifdef CONFIG_IMA_KEXEC
 int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
  size_t size);
@@ -22,7 +13,6 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, 
int chosen_node);
 static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
   int chosen_node)
 {
-   remove_ima_buffer(fdt, chosen_node);
return 0;
 }
 #endif /* CONFIG_IMA_KEXEC */
diff --git a/arch/powerpc/kexec/ima.c b/arch/powerpc/kexec/ima.c
index 720e50e490b6..6054ce91d2a6 100644
--- a/arch/powerpc/kexec/ima.c
+++ b/arch/powerpc/kexec/ima.c
@@ -12,121 +12,6 @@
 #include 
 #include 
 
-static int get_addr_size_cells(int *addr_cells, int *size_cells)
-{
-   struct device_node *root;
-
-   root = of_find_node_by_path("/");
-   if (!root)
-   return -EINVAL;
-
-   *addr_cells = of_n_addr_cells(root);
-   *size_cells = of_n_size_cells(root);
-
-   of_node_put(root);
-
-   return 0;
-}
-
-static int do_get_kexec_buffer(const void *prop, int len, unsigned long *addr,
-  size_t *size)
-{
-   int ret, addr_cells, size_cells;
-
-   ret = get_addr_size_cells(&addr_cells, &size_cells);
-   if (ret)
-   return ret;
-
-   if (len < 4 * (addr_cells + size_cells))
-   return -ENOENT;
-
-   *addr = of_read_number(prop, addr_cells);
-   *size = of_read_number(prop + 4 * addr_cells, size_cells);
-
-   return 0;
-}
-
-/**
- * ima_get_kexec_buffer - get IMA buffer from the previous kernel
- * @addr:  On successful return, set to point to the buffer contents.
- * @size:  On successful return, set to the buffer size.
- *
- * Return: 0 on success, negative errno on error.
- */
-int ima_get_kexec_buffer(void **addr, size_t *size)
-{
-   int ret, len;
-   unsigned long tmp_addr;
-   size_t tmp_size;
-   const void *prop;
-
-   prop = of_get_property(of_chosen, "linux,ima-kexec-buffer", &len);
-   if (!prop)
-   return -ENOENT;
-
-   ret = do_get_kexec_buffer(prop, len, &tmp_addr, &tmp_size);
-   if (ret)
-   return ret;
-
-   *addr = __va(tmp_addr);
-   *size = tmp_size;
-
-   return 0;
-}
-
-/**
- * ima_free_kexec_buffer - free memory used by the IMA buffer
- */
-int ima_free_kexec_buffer(void)
-{
-   int ret;
-   unsigned long addr;
-   size_t size;
-   struct property *prop;
-
-   prop = of_find_property(of_chosen, "linux,ima-kexec-buffer", NULL);
-   if (!prop)
-   return -ENOENT;
-
-   ret = do_get_kexec_buffer(prop->value, prop->length, &addr, &size);
-   if (ret)
-   return ret;
-
-   ret = of_remove_property(of_chosen, prop);
-   if (ret)
-   return ret;
-
-   return memblock_free(addr, size);
-
-}
-
-/**
- * remove_ima_buffer - remove the IMA buffer property and reservation from @fdt
- *
- * The IMA measurement buffer is of no use to a subsequent kernel, so we always
- * remove it from the device tree.
- */
-void remove_ima_buffer(void *fdt, int chosen_node)
-{
-   int ret, len;
-   unsigned long addr;
-   size_t size;
-   const void *prop;
-
-   prop = fdt_getprop(fdt, chosen_node, "linux,ima-kexec-buffer", &len);
-   if (!prop)
-   return;
-
-   ret = do_get_kexec_buffer(prop, len, &addr, &size);
-   fdt_delprop(fdt, chosen_node, "linux,ima-kexec-buffer");
-   if (ret)
-   return;
-
-   ret = delete_fdt_mem_rsv(fdt, addr, size);
-   if (!ret)
-   pr_debug("Removed old IMA buffer reservation.\n");
-}
-
 #ifdef CONFIG_IMA_KEXEC
 /**
  * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer
@@ -179,13 +64,18 @@ int setup_ima_buffer(const struct kimage *image, void 
*fdt, int chosen_node)
int ret, addr_cells, size_cells, entry_size;
u8 value[16];
 
-   remove_ima_buffer(fdt, chosen_node);
if (!imag