date:20191024

Re: [PATCH] ASoC: fsl_esai: Add spin lock to protect reset and stop

2019-10-24 Thread S.j. Wang



Hi
> 
> On Wed, Oct 23, 2019 at 03:29:49PM +0800, Shengjiu Wang wrote:
> > xrun may happen at the end of stream, the
> > trigger->fsl_esai_trigger_stop maybe called in the middle of
> > fsl_esai_hw_reset, this may cause esai in wrong state after stop, and
> > there may be endless xrun interrupt.
> 
> What about fsl_esai_trigger_start? It touches ESAI_xFCR_xFEN bit that is
> being checked in the beginning of fsl_esai_hw_reset.
> 
> Could the scenario below be possible also?
> 
> 1) ESAI TX starts
> 2) Xrun happens to TX
> 3) Starting fsl_esai_hw_reset (enabled[TX] = true; enabled[RX] = false)
> 4) ESAI RX starts
> 5) Finishing fsl_esai_hw_reset (enabled[RX] is still false)
> 
> 
Good catch, this may possible.  Will update in v2.

Best regards
Wang shengjiu

Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-24 Thread Qian Cai




> On Oct 24, 2019, at 11:45 PM, Anshuman Khandual  
> wrote:
> 
> Nothing specific. But just tested this with x86 defconfig with relevant 
> configs
> which are required for this test. Not sure if it involved W=1.

No, it will not. It needs to run like,

make W=1 -j 64 2>/tmp/warns

Re: [PATCH] ASoC: fsl_asrc: refine the setting of internal clock divider

2019-10-24 Thread S.j. Wang

Hi
> 
> On Wed, Oct 23, 2019 at 06:25:20AM +, S.j. Wang wrote:
> > > On Thu, Oct 17, 2019 at 02:21:08PM +0800, Shengjiu Wang wrote:
> > > > For P2P output, the output divider should align with the output
> > > > sample
> > >
> > > I think we should avoid "P2P" (or "M2M") keyword in the mainline
> > > code as we know M2M will never get merged while somebody working
> > > with the mainline and caring about new feature might be confused.
> >
> > Ok. But we still curious that is there a way to upstream m2m?
> 
> Hmm..I would love to see that happening. Here is an old discussion that
> you may want to take a look:
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail
> man.alsa-project.org%2Fpipermail%2Falsa-devel%2F2014-
> May%2F076797.htmldata=02%7C01%7Cshengjiu.wang%40nxp.com%7
> Ce902d2bac4254d2faa0f08d757ecac0e%7C686ea1d3bc2b4c6fa92cd99c5c301
> 635%7C0%7C0%7C637074546320396681sdata=bg%2BLwRQnUPhW8f
> mE972O%2F53MyVftJkK140PSnmC%2FDKQ%3Dreserved=0
> 
> > > It makes sense to me, yet I feel that the delay at the beginning of
> > > the audio playback might be longer as a compromise. I am okay with
> > > this decision though...
> > >
> > > > The maximum divider of asrc clock is 1024, but there is no
> > > > judgement for this limitaion in driver, which may cause the
> > > > divider setting not correct.
> > > >
> > > > For non-ideal ratio mode, the clock rate should divide the sample
> > > > rate with no remainder, and the quotient should be less than 1024.
> > > >
> > > > Signed-off-by: Shengjiu Wang 
> 
> > > > @@ -351,7 +352,9 @@ static int fsl_asrc_config_pair(struct
> > > > fsl_asrc_pair
> > > *pair)
> > > >   /* We only have output clock for ideal ratio mode */
> > > >   clk = asrc_priv->asrck_clk[clk_index[ideal ? OUT : IN]];
> > > >
> > > > - div[IN] = clk_get_rate(clk) / inrate;
> > > > + clk_rate = clk_get_rate(clk);
> > >
> > > The fsl_asrc.c file has config.inclk being set to INCLK_NONE and
> > > this sets the "ideal" in this function to true. So, although we tend
> > > to not use ideal ratio setting for p2p cases, yet the input clock is
> > > still not physically connected, so we still use output clock for div[IN]
> calculation?
> >
> > For p2p case, it can be ideal or non-ideal.  For non-ideal, we still
> > use Output clock for div calculation.
> >
> > >
> > > I am thinking something simplier: if we decided not to use ideal
> > > ratio for "P2P", instead of adding "bool p2p" with the confusing
> > > "ideal" in this function, could we just set config.inclk to the same
> > > clock as the output one for "P2P"? By doing so, "P2P" won't go
> > > through ideal ratio mode while still having a clock rate from the output
> clock for div[IN] calculation here.
> >
> > Bool p2p is to force output rate to be sample rate, no impact to ideal
> > Ratio mode.
> 
> I just realized that the function has a bottom part for ideal mode
> exclusively -- if we treat p2p as !ideal, those configurations will be 
> missing.
> So you're right, should have an extra boolean variable.
> 
> > >
> > > > + rem[IN] = do_div(clk_rate, inrate);
> > > > + div[IN] = (u32)clk_rate;
> > > >   if (div[IN] == 0) {
> > >
> > > Could we check div[IN] and rem[IN] here? Like:
> > > if (div[IN] == 0 || div[IN] > 1024) {
> > > pair_err();
> > > goto out;
> > > }
> > >
> > > if (!ideal && rem[IN]) {
> > > pair_err();
> > > goto out;
> > > }
> > >
> > > According to your commit log, I think the max-1024 limitation should
> > > be applied to all cases, not confined to "!ideal" cases right? And
> > > we should add some comments also, indicating it is limited by hardware.
> >
> > For ideal mode,  my test result is  the divider not impact the output 
> > result.
> > Which means it is ok for ideal mode even divider is not correct...
> 
> OK.
> 
> > >
> > > >   pair_err("failed to support input sample rate %dHz
> > > > by
> > > asrck_%x\n",
> > > >   inrate, clk_index[ideal ? OUT :
> > > > IN]); @@
> > > > -360,11 +363,20 @@ static int fsl_asrc_config_pair(struct
> > > > fsl_asrc_pair *pair)
> > > >
> > > >   clk = asrc_priv->asrck_clk[clk_index[OUT]];
> > > >
> > > > - /* Use fixed output rate for Ideal Ratio mode (INCLK_NONE) */
> > > > - if (ideal)
> > > > - div[OUT] = clk_get_rate(clk) / IDEAL_RATIO_RATE;
> > > > - else
> > > > - div[OUT] = clk_get_rate(clk) / outrate;
> > > > + /*
> > > > +  * When P2P mode, output rate should align with the out
> samplerate.
> > > > +  * if set too high output rate, there will be lots of Overload.
> > > > +  * When M2M mode, output rate should also need to align with
> > > > + the out
> > >
> > > For this "should", do you actually mean "M2M could also"? Sorry, I'm
> > > just trying to understand everyting here, not intentionally being picky at
> words.
> > > My understanding

[PATCH] ASoC: fsl: fsl_dma: fix build failure

2019-10-24 Thread Michael Ellerman

Commit 4ac85de9977e ("ASoC: fsl: fsl_dma: remove snd_pcm_ops") removed
fsl_dma_ops but left a usage, leading to a build error for some
configs, eg. mpc85xx_defconfig:

  sound/soc/fsl/fsl_dma.c: In function ‘fsl_soc_dma_probe’:
  sound/soc/fsl/fsl_dma.c:905:18: error: ‘fsl_dma_ops’ undeclared (first use in 
this function)
dma->dai.ops = _dma_ops;
^~~

Remove the usage to fix the build.

Fixes: 4ac85de9977e ("ASoC: fsl: fsl_dma: remove snd_pcm_ops")
Signed-off-by: Michael Ellerman 
---
 sound/soc/fsl/fsl_dma.c | 1 -
 1 file changed, 1 deletion(-)

This breakage is only in linux-next.

diff --git a/sound/soc/fsl/fsl_dma.c b/sound/soc/fsl/fsl_dma.c
index a092726510d4..2868c4f97cb2 100644
--- a/sound/soc/fsl/fsl_dma.c
+++ b/sound/soc/fsl/fsl_dma.c
@@ -901,7 +901,6 @@ static int fsl_soc_dma_probe(struct platform_device *pdev)
}
 
dma->dai.name = DRV_NAME;
-   dma->dai.ops = _dma_ops;
dma->dai.open = fsl_dma_open;
dma->dai.close = fsl_dma_close;
dma->dai.ioctl = snd_soc_pcm_lib_ioctl;
-- 
2.21.0

[PATCH 10/10] ocxl: Conditionally bind SCM devices to the generic OCXL driver

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

This patch allows the user to bind OpenCAPI SCM devices to the generic OCXL
driver.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/Kconfig | 7 +++
 drivers/misc/ocxl/pci.c   | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig
index 1916fa65f2f2..8a683715c97c 100644
--- a/drivers/misc/ocxl/Kconfig
+++ b/drivers/misc/ocxl/Kconfig
@@ -29,3 +29,10 @@ config OCXL
  dedicated OpenCAPI link, and don't follow the same protocol.
 
  If unsure, say N.
+
+config OCXL_SCM_GENERIC
+   bool "Treat OpenCAPI Storage Class Memory as a generic OpenCAPI device"
+   default n
+   help
+ Select this option to treat OpenCAPI Storage Class Memory
+ devices an generic OpenCAPI devices.
diff --git a/drivers/misc/ocxl/pci.c b/drivers/misc/ocxl/pci.c
index cb920aa88d3a..7137055c1883 100644
--- a/drivers/misc/ocxl/pci.c
+++ b/drivers/misc/ocxl/pci.c
@@ -10,6 +10,9 @@
  */
 static const struct pci_device_id ocxl_pci_tbl[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x062B), },
+#ifdef CONFIG_OCXL_SCM_GENERIC
+   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
+#endif
{ }
 };
 MODULE_DEVICE_TABLE(pci, ocxl_pci_tbl);
-- 
2.21.0

[PATCH 09/10] powerpc: Enable OpenCAPI Storage Class Memory driver on bare metal

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

Enable OpenCAPI Storage Class Memory driver on bare metal

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/configs/powernv_defconfig | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index 6658cceb928c..45c0eff94964 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -352,3 +352,7 @@ CONFIG_KVM_BOOK3S_64=m
 CONFIG_KVM_BOOK3S_64_HV=m
 CONFIG_VHOST_NET=m
 CONFIG_PRINTK_TIME=y
+CONFIG_OCXL_SCM=m
+CONFIG_DEV_DAX=y
+CONFIG_DEV_DAX_PMEM=y
+CONFIG_FS_DAX=y
-- 
2.21.0

[PATCH 08/10] nvdimm: Add driver for OpenCAPI Storage Class Memory

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

This driver exposes LPC memory on OpenCAPI SCM cards
as an NVDIMM, allowing the existing nvram infrastructure
to be used.

Signed-off-by: Alastair D'Silva 
---
 drivers/nvdimm/Kconfig |   17 +
 drivers/nvdimm/Makefile|3 +
 drivers/nvdimm/ocxl-scm.c  | 2210 
 drivers/nvdimm/ocxl-scm_internal.c |  232 +++
 drivers/nvdimm/ocxl-scm_internal.h |  331 +
 drivers/nvdimm/ocxl-scm_sysfs.c|  219 +++
 include/uapi/linux/ocxl-scm.h  |  128 ++
 mm/memory_hotplug.c|2 +-
 8 files changed, 3141 insertions(+), 1 deletion(-)
 create mode 100644 drivers/nvdimm/ocxl-scm.c
 create mode 100644 drivers/nvdimm/ocxl-scm_internal.c
 create mode 100644 drivers/nvdimm/ocxl-scm_internal.h
 create mode 100644 drivers/nvdimm/ocxl-scm_sysfs.c
 create mode 100644 include/uapi/linux/ocxl-scm.h

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 36af7af6b7cf..e4f7b6b08efd 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -130,4 +130,21 @@ config NVDIMM_TEST_BUILD
  core devm_memremap_pages() implementation and other
  infrastructure.
 
+config OCXL_SCM
+   tristate "OpenCAPI Storage Class Memory"
+   depends on LIBNVDIMM
+   select ZONE_DEVICE
+   select OCXL
+   help
+ Exposes devices that implement the OpenCAPI Storage Class Memory
+ specification as persistent memory regions.
+
+ Select N if unsure.
+
+config OCXL_SCM_DEBUG
+   bool "OpenCAPI Storage Class Memory debugging"
+   depends on OCXL_SCM
+   help
+ Enables low level IOCTLs for OpenCAPI SCM firmware development
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 29203f3d3069..43d826397bfc 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -6,6 +6,9 @@ obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
 obj-$(CONFIG_OF_PMEM) += of_pmem.o
 obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o
+obj-$(CONFIG_OCXL_SCM) += ocxlscm.o
+
+ocxlscm-y := ocxl-scm.o ocxl-scm_internal.o ocxl-scm_sysfs.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/ocxl-scm.c b/drivers/nvdimm/ocxl-scm.c
new file mode 100644
index ..f4e6cc022de8
--- /dev/null
+++ b/drivers/nvdimm/ocxl-scm.c
@@ -0,0 +1,2210 @@
+// SPDX-License-Identifier: GPL-2.0+
+// Copyright 2019 IBM Corp.
+
+/*
+ * A driver for Storage Class Memory, connected via OpenCAPI
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "ocxl-scm_internal.h"
+
+
+static const struct pci_device_id scm_pci_tbl[] = {
+   { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), },
+   { }
+};
+
+MODULE_DEVICE_TABLE(pci, scm_pci_tbl);
+
+#define SCM_NUM_MINORS 256 // Total to reserve
+#define SCM_USABLE_TIMEOUT 120 // seconds
+
+static dev_t scm_dev;
+static struct class *scm_class;
+static struct mutex minors_idr_lock;
+static struct idr minors_idr;
+
+static const struct attribute_group *scm_pmem_attribute_groups[] = {
+   _bus_attribute_group,
+   NULL,
+};
+
+static const struct attribute_group *scm_pmem_region_attribute_groups[] = {
+   _region_attribute_group,
+   _device_attribute_group,
+   _mapping_attribute_group,
+   _numa_attribute_group,
+   NULL,
+};
+
+/**
+ * scm_ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from 
ndctl
+ * @scm_data: the SCM metadata
+ * @command: the incoming data to write
+ * Return: 0 on success, negative on failure
+ */
+static int scm_ndctl_config_write(struct scm_data *scm_data,
+ struct nd_cmd_set_config_hdr *command)
+{
+   if (command->in_offset + command->in_length > SCM_LABEL_AREA_SIZE)
+   return -EINVAL;
+
+   memcpy_flushcache(scm_data->metadata_addr + command->in_offset, 
command->in_buf,
+ command->in_length);
+
+   return 0;
+}
+
+/**
+ * scm_ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from ndctl
+ * @scm_data: the SCM metadata
+ * @command: the read request
+ * Return: 0 on success, negative on failure
+ */
+static int scm_ndctl_config_read(struct scm_data *scm_data,
+struct nd_cmd_get_config_data_hdr *command)
+{
+   if (command->in_offset + command->in_length > SCM_LABEL_AREA_SIZE)
+   return -EINVAL;
+
+   memcpy(command->out_buf, scm_data->metadata_addr + command->in_offset,
+  command->in_length);
+
+   return 0;
+}
+
+/**
+ * scm_ndctl_config_size() - Handle a ND_CMD_GET_CONFIG_SIZE command from ndctl
+ * @scm_data: the SCM metadata
+ * @command: the read request
+ * Return: 0 on success, negative on failure
+ */
+static int scm_ndctl_config_size(struct nd_cmd_get_config_size *command)
+{
+   command->status = 0;
+   command->config_size = SCM_LABEL_AREA_SIZE;
+   command->max_xfer = PAGE_SIZE;
+
+   return 0;
+}
+

[PATCH 07/10] ocxl: Save the device serial number in ocxl_fn

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

This patch retrieves the serial number of the card and makes it available
to consumers of the ocxl driver via the ocxl_fn struct.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/config.c | 46 ++
 include/misc/ocxl.h|  1 +
 2 files changed, 47 insertions(+)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index fb0c3b6f8312..a9203c309365 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -71,6 +71,51 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 
afu_idx)
return 0;
 }
 
+/**
+ * Find a related PCI device (function 0)
+ * @device: PCI device to match
+ *
+ * Returns a pointer to the related device, or null if not found
+ */
+static struct pci_dev *get_function_0(struct pci_dev *dev)
+{
+   unsigned int devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0); // Look for 
function 0
+
+   return pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
+   dev->bus->number, devfn);
+}
+
+static void read_serial(struct pci_dev *dev, struct ocxl_fn_config *fn)
+{
+   u32 low, high;
+   int pos;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DSN);
+   if (pos) {
+   pci_read_config_dword(dev, pos + 0x04, );
+   pci_read_config_dword(dev, pos + 0x08, );
+
+   fn->serial = low | ((u64)high) << 32;
+
+   return;
+   }
+
+   if (PCI_FUNC(dev->devfn) != 0) {
+   struct pci_dev *related = get_function_0(dev);
+
+   if (!related) {
+   fn->serial = 0;
+   return;
+   }
+
+   read_serial(related, fn);
+   pci_dev_put(related);
+   return;
+   }
+
+   fn->serial = 0;
+}
+
 static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn)
 {
u16 val;
@@ -208,6 +253,7 @@ int ocxl_config_read_function(struct pci_dev *dev, struct 
ocxl_fn_config *fn)
int rc;
 
read_pasid(dev, fn);
+   read_serial(dev, fn);
 
rc = read_dvsec_tl(dev, fn);
if (rc) {
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 6f7c02f0d5e3..9843051c3c5b 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -46,6 +46,7 @@ struct ocxl_fn_config {
int dvsec_afu_info_pos; /* offset of the AFU information DVSEC */
s8 max_pasid_log;
s8 max_afu_index;
+   u64 serial;
 };
 
 enum ocxl_endian {
-- 
2.21.0

[PATCH 06/10] ocxl: Add functions to map/unmap LPC memory

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

Add functions to map/unmap LPC memory

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/config.c|  4 +++
 drivers/misc/ocxl/core.c  | 50 +++
 drivers/misc/ocxl/ocxl_internal.h |  3 ++
 include/misc/ocxl.h   | 18 +++
 4 files changed, 75 insertions(+)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index c8e19bfb5ef9..fb0c3b6f8312 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev,
afu->special_purpose_mem_size =
total_mem_size - lpc_mem_size;
}
+
+   dev_info(>dev, "Probed LPC memory of %#llx bytes and special 
purpose memory of %#llx bytes\n",
+   afu->lpc_mem_size, afu->special_purpose_mem_size);
+
return 0;
 }
 
diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index 2531c6cf19a0..5554f5ce4b9e 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu *afu)
release_fn_bar(afu->fn, afu->config.global_mmio_bar);
 }
 
+int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu)
+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) 
== 0)
+   return 0;
+
+   afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev);
+   if (afu->lpc_base_addr == 0)
+   return -EINVAL;
+
+   if (afu->config.lpc_mem_size) {
+   afu->lpc_res.start = afu->lpc_base_addr + 
afu->config.lpc_mem_offset;
+   afu->lpc_res.end = afu->lpc_res.start + 
afu->config.lpc_mem_size - 1;
+   }
+
+   if (afu->config.special_purpose_mem_size) {
+   afu->special_purpose_res.start = afu->lpc_base_addr +
+
afu->config.special_purpose_mem_offset;
+   afu->special_purpose_res.end = afu->special_purpose_res.start +
+  
afu->config.special_purpose_mem_size - 1;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(ocxl_afu_map_lpc_mem);
+
+struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu)
+{
+   return >lpc_res;
+}
+EXPORT_SYMBOL(ocxl_afu_lpc_mem);
+
+static void unmap_lpc_mem(struct ocxl_afu *afu)
+{
+   struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent);
+
+   if (afu->lpc_res.start || afu->special_purpose_res.start) {
+   void *link = afu->fn->link;
+
+   ocxl_link_lpc_release(link, dev);
+
+   afu->lpc_res.start = 0;
+   afu->lpc_res.end = 0;
+   afu->special_purpose_res.start = 0;
+   afu->special_purpose_res.end = 0;
+   }
+}
+
 static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev *dev)
 {
int rc;
@@ -251,6 +300,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
 
 static void deconfigure_afu(struct ocxl_afu *afu)
 {
+   unmap_lpc_mem(afu);
unmap_mmio_areas(afu);
reclaim_afu_pasid(afu);
reclaim_afu_actag(afu);
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 20b417e00949..9f4b47900e62 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -52,6 +52,9 @@ struct ocxl_afu {
void __iomem *global_mmio_ptr;
u64 pp_mmio_start;
void *private;
+   u64 lpc_base_addr; /* Covers both LPC & special purpose memory */
+   struct resource lpc_res;
+   struct resource special_purpose_res;
 };
 
 enum ocxl_context_status {
diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
index 06dd5839e438..6f7c02f0d5e3 100644
--- a/include/misc/ocxl.h
+++ b/include/misc/ocxl.h
@@ -212,6 +212,24 @@ int ocxl_irq_set_handler(struct ocxl_context *ctx, int 
irq_id,
 
 // AFU Metadata
 
+/**
+ * Map the LPC system & special purpose memory for an AFU
+ *
+ * Do not call this during device discovery, as there may me multiple
+ * devices on a link, and the memory is mapped for the whole link, not
+ * just one device. It should only be called after all devices have
+ * registered their memory on the link.
+ *
+ * afu: The AFU that has the LPC memory to map
+ */
+extern int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu);
+
+/**
+ * Get the physical address range of LPC memory for an AFU
+ * afu: The AFU associated with the LPC memory
+ */
+extern struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu);
+
 /**
  * Get a pointer to the config for an AFU
  *
-- 
2.21.0

[PATCH 05/10] ocxl: Tally up the LPC memory on a link & allow it to be mapped

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

Tally up the LPC memory on an OpenCAPI link & allow it to be mapped

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/ocxl/core.c  | 10 ++
 drivers/misc/ocxl/link.c  | 60 +++
 drivers/misc/ocxl/ocxl_internal.h | 33 +
 3 files changed, 103 insertions(+)

diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c
index b7a09b21ab36..2531c6cf19a0 100644
--- a/drivers/misc/ocxl/core.c
+++ b/drivers/misc/ocxl/core.c
@@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, 
struct pci_dev *dev)
if (rc)
goto err_free_pasid;
 
+   if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) {
+   rc = ocxl_link_add_lpc_mem(afu->fn->link, 
afu->config.lpc_mem_offset,
+  afu->config.lpc_mem_size +
+  
afu->config.special_purpose_mem_size);
+   if (rc)
+   goto err_free_mmio;
+   }
+
return 0;
 
+err_free_mmio:
+   unmap_mmio_areas(afu);
 err_free_pasid:
reclaim_afu_pasid(afu);
 err_free_actag:
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 58d111afd9f6..1d350d0bb860 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -84,6 +84,11 @@ struct ocxl_link {
int dev;
atomic_t irq_available;
struct spa *spa;
+   struct mutex lpc_mem_lock;
+   u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */
+   u64 lpc_mem;
+   int lpc_consumers;
+
void *platform_data;
 };
 static struct list_head links_list = LIST_HEAD_INIT(links_list);
@@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
struct ocxl_link **out_l
if (rc)
goto err_spa;
 
+   mutex_init(>lpc_mem_lock);
+
/* platform specific hook */
rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask,
>platform_data);
@@ -711,3 +718,56 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq)
atomic_inc(>irq_available);
 }
 EXPORT_SYMBOL_GPL(ocxl_link_free_irq);
+
+int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   // Check for overflow
+   if (offset > (offset + size))
+   return -EINVAL;
+
+   mutex_lock(>lpc_mem_lock);
+   link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size);
+
+   mutex_unlock(>lpc_mem_lock);
+
+   return 0;
+}
+
+u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+   u64 lpc_mem;
+
+   mutex_lock(>lpc_mem_lock);
+   if (link->lpc_mem) {
+   lpc_mem = link->lpc_mem;
+
+   link->lpc_consumers++;
+   mutex_unlock(>lpc_mem_lock);
+   return lpc_mem;
+   }
+
+   link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, link->lpc_mem_sz);
+   if (link->lpc_mem)
+   link->lpc_consumers++;
+   lpc_mem = link->lpc_mem;
+   mutex_unlock(>lpc_mem_lock);
+
+   return lpc_mem;
+}
+
+void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev)
+{
+   struct ocxl_link *link = (struct ocxl_link *) link_handle;
+
+   mutex_lock(>lpc_mem_lock);
+   link->lpc_consumers--;
+   if (link->lpc_consumers == 0) {
+   pnv_ocxl_platform_lpc_release(pdev);
+   link->lpc_mem = 0;
+   }
+
+   mutex_unlock(>lpc_mem_lock);
+}
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 97415afd79f3..20b417e00949 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -141,4 +141,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 
offset);
 u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id);
 void ocxl_afu_irq_free_all(struct ocxl_context *ctx);
 
+/**
+ * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an 
OpenCAPI link
+ *
+ * @link_handle: The OpenCAPI link handle
+ * @offset: The offset of the memory to add
+ * @size: The amount of memory to increment by
+ *
+ * Return 0 on success, negative on overflow
+ */
+int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size);
+
+/**
+ * ocxl_link_lpc_map() - Map the LPC memory for an OpenCAPI device
+ *
+ * Since LPC memory belongs to a link, the whole LPC memory available
+ * on the link bust be mapped in order to make it accessible to a device.
+ *
+ * @link_handle: The OpenCAPI link handle
+ * @pdev: A device that is on the link
+ */
+u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev);
+
+/**
+ * ocxl_link_lpc_release() - Release the LPC memory device for an OpenCAPI 
device
+ *
+ * Offlines LPC memory on an OpenCAPI link for a device. If this is the
+ *

[PATCH 04/10] powerpc: Map & release OpenCAPI LPC memory

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

This patch adds platform support to map & release LPC memory.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/pnv-ocxl.h   |  2 ++
 arch/powerpc/platforms/powernv/ocxl.c | 41 +++
 include/linux/memory_hotplug.h|  5 
 mm/memory_hotplug.c   |  3 +-
 4 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 7de82647e761..f8f8ffb48aa8 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void 
*platform_data, int pe_handle)
 
 extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr);
 extern void pnv_ocxl_free_xive_irq(u32 irq);
+extern u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size);
+extern void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev);
 
 #endif /* _ASM_PNV_OCXL_H */
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c8..c6d4234e0aba 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -475,6 +475,47 @@ void pnv_ocxl_spa_release(void *platform_data)
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release);
 
+u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   u32 bdfn = (pdev->bus->number << 8) | pdev->devfn;
+   u64 base_addr = 0;
+   int rc;
+
+   rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr);
+   if (rc) {
+   dev_warn(>dev,
+"OPAL could not allocate LPC memory, rc=%d\n", rc);
+   return 0;
+   }
+
+   base_addr = be64_to_cpu(base_addr);
+
+   rc = check_hotplug_memory_addressable(base_addr >> PAGE_SHIFT,
+ size >> PAGE_SHIFT);
+   if (rc)
+   return 0;
+
+   return base_addr;
+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup);
+
+void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev)
+{
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   u32 bdfn = (pdev->bus->number << 8) | pdev->devfn;
+   int rc;
+
+   rc = opal_npu_mem_release(phb->opal_id, bdfn);
+   if (rc)
+   dev_warn(>dev,
+"OPAL reported rc=%d when releasing LPC memory\n", rc);
+}
+EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release);
+
+
 int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle)
 {
struct spa_data *data = (struct spa_data *) platform_data;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f46ea71b4ffd..3f5f1a642abe 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -339,6 +339,11 @@ static inline int remove_memory(int nid, u64 start, u64 
size)
 static inline void __remove_memory(int nid, u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+#if CONFIG_MEMORY_HOTPLUG_SPARSE
+int check_hotplug_memory_addressable(unsigned long pfn,
+   unsigned long nr_pages);
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2cecf07b396f..b39827dbd071 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -278,7 +278,7 @@ static int check_pfn_span(unsigned long pfn, unsigned long 
nr_pages,
return 0;
 }
 
-static int check_hotplug_memory_addressable(unsigned long pfn,
+int check_hotplug_memory_addressable(unsigned long pfn,
unsigned long nr_pages)
 {
const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1;
@@ -294,6 +294,7 @@ static int check_hotplug_memory_addressable(unsigned long 
pfn,
 
return 0;
 }
+EXPORT_SYMBOL_GPL(check_hotplug_memory_addressable);
 
 /*
  * Reasonably generic function for adding memory.  It is
-- 
2.21.0

[PATCH 03/10] powerpc: Add OPAL calls for LPC memory alloc/release

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

Add OPAL calls for LPC memory alloc/release

Signed-off-by: Alastair D'Silva 
Acked-by: Andrew Donnellan 
---
 arch/powerpc/include/asm/opal-api.h| 2 ++
 arch/powerpc/include/asm/opal.h| 3 +++
 arch/powerpc/platforms/powernv/opal-call.c | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 378e3997845a..2c88c02e69ed 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,6 +208,8 @@
 #define OPAL_HANDLE_HMI2   166
 #defineOPAL_NX_COPROC_INIT 167
 #define OPAL_XIVE_GET_VP_STATE 170
+#define OPAL_NPU_MEM_ALLOC 171
+#define OPAL_NPU_MEM_RELEASE   172
 #define OPAL_MPIPL_UPDATE  173
 #define OPAL_MPIPL_REGISTER_TAG174
 #define OPAL_MPIPL_QUERY_TAG   175
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a0cf8fba4d12..4db135fb54ab 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -39,6 +39,9 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
bdfn,
uint64_t PE_handle);
 int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
uint64_t rate_phys, uint32_t size);
+int64_t opal_npu_mem_alloc(uint64_t phb_id, uint32_t bdfn,
+   uint64_t size, uint64_t *bar);
+int64_t opal_npu_mem_release(uint64_t phb_id, uint32_t bdfn);
 
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
diff --git a/arch/powerpc/platforms/powernv/opal-call.c 
b/arch/powerpc/platforms/powernv/opal-call.c
index a2aa5e433ac8..27c4b93c774c 100644
--- a/arch/powerpc/platforms/powernv/opal-call.c
+++ b/arch/powerpc/platforms/powernv/opal-call.c
@@ -287,6 +287,8 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, 
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_npu_mem_alloc,  OPAL_NPU_MEM_ALLOC);
+OPAL_CALL(opal_npu_mem_release,OPAL_NPU_MEM_RELEASE);
 OPAL_CALL(opal_mpipl_update,   OPAL_MPIPL_UPDATE);
 OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG);
 OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG);
-- 
2.21.0

[PATCH 02/10] nvdimm: remove prototypes for nonexistent functions

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

These functions don't exist, so remove the prototypes for them.

Signed-off-by: Alastair D'Silva 
---
 drivers/nvdimm/nd-core.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 25fa121104d0..9f121a6aeb02 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -124,11 +124,7 @@ void nd_region_create_dax_seed(struct nd_region 
*nd_region);
 int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus);
 void nvdimm_bus_destroy_ndctl(struct nvdimm_bus *nvdimm_bus);
 void nd_synchronize(void);
-int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus);
-int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus);
-int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus);
 void __nd_device_register(struct device *dev);
-int nd_match_dimm(struct device *dev, void *data);
 struct nd_label_id;
 char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
 bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
-- 
2.21.0

[PATCH 01/10] memory_hotplug: Add a bounds check to __add_pages

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

On PowerPC, the address ranges allocated to OpenCAPI LPC memory
are allocated from firmware. These address ranges may be higher
than what older kernels permit, as we increased the maximum
permissable address in commit 4ffe713b7587
("powerpc/mm: Increase the max addressable memory to 2PB"). It is
possible that the addressable range may change again in the
future.

In this scenario, we end up with a bogus section returned from
__section_nr (see the discussion on the thread "mm: Trigger bug on
if a section is not found in __section_nr").

Adding a check here means that we fail early and have an
opportunity to handle the error gracefully, rather than rumbling
on and potentially accessing an incorrect section.

Further discussion is also on the thread ("powerpc: Perform a bounds
check in arch_add_memory")
http://lkml.kernel.org/r/20190827052047.31547-1-alast...@au1.ibm.com

Signed-off-by: Alastair D'Silva 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 mm/memory_hotplug.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index df570e5c71cc..2cecf07b396f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -278,6 +278,23 @@ static int check_pfn_span(unsigned long pfn, unsigned long 
nr_pages,
return 0;
 }
 
+static int check_hotplug_memory_addressable(unsigned long pfn,
+   unsigned long nr_pages)
+{
+   const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1;
+
+   if (max_addr >> MAX_PHYSMEM_BITS) {
+   const u64 max_allowed = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1;
+
+   WARN(1,
+"Hotplugged memory exceeds maximum addressable address, 
range=%#llx-%#llx, maximum=%#llx\n",
+PFN_PHYS(pfn), max_addr, max_allowed);
+   return -E2BIG;
+   }
+
+   return 0;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -291,6 +308,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned 
long nr_pages,
unsigned long nr, start_sec, end_sec;
struct vmem_altmap *altmap = restrictions->altmap;
 
+   err = check_hotplug_memory_addressable(pfn, nr_pages);
+   if (err)
+   return err;
+
if (altmap) {
/*
 * Validate altmap is within bounds of the total request
-- 
2.21.0

[PATCH 00/10] Add support for OpenCAPI SCM devices

2019-10-24 Thread Alastair D'Silva

From: Alastair D'Silva 

This series adds support for OpenCAPI SCM devices, exposing
them as nvdimms so that we can make use of the existing
infrastructure.

The first patch (in memory_hotplug) has reviews/acks, but has
not yet made it upstream.

Alastair D'Silva (10):
  memory_hotplug: Add a bounds check to __add_pages
  nvdimm: remove prototypes for nonexistent functions
  powerpc: Add OPAL calls for LPC memory alloc/release
  powerpc: Map & release OpenCAPI LPC memory
  ocxl: Tally up the LPC memory on a link & allow it to be mapped
  ocxl: Add functions to map/unmap LPC memory
  ocxl: Save the device serial number in ocxl_fn
  nvdimm: Add driver for OpenCAPI Storage Class Memory
  powerpc: Enable OpenCAPI Storage Class Memory driver on bare metal
  ocxl: Conditionally bind SCM devices to the generic OCXL driver

 arch/powerpc/configs/powernv_defconfig |4 +
 arch/powerpc/include/asm/opal-api.h|2 +
 arch/powerpc/include/asm/opal.h|3 +
 arch/powerpc/include/asm/pnv-ocxl.h|2 +
 arch/powerpc/platforms/powernv/ocxl.c  |   41 +
 arch/powerpc/platforms/powernv/opal-call.c |2 +
 drivers/misc/ocxl/Kconfig  |7 +
 drivers/misc/ocxl/config.c |   50 +
 drivers/misc/ocxl/core.c   |   60 +
 drivers/misc/ocxl/link.c   |   60 +
 drivers/misc/ocxl/ocxl_internal.h  |   36 +
 drivers/misc/ocxl/pci.c|3 +
 drivers/nvdimm/Kconfig |   17 +
 drivers/nvdimm/Makefile|3 +
 drivers/nvdimm/nd-core.h   |4 -
 drivers/nvdimm/ocxl-scm.c  | 2210 
 drivers/nvdimm/ocxl-scm_internal.c |  232 ++
 drivers/nvdimm/ocxl-scm_internal.h |  331 +++
 drivers/nvdimm/ocxl-scm_sysfs.c|  219 ++
 include/linux/memory_hotplug.h |5 +
 include/misc/ocxl.h|   19 +
 include/uapi/linux/ocxl-scm.h  |  128 ++
 mm/memory_hotplug.c|   22 +
 23 files changed, 3456 insertions(+), 4 deletions(-)
 create mode 100644 drivers/nvdimm/ocxl-scm.c
 create mode 100644 drivers/nvdimm/ocxl-scm_internal.c
 create mode 100644 drivers/nvdimm/ocxl-scm_internal.h
 create mode 100644 drivers/nvdimm/ocxl-scm_sysfs.c
 create mode 100644 include/uapi/linux/ocxl-scm.h

-- 
2.21.0

Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-24 Thread Anshuman Khandual

On 10/24/2019 10:21 PM, Qian Cai wrote:
> 
> 
>> On Oct 24, 2019, at 10:50 AM, Anshuman Khandual  
>> wrote:
>>
>> Changes in V7:
>>
>> - Memory allocation and free routines for mapped pages have been droped
>> - Mapped pfns are derived from standard kernel text symbol per Matthew
>> - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian 
>> - Updated the commit message per Michal
>> - Updated W=1 GCC warning problem on x86 per Qian Cai
> 
> It would be interesting to know if you actually tested  out to see if the 
> warning went away. As far I can tell, the GCC is quite stubborn there, so I 
> am not going to insist.
> 

Nothing specific. But just tested this with x86 defconfig with relevant configs
which are required for this test. Not sure if it involved W=1. The problem is,
there is no other or better way to have both the conditional checks in place
while also reducing the chances this warning. IMHO both the conditional checks
are required.

RE: [PATCH v7 2/3] Documentation: dt: binding: fsl: Add 'little-endian' and update Chassis define

2019-10-24 Thread Ran Wang

Hi Scott,

On Friday, October 25, 2019 02:34, Scott Wood wrote
> 
> On Mon, 2019-10-21 at 11:49 +0800, Ran Wang wrote:
> > By default, QorIQ SoC's RCPM register block is Big Endian. But there
> > are some exceptions, such as LS1088A and LS2088A, are Little Endian.
> > So add this optional property to help identify them.
> >
> > Actually LS2021A and other Layerscapes won't totally follow Chassis
> > 2.1, so separate them from powerpc SoC.
> 
> Did you mean LS1021A and "don't" instead of "won't", given the change to the
> examples?

OK, I will change it to don't to just tel current situation.
 
> > Change in v5:
> > - Add 'Reviewed-by: Rob Herring ' to commit
> message.
> > - Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup-
> > cells'.
> > please see https://lore.kernel.org/patchwork/patch/1101022/
> 
> I'm not sure why Rob considers this the "correct form" -- there are other
> examples of the current form, such as ibm,#dma-address-cells and ti,#tlb-
> entries, and the current form makes more logical sense (# is part of the 
> property
> name, not the vendor).  Oh well.
> 
> > Required properites:
> >- reg : Offset and length of the register set of the RCPM block.
> > -  - fsl,#rcpm-wakeup-cells : The number of IPPDEXPCR register cells
> > in the
> > +  - #fsl,rcpm-wakeup-cells : The number of IPPDEXPCR register cells
> > + in the
> > fsl,rcpm-wakeup property.
> >- compatible : Must contain a chip-specific RCPM block compatible string
> > and (if applicable) may contain a chassis-version RCPM compatible @@
> > -20,6 +20,7 @@ Required properites:
> > * "fsl,qoriq-rcpm-1.0": for chassis 1.0 rcpm
> > * "fsl,qoriq-rcpm-2.0": for chassis 2.0 rcpm
> > * "fsl,qoriq-rcpm-2.1": for chassis 2.1 rcpm
> > +   * "fsl,qoriq-rcpm-2.1+": for chassis 2.1+ rcpm
> 
> Is there something actually called "2.1+"?  It looks a bit like an attempt to 
> claim
> compatibility with all future versions.  If the former, is it a name that 
> comes
> from the hardware side with an intent for it to describe a stable interface, 
> or are
> we later going to see a patch changing some by-then-existing device trees from
> "2.1+" to "2.1++" when some new incompatibility is found?
>
> Perhaps it would be better to bind to the specific chip compatibles.

According to SoC data sheets, powerPC SoC T1040 and current ARM based Layerscape
SoCs (LS1021A, LS1012A, LS1043A, etc)'s arch designs are both basing on Chassis 
spec 2.1.
However, for Layerscape, their data sheets are also explicitly telling that 
some minor
changes have been made(basing on Chassis 2.1 spec). And in parallel, the SW 
arch designs
between T1040 and Layerscape family are also different: For Layerscape, part of 
RCPM
programming job has been moved from kernel driver to firmware/bootloader 
(through
PSCI interface). That's why I have to name a new compatible string to 
distinguish them.
They cannot use the same driver. I don’t think we will add another sting like 
2.1++ in the
future. If the Chassis spec keep evolving and requiring different programming 
logic,
we can add more like 3.0, 4.0, ..., I think.

Regards,
Ran

Re: [PATCH v6 20/30] powerpc/pci: Fix crash with enabled movable BARs

2019-10-24 Thread Alexey Kardashevskiy




On 25/10/2019 04:12, Sergey Miroshnichenko wrote:
> Add a check for the UNSET resource flag to skip the released BARs


Where/why does it crash exactly? It is not extremely clear from the code. 
Thanks,

> 
> CC: Alexey Kardashevskiy 
> CC: Oliver O'Halloran 
> CC: Sam Bobroff 
> Signed-off-by: Sergey Miroshnichenko 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c28d0d9b7ee0..33d5ed8c258f 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2976,7 +2976,8 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe 
> *pe,
>   int index;
>   int64_t rc;
>  
> - if (!res || !res->flags || res->start > res->end)
> + if (!res || !res->flags || res->start > res->end ||
> + (res->flags & IORESOURCE_UNSET))
>   return;
>  
>   if (res->flags & IORESOURCE_IO) {
> 

-- 
Alexey

[PATCH v5 4/4] powerpc: load firmware trusted keys/hashes into kernel keyring

2019-10-24 Thread Nayna Jain

The keys used to verify the Host OS kernel are managed by firmware as
secure variables. This patch loads the verification keys into the .platform
keyring and revocation hashes into .blacklist keyring. This enables
verification and loading of the kernels signed by the boot time keys which
are trusted by firmware.

Signed-off-by: Nayna Jain 
Reviewed-by: Mimi Zohar 
---
 arch/powerpc/Kconfig  |  1 +
 security/integrity/Kconfig|  8 ++
 security/integrity/Makefile   |  4 +-
 .../integrity/platform_certs/load_powerpc.c   | 86 +++
 4 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100644 security/integrity/platform_certs/load_powerpc.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 949e747bc8c2..5d860ed6c901 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -939,6 +939,7 @@ config PPC_SECURE_BOOT
bool
depends on PPC_POWERNV
depends on IMA_ARCH_POLICY
+   select LOAD_PPC_KEYS
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
index 0bae6adb63a9..26abee23e4e3 100644
--- a/security/integrity/Kconfig
+++ b/security/integrity/Kconfig
@@ -72,6 +72,14 @@ config LOAD_IPL_KEYS
depends on S390
def_bool y
 
+config LOAD_PPC_KEYS
+   bool "Enable loading of platform and blacklisted keys for POWER"
+   depends on INTEGRITY_PLATFORM_KEYRING
+   depends on PPC_SECURE_BOOT
+   help
+ Enable loading of keys to the .platform keyring and blacklisted
+ hashes to the .blacklist keyring for powerpc based platforms.
+
 config INTEGRITY_AUDIT
bool "Enables integrity auditing support "
depends on AUDIT
diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 351c9662994b..7ee39d66cf16 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -14,6 +14,8 @@ integrity-$(CONFIG_LOAD_UEFI_KEYS) += 
platform_certs/efi_parser.o \
  platform_certs/load_uefi.o \
  platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
-
+integrity-$(CONFIG_LOAD_PPC_KEYS) += platform_certs/efi_parser.o \
+ platform_certs/load_powerpc.o \
+ platform_certs/keyring_handler.o
 obj-$(CONFIG_IMA)  += ima/
 obj-$(CONFIG_EVM)  += evm/
diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
new file mode 100644
index ..83d99cde5376
--- /dev/null
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ *
+ *  - loads keys and hashes stored and controlled by the firmware.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "keyring_handler.h"
+
+/*
+ * Get a certificate list blob from the named secure variable.
+ */
+static __init void *get_cert_list(u8 *key, unsigned long keylen, uint64_t 
*size)
+{
+   int rc;
+   void *db;
+
+   rc = secvar_ops->get(key, keylen, NULL, size);
+   if (rc) {
+   pr_err("Couldn't get size: %d\n", rc);
+   return NULL;
+   }
+
+   db = kmalloc(*size, GFP_KERNEL);
+   if (!db)
+   return NULL;
+
+   rc = secvar_ops->get(key, keylen, db, size);
+   if (rc) {
+   kfree(db);
+   pr_err("Error reading db var: %d\n", rc);
+   return NULL;
+   }
+
+   return db;
+}
+
+/*
+ * Load the certs contained in the keys databases into the platform trusted
+ * keyring and the blacklisted X.509 cert SHA256 hashes into the blacklist
+ * keyring.
+ */
+static int __init load_powerpc_certs(void)
+{
+   void *db = NULL, *dbx = NULL;
+   uint64_t dbsize = 0, dbxsize = 0;
+   int rc = 0;
+
+   if (!secvar_ops)
+   return -ENODEV;
+
+   /* Get db, and dbx.  They might not exist, so it isn't
+* an error if we can't get them.
+*/
+   db = get_cert_list("db", 3, );
+   if (!db) {
+   pr_err("Couldn't get db list from firmware\n");
+   } else {
+   rc = parse_efi_signature_list("powerpc:db", db, dbsize,
+ get_handler_for_db);
+   if (rc)
+   pr_err("Couldn't parse db signatures: %d\n", rc);
+   kfree(db);
+   }
+
+   dbx = get_cert_list("dbx", 3,  );
+   if (!dbx) {
+   pr_info("Couldn't get dbx list from firmware\n");
+   } else {
+   rc =

[PATCH v5 3/4] x86/efi: move common keyring handler functions to new file

2019-10-24 Thread Nayna Jain

The handlers to add the keys to the .platform keyring and blacklisted
hashes to the .blacklist keyring is common for both the uefi and powerpc
mechanisms of loading the keys/hashes from the firmware.

This patch moves the common code from load_uefi.c to keyring_handler.c

Signed-off-by: Nayna Jain 
Acked-by: Mimi Zohar 
---
 security/integrity/Makefile   |  3 +-
 .../platform_certs/keyring_handler.c  | 80 +++
 .../platform_certs/keyring_handler.h  | 32 
 security/integrity/platform_certs/load_uefi.c | 67 +---
 4 files changed, 115 insertions(+), 67 deletions(-)
 create mode 100644 security/integrity/platform_certs/keyring_handler.c
 create mode 100644 security/integrity/platform_certs/keyring_handler.h

diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 35e6ca773734..351c9662994b 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -11,7 +11,8 @@ integrity-$(CONFIG_INTEGRITY_SIGNATURE) += digsig.o
 integrity-$(CONFIG_INTEGRITY_ASYMMETRIC_KEYS) += digsig_asymmetric.o
 integrity-$(CONFIG_INTEGRITY_PLATFORM_KEYRING) += 
platform_certs/platform_keyring.o
 integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \
-   platform_certs/load_uefi.o
+ platform_certs/load_uefi.o \
+ platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
 
 obj-$(CONFIG_IMA)  += ima/
diff --git a/security/integrity/platform_certs/keyring_handler.c 
b/security/integrity/platform_certs/keyring_handler.c
new file mode 100644
index ..c5ba695c10e3
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../integrity.h"
+
+static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID;
+static efi_guid_t efi_cert_x509_sha256_guid __initdata =
+   EFI_CERT_X509_SHA256_GUID;
+static efi_guid_t efi_cert_sha256_guid __initdata = EFI_CERT_SHA256_GUID;
+
+/*
+ * Blacklist a hash.
+ */
+static __init void uefi_blacklist_hash(const char *source, const void *data,
+  size_t len, const char *type,
+  size_t type_len)
+{
+   char *hash, *p;
+
+   hash = kmalloc(type_len + len * 2 + 1, GFP_KERNEL);
+   if (!hash)
+   return;
+   p = memcpy(hash, type, type_len);
+   p += type_len;
+   bin2hex(p, data, len);
+   p += len * 2;
+   *p = 0;
+
+   mark_hash_blacklisted(hash);
+   kfree(hash);
+}
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+static __init void uefi_blacklist_x509_tbs(const char *source,
+  const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "tbs:", 4);
+}
+
+/*
+ * Blacklist the hash of an executable.
+ */
+static __init void uefi_blacklist_binary(const char *source,
+const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "bin:", 4);
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI db and MokListRT tables.
+ */
+__init efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_guid) == 0)
+   return add_to_platform_keyring;
+   return 0;
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI dbx and MokListXRT tables.
+ */
+__init efi_element_handler_t get_handler_for_dbx(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_sha256_guid) == 0)
+   return uefi_blacklist_x509_tbs;
+   if (efi_guidcmp(*sig_type, efi_cert_sha256_guid) == 0)
+   return uefi_blacklist_binary;
+   return 0;
+}
diff --git a/security/integrity/platform_certs/keyring_handler.h 
b/security/integrity/platform_certs/keyring_handler.h
new file mode 100644
index ..2462bfa08fe3
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef PLATFORM_CERTS_INTERNAL_H
+#define PLATFORM_CERTS_INTERNAL_H
+
+#include 
+
+void blacklist_hash(const char *source, const void *data,
+   size_t len, const char *type,
+   size_t type_len);
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+void blacklist_x509_tbs(const char *source, const void *data, size_t len);
+
+/*
+ * Blacklist the hash of an executable.
+ */
+void blacklist_binary(const char *source, const void *data, size_t len);
+
+/*
+ * Return the handler for particular signature list types found in the db.
+ */
+efi_element_handler_t

[PATCH v5 2/4] powerpc: expose secure variables to userspace via sysfs

2019-10-24 Thread Nayna Jain

PowerNV secure variables, which store the keys used for OS kernel
verification, are managed by the firmware. These secure variables need to
be accessed by the userspace for addition/deletion of the certificates.

This patch adds the sysfs interface to expose secure variables for PowerNV
secureboot. The users shall use this interface for manipulating
the keys stored in the secure variables.

Signed-off-by: Nayna Jain 
Reviewed-by: Greg Kroah-Hartman 
---
 Documentation/ABI/testing/sysfs-secvar |  39 +
 arch/powerpc/Kconfig   |  11 ++
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/secvar-sysfs.c | 228 +
 4 files changed, 279 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-secvar
 create mode 100644 arch/powerpc/kernel/secvar-sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-secvar 
b/Documentation/ABI/testing/sysfs-secvar
new file mode 100644
index ..bc0bedf2b662
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-secvar
@@ -0,0 +1,39 @@
+What:  /sys/firmware/secvar
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   This directory is created if the POWER firmware supports OS
+   secureboot, thereby secure variables. It exposes interface
+   for reading/writing the secure variables
+
+What:  /sys/firmware/secvar/vars
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   This directory lists all the secure variables that are supported
+   by the firmware.
+
+What:  /sys/firmware/secvar/vars/
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   Each secure variable is represented as a directory named as
+   . The variable name is unique and is in ASCII
+   representation. The data and size can be determined by reading
+   their respective attribute files.
+
+What:  /sys/firmware/secvar/vars//size
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   An integer representation of the size of the content of the
+   variable. In other words, it represents the size of the data.
+
+What:  /sys/firmware/secvar/vars//data
+Date:  August 2019
+Contact:   Nayna Jain h
+Description:   A read-only file containing the value of the variable. The size
+   of the file represents the maximum size of the variable data.
+
+What:  /sys/firmware/secvar/vars//update
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   A write-only file that is used to submit the new value for the
+   variable. The size of the file represents the maximum size of
+   the variable data that can be written.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c795039bdc73..949e747bc8c2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -945,6 +945,17 @@ config PPC_SECURE_BOOT
  to enable OS secure boot on systems that have firmware support for
  it. If in doubt say N.
 
+config PPC_SECVAR_SYSFS
+   tristate "Enable sysfs interface for POWER secure variables"
+   default y
+   depends on PPC_SECURE_BOOT
+   depends on SYSFS
+   help
+ POWER secure variables are managed and controlled by firmware.
+ These variables are exposed to userspace via sysfs to enable
+ read/write operations on these variables. Say Y if you have
+ secure boot enabled and want to expose variables to userspace.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 3cf26427334f..b216e9f316ee 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -162,6 +162,7 @@ obj-y   += ucall.o
 endif
 
 obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o ima_arch.o secvar-ops.o
+obj-$(CONFIG_PPC_SECVAR_SYSFS) += secvar-sysfs.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
new file mode 100644
index ..f0c4950649e0
--- /dev/null
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -0,0 +1,228 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 IBM Corporation 
+ *
+ * This code exposes secure variables to user via sysfs
+ */
+
+#define pr_fmt(fmt) "secvar-sysfs: "fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NAME_MAX_SIZE 1024
+
+static struct kobject *secvar_kobj;
+static struct kset *secvar_kset;
+
+static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   uint64_t dsize;
+   int rc;
+
+   rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, );
+   if (rc) {
+   pr_err("Error retrieving variable size %d\n", rc);
+   return rc;
+

[PATCH v5 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-10-24 Thread Nayna Jain

The X.509 certificates trusted by the platform and required to secure boot
the OS kernel are wrapped in secure variables, which are controlled by
OPAL.

This patch adds firmware/kernel interface to read and write OPAL secure
variables based on the unique key.

This support can be enabled using CONFIG_OPAL_SECVAR.

Signed-off-by: Claudio Carvalho 
Signed-off-by: Nayna Jain 
---
 arch/powerpc/include/asm/opal-api.h  |   5 +-
 arch/powerpc/include/asm/opal.h  |   7 +
 arch/powerpc/include/asm/secvar.h|  35 +
 arch/powerpc/kernel/Makefile |   2 +-
 arch/powerpc/kernel/secvar-ops.c |  16 +++
 arch/powerpc/platforms/powernv/Makefile  |   2 +-
 arch/powerpc/platforms/powernv/opal-call.c   |   3 +
 arch/powerpc/platforms/powernv/opal-secvar.c | 140 +++
 arch/powerpc/platforms/powernv/opal.c|   3 +
 9 files changed, 210 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/secvar.h
 create mode 100644 arch/powerpc/kernel/secvar-ops.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 378e3997845a..c1f25a760eb1 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -211,7 +211,10 @@
 #define OPAL_MPIPL_UPDATE  173
 #define OPAL_MPIPL_REGISTER_TAG174
 #define OPAL_MPIPL_QUERY_TAG   175
-#define OPAL_LAST  175
+#define OPAL_SECVAR_GET176
+#define OPAL_SECVAR_GET_NEXT   177
+#define OPAL_SECVAR_ENQUEUE_UPDATE 178
+#define OPAL_LAST  178
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a0cf8fba4d12..9986ac34b8e2 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -298,6 +298,13 @@ int opal_sensor_group_clear(u32 group_hndl, int token);
 int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
 
+int opal_secvar_get(const char *key, uint64_t key_len, u8 *data,
+   uint64_t *data_size);
+int opal_secvar_get_next(const char *key, uint64_t *key_len,
+uint64_t key_buf_size);
+int opal_secvar_enqueue_update(const char *key, uint64_t key_len, u8 *data,
+  uint64_t data_size);
+
 s64 opal_mpipl_update(enum opal_mpipl_ops op, u64 src, u64 dest, u64 size);
 s64 opal_mpipl_register_tag(enum opal_mpipl_tags tag, u64 addr);
 s64 opal_mpipl_query_tag(enum opal_mpipl_tags tag, u64 *addr);
diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
new file mode 100644
index ..4cc35b58b986
--- /dev/null
+++ b/arch/powerpc/include/asm/secvar.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ *
+ * PowerPC secure variable operations.
+ */
+#ifndef SECVAR_OPS_H
+#define SECVAR_OPS_H
+
+#include 
+#include 
+
+extern const struct secvar_operations *secvar_ops;
+
+struct secvar_operations {
+   int (*get)(const char *key, uint64_t key_len, u8 *data,
+  uint64_t *data_size);
+   int (*get_next)(const char *key, uint64_t *key_len,
+   uint64_t keybufsize);
+   int (*set)(const char *key, uint64_t key_len, u8 *data,
+  uint64_t data_size);
+};
+
+#ifdef CONFIG_PPC_SECURE_BOOT
+
+extern void set_secvar_ops(const struct secvar_operations *ops);
+
+#else
+
+static inline void set_secvar_ops(const struct secvar_operations *ops) { }
+
+#endif
+
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index e8eb2955b7d5..3cf26427334f 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -161,7 +161,7 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),)
 obj-y  += ucall.o
 endif
 
-obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o ima_arch.o
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o ima_arch.o secvar-ops.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/secvar-ops.c b/arch/powerpc/kernel/secvar-ops.c
new file mode 100644
index ..4cfa7dbd8850
--- /dev/null
+++ b/arch/powerpc/kernel/secvar-ops.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ *
+ * This file initializes secvar operations for PowerPC Secureboot
+ */
+
+#include 
+
+const struct secvar_operations *secvar_ops;
+
+void set_secvar_ops(const struct secvar_operations *ops)
+{
+   secvar_ops = ops;

[PATCH v5 0/4] powerpc: expose secure variables to the kernel and userspace

2019-10-24 Thread Nayna Jain

In order to verify the OS kernel on PowerNV systems, secure boot requires
X.509 certificates trusted by the platform. These are stored in secure
variables controlled by OPAL, called OPAL secure variables. In order to
enable users to manage the keys, the secure variables need to be exposed
to userspace.

OPAL provides the runtime services for the kernel to be able to access the
secure variables[1]. This patchset defines the kernel interface for the
OPAL APIs. These APIs are used by the hooks, which load these variables
to the keyring and expose them to the userspace for reading/writing.

The previous version[2] of the patchset added support only for the sysfs
interface. This patch adds two more patches that involves loading of
the firmware trusted keys to the kernel keyring.

Overall, this patchset adds the following support:

* expose secure variables to the kernel via OPAL Runtime API interface
* expose secure variables to the userspace via kernel sysfs interface
* load kernel verification and revocation keys to .platform and
.blacklist keyring respectively.

The secure variables can be read/written using simple linux utilities
cat/hexdump.

For example:
Path to the secure variables is:
/sys/firmware/secvar/vars

Each secure variable is listed as directory. 
$ ls -l
total 0
drwxr-xr-x. 2 root root 0 Aug 20 21:20 db
drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK
drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK

The attributes of each of the secure variables are(for example: PK):
[db]$ ls -l
total 0
-r--r--r--. 1 root root  4096 Oct  1 15:10 data
-r--r--r--. 1 root root 65536 Oct  1 15:10 size
--w---. 1 root root  4096 Oct  1 15:12 update

The "data" is used to read the existing variable value using hexdump. The
data is stored in ESL format.
The "update" is used to write a new value using cat. The update is
to be submitted as AUTH file.

[1] Depends on skiboot OPAL API changes which removes metadata from
the API. https://lists.ozlabs.org/pipermail/skiboot/2019-September/015203.html.
[2] https://lkml.org/lkml/2019/6/13/1644

Changelog:
v5:
* rebased to v5.4-rc3
* includes Oliver's feedbacks
  * changed OPAL API as platform driver
  * sysfs are made default enabled and dependent on PPC_SECURE_BOOT
  * fixed code specific changes in both OPAL API and sysfs
  * reading size of the "data" and "update" file from device-tree.  
  * fixed sysfs documentation to also reflect the data and update file
  size interpretation
  * This patchset is no more dependent on ima-arch/blacklist patchset

v4:
* rebased to v5.4-rc1 
* uses __BIN_ATTR_WO macro to create binary attribute as suggested by
  Greg
* removed email id from the file header
* renamed argument keysize to keybufsize in get_next() function
* updated default binary file sizes to 0, as firmware handles checking
against the maximum size
* fixed minor formatting issues in Patch 4/4
* added Greg's and Mimi's Reviewed-by and Ack-by

v3:
* includes Greg's feedbacks:
 * fixes in Patch 2/4
   * updates the Documentation.
   * fixes code feedbacks
* adds SYSFS Kconfig dependency for SECVAR_SYSFS
* fixes mixed tabs and spaces
* removes "name" attribute for each of the variable name based
directories
* fixes using __ATTR_RO() and __BIN_ATTR_RO() and statics and const
* fixes the racing issue by using kobj_type default groups. Also,
fixes the kobject leakage.
* removes extra print messages
  * updates patch description for Patch 3/4
  * removes file name from Patch 4/4 file header comment and removed
  def_bool y from the LOAD_PPC_KEYS Kconfig

* includes Oliver's feedbacks:
  * fixes Patch 1/2
   * moves OPAL API wrappers after opal_nx_proc_init(), fixed the
   naming, types and removed extern.
   * fixes spaces
   * renames get_variable() to get(), get_next_variable() to get_next()
   and set_variable() to set()
   * removed get_secvar_ops() and defined secvar_ops as global
   * fixes consts and statics
   * removes generic secvar_init() and defined platform specific
   opal_secar_init()
   * updates opal_secvar_supported() to check for secvar support even
   before checking the OPAL APIs support and also fixed the error codes.
   * addes function that converts OPAL return codes to linux errno
   * moves secvar check support in the opal_secvar_init() and defined its
   prototype in opal.h
  * fixes Patch 2/2
   * fixes static/const
   * defines macro for max name size
   * replaces OPAL error codes with linux errno and also updated error
   handling
   * moves secvar support check before creating sysfs kobjects in 
   secvar_sysfs_init()
   * fixes spaces  

v2:
* removes complete efi-sms from the sysfs implementation and is simplified
* includes Greg's and Oliver's feedbacks:
 * adds sysfs documentation
 * moves sysfs code to arch/powerpc
 * other code related feedbacks.
* adds two new patches to load keys to .platform and .blacklist keyring.
These patches are added to this series as they are also dependent on
OPAL APIs.

Nayna Jain (4):

Re: [PATCH] powerpc/boot: Fix the initrd being overwritten under qemu

2019-10-24 Thread Alexey Kardashevskiy




On 25/10/2019 04:45, Segher Boessenkool wrote:
> On Thu, Oct 24, 2019 at 12:31:24PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 23/10/2019 22:21, Segher Boessenkool wrote:
>>> On Wed, Oct 23, 2019 at 12:36:35PM +1100, Oliver O'Halloran wrote:
 When booting under OF the zImage expects the initrd address and size to be
 passed to it using registers r3 and r4. SLOF (guest firmware used by QEMU)
 currently doesn't do this so the zImage is not aware of the initrd
 location.  This can result in initrd corruption either though the zImage
 extracting the vmlinux over the initrd, or by the vmlinux overwriting the
 initrd when relocating itself.

 QEMU does put the linux,initrd-start and linux,initrd-end properties into
 the devicetree to vmlinux to find the initrd. We can work around the SLOF
 bug by also looking those properties in the zImage.
>>>
>>> This is not a bug.  What boot protocol requires passing the initrd start
>>> and size in GPR3, GPR4?
>>
>> So far I was unable to identify it...
> 
> Maybe this comes from yaboot?
> https://git.ozlabs.org/?p=yaboot.git;a=blob;f=second/yaboot.c;h=9b66ab44e1be0ee82b88e386a5d0358428766e73;hb=HEAD#l1186

I asked around, a "common practice" was the response :) It's been like this for 
ages and it did not come from any OF/PPC
binding. It was also noted that we do not use zImage right - the whole idea was 
that it is a single binary blob with
vmlinux _and_ initramdisk to point OF at as at the time it could only deal with 
single blobs. So having separate zImage
and initrd is out of zImage design scope (some disagreed here).


>>> The CHRP binding (what SLOF implements) requires passing two zeroes here.
>>> And ePAPR requires passing the address of a device tree and a zero, plus
>>> something in GPR6 to allow distinguishing what it does.
>>>
>>> As Alexey says, initramfs works just fine, so please use that?  initrd was
>>> deprecated when this code was written already.
>>
>> I did not say about anything working fine :)
> 
> Yeah, I read that from your words, wrong it seems.  Sorry.  I often used
> INITRAMFS_SOURCE for kernels for use with SLOF, it's just so convenient.
> 
>> In my case I was using a new QEMU which does full FDT on client-arch-support 
>> and that thing would put the original
>> linux,initrd-start/end to the FDT even though the initrd was unpacked and 
>> the properties were changes in SLOF. With that
>> fixed, this is an alternative fix for SLOF but I am not pushing it out as I 
>> have no idea about the bindings and this
>> also breaks "vmlinux".
>>
>>
>> diff --git a/slof/fs/client.fs b/slof/fs/client.fs
>> index 8a7f6ac4326d..138177e4c2a3 100644
>> --- a/slof/fs/client.fs
>> +++ b/slof/fs/client.fs
>> @@ -45,6 +45,17 @@ VARIABLE  client-callback \ Address of client's callback 
>> function
>>>r  ciregs >r7 !  ciregs >r6 !  client-entry-point @ ciregs >r5 !
>>\ Initialise client-stack-pointer
>>cistack ciregs >r1 !
>> +
>> +  s" linux,initrd-end" get-chosen IF decode-int -rot 2drop ELSE 0 THEN
>> +  s" linux,initrd-start" get-chosen IF decode-int -rot 2drop ELSE 0 THEN
>> +  2dup - dup IF
>> +ciregs >r4 !
>> +ciregs >r3 !
>> +drop
>> +  ELSE
>> +3drop
>> +  THEN
> 
> Something like that should work fine.  Do it in go-32 and go-64 though?
> Or is that the wrong spot?


Nah, I was trying a different initramdisk which complained about my test kernel 
being too old, after fixing that, it
works. I'll post a patch. Thanks,



-- 
Alexey

Re: [PATCH 0/7] towards QE support on ARM

2019-10-24 Thread Li Yang

On Tue, Oct 22, 2019 at 9:54 PM Qiang Zhao  wrote:
>
> On 22/10/2019 18:18, Rasmus Villemoes  wrote:
> > -Original Message-
> > From: Rasmus Villemoes 
> > Sent: 2019年10月22日 18:18
> > To: Qiang Zhao ; Leo Li 
> > Cc: Timur Tabi ; Greg Kroah-Hartman
> > ; linux-ker...@vger.kernel.org;
> > linux-ser...@vger.kernel.org; Jiri Slaby ;
> > linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org
> > Subject: Re: [PATCH 0/7] towards QE support on ARM
> >
> > On 22/10/2019 04.24, Qiang Zhao wrote:
> > > On Mon, Oct 22, 2019 at 6:11 AM Leo Li wrote
> >
> > >> Right.  I'm really interested in getting this applied to my tree and
> > >> make it upstream.  Zhao Qiang, can you help to review Rasmus's
> > >> patches and comment?
> > >
> > > As you know, I maintained a similar patchset removing PPC, and someone
> > told me qe_ic should moved into drivers/irqchip/.
> > > I also thought qe_ic is a interrupt control driver, should be moved into 
> > > dir
> > irqchip.
> >
> > Yes, and I also plan to do that at some point. However, that's orthogonal to
> > making the driver build on ARM, so I don't want to mix the two. Making it
> > usable on ARM is my/our priority currently.
> >
> > I'd appreciate your input on my patches.
>
> Yes, we can put this patchset in first place, ensure it can build and work on 
> ARM, then push another patchset to move qe_ic.

Right.  I would only accept a patch series that can really build and
work on ARM.  At least the current out-of-tree patches can make it
work on ARM.  If we accept partial changes, there is no way to make it
work on the latest kernel on ARM then.

Regards,
Leo

Re: [PATCH v9 7/8] ima: check against blacklisted hashes for files with modsig

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/2019 8:47 PM, Nayna Jain wrote:


+/*
+ * ima_check_blacklist - determine if the binary is blacklisted.
+ *
+ * Add the hash of the blacklisted binary to the measurement list, based
+ * on policy.
+ *
+ * Returns -EPERM if the hash is blacklisted.
+ */
+int ima_check_blacklist(struct integrity_iint_cache *iint,
+   const struct modsig *modsig, int pcr)
+{
+   enum hash_algo hash_algo;
+   const u8 *digest = NULL;
+   u32 digestsize = 0;
+   int rc = 0;
+
+   if (!(iint->flags & IMA_CHECK_BLACKLIST))
+   return 0;
+
+   if (iint->flags & IMA_MODSIG_ALLOWED && modsig) {
+   ima_get_modsig_digest(modsig, _algo, , );
+
+   rc = is_binary_blacklisted(digest, digestsize);
+   if ((rc == -EPERM) && (iint->flags & IMA_MEASURE))
+   process_buffer_measurement(digest, digestsize,
+  "blacklisted-hash", NONE,
+  pcr);
+   }


The enum value "NONE" is being passed to process_buffer_measurement to 
indicate that the check for required action based on ima policy is 
already done by ima_check_blacklist. Not sure, but this can cause 
confusion in the future when someone updates process_buffer_measurement.


Would it instead be better to add another parameter to 
process_buffer_measurement to indicate the above condition?


 -lakshmi

Re: [PATCH 1/2] asm-generic: Make msi.h a mandatory include/asm header

2019-10-24 Thread Paul Walmsley

On Thu, 24 Oct 2019, Michal Simek wrote:

> msi.h is generic for all architectures expect of x86 which has own version.
> Enabling MSI by including msi.h to architecture Kbuild is just additional
> step which doesn't need to be done.
> The patch was created based on request to enable MSI for Microblaze.
> 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Michal Simek 
> ---
> 
> https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/

[ ... ]

> diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
> index 16970f246860..1efaeddf1e4b 100644
> --- a/arch/riscv/include/asm/Kbuild
> +++ b/arch/riscv/include/asm/Kbuild
> @@ -22,7 +22,6 @@ generic-y += kvm_para.h
>  generic-y += local.h
>  generic-y += local64.h
>  generic-y += mm-arch-hooks.h
> -generic-y += msi.h
>  generic-y += percpu.h
>  generic-y += preempt.h
>  generic-y += sections.h

Acked-by: Paul Walmsley  # arch/riscv
Tested-by: Paul Walmsley  # build only, rv32/rv64

Thanks Michał,


- Paul

Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-24 Thread David Hildenbrand


On 23.10.19 09:26, David Hildenbrand wrote:

On 22.10.19 23:54, Dan Williams wrote:

Hi David,

Thanks for tackling this!


Thanks for having a look :)

[...]



I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.


I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.


(it's a somewhat unrelated bigger discussion, but we can start discussing it in 
this thread)

I dislike this for three reasons

a) It does not protect against any races, really, it does not improve things.
b) We do have the exact same problem with pfn_to_online_page(). As long as we
don't hold the memory hotplug lock, memory can get offlined and remove any 
time. Racy.
c) We mix in ZONE specific stuff into the core. It should be "just another zone"

What I propose instead (already discussed in 
https://lkml.org/lkml/2019/10/10/87)

1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
2. Convert SECTION_IS_ACTIVE to a subsection bitmap
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
(similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
(similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE



Dan, I am suspecting that you want a pfn_to_zone() that will not touch 
the memmap, because it could potentially (altmap) lie on slow memory, right?


A modification might make this possible (but I am not yet sure if we 
want a less generic MM implementation just to fine tune slow memmap 
access here)


1. Keep SECTION_IS_ONLINE as it is with the same semantics
2. Introduce a subsection bitmap to record active ("initialized memmap")
   PFNs. E.g., also set it when setting sections online.
3. Introduce pfn_active() that checks against the subsection bitmap
4. Once the memmap was initialized / prepared, set the subsection active
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. Before the memmap gets invalidated, set the subsection inactive
   (similar to SECTION_IS_ONLINE in the buddy right now)
5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE
   (or keep it as is, depends on the RCU locking we eventually
implement)
6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE
7. use pfn_active() whenever we don't care about the zone.

Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. 
!ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) 
we really want to touch the memmap right away either way. So we can also 
directly read the zone from it. I really do prefer right now a more 
generic implementation.


--

Thanks,

David / dhildenb

[PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE

2019-10-24 Thread David Hildenbrand

ZONE_DEVICE (a.k.a. device memory) is no longer marked PG_reserved. Update
the comment.

While at it, make it match what the code is acutally doing (reject vs.
accept).

Cc: Kees Cook 
Cc: Andrew Morton 
Cc: "Isaac J. Manjarres" 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Qian Cai 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 mm/usercopy.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..80f254024c97 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -199,9 +199,9 @@ static inline void check_page_span(const void *ptr, 
unsigned long n,
return;
 
/*
-* Reject if range is entirely either Reserved (i.e. special or
-* device memory), or CMA. Otherwise, reject since the object spans
-* several independently allocated pages.
+* Accept if the range is entirely either Reserved ("special") or
+* CMA. Otherwise, reject since the object spans several independently
+* allocated pages.
 */
is_reserved = PageReserved(page);
is_cma = is_migrate_cma_page(page);
-- 
2.21.0

[PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

2019-10-24 Thread David Hildenbrand

Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved (especially,
   by relying on pfn_to_online_page() for now). Details can be found
   below.
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places in the MM core.

There are three PageReserved() users that might be affected by this change.
 - drivers/staging/gasket/gasket_page_table.c:gasket_release_page()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - drivers/staging/kpc2000/kpc_dma/fileops.c:transfer_complete_cb()
   -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page
   -> I assume "we don't care"
 - mm/usercopy.c: check_page_span()
   -> According to Dan, non-HMM ZONE_DEVICE usage excluded this code since
  commit 52f476a323f9 ("libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY
  overhead")
   -> It is unclear whether we rally cared about ZONE_DEVICE here (HMM) or
  simply about "PG_reserved". The worst thing that could happen is a
  false negative with CONFIG_HARDENED_USERCOPY we should be able to
  identify easily.
   -> There is a discussion to rip out that code completely
   -> I assume "not relevant" / "we don't care"

I audited the other PageReserved() users. They don't affect ZONE_DEVICE:
 - mm/page_owner.c:pagetypeinfo_showmixedcount_print()
   -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
 - mm/page_owner.c:init_pages_in_zone()
   -> Never called for ZONE_DEVICE (!populated_zone(zone))
 - mm/page_ext.c:free_page_ext()
   -> Only a BUG_ON(PageReserved(page)), not relevant
 - mm/page_ext.c:has_unmovable_pages()
   -> Not releveant for ZONE_DEVICE
 - mm/page_ext.c:pfn_range_valid_contig()
   -> pfn_to_online_page() already guards us
 - mm/mempolicy.c:queue_pages_pte_range()
   -> vm_normal_page() checks against pte_devmap()
 - mm/memory-failure.c:hwpoison_user_mappings()
   -> Not reached via memory_failure() due to pfn_to_online_page()
   -> Also not reached indirectly via memory_failure_hugetlb()
 - mm/hugetlb.c:gather_bootmem_prealloc()
   -> Only a WARN_ON(PageReserved(page)), not relevant
 - kernel/power/snapshot.c:saveable_highmem_page()
   -> pfn_to_online_page() already guards us
 - kernel/power/snapshot.c:saveable_page()
   -> pfn_to_online_page() already guards us
 - fs/proc/task_mmu.c:can_gather_numa_stats()
   -> vm_normal_page() checks against pte_devmap()
 - fs/proc/task_mmu.c:can_gather_numa_stats_pmd
   -> vm_normal_page_pmd() checks against pte_devmap()
 - fs/proc/page.c:stable_page_flags()
   -> The reserved bit is simply copied, irrelevant
 - drivers/firmware/memmap.c:release_firmware_map_entry()
   -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
 - arch/ia64/kernel/mca_drv.c
 - arch/mips/mm/init.c
 - arch/mips/mm/ioremap.c
 - arch/nios2/mm/ioremap.c
 - arch/parisc/mm/ioremap.c
 - arch/sparc/mm/tlb.c
 - arch/xtensa/mm/cache.c
   -> No ZONE_DEVICE support
 - arch/powerpc/mm/init_64.c:vmemmap_free()
   -> Special-cases memmap on altmap
   -> Only a check for bootmem
 - arch/x86/kernel/alternative.c:__text_poke()
   -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
 - arch/x86/mm/init_64.c
   -> Only a check for bootmem

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Matt Sickler 
Cc: Kees Cook 
Suggested-by: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c|  6 ++
 drivers/xen/balloon.c  |  7 +++
 include/linux/page-flags.h |  8 +---

[PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite __ioremap_check_ram() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Signed-off-by: David Hildenbrand 
---
 arch/x86/mm/ioremap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource 
*res)
start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
stop_pfn = (res->end + 1) >> PAGE_SHIFT;
if (stop_pfn > start_pfn) {
-   for (i = 0; i < (stop_pfn - start_pfn); ++i)
-   if (pfn_valid(start_pfn + i) &&
-   !PageReserved(pfn_to_page(start_pfn + i)))
+   for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+   struct page *page;
+/*
+ * We treat any pages that are not online (not managed
+ * by the buddy) as not being RAM. This includes
+ * ZONE_DEVICE pages.
+ */
+   page = pfn_to_online_page(start_pfn + i);
+   if (page && !PageReserved(page))
return IORES_MAP_SYSTEM_RAM;
+   }
}
 
return 0;
-- 
2.21.0

[PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite maybe_pte_to_page() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Allison Randal 
Cc: Nicholas Piggin 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/pgtable.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
unsigned long pfn = pte_pfn(pte);
struct page *page;
 
-   if (unlikely(!pfn_valid(pfn)))
-   return NULL;
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
+   /*
+* We reject any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   page = pfn_to_online_page(pfn);
+   if (unlikely(!page || PageReserved(page)))
return NULL;
return page;
 }
-- 
2.21.0

[PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Rewrite hash_page_do_lazy_icache() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: YueHaibing 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-   struct page *page;
+   struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-   if (!pfn_valid(pte_pfn(pte)))
+   /*
+* We ignore any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   if (!page)
return pp;
 
-   page = pte_page(pte);
-
/* page is dirty */
if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) {
if (trap == 0x400) {
-- 
2.21.0

[PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvmppc_book3s_instantiate_page() similar to kvm_is_reserved_pfn()
to make sure the function produces the same result once we stop setting
ZONE_DEVICE pages PG_reserved.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
   writing, upgrade_p);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
-   page = NULL;
-   if (pfn_valid(pfn)) {
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
-   page = NULL;
-   }
+   /*
+* We treat any pages that are not online (not managed by the
+* buddy) as reserved - this includes ZONE_DEVICE pages and
+* pages without a memmap (e.g., mapped via /dev/mem).
+*/
+   page = pfn_to_online_page(pfn);
+   if (page && PageReserved(page))
+   page = NULL;
}
 
/*
-- 
2.21.0

[PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make
sure the function produces the same result once we stop setting ZONE_DEVICE
pages PG_reserved.

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
 drivers/vfio/vfio_iommu_type1.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

[PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_reserved_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
 virt/kvm/kvm_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e9eb666eb6e8..9d18cc67d124 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0

Re: [PATCH v9 4/8] powerpc/ima: define trusted boot policy

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/2019 8:47 PM, Nayna Jain wrote:


+/*
+ * The "secure_and_trusted_rules" contains rules for both the secure boot and
+ * trusted boot. The "template=ima-modsig" option includes the appended
+ * signature, when available, in the IMA measurement list.
+ */
+static const char *const secure_and_trusted_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+   "measure func=MODULE_CHECK template=ima-modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#ifndef CONFIG_MODULE_SIG_FORCE
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};


Same comment as earlier - any way to avoid using conditional compilation 
in C file?


 -lakshmi

Re: [PATCH v9 3/8] powerpc: detect the trusted boot state of the system

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/2019 8:47 PM, Nayna Jain wrote:


+bool is_ppc_trustedboot_enabled(void)
+{
+   struct device_node *node;
+   bool enabled = false;
+
+   node = get_ppc_fw_sb_node();
+   enabled = of_property_read_bool(node, "trusted-enabled");


Can get_ppc_fw_sb_node return NULL?
Would of_property_read_bool handle the case when node is NULL?

 -lakshmi

Re: [PATCH v9 2/8] powerpc/ima: add support to initialize ima policy rules

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/2019 8:47 PM, Nayna Jain wrote:


+/*
+ * The "secure_rules" are enabled only on "secureboot" enabled systems.
+ * These rules verify the file signatures against known good values.
+ * The "appraise_type=imasig|modsig" option allows the known good signature
+ * to be stored as an xattr or as an appended signature.
+ *
+ * To avoid duplicate signature verification as much as possible, the IMA
+ * policy rule for module appraisal is added only if CONFIG_MODULE_SIG_FORCE
+ * is not enabled.
+ */
+static const char *const secure_rules[] = {
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#ifndef CONFIG_MODULE_SIG_FORCE
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};


Is there any way to not use conditional compilation in the above array 
definition? Maybe define different functions to get "secure_rules" for 
when CONFIG_MODULE_SIG_FORCE is defined and when it is not defined.

Just a suggestion.

 -lakshmi

Re: [PATCH v9 1/8] powerpc: detect the secure boot mode of the system

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/2019 8:47 PM, Nayna Jain wrote:

This patch defines a function to detect the secure boot state of a
PowerNV system.



+bool is_ppc_secureboot_enabled(void)
+{
+   struct device_node *node;
+   bool enabled = false;
+
+   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-v1");
+   if (!of_device_is_available(node)) {
+   pr_err("Cannot find secure variable node in device tree; failing to 
secure state\n");
+   goto out;


Related to "goto out;" above:

Would of_find_compatible_node return NULL if the given node is not found?

If of_device_is_available returns false (say, because node is NULL or it 
does not find the specified node) would it be correct to call of_node_put?



+
+out:
+   of_node_put(node);


 -lakshmi

Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-24 Thread Lakshmi Ramasubramanian


On 10/23/19 8:47 PM, Nayna Jain wrote:

Hi Nayna,


+void process_buffer_measurement(const void *buf, int size,
+   const char *eventname, enum ima_hooks func,
+   int pcr)
  {
int ret = 0;
struct ima_template_entry *entry = NULL;



+   if (func) {
+   security_task_getsecid(current, );
+   action = ima_get_action(NULL, current_cred(), secid, 0, func,
+   , );
+   if (!(action & IMA_MEASURE))
+   return;
+   }


In your change set process_buffer_measurement is called with NONE for 
the parameter func. So ima_get_action (the above if block) will not be 
executed.


Wouldn't it better to update ima_get_action (and related functions) to 
handle the ima policy (func param)?


thanks,
 -lakshmi

[PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes

2019-10-24 Thread David Hildenbrand

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap (and don't have ZONE_DEVICE memory).

Rewrite kvm_is_mmio_pfn() to make sure the function produces the
same result once we stop setting ZONE_DEVICE pages PG_reserved.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Sean Christopherson 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Jim Mattson 
Cc: Joerg Roedel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: KarimAllah Ahmed 
Cc: Michal Hocko 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 arch/x86/kvm/mmu.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..f03089a336de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+   struct page *page = pfn_to_online_page(pfn);
+
+   /*
+* ZONE_DEVICE pages are never online. Online pages that are reserved
+* either indicate the zero page or MMIO pages.
+*/
+   if (page)
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Anything with a valid (but not online) memmap could be ZONE_DEVICE.
+* Treat only UC/UC-/WC pages as MMIO.
+*/
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-   /*
-* Some reserved pages, such as those from NVDIMM
-* DAX devices, are not for MMIO, and can be mapped
-* with cached memory type for better performance.
-* However, the above check misconceives those pages
-* as MMIO, and results in KVM mapping them with UC
-* memory type, which would hurt the performance.
-* Therefore, we check the host memory type in addition
-* and only treat UC/UC-/WC pages as MMIO.
-*/
-   (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+   return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+   /*
+* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+*/
return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 pfn_to_hpa(pfn + 1) - 1,
 E820_TYPE_RAM);
-- 
2.21.0

[PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-10-24 Thread David Hildenbrand

Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
add_memory_resource()). All boot memory is alread online.

Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way (unmovable data might have easily ended up on that memory during
boot). So stopping to do that should not really hurt (+ people are not
even aware of a setup where that used to work and that the existing code
still works correctly with memory holes). For the use case of offlining
memory to unplug DIMMs, we should see no change. (holes on DIMMs would be
weird).

Please note that hardware errors (PG_hwpoison) are not memory holes and
not affected by this change when offlining.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..8d81730cf036 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct 
memory_notify *arg)
node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+unsigned long nr_pages, void *data)
+{
+   unsigned long *nr_system_ram_pages = data;
+
+   *nr_system_ram_pages += nr_pages;
+   return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
  unsigned long end_pfn)
 {
-   unsigned long pfn, nr_pages;
+   unsigned long pfn, nr_pages = 0;
unsigned long offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
mem_hotplug_begin();
 
+   /*
+* Don't allow to offline memory blocks that contain holes.
+* Consecuently, memory blocks with holes can never get onlined
+* (hotplugged memory has no holes and all boot memory is online).
+* This allows to simplify the onlining/offlining code quite a lot.
+*/
+   walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
+ count_system_ram_pages_cb);
+   if (nr_pages != end_pfn - start_pfn) {
+   ret = -EINVAL;
+   reason = "memory holes";
+   goto failed_removal;
+   }
+
/* This makes hotplug much easier...and readable.
   we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn, _start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
zone = page_zone(pfn_to_page(valid_start));
node = zone_to_nid(zone);
-   nr_pages = end_pfn - start_pfn;
 
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0

[PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-24 Thread David Hildenbrand

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never.

In ancient times, we needed PG_reserved, because there was no way to tell
whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

One of the biggest fears were side effects. I went ahead and audited all
users of PageReserved(). The details can be found in "mm/memory_hotplug:
Don't mark pages PG_reserved when initializing the memmap".

This patch set adapts all relevant users of PageReserved() to keep the
existing behavior in respect to ZONE_DEVICE pages. The biggest part part
that needs changes is KVM, to keep the existing behavior (that's all I
care about in this series).

Note that this series is able to rely completely on pfn_to_online_page().
No new is_zone_device_page() calles are introduced (as requested by Dan).
We are currently discussing a way to mark also ZONE_DEVICE memmaps as
active/initialized - pfn_active() - and lightweight locking to make sure
memmaps remain active (e.g., using RCU). We might later be able to convert
some suers of pfn_to_online_page() to pfn_active(). Details can be found
in [3], however, this represents yet another cleanup/fix we'll perform
on top of this cleanup.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't
test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all).
Compile-tested on x86-64 and PPC.

Based on next/master. The current version (kept updated) can be found at:
https://github.com/davidhildenbrand/linux.git online_reserved_cleanup

RFC -> v1:
- Dropped "staging/gasket: Prepare gasket_release_page() for PG_reserved
  changes"
- Dropped "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
  changes"
- Converted "mm/usercopy.c: Prepare check_page_span() for PG_reserved
  changes" to "mm/usercopy.c: Update comment in check_page_span()
  regarding ZONE_DEVICE"
- No new users of is_zone_device_page() are introduced.
- Rephrased comments and patch descriptions.

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034
[3] https://www.spinics.net/lists/linux-mm/msg194112.html

Cc: Michal Hocko 
Cc: Dan Williams 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: de...@driverdev.osuosl.org
Cc: xen-de...@lists.xenproject.org
Cc: x...@kernel.org
Cc: Alexander Duyck 

David Hildenbrand (10):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
holes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: Don't mark pages PG_reserved when initializing the
memmap
  mm/usercopy.c: Update comment in check_page_span() regarding
ZONE_DEVICE

 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 +
 arch/powerpc/mm/book3s64/hash_utils.c  | 10 +++---
 arch/powerpc/mm/pgtable.c  | 10 +++---
 arch/x86/kvm/mmu.c | 29 ++---
 arch/x86/mm/ioremap.c  | 13 ++--
 drivers/hv/hv_balloon.c|  6 
 drivers/vfio/vfio_iommu_type1.c| 10 --
 drivers/xen/balloon.c  |  7 +
 include/linux/page-flags.h |  8 +
 mm/memory_hotplug.c| 43 +++---
 mm/page_alloc.c| 11 ---
 mm/usercopy.c  |  6 ++--
 virt/kvm/kvm_main.c| 10 --
 13 files changed, 111 insertions(+), 66 deletions(-)

-- 
2.21.0

Re: [PATCH v7 2/3] Documentation: dt: binding: fsl: Add 'little-endian' and update Chassis define

2019-10-24 Thread Scott Wood

On Mon, 2019-10-21 at 11:49 +0800, Ran Wang wrote:
> By default, QorIQ SoC's RCPM register block is Big Endian. But
> there are some exceptions, such as LS1088A and LS2088A, are
> Little Endian. So add this optional property to help identify
> them.
> 
> Actually LS2021A and other Layerscapes won't totally follow Chassis
> 2.1, so separate them from powerpc SoC.

Did you mean LS1021A and "don't" instead of "won't", given the change to the
examples?

> Change in v5:
>   - Add 'Reviewed-by: Rob Herring ' to commit message.
>   - Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup-
> cells'.
>   please see https://lore.kernel.org/patchwork/patch/1101022/

I'm not sure why Rob considers this the "correct form" -- there are other
examples of the current form, such as ibm,#dma-address-cells and ti,#tlb-
entries, and the current form makes more logical sense (# is part of the
property name, not the vendor).  Oh well.

> Required properites:
>- reg : Offset and length of the register set of the RCPM block.
> -  - fsl,#rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the
> +  - #fsl,rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the
>   fsl,rcpm-wakeup property.
>- compatible : Must contain a chip-specific RCPM block compatible string
>   and (if applicable) may contain a chassis-version RCPM compatible
> @@ -20,6 +20,7 @@ Required properites:
>   * "fsl,qoriq-rcpm-1.0": for chassis 1.0 rcpm
>   * "fsl,qoriq-rcpm-2.0": for chassis 2.0 rcpm
>   * "fsl,qoriq-rcpm-2.1": for chassis 2.1 rcpm
> + * "fsl,qoriq-rcpm-2.1+": for chassis 2.1+ rcpm

Is there something actually called "2.1+"?  It looks a bit like an attempt to
claim compatibility with all future versions.  If the former, is it a name
that comes from the hardware side with an intent for it to describe a stable
interface, or are we later going to see a patch changing some by-then-existing 
device trees from "2.1+" to "2.1++" when some new incompatibility is found?

Perhaps it would be better to bind to the specific chip compatibles.

-Scott

Re: [PATCH] powerpc/boot: Fix the initrd being overwritten under qemu

2019-10-24 Thread Segher Boessenkool

On Thu, Oct 24, 2019 at 12:31:24PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 23/10/2019 22:21, Segher Boessenkool wrote:
> > On Wed, Oct 23, 2019 at 12:36:35PM +1100, Oliver O'Halloran wrote:
> >> When booting under OF the zImage expects the initrd address and size to be
> >> passed to it using registers r3 and r4. SLOF (guest firmware used by QEMU)
> >> currently doesn't do this so the zImage is not aware of the initrd
> >> location.  This can result in initrd corruption either though the zImage
> >> extracting the vmlinux over the initrd, or by the vmlinux overwriting the
> >> initrd when relocating itself.
> >>
> >> QEMU does put the linux,initrd-start and linux,initrd-end properties into
> >> the devicetree to vmlinux to find the initrd. We can work around the SLOF
> >> bug by also looking those properties in the zImage.
> > 
> > This is not a bug.  What boot protocol requires passing the initrd start
> > and size in GPR3, GPR4?
> 
> So far I was unable to identify it...

Maybe this comes from yaboot?
https://git.ozlabs.org/?p=yaboot.git;a=blob;f=second/yaboot.c;h=9b66ab44e1be0ee82b88e386a5d0358428766e73;hb=HEAD#l1186

> > The CHRP binding (what SLOF implements) requires passing two zeroes here.
> > And ePAPR requires passing the address of a device tree and a zero, plus
> > something in GPR6 to allow distinguishing what it does.
> > 
> > As Alexey says, initramfs works just fine, so please use that?  initrd was
> > deprecated when this code was written already.
> 
> I did not say about anything working fine :)

Yeah, I read that from your words, wrong it seems.  Sorry.  I often used
INITRAMFS_SOURCE for kernels for use with SLOF, it's just so convenient.

> In my case I was using a new QEMU which does full FDT on client-arch-support 
> and that thing would put the original
> linux,initrd-start/end to the FDT even though the initrd was unpacked and the 
> properties were changes in SLOF. With that
> fixed, this is an alternative fix for SLOF but I am not pushing it out as I 
> have no idea about the bindings and this
> also breaks "vmlinux".
> 
> 
> diff --git a/slof/fs/client.fs b/slof/fs/client.fs
> index 8a7f6ac4326d..138177e4c2a3 100644
> --- a/slof/fs/client.fs
> +++ b/slof/fs/client.fs
> @@ -45,6 +45,17 @@ VARIABLE  client-callback \ Address of client's callback 
> function
>>r  ciregs >r7 !  ciregs >r6 !  client-entry-point @ ciregs >r5 !
>\ Initialise client-stack-pointer
>cistack ciregs >r1 !
> +
> +  s" linux,initrd-end" get-chosen IF decode-int -rot 2drop ELSE 0 THEN
> +  s" linux,initrd-start" get-chosen IF decode-int -rot 2drop ELSE 0 THEN
> +  2dup - dup IF
> +ciregs >r4 !
> +ciregs >r3 !
> +drop
> +  ELSE
> +3drop
> +  THEN

Something like that should work fine.  Do it in go-32 and go-64 though?
Or is that the wrong spot?


Segher

Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts

2019-10-24 Thread Segher Boessenkool

On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote:
> Some of our scripts are passed $objdump and then call it as
> "$objdump". This doesn't work if it contains spaces because we're
> using ccache, for example you get errors such as:
> 
>   ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
> such file or directory
>   ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache 
> ppc64le-objdump: No such file or directory
> 
> Fix it by not quoting the string when we expand it, allowing the shell
> to do the right thing for us.

This breaks things for people with spaces in their paths.  Why doesn't your
user use something like  alias objdump="ccache ppc64le-objdump"  , instead?


Segher

[PATCH RFC 11/11] PCI: hotplug: movable bus numbers: compact the gaps in numbering

2019-10-24 Thread Sergey Miroshnichenko

If bus numbers are distributed sparsely and there are lot of devices in the
tree, hotplugging a bridge into the end of the tree may fail even if it has
less slots then the total number of unused bus numbers.

Thus, the feature of bus renaming relies on the continuity of bus numbers,
so if a bridge was unplugged, the gap in bus numbers must be compacted.

Let's densify the bus numbering at the beginning of a next PCI rescan.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/probe.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index fe9bf012ef33..0c91b9d453dd 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1319,6 +1319,30 @@ static bool pci_new_bus_needed(struct pci_bus *bus, 
const struct pci_dev *self)
return true;
 }
 
+static void pci_compact_bus_numbers(const int domain, const struct resource 
*valid_range)
+{
+   int busnr_p1 = valid_range->start;
+
+   while (busnr_p1 < valid_range->end) {
+   int busnr_p2 = busnr_p1 + 1;
+   struct pci_bus *bus_p2;
+   int delta;
+
+   while (busnr_p2 <= valid_range->end &&
+  !(bus_p2 = pci_find_bus(domain, busnr_p2)))
+   ++busnr_p2;
+
+   if (!bus_p2 || busnr_p2 > valid_range->end)
+   break;
+
+   delta = busnr_p1 - busnr_p2 + 1;
+   if (delta)
+   pci_move_buses(domain, busnr_p2, delta, valid_range);
+
+   ++busnr_p1;
+   }
+}
+
 static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
  unsigned int available_buses);
 /**
@@ -3691,6 +3715,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
pci_bus_update_immovable_range(root);
pci_bus_release_root_bridge_resources(root);
 
+   pci_compact_bus_numbers(pci_domain_nr(bus),
+   >busn_res);
+
max = pci_scan_child_bus(root);
 
pci_reassign_root_bus_resources(root);
-- 
2.23.0

[PATCH RFC 10/11] PCI: hotplug: movable bus numbers: rename proc and sysfs entries

2019-10-24 Thread Sergey Miroshnichenko

Changing the number of a bus (therefore changing addresses of this bus, of
its children and all the buses next in the tree) invalidates entries in
/sys/devices/pci*, /proc/bus/pci/* and symlinks in /sys/bus/pci/devices/*
for all the renamed devices and buses.

Remove the affected proc and sysfs entries and symlinks before renaming the
bus, then created them back.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/probe.c | 105 +++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index be9e5754cac7..fe9bf012ef33 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1096,12 +1096,99 @@ static void pci_enable_crs(struct pci_dev *pdev)
 PCI_EXP_RTCTL_CRSSVE);
 }
 
+static void pci_buses_remove_sysfs(int domain, int busnr, int max_bus_number)
+{
+   struct pci_bus *bus;
+   struct pci_dev *dev = NULL;
+
+   bus = pci_find_bus(domain, busnr);
+   if (!bus)
+   return;
+
+   if (busnr < max_bus_number)
+   pci_buses_remove_sysfs(domain, busnr + 1, max_bus_number);
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   device_remove_class_symlinks(>dev);
+   pci_remove_sysfs_dev_files(dev);
+   pci_proc_detach_device(dev);
+   bus_disconnect_device(>dev);
+   }
+
+   device_remove_class_symlinks(>dev);
+   pci_proc_detach_bus(bus);
+}
+
+static void pci_buses_create_sysfs(int domain, int busnr, int max_bus_number)
+{
+   struct pci_bus *bus;
+   struct pci_dev *dev = NULL;
+
+   bus = pci_find_bus(domain, busnr);
+   if (!bus)
+   return;
+
+   device_add_class_symlinks(>dev);
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   bus_add_device(>dev);
+   if (pci_dev_is_added(dev)) {
+   pci_proc_attach_device(dev);
+   pci_create_sysfs_dev_files(dev);
+   device_add_class_symlinks(>dev);
+   }
+   }
+
+   if (busnr < max_bus_number)
+   pci_buses_create_sysfs(domain, busnr + 1, max_bus_number);
+}
+
+static void pci_rename_bus(struct pci_bus *bus, const char *new_bus_name)
+{
+   struct class *class;
+   int err;
+
+   class = bus->dev.class;
+   bus->dev.class = NULL;
+   err = device_rename(>dev, new_bus_name);
+   bus->dev.class = class;
+}
+
+static void pci_rename_bus_devices(struct pci_bus *bus, const int domain,
+  const int new_busnr)
+{
+   struct pci_dev *dev = NULL;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   char old_name[64];
+   char new_name[64];
+   struct class *class;
+   int err;
+   int i;
+
+   strncpy(old_name, dev_name(>dev), sizeof(old_name));
+   sprintf(new_name, "%04x:%02x:%02x.%d", domain, new_busnr,
+   PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+   class = dev->dev.class;
+   dev->dev.class = NULL;
+   err = device_rename(>dev, new_name);
+   dev->dev.class = class;
+
+   for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
+   dev->resource[i].name = pci_name(dev);
+   }
+}
+
 static void pci_do_move_buses(const int domain, int busnr, int 
first_moved_busnr,
  int delta, const struct resource *valid_range)
 {
struct pci_bus *bus;
-   int subordinate;
+   int subordinate, old_primary;
u32 old_buses, buses;
+   char old_bus_name[64];
+   char new_bus_name[64];
+   struct resource old_res;
+   int new_busnr = busnr + delta;
 
if (busnr < valid_range->start || busnr > valid_range->end)
return;
@@ -1110,11 +1197,21 @@ static void pci_do_move_buses(const int domain, int 
busnr, int first_moved_busnr
if (!bus)
return;
 
+   old_primary = bus->primary;
+   strncpy(old_bus_name, dev_name(>dev), sizeof(old_bus_name));
+   sprintf(new_bus_name, "%04x:%02x", domain, new_busnr);
+
if (delta > 0) {
pci_do_move_buses(domain, busnr + 1, first_moved_busnr,
  delta, valid_range);
+   pci_rename_bus_devices(bus, domain, new_busnr);
+   pci_rename_bus(bus, new_bus_name);
+   } else {
+   pci_rename_bus(bus, new_bus_name);
+   pci_rename_bus_devices(bus, domain, new_busnr);
}
 
+   memcpy(_res, >busn_res, sizeof(old_res));
bus->number += delta;
bus->busn_res.start += delta;
 
@@ -1132,6 +1229,10 @@ static void pci_do_move_buses(const int domain, int 
busnr, int first_moved_busnr
buses |= (unsigned int)(subordinate << 16);
pci_write_config_dword(bus->self,

[PATCH RFC 09/11] PCI: hotplug: Add initial support for movable bus numbers

2019-10-24 Thread Sergey Miroshnichenko

Currently, hot-adding a bridge requires enough bus numbers to be reserved
on the slot. Choosing a favorable number of reserved buses per slot is
relatively simple for predictable cases, but it gets trickier when bridges
can be hot-plugged into hot-plugged bridges: there may be either not enough
buses in a slot for a new big bridge, or all the 255 possible numbers will
be depleted. So hot-add may fail still having unused buses somewhere in the
PCI topology.

Instead of reserving, the bus numbers can be allocated continuously, and
during a hot-adding a bridge in the middle of the PCI tree, the conflicting
buses can increment their numbers, creating a gap for the new bridge.

Before the moving, ensure there are enough space to move on, and there will
be no conflicts with other buses, taking into consideration that it may be
more than one root bridge in the domain (e.g. on some Intel Xeons one root
has buses 00-7f, and the second one - 80-ff).

The feature is disabled by default to not break the ABI, and can be enabled
by the "pci=movable_buses" command line argument, if all risks accepted.

The following set of parameters provides a safe activation of the feature:

  pci=realloc,pcie_bus_peer2peer,movable_buses

On x86, the "pci=assign-busses" is also required:

  pci=realloc,pcie_bus_peer2peer,movable_buses,assign-busses

This series is the second half of the work started by the "Movable BARs"
patches, and relies on fixes made there.

Following patches will resolve the introduced issues:
 - fix desynchronization in /sys/devices/pci*, /sys/bus/pci/devices/* and
   /proc/bus/pci/* after changes in PCI topology;
 - compact gaps in numbering, which may appear after removing a bridge, to
   maintain the number continuity.

Signed-off-by: Sergey Miroshnichenko 
---
 .../admin-guide/kernel-parameters.txt |   3 +
 drivers/pci/pci.c |   3 +
 drivers/pci/pci.h |   2 +
 drivers/pci/probe.c   | 153 +-
 4 files changed, 156 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index c6243aaed0c9..1bf8dea1f08a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3529,6 +3529,9 @@
force_floating  [S390] Force usage of floating interrupts.
nomio   [S390] Do not use MIO instructions.
no_movable_bars Don't allow BARs to be moved during hotplug
+   movable_buses   Prefer bus renaming over the number reserving. 
This
+   inflicts the deleting+recreating of sysfs and 
procfs
+   entries.
 
pcie_aspm=  [PCIE] Forcibly enable or disable PCIe Active State 
Power
Management.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 6ec1b70e4a96..9b2dcaa268e8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -79,6 +79,7 @@ int pci_domains_supported = 1;
 #endif
 
 bool pci_can_move_bars = true;
+bool pci_movable_buses;
 
 #define DEFAULT_CARDBUS_IO_SIZE(256)
 #define DEFAULT_CARDBUS_MEM_SIZE   (64*1024*1024)
@@ -6335,6 +6336,8 @@ static int __init pci_setup(char *str)
disable_acs_redir_param = str + 18;
} else if (!strncmp(str, "no_movable_bars", 15)) {
pci_can_move_bars = false;
+   } else if (!strncmp(str, "movable_buses", 13)) {
+   pci_movable_buses = true;
} else {
pr_err("PCI: Unknown option `%s'\n", str);
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9b5164d10499..804176bb1d1b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -289,6 +289,8 @@ void pci_bus_put(struct pci_bus *bus);
 
 bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res);
 
+extern bool pci_movable_buses;
+
 int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r);
 
 /* PCIe link information */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 3494b5d265d5..be9e5754cac7 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1096,6 +1096,126 @@ static void pci_enable_crs(struct pci_dev *pdev)
 PCI_EXP_RTCTL_CRSSVE);
 }
 
+static void pci_do_move_buses(const int domain, int busnr, int 
first_moved_busnr,
+ int delta, const struct resource *valid_range)
+{
+   struct pci_bus *bus;
+   int subordinate;
+   u32 old_buses, buses;
+
+   if (busnr < valid_range->start || busnr > valid_range->end)
+   return;
+
+   bus = pci_find_bus(domain, busnr);
+   if (!bus)
+   return;
+
+   if (delta > 0) {
+

[PATCH RFC 06/11] powerpc/pci: Enable assigning bus numbers instead of reading them from DT

2019-10-24 Thread Sergey Miroshnichenko

If the firmware indicates support of reassigning bus numbers via the PHB's
"ibm,supported-movable-bdfs" property in DT, PowerNV will not depend on PCI
topology info from DT anymore.

This makes possible to re-enumerate the fabric, assign the new bus numbers
and switch from the pnv_php module to the standard pciehp driver for PCI
hotplug functionality.

Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/kernel/pci_dn.c | 5 +
 arch/powerpc/platforms/powernv/eeh-powernv.c | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index ad0ecf48e943..b9b7518eb2b4 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -559,6 +559,11 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
phb->pci_data = pdn;
}
 
+   if (of_get_property(dn, "ibm,supported-movable-bdfs", NULL)) {
+   pci_add_flags(PCI_REASSIGN_ALL_BUS);
+   return;
+   }
+
/* Update dn->phb ptrs for new phb and children devices */
pci_traverse_device_nodes(dn, add_pdn, phb);
 }
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6bc24a47e9ef..6c126aa2a6b7 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -42,7 +42,8 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
 {
struct pci_dn *pdn = pci_get_pdn(pdev);
 
-   if (eeh_has_flag(EEH_FORCE_DISABLED))
+   if (eeh_has_flag(EEH_FORCE_DISABLED) ||
+   !pci_has_flag(PCI_REASSIGN_ALL_BUS))
return;
 
dev_dbg(>dev, "EEH: Setting up device\n");
-- 
2.23.0

[PATCH RFC 08/11] PCI: Allow expanding the bridges

2019-10-24 Thread Sergey Miroshnichenko

When hotplugging a bridge, the parent bus may not have [enough] reserved
bus numbers. So before rescanning the bus, set its subordinate number to
the maximum possible value: it is 255 when there is only one root bridge
in the domain.

During the PCI rescan, the subordinate bus number of every bus will be
contracted to the actual value.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/probe.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 539f5d39bb6d..3494b5d265d5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3195,20 +3195,22 @@ static unsigned int pci_dev_count_res_mask(struct 
pci_dev *dev)
return res_mask;
 }
 
-static void pci_bus_rescan_prepare(struct pci_bus *bus)
+static void pci_bus_rescan_prepare(struct pci_bus *bus, int last_bus_number)
 {
struct pci_dev *dev;
 
if (bus->self)
pci_config_pm_runtime_get(bus->self);
 
+   bus->busn_res.end = last_bus_number;
+
list_for_each_entry(dev, >devices, bus_list) {
struct pci_bus *child = dev->subordinate;
 
dev->res_mask = pci_dev_count_res_mask(dev);
 
if (child)
-   pci_bus_rescan_prepare(child);
+   pci_bus_rescan_prepare(child, last_bus_number);
 
if (dev->driver &&
dev->driver->rescan_prepare)
@@ -3439,7 +3441,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
 
if (pci_can_move_bars) {
pcibios_root_bus_rescan_prepare(root);
-   pci_bus_rescan_prepare(root);
+   pci_bus_rescan_prepare(root, root->busn_res.end);
pci_bus_update_immovable_range(root);
pci_bus_release_root_bridge_resources(root);
 
-- 
2.23.0

[PATCH RFC 05/11] drivers: base: Add bus_disconnect_device()

2019-10-24 Thread Sergey Miroshnichenko

Add bus_disconnect_device(), which is similar to bus_remove_device(), but
it doesn't detach the device from its driver, so it can be reconnected to
the same or another bus later.

This is a yet another preparation to hotplugging large PCIe bridges, which
may entail changes in BDF addresses of working devices due to movable bus
numbers. Changed addresses require rebuilding the affected entries in
/sys/bus/pci and /proc/bus/pci.

Using bus_disconnect_device()+bus_add_device() during PCI rescan allows the
drivers to work with their devices uninterruptedly, regardless of changes
in PCI addresses.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/base/bus.c | 36 
 include/linux/device.h |  1 +
 2 files changed, 37 insertions(+)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 8f3445cc533e..52d77fb90218 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -497,6 +497,42 @@ void bus_probe_device(struct device *dev)
mutex_unlock(>p->mutex);
 }
 
+/**
+ * bus_disconnect_device - disconnect device from bus,
+ * but don't detach it from driver
+ * @dev: device to be disconnected
+ *
+ * - Remove device from all interfaces.
+ * - Remove symlink from bus' directory.
+ * - Delete device from bus's list.
+ */
+void bus_disconnect_device(struct device *dev)
+{
+   struct bus_type *bus = dev->bus;
+   struct subsys_interface *sif;
+
+   if (!bus)
+   return;
+
+   mutex_lock(>p->mutex);
+   list_for_each_entry(sif, >p->interfaces, node)
+   if (sif->remove_dev)
+   sif->remove_dev(dev, sif);
+   mutex_unlock(>p->mutex);
+
+   sysfs_remove_link(>kobj, "subsystem");
+   sysfs_remove_link(>bus->p->devices_kset->kobj,
+ dev_name(dev));
+   device_remove_groups(dev, dev->bus->dev_groups);
+   if (klist_node_attached(>p->knode_bus))
+   klist_del(>p->knode_bus);
+
+   pr_debug("bus: '%s': remove device %s\n",
+dev->bus->name, dev_name(dev));
+   bus_put(dev->bus);
+}
+EXPORT_SYMBOL_GPL(bus_disconnect_device);
+
 /**
  * bus_remove_device - remove device from bus
  * @dev: device to be removed
diff --git a/include/linux/device.h b/include/linux/device.h
index 420228ab9c4b..9f098c32a4ad 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -268,6 +268,7 @@ void bus_sort_breadthfirst(struct bus_type *bus,
   int (*compare)(const struct device *a,
  const struct device *b));
 extern int bus_add_device(struct device *dev);
+extern void bus_disconnect_device(struct device *dev);
 extern int device_add_class_symlinks(struct device *dev);
 extern void device_remove_class_symlinks(struct device *dev);
 
-- 
2.23.0

[PATCH RFC 07/11] powerpc/pci: Don't reduce the host bridge bus range

2019-10-24 Thread Sergey Miroshnichenko

Currently the last possible bus number of the PHB is set to the last
used bus number during the boot. So when hotplugging a bridge later,
no new buses can be allocated because they are limited by this value.

Let the host bridge contain any number of buses up to 255.

Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/kernel/pci-common.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 1c448cf25506..5877ef7a39a0 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1631,7 +1631,6 @@ void pcibios_scan_phb(struct pci_controller *hose)
if (mode == PCI_PROBE_NORMAL) {
pci_bus_update_busn_res_end(bus, 255);
hose->last_busno = pci_scan_child_bus(bus);
-   pci_bus_update_busn_res_end(bus, hose->last_busno);
}
 
/* Platform gets a chance to do some global fixups before
-- 
2.23.0

[PATCH RFC 04/11] drivers: base: Make device_{add|remove}_class_symlinks() public

2019-10-24 Thread Sergey Miroshnichenko

When updating the /sys/devices/pci* entries affected by changes in the PCI
topology, their symlinks in /sys/bus/pci/devices/* must also be rebuilt.

Moving device_add_class_symlinks() and device_remove_class_symlinks() to a
public API allows the PCI subsystem to update the sysfs without destroying
the working affected devices.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/base/core.c| 6 --
 include/linux/device.h | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 7bd9cd366d41..23e689fc8478 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1922,7 +1922,7 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
mutex_unlock(_mutex);
 }
 
-static int device_add_class_symlinks(struct device *dev)
+int device_add_class_symlinks(struct device *dev)
 {
struct device_node *of_node = dev_of_node(dev);
int error;
@@ -1973,8 +1973,9 @@ static int device_add_class_symlinks(struct device *dev)
sysfs_remove_link(>kobj, "of_node");
return error;
 }
+EXPORT_SYMBOL_GPL(device_add_class_symlinks);
 
-static void device_remove_class_symlinks(struct device *dev)
+void device_remove_class_symlinks(struct device *dev)
 {
if (dev_of_node(dev))
sysfs_remove_link(>kobj, "of_node");
@@ -1991,6 +1992,7 @@ static void device_remove_class_symlinks(struct device 
*dev)
 #endif
sysfs_delete_link(>class->p->subsys.kobj, >kobj, 
dev_name(dev));
 }
+EXPORT_SYMBOL_GPL(device_remove_class_symlinks);
 
 /**
  * dev_set_name - set a device name
diff --git a/include/linux/device.h b/include/linux/device.h
index 4d8bbc8ae73d..420228ab9c4b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -268,6 +268,8 @@ void bus_sort_breadthfirst(struct bus_type *bus,
   int (*compare)(const struct device *a,
  const struct device *b));
 extern int bus_add_device(struct device *dev);
+extern int device_add_class_symlinks(struct device *dev);
+extern void device_remove_class_symlinks(struct device *dev);
 
 /*
  * Bus notifiers: Get notified of addition/removal of devices
-- 
2.23.0

[PATCH RFC 03/11] drivers: base: Make bus_add_device() public

2019-10-24 Thread Sergey Miroshnichenko

Move the bus_add_device() to a public API, so it can be applied to devices
which are temporarily detached from their buses without being destroyed.

This will be used after changes in PCI topology after hotplugging a bridge:
buses may get their numbers changed, so their child devices must be
reattached and their sysfs and proc files recreated.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/base/base.h| 1 -
 drivers/base/bus.c | 1 +
 include/linux/device.h | 2 ++
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 0d32544b6f91..c93d302e6345 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -110,7 +110,6 @@ extern void container_dev_init(void);
 
 struct kobject *virtual_device_parent(struct device *dev);
 
-extern int bus_add_device(struct device *dev);
 extern void bus_probe_device(struct device *dev);
 extern void bus_remove_device(struct device *dev);
 
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index a1d1e8256324..8f3445cc533e 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -471,6 +471,7 @@ int bus_add_device(struct device *dev)
bus_put(dev->bus);
return error;
 }
+EXPORT_SYMBOL_GPL(bus_add_device);
 
 /**
  * bus_probe_device - probe drivers for a new device
diff --git a/include/linux/device.h b/include/linux/device.h
index 297239a08bb7..4d8bbc8ae73d 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -267,6 +267,8 @@ int bus_for_each_drv(struct bus_type *bus, struct 
device_driver *start,
 void bus_sort_breadthfirst(struct bus_type *bus,
   int (*compare)(const struct device *a,
  const struct device *b));
+extern int bus_add_device(struct device *dev);
+
 /*
  * Bus notifiers: Get notified of addition/removal of devices
  * and binding/unbinding of drivers to devices.
-- 
2.23.0

[PATCH RFC 02/11] PCI: proc: Nullify a freed pointer

2019-10-24 Thread Sergey Miroshnichenko

A PCI device may be detached from /proc/bus/pci/devices not only when it is
removed, but also when its bus had changed the number - in this case the
proc entry must be recreated to reflect the new PCI topology.

Nullify freed pointers to mark them as valid for allocating again.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/proc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index 5495537c60c2..c85654dd315b 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -443,6 +443,7 @@ int pci_proc_detach_device(struct pci_dev *dev)
 int pci_proc_detach_bus(struct pci_bus *bus)
 {
proc_remove(bus->procdir);
+   bus->procdir = NULL;
return 0;
 }
 
-- 
2.23.0

[PATCH RFC 01/11] PCI: sysfs: Nullify freed pointers

2019-10-24 Thread Sergey Miroshnichenko

After hotplugging a bridge the PCI topology will be changed: buses may have
their numbers changed. In this case all the affected sysfs entries/symlinks
must be recreated, because they have BDF address in their names.

Set the freed pointers to NULL, so the !NULL checks will be satisfied when
its time to recreate the sysfs entries.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci-sysfs.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 793412954529..a238935c1193 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1129,12 +1129,14 @@ static void pci_remove_resource_files(struct pci_dev 
*pdev)
if (res_attr) {
sysfs_remove_bin_file(>dev.kobj, res_attr);
kfree(res_attr);
+   pdev->res_attr[i] = NULL;
}
 
res_attr = pdev->res_attr_wc[i];
if (res_attr) {
sysfs_remove_bin_file(>dev.kobj, res_attr);
kfree(res_attr);
+   pdev->res_attr_wc[i] = NULL;
}
}
 }
@@ -1175,8 +1177,11 @@ static int pci_create_attr(struct pci_dev *pdev, int 
num, int write_combine)
res_attr->size = pci_resource_len(pdev, num);
res_attr->private = (void *)(unsigned long)num;
retval = sysfs_create_bin_file(>dev.kobj, res_attr);
-   if (retval)
+   if (retval) {
kfree(res_attr);
+   if (pdev->res_attr[num] == res_attr)
+   pdev->res_attr[num] = NULL;
+   }
 
return retval;
 }
-- 
2.23.0

[PATCH RFC 00/11] PCI: hotplug: Movable bus numbers

2019-10-24 Thread Sergey Miroshnichenko

To allow hotplugging bridges, the kernel or BIOS/bootloader/firmware add
extra bus numbers per slot, but this range may be not enough for a large
bridge and/or nested bridges when hot-adding a chassis full of devices.

This patchset proposes an approach similar to movable BARs: bus numbers are
not reserved anymore, instead the kernel moves the "tail" of the PCI tree
by one, when needed a new bus.

When something like this is going to happen:
   *LARGE*
 +-[0020:00]---00.0-[01-20]--+-00.0-[02-08]--+-00.0-[03]--   <--  *NESTED*
 |   |   +-01.0-[04]--*BRIDGE*
 |   |   +-02.0-[05]--
 |   |   +-03.0-[06]--
 |   |   +-04.0-[07]--
 |   |   \-05.0-[08]--
 ...

, this will result into the following:

 
+-[0020:00]---00.0-[01-22]--+-00.0-[02-22]--+-00.0-[03-1d]04.0-[04-1d]--+-00.0-[05]--
 |   |   |   
+-04.0-[06]--
 |   |   |   
+-09.0-[07]--
 |   |   |   
+-0c.0-[08-19]00.0-[09-19]--+-01.0-[0a]--
 |   |   |   |  
 ...
 |   |   |   |  
 \-11.0-[19]--
 |   |   |   ...
 |   |   |   
\-15.0-[1d]--
 |   |   +-01.0-[1e]--  <-- Renamed from 04
 |   |   +-02.0-[1f]--  <-- Renamed from 05
 |   |   +-03.0-[20]--  <-- Renamed from 06
 |   |   +-04.0-[21]--  <-- Renamed from 07
 |   |   \-05.0-[22]--  <-- Renamed from 08
 ...


This looks to be safe in the kernel, because drivers don't use the raw PCI
BDF ID, and we've tested that on our x86 and PowerNV machines: mass storage
with roots and network adapters just continue their work while their bus
numbers had moved.

But here comes the userspace:

 - procfs entries:

% ls -la /proc/bus/pci/*
/proc/bus/pci/00:
00.0
02.0
...
1f.4
1f.6

/proc/bus/pci/04:
00.0

/proc/bus/pci/40:
00.0

 - sysfs entries:

% ls -la /sys/devices/pci:00/
:00:00.0
:00:02.0
...
:00:1f.3
:00:1f.4
:00:1f.6

% ls -la /sys/devices/pci:00/:00:1c.6/:04:00.0/driver
driver -> ../../../../bus/pci/drivers/iwlwifi

 - sysfs symlinks:

% ls -la /sys/bus/pci/devices
:00:00.0 -> ../../../devices/pci:00/:00:00.0
:00:02.0 -> ../../../devices/pci:00/:00:02.0
...
:04:00.0 -> ../../../devices/pci:00/:00:1c.6/:04:00.0
:40:00.0 -> ../../../devices/pci:00/:00:1d.2/:40:00.0


These patches alter the kernel public API and some internals to be able to
remove these files before changing a bus number, and create new versions
of them after device has changed its BDF.

On one hand, this makes the hotplug predictable, independent of non-kernel
program components (BIOS, bootloader, etc.) and cross-platform, but this is
also a severe ABI violation.

Probably, the udev should have a new action like "rename" in addition to
"add" and "remove".

Is it feasible to have this feature disabled by default, but with a chance
to enable by a kernel command line argument like this:

  pci=realloc,movable_buses

?

This code is follow-up of the "PCI: Allow BAR movement during hotplug"
series (v6).

Sergey Miroshnichenko (11):
  PCI: sysfs: Nullify freed pointers
  PCI: proc: Nullify a freed pointer
  drivers: base: Make bus_add_device() public
  drivers: base: Make device_{add|remove}_class_symlinks() public
  drivers: base: Add bus_disconnect_device()
  powerpc/pci: Enable assigning bus numbers instead of reading them from
DT
  powerpc/pci: Don't reduce the host bridge bus range
  PCI: Allow expanding the bridges
  PCI: hotplug: Add initial support for movable bus numbers
  PCI: hotplug: movable bus numbers: rename proc and sysfs entries
  PCI: hotplug: movable bus numbers: compact the gaps in numbering

 .../admin-guide/kernel-parameters.txt |   3 +
 arch/powerpc/kernel/pci-common.c  |   1 -
 arch/powerpc/kernel/pci_dn.c  |   5 +
 arch/powerpc/platforms/powernv/eeh-powernv.c  |   3 +-
 drivers/base/base.h   |   1 -
 drivers/base/bus.c|  37 +++
 drivers/base/core.c   |   6 +-
 drivers/pci/pci-sysfs.c

[PATCH v6 29/30] PCI: pciehp: movable BARs: Trigger a domain rescan on hp events

2019-10-24 Thread Sergey Miroshnichenko

With movable BARs, adding a hotplugged device is not local to its bridge
anymore, but it affects the whole domain: BARs, bridge windows and bus
numbers can be substantially rearranged. So instead of trying to fit the
new devices into preallocated reserved gaps, initiate a full domain rescan.

The pci_rescan_bus() covers all the operations of the replaced functions:
 - assigning new bus numbers, as the pci_hp_add_bridge() does it;
 - allocating BARs (pci_assign_unassigned_bridge_resources());
 - cofiguring MPS settings (pcie_bus_configure_settings());
 - binding devices to their drivers (pci_bus_add_devices()).

CC: Lukas Wunner 
Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/hotplug/pciehp_pci.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index d17f3bf36f70..6d4c1ef38210 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -58,6 +58,11 @@ int pciehp_configure_device(struct controller *ctrl)
goto out;
}
 
+   if (pci_can_move_bars) {
+   pci_rescan_bus(parent);
+   goto out;
+   }
+
for_each_pci_bridge(dev, parent)
pci_hp_add_bridge(dev);
 
-- 
2.23.0

[PATCH v6 30/30] Revert "powerpc/powernv/pci: Work around races in PCI bridge enabling"

2019-10-24 Thread Sergey Miroshnichenko

This reverts commit db2173198b9513f7add8009f225afa1f1c79bcc6.

The root cause of this bug is fixed by the following two commits:

  1. "PCI: Fix race condition in pci_enable/disable_device()"
  2. "PCI: Enable bridge's I/O and MEM access for hotplugged devices"

The x86 is also affected by this bug if a PCIe bridge has been hotplugged
without pre-enabling by the BIOS.

CC: Benjamin Herrenschmidt 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 37 ---
 1 file changed, 37 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 33d5ed8c258f..f12f3a49d3bb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3119,49 +3119,12 @@ static void pnv_pci_ioda_create_dbgfs(void)
 #endif /* CONFIG_DEBUG_FS */
 }
 
-static void pnv_pci_enable_bridge(struct pci_bus *bus)
-{
-   struct pci_dev *dev = bus->self;
-   struct pci_bus *child;
-
-   /* Empty bus ? bail */
-   if (list_empty(>devices))
-   return;
-
-   /*
-* If there's a bridge associated with that bus enable it. This works
-* around races in the generic code if the enabling is done during
-* parallel probing. This can be removed once those races have been
-* fixed.
-*/
-   if (dev) {
-   int rc = pci_enable_device(dev);
-   if (rc)
-   pci_err(dev, "Error enabling bridge (%d)\n", rc);
-   pci_set_master(dev);
-   }
-
-   /* Perform the same to child busses */
-   list_for_each_entry(child, >children, node)
-   pnv_pci_enable_bridge(child);
-}
-
-static void pnv_pci_enable_bridges(void)
-{
-   struct pci_controller *hose;
-
-   list_for_each_entry(hose, _list, list_node)
-   pnv_pci_enable_bridge(hose->bus);
-}
-
 static void pnv_pci_ioda_fixup(void)
 {
pnv_pci_ioda_setup_PEs();
pnv_pci_ioda_setup_iommu_api();
pnv_pci_ioda_create_dbgfs();
 
-   pnv_pci_enable_bridges();
-
 #ifdef CONFIG_EEH
pnv_eeh_post_init();
 #endif
-- 
2.23.0

[PATCH v6 27/30] nvme-pci: Handle movable BARs

2019-10-24 Thread Sergey Miroshnichenko

Hotplugged devices can affect the existing ones by moving their BARs. The
PCI subsystem will inform the NVME driver about this by invoking the
.rescan_prepare() and .rescan_done() hooks, so the BARs can by re-mapped.

Tested under the "randrw" mode of the fio tool. Before the hotplugging:

  % sudo cat /proc/iomem
  ...
3fe8-3fe8007f : PCI Bus 0020:0b
  3fe8-3fe8007f : PCI Bus 0020:18
3fe8-3fe8000f : 0020:18:00.0
  3fe8-3fe8000f : nvme
3fe80010-3fe80017 : 0020:18:00.0
  ...

, then another NVME drive was hot-added, so BARs of the 0020:18:00.0 are
moved:

  % sudo cat /proc/iomem
...
3fe8-3fe800ff : PCI Bus 0020:0b
  3fe8-3fe8007f : PCI Bus 0020:10
3fe8-3fe83fff : 0020:10:00.0
  3fe8-3fe83fff : nvme
3fe80001-3fe80001 : 0020:10:00.0
  3fe80080-3fe800ff : PCI Bus 0020:18
3fe80080-3fe8008f : 0020:18:00.0
  3fe80080-3fe8008f : nvme
3fe80090-3fe80097 : 0020:18:00.0
...

During the rescanning, both READ and WRITE speeds drop to zero for a while
due to driver's pause, then restore.

Also tested with an NVME as a system drive.

Cc: linux-n...@lists.infradead.org
Cc: Christoph Hellwig 
Signed-off-by: Sergey Miroshnichenko 
---
 drivers/nvme/host/pci.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 869f462e6b6e..5f162ea5a5f1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1650,7 +1650,7 @@ static int nvme_remap_bar(struct nvme_dev *dev, unsigned 
long size)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-   if (size <= dev->bar_mapped_size)
+   if (dev->bar && size <= dev->bar_mapped_size)
return 0;
if (size > pci_resource_len(pdev, 0))
return -ENOMEM;
@@ -3059,6 +3059,23 @@ static void nvme_error_resume(struct pci_dev *pdev)
flush_work(>ctrl.reset_work);
 }
 
+static void nvme_rescan_prepare(struct pci_dev *pdev)
+{
+   struct nvme_dev *dev = pci_get_drvdata(pdev);
+
+   nvme_dev_disable(dev, false);
+   nvme_dev_unmap(dev);
+   dev->bar = NULL;
+}
+
+static void nvme_rescan_done(struct pci_dev *pdev)
+{
+   struct nvme_dev *dev = pci_get_drvdata(pdev);
+
+   nvme_dev_map(dev);
+   nvme_reset_ctrl_sync(>ctrl);
+}
+
 static const struct pci_error_handlers nvme_err_handler = {
.error_detected = nvme_error_detected,
.slot_reset = nvme_slot_reset,
@@ -3135,6 +3152,8 @@ static struct pci_driver nvme_driver = {
 #endif
.sriov_configure = pci_sriov_configure_simple,
.err_handler= _err_handler,
+   .rescan_prepare = nvme_rescan_prepare,
+   .rescan_done= nvme_rescan_done,
 };
 
 static int __init nvme_init(void)
-- 
2.23.0

[PATCH v6 28/30] PCI/portdrv: Declare support of movable BARs

2019-10-24 Thread Sergey Miroshnichenko

Switch's BARs are not used by the portdrv driver, but they are still
considered as immovable until the .rescan_prepare() and .rescan_done()
hooks are added. Add these hooks to increase chances to allocate new BARs.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pcie/portdrv_pci.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 0a87091a0800..9dbddc7faaa7 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -197,6 +197,14 @@ static const struct pci_error_handlers 
pcie_portdrv_err_handler = {
.resume = pcie_portdrv_err_resume,
 };
 
+static void pcie_portdrv_rescan_prepare(struct pci_dev *pdev)
+{
+}
+
+static void pcie_portdrv_rescan_done(struct pci_dev *pdev)
+{
+}
+
 static struct pci_driver pcie_portdriver = {
.name   = "pcieport",
.id_table   = _pci_ids[0],
@@ -207,6 +215,9 @@ static struct pci_driver pcie_portdriver = {
 
.err_handler= _portdrv_err_handler,
 
+   .rescan_prepare = pcie_portdrv_rescan_prepare,
+   .rescan_done= pcie_portdrv_rescan_done,
+
.driver.pm  = PCIE_PORTDRV_PM_OPS,
 };
 
-- 
2.23.0

[PATCH v6 26/30] PCI: hotplug: movable BARs: Enable the feature by default

2019-10-24 Thread Sergey Miroshnichenko

This is the last patch in the series which implements the essentials of the
Movable BARs feature, so it is turned by default now. Tested on:

 - x86_64 with "pci=realloc,pcie_bus_peer2peer" command line argument;
 - POWER8 PowerNV+PHB3 ppc64le with "pci=realloc,pcie_bus_peer2peer".

In case of problems it is still can be overridden by the following command
line option:

  pcie_movable_bars=off

CC: Oliver O'Halloran 
Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 85014c6b2817..6ec1b70e4a96 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -78,7 +78,7 @@ static void pci_dev_d3_sleep(struct pci_dev *dev)
 int pci_domains_supported = 1;
 #endif
 
-bool pci_can_move_bars;
+bool pci_can_move_bars = true;
 
 #define DEFAULT_CARDBUS_IO_SIZE(256)
 #define DEFAULT_CARDBUS_MEM_SIZE   (64*1024*1024)
-- 
2.23.0

[PATCH v6 25/30] PNP: Don't reserve BARs for PCI when enabled movable BARs

2019-10-24 Thread Sergey Miroshnichenko

When the Movable BARs feature is supported, the PCI subsystem is able to
distribute existing BARs and allocate the new ones itself, without need to
reserve gaps by BIOS.

CC: Rafael J. Wysocki 
Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pnp/system.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/pnp/system.c b/drivers/pnp/system.c
index 6950503741eb..5977bd11f4d4 100644
--- a/drivers/pnp/system.c
+++ b/drivers/pnp/system.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -58,6 +59,9 @@ static void reserve_resources_of_dev(struct pnp_dev *dev)
struct resource *res;
int i;
 
+   if (pci_can_move_bars)
+   return;
+
for (i = 0; (res = pnp_get_resource(dev, IORESOURCE_IO, i)); i++) {
if (res->flags & IORESOURCE_DISABLED)
continue;
-- 
2.23.0

[PATCH v6 23/30] powerpc/pci: hotplug: Add support for movable BARs

2019-10-24 Thread Sergey Miroshnichenko

Add pcibios_root_bus_rescan_prepare()/_done() hooks for the powerpc, so it
can reassign the PE numbers (which depend on BAR sizes and locations) and
update the EEH address cache during a PCI rescan.

New PE numbers are assigned during pci_setup_bridges(root) after the rescan
is done.

CC: Oliver O'Halloran 
CC: Sam Bobroff 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/kernel/pci-hotplug.c | 43 +++
 drivers/pci/probe.c   | 10 +++
 include/linux/pci.h   |  3 +++
 3 files changed, 56 insertions(+)

diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index fc62c4bc47b1..42847f5b0f08 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct pci_bus *find_bus_among_children(struct pci_bus *bus,
   struct device_node *dn)
@@ -151,3 +152,45 @@ void pci_hp_add_devices(struct pci_bus *bus)
pcibios_finish_adding_to_bus(bus);
 }
 EXPORT_SYMBOL_GPL(pci_hp_add_devices);
+
+static void pci_hp_bus_rescan_prepare(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *child = dev->subordinate;
+
+   if (child)
+   pci_hp_bus_rescan_prepare(child);
+
+   iommu_del_device(>dev);
+   }
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   pcibios_release_device(dev);
+   }
+}
+
+static void pci_hp_bus_rescan_done(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *child = dev->subordinate;
+
+   pcibios_bus_add_device(dev);
+
+   if (child)
+   pci_hp_bus_rescan_done(child);
+   }
+}
+
+void pcibios_root_bus_rescan_prepare(struct pci_bus *root)
+{
+   pci_hp_bus_rescan_prepare(root);
+}
+
+void pcibios_root_bus_rescan_done(struct pci_bus *root)
+{
+   pci_hp_bus_rescan_done(root);
+}
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 73452aa81417..539f5d39bb6d 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3235,6 +3235,14 @@ static void pci_bus_rescan_done(struct pci_bus *bus)
pci_config_pm_runtime_put(bus->self);
 }
 
+void __weak pcibios_root_bus_rescan_prepare(struct pci_bus *root)
+{
+}
+
+void __weak pcibios_root_bus_rescan_done(struct pci_bus *root)
+{
+}
+
 static void pci_setup_bridges(struct pci_bus *bus)
 {
struct pci_dev *dev;
@@ -3430,6 +3438,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
root = root->parent;
 
if (pci_can_move_bars) {
+   pcibios_root_bus_rescan_prepare(root);
pci_bus_rescan_prepare(root);
pci_bus_update_immovable_range(root);
pci_bus_release_root_bridge_resources(root);
@@ -3440,6 +3449,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
 
pci_setup_bridges(root);
pci_bus_rescan_done(root);
+   pcibios_root_bus_rescan_done(root);
} else {
max = pci_scan_child_bus(bus);
pci_assign_unassigned_bus_resources(bus);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e1edcb3fad31..b5821134bdae 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1275,6 +1275,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus);
 void pci_lock_rescan_remove(void);
 void pci_unlock_rescan_remove(void);
 
+void pcibios_root_bus_rescan_prepare(struct pci_bus *root);
+void pcibios_root_bus_rescan_done(struct pci_bus *root);
+
 /* Vital Product Data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
 ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const 
void *buf);
-- 
2.23.0

[PATCH v6 24/30] powerpc/powernv/pci: Suppress an EEH error when reading an empty slot

2019-10-24 Thread Sergey Miroshnichenko

Reading an empty slot returns all ones, which triggers a false EEH
error event on PowerNV. A rescan is performed after all the PEs have
been unmapped, so the reserved PE index is used for unfreezing.

CC: Oliver O'Halloran 
CC: Sam Bobroff 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/platforms/powernv/pci.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index ffd546cf9204..e1b45dc96474 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -768,9 +768,16 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 
*val = 0x;
pdn = pci_get_pdn_by_devfn(bus, devfn);
-   if (!pdn)
-   return pnv_pci_cfg_read_raw(phb->opal_id, bus->number, devfn,
-   where, size, val);
+   if (!pdn) {
+   ret = pnv_pci_cfg_read_raw(phb->opal_id, bus->number, devfn,
+  where, size, val);
+
+   if (!ret && (*val == EEH_IO_ERROR_VALUE(size)) && 
phb->unfreeze_pe)
+   phb->unfreeze_pe(phb, phb->ioda.reserved_pe_idx,
+OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+   return ret;
+   }
 
if (!pnv_pci_cfg_check(pdn))
return PCIBIOS_DEVICE_NOT_FOUND;
-- 
2.23.0

[PATCH v6 22/30] powerpc/pci: Create pci_dn on demand

2019-10-24 Thread Sergey Miroshnichenko

If a struct pci_dn hasn't yet been created for the PCIe device (there was
no DT node for it), allocate this structure and fill with info read from
the device directly.

CC: Oliver O'Halloran 
CC: Sam Bobroff 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/kernel/pci_dn.c | 88 ++--
 1 file changed, 74 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 9524009ca1ae..ad0ecf48e943 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 
+static struct pci_dn *pci_create_pdn_from_dev(struct pci_dev *pdev,
+ struct pci_dn *parent);
+
 /*
  * The function is used to find the firmware data of one
  * specific PCI device, which is attached to the indicated
@@ -52,6 +55,9 @@ static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
dn = pci_bus_to_OF_node(pbus);
pdn = dn ? PCI_DN(dn) : NULL;
 
+   if (!pdn && pbus->self)
+   pdn = pbus->self->dev.archdata.pci_data;
+
return pdn;
 }
 
@@ -61,10 +67,13 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
struct device_node *dn = NULL;
struct pci_dn *parent, *pdn;
struct pci_dev *pdev = NULL;
+   bool pdev_found = false;
 
/* Fast path: fetch from PCI device */
list_for_each_entry(pdev, >devices, bus_list) {
if (pdev->devfn == devfn) {
+   pdev_found = true;
+
if (pdev->dev.archdata.pci_data)
return pdev->dev.archdata.pci_data;
 
@@ -73,6 +82,9 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
}
}
 
+   if (!pdev_found)
+   pdev = NULL;
+
/* Fast path: fetch from device node */
pdn = dn ? PCI_DN(dn) : NULL;
if (pdn)
@@ -85,9 +97,12 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
 
list_for_each_entry(pdn, >child_list, list) {
if (pdn->busno == bus->number &&
-pdn->devfn == devfn)
-return pdn;
-}
+   pdn->devfn == devfn) {
+   if (pdev)
+   pdev->dev.archdata.pci_data = pdn;
+   return pdn;
+   }
+   }
 
return NULL;
 }
@@ -117,17 +132,17 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 
list_for_each_entry(pdn, >child_list, list) {
if (pdn->busno == pdev->bus->number &&
-   pdn->devfn == pdev->devfn)
+   pdn->devfn == pdev->devfn) {
+   pdev->dev.archdata.pci_data = pdn;
return pdn;
+   }
}
 
-   return NULL;
+   return pci_create_pdn_from_dev(pdev, parent);
 }
 
-#ifdef CONFIG_PCI_IOV
-static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
-  int vf_index,
-  int busno, int devfn)
+static struct pci_dn *pci_alloc_pdn(struct pci_dn *parent,
+   int busno, int devfn)
 {
struct pci_dn *pdn;
 
@@ -143,7 +158,6 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
pdn->parent = parent;
pdn->busno = busno;
pdn->devfn = devfn;
-   pdn->vf_index = vf_index;
pdn->pe_number = IODA_INVALID_PE;
INIT_LIST_HEAD(>child_list);
INIT_LIST_HEAD(>list);
@@ -151,7 +165,51 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
 
return pdn;
 }
-#endif
+
+static struct pci_dn *pci_create_pdn_from_dev(struct pci_dev *pdev,
+ struct pci_dn *parent)
+{
+   struct pci_dn *pdn = NULL;
+   u32 class_code;
+   u16 device_id;
+   u16 vendor_id;
+
+   if (!parent)
+   return NULL;
+
+   pdn = pci_alloc_pdn(parent, pdev->bus->busn_res.start, pdev->devfn);
+   pci_info(pdev, "Create a new pdn for devfn %2x\n", pdev->devfn / 8);
+
+   if (!pdn) {
+   pci_err(pdev, "%s: Failed to allocate pdn\n", __func__);
+   return NULL;
+   }
+
+   #ifdef CONFIG_EEH
+   if (!eeh_dev_init(pdn)) {
+   kfree(pdn);
+   pci_err(pdev, "%s: Failed to allocate edev\n", __func__);
+   return NULL;
+   }
+   #endif /* CONFIG_EEH */
+
+   pci_bus_read_config_word(pdev->bus, pdev->devfn,
+PCI_VENDOR_ID, _id);
+   pdn->vendor_id = vendor_id;
+
+   pci_bus_read_config_word(pdev->bus, pdev->devfn,
+PCI_DEVICE_ID, _id);
+   pdn->device_id = device_id;
+
+   pci_bus_read_config_dword(pdev->bus, pdev->devfn,
+ PCI_CLASS_REVISION, _code);
+   class_code >>=

[PATCH v6 21/30] powerpc/pci: Access PCI config space directly w/o pci_dn

2019-10-24 Thread Sergey Miroshnichenko

To fetch an updated DT for the newly hotplugged device, OS must explicitly
request it from the firmware via the pnv_php driver.

If pnv_php wasn't triggered/loaded, it is still possible to discover new
devices if PCIe I/O will not stop in absence of the pci_dn structure.

Reviewed-by: Oliver O'Halloran 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/kernel/rtas_pci.c   | 97 +++-
 arch/powerpc/platforms/powernv/pci.c | 64 --
 2 files changed, 109 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/kernel/rtas_pci.c b/arch/powerpc/kernel/rtas_pci.c
index ae5e43eaca48..912da28b3737 100644
--- a/arch/powerpc/kernel/rtas_pci.c
+++ b/arch/powerpc/kernel/rtas_pci.c
@@ -42,10 +42,26 @@ static inline int config_access_valid(struct pci_dn *dn, 
int where)
return 0;
 }
 
-int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val)
+static int rtas_read_raw_config(unsigned long buid, int busno, unsigned int 
devfn,
+   int where, int size, u32 *val)
 {
int returnval = -1;
-   unsigned long buid, addr;
+   unsigned long addr = rtas_config_addr(busno, devfn, where);
+   int ret;
+
+   if (buid) {
+   ret = rtas_call(ibm_read_pci_config, 4, 2, ,
+   addr, BUID_HI(buid), BUID_LO(buid), size);
+   } else {
+   ret = rtas_call(read_pci_config, 2, 2, , addr, size);
+   }
+   *val = returnval;
+
+   return ret;
+}
+
+int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val)
+{
int ret;
 
if (!pdn)
@@ -58,16 +74,8 @@ int rtas_read_config(struct pci_dn *pdn, int where, int 
size, u32 *val)
return PCIBIOS_SET_FAILED;
 #endif
 
-   addr = rtas_config_addr(pdn->busno, pdn->devfn, where);
-   buid = pdn->phb->buid;
-   if (buid) {
-   ret = rtas_call(ibm_read_pci_config, 4, 2, ,
-   addr, BUID_HI(buid), BUID_LO(buid), size);
-   } else {
-   ret = rtas_call(read_pci_config, 2, 2, , addr, size);
-   }
-   *val = returnval;
-
+   ret = rtas_read_raw_config(pdn->phb->buid, pdn->busno, pdn->devfn,
+  where, size, val);
if (ret)
return PCIBIOS_DEVICE_NOT_FOUND;
 
@@ -85,18 +93,44 @@ static int rtas_pci_read_config(struct pci_bus *bus,
 
pdn = pci_get_pdn_by_devfn(bus, devfn);
 
-   /* Validity of pdn is checked in here */
-   ret = rtas_read_config(pdn, where, size, val);
-   if (*val == EEH_IO_ERROR_VALUE(size) &&
-   eeh_dev_check_failure(pdn_to_eeh_dev(pdn)))
-   return PCIBIOS_DEVICE_NOT_FOUND;
+   if (pdn) {
+   /* Validity of pdn is checked in here */
+   ret = rtas_read_config(pdn, where, size, val);
+
+   if (*val == EEH_IO_ERROR_VALUE(size) &&
+   eeh_dev_check_failure(pdn_to_eeh_dev(pdn)))
+   ret = PCIBIOS_DEVICE_NOT_FOUND;
+   } else {
+   struct pci_controller *phb = pci_bus_to_host(bus);
+
+   ret = rtas_read_raw_config(phb->buid, bus->number, devfn,
+  where, size, val);
+   }
 
return ret;
 }
 
+static int rtas_write_raw_config(unsigned long buid, int busno, unsigned int 
devfn,
+int where, int size, u32 val)
+{
+   unsigned long addr = rtas_config_addr(busno, devfn, where);
+   int ret;
+
+   if (buid) {
+   ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr,
+   BUID_HI(buid), BUID_LO(buid), size, (ulong)val);
+   } else {
+   ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, 
(ulong)val);
+   }
+
+   if (ret)
+   return PCIBIOS_DEVICE_NOT_FOUND;
+
+   return PCIBIOS_SUCCESSFUL;
+}
+
 int rtas_write_config(struct pci_dn *pdn, int where, int size, u32 val)
 {
-   unsigned long buid, addr;
int ret;
 
if (!pdn)
@@ -109,15 +143,8 @@ int rtas_write_config(struct pci_dn *pdn, int where, int 
size, u32 val)
return PCIBIOS_SET_FAILED;
 #endif
 
-   addr = rtas_config_addr(pdn->busno, pdn->devfn, where);
-   buid = pdn->phb->buid;
-   if (buid) {
-   ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr,
-   BUID_HI(buid), BUID_LO(buid), size, (ulong) val);
-   } else {
-   ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, 
(ulong)val);
-   }
-
+   ret = rtas_write_raw_config(pdn->phb->buid, pdn->busno, pdn->devfn,
+   where, size, val);
if (ret)
return PCIBIOS_DEVICE_NOT_FOUND;
 
@@ -128,12 +155,20 @@ static int rtas_pci_write_config(struct pci_bus *bus,
 unsigned int devfn,
 int where, int size,

[PATCH v6 20/30] powerpc/pci: Fix crash with enabled movable BARs

2019-10-24 Thread Sergey Miroshnichenko

Add a check for the UNSET resource flag to skip the released BARs

CC: Alexey Kardashevskiy 
CC: Oliver O'Halloran 
CC: Sam Bobroff 
Signed-off-by: Sergey Miroshnichenko 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c28d0d9b7ee0..33d5ed8c258f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2976,7 +2976,8 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe,
int index;
int64_t rc;
 
-   if (!res || !res->flags || res->start > res->end)
+   if (!res || !res->flags || res->start > res->end ||
+   (res->flags & IORESOURCE_UNSET))
return;
 
if (res->flags & IORESOURCE_IO) {
-- 
2.23.0

[PATCH v6 14/30] PCI: Make sure bridge windows include their fixed BARs

2019-10-24 Thread Sergey Miroshnichenko

When the time comes to select a start address for the bridge window during
the root bus rescan, it should be not just a lowest possible address: this
window must cover all the underlying fixed and immovable BARs. The lowest
address that satisfies this requirement is the .realloc_range field of
struct pci_bus, which is calculated during the preparation to the rescan.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/bus.c   |  2 +-
 drivers/pci/setup-res.c | 31 +--
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 8e40b3e6da77..a1efa87e31b9 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -192,7 +192,7 @@ static int pci_bus_alloc_from_region(struct pci_bus *bus, 
struct resource *res,
 * this is an already-configured bridge window, its start
 * overrides "min".
 */
-   if (avail.start)
+   if (min_used < avail.start)
min_used = avail.start;
 
max = avail.end;
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index a1657a8bf93d..1570bbd620cd 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -248,9 +248,23 @@ static int __pci_assign_resource(struct pci_bus *bus, 
struct pci_dev *dev,
struct resource *res = dev->resource + resno;
resource_size_t min;
int ret;
+   resource_size_t start = (resource_size_t)-1;
+   resource_size_t end = 0;
 
min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM;
 
+   if (dev->subordinate && resno >= PCI_BRIDGE_RESOURCES) {
+   struct pci_bus *child_bus = dev->subordinate;
+   int b_resno = resno - PCI_BRIDGE_RESOURCES;
+   struct resource *immovable_range = 
_bus->immovable_range[b_resno];
+
+   if (immovable_range->start < immovable_range->end) {
+   start = immovable_range->start;
+   end = immovable_range->end;
+   min = child_bus->realloc_range[b_resno].start;
+   }
+   }
+
/*
 * First, try exact prefetching match.  Even if a 64-bit
 * prefetchable bridge window is below 4GB, we can't put a 32-bit
@@ -262,7 +276,7 @@ static int __pci_assign_resource(struct pci_bus *bus, 
struct pci_dev *dev,
 IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
 pcibios_align_resource, dev);
if (ret == 0)
-   return 0;
+   goto check_fixed;
 
/*
 * If the prefetchable window is only 32 bits wide, we can put
@@ -274,7 +288,7 @@ static int __pci_assign_resource(struct pci_bus *bus, 
struct pci_dev *dev,
 IORESOURCE_PREFETCH,
 pcibios_align_resource, dev);
if (ret == 0)
-   return 0;
+   goto check_fixed;
}
 
/*
@@ -287,6 +301,19 @@ static int __pci_assign_resource(struct pci_bus *bus, 
struct pci_dev *dev,
ret = pci_bus_alloc_resource(bus, res, size, align, min, 0,
 pcibios_align_resource, dev);
 
+check_fixed:
+   if (ret == 0 && start < end) {
+   if (res->start > start || res->end < end) {
+   dev_err(>dev, "fixed area 0x%llx-0x%llx for %s 
doesn't fit in the allocated %pR (0x%llx-0x%llx)",
+   (unsigned long long)start, (unsigned long 
long)end,
+   dev_name(>dev),
+   res, (unsigned long long)res->start,
+   (unsigned long long)res->end);
+   release_resource(res);
+   return -1;
+   }
+   }
+
return ret;
 }
 
-- 
2.23.0

[PATCH v6 16/30] PCI: hotplug: movable BARs: Assign fixed and immovable BARs before others

2019-10-24 Thread Sergey Miroshnichenko

Reassign resources during rescan in two steps: first the fixed/immovable
BARs and bridge windows that have fixed areas, so the movable ones will not
steal these reserved areas; then the rest - so the movable BARs will divide
the rest of the space.

With this change, pci_assign_resource() is now able to assign all types of
BARs, so the pdev_assign_fixed_resources() became unused and thus removed.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.h   |  2 ++
 drivers/pci/setup-bus.c | 78 -
 drivers/pci/setup-res.c |  7 ++--
 3 files changed, 53 insertions(+), 34 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 7cd108885598..9b5164d10499 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -289,6 +289,8 @@ void pci_bus_put(struct pci_bus *bus);
 
 bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res);
 
+int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r);
+
 /* PCIe link information */
 #define PCIE_SPEED2STR(speed) \
((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index c7365998fbd6..675a612236d7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -38,6 +38,15 @@ struct pci_dev_resource {
unsigned long flags;
 };
 
+enum assign_step {
+   assign_fixed_resources,
+   assign_float_resources,
+};
+
+static void _assign_requested_resources_sorted(struct list_head *head,
+  struct list_head *fail_head,
+  enum assign_step step);
+
 static void free_list(struct list_head *head)
 {
struct pci_dev_resource *dev_res, *tmp;
@@ -278,19 +287,47 @@ static void reassign_resources_sorted(struct list_head 
*realloc_head,
  */
 static void assign_requested_resources_sorted(struct list_head *head,
 struct list_head *fail_head)
+{
+   _assign_requested_resources_sorted(head, fail_head, 
assign_fixed_resources);
+   _assign_requested_resources_sorted(head, fail_head, 
assign_float_resources);
+}
+
+static void _assign_requested_resources_sorted(struct list_head *head,
+  struct list_head *fail_head,
+  enum assign_step step)
 {
struct resource *res;
struct pci_dev_resource *dev_res;
int idx;
 
list_for_each_entry(dev_res, head, list) {
+   bool is_fixed = false;
+
if (!pci_dev_bars_enabled(dev_res->dev))
continue;
 
res = dev_res->res;
+   if (!resource_size(res))
+   continue;
+
idx = res - _res->dev->resource[0];
-   if (resource_size(res) &&
-   pci_assign_resource(dev_res->dev, idx)) {
+
+   if (idx < PCI_BRIDGE_RESOURCES) {
+   is_fixed = !pci_dev_bar_movable(dev_res->dev, res);
+   } else {
+   int b_res_idx = pci_get_bridge_resource_idx(res);
+   struct resource *fixed_res =
+   
_res->dev->subordinate->immovable_range[b_res_idx];
+
+   is_fixed = (fixed_res->start < fixed_res->end);
+   }
+
+   if (assign_fixed_resources == step && !is_fixed)
+   continue;
+   else if (assign_float_resources == step && is_fixed)
+   continue;
+
+   if (pci_assign_resource(dev_res->dev, idx)) {
if (fail_head) {
/*
 * If the failed resource is a ROM BAR and
@@ -1335,7 +1372,7 @@ void pci_bus_size_bridges(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_bus_size_bridges);
 
-static void assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r)
+int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r)
 {
int i;
struct resource *parent_r;
@@ -1352,35 +1389,14 @@ static void assign_fixed_resource_on_bus(struct pci_bus 
*b, struct resource *r)
!(r->flags & IORESOURCE_PREFETCH))
continue;
 
-   if (resource_contains(parent_r, r))
-   request_resource(parent_r, r);
-   }
-}
-
-/*
- * Try to assign any resources marked as IORESOURCE_PCI_FIXED, as they are
- * skipped by pbus_assign_resources_sorted().
- */
-static void pdev_assign_fixed_resources(struct pci_dev *dev)
-{
-   int i;
-
-   for (i = 0; i <  PCI_NUM_RESOURCES; i++) {
-   struct pci_bus *b;
-   struct resource *r = >resource[i];
-
-   if (r->parent || !(r->flags & IORESOURCE_PCI_FIXED) ||
-   !(r->flags & (IORESOURCE_IO | IORESOURCE_MEM)))
-   continue;
-
-   b = dev->bus;

[PATCH v6 19/30] PCI: hotplug: movable BARs: Ignore the MEM BAR offsets from bootloader

2019-10-24 Thread Sergey Miroshnichenko

BAR allocation by BIOS/UEFI/bootloader/firmware may be non-optimal and
it may even clash with the kernel's BAR assignment algorithm.

For example, if no space was reserved for SR-IOV BARs, and this bridge
window is packed between immovable BARs (so it is unable to extend),
and if this window can't be moved, the next PCI rescan will fail, as
the kernel tries to find a space for all the BARs, including SR-IOV.

With this patch the kernel will use its own methods of BAR allocating
when possible, increasing the chances of successful hotplug.

Also add a workaround for implicitly used video BARs on x86.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/probe.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 94bbdf9b9dc1..73452aa81417 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -305,6 +305,16 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type 
type,
 pos, (unsigned long long)region.start);
}
 
+   if (pci_can_move_bars &&
+   !(res->flags & IORESOURCE_IO) &&
+   (dev->class >> 8) != PCI_CLASS_DISPLAY_VGA) {
+   pci_warn(dev, "ignore the current offset of BAR %llx-%llx\n",
+l64, l64 + sz64 - 1);
+   res->start = 0;
+   res->end = sz64 - 1;
+   res->flags |= IORESOURCE_SIZEALIGN;
+   }
+
goto out;
 
 
-- 
2.23.0

[PATCH v6 18/30] PCI: hotplug: Configure MPS for hot-added bridges during bus rescan

2019-10-24 Thread Sergey Miroshnichenko

Assure that MPS settings are set up for bridges which are discovered during
manually triggered rescan via sysfs. This sequence of bridge init (using
pci_rescan_bus()) will be used for pciehp hot-add events when BARs are
movable.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/probe.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index d0d00cb3e965..94bbdf9b9dc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3414,7 +3414,7 @@ static void pci_reassign_root_bus_resources(struct 
pci_bus *root)
 unsigned int pci_rescan_bus(struct pci_bus *bus)
 {
unsigned int max;
-   struct pci_bus *root = bus;
+   struct pci_bus *root = bus, *child;
 
while (!pci_is_root_bus(root))
root = root->parent;
@@ -3435,6 +3435,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
pci_assign_unassigned_bus_resources(bus);
}
 
+   list_for_each_entry(child, >children, node)
+   pcie_bus_configure_settings(child);
+
pci_bus_add_devices(bus);
 
return max;
-- 
2.23.0

[PATCH v6 17/30] PCI: hotplug: movable BARs: Don't reserve IO/mem bus space

2019-10-24 Thread Sergey Miroshnichenko

A hotplugged bridge with many hotplug-capable ports may request
reserving more IO space than the machine has. This could be overridden
with the "hpiosize=" kernel argument though.

But when BARs are movable, there are no need to reserve space anymore:
new BARs are allocated not from reserved gaps, but via rearranging the
existing BARs. Requesting a precise amount of space for bridge windows
increases the chances of adding the new bridge successfully.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 675a612236d7..a68ec726010e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1285,7 +1285,7 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct 
list_head *realloc_head)
 
case PCI_HEADER_TYPE_BRIDGE:
pci_bridge_check_ranges(bus);
-   if (bus->self->is_hotplug_bridge) {
+   if (bus->self->is_hotplug_bridge && !pci_can_move_bars) {
additional_io_size  = pci_hotplug_io_size;
additional_mem_size = pci_hotplug_mem_size;
}
-- 
2.23.0

[PATCH v6 15/30] PCI: Fix assigning the fixed prefetchable resources

2019-10-24 Thread Sergey Miroshnichenko

Allow matching IORESOURCE_PCI_FIXED prefetchable BARs to non-prefetchable
windows, so they follow the same rules as immovable BARs.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 653ba4d5f191..c7365998fbd6 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1339,15 +1339,20 @@ static void assign_fixed_resource_on_bus(struct pci_bus 
*b, struct resource *r)
 {
int i;
struct resource *parent_r;
-   unsigned long mask = IORESOURCE_IO | IORESOURCE_MEM |
-IORESOURCE_PREFETCH;
+   unsigned long mask = IORESOURCE_TYPE_BITS;
 
pci_bus_for_each_resource(b, parent_r, i) {
if (!parent_r)
continue;
 
-   if ((r->flags & mask) == (parent_r->flags & mask) &&
-   resource_contains(parent_r, r))
+   if ((r->flags & mask) != (parent_r->flags & mask))
+   continue;
+
+   if (parent_r->flags & IORESOURCE_PREFETCH &&
+   !(r->flags & IORESOURCE_PREFETCH))
+   continue;
+
+   if (resource_contains(parent_r, r))
request_resource(parent_r, r);
}
 }
-- 
2.23.0

[PATCH v6 13/30] PCI: hotplug: movable BARs: Compute limits for relocated bridge windows

2019-10-24 Thread Sergey Miroshnichenko

With enabled movable BARs, bridge windows are recalculated during each pci
rescan. Some of the BARs below the bridge may be fixed/immovable: these
areas are represented by the .immovable_range field in struct pci_bus.

If a bridge window size is equal to its immovable range, it can only be
assigned to the start of this range. But if a bridge window size is larger,
and this difference in size is denoted as "delta", the window can start
from (immovable_range.start - delta) to (immovable_range.start), and it can
end from (immovable_range.end) to (immovable_range.end + delta). This range
(the new .realloc_range field in struct pci_bus) must then be compared with
immovable ranges of neighbouring bridges to guarantee no intersections.

This patch only calculates valid ranges for reallocated bridges during pci
rescan, and the next one will make use of these values during allocation.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 67 +
 include/linux/pci.h |  6 
 2 files changed, 73 insertions(+)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index a7546e02ea7c..653ba4d5f191 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1819,6 +1819,72 @@ static enum enable_type pci_realloc_detect(struct 
pci_bus *bus,
 }
 #endif
 
+/*
+ * Calculate the address margins where the bridge windows may be allocated to 
fit all
+ * the fixed and immovable BARs beneath.
+ */
+static void pci_bus_update_realloc_range(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+   struct pci_bus *parent = bus->parent;
+   int idx;
+
+   list_for_each_entry(dev, >devices, bus_list)
+   if (dev->subordinate)
+   pci_bus_update_realloc_range(dev->subordinate);
+
+   if (!parent || !bus->self)
+   return;
+
+   for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) {
+   struct resource *immovable_range = >immovable_range[idx];
+   resource_size_t window_size = resource_size(bus->resource[idx]);
+   resource_size_t realloc_start, realloc_end;
+
+   bus->realloc_range[idx].start = 0;
+   bus->realloc_range[idx].end = 0;
+
+   /* Check if there any immovable BARs under the bridge */
+   if (immovable_range->start >= immovable_range->end)
+   continue;
+
+   /* The lowest possible address where the bridge window can 
start */
+   realloc_start = immovable_range->end - window_size + 1;
+   /* The highest possible address where the bridge window can end 
*/
+   realloc_end = immovable_range->start + window_size - 1;
+
+   if (realloc_start > immovable_range->start)
+   realloc_start = immovable_range->start;
+
+   if (realloc_end < immovable_range->end)
+   realloc_end = immovable_range->end;
+
+   /*
+* Check that realloc range doesn't intersect with hard fixed 
ranges
+* of neighboring bridges
+*/
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *neighbor = dev->subordinate;
+   struct resource *n_imm_range;
+
+   if (!neighbor || neighbor == bus)
+   continue;
+
+   n_imm_range = >immovable_range[idx];
+
+   if (n_imm_range->start >= n_imm_range->end)
+   continue;
+
+   if (n_imm_range->end < immovable_range->start &&
+   n_imm_range->end > realloc_start)
+   realloc_start = n_imm_range->end;
+   }
+
+   bus->realloc_range[idx].start = realloc_start;
+   bus->realloc_range[idx].end = realloc_end;
+   }
+}
+
 /*
  * First try will not touch PCI bridge res.
  * Second and later try will clear small leaf bridge res.
@@ -1838,6 +1904,7 @@ void pci_assign_unassigned_root_bus_resources(struct 
pci_bus *bus)
 
if (pci_can_move_bars) {
__pci_bus_size_bridges(bus, NULL);
+   pci_bus_update_realloc_range(bus);
__pci_bus_assign_resources(bus, NULL, NULL);
 
goto dump;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ef41be0ce082..e1edcb3fad31 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -587,6 +587,12 @@ struct pci_bus {
 */
struct resource immovable_range[PCI_BRIDGE_RESOURCE_NUM];
 
+   /*
+* Acceptable address range, where the bridge window may reside, 
considering its
+* size, so it will cover all the fixed and immovable BARs below.
+*/
+   struct resource realloc_range[PCI_BRIDGE_RESOURCE_NUM];
+
struct pci_ops  *ops;   /* Configuration access functions */

[PATCH v6 12/30] PCI: hotplug: movable BARs: Calculate immovable parts of bridge windows

2019-10-24 Thread Sergey Miroshnichenko

When movable BARs are enabled, and if a bridge contains a device with fixed
(IORESOURCE_PCI_FIXED) or immovable BARs, the corresponing windows can't be
moved too far away from their original positions - they must still contain
all the fixed/immovable BARs, like that:

  1) Window position before a bus rescan:

  | <--root bridge window--> |
  |  |
  | | <-- bridge window--> | |
  | | movable BARs | **fixed BAR** | |

  2) Possible valid outcome after rescan and move:

  | <--root bridge window--> |
  |  |
  || <-- bridge window--> |  |
  || **fixed BAR** | Movable BARs |  |

An immovable area of a bridge (separare for IO, MEM and MEM64 window types)
is a range that covers all the fixed and immovable BARs of direct children,
and all the fixed area of children bridges:

  | <--root bridge window--> |
  |  |
  |  | <--  bridge window level 1--> |   |
  |  |  immovable area of this bridge window |   |
  |  |   |   |
  |  | **fixed BAR**  | <--  bridge window level 2--> | BARs |   |
  |  || * fixed area of this bridge * |  |   |
  |  ||   |  |   |
  |  || ***fixed BAR*** |   | ***fixed BAR*** |  |   |

To store these areas, the .immovable_range field has been added to struct
pci_bus. It is filled recursively from leaves to the root before a rescan.

Also make pbus_size_io() and pbus_size_mem() return their usual result OR
the size of an immovable range of according type, depending on which one is
larger.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.h   | 14 +++
 drivers/pci/probe.c | 88 +
 drivers/pci/setup-bus.c | 17 
 include/linux/pci.h |  6 +++
 4 files changed, 125 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 55344f2c55bf..7cd108885598 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -401,6 +401,20 @@ static inline bool pci_dev_is_disconnected(const struct 
pci_dev *dev)
return dev->error_state == pci_channel_io_perm_failure;
 }
 
+static inline int pci_get_bridge_resource_idx(struct resource *r)
+{
+   int idx = 1;
+
+   if (r->flags & IORESOURCE_IO)
+   idx = 0;
+   else if (!(r->flags & IORESOURCE_PREFETCH))
+   idx = 1;
+   else if (r->flags & IORESOURCE_MEM_64)
+   idx = 2;
+
+   return idx;
+}
+
 /* pci_dev priv_flags */
 #define PCI_DEV_ADDED 0
 #define PCI_DEV_DISABLED_BARS 1
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 2d1157493e6a..d0d00cb3e965 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -545,6 +545,7 @@ void pci_read_bridge_bases(struct pci_bus *child)
 static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
 {
struct pci_bus *b;
+   int idx;
 
b = kzalloc(sizeof(*b), GFP_KERNEL);
if (!b)
@@ -561,6 +562,11 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus 
*parent)
if (parent)
b->domain_nr = parent->domain_nr;
 #endif
+   for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) {
+   b->immovable_range[idx].start = 0;
+   b->immovable_range[idx].end = 0;
+   }
+
return b;
 }
 
@@ -3238,6 +3244,87 @@ static void pci_setup_bridges(struct pci_bus *bus)
pci_setup_bridge(bus);
 }
 
+static void pci_bus_update_immovable_range(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+   int idx;
+   resource_size_t start, end;
+
+   for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) {
+   bus->immovable_range[idx].start = 0;
+   bus->immovable_range[idx].end = 0;
+   }
+
+   list_for_each_entry(dev, >devices, bus_list)
+   if (dev->subordinate)
+   pci_bus_update_immovable_range(dev->subordinate);
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   int i;
+   struct pci_bus *child = dev->subordinate;
+
+   for (i = 0; i < PCI_BRIDGE_RESOURCES; ++i) {
+   struct resource *r = >resource[i];
+
+   if (!r->flags || (r->flags & IORESOURCE_UNSET) || 
!r->parent)
+   continue;
+
+   if (!pci_dev_bar_movable(dev, r)) {
+   idx =

[PATCH v6 11/30] PCI: hotplug: movable BARs: Try to assign unassigned resources only once

2019-10-24 Thread Sergey Miroshnichenko

With enabled BAR movement, BARs and bridge windows can only be assigned to
their direct parents, so there can be only one variant of resource tree,
thus every retry within the pci_assign_unassigned_root_bus_resources() will
result in the same tree, and it is enough to try just once.

In case of failures the pci_reassign_root_bus_resources() disables BARs for
one of the hotplugged devices and tries the assignment again.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index cf325daae1b1..3deb1c343e89 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1819,6 +1819,13 @@ void pci_assign_unassigned_root_bus_resources(struct 
pci_bus *bus)
int pci_try_num = 1;
enum enable_type enable_local;
 
+   if (pci_can_move_bars) {
+   __pci_bus_size_bridges(bus, NULL);
+   __pci_bus_assign_resources(bus, NULL, NULL);
+
+   goto dump;
+   }
+
/* Don't realloc if asked to do so */
enable_local = pci_realloc_detect(bus, pci_realloc_enable);
if (pci_realloc_enabled(enable_local)) {
-- 
2.23.0

[PATCH v6 10/30] PCI: Prohibit assigning BARs and bridge windows to non-direct parents

2019-10-24 Thread Sergey Miroshnichenko

When movable BARs are enabled, the feature of resource relocating from
commit 2bbc6942273b5 ("PCI : ability to relocate assigned pci-resources")
is not used. Instead, inability to assign a resource is used as a signal
to retry BAR assignment with other configuration of bridge windows.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c |  2 ++
 drivers/pci/setup-res.c | 12 
 2 files changed, 14 insertions(+)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ff33b47b1bb7..cf325daae1b1 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1355,6 +1355,8 @@ static void pdev_assign_fixed_resources(struct pci_dev 
*dev)
while (b && !r->parent) {
assign_fixed_resource_on_bus(b, r);
b = b->parent;
+   if (!r->parent && pci_can_move_bars)
+   break;
}
}
 }
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index d8ca40a97693..a1657a8bf93d 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -298,6 +298,18 @@ static int _pci_assign_resource(struct pci_dev *dev, int 
resno,
 
bus = dev->bus;
while ((ret = __pci_assign_resource(bus, dev, resno, size, min_align))) 
{
+   if (pci_can_move_bars) {
+   if (resno >= PCI_BRIDGE_RESOURCES &&
+   resno <= PCI_BRIDGE_RESOURCE_END) {
+   struct resource *res = dev->resource + resno;
+
+   res->start = 0;
+   res->end = 0;
+   res->flags = 0;
+   }
+   break;
+   }
+
if (!bus->parent || !bus->self->transparent)
break;
bus = bus->parent;
-- 
2.23.0

[PATCH v6 09/30] PCI: Include fixed and immovable BARs into the bus size calculating

2019-10-24 Thread Sergey Miroshnichenko

The only difference between the fixed/immovable and movable BARs is a size
and offset preservation after they are released (the corresponding struct
resource* detached from a bridge window for a while during a bus rescan).

Include fixed/immovable BARs into result of pbus_size_mem() and prohibit
assigning them to non-direct parents.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 4b538d132958..ff33b47b1bb7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1011,12 +1011,20 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned 
long mask,
struct resource *r = >resource[i];
resource_size_t r_size;
 
-   if (r->parent || (r->flags & IORESOURCE_PCI_FIXED) ||
+   if (r->parent ||
((r->flags & mask) != type &&
 (r->flags & mask) != type2 &&
 (r->flags & mask) != type3))
continue;
r_size = resource_size(r);
+
+   if (!pci_dev_bar_movable(dev, r)) {
+   if (pci_can_move_bars)
+   size += r_size;
+
+   continue;
+   }
+
 #ifdef CONFIG_PCI_IOV
/* Put SRIOV requested res to the optional list */
if (realloc_head && i >= PCI_IOV_RESOURCES &&
-- 
2.23.0

[PATCH v6 06/30] PCI: hotplug: movable BARs: Recalculate all bridge windows during rescan

2019-10-24 Thread Sergey Miroshnichenko

When the movable BARs feature is enabled and a rescan has been requested,
release all the bridge windows and recalculate them from scratch, taking
into account all kinds for BARs: fixed, immovable, movable, new.

This increases the chances to find a memory space to fit BARs for newly
hotplugged devices, especially if no/not enough gaps were reserved by the
BIOS/bootloader/firmware.

The last step of writing the recalculated windows to the bridges is done
by the new pci_setup_bridges() function.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.h   |  1 +
 drivers/pci/probe.c | 22 ++
 drivers/pci/setup-bus.c | 16 
 3 files changed, 39 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 19bc50597d12..4a3f2b69285b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -280,6 +280,7 @@ void __pci_bus_assign_resources(const struct pci_bus *bus,
struct list_head *realloc_head,
struct list_head *fail_head);
 bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
+void pci_bus_release_root_bridge_resources(struct pci_bus *bus);
 
 void pci_reassigndev_resource_alignment(struct pci_dev *dev);
 void pci_disable_bridge_window(struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 3d8c0f653378..d2dbec51c4df 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3200,6 +3200,25 @@ static void pci_bus_rescan_done(struct pci_bus *bus)
pci_config_pm_runtime_put(bus->self);
 }
 
+static void pci_setup_bridges(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *child;
+
+   if (!pci_dev_is_added(dev))
+   continue;
+
+   child = dev->subordinate;
+   if (child)
+   pci_setup_bridges(child);
+   }
+
+   if (bus->self)
+   pci_setup_bridge(bus);
+}
+
 /**
  * pci_rescan_bus - Scan a PCI bus for devices
  * @bus: PCI bus to scan
@@ -3221,8 +3240,11 @@ unsigned int pci_rescan_bus(struct pci_bus *bus)
pci_bus_rescan_prepare(root);
 
max = pci_scan_child_bus(root);
+
+   pci_bus_release_root_bridge_resources(root);
pci_assign_unassigned_root_bus_resources(root);
 
+   pci_setup_bridges(root);
pci_bus_rescan_done(root);
} else {
max = pci_scan_child_bus(bus);
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index f2f02e6c9000..075e8185b936 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1635,6 +1635,22 @@ static void pci_bus_release_bridge_resources(struct 
pci_bus *bus,
pci_bridge_release_resources(bus, type);
 }
 
+void pci_bus_release_root_bridge_resources(struct pci_bus *root_bus)
+{
+   int i;
+   struct resource *r;
+
+   pci_bus_release_bridge_resources(root_bus, IORESOURCE_IO, 
whole_subtree);
+   pci_bus_release_bridge_resources(root_bus, IORESOURCE_MEM, 
whole_subtree);
+   pci_bus_release_bridge_resources(root_bus,
+IORESOURCE_MEM_64 | 
IORESOURCE_PREFETCH,
+whole_subtree);
+
+   pci_bus_for_each_resource(root_bus, r, i) {
+   pci_release_child_resources(root_bus, r);
+   }
+}
+
 static void pci_bus_dump_res(struct pci_bus *bus)
 {
struct resource *res;
-- 
2.23.0

[PATCH v6 08/30] PCI: hotplug: movable BARs: Don't allow added devices to steal resources

2019-10-24 Thread Sergey Miroshnichenko

When movable BARs are enabled, the PCI subsystem at first releases all the
bridge windows and then attempts to assign resources both to previously
working devices and to the newly hotplugged ones, with the same priority.

If a hotplugged device gets its BARs first, this may lead to lack of space
for already working devices, which is unacceptable. If that happens, mark
one of the new devices with the newly introduced flag PCI_DEV_DISABLED_BARS
(if it is not yet marked) and retry the BAR recalculation.

The worst case would be no BARs for hotplugged devices, while all the rest
just continue working.

The algorithm is simple and it doesn't retry different subsets of hot-added
devices in case of a failure, e.g. if there are no space to allocate BARs
for both hotplugged devices A and B, but is enough for just A, the A will
be marked with PCI_DEV_DISABLED_BARS first, then (after the next failure) -
B. As a result, A will not get BARs while it could. This issue is only
relevant when hotplugging two and more devices simultaneously.

Add a new res_mask bitmask to the struct pci_dev for storing the indices of
assigned BARs.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.h   |  11 +
 drivers/pci/probe.c | 102 ++--
 drivers/pci/setup-bus.c |  15 ++
 include/linux/pci.h |   1 +
 4 files changed, 126 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a3f2b69285b..55344f2c55bf 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -403,6 +403,7 @@ static inline bool pci_dev_is_disconnected(const struct 
pci_dev *dev)
 
 /* pci_dev priv_flags */
 #define PCI_DEV_ADDED 0
+#define PCI_DEV_DISABLED_BARS 1
 
 static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
 {
@@ -414,6 +415,16 @@ static inline bool pci_dev_is_added(const struct pci_dev 
*dev)
return test_bit(PCI_DEV_ADDED, >priv_flags);
 }
 
+static inline void pci_dev_disable_bars(struct pci_dev *dev)
+{
+   assign_bit(PCI_DEV_DISABLED_BARS, >priv_flags, true);
+}
+
+static inline bool pci_dev_bars_enabled(const struct pci_dev *dev)
+{
+   return !test_bit(PCI_DEV_DISABLED_BARS, >priv_flags);
+}
+
 #ifdef CONFIG_PCIEAER
 #include 
 
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index d2dbec51c4df..2d1157493e6a 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3162,6 +3162,23 @@ bool pci_dev_bar_movable(struct pci_dev *dev, struct 
resource *res)
return pci_dev_movable(dev, res->child);
 }
 
+static unsigned int pci_dev_count_res_mask(struct pci_dev *dev)
+{
+   unsigned int res_mask = 0;
+   int i;
+
+   for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) {
+   struct resource *r = >resource[i];
+
+   if (!r->flags || (r->flags & IORESOURCE_UNSET) || !r->parent)
+   continue;
+
+   res_mask |= (1 << i);
+   }
+
+   return res_mask;
+}
+
 static void pci_bus_rescan_prepare(struct pci_bus *bus)
 {
struct pci_dev *dev;
@@ -3172,6 +3189,8 @@ static void pci_bus_rescan_prepare(struct pci_bus *bus)
list_for_each_entry(dev, >devices, bus_list) {
struct pci_bus *child = dev->subordinate;
 
+   dev->res_mask = pci_dev_count_res_mask(dev);
+
if (child)
pci_bus_rescan_prepare(child);
 
@@ -3207,7 +3226,7 @@ static void pci_setup_bridges(struct pci_bus *bus)
list_for_each_entry(dev, >devices, bus_list) {
struct pci_bus *child;
 
-   if (!pci_dev_is_added(dev))
+   if (!pci_dev_is_added(dev) || !pci_dev_bars_enabled(dev))
continue;
 
child = dev->subordinate;
@@ -3219,6 +3238,83 @@ static void pci_setup_bridges(struct pci_bus *bus)
pci_setup_bridge(bus);
 }
 
+static struct pci_dev *pci_find_next_new_device(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   if (!bus)
+   return NULL;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *child_bus = dev->subordinate;
+
+   if (!pci_dev_is_added(dev) && pci_dev_bars_enabled(dev))
+   return dev;
+
+   if (child_bus) {
+   struct pci_dev *next_new_dev;
+
+   next_new_dev = pci_find_next_new_device(child_bus);
+   if (next_new_dev)
+   return next_new_dev;
+   }
+   }
+
+   return NULL;
+}
+
+static bool pci_bus_check_all_bars_reassigned(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+   bool ret = true;
+
+   if (!bus)
+   return false;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   struct pci_bus *child = dev->subordinate;
+   unsigned int res_mask = pci_dev_count_res_mask(dev);
+
+   if (!pci_dev_bars_enabled(dev))
+

[PATCH v6 05/30] PCI: hotplug: movable BARs: Fix reassigning the released bridge windows

2019-10-24 Thread Sergey Miroshnichenko

When a bridge window is temporarily released during the rescan, its old
size is not relevant anymore - it will be recreated from pbus_size_*(), so
it's start value should be zero.

If such window can't be reassigned, don't apply reset_resource(), so the
next retry may succeed.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 2c02eb1acf5d..f2f02e6c9000 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -295,7 +295,8 @@ static void assign_requested_resources_sorted(struct 
list_head *head,
0 /* don't care */,
0 /* don't care */);
}
-   reset_resource(res);
+   if (!pci_can_move_bars)
+   reset_resource(res);
}
}
 }
@@ -1579,8 +1580,8 @@ static void pci_bridge_release_resources(struct pci_bus 
*bus,
type = old_flags = r->flags & PCI_RES_TYPE_MASK;
pci_info(dev, "resource %d %pR released\n",
 PCI_BRIDGE_RESOURCES + idx, r);
-   /* Keep the old size */
-   r->end = resource_size(r) - 1;
+   /* Don't keep the old size if the bridge will be recalculated */
+   r->end = pci_can_move_bars ? 0 : (resource_size(r) - 1);
r->start = 0;
r->flags = 0;
 
-- 
2.23.0

[PATCH v6 07/30] PCI: hotplug: movable BARs: Don't disable the released bridge windows

2019-10-24 Thread Sergey Miroshnichenko

On a hotplug event with enabled BAR movement, calculating the new bridge
windows takes some time. During this procedure, the structures that
represent these windows are released - marked for recalculation.

When new bridge windows are ready, they are written to the registers of
every bridge via pci_setup_bridges().

Currently, bridge's registers are updated immediately after releasing a
window to disable it. But if a driver doesn't yet support movable BARs, it
doesn't stop MEM transactions during the hotplug, so disabled bridge
windows will break them.

Let the bridge windows remain operating after releasing, as they will be
updated to the new values in the end of a hotplug event.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 075e8185b936..381ce964cb20 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1588,7 +1588,8 @@ static void pci_bridge_release_resources(struct pci_bus 
*bus,
/* Avoiding touch the one without PREF */
if (type & IORESOURCE_PREFETCH)
type = IORESOURCE_PREFETCH;
-   __pci_setup_bridge(bus, type);
+   if (!pci_can_move_bars)
+   __pci_setup_bridge(bus, type);
/* For next child res under same bridge */
r->flags = old_flags;
}
-- 
2.23.0

[PATCH v6 04/30] PCI: Define PCI-specific version of the release_child_resources()

2019-10-24 Thread Sergey Miroshnichenko

If release the bridge resources with standard release_child_resources(), it
drops the .start field of children's BARs to zero, but with the STARTALIGN
flag remaining set, which makes the resource invalid for reassignment.

Some resources must preserve their offset and size: those marked with the
PCI_FIXED and the immovable ones - which are bound by drivers without
support of the movable BARs feature.

Add the pci_release_child_resources() to replace release_child_resources()
in handling the described PCI-specific cases.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/setup-bus.c | 54 -
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index e7dbe21705ba..2c02eb1acf5d 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1482,6 +1482,54 @@ static void __pci_bridge_assign_resources(const struct 
pci_dev *bridge,
(IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
 IORESOURCE_MEM_64)
 
+/*
+ * Similar to generic release_child_resources(), but aware of immovable BARs 
and
+ * PCI_FIXED and STARTALIGN flags
+ */
+static void pci_release_child_resources(struct pci_bus *bus, struct resource 
*r)
+{
+   struct pci_dev *dev;
+
+   if (!bus || !r)
+   return;
+
+   if (r->flags & IORESOURCE_PCI_FIXED)
+   return;
+
+   r->child = NULL;
+
+   list_for_each_entry(dev, >devices, bus_list) {
+   int i;
+
+   for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+   struct resource *tmp = >resource[i];
+   resource_size_t size = resource_size(tmp);
+
+   if (!tmp->flags || tmp->parent != r)
+   continue;
+
+   tmp->parent = NULL;
+   tmp->sibling = NULL;
+
+   pci_release_child_resources(dev->subordinate, tmp);
+
+   tmp->flags &= ~IORESOURCE_STARTALIGN;
+   tmp->flags |= IORESOURCE_SIZEALIGN;
+
+   if (!pci_dev_bar_movable(dev, tmp)) {
+   pci_dbg(dev, "release immovable %pR (%s), keep 
its flags, base and size\n",
+   tmp, tmp->name);
+   continue;
+   }
+
+   pci_dbg(dev, "release %pR (%s)\n", tmp, tmp->name);
+
+   tmp->start = 0;
+   tmp->end = size - 1;
+   }
+   }
+}
+
 static void pci_bridge_release_resources(struct pci_bus *bus,
 unsigned long type)
 {
@@ -1522,7 +1570,11 @@ static void pci_bridge_release_resources(struct pci_bus 
*bus,
return;
 
/* If there are children, release them all */
-   release_child_resources(r);
+   if (pci_can_move_bars)
+   pci_release_child_resources(bus, r);
+   else
+   release_child_resources(r);
+
if (!release_resource(r)) {
type = old_flags = r->flags & PCI_RES_TYPE_MASK;
pci_info(dev, "resource %d %pR released\n",
-- 
2.23.0

[PATCH v6 00/30] PCI: Allow BAR movement during hotplug

2019-10-24 Thread Sergey Miroshnichenko

Currently PCI hotplug works on top of resources, which are usually reserved
not by the kernel, but by BIOS, bootloader, firmware, etc. These resources
are gaps in the address space where BARs of new devices may fit, and extra
bus number per port, so bridges can be hot-added. This series aim the
former problem: it shows the kernel how to redistribute on the run, so the
hotplug becomes predictable and cross-platform. A follow-up patchset will
propose a solution for bus numbers.

If the memory is arranged in a way that doesn't provide enough space for
BARs of a new hotplugged device, the kernel can pause the drivers of the
"obstructing" devices and move their BARs, so the new BARs can fit into the
freed spaces.

To rearrange the BARs and bridge windows these patches releases all of them
after a rescan and re-assigns in the same way as during the initial PCIe
topology scan at system boot.

When a driver is un-paused by the kernel after the PCIe rescan, it should
ioremap() the new addresses of its BARs.

Drivers indicate their support of the feature by implementing the new hooks
.rescan_prepare() and .rescan_done() in the struct pci_driver. If a driver
doesn't yet support the feature, BARs of its devices will be considered as
immovable (by checking the pci_dev_movable_bars_supported(dev)) and handled
in the same way as resources with the IORESOURCE_PCI_FIXED flag.

If a driver doesn't yet support the feature, its devices are guaranteed to
have their BARs remaining untouched.

Tested on:
 - x86_64 with "pci=pcie_bus_peer2peer"
 - POWER8 PowerNV+OPAL+PHB3 ppc64le with "pci=pcie_bus_peer2peer".

This patchset is a part of our work on adding support for hotplugging
bridges full of other bridges, NVME drives, SAS HBAs and GPUs without
special requirements such as Hot-Plug Controller, reservation of bus
numbers or memory regions by firmware, etc.

Changes since v5:
 - Simplified the disable flag, now it is "pci=no_movable_buses";
 - More deliberate marking the BARs as immovable;
 - Mark as immovable BARs which are used by unbound drivers;
 - Ignoring BAR assignment by non-kernel program components, so the kernel
   is able now to distribute BARs in optimal and predictable way;
 - Move here PowerNV-specific patches from the older "powerpc/powernv/pci:
   Make hotplug self-sufficient, independent of FW and DT" series;
 - Fix EEH cache rebuilding and PE allocation for PowerNV during rescan.

Changes since v4:
 - Feature is enabled by default (turned on by one of the latest patches);
 - Add pci_dev_movable_bars_supported(dev) instead of marking the immovable
   BARs with the IORESOURCE_PCI_FIXED flag;
 - Set up PCIe bridges during rescan via sysfs, so MPS settings are now
   configured not only during system boot or pcihp events;
 - Allow movement of switch's BARs if claimed by portdrv;
 - Update EEH address caches after rescan for powerpc;
 - Don't disable completely hot-added devices which can't have BARs being
   fit - just disable their BARs, so they are still visible in lspci etc;
 - Clearer names: fixed_range_hard -> immovable_range, fixed_range_soft ->
   realloc_range;
 - Drop the patch for pci_restore_config_space() - fixed by properly using
   the runtime PM.

Changes since v3:
 - Rebased to the upstream, so the patches apply cleanly again.

Changes since v2:
 - Fixed double-assignment of bridge windows;
 - Fixed assignment of fixed prefetched resources;
 - Fixed releasing of fixed resources;
 - Fixed a debug message;
 - Removed auto-enabling the movable BARs for x86 - let's rely on the
   "pcie_movable_bars=force" option for now;
 - Reordered the patches - bugfixes first.

Changes since v1:
 - Add a "pcie_movable_bars={ off | force }" command line argument;
 - Handle the IORESOURCE_PCI_FIXED flag properly;
 - Don't move BARs of devices which don't support the feature;
 - Guarantee that new hotplugged devices will not steal memory from working
   devices by ignoring the failing new devices with the new PCI_DEV_IGNORE
   flag;
 - Add rescan_prepare()+rescan_done() to the struct pci_driver instead of
   using the reset_prepare()+reset_done() from struct pci_error_handlers;
 - Add a bugfix of a race condition;
 - Fixed hotplug in a non-pre-enabled (by BIOS/firmware) bridge;
 - Fix the compatibility of the feature with pm_runtime and D3-state;
 - Hotplug events from pciehp also can move BARs;
 - Add support of the feature to the NVME driver.

Sergey Miroshnichenko (30):
  PCI: Fix race condition in pci_enable/disable_device()
  PCI: Enable bridge's I/O and MEM access for hotplugged devices
  PCI: hotplug: Add a flag for the movable BARs feature
  PCI: Define PCI-specific version of the release_child_resources()
  PCI: hotplug: movable BARs: Fix reassigning the released bridge
windows
  PCI: hotplug: movable BARs: Recalculate all bridge windows during
rescan
  PCI: hotplug: movable BARs: Don't disable the released bridge windows
  PCI: hotplug: movable BARs: Don't allow added devices to steal
resources

[PATCH v6 03/30] PCI: hotplug: Add a flag for the movable BARs feature

2019-10-24 Thread Sergey Miroshnichenko

When hot-adding a device, the bridge may have windows not big enough (or
fragmented too much) for newly requested BARs to fit in. And expanding
these bridge windows may be impossible because blocked by "neighboring"
BARs and bridge windows.

Still, it may be possible to allocate a memory region for new BARs with the
following procedure:

1) notify all the drivers which support movable BARs to pause and release
   the BARs; the rest of the drivers are guaranteed that their devices will
   not get BARs moved;

2) release all the bridge windows and movable BARs;

3) try to recalculate new bridge windows that will fit all the BAR types:
   - fixed;
   - immovable;
   - movable;
   - newly requested by hot-added devices;

4) if the previous step fails, disable BARs for one of the hot-added
   devices and retry from step 3;

5) notify the drivers, so they remap BARs and resume.

If bridge calculation and BAR assignment fails with a hot-added devices,
BARs of these devices will be disabled, falling back to the same amount and
size of BARs as they were before the hotplug event. The kernel succeeded in
assigning then, so the same algorithm will provide the same results again.

This makes the prior reservation of memory by BIOS/bootloader/firmware not
required anymore for the PCI hotplug.

Drivers indicate their support of movable BARs by implementing the new
.rescan_prepare() and .rescan_done() hooks in the struct pci_driver. All
device's activity must be paused during a rescan, and iounmap()+ioremap()
must be applied to every used BAR.

If a device is not bound to a driver, its BARs are considered movable.

For a higher probability of the successful BAR reassignment, all the BARs
and bridge windows should be released before the rescan, not only those
with higher addresses.

One example when it is needed, BAR(I) is moved to free a gap for the new
BAR(II):

  Before:
  parent bridge window ===
    hotplug bridge window 
|   BAR(I)|   fixed BAR   |   fixed BAR   | fixed BAR |
   ^
   |
   new BAR(II)

  After:
  parent bridge window =
 --- hotplug bridge window ---
| new BAR(II) |   fixed BAR   |   fixed BAR   | fixed BAR | BAR(I)  |

Another example is a fragmented bridge window jammed between fixed BARs:

  Before:
 = parent bridge window 
 -- hotplug bridge window --
| fixed BAR |   | BAR(I) || BAR(II) || BAR(III) | fixed BAR |
   ^
   |
   new BAR(IV)

 After:
  parent bridge window =
 -- hotplug bridge window --
| fixed BAR | BAR(I) | BAR(II) | BAR(III) | new BAR(IV) | fixed BAR |

This patch is a preparation for future patches with actual implementation,
and for now it just does the following:
 - declares the feature;
 - defines the bool pci_can_move_bars and bool pci_dev_bar_movable(dev);
 - invokes the .rescan_prepare() and .rescan_done() driver notifiers;
 - disables the feature for the powerpc/pseries.

The feature is disabled by default until the final patch of the series.
It can be overridden per-arch using the pci_can_move_bars=false flag or by
the following command line option:

pci=no_movable_bars

CC: Sam Bobroff 
CC: Rajat Jain 
CC: Lukas Wunner 
CC: Oliver O'Halloran 
CC: David Laight 
Signed-off-by: Sergey Miroshnichenko 
---
 .../admin-guide/kernel-parameters.txt |  1 +
 arch/powerpc/platforms/pseries/setup.c|  2 +
 drivers/pci/pci.c |  4 +
 drivers/pci/pci.h |  2 +
 drivers/pci/probe.c   | 85 ++-
 include/linux/pci.h   |  4 +
 6 files changed, 96 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index a84a83f8881e..c6243aaed0c9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3528,6 +3528,7 @@
may put more devices in an IOMMU group.
force_floating  [S390] Force usage of floating interrupts.
nomio   [S390] Do not use MIO instructions.
+   no_movable_bars Don't allow BARs to be moved during hotplug
 
pcie_aspm=  [PCIE] Forcibly enable or disable PCIe Active State 
Power
Management.
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 0a40201f315f..7cd12c5a2deb 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -920,6 +920,8 @@ static void __init pseries_init(void)
 {
pr_debug(" ->

[PATCH v6 02/30] PCI: Enable bridge's I/O and MEM access for hotplugged devices

2019-10-24 Thread Sergey Miroshnichenko

The PCI_COMMAND_IO and PCI_COMMAND_MEMORY bits of the bridge must be
updated not only when enabling the bridge for the first time, but also if a
hotplugged device requests these types of resources.

Originally these bits were set by the pci_enable_device_flags() only, which
exits early if the bridge is already pci_is_enabled(). So if the bridge was
empty initially (an edge case), then hotplugged devices fail to IO/MEM.

Signed-off-by: Sergey Miroshnichenko 
---
 drivers/pci/pci.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 44d0d12c80cf..e85dc63c73fd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1650,6 +1650,14 @@ static void pci_enable_bridge(struct pci_dev *dev)
pci_enable_bridge(bridge);
 
if (pci_is_enabled(dev)) {
+   int i, bars = 0;
+
+   for (i = PCI_BRIDGE_RESOURCES; i < DEVICE_COUNT_RESOURCE; i++) {
+   if (dev->resource[i].flags & (IORESOURCE_MEM | 
IORESOURCE_IO))
+   bars |= (1 << i);
+   }
+   do_pci_enable_device(dev, bars);
+
if (!dev->is_busmaster)
pci_set_master(dev);
mutex_unlock(>enable_mutex);
-- 
2.23.0

[PATCH v6 01/30] PCI: Fix race condition in pci_enable/disable_device()

2019-10-24 Thread Sergey Miroshnichenko

This is a yet another approach to fix an old [1-2] concurrency issue, when:
 - two or more devices are being hot-added into a bridge which was
   initially empty;
 - a bridge with two or more devices is being hot-added;
 - during boot, if BIOS/bootloader/firmware doesn't pre-enable bridges.

The problem is that a bridge is reported as enabled before the MEM/IO bits
are actually written to the PCI_COMMAND register, so another driver thread
starts memory requests through the not-yet-enabled bridge:

 CPU0CPU1

 pci_enable_device_mem() pci_enable_device_mem()
   pci_enable_bridge() pci_enable_bridge()
 pci_is_enabled()
   return false;
 atomic_inc_return(enable_cnt)
 Start actual enabling the bridge
 ... pci_is_enabled()
 ...   return true;
 ... Start memory requests <-- FAIL
 ...
 Set the PCI_COMMAND_MEMORY bit <-- Must wait for this

Protect the pci_enable/disable_device() and pci_enable_bridge(), which is
similar to the previous solution from commit 40f11adc7cd9 ("PCI: Avoid race
while enabling upstream bridges"), but adding a per-device mutexes and
preventing the dev->enable_cnt from from incrementing early.

CC: Srinath Mannam 
CC: Marta Rybczynska 
Signed-off-by: Sergey Miroshnichenko 

[1] 
https://lore.kernel.org/linux-pci/1501858648-8-1-git-send-email-srinath.man...@broadcom.com/T/#u
[RFC PATCH v3] pci: Concurrency issue during pci enable bridge

[2] 
https://lore.kernel.org/linux-pci/744877924.5841545.1521630049567.javamail.zim...@kalray.eu/T/#u
[RFC PATCH] nvme: avoid race-conditions when enabling devices
---
 drivers/pci/pci.c   | 26 ++
 drivers/pci/probe.c |  1 +
 include/linux/pci.h |  1 +
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a97e2571a527..44d0d12c80cf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1643,6 +1643,8 @@ static void pci_enable_bridge(struct pci_dev *dev)
struct pci_dev *bridge;
int retval;
 
+   mutex_lock(>enable_mutex);
+
bridge = pci_upstream_bridge(dev);
if (bridge)
pci_enable_bridge(bridge);
@@ -1650,6 +1652,7 @@ static void pci_enable_bridge(struct pci_dev *dev)
if (pci_is_enabled(dev)) {
if (!dev->is_busmaster)
pci_set_master(dev);
+   mutex_unlock(>enable_mutex);
return;
}
 
@@ -1658,11 +1661,14 @@ static void pci_enable_bridge(struct pci_dev *dev)
pci_err(dev, "Error enabling bridge (%d), continuing\n",
retval);
pci_set_master(dev);
+   mutex_unlock(>enable_mutex);
 }
 
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
struct pci_dev *bridge;
+   /* Enable-locking of bridges is performed within the 
pci_enable_bridge() */
+   bool need_lock = !dev->subordinate;
int err;
int i, bars = 0;
 
@@ -1678,8 +1684,13 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}
 
-   if (atomic_inc_return(>enable_cnt) > 1)
+   if (need_lock)
+   mutex_lock(>enable_mutex);
+   if (pci_is_enabled(dev)) {
+   if (need_lock)
+   mutex_unlock(>enable_mutex);
return 0;   /* already enabled */
+   }
 
bridge = pci_upstream_bridge(dev);
if (bridge)
@@ -1694,8 +1705,10 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
bars |= (1 << i);
 
err = do_pci_enable_device(dev, bars);
-   if (err < 0)
-   atomic_dec(>enable_cnt);
+   if (err >= 0)
+   atomic_inc(>enable_cnt);
+   if (need_lock)
+   mutex_unlock(>enable_mutex);
return err;
 }
 
@@ -1939,15 +1952,20 @@ void pci_disable_device(struct pci_dev *dev)
if (dr)
dr->enabled = 0;
 
+   mutex_lock(>enable_mutex);
dev_WARN_ONCE(>dev, atomic_read(>enable_cnt) <= 0,
  "disabling already-disabled device");
 
-   if (atomic_dec_return(>enable_cnt) != 0)
+   if (atomic_dec_return(>enable_cnt) != 0) {
+   mutex_unlock(>enable_mutex);
return;
+   }
 
do_pci_disable_device(dev);
 
dev->is_busmaster = 0;
+
+   mutex_unlock(>enable_mutex);
 }
 EXPORT_SYMBOL(pci_disable_device);
 
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 3d5271a7a849..d4f21e413638 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2158,6 +2158,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
INIT_LIST_HEAD(>bus_list);

Re: [PATCH 0/2] Enabling MSI for Microblaze

2019-10-24 Thread Waiman Long

On 10/24/19 6:13 AM, Michal Simek wrote:
> Hi,
>
> these two patches come from discussion with Christoph, Bjorn, Palmer and
> Waiman. The first patch was suggestion by Christoph here
> https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/
> The second part was discussed
> https://lore.kernel.org/linux-pci/mhng-5d9bcb53-225e-441f-86cc-b335624b3e7c@palmer-si-x1e/
> and
> https://lore.kernel.org/linux-pci/20191017181937.7004-1-pal...@sifive.com/
>
> Thanks,
> Michal
>
>
> Michal Simek (1):
>   asm-generic: Make msi.h a mandatory include/asm header
>
> Palmer Dabbelt (1):
>   pci: Default to PCI_MSI_IRQ_DOMAIN
>
>  arch/arc/include/asm/Kbuild | 1 -
>  arch/arm/include/asm/Kbuild | 1 -
>  arch/arm64/include/asm/Kbuild   | 1 -
>  arch/mips/include/asm/Kbuild| 1 -
>  arch/powerpc/include/asm/Kbuild | 1 -
>  arch/riscv/include/asm/Kbuild   | 1 -
>  arch/sparc/include/asm/Kbuild   | 1 -
>  drivers/pci/Kconfig | 2 +-
>  include/asm-generic/Kbuild  | 1 +
>  9 files changed, 2 insertions(+), 8 deletions(-)
>
That looks OK.

Acked-by: Waiman Long

Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-24 Thread Qian Cai




> On Oct 24, 2019, at 10:50 AM, Anshuman Khandual  
> wrote:
> 
> Changes in V7:
> 
> - Memory allocation and free routines for mapped pages have been droped
> - Mapped pfns are derived from standard kernel text symbol per Matthew
> - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian 
> - Updated the commit message per Michal
> - Updated W=1 GCC warning problem on x86 per Qian Cai

It would be interesting to know if you actually tested  out to see if the 
warning went away. As far I can tell, the GCC is quite stubborn there, so I am 
not going to insist.

[PATCH 03/34 v3] powerpc: Use CONFIG_PREEMPTION

2019-10-24 Thread Sebastian Andrzej Siewior

From: Thomas Gleixner 

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the entry code over to use CONFIG_PREEMPTION.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Thomas Gleixner 
[bigeasy: +Kconfig]
Signed-off-by: Sebastian Andrzej Siewior 
---
v2…v3: Don't mention die.c changes in the description.
v1…v2: Remove the changes to die.c.

 arch/powerpc/Kconfig   | 2 +-
 arch/powerpc/kernel/entry_32.S | 4 ++--
 arch/powerpc/kernel/entry_64.S | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16ee..8ead8d6e1cbc8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT
 config GENERIC_LOCKBREAK
bool
default y
-   depends on SMP && PREEMPT
+   depends on SMP && PREEMPTION
 
 config GENERIC_HWEIGHT
bool
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb9..e1a4c39b83b86 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* check current_thread_info->preempt_count */
lwz r0,TI_PREEMPT(r2)
cmpwi   0,r0,0  /* if non-zero, just restore regs and return */
@@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
 */
bl  trace_hardirqs_on
 #endif
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 restore_kuap:
kuap_restore r1, r2, r9, r10, r0
 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6467bdab8d405..83733376533e8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite)
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
beq+restore
@@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite)
li  r10,MSR_RI
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 
.globl  fast_exc_return_irq
 fast_exc_return_irq:
-- 
2.23.0

Re: [PATCH 0/2] vfio pci: Add support for OpenCAPI devices

2019-10-24 Thread Greg Kurz

Hi Christophe,

Sorry, I didn't have time to look at your other series yet and
likely the same for this one with the upcoming KVM Forum... :-\
Anyway, for any VFIO related patch, don't forget to Cc the
maintainer, Alex Williamson  .

Cheers,

--
Greg

On Thu, 24 Oct 2019 15:28:03 +0200
christophe lombard  wrote:

> This series adds support for the OpenCAPI devices for vfio pci.
> 
> It builds on top of the existing ocxl driver +
> http://patchwork.ozlabs.org/patch/1177999/
> 
> VFIO is a Linux kernel driver framework used by QEMU to make devices
> directly assignable to virtual machines.
> 
> All OpenCAPI devices on the same PCI slot will all be grouped and
> assigned to the same guest.
> 
> - Assume these are the devices you want to assign
>  0007:00:00.0 Processing accelerators: IBM Device 062b
>  0007:00:00.1 Processing accelerators: IBM Device 062b
> 
> - Two Devices in the group
> $ ls /sys/bus/pci/devices/0007\:00\:00.0/iommu_group/devices/
>  0007:00:00.0  0007:00:00.1
> 
> - Find vendor & device ID
> $ lspci -n -s 0007:00:00
>  0007:00:00.0 1200: 1014:062b
>  0007:00:00.1 1200: 1014:062b
> 
> - Unbind from the current ocxl device driver if already loaded
> $ rmmod ocxl
> 
> - Load vfio-pci if it's not already done.
> $ modprobe vfio-pci
> 
> - Bind to vfio-pci
> $ echo 1014 062b > /sys/bus/pci/drivers/vfio-pci/new_id
> 
>   This will result in a new device node "/dev/vfio/7", which will be
>   use by QEMU to setup the devices for passthrough.
> 
> - Pass to qemu using -device vfio-pci
>   -device vfio-pci,multifunction=on,host=0007:00:00.0,addr=2.0 -device
>   vfio-pci,multifunction=on,host=0007:00:00.1,addr=2.1
> 
> It has been tested in a bare-metal and QEMU environment using the memcpy
> and the AFP AFUs.
> 
> christophe lombard (2):
>   powerpc/powernv: Register IOMMU group for OpenCAPI devices
>   vfio/pci: Introduce OpenCAPI devices support.
> 
>  arch/powerpc/platforms/powernv/ocxl.c | 164 ++---
>  arch/powerpc/platforms/powernv/pci-ioda.c |  19 +-
>  arch/powerpc/platforms/powernv/pci.h  |  13 +
>  drivers/vfio/pci/Kconfig  |   7 +
>  drivers/vfio/pci/Makefile |   1 +
>  drivers/vfio/pci/vfio_pci.c   |  19 ++
>  drivers/vfio/pci/vfio_pci_ocxl.c  | 287 ++
>  drivers/vfio/vfio.c   |  25 ++
>  include/linux/vfio.h  |  13 +
>  include/uapi/linux/vfio.h |  22 ++
>  10 files changed, 530 insertions(+), 40 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c
>

Re: [PATCH 1/2] asm-generic: Make msi.h a mandatory include/asm header

2019-10-24 Thread Masahiro Yamada

On Thu, Oct 24, 2019 at 7:13 PM Michal Simek  wrote:
>
> msi.h is generic for all architectures expect of x86 which has own version.

Maybe a typo?  "except"


Anyway, the code looks good to me.

Reviewed-by: Masahiro Yamada 


> Enabling MSI by including msi.h to architecture Kbuild is just additional
> step which doesn't need to be done.
> The patch was created based on request to enable MSI for Microblaze.
>
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Michal Simek 
> ---
>
> https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/
> ---
>  arch/arc/include/asm/Kbuild | 1 -
>  arch/arm/include/asm/Kbuild | 1 -
>  arch/arm64/include/asm/Kbuild   | 1 -
>  arch/mips/include/asm/Kbuild| 1 -
>  arch/powerpc/include/asm/Kbuild | 1 -
>  arch/riscv/include/asm/Kbuild   | 1 -
>  arch/sparc/include/asm/Kbuild   | 1 -
>  include/asm-generic/Kbuild  | 1 +
>  8 files changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
> index 393d4f5e1450..1b505694691e 100644
> --- a/arch/arc/include/asm/Kbuild
> +++ b/arch/arc/include/asm/Kbuild
> @@ -17,7 +17,6 @@ generic-y += local64.h
>  generic-y += mcs_spinlock.h
>  generic-y += mm-arch-hooks.h
>  generic-y += mmiowb.h
> -generic-y += msi.h
>  generic-y += parport.h
>  generic-y += percpu.h
>  generic-y += preempt.h
> diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
> index 68ca86f85eb7..fa579b23b4df 100644
> --- a/arch/arm/include/asm/Kbuild
> +++ b/arch/arm/include/asm/Kbuild
> @@ -12,7 +12,6 @@ generic-y += local.h
>  generic-y += local64.h
>  generic-y += mm-arch-hooks.h
>  generic-y += mmiowb.h
> -generic-y += msi.h
>  generic-y += parport.h
>  generic-y += preempt.h
>  generic-y += seccomp.h
> diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
> index 98a5405c8558..bd23f87d6c55 100644
> --- a/arch/arm64/include/asm/Kbuild
> +++ b/arch/arm64/include/asm/Kbuild
> @@ -16,7 +16,6 @@ generic-y += local64.h
>  generic-y += mcs_spinlock.h
>  generic-y += mm-arch-hooks.h
>  generic-y += mmiowb.h
> -generic-y += msi.h
>  generic-y += qrwlock.h
>  generic-y += qspinlock.h
>  generic-y += serial.h
> diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
> index c8b595c60910..61b0fc2026e6 100644
> --- a/arch/mips/include/asm/Kbuild
> +++ b/arch/mips/include/asm/Kbuild
> @@ -13,7 +13,6 @@ generic-y += irq_work.h
>  generic-y += local64.h
>  generic-y += mcs_spinlock.h
>  generic-y += mm-arch-hooks.h
> -generic-y += msi.h
>  generic-y += parport.h
>  generic-y += percpu.h
>  generic-y += preempt.h
> diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
> index 64870c7be4a3..17726f2e46de 100644
> --- a/arch/powerpc/include/asm/Kbuild
> +++ b/arch/powerpc/include/asm/Kbuild
> @@ -10,4 +10,3 @@ generic-y += local64.h
>  generic-y += mcs_spinlock.h
>  generic-y += preempt.h
>  generic-y += vtime.h
> -generic-y += msi.h
> diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
> index 16970f246860..1efaeddf1e4b 100644
> --- a/arch/riscv/include/asm/Kbuild
> +++ b/arch/riscv/include/asm/Kbuild
> @@ -22,7 +22,6 @@ generic-y += kvm_para.h
>  generic-y += local.h
>  generic-y += local64.h
>  generic-y += mm-arch-hooks.h
> -generic-y += msi.h
>  generic-y += percpu.h
>  generic-y += preempt.h
>  generic-y += sections.h
> diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
> index b6212164847b..62de2eb2773d 100644
> --- a/arch/sparc/include/asm/Kbuild
> +++ b/arch/sparc/include/asm/Kbuild
> @@ -18,7 +18,6 @@ generic-y += mcs_spinlock.h
>  generic-y += mm-arch-hooks.h
>  generic-y += mmiowb.h
>  generic-y += module.h
> -generic-y += msi.h
>  generic-y += preempt.h
>  generic-y += serial.h
>  generic-y += trace_clock.h
> diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> index adff14fcb8e4..ddfee1bd9dc1 100644
> --- a/include/asm-generic/Kbuild
> +++ b/include/asm-generic/Kbuild
> @@ -4,4 +4,5 @@
>  # (This file is not included when SRCARCH=um since UML borrows several
>  # asm headers from the host architecutre.)
>
> +mandatory-y += msi.h
>  mandatory-y += simd.h
> --
> 2.17.1
>


-- 
Best Regards
Masahiro Yamada

[PATCH 03/34 v2] powerpc: Use CONFIG_PREEMPTION

2019-10-24 Thread Sebastian Andrzej Siewior

From: Thomas Gleixner 

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the entry code over to use CONFIG_PREEMPTION. Add PREEMPT_RT
output in __die().

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Thomas Gleixner 
[bigeasy: +Kconfig]
Signed-off-by: Sebastian Andrzej Siewior 
---
v1…v2: Remove the changes to die.c

 arch/powerpc/Kconfig   | 2 +-
 arch/powerpc/kernel/entry_32.S | 4 ++--
 arch/powerpc/kernel/entry_64.S | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16ee..8ead8d6e1cbc8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT
 config GENERIC_LOCKBREAK
bool
default y
-   depends on SMP && PREEMPT
+   depends on SMP && PREEMPTION
 
 config GENERIC_HWEIGHT
bool
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb9..e1a4c39b83b86 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* check current_thread_info->preempt_count */
lwz r0,TI_PREEMPT(r2)
cmpwi   0,r0,0  /* if non-zero, just restore regs and return */
@@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
 */
bl  trace_hardirqs_on
 #endif
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 restore_kuap:
kuap_restore r1, r2, r9, r10, r0
 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6467bdab8d405..83733376533e8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite)
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
beq+restore
@@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite)
li  r10,MSR_RI
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 
.globl  fast_exc_return_irq
 fast_exc_return_irq:
-- 
2.23.0

[PATCH 2/2] vfio/pci: Introduce OpenCAPI devices support.

2019-10-24 Thread christophe lombard

This patch adds new IOCTL commands for VFIO PCI driver to support
configuration and management for OpenCAPI devices, which have been passed
through from host to QEMU VFIO.
OpenCAPI (Open Coherent Accelerator Processor Interface) is an interface
between processors and accelerators.

The main IOCTL command is:
 VFIO_DEVICE_OCXL_OPHandles devices, which supports the OpenCAPI
interface, using the ocxl pnv_* interface.

The following commands are supported, based on the hcalls defined
in ocxl/pseries.c that implements the guest-specific callbacks.
VFIO_DEVICE_OCXL_CONFIG_ADAPTER   Used to configure OpenCAPI adapter
  characteristics.

VFIO_DEVICE_OCXL_CONFIG_SPA   Used to configure the schedule process
  area (SPA) table for an OpenCAPI device.

VFIO_DEVICE_OCXL_GET_FAULT_STATE  Used to retrieve fault information
  from an OpenCAPI device.

VFIO_DEVICE_OCXL_HANDLE_FAULT Used to respond to an OpenCAPI fault.

The platform data is declared in the vfio_pci_ocxl_link which is common
for each devices sharing the same domain, same bus and same slot.

The lpid value, requested to configure the process element in the
Scheduled Process Area, is not available in the QEMU environment.
This implies getting it from the host through the iommu group.

Signed-off-by: Christophe Lombard 
---
 drivers/vfio/pci/Kconfig |   7 +
 drivers/vfio/pci/Makefile|   1 +
 drivers/vfio/pci/vfio_pci.c  |  19 ++
 drivers/vfio/pci/vfio_pci_ocxl.c | 287 +++
 drivers/vfio/vfio.c  |  25 +++
 include/linux/vfio.h |  13 ++
 include/uapi/linux/vfio.h|  22 +++
 7 files changed, 374 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index ac3c1dd3edef..fd3716d10ded 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -45,3 +45,10 @@ config VFIO_PCI_NVLINK2
depends on VFIO_PCI && PPC_POWERNV
help
  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+
+config VFIO_PCI_OCXL
+   depends on VFIO_PCI
+   def_bool y if OCXL_BASE
+   help
+ VFIO PCI support for devices which handle the Open Coherent
+ Accelerator Processor Interface.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f027f8a0e89c..6d55a5fee4b0 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -3,5 +3,6 @@
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-$(CONFIG_VFIO_PCI_OCXL) += vfio_pci_ocxl.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 703948c9fbe1..4f9741bbe790 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1128,6 +1128,25 @@ static long vfio_pci_ioctl(void *device_data,
 
return vfio_pci_ioeventfd(vdev, ioeventfd.offset,
  ioeventfd.data, count, ioeventfd.fd);
+   } else if (cmd == VFIO_DEVICE_OCXL_OP) {
+   struct vfio_device_ocxl_op ocxl_op;
+   int ret = 0;
+
+   minsz = offsetofend(struct vfio_device_ocxl_op, data);
+
+   if (copy_from_user(_op, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (ocxl_op.argsz < minsz)
+   return -EINVAL;
+
+   ret = vfio_pci_ocxl_ioctl(vdev->pdev, _op);
+
+   if (!ret) {
+   if (copy_to_user((void __user *)arg, _op, minsz))
+   ret = -EFAULT;
+   }
+   return ret;
}
 
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_ocxl.c b/drivers/vfio/pci/vfio_pci_ocxl.c
new file mode 100644
index ..cb5cd4fb416d
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_ocxl.c
@@ -0,0 +1,287 @@
+// SPDX-License-Identifier: GPL-2.0+
+// Copyright 2019 IBM Corp.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct vfio_device_ocxl_link {
+   struct list_head list;
+   int domain;
+   int bus;
+   int slot;
+   void *platform_data;
+};
+static struct list_head links_list = LIST_HEAD_INIT(links_list);
+static DEFINE_MUTEX(links_list_lock);
+
+#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER1
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_SETUP1
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_RELEASE  2
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_GET_ACTAG3
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_GET_PASID4
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_SET_TL   5
+#define   VFIO_DEVICE_OCXL_CONFIG_ADAPTER_ALLOC_IRQ6
+#define

[PATCH 0/2] vfio pci: Add support for OpenCAPI devices

2019-10-24 Thread christophe lombard

This series adds support for the OpenCAPI devices for vfio pci.

It builds on top of the existing ocxl driver +
http://patchwork.ozlabs.org/patch/1177999/

VFIO is a Linux kernel driver framework used by QEMU to make devices
directly assignable to virtual machines.

All OpenCAPI devices on the same PCI slot will all be grouped and
assigned to the same guest.

- Assume these are the devices you want to assign
 0007:00:00.0 Processing accelerators: IBM Device 062b
 0007:00:00.1 Processing accelerators: IBM Device 062b

- Two Devices in the group
$ ls /sys/bus/pci/devices/0007\:00\:00.0/iommu_group/devices/
 0007:00:00.0  0007:00:00.1

- Find vendor & device ID
$ lspci -n -s 0007:00:00
 0007:00:00.0 1200: 1014:062b
 0007:00:00.1 1200: 1014:062b

- Unbind from the current ocxl device driver if already loaded
$ rmmod ocxl

- Load vfio-pci if it's not already done.
$ modprobe vfio-pci

- Bind to vfio-pci
$ echo 1014 062b > /sys/bus/pci/drivers/vfio-pci/new_id

  This will result in a new device node "/dev/vfio/7", which will be
  use by QEMU to setup the devices for passthrough.

- Pass to qemu using -device vfio-pci
  -device vfio-pci,multifunction=on,host=0007:00:00.0,addr=2.0 -device
  vfio-pci,multifunction=on,host=0007:00:00.1,addr=2.1

It has been tested in a bare-metal and QEMU environment using the memcpy
and the AFP AFUs.

christophe lombard (2):
  powerpc/powernv: Register IOMMU group for OpenCAPI devices
  vfio/pci: Introduce OpenCAPI devices support.

 arch/powerpc/platforms/powernv/ocxl.c | 164 ++---
 arch/powerpc/platforms/powernv/pci-ioda.c |  19 +-
 arch/powerpc/platforms/powernv/pci.h  |  13 +
 drivers/vfio/pci/Kconfig  |   7 +
 drivers/vfio/pci/Makefile |   1 +
 drivers/vfio/pci/vfio_pci.c   |  19 ++
 drivers/vfio/pci/vfio_pci_ocxl.c  | 287 ++
 drivers/vfio/vfio.c   |  25 ++
 include/linux/vfio.h  |  13 +
 include/uapi/linux/vfio.h |  22 ++
 10 files changed, 530 insertions(+), 40 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c

-- 
2.21.0

[PATCH 1/2] powerpc/powernv: Register IOMMU group for OpenCAPI devices

2019-10-24 Thread christophe lombard

This patch adds group registration for the OpenCAPI devices.
An unique iommu group is register for multiple PE, ie for a set of
multiple devices sharing the same domain, same bus and same slot.

This groud registration will be used to assign an OpenCAPI device to a
guest to participate in VFIO, like vfio-pci.

The release_ownership hook is used to disable the Scheduled Process Area
and clean allocated data if it's not done previously when the ocxl driver
is unloaded.

To support multiple OpenCAPI devices on the same machine, iommu group
and platform data are declared in the npu_link which is common for each
devices sharing the same domain, same bus and same slot.

Signed-off-by: Christophe Lombard 
---
 arch/powerpc/platforms/powernv/ocxl.c | 164 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  19 ++-
 arch/powerpc/platforms/powernv/pci.h  |  13 ++
 3 files changed, 156 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 12b146c2f855..67b2be965415 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -74,6 +74,8 @@ struct npu_link {
u16 fn_desired_actags[8];
struct actag_range fn_actags[8];
bool assignment_done;
+   struct iommu_group *group;
+   struct platform_data data;
 };
 static struct list_head links_list = LIST_HEAD_INIT(links_list);
 static DEFINE_MUTEX(links_list_lock);
@@ -603,54 +605,56 @@ int pnv_ocxl_platform_setup(struct pci_dev *dev, int 
PE_mask,
 {
struct pci_controller *hose = pci_bus_to_host(dev->bus);
struct pnv_phb *phb = hose->private_data;
-   struct platform_data *data;
+   struct npu_link *link = NULL;
int xsl_irq;
u32 bdfn;
-   int rc;
-
-   data = kzalloc(sizeof(*data), GFP_KERNEL);
-   if (!data)
-   return -ENOMEM;
+   int rc = 0;
 
-   rc = alloc_spa(dev, data);
-   if (rc) {
-   kfree(data);
-   return rc;
+   mutex_lock(_list_lock);
+   link = find_link(dev);
+   if (!link) {
+   dev_err(>dev, "Failed to setup platform\n");
+   mutex_unlock(_list_lock);
+   return -ENODEV;
}
 
+   rc = alloc_spa(dev, >data);
+   if (rc)
+   goto unlock;
+
rc = get_xsl_irq(dev, _irq);
if (rc) {
-   free_spa(data);
-   kfree(data);
-   return rc;
+   free_spa(>data);
+   goto unlock;
}
 
-   rc = map_xsl_regs(dev, >dsisr, >dar, >tfc,
- >pe_handle);
+   rc = map_xsl_regs(dev, >data.dsisr, >data.dar,
+ >data.tfc, >data.pe_handle);
if (rc) {
-   free_spa(data);
-   kfree(data);
-   return rc;
+   free_spa(>data);
+   goto unlock;
}
 
bdfn = (dev->bus->number << 8) | dev->devfn;
rc = opal_npu_spa_setup(phb->opal_id, bdfn,
-   virt_to_phys(data->spa->spa_mem),
+   virt_to_phys(link->data.spa->spa_mem),
PE_mask);
if (rc) {
dev_err(>dev, "Can't setup Shared Process Area: %d\n", rc);
-   unmap_xsl_regs(data->dsisr, data->dar, data->tfc,
-  data->pe_handle);
-   free_spa(data);
-   kfree(data);
-   return rc;
+   unmap_xsl_regs(link->data.dsisr, link->data.dar,
+  link->data.tfc, link->data.pe_handle);
+   free_spa(>data);
+   goto unlock;
}
-   data->phb_opal_id = phb->opal_id;
-   data->bdfn = bdfn;
-   *platform_data = (void *) data;
+   link->data.phb_opal_id = phb->opal_id;
+   link->data.bdfn = bdfn;
 
*hwirq = xsl_irq;
-   return 0;
+   *platform_data = (void *)>data;
+
+unlock:
+   mutex_unlock(_list_lock);
+   return rc;
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_platform_setup);
 
@@ -682,11 +686,13 @@ void pnv_ocxl_platform_release(void *platform_data)
struct platform_data *data = (struct platform_data *)platform_data;
int rc;
 
-   rc = opal_npu_spa_setup(data->phb_opal_id, data->bdfn, 0, 0);
-   WARN_ON(rc);
-   unmap_xsl_regs(data->dsisr, data->dar, data->tfc, data->pe_handle);
-   free_spa(data);
-   kfree(data);
+   if (data->spa) {
+   rc = opal_npu_spa_setup(data->phb_opal_id, data->bdfn, 0, 0);
+   WARN_ON(rc);
+   unmap_xsl_regs(data->dsisr, data->dar, data->tfc,
+  data->pe_handle);
+   free_spa(data);
+   }
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_platform_release);
 
@@ -837,3 +843,95 @@ int pnv_ocxl_remove_pe(void *platform_data, int pasid, u32 
*pid,
return remove_pe_from_cache(data,

Re: [PATCH] powerpc/fadump: Remove duplicate message.

2019-10-24 Thread Michal Suchánek

On Thu, Oct 24, 2019 at 04:08:08PM +0530, Hari Bathini wrote:
> 
> Michal, thanks for looking into this.
> 
> On 23/10/19 11:26 PM, Michal Suchanek wrote:
> > There is duplicate message about lack of support by firmware in
> > fadump_reserve_mem and setup_fadump. Due to different capitalization it
> > is clear that the one in setup_fadump is shown on boot. Remove the
> > duplicate that is not shown.
> 
> Actually, the message in fadump_reserve_mem() is logged. fadump_reserve_mem()
> executes first and sets fw_dump.fadump_enabled to `0`, if fadump is not 
> supported.
> So, the other message in setup_fadump() doesn't get logged anymore with recent
> changes. The right thing to do would be to remove similar message in 
> setup_fadump() instead.

I need to re-check with a recent kernel build. I saw the message from
setup_fadump and not the one from fadump_reserve_mem but not sure what
the platform init code looked like in the kernel I tested with.

Thanks

Michal

1 2 >

1 - 100 of 119 matches

Mail list logo