Re: [PATCH] ASoC: fsl_esai: Add spin lock to protect reset and stop
Hi > > On Wed, Oct 23, 2019 at 03:29:49PM +0800, Shengjiu Wang wrote: > > xrun may happen at the end of stream, the > > trigger->fsl_esai_trigger_stop maybe called in the middle of > > fsl_esai_hw_reset, this may cause esai in wrong state after stop, and > > there may be endless xrun interrupt. > > What about fsl_esai_trigger_start? It touches ESAI_xFCR_xFEN bit that is > being checked in the beginning of fsl_esai_hw_reset. > > Could the scenario below be possible also? > > 1) ESAI TX starts > 2) Xrun happens to TX > 3) Starting fsl_esai_hw_reset (enabled[TX] = true; enabled[RX] = false) > 4) ESAI RX starts > 5) Finishing fsl_esai_hw_reset (enabled[RX] is still false) > > Good catch, this may possible. Will update in v2. Best regards Wang shengjiu
Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers
> On Oct 24, 2019, at 11:45 PM, Anshuman Khandual > wrote: > > Nothing specific. But just tested this with x86 defconfig with relevant > configs > which are required for this test. Not sure if it involved W=1. No, it will not. It needs to run like, make W=1 -j 64 2>/tmp/warns
Re: [PATCH] ASoC: fsl_asrc: refine the setting of internal clock divider
Hi > > On Wed, Oct 23, 2019 at 06:25:20AM +, S.j. Wang wrote: > > > On Thu, Oct 17, 2019 at 02:21:08PM +0800, Shengjiu Wang wrote: > > > > For P2P output, the output divider should align with the output > > > > sample > > > > > > I think we should avoid "P2P" (or "M2M") keyword in the mainline > > > code as we know M2M will never get merged while somebody working > > > with the mainline and caring about new feature might be confused. > > > > Ok. But we still curious that is there a way to upstream m2m? > > Hmm..I would love to see that happening. Here is an old discussion that > you may want to take a look: > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail > man.alsa-project.org%2Fpipermail%2Falsa-devel%2F2014- > May%2F076797.htmldata=02%7C01%7Cshengjiu.wang%40nxp.com%7 > Ce902d2bac4254d2faa0f08d757ecac0e%7C686ea1d3bc2b4c6fa92cd99c5c301 > 635%7C0%7C0%7C637074546320396681sdata=bg%2BLwRQnUPhW8f > mE972O%2F53MyVftJkK140PSnmC%2FDKQ%3Dreserved=0 > > > > It makes sense to me, yet I feel that the delay at the beginning of > > > the audio playback might be longer as a compromise. I am okay with > > > this decision though... > > > > > > > The maximum divider of asrc clock is 1024, but there is no > > > > judgement for this limitaion in driver, which may cause the > > > > divider setting not correct. > > > > > > > > For non-ideal ratio mode, the clock rate should divide the sample > > > > rate with no remainder, and the quotient should be less than 1024. > > > > > > > > Signed-off-by: Shengjiu Wang > > > > > @@ -351,7 +352,9 @@ static int fsl_asrc_config_pair(struct > > > > fsl_asrc_pair > > > *pair) > > > > /* We only have output clock for ideal ratio mode */ > > > > clk = asrc_priv->asrck_clk[clk_index[ideal ? OUT : IN]]; > > > > > > > > - div[IN] = clk_get_rate(clk) / inrate; > > > > + clk_rate = clk_get_rate(clk); > > > > > > The fsl_asrc.c file has config.inclk being set to INCLK_NONE and > > > this sets the "ideal" in this function to true. So, although we tend > > > to not use ideal ratio setting for p2p cases, yet the input clock is > > > still not physically connected, so we still use output clock for div[IN] > calculation? > > > > For p2p case, it can be ideal or non-ideal. For non-ideal, we still > > use Output clock for div calculation. > > > > > > > > I am thinking something simplier: if we decided not to use ideal > > > ratio for "P2P", instead of adding "bool p2p" with the confusing > > > "ideal" in this function, could we just set config.inclk to the same > > > clock as the output one for "P2P"? By doing so, "P2P" won't go > > > through ideal ratio mode while still having a clock rate from the output > clock for div[IN] calculation here. > > > > Bool p2p is to force output rate to be sample rate, no impact to ideal > > Ratio mode. > > I just realized that the function has a bottom part for ideal mode > exclusively -- if we treat p2p as !ideal, those configurations will be > missing. > So you're right, should have an extra boolean variable. > > > > > > > > + rem[IN] = do_div(clk_rate, inrate); > > > > + div[IN] = (u32)clk_rate; > > > > if (div[IN] == 0) { > > > > > > Could we check div[IN] and rem[IN] here? Like: > > > if (div[IN] == 0 || div[IN] > 1024) { > > > pair_err(); > > > goto out; > > > } > > > > > > if (!ideal && rem[IN]) { > > > pair_err(); > > > goto out; > > > } > > > > > > According to your commit log, I think the max-1024 limitation should > > > be applied to all cases, not confined to "!ideal" cases right? And > > > we should add some comments also, indicating it is limited by hardware. > > > > For ideal mode, my test result is the divider not impact the output > > result. > > Which means it is ok for ideal mode even divider is not correct... > > OK. > > > > > > > > pair_err("failed to support input sample rate %dHz > > > > by > > > asrck_%x\n", > > > > inrate, clk_index[ideal ? OUT : > > > > IN]); @@ > > > > -360,11 +363,20 @@ static int fsl_asrc_config_pair(struct > > > > fsl_asrc_pair *pair) > > > > > > > > clk = asrc_priv->asrck_clk[clk_index[OUT]]; > > > > > > > > - /* Use fixed output rate for Ideal Ratio mode (INCLK_NONE) */ > > > > - if (ideal) > > > > - div[OUT] = clk_get_rate(clk) / IDEAL_RATIO_RATE; > > > > - else > > > > - div[OUT] = clk_get_rate(clk) / outrate; > > > > + /* > > > > + * When P2P mode, output rate should align with the out > samplerate. > > > > + * if set too high output rate, there will be lots of Overload. > > > > + * When M2M mode, output rate should also need to align with > > > > + the out > > > > > > For this "should", do you actually mean "M2M could also"? Sorry, I'm > > > just trying to understand everyting here, not intentionally being picky at > words. > > > My understanding
[PATCH] ASoC: fsl: fsl_dma: fix build failure
Commit 4ac85de9977e ("ASoC: fsl: fsl_dma: remove snd_pcm_ops") removed fsl_dma_ops but left a usage, leading to a build error for some configs, eg. mpc85xx_defconfig: sound/soc/fsl/fsl_dma.c: In function ‘fsl_soc_dma_probe’: sound/soc/fsl/fsl_dma.c:905:18: error: ‘fsl_dma_ops’ undeclared (first use in this function) dma->dai.ops = _dma_ops; ^~~ Remove the usage to fix the build. Fixes: 4ac85de9977e ("ASoC: fsl: fsl_dma: remove snd_pcm_ops") Signed-off-by: Michael Ellerman --- sound/soc/fsl/fsl_dma.c | 1 - 1 file changed, 1 deletion(-) This breakage is only in linux-next. diff --git a/sound/soc/fsl/fsl_dma.c b/sound/soc/fsl/fsl_dma.c index a092726510d4..2868c4f97cb2 100644 --- a/sound/soc/fsl/fsl_dma.c +++ b/sound/soc/fsl/fsl_dma.c @@ -901,7 +901,6 @@ static int fsl_soc_dma_probe(struct platform_device *pdev) } dma->dai.name = DRV_NAME; - dma->dai.ops = _dma_ops; dma->dai.open = fsl_dma_open; dma->dai.close = fsl_dma_close; dma->dai.ioctl = snd_soc_pcm_lib_ioctl; -- 2.21.0
[PATCH 10/10] ocxl: Conditionally bind SCM devices to the generic OCXL driver
From: Alastair D'Silva This patch allows the user to bind OpenCAPI SCM devices to the generic OCXL driver. Signed-off-by: Alastair D'Silva --- drivers/misc/ocxl/Kconfig | 7 +++ drivers/misc/ocxl/pci.c | 3 +++ 2 files changed, 10 insertions(+) diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig index 1916fa65f2f2..8a683715c97c 100644 --- a/drivers/misc/ocxl/Kconfig +++ b/drivers/misc/ocxl/Kconfig @@ -29,3 +29,10 @@ config OCXL dedicated OpenCAPI link, and don't follow the same protocol. If unsure, say N. + +config OCXL_SCM_GENERIC + bool "Treat OpenCAPI Storage Class Memory as a generic OpenCAPI device" + default n + help + Select this option to treat OpenCAPI Storage Class Memory + devices an generic OpenCAPI devices. diff --git a/drivers/misc/ocxl/pci.c b/drivers/misc/ocxl/pci.c index cb920aa88d3a..7137055c1883 100644 --- a/drivers/misc/ocxl/pci.c +++ b/drivers/misc/ocxl/pci.c @@ -10,6 +10,9 @@ */ static const struct pci_device_id ocxl_pci_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x062B), }, +#ifdef CONFIG_OCXL_SCM_GENERIC + { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), }, +#endif { } }; MODULE_DEVICE_TABLE(pci, ocxl_pci_tbl); -- 2.21.0
[PATCH 09/10] powerpc: Enable OpenCAPI Storage Class Memory driver on bare metal
From: Alastair D'Silva Enable OpenCAPI Storage Class Memory driver on bare metal Signed-off-by: Alastair D'Silva --- arch/powerpc/configs/powernv_defconfig | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/configs/powernv_defconfig b/arch/powerpc/configs/powernv_defconfig index 6658cceb928c..45c0eff94964 100644 --- a/arch/powerpc/configs/powernv_defconfig +++ b/arch/powerpc/configs/powernv_defconfig @@ -352,3 +352,7 @@ CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=m CONFIG_VHOST_NET=m CONFIG_PRINTK_TIME=y +CONFIG_OCXL_SCM=m +CONFIG_DEV_DAX=y +CONFIG_DEV_DAX_PMEM=y +CONFIG_FS_DAX=y -- 2.21.0
[PATCH 08/10] nvdimm: Add driver for OpenCAPI Storage Class Memory
From: Alastair D'Silva This driver exposes LPC memory on OpenCAPI SCM cards as an NVDIMM, allowing the existing nvram infrastructure to be used. Signed-off-by: Alastair D'Silva --- drivers/nvdimm/Kconfig | 17 + drivers/nvdimm/Makefile|3 + drivers/nvdimm/ocxl-scm.c | 2210 drivers/nvdimm/ocxl-scm_internal.c | 232 +++ drivers/nvdimm/ocxl-scm_internal.h | 331 + drivers/nvdimm/ocxl-scm_sysfs.c| 219 +++ include/uapi/linux/ocxl-scm.h | 128 ++ mm/memory_hotplug.c|2 +- 8 files changed, 3141 insertions(+), 1 deletion(-) create mode 100644 drivers/nvdimm/ocxl-scm.c create mode 100644 drivers/nvdimm/ocxl-scm_internal.c create mode 100644 drivers/nvdimm/ocxl-scm_internal.h create mode 100644 drivers/nvdimm/ocxl-scm_sysfs.c create mode 100644 include/uapi/linux/ocxl-scm.h diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 36af7af6b7cf..e4f7b6b08efd 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -130,4 +130,21 @@ config NVDIMM_TEST_BUILD core devm_memremap_pages() implementation and other infrastructure. +config OCXL_SCM + tristate "OpenCAPI Storage Class Memory" + depends on LIBNVDIMM + select ZONE_DEVICE + select OCXL + help + Exposes devices that implement the OpenCAPI Storage Class Memory + specification as persistent memory regions. + + Select N if unsure. + +config OCXL_SCM_DEBUG + bool "OpenCAPI Storage Class Memory debugging" + depends on OCXL_SCM + help + Enables low level IOCTLs for OpenCAPI SCM firmware development + endif diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 29203f3d3069..43d826397bfc 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -6,6 +6,9 @@ obj-$(CONFIG_ND_BLK) += nd_blk.o obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o obj-$(CONFIG_OF_PMEM) += of_pmem.o obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o +obj-$(CONFIG_OCXL_SCM) += ocxlscm.o + +ocxlscm-y := ocxl-scm.o ocxl-scm_internal.o ocxl-scm_sysfs.o nd_pmem-y := pmem.o diff --git a/drivers/nvdimm/ocxl-scm.c b/drivers/nvdimm/ocxl-scm.c new file mode 100644 index ..f4e6cc022de8 --- /dev/null +++ b/drivers/nvdimm/ocxl-scm.c @@ -0,0 +1,2210 @@ +// SPDX-License-Identifier: GPL-2.0+ +// Copyright 2019 IBM Corp. + +/* + * A driver for Storage Class Memory, connected via OpenCAPI + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "ocxl-scm_internal.h" + + +static const struct pci_device_id scm_pci_tbl[] = { + { PCI_DEVICE(PCI_VENDOR_ID_IBM, 0x0625), }, + { } +}; + +MODULE_DEVICE_TABLE(pci, scm_pci_tbl); + +#define SCM_NUM_MINORS 256 // Total to reserve +#define SCM_USABLE_TIMEOUT 120 // seconds + +static dev_t scm_dev; +static struct class *scm_class; +static struct mutex minors_idr_lock; +static struct idr minors_idr; + +static const struct attribute_group *scm_pmem_attribute_groups[] = { + _bus_attribute_group, + NULL, +}; + +static const struct attribute_group *scm_pmem_region_attribute_groups[] = { + _region_attribute_group, + _device_attribute_group, + _mapping_attribute_group, + _numa_attribute_group, + NULL, +}; + +/** + * scm_ndctl_config_write() - Handle a ND_CMD_SET_CONFIG_DATA command from ndctl + * @scm_data: the SCM metadata + * @command: the incoming data to write + * Return: 0 on success, negative on failure + */ +static int scm_ndctl_config_write(struct scm_data *scm_data, + struct nd_cmd_set_config_hdr *command) +{ + if (command->in_offset + command->in_length > SCM_LABEL_AREA_SIZE) + return -EINVAL; + + memcpy_flushcache(scm_data->metadata_addr + command->in_offset, command->in_buf, + command->in_length); + + return 0; +} + +/** + * scm_ndctl_config_read() - Handle a ND_CMD_GET_CONFIG_DATA command from ndctl + * @scm_data: the SCM metadata + * @command: the read request + * Return: 0 on success, negative on failure + */ +static int scm_ndctl_config_read(struct scm_data *scm_data, +struct nd_cmd_get_config_data_hdr *command) +{ + if (command->in_offset + command->in_length > SCM_LABEL_AREA_SIZE) + return -EINVAL; + + memcpy(command->out_buf, scm_data->metadata_addr + command->in_offset, + command->in_length); + + return 0; +} + +/** + * scm_ndctl_config_size() - Handle a ND_CMD_GET_CONFIG_SIZE command from ndctl + * @scm_data: the SCM metadata + * @command: the read request + * Return: 0 on success, negative on failure + */ +static int scm_ndctl_config_size(struct nd_cmd_get_config_size *command) +{ + command->status = 0; + command->config_size = SCM_LABEL_AREA_SIZE; + command->max_xfer = PAGE_SIZE; + + return 0; +} +
[PATCH 07/10] ocxl: Save the device serial number in ocxl_fn
From: Alastair D'Silva This patch retrieves the serial number of the card and makes it available to consumers of the ocxl driver via the ocxl_fn struct. Signed-off-by: Alastair D'Silva --- drivers/misc/ocxl/config.c | 46 ++ include/misc/ocxl.h| 1 + 2 files changed, 47 insertions(+) diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index fb0c3b6f8312..a9203c309365 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -71,6 +71,51 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx) return 0; } +/** + * Find a related PCI device (function 0) + * @device: PCI device to match + * + * Returns a pointer to the related device, or null if not found + */ +static struct pci_dev *get_function_0(struct pci_dev *dev) +{ + unsigned int devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0); // Look for function 0 + + return pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus), + dev->bus->number, devfn); +} + +static void read_serial(struct pci_dev *dev, struct ocxl_fn_config *fn) +{ + u32 low, high; + int pos; + + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DSN); + if (pos) { + pci_read_config_dword(dev, pos + 0x04, ); + pci_read_config_dword(dev, pos + 0x08, ); + + fn->serial = low | ((u64)high) << 32; + + return; + } + + if (PCI_FUNC(dev->devfn) != 0) { + struct pci_dev *related = get_function_0(dev); + + if (!related) { + fn->serial = 0; + return; + } + + read_serial(related, fn); + pci_dev_put(related); + return; + } + + fn->serial = 0; +} + static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn) { u16 val; @@ -208,6 +253,7 @@ int ocxl_config_read_function(struct pci_dev *dev, struct ocxl_fn_config *fn) int rc; read_pasid(dev, fn); + read_serial(dev, fn); rc = read_dvsec_tl(dev, fn); if (rc) { diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h index 6f7c02f0d5e3..9843051c3c5b 100644 --- a/include/misc/ocxl.h +++ b/include/misc/ocxl.h @@ -46,6 +46,7 @@ struct ocxl_fn_config { int dvsec_afu_info_pos; /* offset of the AFU information DVSEC */ s8 max_pasid_log; s8 max_afu_index; + u64 serial; }; enum ocxl_endian { -- 2.21.0
[PATCH 06/10] ocxl: Add functions to map/unmap LPC memory
From: Alastair D'Silva Add functions to map/unmap LPC memory Signed-off-by: Alastair D'Silva --- drivers/misc/ocxl/config.c| 4 +++ drivers/misc/ocxl/core.c | 50 +++ drivers/misc/ocxl/ocxl_internal.h | 3 ++ include/misc/ocxl.h | 18 +++ 4 files changed, 75 insertions(+) diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index c8e19bfb5ef9..fb0c3b6f8312 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -568,6 +568,10 @@ static int read_afu_lpc_memory_info(struct pci_dev *dev, afu->special_purpose_mem_size = total_mem_size - lpc_mem_size; } + + dev_info(>dev, "Probed LPC memory of %#llx bytes and special purpose memory of %#llx bytes\n", + afu->lpc_mem_size, afu->special_purpose_mem_size); + return 0; } diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c index 2531c6cf19a0..5554f5ce4b9e 100644 --- a/drivers/misc/ocxl/core.c +++ b/drivers/misc/ocxl/core.c @@ -210,6 +210,55 @@ static void unmap_mmio_areas(struct ocxl_afu *afu) release_fn_bar(afu->fn, afu->config.global_mmio_bar); } +int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu) +{ + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent); + + if ((afu->config.lpc_mem_size + afu->config.special_purpose_mem_size) == 0) + return 0; + + afu->lpc_base_addr = ocxl_link_lpc_map(afu->fn->link, dev); + if (afu->lpc_base_addr == 0) + return -EINVAL; + + if (afu->config.lpc_mem_size) { + afu->lpc_res.start = afu->lpc_base_addr + afu->config.lpc_mem_offset; + afu->lpc_res.end = afu->lpc_res.start + afu->config.lpc_mem_size - 1; + } + + if (afu->config.special_purpose_mem_size) { + afu->special_purpose_res.start = afu->lpc_base_addr + + afu->config.special_purpose_mem_offset; + afu->special_purpose_res.end = afu->special_purpose_res.start + + afu->config.special_purpose_mem_size - 1; + } + + return 0; +} +EXPORT_SYMBOL(ocxl_afu_map_lpc_mem); + +struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu) +{ + return >lpc_res; +} +EXPORT_SYMBOL(ocxl_afu_lpc_mem); + +static void unmap_lpc_mem(struct ocxl_afu *afu) +{ + struct pci_dev *dev = to_pci_dev(afu->fn->dev.parent); + + if (afu->lpc_res.start || afu->special_purpose_res.start) { + void *link = afu->fn->link; + + ocxl_link_lpc_release(link, dev); + + afu->lpc_res.start = 0; + afu->lpc_res.end = 0; + afu->special_purpose_res.start = 0; + afu->special_purpose_res.end = 0; + } +} + static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev *dev) { int rc; @@ -251,6 +300,7 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev *dev) static void deconfigure_afu(struct ocxl_afu *afu) { + unmap_lpc_mem(afu); unmap_mmio_areas(afu); reclaim_afu_pasid(afu); reclaim_afu_actag(afu); diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h index 20b417e00949..9f4b47900e62 100644 --- a/drivers/misc/ocxl/ocxl_internal.h +++ b/drivers/misc/ocxl/ocxl_internal.h @@ -52,6 +52,9 @@ struct ocxl_afu { void __iomem *global_mmio_ptr; u64 pp_mmio_start; void *private; + u64 lpc_base_addr; /* Covers both LPC & special purpose memory */ + struct resource lpc_res; + struct resource special_purpose_res; }; enum ocxl_context_status { diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h index 06dd5839e438..6f7c02f0d5e3 100644 --- a/include/misc/ocxl.h +++ b/include/misc/ocxl.h @@ -212,6 +212,24 @@ int ocxl_irq_set_handler(struct ocxl_context *ctx, int irq_id, // AFU Metadata +/** + * Map the LPC system & special purpose memory for an AFU + * + * Do not call this during device discovery, as there may me multiple + * devices on a link, and the memory is mapped for the whole link, not + * just one device. It should only be called after all devices have + * registered their memory on the link. + * + * afu: The AFU that has the LPC memory to map + */ +extern int ocxl_afu_map_lpc_mem(struct ocxl_afu *afu); + +/** + * Get the physical address range of LPC memory for an AFU + * afu: The AFU associated with the LPC memory + */ +extern struct resource *ocxl_afu_lpc_mem(struct ocxl_afu *afu); + /** * Get a pointer to the config for an AFU * -- 2.21.0
[PATCH 05/10] ocxl: Tally up the LPC memory on a link & allow it to be mapped
From: Alastair D'Silva Tally up the LPC memory on an OpenCAPI link & allow it to be mapped Signed-off-by: Alastair D'Silva --- drivers/misc/ocxl/core.c | 10 ++ drivers/misc/ocxl/link.c | 60 +++ drivers/misc/ocxl/ocxl_internal.h | 33 + 3 files changed, 103 insertions(+) diff --git a/drivers/misc/ocxl/core.c b/drivers/misc/ocxl/core.c index b7a09b21ab36..2531c6cf19a0 100644 --- a/drivers/misc/ocxl/core.c +++ b/drivers/misc/ocxl/core.c @@ -230,8 +230,18 @@ static int configure_afu(struct ocxl_afu *afu, u8 afu_idx, struct pci_dev *dev) if (rc) goto err_free_pasid; + if (afu->config.lpc_mem_size || afu->config.special_purpose_mem_size) { + rc = ocxl_link_add_lpc_mem(afu->fn->link, afu->config.lpc_mem_offset, + afu->config.lpc_mem_size + + afu->config.special_purpose_mem_size); + if (rc) + goto err_free_mmio; + } + return 0; +err_free_mmio: + unmap_mmio_areas(afu); err_free_pasid: reclaim_afu_pasid(afu); err_free_actag: diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c index 58d111afd9f6..1d350d0bb860 100644 --- a/drivers/misc/ocxl/link.c +++ b/drivers/misc/ocxl/link.c @@ -84,6 +84,11 @@ struct ocxl_link { int dev; atomic_t irq_available; struct spa *spa; + struct mutex lpc_mem_lock; + u64 lpc_mem_sz; /* Total amount of LPC memory presented on the link */ + u64 lpc_mem; + int lpc_consumers; + void *platform_data; }; static struct list_head links_list = LIST_HEAD_INIT(links_list); @@ -396,6 +401,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, struct ocxl_link **out_l if (rc) goto err_spa; + mutex_init(>lpc_mem_lock); + /* platform specific hook */ rc = pnv_ocxl_spa_setup(dev, link->spa->spa_mem, PE_mask, >platform_data); @@ -711,3 +718,56 @@ void ocxl_link_free_irq(void *link_handle, int hw_irq) atomic_inc(>irq_available); } EXPORT_SYMBOL_GPL(ocxl_link_free_irq); + +int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size) +{ + struct ocxl_link *link = (struct ocxl_link *) link_handle; + + // Check for overflow + if (offset > (offset + size)) + return -EINVAL; + + mutex_lock(>lpc_mem_lock); + link->lpc_mem_sz = max(link->lpc_mem_sz, offset + size); + + mutex_unlock(>lpc_mem_lock); + + return 0; +} + +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev) +{ + struct ocxl_link *link = (struct ocxl_link *) link_handle; + u64 lpc_mem; + + mutex_lock(>lpc_mem_lock); + if (link->lpc_mem) { + lpc_mem = link->lpc_mem; + + link->lpc_consumers++; + mutex_unlock(>lpc_mem_lock); + return lpc_mem; + } + + link->lpc_mem = pnv_ocxl_platform_lpc_setup(pdev, link->lpc_mem_sz); + if (link->lpc_mem) + link->lpc_consumers++; + lpc_mem = link->lpc_mem; + mutex_unlock(>lpc_mem_lock); + + return lpc_mem; +} + +void ocxl_link_lpc_release(void *link_handle, struct pci_dev *pdev) +{ + struct ocxl_link *link = (struct ocxl_link *) link_handle; + + mutex_lock(>lpc_mem_lock); + link->lpc_consumers--; + if (link->lpc_consumers == 0) { + pnv_ocxl_platform_lpc_release(pdev); + link->lpc_mem = 0; + } + + mutex_unlock(>lpc_mem_lock); +} diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h index 97415afd79f3..20b417e00949 100644 --- a/drivers/misc/ocxl/ocxl_internal.h +++ b/drivers/misc/ocxl/ocxl_internal.h @@ -141,4 +141,37 @@ int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 offset); u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id); void ocxl_afu_irq_free_all(struct ocxl_context *ctx); +/** + * ocxl_link_add_lpc_mem() - Increment the amount of memory required by an OpenCAPI link + * + * @link_handle: The OpenCAPI link handle + * @offset: The offset of the memory to add + * @size: The amount of memory to increment by + * + * Return 0 on success, negative on overflow + */ +int ocxl_link_add_lpc_mem(void *link_handle, u64 offset, u64 size); + +/** + * ocxl_link_lpc_map() - Map the LPC memory for an OpenCAPI device + * + * Since LPC memory belongs to a link, the whole LPC memory available + * on the link bust be mapped in order to make it accessible to a device. + * + * @link_handle: The OpenCAPI link handle + * @pdev: A device that is on the link + */ +u64 ocxl_link_lpc_map(void *link_handle, struct pci_dev *pdev); + +/** + * ocxl_link_lpc_release() - Release the LPC memory device for an OpenCAPI device + * + * Offlines LPC memory on an OpenCAPI link for a device. If this is the + *
[PATCH 04/10] powerpc: Map & release OpenCAPI LPC memory
From: Alastair D'Silva This patch adds platform support to map & release LPC memory. Signed-off-by: Alastair D'Silva --- arch/powerpc/include/asm/pnv-ocxl.h | 2 ++ arch/powerpc/platforms/powernv/ocxl.c | 41 +++ include/linux/memory_hotplug.h| 5 mm/memory_hotplug.c | 3 +- 4 files changed, 50 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pnv-ocxl.h b/arch/powerpc/include/asm/pnv-ocxl.h index 7de82647e761..f8f8ffb48aa8 100644 --- a/arch/powerpc/include/asm/pnv-ocxl.h +++ b/arch/powerpc/include/asm/pnv-ocxl.h @@ -32,5 +32,7 @@ extern int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle) extern int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr); extern void pnv_ocxl_free_xive_irq(u32 irq); +extern u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size); +extern void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev); #endif /* _ASM_PNV_OCXL_H */ diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c index 8c65aacda9c8..c6d4234e0aba 100644 --- a/arch/powerpc/platforms/powernv/ocxl.c +++ b/arch/powerpc/platforms/powernv/ocxl.c @@ -475,6 +475,47 @@ void pnv_ocxl_spa_release(void *platform_data) } EXPORT_SYMBOL_GPL(pnv_ocxl_spa_release); +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size) +{ + struct pci_controller *hose = pci_bus_to_host(pdev->bus); + struct pnv_phb *phb = hose->private_data; + u32 bdfn = (pdev->bus->number << 8) | pdev->devfn; + u64 base_addr = 0; + int rc; + + rc = opal_npu_mem_alloc(phb->opal_id, bdfn, size, _addr); + if (rc) { + dev_warn(>dev, +"OPAL could not allocate LPC memory, rc=%d\n", rc); + return 0; + } + + base_addr = be64_to_cpu(base_addr); + + rc = check_hotplug_memory_addressable(base_addr >> PAGE_SHIFT, + size >> PAGE_SHIFT); + if (rc) + return 0; + + return base_addr; +} +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_setup); + +void pnv_ocxl_platform_lpc_release(struct pci_dev *pdev) +{ + struct pci_controller *hose = pci_bus_to_host(pdev->bus); + struct pnv_phb *phb = hose->private_data; + u32 bdfn = (pdev->bus->number << 8) | pdev->devfn; + int rc; + + rc = opal_npu_mem_release(phb->opal_id, bdfn); + if (rc) + dev_warn(>dev, +"OPAL reported rc=%d when releasing LPC memory\n", rc); +} +EXPORT_SYMBOL_GPL(pnv_ocxl_platform_lpc_release); + + int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, int pe_handle) { struct spa_data *data = (struct spa_data *) platform_data; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index f46ea71b4ffd..3f5f1a642abe 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -339,6 +339,11 @@ static inline int remove_memory(int nid, u64 start, u64 size) static inline void __remove_memory(int nid, u64 start, u64 size) {} #endif /* CONFIG_MEMORY_HOTREMOVE */ +#if CONFIG_MEMORY_HOTPLUG_SPARSE +int check_hotplug_memory_addressable(unsigned long pfn, + unsigned long nr_pages); +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ + extern void __ref free_area_init_core_hotplug(int nid); extern int __add_memory(int nid, u64 start, u64 size); extern int add_memory(int nid, u64 start, u64 size); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2cecf07b396f..b39827dbd071 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -278,7 +278,7 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, return 0; } -static int check_hotplug_memory_addressable(unsigned long pfn, +int check_hotplug_memory_addressable(unsigned long pfn, unsigned long nr_pages) { const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1; @@ -294,6 +294,7 @@ static int check_hotplug_memory_addressable(unsigned long pfn, return 0; } +EXPORT_SYMBOL_GPL(check_hotplug_memory_addressable); /* * Reasonably generic function for adding memory. It is -- 2.21.0
[PATCH 03/10] powerpc: Add OPAL calls for LPC memory alloc/release
From: Alastair D'Silva Add OPAL calls for LPC memory alloc/release Signed-off-by: Alastair D'Silva Acked-by: Andrew Donnellan --- arch/powerpc/include/asm/opal-api.h| 2 ++ arch/powerpc/include/asm/opal.h| 3 +++ arch/powerpc/platforms/powernv/opal-call.c | 2 ++ 3 files changed, 7 insertions(+) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 378e3997845a..2c88c02e69ed 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -208,6 +208,8 @@ #define OPAL_HANDLE_HMI2 166 #defineOPAL_NX_COPROC_INIT 167 #define OPAL_XIVE_GET_VP_STATE 170 +#define OPAL_NPU_MEM_ALLOC 171 +#define OPAL_NPU_MEM_RELEASE 172 #define OPAL_MPIPL_UPDATE 173 #define OPAL_MPIPL_REGISTER_TAG174 #define OPAL_MPIPL_QUERY_TAG 175 diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index a0cf8fba4d12..4db135fb54ab 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -39,6 +39,9 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t bdfn, uint64_t PE_handle); int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap, uint64_t rate_phys, uint32_t size); +int64_t opal_npu_mem_alloc(uint64_t phb_id, uint32_t bdfn, + uint64_t size, uint64_t *bar); +int64_t opal_npu_mem_release(uint64_t phb_id, uint32_t bdfn); int64_t opal_console_write(int64_t term_number, __be64 *length, const uint8_t *buffer); diff --git a/arch/powerpc/platforms/powernv/opal-call.c b/arch/powerpc/platforms/powernv/opal-call.c index a2aa5e433ac8..27c4b93c774c 100644 --- a/arch/powerpc/platforms/powernv/opal-call.c +++ b/arch/powerpc/platforms/powernv/opal-call.c @@ -287,6 +287,8 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, OPAL_PCI_SET_PBCQ_TUNNEL_BAR); OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64); OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE); OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT); +OPAL_CALL(opal_npu_mem_alloc, OPAL_NPU_MEM_ALLOC); +OPAL_CALL(opal_npu_mem_release,OPAL_NPU_MEM_RELEASE); OPAL_CALL(opal_mpipl_update, OPAL_MPIPL_UPDATE); OPAL_CALL(opal_mpipl_register_tag, OPAL_MPIPL_REGISTER_TAG); OPAL_CALL(opal_mpipl_query_tag,OPAL_MPIPL_QUERY_TAG); -- 2.21.0
[PATCH 02/10] nvdimm: remove prototypes for nonexistent functions
From: Alastair D'Silva These functions don't exist, so remove the prototypes for them. Signed-off-by: Alastair D'Silva --- drivers/nvdimm/nd-core.h | 4 1 file changed, 4 deletions(-) diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h index 25fa121104d0..9f121a6aeb02 100644 --- a/drivers/nvdimm/nd-core.h +++ b/drivers/nvdimm/nd-core.h @@ -124,11 +124,7 @@ void nd_region_create_dax_seed(struct nd_region *nd_region); int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus); void nvdimm_bus_destroy_ndctl(struct nvdimm_bus *nvdimm_bus); void nd_synchronize(void); -int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus); -int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus); -int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus); void __nd_device_register(struct device *dev); -int nd_match_dimm(struct device *dev, void *data); struct nd_label_id; char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags); bool nd_is_uuid_unique(struct device *dev, u8 *uuid); -- 2.21.0
[PATCH 01/10] memory_hotplug: Add a bounds check to __add_pages
From: Alastair D'Silva On PowerPC, the address ranges allocated to OpenCAPI LPC memory are allocated from firmware. These address ranges may be higher than what older kernels permit, as we increased the maximum permissable address in commit 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to 2PB"). It is possible that the addressable range may change again in the future. In this scenario, we end up with a bogus section returned from __section_nr (see the discussion on the thread "mm: Trigger bug on if a section is not found in __section_nr"). Adding a check here means that we fail early and have an opportunity to handle the error gracefully, rather than rumbling on and potentially accessing an incorrect section. Further discussion is also on the thread ("powerpc: Perform a bounds check in arch_add_memory") http://lkml.kernel.org/r/20190827052047.31547-1-alast...@au1.ibm.com Signed-off-by: Alastair D'Silva Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- mm/memory_hotplug.c | 21 + 1 file changed, 21 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index df570e5c71cc..2cecf07b396f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -278,6 +278,23 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, return 0; } +static int check_hotplug_memory_addressable(unsigned long pfn, + unsigned long nr_pages) +{ + const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1; + + if (max_addr >> MAX_PHYSMEM_BITS) { + const u64 max_allowed = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1; + + WARN(1, +"Hotplugged memory exceeds maximum addressable address, range=%#llx-%#llx, maximum=%#llx\n", +PFN_PHYS(pfn), max_addr, max_allowed); + return -E2BIG; + } + + return 0; +} + /* * Reasonably generic function for adding memory. It is * expected that archs that support memory hotplug will @@ -291,6 +308,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, unsigned long nr, start_sec, end_sec; struct vmem_altmap *altmap = restrictions->altmap; + err = check_hotplug_memory_addressable(pfn, nr_pages); + if (err) + return err; + if (altmap) { /* * Validate altmap is within bounds of the total request -- 2.21.0
[PATCH 00/10] Add support for OpenCAPI SCM devices
From: Alastair D'Silva This series adds support for OpenCAPI SCM devices, exposing them as nvdimms so that we can make use of the existing infrastructure. The first patch (in memory_hotplug) has reviews/acks, but has not yet made it upstream. Alastair D'Silva (10): memory_hotplug: Add a bounds check to __add_pages nvdimm: remove prototypes for nonexistent functions powerpc: Add OPAL calls for LPC memory alloc/release powerpc: Map & release OpenCAPI LPC memory ocxl: Tally up the LPC memory on a link & allow it to be mapped ocxl: Add functions to map/unmap LPC memory ocxl: Save the device serial number in ocxl_fn nvdimm: Add driver for OpenCAPI Storage Class Memory powerpc: Enable OpenCAPI Storage Class Memory driver on bare metal ocxl: Conditionally bind SCM devices to the generic OCXL driver arch/powerpc/configs/powernv_defconfig |4 + arch/powerpc/include/asm/opal-api.h|2 + arch/powerpc/include/asm/opal.h|3 + arch/powerpc/include/asm/pnv-ocxl.h|2 + arch/powerpc/platforms/powernv/ocxl.c | 41 + arch/powerpc/platforms/powernv/opal-call.c |2 + drivers/misc/ocxl/Kconfig |7 + drivers/misc/ocxl/config.c | 50 + drivers/misc/ocxl/core.c | 60 + drivers/misc/ocxl/link.c | 60 + drivers/misc/ocxl/ocxl_internal.h | 36 + drivers/misc/ocxl/pci.c|3 + drivers/nvdimm/Kconfig | 17 + drivers/nvdimm/Makefile|3 + drivers/nvdimm/nd-core.h |4 - drivers/nvdimm/ocxl-scm.c | 2210 drivers/nvdimm/ocxl-scm_internal.c | 232 ++ drivers/nvdimm/ocxl-scm_internal.h | 331 +++ drivers/nvdimm/ocxl-scm_sysfs.c| 219 ++ include/linux/memory_hotplug.h |5 + include/misc/ocxl.h| 19 + include/uapi/linux/ocxl-scm.h | 128 ++ mm/memory_hotplug.c| 22 + 23 files changed, 3456 insertions(+), 4 deletions(-) create mode 100644 drivers/nvdimm/ocxl-scm.c create mode 100644 drivers/nvdimm/ocxl-scm_internal.c create mode 100644 drivers/nvdimm/ocxl-scm_internal.h create mode 100644 drivers/nvdimm/ocxl-scm_sysfs.c create mode 100644 include/uapi/linux/ocxl-scm.h -- 2.21.0
Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers
On 10/24/2019 10:21 PM, Qian Cai wrote: > > >> On Oct 24, 2019, at 10:50 AM, Anshuman Khandual >> wrote: >> >> Changes in V7: >> >> - Memory allocation and free routines for mapped pages have been droped >> - Mapped pfns are derived from standard kernel text symbol per Matthew >> - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian >> - Updated the commit message per Michal >> - Updated W=1 GCC warning problem on x86 per Qian Cai > > It would be interesting to know if you actually tested out to see if the > warning went away. As far I can tell, the GCC is quite stubborn there, so I > am not going to insist. > Nothing specific. But just tested this with x86 defconfig with relevant configs which are required for this test. Not sure if it involved W=1. The problem is, there is no other or better way to have both the conditional checks in place while also reducing the chances this warning. IMHO both the conditional checks are required.
RE: [PATCH v7 2/3] Documentation: dt: binding: fsl: Add 'little-endian' and update Chassis define
Hi Scott, On Friday, October 25, 2019 02:34, Scott Wood wrote > > On Mon, 2019-10-21 at 11:49 +0800, Ran Wang wrote: > > By default, QorIQ SoC's RCPM register block is Big Endian. But there > > are some exceptions, such as LS1088A and LS2088A, are Little Endian. > > So add this optional property to help identify them. > > > > Actually LS2021A and other Layerscapes won't totally follow Chassis > > 2.1, so separate them from powerpc SoC. > > Did you mean LS1021A and "don't" instead of "won't", given the change to the > examples? OK, I will change it to don't to just tel current situation. > > Change in v5: > > - Add 'Reviewed-by: Rob Herring ' to commit > message. > > - Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup- > > cells'. > > please see https://lore.kernel.org/patchwork/patch/1101022/ > > I'm not sure why Rob considers this the "correct form" -- there are other > examples of the current form, such as ibm,#dma-address-cells and ti,#tlb- > entries, and the current form makes more logical sense (# is part of the > property > name, not the vendor). Oh well. > > > Required properites: > >- reg : Offset and length of the register set of the RCPM block. > > - - fsl,#rcpm-wakeup-cells : The number of IPPDEXPCR register cells > > in the > > + - #fsl,rcpm-wakeup-cells : The number of IPPDEXPCR register cells > > + in the > > fsl,rcpm-wakeup property. > >- compatible : Must contain a chip-specific RCPM block compatible string > > and (if applicable) may contain a chassis-version RCPM compatible @@ > > -20,6 +20,7 @@ Required properites: > > * "fsl,qoriq-rcpm-1.0": for chassis 1.0 rcpm > > * "fsl,qoriq-rcpm-2.0": for chassis 2.0 rcpm > > * "fsl,qoriq-rcpm-2.1": for chassis 2.1 rcpm > > + * "fsl,qoriq-rcpm-2.1+": for chassis 2.1+ rcpm > > Is there something actually called "2.1+"? It looks a bit like an attempt to > claim > compatibility with all future versions. If the former, is it a name that > comes > from the hardware side with an intent for it to describe a stable interface, > or are > we later going to see a patch changing some by-then-existing device trees from > "2.1+" to "2.1++" when some new incompatibility is found? > > Perhaps it would be better to bind to the specific chip compatibles. According to SoC data sheets, powerPC SoC T1040 and current ARM based Layerscape SoCs (LS1021A, LS1012A, LS1043A, etc)'s arch designs are both basing on Chassis spec 2.1. However, for Layerscape, their data sheets are also explicitly telling that some minor changes have been made(basing on Chassis 2.1 spec). And in parallel, the SW arch designs between T1040 and Layerscape family are also different: For Layerscape, part of RCPM programming job has been moved from kernel driver to firmware/bootloader (through PSCI interface). That's why I have to name a new compatible string to distinguish them. They cannot use the same driver. I don’t think we will add another sting like 2.1++ in the future. If the Chassis spec keep evolving and requiring different programming logic, we can add more like 3.0, 4.0, ..., I think. Regards, Ran
Re: [PATCH v6 20/30] powerpc/pci: Fix crash with enabled movable BARs
On 25/10/2019 04:12, Sergey Miroshnichenko wrote: > Add a check for the UNSET resource flag to skip the released BARs Where/why does it crash exactly? It is not extremely clear from the code. Thanks, > > CC: Alexey Kardashevskiy > CC: Oliver O'Halloran > CC: Sam Bobroff > Signed-off-by: Sergey Miroshnichenko > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index c28d0d9b7ee0..33d5ed8c258f 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2976,7 +2976,8 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe > *pe, > int index; > int64_t rc; > > - if (!res || !res->flags || res->start > res->end) > + if (!res || !res->flags || res->start > res->end || > + (res->flags & IORESOURCE_UNSET)) > return; > > if (res->flags & IORESOURCE_IO) { > -- Alexey
[PATCH v5 4/4] powerpc: load firmware trusted keys/hashes into kernel keyring
The keys used to verify the Host OS kernel are managed by firmware as secure variables. This patch loads the verification keys into the .platform keyring and revocation hashes into .blacklist keyring. This enables verification and loading of the kernels signed by the boot time keys which are trusted by firmware. Signed-off-by: Nayna Jain Reviewed-by: Mimi Zohar --- arch/powerpc/Kconfig | 1 + security/integrity/Kconfig| 8 ++ security/integrity/Makefile | 4 +- .../integrity/platform_certs/load_powerpc.c | 86 +++ 4 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 security/integrity/platform_certs/load_powerpc.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 949e747bc8c2..5d860ed6c901 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -939,6 +939,7 @@ config PPC_SECURE_BOOT bool depends on PPC_POWERNV depends on IMA_ARCH_POLICY + select LOAD_PPC_KEYS help Systems with firmware secure boot enabled need to define security policies to extend secure boot to the OS. This config allows a user diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig index 0bae6adb63a9..26abee23e4e3 100644 --- a/security/integrity/Kconfig +++ b/security/integrity/Kconfig @@ -72,6 +72,14 @@ config LOAD_IPL_KEYS depends on S390 def_bool y +config LOAD_PPC_KEYS + bool "Enable loading of platform and blacklisted keys for POWER" + depends on INTEGRITY_PLATFORM_KEYRING + depends on PPC_SECURE_BOOT + help + Enable loading of keys to the .platform keyring and blacklisted + hashes to the .blacklist keyring for powerpc based platforms. + config INTEGRITY_AUDIT bool "Enables integrity auditing support " depends on AUDIT diff --git a/security/integrity/Makefile b/security/integrity/Makefile index 351c9662994b..7ee39d66cf16 100644 --- a/security/integrity/Makefile +++ b/security/integrity/Makefile @@ -14,6 +14,8 @@ integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \ platform_certs/load_uefi.o \ platform_certs/keyring_handler.o integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o - +integrity-$(CONFIG_LOAD_PPC_KEYS) += platform_certs/efi_parser.o \ + platform_certs/load_powerpc.o \ + platform_certs/keyring_handler.o obj-$(CONFIG_IMA) += ima/ obj-$(CONFIG_EVM) += evm/ diff --git a/security/integrity/platform_certs/load_powerpc.c b/security/integrity/platform_certs/load_powerpc.c new file mode 100644 index ..83d99cde5376 --- /dev/null +++ b/security/integrity/platform_certs/load_powerpc.c @@ -0,0 +1,86 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + * + * - loads keys and hashes stored and controlled by the firmware. + */ +#include +#include +#include +#include +#include +#include +#include +#include "keyring_handler.h" + +/* + * Get a certificate list blob from the named secure variable. + */ +static __init void *get_cert_list(u8 *key, unsigned long keylen, uint64_t *size) +{ + int rc; + void *db; + + rc = secvar_ops->get(key, keylen, NULL, size); + if (rc) { + pr_err("Couldn't get size: %d\n", rc); + return NULL; + } + + db = kmalloc(*size, GFP_KERNEL); + if (!db) + return NULL; + + rc = secvar_ops->get(key, keylen, db, size); + if (rc) { + kfree(db); + pr_err("Error reading db var: %d\n", rc); + return NULL; + } + + return db; +} + +/* + * Load the certs contained in the keys databases into the platform trusted + * keyring and the blacklisted X.509 cert SHA256 hashes into the blacklist + * keyring. + */ +static int __init load_powerpc_certs(void) +{ + void *db = NULL, *dbx = NULL; + uint64_t dbsize = 0, dbxsize = 0; + int rc = 0; + + if (!secvar_ops) + return -ENODEV; + + /* Get db, and dbx. They might not exist, so it isn't +* an error if we can't get them. +*/ + db = get_cert_list("db", 3, ); + if (!db) { + pr_err("Couldn't get db list from firmware\n"); + } else { + rc = parse_efi_signature_list("powerpc:db", db, dbsize, + get_handler_for_db); + if (rc) + pr_err("Couldn't parse db signatures: %d\n", rc); + kfree(db); + } + + dbx = get_cert_list("dbx", 3, ); + if (!dbx) { + pr_info("Couldn't get dbx list from firmware\n"); + } else { + rc =
[PATCH v5 3/4] x86/efi: move common keyring handler functions to new file
The handlers to add the keys to the .platform keyring and blacklisted hashes to the .blacklist keyring is common for both the uefi and powerpc mechanisms of loading the keys/hashes from the firmware. This patch moves the common code from load_uefi.c to keyring_handler.c Signed-off-by: Nayna Jain Acked-by: Mimi Zohar --- security/integrity/Makefile | 3 +- .../platform_certs/keyring_handler.c | 80 +++ .../platform_certs/keyring_handler.h | 32 security/integrity/platform_certs/load_uefi.c | 67 +--- 4 files changed, 115 insertions(+), 67 deletions(-) create mode 100644 security/integrity/platform_certs/keyring_handler.c create mode 100644 security/integrity/platform_certs/keyring_handler.h diff --git a/security/integrity/Makefile b/security/integrity/Makefile index 35e6ca773734..351c9662994b 100644 --- a/security/integrity/Makefile +++ b/security/integrity/Makefile @@ -11,7 +11,8 @@ integrity-$(CONFIG_INTEGRITY_SIGNATURE) += digsig.o integrity-$(CONFIG_INTEGRITY_ASYMMETRIC_KEYS) += digsig_asymmetric.o integrity-$(CONFIG_INTEGRITY_PLATFORM_KEYRING) += platform_certs/platform_keyring.o integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \ - platform_certs/load_uefi.o + platform_certs/load_uefi.o \ + platform_certs/keyring_handler.o integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o obj-$(CONFIG_IMA) += ima/ diff --git a/security/integrity/platform_certs/keyring_handler.c b/security/integrity/platform_certs/keyring_handler.c new file mode 100644 index ..c5ba695c10e3 --- /dev/null +++ b/security/integrity/platform_certs/keyring_handler.c @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include "../integrity.h" + +static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID; +static efi_guid_t efi_cert_x509_sha256_guid __initdata = + EFI_CERT_X509_SHA256_GUID; +static efi_guid_t efi_cert_sha256_guid __initdata = EFI_CERT_SHA256_GUID; + +/* + * Blacklist a hash. + */ +static __init void uefi_blacklist_hash(const char *source, const void *data, + size_t len, const char *type, + size_t type_len) +{ + char *hash, *p; + + hash = kmalloc(type_len + len * 2 + 1, GFP_KERNEL); + if (!hash) + return; + p = memcpy(hash, type, type_len); + p += type_len; + bin2hex(p, data, len); + p += len * 2; + *p = 0; + + mark_hash_blacklisted(hash); + kfree(hash); +} + +/* + * Blacklist an X509 TBS hash. + */ +static __init void uefi_blacklist_x509_tbs(const char *source, + const void *data, size_t len) +{ + uefi_blacklist_hash(source, data, len, "tbs:", 4); +} + +/* + * Blacklist the hash of an executable. + */ +static __init void uefi_blacklist_binary(const char *source, +const void *data, size_t len) +{ + uefi_blacklist_hash(source, data, len, "bin:", 4); +} + +/* + * Return the appropriate handler for particular signature list types found in + * the UEFI db and MokListRT tables. + */ +__init efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type) +{ + if (efi_guidcmp(*sig_type, efi_cert_x509_guid) == 0) + return add_to_platform_keyring; + return 0; +} + +/* + * Return the appropriate handler for particular signature list types found in + * the UEFI dbx and MokListXRT tables. + */ +__init efi_element_handler_t get_handler_for_dbx(const efi_guid_t *sig_type) +{ + if (efi_guidcmp(*sig_type, efi_cert_x509_sha256_guid) == 0) + return uefi_blacklist_x509_tbs; + if (efi_guidcmp(*sig_type, efi_cert_sha256_guid) == 0) + return uefi_blacklist_binary; + return 0; +} diff --git a/security/integrity/platform_certs/keyring_handler.h b/security/integrity/platform_certs/keyring_handler.h new file mode 100644 index ..2462bfa08fe3 --- /dev/null +++ b/security/integrity/platform_certs/keyring_handler.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef PLATFORM_CERTS_INTERNAL_H +#define PLATFORM_CERTS_INTERNAL_H + +#include + +void blacklist_hash(const char *source, const void *data, + size_t len, const char *type, + size_t type_len); + +/* + * Blacklist an X509 TBS hash. + */ +void blacklist_x509_tbs(const char *source, const void *data, size_t len); + +/* + * Blacklist the hash of an executable. + */ +void blacklist_binary(const char *source, const void *data, size_t len); + +/* + * Return the handler for particular signature list types found in the db. + */ +efi_element_handler_t
[PATCH v5 2/4] powerpc: expose secure variables to userspace via sysfs
PowerNV secure variables, which store the keys used for OS kernel verification, are managed by the firmware. These secure variables need to be accessed by the userspace for addition/deletion of the certificates. This patch adds the sysfs interface to expose secure variables for PowerNV secureboot. The users shall use this interface for manipulating the keys stored in the secure variables. Signed-off-by: Nayna Jain Reviewed-by: Greg Kroah-Hartman --- Documentation/ABI/testing/sysfs-secvar | 39 + arch/powerpc/Kconfig | 11 ++ arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/secvar-sysfs.c | 228 + 4 files changed, 279 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-secvar create mode 100644 arch/powerpc/kernel/secvar-sysfs.c diff --git a/Documentation/ABI/testing/sysfs-secvar b/Documentation/ABI/testing/sysfs-secvar new file mode 100644 index ..bc0bedf2b662 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-secvar @@ -0,0 +1,39 @@ +What: /sys/firmware/secvar +Date: August 2019 +Contact: Nayna Jain +Description: This directory is created if the POWER firmware supports OS + secureboot, thereby secure variables. It exposes interface + for reading/writing the secure variables + +What: /sys/firmware/secvar/vars +Date: August 2019 +Contact: Nayna Jain +Description: This directory lists all the secure variables that are supported + by the firmware. + +What: /sys/firmware/secvar/vars/ +Date: August 2019 +Contact: Nayna Jain +Description: Each secure variable is represented as a directory named as + . The variable name is unique and is in ASCII + representation. The data and size can be determined by reading + their respective attribute files. + +What: /sys/firmware/secvar/vars//size +Date: August 2019 +Contact: Nayna Jain +Description: An integer representation of the size of the content of the + variable. In other words, it represents the size of the data. + +What: /sys/firmware/secvar/vars//data +Date: August 2019 +Contact: Nayna Jain h +Description: A read-only file containing the value of the variable. The size + of the file represents the maximum size of the variable data. + +What: /sys/firmware/secvar/vars//update +Date: August 2019 +Contact: Nayna Jain +Description: A write-only file that is used to submit the new value for the + variable. The size of the file represents the maximum size of + the variable data that can be written. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c795039bdc73..949e747bc8c2 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -945,6 +945,17 @@ config PPC_SECURE_BOOT to enable OS secure boot on systems that have firmware support for it. If in doubt say N. +config PPC_SECVAR_SYSFS + tristate "Enable sysfs interface for POWER secure variables" + default y + depends on PPC_SECURE_BOOT + depends on SYSFS + help + POWER secure variables are managed and controlled by firmware. + These variables are exposed to userspace via sysfs to enable + read/write operations on these variables. Say Y if you have + secure boot enabled and want to expose variables to userspace. + endmenu config ISA_DMA_API diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 3cf26427334f..b216e9f316ee 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -162,6 +162,7 @@ obj-y += ucall.o endif obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o ima_arch.o secvar-ops.o +obj-$(CONFIG_PPC_SECVAR_SYSFS) += secvar-sysfs.o # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_prom_init.o := n diff --git a/arch/powerpc/kernel/secvar-sysfs.c b/arch/powerpc/kernel/secvar-sysfs.c new file mode 100644 index ..f0c4950649e0 --- /dev/null +++ b/arch/powerpc/kernel/secvar-sysfs.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (C) 2019 IBM Corporation + * + * This code exposes secure variables to user via sysfs + */ + +#define pr_fmt(fmt) "secvar-sysfs: "fmt + +#include +#include +#include +#include +#include +#include + +#define NAME_MAX_SIZE 1024 + +static struct kobject *secvar_kobj; +static struct kset *secvar_kset; + +static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr, +char *buf) +{ + uint64_t dsize; + int rc; + + rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, ); + if (rc) { + pr_err("Error retrieving variable size %d\n", rc); + return rc; +
[PATCH v5 1/4] powerpc/powernv: Add OPAL API interface to access secure variable
The X.509 certificates trusted by the platform and required to secure boot the OS kernel are wrapped in secure variables, which are controlled by OPAL. This patch adds firmware/kernel interface to read and write OPAL secure variables based on the unique key. This support can be enabled using CONFIG_OPAL_SECVAR. Signed-off-by: Claudio Carvalho Signed-off-by: Nayna Jain --- arch/powerpc/include/asm/opal-api.h | 5 +- arch/powerpc/include/asm/opal.h | 7 + arch/powerpc/include/asm/secvar.h| 35 + arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/secvar-ops.c | 16 +++ arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/opal-call.c | 3 + arch/powerpc/platforms/powernv/opal-secvar.c | 140 +++ arch/powerpc/platforms/powernv/opal.c| 3 + 9 files changed, 210 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/secvar.h create mode 100644 arch/powerpc/kernel/secvar-ops.c create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 378e3997845a..c1f25a760eb1 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -211,7 +211,10 @@ #define OPAL_MPIPL_UPDATE 173 #define OPAL_MPIPL_REGISTER_TAG174 #define OPAL_MPIPL_QUERY_TAG 175 -#define OPAL_LAST 175 +#define OPAL_SECVAR_GET176 +#define OPAL_SECVAR_GET_NEXT 177 +#define OPAL_SECVAR_ENQUEUE_UPDATE 178 +#define OPAL_LAST 178 #define QUIESCE_HOLD 1 /* Spin all calls at entry */ #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index a0cf8fba4d12..9986ac34b8e2 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -298,6 +298,13 @@ int opal_sensor_group_clear(u32 group_hndl, int token); int opal_sensor_group_enable(u32 group_hndl, int token, bool enable); int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct); +int opal_secvar_get(const char *key, uint64_t key_len, u8 *data, + uint64_t *data_size); +int opal_secvar_get_next(const char *key, uint64_t *key_len, +uint64_t key_buf_size); +int opal_secvar_enqueue_update(const char *key, uint64_t key_len, u8 *data, + uint64_t data_size); + s64 opal_mpipl_update(enum opal_mpipl_ops op, u64 src, u64 dest, u64 size); s64 opal_mpipl_register_tag(enum opal_mpipl_tags tag, u64 addr); s64 opal_mpipl_query_tag(enum opal_mpipl_tags tag, u64 *addr); diff --git a/arch/powerpc/include/asm/secvar.h b/arch/powerpc/include/asm/secvar.h new file mode 100644 index ..4cc35b58b986 --- /dev/null +++ b/arch/powerpc/include/asm/secvar.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + * + * PowerPC secure variable operations. + */ +#ifndef SECVAR_OPS_H +#define SECVAR_OPS_H + +#include +#include + +extern const struct secvar_operations *secvar_ops; + +struct secvar_operations { + int (*get)(const char *key, uint64_t key_len, u8 *data, + uint64_t *data_size); + int (*get_next)(const char *key, uint64_t *key_len, + uint64_t keybufsize); + int (*set)(const char *key, uint64_t key_len, u8 *data, + uint64_t data_size); +}; + +#ifdef CONFIG_PPC_SECURE_BOOT + +extern void set_secvar_ops(const struct secvar_operations *ops); + +#else + +static inline void set_secvar_ops(const struct secvar_operations *ops) { } + +#endif + +#endif diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index e8eb2955b7d5..3cf26427334f 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -161,7 +161,7 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),) obj-y += ucall.o endif -obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o ima_arch.o +obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o ima_arch.o secvar-ops.o # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_prom_init.o := n diff --git a/arch/powerpc/kernel/secvar-ops.c b/arch/powerpc/kernel/secvar-ops.c new file mode 100644 index ..4cfa7dbd8850 --- /dev/null +++ b/arch/powerpc/kernel/secvar-ops.c @@ -0,0 +1,16 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + * + * This file initializes secvar operations for PowerPC Secureboot + */ + +#include + +const struct secvar_operations *secvar_ops; + +void set_secvar_ops(const struct secvar_operations *ops) +{ + secvar_ops = ops;
[PATCH v5 0/4] powerpc: expose secure variables to the kernel and userspace
In order to verify the OS kernel on PowerNV systems, secure boot requires X.509 certificates trusted by the platform. These are stored in secure variables controlled by OPAL, called OPAL secure variables. In order to enable users to manage the keys, the secure variables need to be exposed to userspace. OPAL provides the runtime services for the kernel to be able to access the secure variables[1]. This patchset defines the kernel interface for the OPAL APIs. These APIs are used by the hooks, which load these variables to the keyring and expose them to the userspace for reading/writing. The previous version[2] of the patchset added support only for the sysfs interface. This patch adds two more patches that involves loading of the firmware trusted keys to the kernel keyring. Overall, this patchset adds the following support: * expose secure variables to the kernel via OPAL Runtime API interface * expose secure variables to the userspace via kernel sysfs interface * load kernel verification and revocation keys to .platform and .blacklist keyring respectively. The secure variables can be read/written using simple linux utilities cat/hexdump. For example: Path to the secure variables is: /sys/firmware/secvar/vars Each secure variable is listed as directory. $ ls -l total 0 drwxr-xr-x. 2 root root 0 Aug 20 21:20 db drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK The attributes of each of the secure variables are(for example: PK): [db]$ ls -l total 0 -r--r--r--. 1 root root 4096 Oct 1 15:10 data -r--r--r--. 1 root root 65536 Oct 1 15:10 size --w---. 1 root root 4096 Oct 1 15:12 update The "data" is used to read the existing variable value using hexdump. The data is stored in ESL format. The "update" is used to write a new value using cat. The update is to be submitted as AUTH file. [1] Depends on skiboot OPAL API changes which removes metadata from the API. https://lists.ozlabs.org/pipermail/skiboot/2019-September/015203.html. [2] https://lkml.org/lkml/2019/6/13/1644 Changelog: v5: * rebased to v5.4-rc3 * includes Oliver's feedbacks * changed OPAL API as platform driver * sysfs are made default enabled and dependent on PPC_SECURE_BOOT * fixed code specific changes in both OPAL API and sysfs * reading size of the "data" and "update" file from device-tree. * fixed sysfs documentation to also reflect the data and update file size interpretation * This patchset is no more dependent on ima-arch/blacklist patchset v4: * rebased to v5.4-rc1 * uses __BIN_ATTR_WO macro to create binary attribute as suggested by Greg * removed email id from the file header * renamed argument keysize to keybufsize in get_next() function * updated default binary file sizes to 0, as firmware handles checking against the maximum size * fixed minor formatting issues in Patch 4/4 * added Greg's and Mimi's Reviewed-by and Ack-by v3: * includes Greg's feedbacks: * fixes in Patch 2/4 * updates the Documentation. * fixes code feedbacks * adds SYSFS Kconfig dependency for SECVAR_SYSFS * fixes mixed tabs and spaces * removes "name" attribute for each of the variable name based directories * fixes using __ATTR_RO() and __BIN_ATTR_RO() and statics and const * fixes the racing issue by using kobj_type default groups. Also, fixes the kobject leakage. * removes extra print messages * updates patch description for Patch 3/4 * removes file name from Patch 4/4 file header comment and removed def_bool y from the LOAD_PPC_KEYS Kconfig * includes Oliver's feedbacks: * fixes Patch 1/2 * moves OPAL API wrappers after opal_nx_proc_init(), fixed the naming, types and removed extern. * fixes spaces * renames get_variable() to get(), get_next_variable() to get_next() and set_variable() to set() * removed get_secvar_ops() and defined secvar_ops as global * fixes consts and statics * removes generic secvar_init() and defined platform specific opal_secar_init() * updates opal_secvar_supported() to check for secvar support even before checking the OPAL APIs support and also fixed the error codes. * addes function that converts OPAL return codes to linux errno * moves secvar check support in the opal_secvar_init() and defined its prototype in opal.h * fixes Patch 2/2 * fixes static/const * defines macro for max name size * replaces OPAL error codes with linux errno and also updated error handling * moves secvar support check before creating sysfs kobjects in secvar_sysfs_init() * fixes spaces v2: * removes complete efi-sms from the sysfs implementation and is simplified * includes Greg's and Oliver's feedbacks: * adds sysfs documentation * moves sysfs code to arch/powerpc * other code related feedbacks. * adds two new patches to load keys to .platform and .blacklist keyring. These patches are added to this series as they are also dependent on OPAL APIs. Nayna Jain (4):
Re: [PATCH] powerpc/boot: Fix the initrd being overwritten under qemu
On 25/10/2019 04:45, Segher Boessenkool wrote: > On Thu, Oct 24, 2019 at 12:31:24PM +1100, Alexey Kardashevskiy wrote: >> >> >> On 23/10/2019 22:21, Segher Boessenkool wrote: >>> On Wed, Oct 23, 2019 at 12:36:35PM +1100, Oliver O'Halloran wrote: When booting under OF the zImage expects the initrd address and size to be passed to it using registers r3 and r4. SLOF (guest firmware used by QEMU) currently doesn't do this so the zImage is not aware of the initrd location. This can result in initrd corruption either though the zImage extracting the vmlinux over the initrd, or by the vmlinux overwriting the initrd when relocating itself. QEMU does put the linux,initrd-start and linux,initrd-end properties into the devicetree to vmlinux to find the initrd. We can work around the SLOF bug by also looking those properties in the zImage. >>> >>> This is not a bug. What boot protocol requires passing the initrd start >>> and size in GPR3, GPR4? >> >> So far I was unable to identify it... > > Maybe this comes from yaboot? > https://git.ozlabs.org/?p=yaboot.git;a=blob;f=second/yaboot.c;h=9b66ab44e1be0ee82b88e386a5d0358428766e73;hb=HEAD#l1186 I asked around, a "common practice" was the response :) It's been like this for ages and it did not come from any OF/PPC binding. It was also noted that we do not use zImage right - the whole idea was that it is a single binary blob with vmlinux _and_ initramdisk to point OF at as at the time it could only deal with single blobs. So having separate zImage and initrd is out of zImage design scope (some disagreed here). >>> The CHRP binding (what SLOF implements) requires passing two zeroes here. >>> And ePAPR requires passing the address of a device tree and a zero, plus >>> something in GPR6 to allow distinguishing what it does. >>> >>> As Alexey says, initramfs works just fine, so please use that? initrd was >>> deprecated when this code was written already. >> >> I did not say about anything working fine :) > > Yeah, I read that from your words, wrong it seems. Sorry. I often used > INITRAMFS_SOURCE for kernels for use with SLOF, it's just so convenient. > >> In my case I was using a new QEMU which does full FDT on client-arch-support >> and that thing would put the original >> linux,initrd-start/end to the FDT even though the initrd was unpacked and >> the properties were changes in SLOF. With that >> fixed, this is an alternative fix for SLOF but I am not pushing it out as I >> have no idea about the bindings and this >> also breaks "vmlinux". >> >> >> diff --git a/slof/fs/client.fs b/slof/fs/client.fs >> index 8a7f6ac4326d..138177e4c2a3 100644 >> --- a/slof/fs/client.fs >> +++ b/slof/fs/client.fs >> @@ -45,6 +45,17 @@ VARIABLE client-callback \ Address of client's callback >> function >>>r ciregs >r7 ! ciregs >r6 ! client-entry-point @ ciregs >r5 ! >>\ Initialise client-stack-pointer >>cistack ciregs >r1 ! >> + >> + s" linux,initrd-end" get-chosen IF decode-int -rot 2drop ELSE 0 THEN >> + s" linux,initrd-start" get-chosen IF decode-int -rot 2drop ELSE 0 THEN >> + 2dup - dup IF >> +ciregs >r4 ! >> +ciregs >r3 ! >> +drop >> + ELSE >> +3drop >> + THEN > > Something like that should work fine. Do it in go-32 and go-64 though? > Or is that the wrong spot? Nah, I was trying a different initramdisk which complained about my test kernel being too old, after fixing that, it works. I'll post a patch. Thanks, -- Alexey
Re: [PATCH 0/7] towards QE support on ARM
On Tue, Oct 22, 2019 at 9:54 PM Qiang Zhao wrote: > > On 22/10/2019 18:18, Rasmus Villemoes wrote: > > -Original Message- > > From: Rasmus Villemoes > > Sent: 2019年10月22日 18:18 > > To: Qiang Zhao ; Leo Li > > Cc: Timur Tabi ; Greg Kroah-Hartman > > ; linux-ker...@vger.kernel.org; > > linux-ser...@vger.kernel.org; Jiri Slaby ; > > linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org > > Subject: Re: [PATCH 0/7] towards QE support on ARM > > > > On 22/10/2019 04.24, Qiang Zhao wrote: > > > On Mon, Oct 22, 2019 at 6:11 AM Leo Li wrote > > > > >> Right. I'm really interested in getting this applied to my tree and > > >> make it upstream. Zhao Qiang, can you help to review Rasmus's > > >> patches and comment? > > > > > > As you know, I maintained a similar patchset removing PPC, and someone > > told me qe_ic should moved into drivers/irqchip/. > > > I also thought qe_ic is a interrupt control driver, should be moved into > > > dir > > irqchip. > > > > Yes, and I also plan to do that at some point. However, that's orthogonal to > > making the driver build on ARM, so I don't want to mix the two. Making it > > usable on ARM is my/our priority currently. > > > > I'd appreciate your input on my patches. > > Yes, we can put this patchset in first place, ensure it can build and work on > ARM, then push another patchset to move qe_ic. Right. I would only accept a patch series that can really build and work on ARM. At least the current out-of-tree patches can make it work on ARM. If we accept partial changes, there is no way to make it work on the latest kernel on ARM then. Regards, Leo
Re: [PATCH v9 7/8] ima: check against blacklisted hashes for files with modsig
On 10/23/2019 8:47 PM, Nayna Jain wrote: +/* + * ima_check_blacklist - determine if the binary is blacklisted. + * + * Add the hash of the blacklisted binary to the measurement list, based + * on policy. + * + * Returns -EPERM if the hash is blacklisted. + */ +int ima_check_blacklist(struct integrity_iint_cache *iint, + const struct modsig *modsig, int pcr) +{ + enum hash_algo hash_algo; + const u8 *digest = NULL; + u32 digestsize = 0; + int rc = 0; + + if (!(iint->flags & IMA_CHECK_BLACKLIST)) + return 0; + + if (iint->flags & IMA_MODSIG_ALLOWED && modsig) { + ima_get_modsig_digest(modsig, _algo, , ); + + rc = is_binary_blacklisted(digest, digestsize); + if ((rc == -EPERM) && (iint->flags & IMA_MEASURE)) + process_buffer_measurement(digest, digestsize, + "blacklisted-hash", NONE, + pcr); + } The enum value "NONE" is being passed to process_buffer_measurement to indicate that the check for required action based on ima policy is already done by ima_check_blacklist. Not sure, but this can cause confusion in the future when someone updates process_buffer_measurement. Would it instead be better to add another parameter to process_buffer_measurement to indicate the above condition? -lakshmi
Re: [PATCH 1/2] asm-generic: Make msi.h a mandatory include/asm header
On Thu, 24 Oct 2019, Michal Simek wrote: > msi.h is generic for all architectures expect of x86 which has own version. > Enabling MSI by including msi.h to architecture Kbuild is just additional > step which doesn't need to be done. > The patch was created based on request to enable MSI for Microblaze. > > Suggested-by: Christoph Hellwig > Signed-off-by: Michal Simek > --- > > https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/ [ ... ] > diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild > index 16970f246860..1efaeddf1e4b 100644 > --- a/arch/riscv/include/asm/Kbuild > +++ b/arch/riscv/include/asm/Kbuild > @@ -22,7 +22,6 @@ generic-y += kvm_para.h > generic-y += local.h > generic-y += local64.h > generic-y += mm-arch-hooks.h > -generic-y += msi.h > generic-y += percpu.h > generic-y += preempt.h > generic-y += sections.h Acked-by: Paul Walmsley # arch/riscv Tested-by: Paul Walmsley # build only, rv32/rv64 Thanks Michał, - Paul
Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
On 23.10.19 09:26, David Hildenbrand wrote: On 22.10.19 23:54, Dan Williams wrote: Hi David, Thanks for tackling this! Thanks for having a look :) [...] I am probably a little bit too careful (but I don't want to break things). In most places (besides KVM and vfio that are nuts), the pfn_to_online_page() check could most probably be avoided by a is_zone_device_page() check. However, I usually get suspicious when I see a pfn_valid() check (especially after I learned that people mmap parts of /dev/mem into user space, including memory without memmaps. Also, people could memmap offline memory blocks this way :/). As long as this does not hurt performance, I think we should rather do it the clean way. I'm concerned about using is_zone_device_page() in places that are not known to already have a reference to the page. Here's an audit of current usages, and the ones I think need to cleaned up. The "unsafe" ones do not appear to have any protections against the device page being removed (get_dev_pagemap()). Yes, some of these were added by me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device pages into anonymous memory paths and I'm not up to speed on how it guarantees 'struct page' validity vs device shutdown without using get_dev_pagemap(). smaps_pmd_entry(): unsafe put_devmap_managed_page(): safe, page reference is held is_device_private_page(): safe? gpu driver manages private page lifetime is_pci_p2pdma_page(): safe, page reference is held uncharge_page(): unsafe? HMM add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page() soft_offline_page(): unsafe remove_migration_pte(): unsafe? HMM move_to_new_page(): unsafe? HMM migrate_vma_pages() and helpers: unsafe? HMM try_to_unmap_one(): unsafe? HMM __put_page(): safe release_pages(): safe I'm hoping all the HMM ones can be converted to is_device_private_page() directlly and have that routine grow a nice comment about how it knows it can always safely de-reference its @page argument. For the rest I'd like to propose that we add a facility to determine ZONE_DEVICE by pfn rather than page. The most straightforward why I can think of would be to just add another bitmap to mem_section_usage to indicate if a subsection is ZONE_DEVICE or not. (it's a somewhat unrelated bigger discussion, but we can start discussing it in this thread) I dislike this for three reasons a) It does not protect against any races, really, it does not improve things. b) We do have the exact same problem with pfn_to_online_page(). As long as we don't hold the memory hotplug lock, memory can get offlined and remove any time. Racy. c) We mix in ZONE specific stuff into the core. It should be "just another zone" What I propose instead (already discussed in https://lkml.org/lkml/2019/10/10/87) 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE 2. Convert SECTION_IS_ACTIVE to a subsection bitmap 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE Dan, I am suspecting that you want a pfn_to_zone() that will not touch the memmap, because it could potentially (altmap) lie on slow memory, right? A modification might make this possible (but I am not yet sure if we want a less generic MM implementation just to fine tune slow memmap access here) 1. Keep SECTION_IS_ONLINE as it is with the same semantics 2. Introduce a subsection bitmap to record active ("initialized memmap") PFNs. E.g., also set it when setting sections online. 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() = pfn_active() && section == SECTION_IS_ONLINE (or keep it as is, depends on the RCU locking we eventually implement) 6. pfn_to_device_page() = pfn_active() && section != SECTION_IS_ONLINE 7. use pfn_active() whenever we don't care about the zone. Again, not really a friend of that, it hardcodes ZONE_DEVICE vs. !ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker) we really want to touch the memmap right away either way. So we can also directly read the zone from it. I really do prefer right now a more generic implementation. -- Thanks, David / dhildenb
[PATCH v1 10/10] mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE
ZONE_DEVICE (a.k.a. device memory) is no longer marked PG_reserved. Update the comment. While at it, make it match what the code is acutally doing (reject vs. accept). Cc: Kees Cook Cc: Andrew Morton Cc: "Isaac J. Manjarres" Cc: "Matthew Wilcox (Oracle)" Cc: Qian Cai Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- mm/usercopy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..80f254024c97 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -199,9 +199,9 @@ static inline void check_page_span(const void *ptr, unsigned long n, return; /* -* Reject if range is entirely either Reserved (i.e. special or -* device memory), or CMA. Otherwise, reject since the object spans -* several independently allocated pages. +* Accept if the range is entirely either Reserved ("special") or +* CMA. Otherwise, reject since the object spans several independently +* allocated pages. */ is_reserved = PageReserved(page); is_cma = is_migrate_cma_page(page); -- 2.21.0
[PATCH v1 09/10] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap
Everything should be prepared to stop setting pages PG_reserved when initializing the memmap on memory hotplug. Most importantly, we stop marking ZONE_DEVICE pages PG_reserved. a) We made sure that any code that relied on PG_reserved to detect ZONE_DEVICE memory will no longer rely on PG_reserved (especially, by relying on pfn_to_online_page() for now). Details can be found below. b) We made sure that memory blocks with holes cannot be offlined and therefore also not onlined. We have quite some code that relies on memory holes being marked PG_reserved. This is now not an issue anymore. generic_online_page() still calls __free_pages_core(), which performs __ClearPageReserved(p). AFAIKS, this should not hurt. It is worth nothing that the users of online_page_callback_t might see a change. E.g., until now, pages not freed to the buddy by the HyperV balloonm were set PG_reserved until freed via generic_online_page(). Now, they would look like ordinarily allocated pages (refcount == 1). This callback is used by the XEN balloon and the HyperV balloon. To not introduce any silent errors, keep marking the pages PG_reserved. We can most probably stop doing that, but have to double check if there are issues (e.g., offlining code aborts right away in has_unmovable_pages() when it runs into a PageReserved(page)) Update the documentation at various places in the MM core. There are three PageReserved() users that might be affected by this change. - drivers/staging/gasket/gasket_page_table.c:gasket_release_page() -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page -> I assume "we don't care" - drivers/staging/kpc2000/kpc_dma/fileops.c:transfer_complete_cb() -> We might (unlikely) set SetPageDirty() on a ZONE_DEVICE page -> I assume "we don't care" - mm/usercopy.c: check_page_span() -> According to Dan, non-HMM ZONE_DEVICE usage excluded this code since commit 52f476a323f9 ("libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead") -> It is unclear whether we rally cared about ZONE_DEVICE here (HMM) or simply about "PG_reserved". The worst thing that could happen is a false negative with CONFIG_HARDENED_USERCOPY we should be able to identify easily. -> There is a discussion to rip out that code completely -> I assume "not relevant" / "we don't care" I audited the other PageReserved() users. They don't affect ZONE_DEVICE: - mm/page_owner.c:pagetypeinfo_showmixedcount_print() -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn)) - mm/page_owner.c:init_pages_in_zone() -> Never called for ZONE_DEVICE (!populated_zone(zone)) - mm/page_ext.c:free_page_ext() -> Only a BUG_ON(PageReserved(page)), not relevant - mm/page_ext.c:has_unmovable_pages() -> Not releveant for ZONE_DEVICE - mm/page_ext.c:pfn_range_valid_contig() -> pfn_to_online_page() already guards us - mm/mempolicy.c:queue_pages_pte_range() -> vm_normal_page() checks against pte_devmap() - mm/memory-failure.c:hwpoison_user_mappings() -> Not reached via memory_failure() due to pfn_to_online_page() -> Also not reached indirectly via memory_failure_hugetlb() - mm/hugetlb.c:gather_bootmem_prealloc() -> Only a WARN_ON(PageReserved(page)), not relevant - kernel/power/snapshot.c:saveable_highmem_page() -> pfn_to_online_page() already guards us - kernel/power/snapshot.c:saveable_page() -> pfn_to_online_page() already guards us - fs/proc/task_mmu.c:can_gather_numa_stats() -> vm_normal_page() checks against pte_devmap() - fs/proc/task_mmu.c:can_gather_numa_stats_pmd -> vm_normal_page_pmd() checks against pte_devmap() - fs/proc/page.c:stable_page_flags() -> The reserved bit is simply copied, irrelevant - drivers/firmware/memmap.c:release_firmware_map_entry() -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE - arch/ia64/kernel/mca_drv.c - arch/mips/mm/init.c - arch/mips/mm/ioremap.c - arch/nios2/mm/ioremap.c - arch/parisc/mm/ioremap.c - arch/sparc/mm/tlb.c - arch/xtensa/mm/cache.c -> No ZONE_DEVICE support - arch/powerpc/mm/init_64.c:vmemmap_free() -> Special-cases memmap on altmap -> Only a check for bootmem - arch/x86/kernel/alternative.c:__text_poke() -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem - arch/x86/mm/init_64.c -> Only a check for bootmem Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Andrew Morton Cc: Alexander Duyck Cc: Pavel Tatashin Cc: Vlastimil Babka Cc: Johannes Weiner Cc: Anthony Yznaga Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Mel Gorman Cc: Mike Rapoport Cc: Anshuman Khandual Cc: Matt Sickler Cc: Kees Cook Suggested-by: Michal Hocko Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c| 6 ++ drivers/xen/balloon.c | 7 +++ include/linux/page-flags.h | 8 +---
[PATCH v1 08/10] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite __ioremap_check_ram() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Signed-off-by: David Hildenbrand --- arch/x86/mm/ioremap.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index a39dcdb5ae34..db6913b48edf 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource *res) start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT; stop_pfn = (res->end + 1) >> PAGE_SHIFT; if (stop_pfn > start_pfn) { - for (i = 0; i < (stop_pfn - start_pfn); ++i) - if (pfn_valid(start_pfn + i) && - !PageReserved(pfn_to_page(start_pfn + i))) + for (i = 0; i < (stop_pfn - start_pfn); ++i) { + struct page *page; +/* + * We treat any pages that are not online (not managed + * by the buddy) as not being RAM. This includes + * ZONE_DEVICE pages. + */ + page = pfn_to_online_page(start_pfn + i); + if (page && !PageReserved(page)) return IORES_MAP_SYSTEM_RAM; + } } return 0; -- 2.21.0
[PATCH v1 07/10] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite maybe_pte_to_page() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: Allison Randal Cc: Nicholas Piggin Cc: Thomas Gleixner Signed-off-by: David Hildenbrand --- arch/powerpc/mm/pgtable.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index e3759b69f81b..613c98fa7dc0 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte) unsigned long pfn = pte_pfn(pte); struct page *page; - if (unlikely(!pfn_valid(pfn))) - return NULL; - page = pfn_to_page(pfn); - if (PageReserved(page)) + /* +* We reject any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + page = pfn_to_online_page(pfn); + if (unlikely(!page || PageReserved(page))) return NULL; return page; } -- 2.21.0
[PATCH v1 06/10] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. Rewrite hash_page_do_lazy_icache() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: "Aneesh Kumar K.V" Cc: Christophe Leroy Cc: Nicholas Piggin Cc: Andrew Morton Cc: Mike Rapoport Cc: YueHaibing Signed-off-by: David Hildenbrand --- arch/powerpc/mm/book3s64/hash_utils.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 6c123760164e..a1566039e747 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void) */ unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap) { - struct page *page; + struct page *page = pfn_to_online_page(pte_pfn(pte)); - if (!pfn_valid(pte_pfn(pte))) + /* +* We ignore any pages that are not online (not managed by the buddy). +* This includes ZONE_DEVICE pages. +*/ + if (!page) return pp; - page = pte_page(pte); - /* page is dirty */ if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) { if (trap == 0x400) { -- 2.21.0
[PATCH v1 05/10] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvmppc_book3s_instantiate_page() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Signed-off-by: David Hildenbrand --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..05397c0561fc 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, writing, upgrade_p); if (is_error_noslot_pfn(pfn)) return -EFAULT; - page = NULL; - if (pfn_valid(pfn)) { - page = pfn_to_page(pfn); - if (PageReserved(page)) - page = NULL; - } + /* +* We treat any pages that are not online (not managed by the +* buddy) as reserved - this includes ZONE_DEVICE pages and +* pages without a memmap (e.g., mapped via /dev/mem). +*/ + page = pfn_to_online_page(pfn); + if (page && PageReserved(page)) + page = NULL; } /* -- 2.21.0
[PATCH v1 04/10] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson Cc: Cornelia Huck Signed-off-by: David Hildenbrand --- drivers/vfio/vfio_iommu_type1.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0
[PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Michal Hocko Cc: Dan Williams Cc: KarimAllah Ahmed Signed-off-by: David Hildenbrand --- virt/kvm/kvm_main.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* +* We treat any pages that are not online (not managed by the buddy) +* as reserved - this includes ZONE_DEVICE pages and pages without +* a memmap (e.g., mapped via /dev/mem). +*/ + if (page) + return PageReserved(page); return true; } -- 2.21.0
Re: [PATCH v9 4/8] powerpc/ima: define trusted boot policy
On 10/23/2019 8:47 PM, Nayna Jain wrote: +/* + * The "secure_and_trusted_rules" contains rules for both the secure boot and + * trusted boot. The "template=ima-modsig" option includes the appended + * signature, when available, in the IMA measurement list. + */ +static const char *const secure_and_trusted_rules[] = { + "measure func=KEXEC_KERNEL_CHECK template=ima-modsig", + "measure func=MODULE_CHECK template=ima-modsig", + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", +#ifndef CONFIG_MODULE_SIG_FORCE + "appraise func=MODULE_CHECK appraise_type=imasig|modsig", +#endif + NULL +}; Same comment as earlier - any way to avoid using conditional compilation in C file? -lakshmi
Re: [PATCH v9 3/8] powerpc: detect the trusted boot state of the system
On 10/23/2019 8:47 PM, Nayna Jain wrote: +bool is_ppc_trustedboot_enabled(void) +{ + struct device_node *node; + bool enabled = false; + + node = get_ppc_fw_sb_node(); + enabled = of_property_read_bool(node, "trusted-enabled"); Can get_ppc_fw_sb_node return NULL? Would of_property_read_bool handle the case when node is NULL? -lakshmi
Re: [PATCH v9 2/8] powerpc/ima: add support to initialize ima policy rules
On 10/23/2019 8:47 PM, Nayna Jain wrote: +/* + * The "secure_rules" are enabled only on "secureboot" enabled systems. + * These rules verify the file signatures against known good values. + * The "appraise_type=imasig|modsig" option allows the known good signature + * to be stored as an xattr or as an appended signature. + * + * To avoid duplicate signature verification as much as possible, the IMA + * policy rule for module appraisal is added only if CONFIG_MODULE_SIG_FORCE + * is not enabled. + */ +static const char *const secure_rules[] = { + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", +#ifndef CONFIG_MODULE_SIG_FORCE + "appraise func=MODULE_CHECK appraise_type=imasig|modsig", +#endif + NULL +}; Is there any way to not use conditional compilation in the above array definition? Maybe define different functions to get "secure_rules" for when CONFIG_MODULE_SIG_FORCE is defined and when it is not defined. Just a suggestion. -lakshmi
Re: [PATCH v9 1/8] powerpc: detect the secure boot mode of the system
On 10/23/2019 8:47 PM, Nayna Jain wrote: This patch defines a function to detect the secure boot state of a PowerNV system. +bool is_ppc_secureboot_enabled(void) +{ + struct device_node *node; + bool enabled = false; + + node = of_find_compatible_node(NULL, NULL, "ibm,secvar-v1"); + if (!of_device_is_available(node)) { + pr_err("Cannot find secure variable node in device tree; failing to secure state\n"); + goto out; Related to "goto out;" above: Would of_find_compatible_node return NULL if the given node is not found? If of_device_is_available returns false (say, because node is NULL or it does not find the specified node) would it be correct to call of_node_put? + +out: + of_node_put(node); -lakshmi
Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic
On 10/23/19 8:47 PM, Nayna Jain wrote: Hi Nayna, +void process_buffer_measurement(const void *buf, int size, + const char *eventname, enum ima_hooks func, + int pcr) { int ret = 0; struct ima_template_entry *entry = NULL; + if (func) { + security_task_getsecid(current, ); + action = ima_get_action(NULL, current_cred(), secid, 0, func, + , ); + if (!(action & IMA_MEASURE)) + return; + } In your change set process_buffer_measurement is called with NONE for the parameter func. So ima_get_action (the above if block) will not be executed. Wouldn't it better to update ima_get_action (and related functions) to handle the ima policy (func param)? thanks, -lakshmi
[PATCH v1 02/10] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_mmio_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Sean Christopherson Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Jim Mattson Cc: Joerg Roedel Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: KarimAllah Ahmed Cc: Michal Hocko Cc: Dan Williams Signed-off-by: David Hildenbrand --- arch/x86/kvm/mmu.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24c23c66b226..f03089a336de 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2962,20 +2962,25 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { + struct page *page = pfn_to_online_page(pfn); + + /* +* ZONE_DEVICE pages are never online. Online pages that are reserved +* either indicate the zero page or MMIO pages. +*/ + if (page) + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Anything with a valid (but not online) memmap could be ZONE_DEVICE. +* Treat only UC/UC-/WC pages as MMIO. +*/ if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && - /* -* Some reserved pages, such as those from NVDIMM -* DAX devices, are not for MMIO, and can be mapped -* with cached memory type for better performance. -* However, the above check misconceives those pages -* as MMIO, and results in KVM mapping them with UC -* memory type, which would hurt the performance. -* Therefore, we check the host memory type in addition -* and only treat UC/UC-/WC pages as MMIO. -*/ - (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); + return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn); + /* +* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO. +*/ return !e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1, E820_TYPE_RAM); -- 2.21.0
[PATCH v1 01/10] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes
Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes (a range that is not IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see add_memory_resource()). All boot memory is alread online. Therefore, when we stop allowing to offline memory blocks with holes, we implicitly no longer have to deal with onlining memory blocks with holes. This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved. Offlining memory blocks added during boot is usually not guranteed to work either way (unmovable data might have easily ended up on that memory during boot). So stopping to do that should not really hurt (+ people are not even aware of a setup where that used to work and that the existing code still works correctly with memory holes). For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird). Please note that hardware errors (PG_hwpoison) are not memory holes and not affected by this change when offlining. Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Dan Williams Cc: Anshuman Khandual Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 561371ead39a..8d81730cf036 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct memory_notify *arg) node_clear_state(node, N_MEMORY); } +static int count_system_ram_pages_cb(unsigned long start_pfn, +unsigned long nr_pages, void *data) +{ + unsigned long *nr_system_ram_pages = data; + + *nr_system_ram_pages += nr_pages; + return 0; +} + static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn) { - unsigned long pfn, nr_pages; + unsigned long pfn, nr_pages = 0; unsigned long offlined_pages = 0; int ret, node, nr_isolate_pageblock; unsigned long flags; @@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn, mem_hotplug_begin(); + /* +* Don't allow to offline memory blocks that contain holes. +* Consecuently, memory blocks with holes can never get onlined +* (hotplugged memory has no holes and all boot memory is online). +* This allows to simplify the onlining/offlining code quite a lot. +*/ + walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages, + count_system_ram_pages_cb); + if (nr_pages != end_pfn - start_pfn) { + ret = -EINVAL; + reason = "memory holes"; + goto failed_removal; + } + /* This makes hotplug much easier...and readable. we assume this for now. .*/ if (!test_pages_in_a_zone(start_pfn, end_pfn, _start, @@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn, zone = page_zone(pfn_to_page(valid_start)); node = zone_to_nid(zone); - nr_pages = end_pfn - start_pfn; /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, -- 2.21.0
[PATCH v1 00/10] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
This is the result of a recent discussion with Michal ([1], [2]). Right now we set all pages PG_reserved when initializing hotplugged memmaps. This includes ZONE_DEVICE memory. In case of system memory, PG_reserved is cleared again when onlining the memory, in case of ZONE_DEVICE memory never. In ancient times, we needed PG_reserved, because there was no way to tell whether the memmap was already properly initialized. We now have SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE memory is already initialized deferred, and there shouldn't be a visible change in that regard. One of the biggest fears were side effects. I went ahead and audited all users of PageReserved(). The details can be found in "mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap". This patch set adapts all relevant users of PageReserved() to keep the existing behavior in respect to ZONE_DEVICE pages. The biggest part part that needs changes is KVM, to keep the existing behavior (that's all I care about in this series). Note that this series is able to rely completely on pfn_to_online_page(). No new is_zone_device_page() calles are introduced (as requested by Dan). We are currently discussing a way to mark also ZONE_DEVICE memmaps as active/initialized - pfn_active() - and lightweight locking to make sure memmaps remain active (e.g., using RCU). We might later be able to convert some suers of pfn_to_online_page() to pfn_active(). Details can be found in [3], however, this represents yet another cleanup/fix we'll perform on top of this cleanup. I only gave it a quick test with DIMMs on x86-64, but didn't test the ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Also, I didn't test the KVM parts (especially with ZONE_DEVICE pages or no memmap at all). Compile-tested on x86-64 and PPC. Based on next/master. The current version (kept updated) can be found at: https://github.com/davidhildenbrand/linux.git online_reserved_cleanup RFC -> v1: - Dropped "staging/gasket: Prepare gasket_release_page() for PG_reserved changes" - Dropped "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes" - Converted "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes" to "mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE" - No new users of is_zone_device_page() are introduced. - Rephrased comments and patch descriptions. [1] https://lkml.org/lkml/2019/10/21/736 [2] https://lkml.org/lkml/2019/10/21/1034 [3] https://www.spinics.net/lists/linux-mm/msg194112.html Cc: Michal Hocko Cc: Dan Williams Cc: kvm-...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: k...@vger.kernel.org Cc: linux-hyp...@vger.kernel.org Cc: de...@driverdev.osuosl.org Cc: xen-de...@lists.xenproject.org Cc: x...@kernel.org Cc: Alexander Duyck David Hildenbrand (10): mm/memory_hotplug: Don't allow to online/offline memory blocks with holes KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap mm/usercopy.c: Update comment in check_page_span() regarding ZONE_DEVICE arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 + arch/powerpc/mm/book3s64/hash_utils.c | 10 +++--- arch/powerpc/mm/pgtable.c | 10 +++--- arch/x86/kvm/mmu.c | 29 ++--- arch/x86/mm/ioremap.c | 13 ++-- drivers/hv/hv_balloon.c| 6 drivers/vfio/vfio_iommu_type1.c| 10 -- drivers/xen/balloon.c | 7 + include/linux/page-flags.h | 8 + mm/memory_hotplug.c| 43 +++--- mm/page_alloc.c| 11 --- mm/usercopy.c | 6 ++-- virt/kvm/kvm_main.c| 10 -- 13 files changed, 111 insertions(+), 66 deletions(-) -- 2.21.0
Re: [PATCH v7 2/3] Documentation: dt: binding: fsl: Add 'little-endian' and update Chassis define
On Mon, 2019-10-21 at 11:49 +0800, Ran Wang wrote: > By default, QorIQ SoC's RCPM register block is Big Endian. But > there are some exceptions, such as LS1088A and LS2088A, are > Little Endian. So add this optional property to help identify > them. > > Actually LS2021A and other Layerscapes won't totally follow Chassis > 2.1, so separate them from powerpc SoC. Did you mean LS1021A and "don't" instead of "won't", given the change to the examples? > Change in v5: > - Add 'Reviewed-by: Rob Herring ' to commit message. > - Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup- > cells'. > please see https://lore.kernel.org/patchwork/patch/1101022/ I'm not sure why Rob considers this the "correct form" -- there are other examples of the current form, such as ibm,#dma-address-cells and ti,#tlb- entries, and the current form makes more logical sense (# is part of the property name, not the vendor). Oh well. > Required properites: >- reg : Offset and length of the register set of the RCPM block. > - - fsl,#rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the > + - #fsl,rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the > fsl,rcpm-wakeup property. >- compatible : Must contain a chip-specific RCPM block compatible string > and (if applicable) may contain a chassis-version RCPM compatible > @@ -20,6 +20,7 @@ Required properites: > * "fsl,qoriq-rcpm-1.0": for chassis 1.0 rcpm > * "fsl,qoriq-rcpm-2.0": for chassis 2.0 rcpm > * "fsl,qoriq-rcpm-2.1": for chassis 2.1 rcpm > + * "fsl,qoriq-rcpm-2.1+": for chassis 2.1+ rcpm Is there something actually called "2.1+"? It looks a bit like an attempt to claim compatibility with all future versions. If the former, is it a name that comes from the hardware side with an intent for it to describe a stable interface, or are we later going to see a patch changing some by-then-existing device trees from "2.1+" to "2.1++" when some new incompatibility is found? Perhaps it would be better to bind to the specific chip compatibles. -Scott
Re: [PATCH] powerpc/boot: Fix the initrd being overwritten under qemu
On Thu, Oct 24, 2019 at 12:31:24PM +1100, Alexey Kardashevskiy wrote: > > > On 23/10/2019 22:21, Segher Boessenkool wrote: > > On Wed, Oct 23, 2019 at 12:36:35PM +1100, Oliver O'Halloran wrote: > >> When booting under OF the zImage expects the initrd address and size to be > >> passed to it using registers r3 and r4. SLOF (guest firmware used by QEMU) > >> currently doesn't do this so the zImage is not aware of the initrd > >> location. This can result in initrd corruption either though the zImage > >> extracting the vmlinux over the initrd, or by the vmlinux overwriting the > >> initrd when relocating itself. > >> > >> QEMU does put the linux,initrd-start and linux,initrd-end properties into > >> the devicetree to vmlinux to find the initrd. We can work around the SLOF > >> bug by also looking those properties in the zImage. > > > > This is not a bug. What boot protocol requires passing the initrd start > > and size in GPR3, GPR4? > > So far I was unable to identify it... Maybe this comes from yaboot? https://git.ozlabs.org/?p=yaboot.git;a=blob;f=second/yaboot.c;h=9b66ab44e1be0ee82b88e386a5d0358428766e73;hb=HEAD#l1186 > > The CHRP binding (what SLOF implements) requires passing two zeroes here. > > And ePAPR requires passing the address of a device tree and a zero, plus > > something in GPR6 to allow distinguishing what it does. > > > > As Alexey says, initramfs works just fine, so please use that? initrd was > > deprecated when this code was written already. > > I did not say about anything working fine :) Yeah, I read that from your words, wrong it seems. Sorry. I often used INITRAMFS_SOURCE for kernels for use with SLOF, it's just so convenient. > In my case I was using a new QEMU which does full FDT on client-arch-support > and that thing would put the original > linux,initrd-start/end to the FDT even though the initrd was unpacked and the > properties were changes in SLOF. With that > fixed, this is an alternative fix for SLOF but I am not pushing it out as I > have no idea about the bindings and this > also breaks "vmlinux". > > > diff --git a/slof/fs/client.fs b/slof/fs/client.fs > index 8a7f6ac4326d..138177e4c2a3 100644 > --- a/slof/fs/client.fs > +++ b/slof/fs/client.fs > @@ -45,6 +45,17 @@ VARIABLE client-callback \ Address of client's callback > function >>r ciregs >r7 ! ciregs >r6 ! client-entry-point @ ciregs >r5 ! >\ Initialise client-stack-pointer >cistack ciregs >r1 ! > + > + s" linux,initrd-end" get-chosen IF decode-int -rot 2drop ELSE 0 THEN > + s" linux,initrd-start" get-chosen IF decode-int -rot 2drop ELSE 0 THEN > + 2dup - dup IF > +ciregs >r4 ! > +ciregs >r3 ! > +drop > + ELSE > +3drop > + THEN Something like that should work fine. Do it in go-32 and go-64 though? Or is that the wrong spot? Segher
Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts
On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote: > Some of our scripts are passed $objdump and then call it as > "$objdump". This doesn't work if it contains spaces because we're > using ccache, for example you get errors such as: > > ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No > such file or directory > ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache > ppc64le-objdump: No such file or directory > > Fix it by not quoting the string when we expand it, allowing the shell > to do the right thing for us. This breaks things for people with spaces in their paths. Why doesn't your user use something like alias objdump="ccache ppc64le-objdump" , instead? Segher
[PATCH RFC 11/11] PCI: hotplug: movable bus numbers: compact the gaps in numbering
If bus numbers are distributed sparsely and there are lot of devices in the tree, hotplugging a bridge into the end of the tree may fail even if it has less slots then the total number of unused bus numbers. Thus, the feature of bus renaming relies on the continuity of bus numbers, so if a bridge was unplugged, the gap in bus numbers must be compacted. Let's densify the bus numbering at the beginning of a next PCI rescan. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/probe.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index fe9bf012ef33..0c91b9d453dd 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1319,6 +1319,30 @@ static bool pci_new_bus_needed(struct pci_bus *bus, const struct pci_dev *self) return true; } +static void pci_compact_bus_numbers(const int domain, const struct resource *valid_range) +{ + int busnr_p1 = valid_range->start; + + while (busnr_p1 < valid_range->end) { + int busnr_p2 = busnr_p1 + 1; + struct pci_bus *bus_p2; + int delta; + + while (busnr_p2 <= valid_range->end && + !(bus_p2 = pci_find_bus(domain, busnr_p2))) + ++busnr_p2; + + if (!bus_p2 || busnr_p2 > valid_range->end) + break; + + delta = busnr_p1 - busnr_p2 + 1; + if (delta) + pci_move_buses(domain, busnr_p2, delta, valid_range); + + ++busnr_p1; + } +} + static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus, unsigned int available_buses); /** @@ -3691,6 +3715,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) pci_bus_update_immovable_range(root); pci_bus_release_root_bridge_resources(root); + pci_compact_bus_numbers(pci_domain_nr(bus), + >busn_res); + max = pci_scan_child_bus(root); pci_reassign_root_bus_resources(root); -- 2.23.0
[PATCH RFC 10/11] PCI: hotplug: movable bus numbers: rename proc and sysfs entries
Changing the number of a bus (therefore changing addresses of this bus, of its children and all the buses next in the tree) invalidates entries in /sys/devices/pci*, /proc/bus/pci/* and symlinks in /sys/bus/pci/devices/* for all the renamed devices and buses. Remove the affected proc and sysfs entries and symlinks before renaming the bus, then created them back. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/probe.c | 105 +++- 1 file changed, 104 insertions(+), 1 deletion(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index be9e5754cac7..fe9bf012ef33 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1096,12 +1096,99 @@ static void pci_enable_crs(struct pci_dev *pdev) PCI_EXP_RTCTL_CRSSVE); } +static void pci_buses_remove_sysfs(int domain, int busnr, int max_bus_number) +{ + struct pci_bus *bus; + struct pci_dev *dev = NULL; + + bus = pci_find_bus(domain, busnr); + if (!bus) + return; + + if (busnr < max_bus_number) + pci_buses_remove_sysfs(domain, busnr + 1, max_bus_number); + + list_for_each_entry(dev, >devices, bus_list) { + device_remove_class_symlinks(>dev); + pci_remove_sysfs_dev_files(dev); + pci_proc_detach_device(dev); + bus_disconnect_device(>dev); + } + + device_remove_class_symlinks(>dev); + pci_proc_detach_bus(bus); +} + +static void pci_buses_create_sysfs(int domain, int busnr, int max_bus_number) +{ + struct pci_bus *bus; + struct pci_dev *dev = NULL; + + bus = pci_find_bus(domain, busnr); + if (!bus) + return; + + device_add_class_symlinks(>dev); + + list_for_each_entry(dev, >devices, bus_list) { + bus_add_device(>dev); + if (pci_dev_is_added(dev)) { + pci_proc_attach_device(dev); + pci_create_sysfs_dev_files(dev); + device_add_class_symlinks(>dev); + } + } + + if (busnr < max_bus_number) + pci_buses_create_sysfs(domain, busnr + 1, max_bus_number); +} + +static void pci_rename_bus(struct pci_bus *bus, const char *new_bus_name) +{ + struct class *class; + int err; + + class = bus->dev.class; + bus->dev.class = NULL; + err = device_rename(>dev, new_bus_name); + bus->dev.class = class; +} + +static void pci_rename_bus_devices(struct pci_bus *bus, const int domain, + const int new_busnr) +{ + struct pci_dev *dev = NULL; + + list_for_each_entry(dev, >devices, bus_list) { + char old_name[64]; + char new_name[64]; + struct class *class; + int err; + int i; + + strncpy(old_name, dev_name(>dev), sizeof(old_name)); + sprintf(new_name, "%04x:%02x:%02x.%d", domain, new_busnr, + PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); + class = dev->dev.class; + dev->dev.class = NULL; + err = device_rename(>dev, new_name); + dev->dev.class = class; + + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) + dev->resource[i].name = pci_name(dev); + } +} + static void pci_do_move_buses(const int domain, int busnr, int first_moved_busnr, int delta, const struct resource *valid_range) { struct pci_bus *bus; - int subordinate; + int subordinate, old_primary; u32 old_buses, buses; + char old_bus_name[64]; + char new_bus_name[64]; + struct resource old_res; + int new_busnr = busnr + delta; if (busnr < valid_range->start || busnr > valid_range->end) return; @@ -1110,11 +1197,21 @@ static void pci_do_move_buses(const int domain, int busnr, int first_moved_busnr if (!bus) return; + old_primary = bus->primary; + strncpy(old_bus_name, dev_name(>dev), sizeof(old_bus_name)); + sprintf(new_bus_name, "%04x:%02x", domain, new_busnr); + if (delta > 0) { pci_do_move_buses(domain, busnr + 1, first_moved_busnr, delta, valid_range); + pci_rename_bus_devices(bus, domain, new_busnr); + pci_rename_bus(bus, new_bus_name); + } else { + pci_rename_bus(bus, new_bus_name); + pci_rename_bus_devices(bus, domain, new_busnr); } + memcpy(_res, >busn_res, sizeof(old_res)); bus->number += delta; bus->busn_res.start += delta; @@ -1132,6 +1229,10 @@ static void pci_do_move_buses(const int domain, int busnr, int first_moved_busnr buses |= (unsigned int)(subordinate << 16); pci_write_config_dword(bus->self,
[PATCH RFC 09/11] PCI: hotplug: Add initial support for movable bus numbers
Currently, hot-adding a bridge requires enough bus numbers to be reserved on the slot. Choosing a favorable number of reserved buses per slot is relatively simple for predictable cases, but it gets trickier when bridges can be hot-plugged into hot-plugged bridges: there may be either not enough buses in a slot for a new big bridge, or all the 255 possible numbers will be depleted. So hot-add may fail still having unused buses somewhere in the PCI topology. Instead of reserving, the bus numbers can be allocated continuously, and during a hot-adding a bridge in the middle of the PCI tree, the conflicting buses can increment their numbers, creating a gap for the new bridge. Before the moving, ensure there are enough space to move on, and there will be no conflicts with other buses, taking into consideration that it may be more than one root bridge in the domain (e.g. on some Intel Xeons one root has buses 00-7f, and the second one - 80-ff). The feature is disabled by default to not break the ABI, and can be enabled by the "pci=movable_buses" command line argument, if all risks accepted. The following set of parameters provides a safe activation of the feature: pci=realloc,pcie_bus_peer2peer,movable_buses On x86, the "pci=assign-busses" is also required: pci=realloc,pcie_bus_peer2peer,movable_buses,assign-busses This series is the second half of the work started by the "Movable BARs" patches, and relies on fixes made there. Following patches will resolve the introduced issues: - fix desynchronization in /sys/devices/pci*, /sys/bus/pci/devices/* and /proc/bus/pci/* after changes in PCI topology; - compact gaps in numbering, which may appear after removing a bridge, to maintain the number continuity. Signed-off-by: Sergey Miroshnichenko --- .../admin-guide/kernel-parameters.txt | 3 + drivers/pci/pci.c | 3 + drivers/pci/pci.h | 2 + drivers/pci/probe.c | 153 +- 4 files changed, 156 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c6243aaed0c9..1bf8dea1f08a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3529,6 +3529,9 @@ force_floating [S390] Force usage of floating interrupts. nomio [S390] Do not use MIO instructions. no_movable_bars Don't allow BARs to be moved during hotplug + movable_buses Prefer bus renaming over the number reserving. This + inflicts the deleting+recreating of sysfs and procfs + entries. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 6ec1b70e4a96..9b2dcaa268e8 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -79,6 +79,7 @@ int pci_domains_supported = 1; #endif bool pci_can_move_bars = true; +bool pci_movable_buses; #define DEFAULT_CARDBUS_IO_SIZE(256) #define DEFAULT_CARDBUS_MEM_SIZE (64*1024*1024) @@ -6335,6 +6336,8 @@ static int __init pci_setup(char *str) disable_acs_redir_param = str + 18; } else if (!strncmp(str, "no_movable_bars", 15)) { pci_can_move_bars = false; + } else if (!strncmp(str, "movable_buses", 13)) { + pci_movable_buses = true; } else { pr_err("PCI: Unknown option `%s'\n", str); } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9b5164d10499..804176bb1d1b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -289,6 +289,8 @@ void pci_bus_put(struct pci_bus *bus); bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res); +extern bool pci_movable_buses; + int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r); /* PCIe link information */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 3494b5d265d5..be9e5754cac7 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1096,6 +1096,126 @@ static void pci_enable_crs(struct pci_dev *pdev) PCI_EXP_RTCTL_CRSSVE); } +static void pci_do_move_buses(const int domain, int busnr, int first_moved_busnr, + int delta, const struct resource *valid_range) +{ + struct pci_bus *bus; + int subordinate; + u32 old_buses, buses; + + if (busnr < valid_range->start || busnr > valid_range->end) + return; + + bus = pci_find_bus(domain, busnr); + if (!bus) + return; + + if (delta > 0) { +
[PATCH RFC 06/11] powerpc/pci: Enable assigning bus numbers instead of reading them from DT
If the firmware indicates support of reassigning bus numbers via the PHB's "ibm,supported-movable-bdfs" property in DT, PowerNV will not depend on PCI topology info from DT anymore. This makes possible to re-enumerate the fabric, assign the new bus numbers and switch from the pnv_php module to the standard pciehp driver for PCI hotplug functionality. Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/kernel/pci_dn.c | 5 + arch/powerpc/platforms/powernv/eeh-powernv.c | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index ad0ecf48e943..b9b7518eb2b4 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -559,6 +559,11 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb) phb->pci_data = pdn; } + if (of_get_property(dn, "ibm,supported-movable-bdfs", NULL)) { + pci_add_flags(PCI_REASSIGN_ALL_BUS); + return; + } + /* Update dn->phb ptrs for new phb and children devices */ pci_traverse_device_nodes(dn, add_pdn, phb); } diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 6bc24a47e9ef..6c126aa2a6b7 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -42,7 +42,8 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev) { struct pci_dn *pdn = pci_get_pdn(pdev); - if (eeh_has_flag(EEH_FORCE_DISABLED)) + if (eeh_has_flag(EEH_FORCE_DISABLED) || + !pci_has_flag(PCI_REASSIGN_ALL_BUS)) return; dev_dbg(>dev, "EEH: Setting up device\n"); -- 2.23.0
[PATCH RFC 08/11] PCI: Allow expanding the bridges
When hotplugging a bridge, the parent bus may not have [enough] reserved bus numbers. So before rescanning the bus, set its subordinate number to the maximum possible value: it is 255 when there is only one root bridge in the domain. During the PCI rescan, the subordinate bus number of every bus will be contracted to the actual value. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/probe.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 539f5d39bb6d..3494b5d265d5 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3195,20 +3195,22 @@ static unsigned int pci_dev_count_res_mask(struct pci_dev *dev) return res_mask; } -static void pci_bus_rescan_prepare(struct pci_bus *bus) +static void pci_bus_rescan_prepare(struct pci_bus *bus, int last_bus_number) { struct pci_dev *dev; if (bus->self) pci_config_pm_runtime_get(bus->self); + bus->busn_res.end = last_bus_number; + list_for_each_entry(dev, >devices, bus_list) { struct pci_bus *child = dev->subordinate; dev->res_mask = pci_dev_count_res_mask(dev); if (child) - pci_bus_rescan_prepare(child); + pci_bus_rescan_prepare(child, last_bus_number); if (dev->driver && dev->driver->rescan_prepare) @@ -3439,7 +3441,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) if (pci_can_move_bars) { pcibios_root_bus_rescan_prepare(root); - pci_bus_rescan_prepare(root); + pci_bus_rescan_prepare(root, root->busn_res.end); pci_bus_update_immovable_range(root); pci_bus_release_root_bridge_resources(root); -- 2.23.0
[PATCH RFC 05/11] drivers: base: Add bus_disconnect_device()
Add bus_disconnect_device(), which is similar to bus_remove_device(), but it doesn't detach the device from its driver, so it can be reconnected to the same or another bus later. This is a yet another preparation to hotplugging large PCIe bridges, which may entail changes in BDF addresses of working devices due to movable bus numbers. Changed addresses require rebuilding the affected entries in /sys/bus/pci and /proc/bus/pci. Using bus_disconnect_device()+bus_add_device() during PCI rescan allows the drivers to work with their devices uninterruptedly, regardless of changes in PCI addresses. Signed-off-by: Sergey Miroshnichenko --- drivers/base/bus.c | 36 include/linux/device.h | 1 + 2 files changed, 37 insertions(+) diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 8f3445cc533e..52d77fb90218 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -497,6 +497,42 @@ void bus_probe_device(struct device *dev) mutex_unlock(>p->mutex); } +/** + * bus_disconnect_device - disconnect device from bus, + * but don't detach it from driver + * @dev: device to be disconnected + * + * - Remove device from all interfaces. + * - Remove symlink from bus' directory. + * - Delete device from bus's list. + */ +void bus_disconnect_device(struct device *dev) +{ + struct bus_type *bus = dev->bus; + struct subsys_interface *sif; + + if (!bus) + return; + + mutex_lock(>p->mutex); + list_for_each_entry(sif, >p->interfaces, node) + if (sif->remove_dev) + sif->remove_dev(dev, sif); + mutex_unlock(>p->mutex); + + sysfs_remove_link(>kobj, "subsystem"); + sysfs_remove_link(>bus->p->devices_kset->kobj, + dev_name(dev)); + device_remove_groups(dev, dev->bus->dev_groups); + if (klist_node_attached(>p->knode_bus)) + klist_del(>p->knode_bus); + + pr_debug("bus: '%s': remove device %s\n", +dev->bus->name, dev_name(dev)); + bus_put(dev->bus); +} +EXPORT_SYMBOL_GPL(bus_disconnect_device); + /** * bus_remove_device - remove device from bus * @dev: device to be removed diff --git a/include/linux/device.h b/include/linux/device.h index 420228ab9c4b..9f098c32a4ad 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -268,6 +268,7 @@ void bus_sort_breadthfirst(struct bus_type *bus, int (*compare)(const struct device *a, const struct device *b)); extern int bus_add_device(struct device *dev); +extern void bus_disconnect_device(struct device *dev); extern int device_add_class_symlinks(struct device *dev); extern void device_remove_class_symlinks(struct device *dev); -- 2.23.0
[PATCH RFC 07/11] powerpc/pci: Don't reduce the host bridge bus range
Currently the last possible bus number of the PHB is set to the last used bus number during the boot. So when hotplugging a bridge later, no new buses can be allocated because they are limited by this value. Let the host bridge contain any number of buses up to 255. Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/kernel/pci-common.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 1c448cf25506..5877ef7a39a0 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1631,7 +1631,6 @@ void pcibios_scan_phb(struct pci_controller *hose) if (mode == PCI_PROBE_NORMAL) { pci_bus_update_busn_res_end(bus, 255); hose->last_busno = pci_scan_child_bus(bus); - pci_bus_update_busn_res_end(bus, hose->last_busno); } /* Platform gets a chance to do some global fixups before -- 2.23.0
[PATCH RFC 04/11] drivers: base: Make device_{add|remove}_class_symlinks() public
When updating the /sys/devices/pci* entries affected by changes in the PCI topology, their symlinks in /sys/bus/pci/devices/* must also be rebuilt. Moving device_add_class_symlinks() and device_remove_class_symlinks() to a public API allows the PCI subsystem to update the sysfs without destroying the working affected devices. Signed-off-by: Sergey Miroshnichenko --- drivers/base/core.c| 6 -- include/linux/device.h | 2 ++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 7bd9cd366d41..23e689fc8478 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1922,7 +1922,7 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir) mutex_unlock(_mutex); } -static int device_add_class_symlinks(struct device *dev) +int device_add_class_symlinks(struct device *dev) { struct device_node *of_node = dev_of_node(dev); int error; @@ -1973,8 +1973,9 @@ static int device_add_class_symlinks(struct device *dev) sysfs_remove_link(>kobj, "of_node"); return error; } +EXPORT_SYMBOL_GPL(device_add_class_symlinks); -static void device_remove_class_symlinks(struct device *dev) +void device_remove_class_symlinks(struct device *dev) { if (dev_of_node(dev)) sysfs_remove_link(>kobj, "of_node"); @@ -1991,6 +1992,7 @@ static void device_remove_class_symlinks(struct device *dev) #endif sysfs_delete_link(>class->p->subsys.kobj, >kobj, dev_name(dev)); } +EXPORT_SYMBOL_GPL(device_remove_class_symlinks); /** * dev_set_name - set a device name diff --git a/include/linux/device.h b/include/linux/device.h index 4d8bbc8ae73d..420228ab9c4b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -268,6 +268,8 @@ void bus_sort_breadthfirst(struct bus_type *bus, int (*compare)(const struct device *a, const struct device *b)); extern int bus_add_device(struct device *dev); +extern int device_add_class_symlinks(struct device *dev); +extern void device_remove_class_symlinks(struct device *dev); /* * Bus notifiers: Get notified of addition/removal of devices -- 2.23.0
[PATCH RFC 03/11] drivers: base: Make bus_add_device() public
Move the bus_add_device() to a public API, so it can be applied to devices which are temporarily detached from their buses without being destroyed. This will be used after changes in PCI topology after hotplugging a bridge: buses may get their numbers changed, so their child devices must be reattached and their sysfs and proc files recreated. Signed-off-by: Sergey Miroshnichenko --- drivers/base/base.h| 1 - drivers/base/bus.c | 1 + include/linux/device.h | 2 ++ 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/base/base.h b/drivers/base/base.h index 0d32544b6f91..c93d302e6345 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -110,7 +110,6 @@ extern void container_dev_init(void); struct kobject *virtual_device_parent(struct device *dev); -extern int bus_add_device(struct device *dev); extern void bus_probe_device(struct device *dev); extern void bus_remove_device(struct device *dev); diff --git a/drivers/base/bus.c b/drivers/base/bus.c index a1d1e8256324..8f3445cc533e 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -471,6 +471,7 @@ int bus_add_device(struct device *dev) bus_put(dev->bus); return error; } +EXPORT_SYMBOL_GPL(bus_add_device); /** * bus_probe_device - probe drivers for a new device diff --git a/include/linux/device.h b/include/linux/device.h index 297239a08bb7..4d8bbc8ae73d 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -267,6 +267,8 @@ int bus_for_each_drv(struct bus_type *bus, struct device_driver *start, void bus_sort_breadthfirst(struct bus_type *bus, int (*compare)(const struct device *a, const struct device *b)); +extern int bus_add_device(struct device *dev); + /* * Bus notifiers: Get notified of addition/removal of devices * and binding/unbinding of drivers to devices. -- 2.23.0
[PATCH RFC 02/11] PCI: proc: Nullify a freed pointer
A PCI device may be detached from /proc/bus/pci/devices not only when it is removed, but also when its bus had changed the number - in this case the proc entry must be recreated to reflect the new PCI topology. Nullify freed pointers to mark them as valid for allocating again. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/proc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index 5495537c60c2..c85654dd315b 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -443,6 +443,7 @@ int pci_proc_detach_device(struct pci_dev *dev) int pci_proc_detach_bus(struct pci_bus *bus) { proc_remove(bus->procdir); + bus->procdir = NULL; return 0; } -- 2.23.0
[PATCH RFC 01/11] PCI: sysfs: Nullify freed pointers
After hotplugging a bridge the PCI topology will be changed: buses may have their numbers changed. In this case all the affected sysfs entries/symlinks must be recreated, because they have BDF address in their names. Set the freed pointers to NULL, so the !NULL checks will be satisfied when its time to recreate the sysfs entries. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci-sysfs.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 793412954529..a238935c1193 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -1129,12 +1129,14 @@ static void pci_remove_resource_files(struct pci_dev *pdev) if (res_attr) { sysfs_remove_bin_file(>dev.kobj, res_attr); kfree(res_attr); + pdev->res_attr[i] = NULL; } res_attr = pdev->res_attr_wc[i]; if (res_attr) { sysfs_remove_bin_file(>dev.kobj, res_attr); kfree(res_attr); + pdev->res_attr_wc[i] = NULL; } } } @@ -1175,8 +1177,11 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine) res_attr->size = pci_resource_len(pdev, num); res_attr->private = (void *)(unsigned long)num; retval = sysfs_create_bin_file(>dev.kobj, res_attr); - if (retval) + if (retval) { kfree(res_attr); + if (pdev->res_attr[num] == res_attr) + pdev->res_attr[num] = NULL; + } return retval; } -- 2.23.0
[PATCH RFC 00/11] PCI: hotplug: Movable bus numbers
To allow hotplugging bridges, the kernel or BIOS/bootloader/firmware add extra bus numbers per slot, but this range may be not enough for a large bridge and/or nested bridges when hot-adding a chassis full of devices. This patchset proposes an approach similar to movable BARs: bus numbers are not reserved anymore, instead the kernel moves the "tail" of the PCI tree by one, when needed a new bus. When something like this is going to happen: *LARGE* +-[0020:00]---00.0-[01-20]--+-00.0-[02-08]--+-00.0-[03]-- <-- *NESTED* | | +-01.0-[04]--*BRIDGE* | | +-02.0-[05]-- | | +-03.0-[06]-- | | +-04.0-[07]-- | | \-05.0-[08]-- ... , this will result into the following: +-[0020:00]---00.0-[01-22]--+-00.0-[02-22]--+-00.0-[03-1d]04.0-[04-1d]--+-00.0-[05]-- | | | +-04.0-[06]-- | | | +-09.0-[07]-- | | | +-0c.0-[08-19]00.0-[09-19]--+-01.0-[0a]-- | | | | ... | | | | \-11.0-[19]-- | | | ... | | | \-15.0-[1d]-- | | +-01.0-[1e]-- <-- Renamed from 04 | | +-02.0-[1f]-- <-- Renamed from 05 | | +-03.0-[20]-- <-- Renamed from 06 | | +-04.0-[21]-- <-- Renamed from 07 | | \-05.0-[22]-- <-- Renamed from 08 ... This looks to be safe in the kernel, because drivers don't use the raw PCI BDF ID, and we've tested that on our x86 and PowerNV machines: mass storage with roots and network adapters just continue their work while their bus numbers had moved. But here comes the userspace: - procfs entries: % ls -la /proc/bus/pci/* /proc/bus/pci/00: 00.0 02.0 ... 1f.4 1f.6 /proc/bus/pci/04: 00.0 /proc/bus/pci/40: 00.0 - sysfs entries: % ls -la /sys/devices/pci:00/ :00:00.0 :00:02.0 ... :00:1f.3 :00:1f.4 :00:1f.6 % ls -la /sys/devices/pci:00/:00:1c.6/:04:00.0/driver driver -> ../../../../bus/pci/drivers/iwlwifi - sysfs symlinks: % ls -la /sys/bus/pci/devices :00:00.0 -> ../../../devices/pci:00/:00:00.0 :00:02.0 -> ../../../devices/pci:00/:00:02.0 ... :04:00.0 -> ../../../devices/pci:00/:00:1c.6/:04:00.0 :40:00.0 -> ../../../devices/pci:00/:00:1d.2/:40:00.0 These patches alter the kernel public API and some internals to be able to remove these files before changing a bus number, and create new versions of them after device has changed its BDF. On one hand, this makes the hotplug predictable, independent of non-kernel program components (BIOS, bootloader, etc.) and cross-platform, but this is also a severe ABI violation. Probably, the udev should have a new action like "rename" in addition to "add" and "remove". Is it feasible to have this feature disabled by default, but with a chance to enable by a kernel command line argument like this: pci=realloc,movable_buses ? This code is follow-up of the "PCI: Allow BAR movement during hotplug" series (v6). Sergey Miroshnichenko (11): PCI: sysfs: Nullify freed pointers PCI: proc: Nullify a freed pointer drivers: base: Make bus_add_device() public drivers: base: Make device_{add|remove}_class_symlinks() public drivers: base: Add bus_disconnect_device() powerpc/pci: Enable assigning bus numbers instead of reading them from DT powerpc/pci: Don't reduce the host bridge bus range PCI: Allow expanding the bridges PCI: hotplug: Add initial support for movable bus numbers PCI: hotplug: movable bus numbers: rename proc and sysfs entries PCI: hotplug: movable bus numbers: compact the gaps in numbering .../admin-guide/kernel-parameters.txt | 3 + arch/powerpc/kernel/pci-common.c | 1 - arch/powerpc/kernel/pci_dn.c | 5 + arch/powerpc/platforms/powernv/eeh-powernv.c | 3 +- drivers/base/base.h | 1 - drivers/base/bus.c| 37 +++ drivers/base/core.c | 6 +- drivers/pci/pci-sysfs.c
[PATCH v6 29/30] PCI: pciehp: movable BARs: Trigger a domain rescan on hp events
With movable BARs, adding a hotplugged device is not local to its bridge anymore, but it affects the whole domain: BARs, bridge windows and bus numbers can be substantially rearranged. So instead of trying to fit the new devices into preallocated reserved gaps, initiate a full domain rescan. The pci_rescan_bus() covers all the operations of the replaced functions: - assigning new bus numbers, as the pci_hp_add_bridge() does it; - allocating BARs (pci_assign_unassigned_bridge_resources()); - cofiguring MPS settings (pcie_bus_configure_settings()); - binding devices to their drivers (pci_bus_add_devices()). CC: Lukas Wunner Signed-off-by: Sergey Miroshnichenko --- drivers/pci/hotplug/pciehp_pci.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c index d17f3bf36f70..6d4c1ef38210 100644 --- a/drivers/pci/hotplug/pciehp_pci.c +++ b/drivers/pci/hotplug/pciehp_pci.c @@ -58,6 +58,11 @@ int pciehp_configure_device(struct controller *ctrl) goto out; } + if (pci_can_move_bars) { + pci_rescan_bus(parent); + goto out; + } + for_each_pci_bridge(dev, parent) pci_hp_add_bridge(dev); -- 2.23.0
[PATCH v6 30/30] Revert "powerpc/powernv/pci: Work around races in PCI bridge enabling"
This reverts commit db2173198b9513f7add8009f225afa1f1c79bcc6. The root cause of this bug is fixed by the following two commits: 1. "PCI: Fix race condition in pci_enable/disable_device()" 2. "PCI: Enable bridge's I/O and MEM access for hotplugged devices" The x86 is also affected by this bug if a PCIe bridge has been hotplugged without pre-enabling by the BIOS. CC: Benjamin Herrenschmidt Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/platforms/powernv/pci-ioda.c | 37 --- 1 file changed, 37 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 33d5ed8c258f..f12f3a49d3bb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3119,49 +3119,12 @@ static void pnv_pci_ioda_create_dbgfs(void) #endif /* CONFIG_DEBUG_FS */ } -static void pnv_pci_enable_bridge(struct pci_bus *bus) -{ - struct pci_dev *dev = bus->self; - struct pci_bus *child; - - /* Empty bus ? bail */ - if (list_empty(>devices)) - return; - - /* -* If there's a bridge associated with that bus enable it. This works -* around races in the generic code if the enabling is done during -* parallel probing. This can be removed once those races have been -* fixed. -*/ - if (dev) { - int rc = pci_enable_device(dev); - if (rc) - pci_err(dev, "Error enabling bridge (%d)\n", rc); - pci_set_master(dev); - } - - /* Perform the same to child busses */ - list_for_each_entry(child, >children, node) - pnv_pci_enable_bridge(child); -} - -static void pnv_pci_enable_bridges(void) -{ - struct pci_controller *hose; - - list_for_each_entry(hose, _list, list_node) - pnv_pci_enable_bridge(hose->bus); -} - static void pnv_pci_ioda_fixup(void) { pnv_pci_ioda_setup_PEs(); pnv_pci_ioda_setup_iommu_api(); pnv_pci_ioda_create_dbgfs(); - pnv_pci_enable_bridges(); - #ifdef CONFIG_EEH pnv_eeh_post_init(); #endif -- 2.23.0
[PATCH v6 27/30] nvme-pci: Handle movable BARs
Hotplugged devices can affect the existing ones by moving their BARs. The PCI subsystem will inform the NVME driver about this by invoking the .rescan_prepare() and .rescan_done() hooks, so the BARs can by re-mapped. Tested under the "randrw" mode of the fio tool. Before the hotplugging: % sudo cat /proc/iomem ... 3fe8-3fe8007f : PCI Bus 0020:0b 3fe8-3fe8007f : PCI Bus 0020:18 3fe8-3fe8000f : 0020:18:00.0 3fe8-3fe8000f : nvme 3fe80010-3fe80017 : 0020:18:00.0 ... , then another NVME drive was hot-added, so BARs of the 0020:18:00.0 are moved: % sudo cat /proc/iomem ... 3fe8-3fe800ff : PCI Bus 0020:0b 3fe8-3fe8007f : PCI Bus 0020:10 3fe8-3fe83fff : 0020:10:00.0 3fe8-3fe83fff : nvme 3fe80001-3fe80001 : 0020:10:00.0 3fe80080-3fe800ff : PCI Bus 0020:18 3fe80080-3fe8008f : 0020:18:00.0 3fe80080-3fe8008f : nvme 3fe80090-3fe80097 : 0020:18:00.0 ... During the rescanning, both READ and WRITE speeds drop to zero for a while due to driver's pause, then restore. Also tested with an NVME as a system drive. Cc: linux-n...@lists.infradead.org Cc: Christoph Hellwig Signed-off-by: Sergey Miroshnichenko --- drivers/nvme/host/pci.c | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 869f462e6b6e..5f162ea5a5f1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1650,7 +1650,7 @@ static int nvme_remap_bar(struct nvme_dev *dev, unsigned long size) { struct pci_dev *pdev = to_pci_dev(dev->dev); - if (size <= dev->bar_mapped_size) + if (dev->bar && size <= dev->bar_mapped_size) return 0; if (size > pci_resource_len(pdev, 0)) return -ENOMEM; @@ -3059,6 +3059,23 @@ static void nvme_error_resume(struct pci_dev *pdev) flush_work(>ctrl.reset_work); } +static void nvme_rescan_prepare(struct pci_dev *pdev) +{ + struct nvme_dev *dev = pci_get_drvdata(pdev); + + nvme_dev_disable(dev, false); + nvme_dev_unmap(dev); + dev->bar = NULL; +} + +static void nvme_rescan_done(struct pci_dev *pdev) +{ + struct nvme_dev *dev = pci_get_drvdata(pdev); + + nvme_dev_map(dev); + nvme_reset_ctrl_sync(>ctrl); +} + static const struct pci_error_handlers nvme_err_handler = { .error_detected = nvme_error_detected, .slot_reset = nvme_slot_reset, @@ -3135,6 +3152,8 @@ static struct pci_driver nvme_driver = { #endif .sriov_configure = pci_sriov_configure_simple, .err_handler= _err_handler, + .rescan_prepare = nvme_rescan_prepare, + .rescan_done= nvme_rescan_done, }; static int __init nvme_init(void) -- 2.23.0
[PATCH v6 28/30] PCI/portdrv: Declare support of movable BARs
Switch's BARs are not used by the portdrv driver, but they are still considered as immovable until the .rescan_prepare() and .rescan_done() hooks are added. Add these hooks to increase chances to allocate new BARs. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pcie/portdrv_pci.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 0a87091a0800..9dbddc7faaa7 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -197,6 +197,14 @@ static const struct pci_error_handlers pcie_portdrv_err_handler = { .resume = pcie_portdrv_err_resume, }; +static void pcie_portdrv_rescan_prepare(struct pci_dev *pdev) +{ +} + +static void pcie_portdrv_rescan_done(struct pci_dev *pdev) +{ +} + static struct pci_driver pcie_portdriver = { .name = "pcieport", .id_table = _pci_ids[0], @@ -207,6 +215,9 @@ static struct pci_driver pcie_portdriver = { .err_handler= _portdrv_err_handler, + .rescan_prepare = pcie_portdrv_rescan_prepare, + .rescan_done= pcie_portdrv_rescan_done, + .driver.pm = PCIE_PORTDRV_PM_OPS, }; -- 2.23.0
[PATCH v6 26/30] PCI: hotplug: movable BARs: Enable the feature by default
This is the last patch in the series which implements the essentials of the Movable BARs feature, so it is turned by default now. Tested on: - x86_64 with "pci=realloc,pcie_bus_peer2peer" command line argument; - POWER8 PowerNV+PHB3 ppc64le with "pci=realloc,pcie_bus_peer2peer". In case of problems it is still can be overridden by the following command line option: pcie_movable_bars=off CC: Oliver O'Halloran Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 85014c6b2817..6ec1b70e4a96 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -78,7 +78,7 @@ static void pci_dev_d3_sleep(struct pci_dev *dev) int pci_domains_supported = 1; #endif -bool pci_can_move_bars; +bool pci_can_move_bars = true; #define DEFAULT_CARDBUS_IO_SIZE(256) #define DEFAULT_CARDBUS_MEM_SIZE (64*1024*1024) -- 2.23.0
[PATCH v6 25/30] PNP: Don't reserve BARs for PCI when enabled movable BARs
When the Movable BARs feature is supported, the PCI subsystem is able to distribute existing BARs and allocate the new ones itself, without need to reserve gaps by BIOS. CC: Rafael J. Wysocki Signed-off-by: Sergey Miroshnichenko --- drivers/pnp/system.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/pnp/system.c b/drivers/pnp/system.c index 6950503741eb..5977bd11f4d4 100644 --- a/drivers/pnp/system.c +++ b/drivers/pnp/system.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -58,6 +59,9 @@ static void reserve_resources_of_dev(struct pnp_dev *dev) struct resource *res; int i; + if (pci_can_move_bars) + return; + for (i = 0; (res = pnp_get_resource(dev, IORESOURCE_IO, i)); i++) { if (res->flags & IORESOURCE_DISABLED) continue; -- 2.23.0
[PATCH v6 23/30] powerpc/pci: hotplug: Add support for movable BARs
Add pcibios_root_bus_rescan_prepare()/_done() hooks for the powerpc, so it can reassign the PE numbers (which depend on BAR sizes and locations) and update the EEH address cache during a PCI rescan. New PE numbers are assigned during pci_setup_bridges(root) after the rescan is done. CC: Oliver O'Halloran CC: Sam Bobroff Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/kernel/pci-hotplug.c | 43 +++ drivers/pci/probe.c | 10 +++ include/linux/pci.h | 3 +++ 3 files changed, 56 insertions(+) diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index fc62c4bc47b1..42847f5b0f08 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -16,6 +16,7 @@ #include #include #include +#include static struct pci_bus *find_bus_among_children(struct pci_bus *bus, struct device_node *dn) @@ -151,3 +152,45 @@ void pci_hp_add_devices(struct pci_bus *bus) pcibios_finish_adding_to_bus(bus); } EXPORT_SYMBOL_GPL(pci_hp_add_devices); + +static void pci_hp_bus_rescan_prepare(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *child = dev->subordinate; + + if (child) + pci_hp_bus_rescan_prepare(child); + + iommu_del_device(>dev); + } + + list_for_each_entry(dev, >devices, bus_list) { + pcibios_release_device(dev); + } +} + +static void pci_hp_bus_rescan_done(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *child = dev->subordinate; + + pcibios_bus_add_device(dev); + + if (child) + pci_hp_bus_rescan_done(child); + } +} + +void pcibios_root_bus_rescan_prepare(struct pci_bus *root) +{ + pci_hp_bus_rescan_prepare(root); +} + +void pcibios_root_bus_rescan_done(struct pci_bus *root) +{ + pci_hp_bus_rescan_done(root); +} diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 73452aa81417..539f5d39bb6d 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3235,6 +3235,14 @@ static void pci_bus_rescan_done(struct pci_bus *bus) pci_config_pm_runtime_put(bus->self); } +void __weak pcibios_root_bus_rescan_prepare(struct pci_bus *root) +{ +} + +void __weak pcibios_root_bus_rescan_done(struct pci_bus *root) +{ +} + static void pci_setup_bridges(struct pci_bus *bus) { struct pci_dev *dev; @@ -3430,6 +3438,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) root = root->parent; if (pci_can_move_bars) { + pcibios_root_bus_rescan_prepare(root); pci_bus_rescan_prepare(root); pci_bus_update_immovable_range(root); pci_bus_release_root_bridge_resources(root); @@ -3440,6 +3449,7 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) pci_setup_bridges(root); pci_bus_rescan_done(root); + pcibios_root_bus_rescan_done(root); } else { max = pci_scan_child_bus(bus); pci_assign_unassigned_bus_resources(bus); diff --git a/include/linux/pci.h b/include/linux/pci.h index e1edcb3fad31..b5821134bdae 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1275,6 +1275,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus); void pci_lock_rescan_remove(void); void pci_unlock_rescan_remove(void); +void pcibios_root_bus_rescan_prepare(struct pci_bus *root); +void pcibios_root_bus_rescan_done(struct pci_bus *root); + /* Vital Product Data routines */ ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf); ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const void *buf); -- 2.23.0
[PATCH v6 24/30] powerpc/powernv/pci: Suppress an EEH error when reading an empty slot
Reading an empty slot returns all ones, which triggers a false EEH error event on PowerNV. A rescan is performed after all the PEs have been unmapped, so the reserved PE index is used for unfreezing. CC: Oliver O'Halloran CC: Sam Bobroff Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/platforms/powernv/pci.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index ffd546cf9204..e1b45dc96474 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -768,9 +768,16 @@ static int pnv_pci_read_config(struct pci_bus *bus, *val = 0x; pdn = pci_get_pdn_by_devfn(bus, devfn); - if (!pdn) - return pnv_pci_cfg_read_raw(phb->opal_id, bus->number, devfn, - where, size, val); + if (!pdn) { + ret = pnv_pci_cfg_read_raw(phb->opal_id, bus->number, devfn, + where, size, val); + + if (!ret && (*val == EEH_IO_ERROR_VALUE(size)) && phb->unfreeze_pe) + phb->unfreeze_pe(phb, phb->ioda.reserved_pe_idx, +OPAL_EEH_ACTION_CLEAR_FREEZE_ALL); + + return ret; + } if (!pnv_pci_cfg_check(pdn)) return PCIBIOS_DEVICE_NOT_FOUND; -- 2.23.0
[PATCH v6 22/30] powerpc/pci: Create pci_dn on demand
If a struct pci_dn hasn't yet been created for the PCIe device (there was no DT node for it), allocate this structure and fill with info read from the device directly. CC: Oliver O'Halloran CC: Sam Bobroff Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/kernel/pci_dn.c | 88 ++-- 1 file changed, 74 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index 9524009ca1ae..ad0ecf48e943 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -20,6 +20,9 @@ #include #include +static struct pci_dn *pci_create_pdn_from_dev(struct pci_dev *pdev, + struct pci_dn *parent); + /* * The function is used to find the firmware data of one * specific PCI device, which is attached to the indicated @@ -52,6 +55,9 @@ static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus) dn = pci_bus_to_OF_node(pbus); pdn = dn ? PCI_DN(dn) : NULL; + if (!pdn && pbus->self) + pdn = pbus->self->dev.archdata.pci_data; + return pdn; } @@ -61,10 +67,13 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus, struct device_node *dn = NULL; struct pci_dn *parent, *pdn; struct pci_dev *pdev = NULL; + bool pdev_found = false; /* Fast path: fetch from PCI device */ list_for_each_entry(pdev, >devices, bus_list) { if (pdev->devfn == devfn) { + pdev_found = true; + if (pdev->dev.archdata.pci_data) return pdev->dev.archdata.pci_data; @@ -73,6 +82,9 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus, } } + if (!pdev_found) + pdev = NULL; + /* Fast path: fetch from device node */ pdn = dn ? PCI_DN(dn) : NULL; if (pdn) @@ -85,9 +97,12 @@ struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus, list_for_each_entry(pdn, >child_list, list) { if (pdn->busno == bus->number && -pdn->devfn == devfn) -return pdn; -} + pdn->devfn == devfn) { + if (pdev) + pdev->dev.archdata.pci_data = pdn; + return pdn; + } + } return NULL; } @@ -117,17 +132,17 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev) list_for_each_entry(pdn, >child_list, list) { if (pdn->busno == pdev->bus->number && - pdn->devfn == pdev->devfn) + pdn->devfn == pdev->devfn) { + pdev->dev.archdata.pci_data = pdn; return pdn; + } } - return NULL; + return pci_create_pdn_from_dev(pdev, parent); } -#ifdef CONFIG_PCI_IOV -static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent, - int vf_index, - int busno, int devfn) +static struct pci_dn *pci_alloc_pdn(struct pci_dn *parent, + int busno, int devfn) { struct pci_dn *pdn; @@ -143,7 +158,6 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent, pdn->parent = parent; pdn->busno = busno; pdn->devfn = devfn; - pdn->vf_index = vf_index; pdn->pe_number = IODA_INVALID_PE; INIT_LIST_HEAD(>child_list); INIT_LIST_HEAD(>list); @@ -151,7 +165,51 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent, return pdn; } -#endif + +static struct pci_dn *pci_create_pdn_from_dev(struct pci_dev *pdev, + struct pci_dn *parent) +{ + struct pci_dn *pdn = NULL; + u32 class_code; + u16 device_id; + u16 vendor_id; + + if (!parent) + return NULL; + + pdn = pci_alloc_pdn(parent, pdev->bus->busn_res.start, pdev->devfn); + pci_info(pdev, "Create a new pdn for devfn %2x\n", pdev->devfn / 8); + + if (!pdn) { + pci_err(pdev, "%s: Failed to allocate pdn\n", __func__); + return NULL; + } + + #ifdef CONFIG_EEH + if (!eeh_dev_init(pdn)) { + kfree(pdn); + pci_err(pdev, "%s: Failed to allocate edev\n", __func__); + return NULL; + } + #endif /* CONFIG_EEH */ + + pci_bus_read_config_word(pdev->bus, pdev->devfn, +PCI_VENDOR_ID, _id); + pdn->vendor_id = vendor_id; + + pci_bus_read_config_word(pdev->bus, pdev->devfn, +PCI_DEVICE_ID, _id); + pdn->device_id = device_id; + + pci_bus_read_config_dword(pdev->bus, pdev->devfn, + PCI_CLASS_REVISION, _code); + class_code >>=
[PATCH v6 21/30] powerpc/pci: Access PCI config space directly w/o pci_dn
To fetch an updated DT for the newly hotplugged device, OS must explicitly request it from the firmware via the pnv_php driver. If pnv_php wasn't triggered/loaded, it is still possible to discover new devices if PCIe I/O will not stop in absence of the pci_dn structure. Reviewed-by: Oliver O'Halloran Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/kernel/rtas_pci.c | 97 +++- arch/powerpc/platforms/powernv/pci.c | 64 -- 2 files changed, 109 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/kernel/rtas_pci.c b/arch/powerpc/kernel/rtas_pci.c index ae5e43eaca48..912da28b3737 100644 --- a/arch/powerpc/kernel/rtas_pci.c +++ b/arch/powerpc/kernel/rtas_pci.c @@ -42,10 +42,26 @@ static inline int config_access_valid(struct pci_dn *dn, int where) return 0; } -int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) +static int rtas_read_raw_config(unsigned long buid, int busno, unsigned int devfn, + int where, int size, u32 *val) { int returnval = -1; - unsigned long buid, addr; + unsigned long addr = rtas_config_addr(busno, devfn, where); + int ret; + + if (buid) { + ret = rtas_call(ibm_read_pci_config, 4, 2, , + addr, BUID_HI(buid), BUID_LO(buid), size); + } else { + ret = rtas_call(read_pci_config, 2, 2, , addr, size); + } + *val = returnval; + + return ret; +} + +int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) +{ int ret; if (!pdn) @@ -58,16 +74,8 @@ int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) return PCIBIOS_SET_FAILED; #endif - addr = rtas_config_addr(pdn->busno, pdn->devfn, where); - buid = pdn->phb->buid; - if (buid) { - ret = rtas_call(ibm_read_pci_config, 4, 2, , - addr, BUID_HI(buid), BUID_LO(buid), size); - } else { - ret = rtas_call(read_pci_config, 2, 2, , addr, size); - } - *val = returnval; - + ret = rtas_read_raw_config(pdn->phb->buid, pdn->busno, pdn->devfn, + where, size, val); if (ret) return PCIBIOS_DEVICE_NOT_FOUND; @@ -85,18 +93,44 @@ static int rtas_pci_read_config(struct pci_bus *bus, pdn = pci_get_pdn_by_devfn(bus, devfn); - /* Validity of pdn is checked in here */ - ret = rtas_read_config(pdn, where, size, val); - if (*val == EEH_IO_ERROR_VALUE(size) && - eeh_dev_check_failure(pdn_to_eeh_dev(pdn))) - return PCIBIOS_DEVICE_NOT_FOUND; + if (pdn) { + /* Validity of pdn is checked in here */ + ret = rtas_read_config(pdn, where, size, val); + + if (*val == EEH_IO_ERROR_VALUE(size) && + eeh_dev_check_failure(pdn_to_eeh_dev(pdn))) + ret = PCIBIOS_DEVICE_NOT_FOUND; + } else { + struct pci_controller *phb = pci_bus_to_host(bus); + + ret = rtas_read_raw_config(phb->buid, bus->number, devfn, + where, size, val); + } return ret; } +static int rtas_write_raw_config(unsigned long buid, int busno, unsigned int devfn, +int where, int size, u32 val) +{ + unsigned long addr = rtas_config_addr(busno, devfn, where); + int ret; + + if (buid) { + ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, + BUID_HI(buid), BUID_LO(buid), size, (ulong)val); + } else { + ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); + } + + if (ret) + return PCIBIOS_DEVICE_NOT_FOUND; + + return PCIBIOS_SUCCESSFUL; +} + int rtas_write_config(struct pci_dn *pdn, int where, int size, u32 val) { - unsigned long buid, addr; int ret; if (!pdn) @@ -109,15 +143,8 @@ int rtas_write_config(struct pci_dn *pdn, int where, int size, u32 val) return PCIBIOS_SET_FAILED; #endif - addr = rtas_config_addr(pdn->busno, pdn->devfn, where); - buid = pdn->phb->buid; - if (buid) { - ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, - BUID_HI(buid), BUID_LO(buid), size, (ulong) val); - } else { - ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); - } - + ret = rtas_write_raw_config(pdn->phb->buid, pdn->busno, pdn->devfn, + where, size, val); if (ret) return PCIBIOS_DEVICE_NOT_FOUND; @@ -128,12 +155,20 @@ static int rtas_pci_write_config(struct pci_bus *bus, unsigned int devfn, int where, int size,
[PATCH v6 20/30] powerpc/pci: Fix crash with enabled movable BARs
Add a check for the UNSET resource flag to skip the released BARs CC: Alexey Kardashevskiy CC: Oliver O'Halloran CC: Sam Bobroff Signed-off-by: Sergey Miroshnichenko --- arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c28d0d9b7ee0..33d5ed8c258f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2976,7 +2976,8 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe, int index; int64_t rc; - if (!res || !res->flags || res->start > res->end) + if (!res || !res->flags || res->start > res->end || + (res->flags & IORESOURCE_UNSET)) return; if (res->flags & IORESOURCE_IO) { -- 2.23.0
[PATCH v6 14/30] PCI: Make sure bridge windows include their fixed BARs
When the time comes to select a start address for the bridge window during the root bus rescan, it should be not just a lowest possible address: this window must cover all the underlying fixed and immovable BARs. The lowest address that satisfies this requirement is the .realloc_range field of struct pci_bus, which is calculated during the preparation to the rescan. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/bus.c | 2 +- drivers/pci/setup-res.c | 31 +-- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c index 8e40b3e6da77..a1efa87e31b9 100644 --- a/drivers/pci/bus.c +++ b/drivers/pci/bus.c @@ -192,7 +192,7 @@ static int pci_bus_alloc_from_region(struct pci_bus *bus, struct resource *res, * this is an already-configured bridge window, its start * overrides "min". */ - if (avail.start) + if (min_used < avail.start) min_used = avail.start; max = avail.end; diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index a1657a8bf93d..1570bbd620cd 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -248,9 +248,23 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev, struct resource *res = dev->resource + resno; resource_size_t min; int ret; + resource_size_t start = (resource_size_t)-1; + resource_size_t end = 0; min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM; + if (dev->subordinate && resno >= PCI_BRIDGE_RESOURCES) { + struct pci_bus *child_bus = dev->subordinate; + int b_resno = resno - PCI_BRIDGE_RESOURCES; + struct resource *immovable_range = _bus->immovable_range[b_resno]; + + if (immovable_range->start < immovable_range->end) { + start = immovable_range->start; + end = immovable_range->end; + min = child_bus->realloc_range[b_resno].start; + } + } + /* * First, try exact prefetching match. Even if a 64-bit * prefetchable bridge window is below 4GB, we can't put a 32-bit @@ -262,7 +276,7 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev, IORESOURCE_PREFETCH | IORESOURCE_MEM_64, pcibios_align_resource, dev); if (ret == 0) - return 0; + goto check_fixed; /* * If the prefetchable window is only 32 bits wide, we can put @@ -274,7 +288,7 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev, IORESOURCE_PREFETCH, pcibios_align_resource, dev); if (ret == 0) - return 0; + goto check_fixed; } /* @@ -287,6 +301,19 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev, ret = pci_bus_alloc_resource(bus, res, size, align, min, 0, pcibios_align_resource, dev); +check_fixed: + if (ret == 0 && start < end) { + if (res->start > start || res->end < end) { + dev_err(>dev, "fixed area 0x%llx-0x%llx for %s doesn't fit in the allocated %pR (0x%llx-0x%llx)", + (unsigned long long)start, (unsigned long long)end, + dev_name(>dev), + res, (unsigned long long)res->start, + (unsigned long long)res->end); + release_resource(res); + return -1; + } + } + return ret; } -- 2.23.0
[PATCH v6 16/30] PCI: hotplug: movable BARs: Assign fixed and immovable BARs before others
Reassign resources during rescan in two steps: first the fixed/immovable BARs and bridge windows that have fixed areas, so the movable ones will not steal these reserved areas; then the rest - so the movable BARs will divide the rest of the space. With this change, pci_assign_resource() is now able to assign all types of BARs, so the pdev_assign_fixed_resources() became unused and thus removed. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.h | 2 ++ drivers/pci/setup-bus.c | 78 - drivers/pci/setup-res.c | 7 ++-- 3 files changed, 53 insertions(+), 34 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 7cd108885598..9b5164d10499 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -289,6 +289,8 @@ void pci_bus_put(struct pci_bus *bus); bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res); +int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r); + /* PCIe link information */ #define PCIE_SPEED2STR(speed) \ ((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \ diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index c7365998fbd6..675a612236d7 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -38,6 +38,15 @@ struct pci_dev_resource { unsigned long flags; }; +enum assign_step { + assign_fixed_resources, + assign_float_resources, +}; + +static void _assign_requested_resources_sorted(struct list_head *head, + struct list_head *fail_head, + enum assign_step step); + static void free_list(struct list_head *head) { struct pci_dev_resource *dev_res, *tmp; @@ -278,19 +287,47 @@ static void reassign_resources_sorted(struct list_head *realloc_head, */ static void assign_requested_resources_sorted(struct list_head *head, struct list_head *fail_head) +{ + _assign_requested_resources_sorted(head, fail_head, assign_fixed_resources); + _assign_requested_resources_sorted(head, fail_head, assign_float_resources); +} + +static void _assign_requested_resources_sorted(struct list_head *head, + struct list_head *fail_head, + enum assign_step step) { struct resource *res; struct pci_dev_resource *dev_res; int idx; list_for_each_entry(dev_res, head, list) { + bool is_fixed = false; + if (!pci_dev_bars_enabled(dev_res->dev)) continue; res = dev_res->res; + if (!resource_size(res)) + continue; + idx = res - _res->dev->resource[0]; - if (resource_size(res) && - pci_assign_resource(dev_res->dev, idx)) { + + if (idx < PCI_BRIDGE_RESOURCES) { + is_fixed = !pci_dev_bar_movable(dev_res->dev, res); + } else { + int b_res_idx = pci_get_bridge_resource_idx(res); + struct resource *fixed_res = + _res->dev->subordinate->immovable_range[b_res_idx]; + + is_fixed = (fixed_res->start < fixed_res->end); + } + + if (assign_fixed_resources == step && !is_fixed) + continue; + else if (assign_float_resources == step && is_fixed) + continue; + + if (pci_assign_resource(dev_res->dev, idx)) { if (fail_head) { /* * If the failed resource is a ROM BAR and @@ -1335,7 +1372,7 @@ void pci_bus_size_bridges(struct pci_bus *bus) } EXPORT_SYMBOL(pci_bus_size_bridges); -static void assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r) +int assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r) { int i; struct resource *parent_r; @@ -1352,35 +1389,14 @@ static void assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r) !(r->flags & IORESOURCE_PREFETCH)) continue; - if (resource_contains(parent_r, r)) - request_resource(parent_r, r); - } -} - -/* - * Try to assign any resources marked as IORESOURCE_PCI_FIXED, as they are - * skipped by pbus_assign_resources_sorted(). - */ -static void pdev_assign_fixed_resources(struct pci_dev *dev) -{ - int i; - - for (i = 0; i < PCI_NUM_RESOURCES; i++) { - struct pci_bus *b; - struct resource *r = >resource[i]; - - if (r->parent || !(r->flags & IORESOURCE_PCI_FIXED) || - !(r->flags & (IORESOURCE_IO | IORESOURCE_MEM))) - continue; - - b = dev->bus;
[PATCH v6 19/30] PCI: hotplug: movable BARs: Ignore the MEM BAR offsets from bootloader
BAR allocation by BIOS/UEFI/bootloader/firmware may be non-optimal and it may even clash with the kernel's BAR assignment algorithm. For example, if no space was reserved for SR-IOV BARs, and this bridge window is packed between immovable BARs (so it is unable to extend), and if this window can't be moved, the next PCI rescan will fail, as the kernel tries to find a space for all the BARs, including SR-IOV. With this patch the kernel will use its own methods of BAR allocating when possible, increasing the chances of successful hotplug. Also add a workaround for implicitly used video BARs on x86. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/probe.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 94bbdf9b9dc1..73452aa81417 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -305,6 +305,16 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, pos, (unsigned long long)region.start); } + if (pci_can_move_bars && + !(res->flags & IORESOURCE_IO) && + (dev->class >> 8) != PCI_CLASS_DISPLAY_VGA) { + pci_warn(dev, "ignore the current offset of BAR %llx-%llx\n", +l64, l64 + sz64 - 1); + res->start = 0; + res->end = sz64 - 1; + res->flags |= IORESOURCE_SIZEALIGN; + } + goto out; -- 2.23.0
[PATCH v6 18/30] PCI: hotplug: Configure MPS for hot-added bridges during bus rescan
Assure that MPS settings are set up for bridges which are discovered during manually triggered rescan via sysfs. This sequence of bridge init (using pci_rescan_bus()) will be used for pciehp hot-add events when BARs are movable. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/probe.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index d0d00cb3e965..94bbdf9b9dc1 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3414,7 +3414,7 @@ static void pci_reassign_root_bus_resources(struct pci_bus *root) unsigned int pci_rescan_bus(struct pci_bus *bus) { unsigned int max; - struct pci_bus *root = bus; + struct pci_bus *root = bus, *child; while (!pci_is_root_bus(root)) root = root->parent; @@ -3435,6 +3435,9 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) pci_assign_unassigned_bus_resources(bus); } + list_for_each_entry(child, >children, node) + pcie_bus_configure_settings(child); + pci_bus_add_devices(bus); return max; -- 2.23.0
[PATCH v6 17/30] PCI: hotplug: movable BARs: Don't reserve IO/mem bus space
A hotplugged bridge with many hotplug-capable ports may request reserving more IO space than the machine has. This could be overridden with the "hpiosize=" kernel argument though. But when BARs are movable, there are no need to reserve space anymore: new BARs are allocated not from reserved gaps, but via rearranging the existing BARs. Requesting a precise amount of space for bridge windows increases the chances of adding the new bridge successfully. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 675a612236d7..a68ec726010e 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1285,7 +1285,7 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head) case PCI_HEADER_TYPE_BRIDGE: pci_bridge_check_ranges(bus); - if (bus->self->is_hotplug_bridge) { + if (bus->self->is_hotplug_bridge && !pci_can_move_bars) { additional_io_size = pci_hotplug_io_size; additional_mem_size = pci_hotplug_mem_size; } -- 2.23.0
[PATCH v6 15/30] PCI: Fix assigning the fixed prefetchable resources
Allow matching IORESOURCE_PCI_FIXED prefetchable BARs to non-prefetchable windows, so they follow the same rules as immovable BARs. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 653ba4d5f191..c7365998fbd6 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1339,15 +1339,20 @@ static void assign_fixed_resource_on_bus(struct pci_bus *b, struct resource *r) { int i; struct resource *parent_r; - unsigned long mask = IORESOURCE_IO | IORESOURCE_MEM | -IORESOURCE_PREFETCH; + unsigned long mask = IORESOURCE_TYPE_BITS; pci_bus_for_each_resource(b, parent_r, i) { if (!parent_r) continue; - if ((r->flags & mask) == (parent_r->flags & mask) && - resource_contains(parent_r, r)) + if ((r->flags & mask) != (parent_r->flags & mask)) + continue; + + if (parent_r->flags & IORESOURCE_PREFETCH && + !(r->flags & IORESOURCE_PREFETCH)) + continue; + + if (resource_contains(parent_r, r)) request_resource(parent_r, r); } } -- 2.23.0
[PATCH v6 13/30] PCI: hotplug: movable BARs: Compute limits for relocated bridge windows
With enabled movable BARs, bridge windows are recalculated during each pci rescan. Some of the BARs below the bridge may be fixed/immovable: these areas are represented by the .immovable_range field in struct pci_bus. If a bridge window size is equal to its immovable range, it can only be assigned to the start of this range. But if a bridge window size is larger, and this difference in size is denoted as "delta", the window can start from (immovable_range.start - delta) to (immovable_range.start), and it can end from (immovable_range.end) to (immovable_range.end + delta). This range (the new .realloc_range field in struct pci_bus) must then be compared with immovable ranges of neighbouring bridges to guarantee no intersections. This patch only calculates valid ranges for reallocated bridges during pci rescan, and the next one will make use of these values during allocation. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 67 + include/linux/pci.h | 6 2 files changed, 73 insertions(+) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index a7546e02ea7c..653ba4d5f191 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1819,6 +1819,72 @@ static enum enable_type pci_realloc_detect(struct pci_bus *bus, } #endif +/* + * Calculate the address margins where the bridge windows may be allocated to fit all + * the fixed and immovable BARs beneath. + */ +static void pci_bus_update_realloc_range(struct pci_bus *bus) +{ + struct pci_dev *dev; + struct pci_bus *parent = bus->parent; + int idx; + + list_for_each_entry(dev, >devices, bus_list) + if (dev->subordinate) + pci_bus_update_realloc_range(dev->subordinate); + + if (!parent || !bus->self) + return; + + for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) { + struct resource *immovable_range = >immovable_range[idx]; + resource_size_t window_size = resource_size(bus->resource[idx]); + resource_size_t realloc_start, realloc_end; + + bus->realloc_range[idx].start = 0; + bus->realloc_range[idx].end = 0; + + /* Check if there any immovable BARs under the bridge */ + if (immovable_range->start >= immovable_range->end) + continue; + + /* The lowest possible address where the bridge window can start */ + realloc_start = immovable_range->end - window_size + 1; + /* The highest possible address where the bridge window can end */ + realloc_end = immovable_range->start + window_size - 1; + + if (realloc_start > immovable_range->start) + realloc_start = immovable_range->start; + + if (realloc_end < immovable_range->end) + realloc_end = immovable_range->end; + + /* +* Check that realloc range doesn't intersect with hard fixed ranges +* of neighboring bridges +*/ + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *neighbor = dev->subordinate; + struct resource *n_imm_range; + + if (!neighbor || neighbor == bus) + continue; + + n_imm_range = >immovable_range[idx]; + + if (n_imm_range->start >= n_imm_range->end) + continue; + + if (n_imm_range->end < immovable_range->start && + n_imm_range->end > realloc_start) + realloc_start = n_imm_range->end; + } + + bus->realloc_range[idx].start = realloc_start; + bus->realloc_range[idx].end = realloc_end; + } +} + /* * First try will not touch PCI bridge res. * Second and later try will clear small leaf bridge res. @@ -1838,6 +1904,7 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) if (pci_can_move_bars) { __pci_bus_size_bridges(bus, NULL); + pci_bus_update_realloc_range(bus); __pci_bus_assign_resources(bus, NULL, NULL); goto dump; diff --git a/include/linux/pci.h b/include/linux/pci.h index ef41be0ce082..e1edcb3fad31 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -587,6 +587,12 @@ struct pci_bus { */ struct resource immovable_range[PCI_BRIDGE_RESOURCE_NUM]; + /* +* Acceptable address range, where the bridge window may reside, considering its +* size, so it will cover all the fixed and immovable BARs below. +*/ + struct resource realloc_range[PCI_BRIDGE_RESOURCE_NUM]; + struct pci_ops *ops; /* Configuration access functions */
[PATCH v6 12/30] PCI: hotplug: movable BARs: Calculate immovable parts of bridge windows
When movable BARs are enabled, and if a bridge contains a device with fixed (IORESOURCE_PCI_FIXED) or immovable BARs, the corresponing windows can't be moved too far away from their original positions - they must still contain all the fixed/immovable BARs, like that: 1) Window position before a bus rescan: | <--root bridge window--> | | | | | <-- bridge window--> | | | | movable BARs | **fixed BAR** | | 2) Possible valid outcome after rescan and move: | <--root bridge window--> | | | || <-- bridge window--> | | || **fixed BAR** | Movable BARs | | An immovable area of a bridge (separare for IO, MEM and MEM64 window types) is a range that covers all the fixed and immovable BARs of direct children, and all the fixed area of children bridges: | <--root bridge window--> | | | | | <-- bridge window level 1--> | | | | immovable area of this bridge window | | | | | | | | **fixed BAR** | <-- bridge window level 2--> | BARs | | | || * fixed area of this bridge * | | | | || | | | | || ***fixed BAR*** | | ***fixed BAR*** | | | To store these areas, the .immovable_range field has been added to struct pci_bus. It is filled recursively from leaves to the root before a rescan. Also make pbus_size_io() and pbus_size_mem() return their usual result OR the size of an immovable range of according type, depending on which one is larger. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.h | 14 +++ drivers/pci/probe.c | 88 + drivers/pci/setup-bus.c | 17 include/linux/pci.h | 6 +++ 4 files changed, 125 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 55344f2c55bf..7cd108885598 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -401,6 +401,20 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev) return dev->error_state == pci_channel_io_perm_failure; } +static inline int pci_get_bridge_resource_idx(struct resource *r) +{ + int idx = 1; + + if (r->flags & IORESOURCE_IO) + idx = 0; + else if (!(r->flags & IORESOURCE_PREFETCH)) + idx = 1; + else if (r->flags & IORESOURCE_MEM_64) + idx = 2; + + return idx; +} + /* pci_dev priv_flags */ #define PCI_DEV_ADDED 0 #define PCI_DEV_DISABLED_BARS 1 diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 2d1157493e6a..d0d00cb3e965 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -545,6 +545,7 @@ void pci_read_bridge_bases(struct pci_bus *child) static struct pci_bus *pci_alloc_bus(struct pci_bus *parent) { struct pci_bus *b; + int idx; b = kzalloc(sizeof(*b), GFP_KERNEL); if (!b) @@ -561,6 +562,11 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent) if (parent) b->domain_nr = parent->domain_nr; #endif + for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) { + b->immovable_range[idx].start = 0; + b->immovable_range[idx].end = 0; + } + return b; } @@ -3238,6 +3244,87 @@ static void pci_setup_bridges(struct pci_bus *bus) pci_setup_bridge(bus); } +static void pci_bus_update_immovable_range(struct pci_bus *bus) +{ + struct pci_dev *dev; + int idx; + resource_size_t start, end; + + for (idx = 0; idx < PCI_BRIDGE_RESOURCE_NUM; ++idx) { + bus->immovable_range[idx].start = 0; + bus->immovable_range[idx].end = 0; + } + + list_for_each_entry(dev, >devices, bus_list) + if (dev->subordinate) + pci_bus_update_immovable_range(dev->subordinate); + + list_for_each_entry(dev, >devices, bus_list) { + int i; + struct pci_bus *child = dev->subordinate; + + for (i = 0; i < PCI_BRIDGE_RESOURCES; ++i) { + struct resource *r = >resource[i]; + + if (!r->flags || (r->flags & IORESOURCE_UNSET) || !r->parent) + continue; + + if (!pci_dev_bar_movable(dev, r)) { + idx =
[PATCH v6 11/30] PCI: hotplug: movable BARs: Try to assign unassigned resources only once
With enabled BAR movement, BARs and bridge windows can only be assigned to their direct parents, so there can be only one variant of resource tree, thus every retry within the pci_assign_unassigned_root_bus_resources() will result in the same tree, and it is enough to try just once. In case of failures the pci_reassign_root_bus_resources() disables BARs for one of the hotplugged devices and tries the assignment again. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index cf325daae1b1..3deb1c343e89 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1819,6 +1819,13 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) int pci_try_num = 1; enum enable_type enable_local; + if (pci_can_move_bars) { + __pci_bus_size_bridges(bus, NULL); + __pci_bus_assign_resources(bus, NULL, NULL); + + goto dump; + } + /* Don't realloc if asked to do so */ enable_local = pci_realloc_detect(bus, pci_realloc_enable); if (pci_realloc_enabled(enable_local)) { -- 2.23.0
[PATCH v6 10/30] PCI: Prohibit assigning BARs and bridge windows to non-direct parents
When movable BARs are enabled, the feature of resource relocating from commit 2bbc6942273b5 ("PCI : ability to relocate assigned pci-resources") is not used. Instead, inability to assign a resource is used as a signal to retry BAR assignment with other configuration of bridge windows. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 2 ++ drivers/pci/setup-res.c | 12 2 files changed, 14 insertions(+) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ff33b47b1bb7..cf325daae1b1 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1355,6 +1355,8 @@ static void pdev_assign_fixed_resources(struct pci_dev *dev) while (b && !r->parent) { assign_fixed_resource_on_bus(b, r); b = b->parent; + if (!r->parent && pci_can_move_bars) + break; } } } diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index d8ca40a97693..a1657a8bf93d 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -298,6 +298,18 @@ static int _pci_assign_resource(struct pci_dev *dev, int resno, bus = dev->bus; while ((ret = __pci_assign_resource(bus, dev, resno, size, min_align))) { + if (pci_can_move_bars) { + if (resno >= PCI_BRIDGE_RESOURCES && + resno <= PCI_BRIDGE_RESOURCE_END) { + struct resource *res = dev->resource + resno; + + res->start = 0; + res->end = 0; + res->flags = 0; + } + break; + } + if (!bus->parent || !bus->self->transparent) break; bus = bus->parent; -- 2.23.0
[PATCH v6 09/30] PCI: Include fixed and immovable BARs into the bus size calculating
The only difference between the fixed/immovable and movable BARs is a size and offset preservation after they are released (the corresponding struct resource* detached from a bridge window for a while during a bus rescan). Include fixed/immovable BARs into result of pbus_size_mem() and prohibit assigning them to non-direct parents. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 4b538d132958..ff33b47b1bb7 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1011,12 +1011,20 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, struct resource *r = >resource[i]; resource_size_t r_size; - if (r->parent || (r->flags & IORESOURCE_PCI_FIXED) || + if (r->parent || ((r->flags & mask) != type && (r->flags & mask) != type2 && (r->flags & mask) != type3)) continue; r_size = resource_size(r); + + if (!pci_dev_bar_movable(dev, r)) { + if (pci_can_move_bars) + size += r_size; + + continue; + } + #ifdef CONFIG_PCI_IOV /* Put SRIOV requested res to the optional list */ if (realloc_head && i >= PCI_IOV_RESOURCES && -- 2.23.0
[PATCH v6 06/30] PCI: hotplug: movable BARs: Recalculate all bridge windows during rescan
When the movable BARs feature is enabled and a rescan has been requested, release all the bridge windows and recalculate them from scratch, taking into account all kinds for BARs: fixed, immovable, movable, new. This increases the chances to find a memory space to fit BARs for newly hotplugged devices, especially if no/not enough gaps were reserved by the BIOS/bootloader/firmware. The last step of writing the recalculated windows to the bridges is done by the new pci_setup_bridges() function. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.h | 1 + drivers/pci/probe.c | 22 ++ drivers/pci/setup-bus.c | 16 3 files changed, 39 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 19bc50597d12..4a3f2b69285b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -280,6 +280,7 @@ void __pci_bus_assign_resources(const struct pci_bus *bus, struct list_head *realloc_head, struct list_head *fail_head); bool pci_bus_clip_resource(struct pci_dev *dev, int idx); +void pci_bus_release_root_bridge_resources(struct pci_bus *bus); void pci_reassigndev_resource_alignment(struct pci_dev *dev); void pci_disable_bridge_window(struct pci_dev *dev); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 3d8c0f653378..d2dbec51c4df 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3200,6 +3200,25 @@ static void pci_bus_rescan_done(struct pci_bus *bus) pci_config_pm_runtime_put(bus->self); } +static void pci_setup_bridges(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *child; + + if (!pci_dev_is_added(dev)) + continue; + + child = dev->subordinate; + if (child) + pci_setup_bridges(child); + } + + if (bus->self) + pci_setup_bridge(bus); +} + /** * pci_rescan_bus - Scan a PCI bus for devices * @bus: PCI bus to scan @@ -3221,8 +3240,11 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) pci_bus_rescan_prepare(root); max = pci_scan_child_bus(root); + + pci_bus_release_root_bridge_resources(root); pci_assign_unassigned_root_bus_resources(root); + pci_setup_bridges(root); pci_bus_rescan_done(root); } else { max = pci_scan_child_bus(bus); diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index f2f02e6c9000..075e8185b936 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1635,6 +1635,22 @@ static void pci_bus_release_bridge_resources(struct pci_bus *bus, pci_bridge_release_resources(bus, type); } +void pci_bus_release_root_bridge_resources(struct pci_bus *root_bus) +{ + int i; + struct resource *r; + + pci_bus_release_bridge_resources(root_bus, IORESOURCE_IO, whole_subtree); + pci_bus_release_bridge_resources(root_bus, IORESOURCE_MEM, whole_subtree); + pci_bus_release_bridge_resources(root_bus, +IORESOURCE_MEM_64 | IORESOURCE_PREFETCH, +whole_subtree); + + pci_bus_for_each_resource(root_bus, r, i) { + pci_release_child_resources(root_bus, r); + } +} + static void pci_bus_dump_res(struct pci_bus *bus) { struct resource *res; -- 2.23.0
[PATCH v6 08/30] PCI: hotplug: movable BARs: Don't allow added devices to steal resources
When movable BARs are enabled, the PCI subsystem at first releases all the bridge windows and then attempts to assign resources both to previously working devices and to the newly hotplugged ones, with the same priority. If a hotplugged device gets its BARs first, this may lead to lack of space for already working devices, which is unacceptable. If that happens, mark one of the new devices with the newly introduced flag PCI_DEV_DISABLED_BARS (if it is not yet marked) and retry the BAR recalculation. The worst case would be no BARs for hotplugged devices, while all the rest just continue working. The algorithm is simple and it doesn't retry different subsets of hot-added devices in case of a failure, e.g. if there are no space to allocate BARs for both hotplugged devices A and B, but is enough for just A, the A will be marked with PCI_DEV_DISABLED_BARS first, then (after the next failure) - B. As a result, A will not get BARs while it could. This issue is only relevant when hotplugging two and more devices simultaneously. Add a new res_mask bitmask to the struct pci_dev for storing the indices of assigned BARs. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.h | 11 + drivers/pci/probe.c | 102 ++-- drivers/pci/setup-bus.c | 15 ++ include/linux/pci.h | 1 + 4 files changed, 126 insertions(+), 3 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 4a3f2b69285b..55344f2c55bf 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -403,6 +403,7 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev) /* pci_dev priv_flags */ #define PCI_DEV_ADDED 0 +#define PCI_DEV_DISABLED_BARS 1 static inline void pci_dev_assign_added(struct pci_dev *dev, bool added) { @@ -414,6 +415,16 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev) return test_bit(PCI_DEV_ADDED, >priv_flags); } +static inline void pci_dev_disable_bars(struct pci_dev *dev) +{ + assign_bit(PCI_DEV_DISABLED_BARS, >priv_flags, true); +} + +static inline bool pci_dev_bars_enabled(const struct pci_dev *dev) +{ + return !test_bit(PCI_DEV_DISABLED_BARS, >priv_flags); +} + #ifdef CONFIG_PCIEAER #include diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index d2dbec51c4df..2d1157493e6a 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3162,6 +3162,23 @@ bool pci_dev_bar_movable(struct pci_dev *dev, struct resource *res) return pci_dev_movable(dev, res->child); } +static unsigned int pci_dev_count_res_mask(struct pci_dev *dev) +{ + unsigned int res_mask = 0; + int i; + + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++) { + struct resource *r = >resource[i]; + + if (!r->flags || (r->flags & IORESOURCE_UNSET) || !r->parent) + continue; + + res_mask |= (1 << i); + } + + return res_mask; +} + static void pci_bus_rescan_prepare(struct pci_bus *bus) { struct pci_dev *dev; @@ -3172,6 +3189,8 @@ static void pci_bus_rescan_prepare(struct pci_bus *bus) list_for_each_entry(dev, >devices, bus_list) { struct pci_bus *child = dev->subordinate; + dev->res_mask = pci_dev_count_res_mask(dev); + if (child) pci_bus_rescan_prepare(child); @@ -3207,7 +3226,7 @@ static void pci_setup_bridges(struct pci_bus *bus) list_for_each_entry(dev, >devices, bus_list) { struct pci_bus *child; - if (!pci_dev_is_added(dev)) + if (!pci_dev_is_added(dev) || !pci_dev_bars_enabled(dev)) continue; child = dev->subordinate; @@ -3219,6 +3238,83 @@ static void pci_setup_bridges(struct pci_bus *bus) pci_setup_bridge(bus); } +static struct pci_dev *pci_find_next_new_device(struct pci_bus *bus) +{ + struct pci_dev *dev; + + if (!bus) + return NULL; + + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *child_bus = dev->subordinate; + + if (!pci_dev_is_added(dev) && pci_dev_bars_enabled(dev)) + return dev; + + if (child_bus) { + struct pci_dev *next_new_dev; + + next_new_dev = pci_find_next_new_device(child_bus); + if (next_new_dev) + return next_new_dev; + } + } + + return NULL; +} + +static bool pci_bus_check_all_bars_reassigned(struct pci_bus *bus) +{ + struct pci_dev *dev; + bool ret = true; + + if (!bus) + return false; + + list_for_each_entry(dev, >devices, bus_list) { + struct pci_bus *child = dev->subordinate; + unsigned int res_mask = pci_dev_count_res_mask(dev); + + if (!pci_dev_bars_enabled(dev)) +
[PATCH v6 05/30] PCI: hotplug: movable BARs: Fix reassigning the released bridge windows
When a bridge window is temporarily released during the rescan, its old size is not relevant anymore - it will be recreated from pbus_size_*(), so it's start value should be zero. If such window can't be reassigned, don't apply reset_resource(), so the next retry may succeed. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 2c02eb1acf5d..f2f02e6c9000 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -295,7 +295,8 @@ static void assign_requested_resources_sorted(struct list_head *head, 0 /* don't care */, 0 /* don't care */); } - reset_resource(res); + if (!pci_can_move_bars) + reset_resource(res); } } } @@ -1579,8 +1580,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, type = old_flags = r->flags & PCI_RES_TYPE_MASK; pci_info(dev, "resource %d %pR released\n", PCI_BRIDGE_RESOURCES + idx, r); - /* Keep the old size */ - r->end = resource_size(r) - 1; + /* Don't keep the old size if the bridge will be recalculated */ + r->end = pci_can_move_bars ? 0 : (resource_size(r) - 1); r->start = 0; r->flags = 0; -- 2.23.0
[PATCH v6 07/30] PCI: hotplug: movable BARs: Don't disable the released bridge windows
On a hotplug event with enabled BAR movement, calculating the new bridge windows takes some time. During this procedure, the structures that represent these windows are released - marked for recalculation. When new bridge windows are ready, they are written to the registers of every bridge via pci_setup_bridges(). Currently, bridge's registers are updated immediately after releasing a window to disable it. But if a driver doesn't yet support movable BARs, it doesn't stop MEM transactions during the hotplug, so disabled bridge windows will break them. Let the bridge windows remain operating after releasing, as they will be updated to the new values in the end of a hotplug event. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 075e8185b936..381ce964cb20 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1588,7 +1588,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, /* Avoiding touch the one without PREF */ if (type & IORESOURCE_PREFETCH) type = IORESOURCE_PREFETCH; - __pci_setup_bridge(bus, type); + if (!pci_can_move_bars) + __pci_setup_bridge(bus, type); /* For next child res under same bridge */ r->flags = old_flags; } -- 2.23.0
[PATCH v6 04/30] PCI: Define PCI-specific version of the release_child_resources()
If release the bridge resources with standard release_child_resources(), it drops the .start field of children's BARs to zero, but with the STARTALIGN flag remaining set, which makes the resource invalid for reassignment. Some resources must preserve their offset and size: those marked with the PCI_FIXED and the immovable ones - which are bound by drivers without support of the movable BARs feature. Add the pci_release_child_resources() to replace release_child_resources() in handling the described PCI-specific cases. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/setup-bus.c | 54 - 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index e7dbe21705ba..2c02eb1acf5d 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1482,6 +1482,54 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge, (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\ IORESOURCE_MEM_64) +/* + * Similar to generic release_child_resources(), but aware of immovable BARs and + * PCI_FIXED and STARTALIGN flags + */ +static void pci_release_child_resources(struct pci_bus *bus, struct resource *r) +{ + struct pci_dev *dev; + + if (!bus || !r) + return; + + if (r->flags & IORESOURCE_PCI_FIXED) + return; + + r->child = NULL; + + list_for_each_entry(dev, >devices, bus_list) { + int i; + + for (i = 0; i < PCI_NUM_RESOURCES; i++) { + struct resource *tmp = >resource[i]; + resource_size_t size = resource_size(tmp); + + if (!tmp->flags || tmp->parent != r) + continue; + + tmp->parent = NULL; + tmp->sibling = NULL; + + pci_release_child_resources(dev->subordinate, tmp); + + tmp->flags &= ~IORESOURCE_STARTALIGN; + tmp->flags |= IORESOURCE_SIZEALIGN; + + if (!pci_dev_bar_movable(dev, tmp)) { + pci_dbg(dev, "release immovable %pR (%s), keep its flags, base and size\n", + tmp, tmp->name); + continue; + } + + pci_dbg(dev, "release %pR (%s)\n", tmp, tmp->name); + + tmp->start = 0; + tmp->end = size - 1; + } + } +} + static void pci_bridge_release_resources(struct pci_bus *bus, unsigned long type) { @@ -1522,7 +1570,11 @@ static void pci_bridge_release_resources(struct pci_bus *bus, return; /* If there are children, release them all */ - release_child_resources(r); + if (pci_can_move_bars) + pci_release_child_resources(bus, r); + else + release_child_resources(r); + if (!release_resource(r)) { type = old_flags = r->flags & PCI_RES_TYPE_MASK; pci_info(dev, "resource %d %pR released\n", -- 2.23.0
[PATCH v6 00/30] PCI: Allow BAR movement during hotplug
Currently PCI hotplug works on top of resources, which are usually reserved not by the kernel, but by BIOS, bootloader, firmware, etc. These resources are gaps in the address space where BARs of new devices may fit, and extra bus number per port, so bridges can be hot-added. This series aim the former problem: it shows the kernel how to redistribute on the run, so the hotplug becomes predictable and cross-platform. A follow-up patchset will propose a solution for bus numbers. If the memory is arranged in a way that doesn't provide enough space for BARs of a new hotplugged device, the kernel can pause the drivers of the "obstructing" devices and move their BARs, so the new BARs can fit into the freed spaces. To rearrange the BARs and bridge windows these patches releases all of them after a rescan and re-assigns in the same way as during the initial PCIe topology scan at system boot. When a driver is un-paused by the kernel after the PCIe rescan, it should ioremap() the new addresses of its BARs. Drivers indicate their support of the feature by implementing the new hooks .rescan_prepare() and .rescan_done() in the struct pci_driver. If a driver doesn't yet support the feature, BARs of its devices will be considered as immovable (by checking the pci_dev_movable_bars_supported(dev)) and handled in the same way as resources with the IORESOURCE_PCI_FIXED flag. If a driver doesn't yet support the feature, its devices are guaranteed to have their BARs remaining untouched. Tested on: - x86_64 with "pci=pcie_bus_peer2peer" - POWER8 PowerNV+OPAL+PHB3 ppc64le with "pci=pcie_bus_peer2peer". This patchset is a part of our work on adding support for hotplugging bridges full of other bridges, NVME drives, SAS HBAs and GPUs without special requirements such as Hot-Plug Controller, reservation of bus numbers or memory regions by firmware, etc. Changes since v5: - Simplified the disable flag, now it is "pci=no_movable_buses"; - More deliberate marking the BARs as immovable; - Mark as immovable BARs which are used by unbound drivers; - Ignoring BAR assignment by non-kernel program components, so the kernel is able now to distribute BARs in optimal and predictable way; - Move here PowerNV-specific patches from the older "powerpc/powernv/pci: Make hotplug self-sufficient, independent of FW and DT" series; - Fix EEH cache rebuilding and PE allocation for PowerNV during rescan. Changes since v4: - Feature is enabled by default (turned on by one of the latest patches); - Add pci_dev_movable_bars_supported(dev) instead of marking the immovable BARs with the IORESOURCE_PCI_FIXED flag; - Set up PCIe bridges during rescan via sysfs, so MPS settings are now configured not only during system boot or pcihp events; - Allow movement of switch's BARs if claimed by portdrv; - Update EEH address caches after rescan for powerpc; - Don't disable completely hot-added devices which can't have BARs being fit - just disable their BARs, so they are still visible in lspci etc; - Clearer names: fixed_range_hard -> immovable_range, fixed_range_soft -> realloc_range; - Drop the patch for pci_restore_config_space() - fixed by properly using the runtime PM. Changes since v3: - Rebased to the upstream, so the patches apply cleanly again. Changes since v2: - Fixed double-assignment of bridge windows; - Fixed assignment of fixed prefetched resources; - Fixed releasing of fixed resources; - Fixed a debug message; - Removed auto-enabling the movable BARs for x86 - let's rely on the "pcie_movable_bars=force" option for now; - Reordered the patches - bugfixes first. Changes since v1: - Add a "pcie_movable_bars={ off | force }" command line argument; - Handle the IORESOURCE_PCI_FIXED flag properly; - Don't move BARs of devices which don't support the feature; - Guarantee that new hotplugged devices will not steal memory from working devices by ignoring the failing new devices with the new PCI_DEV_IGNORE flag; - Add rescan_prepare()+rescan_done() to the struct pci_driver instead of using the reset_prepare()+reset_done() from struct pci_error_handlers; - Add a bugfix of a race condition; - Fixed hotplug in a non-pre-enabled (by BIOS/firmware) bridge; - Fix the compatibility of the feature with pm_runtime and D3-state; - Hotplug events from pciehp also can move BARs; - Add support of the feature to the NVME driver. Sergey Miroshnichenko (30): PCI: Fix race condition in pci_enable/disable_device() PCI: Enable bridge's I/O and MEM access for hotplugged devices PCI: hotplug: Add a flag for the movable BARs feature PCI: Define PCI-specific version of the release_child_resources() PCI: hotplug: movable BARs: Fix reassigning the released bridge windows PCI: hotplug: movable BARs: Recalculate all bridge windows during rescan PCI: hotplug: movable BARs: Don't disable the released bridge windows PCI: hotplug: movable BARs: Don't allow added devices to steal resources
[PATCH v6 03/30] PCI: hotplug: Add a flag for the movable BARs feature
When hot-adding a device, the bridge may have windows not big enough (or fragmented too much) for newly requested BARs to fit in. And expanding these bridge windows may be impossible because blocked by "neighboring" BARs and bridge windows. Still, it may be possible to allocate a memory region for new BARs with the following procedure: 1) notify all the drivers which support movable BARs to pause and release the BARs; the rest of the drivers are guaranteed that their devices will not get BARs moved; 2) release all the bridge windows and movable BARs; 3) try to recalculate new bridge windows that will fit all the BAR types: - fixed; - immovable; - movable; - newly requested by hot-added devices; 4) if the previous step fails, disable BARs for one of the hot-added devices and retry from step 3; 5) notify the drivers, so they remap BARs and resume. If bridge calculation and BAR assignment fails with a hot-added devices, BARs of these devices will be disabled, falling back to the same amount and size of BARs as they were before the hotplug event. The kernel succeeded in assigning then, so the same algorithm will provide the same results again. This makes the prior reservation of memory by BIOS/bootloader/firmware not required anymore for the PCI hotplug. Drivers indicate their support of movable BARs by implementing the new .rescan_prepare() and .rescan_done() hooks in the struct pci_driver. All device's activity must be paused during a rescan, and iounmap()+ioremap() must be applied to every used BAR. If a device is not bound to a driver, its BARs are considered movable. For a higher probability of the successful BAR reassignment, all the BARs and bridge windows should be released before the rescan, not only those with higher addresses. One example when it is needed, BAR(I) is moved to free a gap for the new BAR(II): Before: parent bridge window === hotplug bridge window | BAR(I)| fixed BAR | fixed BAR | fixed BAR | ^ | new BAR(II) After: parent bridge window = --- hotplug bridge window --- | new BAR(II) | fixed BAR | fixed BAR | fixed BAR | BAR(I) | Another example is a fragmented bridge window jammed between fixed BARs: Before: = parent bridge window -- hotplug bridge window -- | fixed BAR | | BAR(I) || BAR(II) || BAR(III) | fixed BAR | ^ | new BAR(IV) After: parent bridge window = -- hotplug bridge window -- | fixed BAR | BAR(I) | BAR(II) | BAR(III) | new BAR(IV) | fixed BAR | This patch is a preparation for future patches with actual implementation, and for now it just does the following: - declares the feature; - defines the bool pci_can_move_bars and bool pci_dev_bar_movable(dev); - invokes the .rescan_prepare() and .rescan_done() driver notifiers; - disables the feature for the powerpc/pseries. The feature is disabled by default until the final patch of the series. It can be overridden per-arch using the pci_can_move_bars=false flag or by the following command line option: pci=no_movable_bars CC: Sam Bobroff CC: Rajat Jain CC: Lukas Wunner CC: Oliver O'Halloran CC: David Laight Signed-off-by: Sergey Miroshnichenko --- .../admin-guide/kernel-parameters.txt | 1 + arch/powerpc/platforms/pseries/setup.c| 2 + drivers/pci/pci.c | 4 + drivers/pci/pci.h | 2 + drivers/pci/probe.c | 85 ++- include/linux/pci.h | 4 + 6 files changed, 96 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a84a83f8881e..c6243aaed0c9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3528,6 +3528,7 @@ may put more devices in an IOMMU group. force_floating [S390] Force usage of floating interrupts. nomio [S390] Do not use MIO instructions. + no_movable_bars Don't allow BARs to be moved during hotplug pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 0a40201f315f..7cd12c5a2deb 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -920,6 +920,8 @@ static void __init pseries_init(void) { pr_debug(" ->
[PATCH v6 02/30] PCI: Enable bridge's I/O and MEM access for hotplugged devices
The PCI_COMMAND_IO and PCI_COMMAND_MEMORY bits of the bridge must be updated not only when enabling the bridge for the first time, but also if a hotplugged device requests these types of resources. Originally these bits were set by the pci_enable_device_flags() only, which exits early if the bridge is already pci_is_enabled(). So if the bridge was empty initially (an edge case), then hotplugged devices fail to IO/MEM. Signed-off-by: Sergey Miroshnichenko --- drivers/pci/pci.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 44d0d12c80cf..e85dc63c73fd 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1650,6 +1650,14 @@ static void pci_enable_bridge(struct pci_dev *dev) pci_enable_bridge(bridge); if (pci_is_enabled(dev)) { + int i, bars = 0; + + for (i = PCI_BRIDGE_RESOURCES; i < DEVICE_COUNT_RESOURCE; i++) { + if (dev->resource[i].flags & (IORESOURCE_MEM | IORESOURCE_IO)) + bars |= (1 << i); + } + do_pci_enable_device(dev, bars); + if (!dev->is_busmaster) pci_set_master(dev); mutex_unlock(>enable_mutex); -- 2.23.0
[PATCH v6 01/30] PCI: Fix race condition in pci_enable/disable_device()
This is a yet another approach to fix an old [1-2] concurrency issue, when: - two or more devices are being hot-added into a bridge which was initially empty; - a bridge with two or more devices is being hot-added; - during boot, if BIOS/bootloader/firmware doesn't pre-enable bridges. The problem is that a bridge is reported as enabled before the MEM/IO bits are actually written to the PCI_COMMAND register, so another driver thread starts memory requests through the not-yet-enabled bridge: CPU0CPU1 pci_enable_device_mem() pci_enable_device_mem() pci_enable_bridge() pci_enable_bridge() pci_is_enabled() return false; atomic_inc_return(enable_cnt) Start actual enabling the bridge ... pci_is_enabled() ... return true; ... Start memory requests <-- FAIL ... Set the PCI_COMMAND_MEMORY bit <-- Must wait for this Protect the pci_enable/disable_device() and pci_enable_bridge(), which is similar to the previous solution from commit 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges"), but adding a per-device mutexes and preventing the dev->enable_cnt from from incrementing early. CC: Srinath Mannam CC: Marta Rybczynska Signed-off-by: Sergey Miroshnichenko [1] https://lore.kernel.org/linux-pci/1501858648-8-1-git-send-email-srinath.man...@broadcom.com/T/#u [RFC PATCH v3] pci: Concurrency issue during pci enable bridge [2] https://lore.kernel.org/linux-pci/744877924.5841545.1521630049567.javamail.zim...@kalray.eu/T/#u [RFC PATCH] nvme: avoid race-conditions when enabling devices --- drivers/pci/pci.c | 26 ++ drivers/pci/probe.c | 1 + include/linux/pci.h | 1 + 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a97e2571a527..44d0d12c80cf 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1643,6 +1643,8 @@ static void pci_enable_bridge(struct pci_dev *dev) struct pci_dev *bridge; int retval; + mutex_lock(>enable_mutex); + bridge = pci_upstream_bridge(dev); if (bridge) pci_enable_bridge(bridge); @@ -1650,6 +1652,7 @@ static void pci_enable_bridge(struct pci_dev *dev) if (pci_is_enabled(dev)) { if (!dev->is_busmaster) pci_set_master(dev); + mutex_unlock(>enable_mutex); return; } @@ -1658,11 +1661,14 @@ static void pci_enable_bridge(struct pci_dev *dev) pci_err(dev, "Error enabling bridge (%d), continuing\n", retval); pci_set_master(dev); + mutex_unlock(>enable_mutex); } static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) { struct pci_dev *bridge; + /* Enable-locking of bridges is performed within the pci_enable_bridge() */ + bool need_lock = !dev->subordinate; int err; int i, bars = 0; @@ -1678,8 +1684,13 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); } - if (atomic_inc_return(>enable_cnt) > 1) + if (need_lock) + mutex_lock(>enable_mutex); + if (pci_is_enabled(dev)) { + if (need_lock) + mutex_unlock(>enable_mutex); return 0; /* already enabled */ + } bridge = pci_upstream_bridge(dev); if (bridge) @@ -1694,8 +1705,10 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) bars |= (1 << i); err = do_pci_enable_device(dev, bars); - if (err < 0) - atomic_dec(>enable_cnt); + if (err >= 0) + atomic_inc(>enable_cnt); + if (need_lock) + mutex_unlock(>enable_mutex); return err; } @@ -1939,15 +1952,20 @@ void pci_disable_device(struct pci_dev *dev) if (dr) dr->enabled = 0; + mutex_lock(>enable_mutex); dev_WARN_ONCE(>dev, atomic_read(>enable_cnt) <= 0, "disabling already-disabled device"); - if (atomic_dec_return(>enable_cnt) != 0) + if (atomic_dec_return(>enable_cnt) != 0) { + mutex_unlock(>enable_mutex); return; + } do_pci_disable_device(dev); dev->is_busmaster = 0; + + mutex_unlock(>enable_mutex); } EXPORT_SYMBOL(pci_disable_device); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 3d5271a7a849..d4f21e413638 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2158,6 +2158,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus) INIT_LIST_HEAD(>bus_list);
Re: [PATCH 0/2] Enabling MSI for Microblaze
On 10/24/19 6:13 AM, Michal Simek wrote: > Hi, > > these two patches come from discussion with Christoph, Bjorn, Palmer and > Waiman. The first patch was suggestion by Christoph here > https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/ > The second part was discussed > https://lore.kernel.org/linux-pci/mhng-5d9bcb53-225e-441f-86cc-b335624b3e7c@palmer-si-x1e/ > and > https://lore.kernel.org/linux-pci/20191017181937.7004-1-pal...@sifive.com/ > > Thanks, > Michal > > > Michal Simek (1): > asm-generic: Make msi.h a mandatory include/asm header > > Palmer Dabbelt (1): > pci: Default to PCI_MSI_IRQ_DOMAIN > > arch/arc/include/asm/Kbuild | 1 - > arch/arm/include/asm/Kbuild | 1 - > arch/arm64/include/asm/Kbuild | 1 - > arch/mips/include/asm/Kbuild| 1 - > arch/powerpc/include/asm/Kbuild | 1 - > arch/riscv/include/asm/Kbuild | 1 - > arch/sparc/include/asm/Kbuild | 1 - > drivers/pci/Kconfig | 2 +- > include/asm-generic/Kbuild | 1 + > 9 files changed, 2 insertions(+), 8 deletions(-) > That looks OK. Acked-by: Waiman Long
Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers
> On Oct 24, 2019, at 10:50 AM, Anshuman Khandual > wrote: > > Changes in V7: > > - Memory allocation and free routines for mapped pages have been droped > - Mapped pfns are derived from standard kernel text symbol per Matthew > - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian > - Updated the commit message per Michal > - Updated W=1 GCC warning problem on x86 per Qian Cai It would be interesting to know if you actually tested out to see if the warning went away. As far I can tell, the GCC is quite stubborn there, so I am not going to insist.
[PATCH 03/34 v3] powerpc: Use CONFIG_PREEMPTION
From: Thomas Gleixner CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the entry code over to use CONFIG_PREEMPTION. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Thomas Gleixner [bigeasy: +Kconfig] Signed-off-by: Sebastian Andrzej Siewior --- v2…v3: Don't mention die.c changes in the description. v1…v2: Remove the changes to die.c. arch/powerpc/Kconfig | 2 +- arch/powerpc/kernel/entry_32.S | 4 ++-- arch/powerpc/kernel/entry_64.S | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16ee..8ead8d6e1cbc8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT config GENERIC_LOCKBREAK bool default y - depends on SMP && PREEMPT + depends on SMP && PREEMPTION config GENERIC_HWEIGHT bool diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index d60908ea37fb9..e1a4c39b83b86 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here */ bne-0b 1: -#ifdef CONFIG_PREEMPT +#ifdef CONFIG_PREEMPTION /* check current_thread_info->preempt_count */ lwz r0,TI_PREEMPT(r2) cmpwi 0,r0,0 /* if non-zero, just restore regs and return */ @@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here */ */ bl trace_hardirqs_on #endif -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPTION */ restore_kuap: kuap_restore r1, r2, r9, r10, r0 diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 6467bdab8d405..83733376533e8 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite) bne-0b 1: -#ifdef CONFIG_PREEMPT +#ifdef CONFIG_PREEMPTION /* Check if we need to preempt */ andi. r0,r4,_TIF_NEED_RESCHED beq+restore @@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite) li r10,MSR_RI mtmsrd r10,1 /* Update machine state */ #endif /* CONFIG_PPC_BOOK3E */ -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPTION */ .globl fast_exc_return_irq fast_exc_return_irq: -- 2.23.0
Re: [PATCH 0/2] vfio pci: Add support for OpenCAPI devices
Hi Christophe, Sorry, I didn't have time to look at your other series yet and likely the same for this one with the upcoming KVM Forum... :-\ Anyway, for any VFIO related patch, don't forget to Cc the maintainer, Alex Williamson . Cheers, -- Greg On Thu, 24 Oct 2019 15:28:03 +0200 christophe lombard wrote: > This series adds support for the OpenCAPI devices for vfio pci. > > It builds on top of the existing ocxl driver + > http://patchwork.ozlabs.org/patch/1177999/ > > VFIO is a Linux kernel driver framework used by QEMU to make devices > directly assignable to virtual machines. > > All OpenCAPI devices on the same PCI slot will all be grouped and > assigned to the same guest. > > - Assume these are the devices you want to assign > 0007:00:00.0 Processing accelerators: IBM Device 062b > 0007:00:00.1 Processing accelerators: IBM Device 062b > > - Two Devices in the group > $ ls /sys/bus/pci/devices/0007\:00\:00.0/iommu_group/devices/ > 0007:00:00.0 0007:00:00.1 > > - Find vendor & device ID > $ lspci -n -s 0007:00:00 > 0007:00:00.0 1200: 1014:062b > 0007:00:00.1 1200: 1014:062b > > - Unbind from the current ocxl device driver if already loaded > $ rmmod ocxl > > - Load vfio-pci if it's not already done. > $ modprobe vfio-pci > > - Bind to vfio-pci > $ echo 1014 062b > /sys/bus/pci/drivers/vfio-pci/new_id > > This will result in a new device node "/dev/vfio/7", which will be > use by QEMU to setup the devices for passthrough. > > - Pass to qemu using -device vfio-pci > -device vfio-pci,multifunction=on,host=0007:00:00.0,addr=2.0 -device > vfio-pci,multifunction=on,host=0007:00:00.1,addr=2.1 > > It has been tested in a bare-metal and QEMU environment using the memcpy > and the AFP AFUs. > > christophe lombard (2): > powerpc/powernv: Register IOMMU group for OpenCAPI devices > vfio/pci: Introduce OpenCAPI devices support. > > arch/powerpc/platforms/powernv/ocxl.c | 164 ++--- > arch/powerpc/platforms/powernv/pci-ioda.c | 19 +- > arch/powerpc/platforms/powernv/pci.h | 13 + > drivers/vfio/pci/Kconfig | 7 + > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/vfio_pci.c | 19 ++ > drivers/vfio/pci/vfio_pci_ocxl.c | 287 ++ > drivers/vfio/vfio.c | 25 ++ > include/linux/vfio.h | 13 + > include/uapi/linux/vfio.h | 22 ++ > 10 files changed, 530 insertions(+), 40 deletions(-) > create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c >
Re: [PATCH 1/2] asm-generic: Make msi.h a mandatory include/asm header
On Thu, Oct 24, 2019 at 7:13 PM Michal Simek wrote: > > msi.h is generic for all architectures expect of x86 which has own version. Maybe a typo? "except" Anyway, the code looks good to me. Reviewed-by: Masahiro Yamada > Enabling MSI by including msi.h to architecture Kbuild is just additional > step which doesn't need to be done. > The patch was created based on request to enable MSI for Microblaze. > > Suggested-by: Christoph Hellwig > Signed-off-by: Michal Simek > --- > > https://lore.kernel.org/linux-riscv/20191008154604.ga7...@infradead.org/ > --- > arch/arc/include/asm/Kbuild | 1 - > arch/arm/include/asm/Kbuild | 1 - > arch/arm64/include/asm/Kbuild | 1 - > arch/mips/include/asm/Kbuild| 1 - > arch/powerpc/include/asm/Kbuild | 1 - > arch/riscv/include/asm/Kbuild | 1 - > arch/sparc/include/asm/Kbuild | 1 - > include/asm-generic/Kbuild | 1 + > 8 files changed, 1 insertion(+), 7 deletions(-) > > diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild > index 393d4f5e1450..1b505694691e 100644 > --- a/arch/arc/include/asm/Kbuild > +++ b/arch/arc/include/asm/Kbuild > @@ -17,7 +17,6 @@ generic-y += local64.h > generic-y += mcs_spinlock.h > generic-y += mm-arch-hooks.h > generic-y += mmiowb.h > -generic-y += msi.h > generic-y += parport.h > generic-y += percpu.h > generic-y += preempt.h > diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild > index 68ca86f85eb7..fa579b23b4df 100644 > --- a/arch/arm/include/asm/Kbuild > +++ b/arch/arm/include/asm/Kbuild > @@ -12,7 +12,6 @@ generic-y += local.h > generic-y += local64.h > generic-y += mm-arch-hooks.h > generic-y += mmiowb.h > -generic-y += msi.h > generic-y += parport.h > generic-y += preempt.h > generic-y += seccomp.h > diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild > index 98a5405c8558..bd23f87d6c55 100644 > --- a/arch/arm64/include/asm/Kbuild > +++ b/arch/arm64/include/asm/Kbuild > @@ -16,7 +16,6 @@ generic-y += local64.h > generic-y += mcs_spinlock.h > generic-y += mm-arch-hooks.h > generic-y += mmiowb.h > -generic-y += msi.h > generic-y += qrwlock.h > generic-y += qspinlock.h > generic-y += serial.h > diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild > index c8b595c60910..61b0fc2026e6 100644 > --- a/arch/mips/include/asm/Kbuild > +++ b/arch/mips/include/asm/Kbuild > @@ -13,7 +13,6 @@ generic-y += irq_work.h > generic-y += local64.h > generic-y += mcs_spinlock.h > generic-y += mm-arch-hooks.h > -generic-y += msi.h > generic-y += parport.h > generic-y += percpu.h > generic-y += preempt.h > diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild > index 64870c7be4a3..17726f2e46de 100644 > --- a/arch/powerpc/include/asm/Kbuild > +++ b/arch/powerpc/include/asm/Kbuild > @@ -10,4 +10,3 @@ generic-y += local64.h > generic-y += mcs_spinlock.h > generic-y += preempt.h > generic-y += vtime.h > -generic-y += msi.h > diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild > index 16970f246860..1efaeddf1e4b 100644 > --- a/arch/riscv/include/asm/Kbuild > +++ b/arch/riscv/include/asm/Kbuild > @@ -22,7 +22,6 @@ generic-y += kvm_para.h > generic-y += local.h > generic-y += local64.h > generic-y += mm-arch-hooks.h > -generic-y += msi.h > generic-y += percpu.h > generic-y += preempt.h > generic-y += sections.h > diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild > index b6212164847b..62de2eb2773d 100644 > --- a/arch/sparc/include/asm/Kbuild > +++ b/arch/sparc/include/asm/Kbuild > @@ -18,7 +18,6 @@ generic-y += mcs_spinlock.h > generic-y += mm-arch-hooks.h > generic-y += mmiowb.h > generic-y += module.h > -generic-y += msi.h > generic-y += preempt.h > generic-y += serial.h > generic-y += trace_clock.h > diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild > index adff14fcb8e4..ddfee1bd9dc1 100644 > --- a/include/asm-generic/Kbuild > +++ b/include/asm-generic/Kbuild > @@ -4,4 +4,5 @@ > # (This file is not included when SRCARCH=um since UML borrows several > # asm headers from the host architecutre.) > > +mandatory-y += msi.h > mandatory-y += simd.h > -- > 2.17.1 > -- Best Regards Masahiro Yamada
[PATCH 03/34 v2] powerpc: Use CONFIG_PREEMPTION
From: Thomas Gleixner CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the entry code over to use CONFIG_PREEMPTION. Add PREEMPT_RT output in __die(). Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Thomas Gleixner [bigeasy: +Kconfig] Signed-off-by: Sebastian Andrzej Siewior --- v1…v2: Remove the changes to die.c arch/powerpc/Kconfig | 2 +- arch/powerpc/kernel/entry_32.S | 4 ++-- arch/powerpc/kernel/entry_64.S | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16ee..8ead8d6e1cbc8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT config GENERIC_LOCKBREAK bool default y - depends on SMP && PREEMPT + depends on SMP && PREEMPTION config GENERIC_HWEIGHT bool diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index d60908ea37fb9..e1a4c39b83b86 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here */ bne-0b 1: -#ifdef CONFIG_PREEMPT +#ifdef CONFIG_PREEMPTION /* check current_thread_info->preempt_count */ lwz r0,TI_PREEMPT(r2) cmpwi 0,r0,0 /* if non-zero, just restore regs and return */ @@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here */ */ bl trace_hardirqs_on #endif -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPTION */ restore_kuap: kuap_restore r1, r2, r9, r10, r0 diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 6467bdab8d405..83733376533e8 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite) bne-0b 1: -#ifdef CONFIG_PREEMPT +#ifdef CONFIG_PREEMPTION /* Check if we need to preempt */ andi. r0,r4,_TIF_NEED_RESCHED beq+restore @@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite) li r10,MSR_RI mtmsrd r10,1 /* Update machine state */ #endif /* CONFIG_PPC_BOOK3E */ -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPTION */ .globl fast_exc_return_irq fast_exc_return_irq: -- 2.23.0
[PATCH 2/2] vfio/pci: Introduce OpenCAPI devices support.
This patch adds new IOCTL commands for VFIO PCI driver to support configuration and management for OpenCAPI devices, which have been passed through from host to QEMU VFIO. OpenCAPI (Open Coherent Accelerator Processor Interface) is an interface between processors and accelerators. The main IOCTL command is: VFIO_DEVICE_OCXL_OPHandles devices, which supports the OpenCAPI interface, using the ocxl pnv_* interface. The following commands are supported, based on the hcalls defined in ocxl/pseries.c that implements the guest-specific callbacks. VFIO_DEVICE_OCXL_CONFIG_ADAPTER Used to configure OpenCAPI adapter characteristics. VFIO_DEVICE_OCXL_CONFIG_SPA Used to configure the schedule process area (SPA) table for an OpenCAPI device. VFIO_DEVICE_OCXL_GET_FAULT_STATE Used to retrieve fault information from an OpenCAPI device. VFIO_DEVICE_OCXL_HANDLE_FAULT Used to respond to an OpenCAPI fault. The platform data is declared in the vfio_pci_ocxl_link which is common for each devices sharing the same domain, same bus and same slot. The lpid value, requested to configure the process element in the Scheduled Process Area, is not available in the QEMU environment. This implies getting it from the host through the iommu group. Signed-off-by: Christophe Lombard --- drivers/vfio/pci/Kconfig | 7 + drivers/vfio/pci/Makefile| 1 + drivers/vfio/pci/vfio_pci.c | 19 ++ drivers/vfio/pci/vfio_pci_ocxl.c | 287 +++ drivers/vfio/vfio.c | 25 +++ include/linux/vfio.h | 13 ++ include/uapi/linux/vfio.h| 22 +++ 7 files changed, 374 insertions(+) create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index ac3c1dd3edef..fd3716d10ded 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -45,3 +45,10 @@ config VFIO_PCI_NVLINK2 depends on VFIO_PCI && PPC_POWERNV help VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs + +config VFIO_PCI_OCXL + depends on VFIO_PCI + def_bool y if OCXL_BASE + help + VFIO PCI support for devices which handle the Open Coherent + Accelerator Processor Interface. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index f027f8a0e89c..6d55a5fee4b0 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -3,5 +3,6 @@ vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o +vfio-pci-$(CONFIG_VFIO_PCI_OCXL) += vfio_pci_ocxl.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 703948c9fbe1..4f9741bbe790 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1128,6 +1128,25 @@ static long vfio_pci_ioctl(void *device_data, return vfio_pci_ioeventfd(vdev, ioeventfd.offset, ioeventfd.data, count, ioeventfd.fd); + } else if (cmd == VFIO_DEVICE_OCXL_OP) { + struct vfio_device_ocxl_op ocxl_op; + int ret = 0; + + minsz = offsetofend(struct vfio_device_ocxl_op, data); + + if (copy_from_user(_op, (void __user *)arg, minsz)) + return -EFAULT; + + if (ocxl_op.argsz < minsz) + return -EINVAL; + + ret = vfio_pci_ocxl_ioctl(vdev->pdev, _op); + + if (!ret) { + if (copy_to_user((void __user *)arg, _op, minsz)) + ret = -EFAULT; + } + return ret; } return -ENOTTY; diff --git a/drivers/vfio/pci/vfio_pci_ocxl.c b/drivers/vfio/pci/vfio_pci_ocxl.c new file mode 100644 index ..cb5cd4fb416d --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_ocxl.c @@ -0,0 +1,287 @@ +// SPDX-License-Identifier: GPL-2.0+ +// Copyright 2019 IBM Corp. + +#include +#include +#include +#include +#include +#include + +struct vfio_device_ocxl_link { + struct list_head list; + int domain; + int bus; + int slot; + void *platform_data; +}; +static struct list_head links_list = LIST_HEAD_INIT(links_list); +static DEFINE_MUTEX(links_list_lock); + +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER1 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_SETUP1 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_RELEASE 2 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_GET_ACTAG3 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_GET_PASID4 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_SET_TL 5 +#define VFIO_DEVICE_OCXL_CONFIG_ADAPTER_ALLOC_IRQ6 +#define
[PATCH 0/2] vfio pci: Add support for OpenCAPI devices
This series adds support for the OpenCAPI devices for vfio pci. It builds on top of the existing ocxl driver + http://patchwork.ozlabs.org/patch/1177999/ VFIO is a Linux kernel driver framework used by QEMU to make devices directly assignable to virtual machines. All OpenCAPI devices on the same PCI slot will all be grouped and assigned to the same guest. - Assume these are the devices you want to assign 0007:00:00.0 Processing accelerators: IBM Device 062b 0007:00:00.1 Processing accelerators: IBM Device 062b - Two Devices in the group $ ls /sys/bus/pci/devices/0007\:00\:00.0/iommu_group/devices/ 0007:00:00.0 0007:00:00.1 - Find vendor & device ID $ lspci -n -s 0007:00:00 0007:00:00.0 1200: 1014:062b 0007:00:00.1 1200: 1014:062b - Unbind from the current ocxl device driver if already loaded $ rmmod ocxl - Load vfio-pci if it's not already done. $ modprobe vfio-pci - Bind to vfio-pci $ echo 1014 062b > /sys/bus/pci/drivers/vfio-pci/new_id This will result in a new device node "/dev/vfio/7", which will be use by QEMU to setup the devices for passthrough. - Pass to qemu using -device vfio-pci -device vfio-pci,multifunction=on,host=0007:00:00.0,addr=2.0 -device vfio-pci,multifunction=on,host=0007:00:00.1,addr=2.1 It has been tested in a bare-metal and QEMU environment using the memcpy and the AFP AFUs. christophe lombard (2): powerpc/powernv: Register IOMMU group for OpenCAPI devices vfio/pci: Introduce OpenCAPI devices support. arch/powerpc/platforms/powernv/ocxl.c | 164 ++--- arch/powerpc/platforms/powernv/pci-ioda.c | 19 +- arch/powerpc/platforms/powernv/pci.h | 13 + drivers/vfio/pci/Kconfig | 7 + drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci.c | 19 ++ drivers/vfio/pci/vfio_pci_ocxl.c | 287 ++ drivers/vfio/vfio.c | 25 ++ include/linux/vfio.h | 13 + include/uapi/linux/vfio.h | 22 ++ 10 files changed, 530 insertions(+), 40 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_ocxl.c -- 2.21.0
[PATCH 1/2] powerpc/powernv: Register IOMMU group for OpenCAPI devices
This patch adds group registration for the OpenCAPI devices. An unique iommu group is register for multiple PE, ie for a set of multiple devices sharing the same domain, same bus and same slot. This groud registration will be used to assign an OpenCAPI device to a guest to participate in VFIO, like vfio-pci. The release_ownership hook is used to disable the Scheduled Process Area and clean allocated data if it's not done previously when the ocxl driver is unloaded. To support multiple OpenCAPI devices on the same machine, iommu group and platform data are declared in the npu_link which is common for each devices sharing the same domain, same bus and same slot. Signed-off-by: Christophe Lombard --- arch/powerpc/platforms/powernv/ocxl.c | 164 +- arch/powerpc/platforms/powernv/pci-ioda.c | 19 ++- arch/powerpc/platforms/powernv/pci.h | 13 ++ 3 files changed, 156 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c index 12b146c2f855..67b2be965415 100644 --- a/arch/powerpc/platforms/powernv/ocxl.c +++ b/arch/powerpc/platforms/powernv/ocxl.c @@ -74,6 +74,8 @@ struct npu_link { u16 fn_desired_actags[8]; struct actag_range fn_actags[8]; bool assignment_done; + struct iommu_group *group; + struct platform_data data; }; static struct list_head links_list = LIST_HEAD_INIT(links_list); static DEFINE_MUTEX(links_list_lock); @@ -603,54 +605,56 @@ int pnv_ocxl_platform_setup(struct pci_dev *dev, int PE_mask, { struct pci_controller *hose = pci_bus_to_host(dev->bus); struct pnv_phb *phb = hose->private_data; - struct platform_data *data; + struct npu_link *link = NULL; int xsl_irq; u32 bdfn; - int rc; - - data = kzalloc(sizeof(*data), GFP_KERNEL); - if (!data) - return -ENOMEM; + int rc = 0; - rc = alloc_spa(dev, data); - if (rc) { - kfree(data); - return rc; + mutex_lock(_list_lock); + link = find_link(dev); + if (!link) { + dev_err(>dev, "Failed to setup platform\n"); + mutex_unlock(_list_lock); + return -ENODEV; } + rc = alloc_spa(dev, >data); + if (rc) + goto unlock; + rc = get_xsl_irq(dev, _irq); if (rc) { - free_spa(data); - kfree(data); - return rc; + free_spa(>data); + goto unlock; } - rc = map_xsl_regs(dev, >dsisr, >dar, >tfc, - >pe_handle); + rc = map_xsl_regs(dev, >data.dsisr, >data.dar, + >data.tfc, >data.pe_handle); if (rc) { - free_spa(data); - kfree(data); - return rc; + free_spa(>data); + goto unlock; } bdfn = (dev->bus->number << 8) | dev->devfn; rc = opal_npu_spa_setup(phb->opal_id, bdfn, - virt_to_phys(data->spa->spa_mem), + virt_to_phys(link->data.spa->spa_mem), PE_mask); if (rc) { dev_err(>dev, "Can't setup Shared Process Area: %d\n", rc); - unmap_xsl_regs(data->dsisr, data->dar, data->tfc, - data->pe_handle); - free_spa(data); - kfree(data); - return rc; + unmap_xsl_regs(link->data.dsisr, link->data.dar, + link->data.tfc, link->data.pe_handle); + free_spa(>data); + goto unlock; } - data->phb_opal_id = phb->opal_id; - data->bdfn = bdfn; - *platform_data = (void *) data; + link->data.phb_opal_id = phb->opal_id; + link->data.bdfn = bdfn; *hwirq = xsl_irq; - return 0; + *platform_data = (void *)>data; + +unlock: + mutex_unlock(_list_lock); + return rc; } EXPORT_SYMBOL_GPL(pnv_ocxl_platform_setup); @@ -682,11 +686,13 @@ void pnv_ocxl_platform_release(void *platform_data) struct platform_data *data = (struct platform_data *)platform_data; int rc; - rc = opal_npu_spa_setup(data->phb_opal_id, data->bdfn, 0, 0); - WARN_ON(rc); - unmap_xsl_regs(data->dsisr, data->dar, data->tfc, data->pe_handle); - free_spa(data); - kfree(data); + if (data->spa) { + rc = opal_npu_spa_setup(data->phb_opal_id, data->bdfn, 0, 0); + WARN_ON(rc); + unmap_xsl_regs(data->dsisr, data->dar, data->tfc, + data->pe_handle); + free_spa(data); + } } EXPORT_SYMBOL_GPL(pnv_ocxl_platform_release); @@ -837,3 +843,95 @@ int pnv_ocxl_remove_pe(void *platform_data, int pasid, u32 *pid, return remove_pe_from_cache(data,
Re: [PATCH] powerpc/fadump: Remove duplicate message.
On Thu, Oct 24, 2019 at 04:08:08PM +0530, Hari Bathini wrote: > > Michal, thanks for looking into this. > > On 23/10/19 11:26 PM, Michal Suchanek wrote: > > There is duplicate message about lack of support by firmware in > > fadump_reserve_mem and setup_fadump. Due to different capitalization it > > is clear that the one in setup_fadump is shown on boot. Remove the > > duplicate that is not shown. > > Actually, the message in fadump_reserve_mem() is logged. fadump_reserve_mem() > executes first and sets fw_dump.fadump_enabled to `0`, if fadump is not > supported. > So, the other message in setup_fadump() doesn't get logged anymore with recent > changes. The right thing to do would be to remove similar message in > setup_fadump() instead. I need to re-check with a recent kernel build. I saw the message from setup_fadump and not the one from fadump_reserve_mem but not sure what the platform init code looked like in the kernel I tested with. Thanks Michal