Re: [PATCH v5 1/3] powerpc/eeh: Move PE state constants around
On Thu, Mar 26, 2015 at 04:42:07PM +1100, Gavin Shan wrote: There are two equivalent sets of PE state constants, defined in arch/powerpc/include/asm/eeh.h and include/uapi/linux/vfio.h. Though the names are different, their corresponding values are exactly same. The former is used by EEH core and the latter is used by userspace. The patch moves those constants from arch/powerpc/include/asm/eeh.h to arch/powerpc/include/uapi/asm/eeh.h, which are expected to be used by userspace from now on. We can't delete those constants in vfio.h as it's uncertain that those constants have been or will be used by userspace. Suggested-by: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpK4ejedVTD7.pgp Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: handle OPAL_SUCCESS return in opal_sensor_read
Cedric Le Goater c...@fr.ibm.com writes: The sensor service in OPAL only handles one FSP request at a time and returns OPAL_BUSY if one is already in progress. The lock covers this case but we could also remove it return EBUSY to the driver or even retry the call. That might be dangerous though. Retrying the call should be okay. Just because FSP wants to do things serially doesn't mean non-FSP does :) Changing OPAL to handle simultaneously multiple requests does not seem really necessary, it won't speed up the communication with the FSP and that is the main bottleneck. Only on FSP systems though, and all of the OpenPower machines don't have FSPs :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: cxl: Fix a typo in ABI documentation
On Thu, 2015-26-03 at 10:46:56 UTC, Philippe Bergheaud wrote: Fix the attribute name of the configuration record class ID. Signed-off-by: Philippe Bergheaud fe...@linux.vnet.ibm.com --- Documentation/ABI/testing/sysfs-class-cxl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl index 3680364..d46bba8 100644 --- a/Documentation/ABI/testing/sysfs-class-cxl +++ b/Documentation/ABI/testing/sysfs-class-cxl @@ -100,7 +100,7 @@ Description:read only Hexadecimal value of the device ID found in this AFU configuration record. -What: /sys/class/cxl/afu/crconfig num/vendor +What: /sys/class/cxl/afu/crconfig num/class Date: February 2015 Contact:linuxppc-dev@lists.ozlabs.org Description:read only White space is fubar. I fixed it up and applied it anyway. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 0/3] Generic IOMMU pooled allocator
On (03/26/15 08:05), Benjamin Herrenschmidt wrote: PowerPC folks, what do you think? I'll give it another look today. Cheers, Ben. Hi Ben, did you have a chance to look at this? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] t2080qds/rtc: fix rtc interrupt
RTC interrupt uses IRQ11 on T2080QDS. Signed-off-by: Shengzhou Liu shengzhou@freescale.com --- arch/powerpc/boot/dts/t208xqds.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/boot/dts/t208xqds.dtsi b/arch/powerpc/boot/dts/t208xqds.dtsi index 5906183..024cc96 100644 --- a/arch/powerpc/boot/dts/t208xqds.dtsi +++ b/arch/powerpc/boot/dts/t208xqds.dtsi @@ -137,7 +137,7 @@ rtc@68 { compatible = dallas,ds3232; reg = 0x68; - interrupts = 0x1 0x1 0 0; + interrupts = 0xb 0x1 0 0; }; }; -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 2/2] powerpc/mm: Tracking vDSO remap
On 26/03/2015 19:55, Ingo Molnar wrote: * Laurent Dufour lduf...@linux.vnet.ibm.com wrote: +{ +unsigned long vdso_end, vdso_start; + +if (!mm-context.vdso_base) +return; +vdso_start = mm-context.vdso_base; + +#ifdef CONFIG_PPC64 +/* Calling is_32bit_task() implies that we are dealing with the + * current process memory. If there is a call path where mm is not + * owned by the current task, then we'll have need to store the + * vDSO size in the mm-context. + */ +BUG_ON(current-mm != mm); +if (is_32bit_task()) +vdso_end = vdso_start + (vdso32_pages PAGE_SHIFT); +else +vdso_end = vdso_start + (vdso64_pages PAGE_SHIFT); +#else +vdso_end = vdso_start + (vdso32_pages PAGE_SHIFT); +#endif +vdso_end += (1PAGE_SHIFT); /* data page */ + +/* Check if the vDSO is in the range of the remapped area */ +if ((vdso_start = old_start old_start vdso_end) || +(vdso_start old_end old_end = vdso_end) || +(old_start = vdso_start vdso_start old_end)) { +/* Update vdso_base if the vDSO is entirely moved. */ +if (old_start == vdso_start old_end == vdso_end +(old_end - old_start) == (new_end - new_start)) +mm-context.vdso_base = new_start; +else +mm-context.vdso_base = 0; +} +} Oh my, that really looks awfully complex, as you predicted, and right in every mremap() call. I do agree, that's awfully complex ;) I'm fine with your original, imperfect, KISS approach. Sorry about this detour ... Reviewed-by: Ingo Molnar mi...@kernel.org No problem, so let's stay on the v3 version of the patch. Thanks for Reviewed-by statement which, I guess, applied to the v3 too. Should I resend the v3 ? Thanks, Laurent. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] mtd/spi: support en25s64 device
Add support for EON en25s64 spi device. Signed-off-by: Shengzhou Liu shengzhou@freescale.com --- drivers/mtd/spi-nor/spi-nor.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/mtd/spi-nor/spi-nor.c b/drivers/mtd/spi-nor/spi-nor.c index 0f8ec3c..f8acef7 100644 --- a/drivers/mtd/spi-nor/spi-nor.c +++ b/drivers/mtd/spi-nor/spi-nor.c @@ -524,6 +524,7 @@ static const struct spi_device_id spi_nor_ids[] = { { en25q64,INFO(0x1c3017, 0, 64 * 1024, 128, SECT_4K) }, { en25qh128, INFO(0x1c7018, 0, 64 * 1024, 256, 0) }, { en25qh256, INFO(0x1c7019, 0, 64 * 1024, 512, 0) }, + { en25s64,INFO(0x1c3817, 0, 64 * 1024, 128, 0) }, /* ESMT */ { f25l32pa, INFO(0x8c2016, 0, 64 * 1024, 64, SECT_4K) }, -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [Skiboot] [v2, 1/3] powerpc/powernv: convert codes returned by OPAL calls
On 03/27/2015 11:36 AM, Benjamin Herrenschmidt wrote: On Fri, 2015-03-27 at 20:59 +1100, Michael Ellerman wrote: Can you put it in opal.h and give it a better name, maybe opal_error_code() ? Do we want it to be inlined all the time ? Feels more like something we should have in opal.c Also we only want to call it when we forward the error code up the food chain, there are a number of cases where we look for specific OPAL error codes. yes. the forward is not systematic. opal.c looks like a better place. -ERANGE looks also better when the return code is unexpected. C. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [Skiboot] [v2, 1/3] powerpc/powernv: convert codes returned by OPAL calls
On Fri, 2015-03-27 at 20:59 +1100, Michael Ellerman wrote: Can you put it in opal.h and give it a better name, maybe opal_error_code() ? Do we want it to be inlined all the time ? Feels more like something we should have in opal.c Also we only want to call it when we forward the error code up the food chain, there are a number of cases where we look for specific OPAL error codes. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v2,1/3] powerpc/powernv: convert codes returned by OPAL calls
On 03/27/2015 10:59 AM, Michael Ellerman wrote: On Thu, 2015-26-03 at 16:04:45 UTC, =?utf-8?q?C=C3=A9dric_Le_Goater?= wrote: OPAL has its own list of return codes. The patch provides a translation of such codes in errnos for the opal_sensor_read call. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- arch/powerpc/platforms/powernv/opal-sensor.c | 37 ++- 1 file changed, 36 insertions(+), 1 deletion(-) Index: linux.git/arch/powerpc/platforms/powernv/opal-sensor.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal-sensor.c +++ linux.git/arch/powerpc/platforms/powernv/opal-sensor.c @@ -26,6 +26,38 @@ +static int convert_opal_code(int ret) +{ +switch (ret) { +case OPAL_SUCCESS: return 0; +case OPAL_PARAMETER:return -EINVAL; +case OPAL_UNSUPPORTED: return -ENOSYS; +case OPAL_ASYNC_COMPLETION: return -EAGAIN; +case OPAL_BUSY_EVENT: return -EBUSY; +case OPAL_NO_MEM: return -ENOMEM; +case OPAL_HARDWARE: return -ENOENT; +case OPAL_INTERNAL_ERROR: return -EIO; +default:return -EIO; +} +} That looks a bit familiar :) Ah ! I only looked in opal ... static int rtas_error_rc(int rtas_rc) { int rc; switch (rtas_rc) { case -1:/* Hardware Error */ rc = -EIO; break; case -3:/* Bad indicator/domain/etc */ rc = -EINVAL; break; case -9000: /* Isolation error */ rc = -EFAULT; break; case -9001: /* Outstanding TCE/PTE */ rc = -EEXIST; break; case -9002: /* No usable slot */ rc = -ENODEV; break; default: printk(KERN_ERR %s: unexpected RTAS error %d\n, __func__, rtas_rc); rc = -ERANGE; this a better code default value. break; } return rc; } But I guess we still should have it. Can you put it in opal.h and give it a better name, maybe opal_error_code() ? Sure. I will change the name but opal.c looks better, knowing that opal.h is shared with skiboot. /* * This will return sensor information to driver based on the requested sensor * handle. A handle is an opaque id for the powernv, read by the driver from the @@ -46,8 +78,10 @@ int opal_get_sensor_data(u32 sensor_hndl mutex_lock(opal_sensor_mutex); ret = opal_sensor_read(sensor_hndl, token, data); -if (ret != OPAL_ASYNC_COMPLETION) +if (ret != OPAL_ASYNC_COMPLETION) { +ret = convert_opal_code(ret); goto out_token; +} ret = opal_async_wait_response(token, msg); if (ret) { @@ -58,6 +92,7 @@ int opal_get_sensor_data(u32 sensor_hndl *sensor_data = be32_to_cpu(data); ret = be64_to_cpu(msg.params[1]); +ret = convert_opal_code(ret); I'd do: ret = convert_opal_code(be64_to_cpu(msg.params[1])); Yes. the double 'ret =' is ugly. Thanks, C. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/defconfig: enable CONFIG_I2C_MUX and CONFIG_I2C_MUX_PCA954x
By default we enable CONFIG_I2C_MUX and CONFIG_I2C_MUX_PCA954x, which are needed on T2080QDS, T4240QDS, B4860QDS, etc. Signed-off-by: Shengzhou Liu shengzhou@freescale.com --- against 'next' branch of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git arch/powerpc/configs/corenet32_smp_defconfig | 2 ++ arch/powerpc/configs/corenet64_smp_defconfig | 2 ++ 2 files changed, 4 insertions(+) diff --git a/arch/powerpc/configs/corenet32_smp_defconfig b/arch/powerpc/configs/corenet32_smp_defconfig index 51866f1..6cf323f 100644 --- a/arch/powerpc/configs/corenet32_smp_defconfig +++ b/arch/powerpc/configs/corenet32_smp_defconfig @@ -114,6 +114,8 @@ CONFIG_NVRAM=y CONFIG_I2C=y CONFIG_I2C_CHARDEV=y CONFIG_I2C_MPC=y +CONFIG_I2C_MUX=y +CONFIG_I2C_MUX_PCA954x=y CONFIG_SPI=y CONFIG_SPI_GPIO=y CONFIG_SPI_FSL_SPI=y diff --git a/arch/powerpc/configs/corenet64_smp_defconfig b/arch/powerpc/configs/corenet64_smp_defconfig index d6c0c81..9d8ca81 100644 --- a/arch/powerpc/configs/corenet64_smp_defconfig +++ b/arch/powerpc/configs/corenet64_smp_defconfig @@ -99,6 +99,8 @@ CONFIG_SERIAL_8250_RSA=y CONFIG_I2C=y CONFIG_I2C_CHARDEV=y CONFIG_I2C_MPC=y +CONFIG_I2C_MUX=y +CONFIG_I2C_MUX_PCA954x=y CONFIG_SPI=y CONFIG_SPI_GPIO=y CONFIG_SPI_FSL_SPI=y -- 2.1.0.27.g96db324 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [v2,1/3] powerpc/powernv: convert codes returned by OPAL calls
On Thu, 2015-26-03 at 16:04:45 UTC, =?utf-8?q?C=C3=A9dric_Le_Goater?= wrote: OPAL has its own list of return codes. The patch provides a translation of such codes in errnos for the opal_sensor_read call. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- arch/powerpc/platforms/powernv/opal-sensor.c | 37 ++- 1 file changed, 36 insertions(+), 1 deletion(-) Index: linux.git/arch/powerpc/platforms/powernv/opal-sensor.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal-sensor.c +++ linux.git/arch/powerpc/platforms/powernv/opal-sensor.c @@ -26,6 +26,38 @@ +static int convert_opal_code(int ret) +{ + switch (ret) { + case OPAL_SUCCESS: return 0; + case OPAL_PARAMETER:return -EINVAL; + case OPAL_UNSUPPORTED: return -ENOSYS; + case OPAL_ASYNC_COMPLETION: return -EAGAIN; + case OPAL_BUSY_EVENT: return -EBUSY; + case OPAL_NO_MEM: return -ENOMEM; + case OPAL_HARDWARE: return -ENOENT; + case OPAL_INTERNAL_ERROR: return -EIO; + default:return -EIO; + } +} That looks a bit familiar :) static int rtas_error_rc(int rtas_rc) { int rc; switch (rtas_rc) { case -1:/* Hardware Error */ rc = -EIO; break; case -3:/* Bad indicator/domain/etc */ rc = -EINVAL; break; case -9000: /* Isolation error */ rc = -EFAULT; break; case -9001: /* Outstanding TCE/PTE */ rc = -EEXIST; break; case -9002: /* No usable slot */ rc = -ENODEV; break; default: printk(KERN_ERR %s: unexpected RTAS error %d\n, __func__, rtas_rc); rc = -ERANGE; break; } return rc; } But I guess we still should have it. Can you put it in opal.h and give it a better name, maybe opal_error_code() ? /* * This will return sensor information to driver based on the requested sensor * handle. A handle is an opaque id for the powernv, read by the driver from the @@ -46,8 +78,10 @@ int opal_get_sensor_data(u32 sensor_hndl mutex_lock(opal_sensor_mutex); ret = opal_sensor_read(sensor_hndl, token, data); - if (ret != OPAL_ASYNC_COMPLETION) + if (ret != OPAL_ASYNC_COMPLETION) { + ret = convert_opal_code(ret); goto out_token; + } ret = opal_async_wait_response(token, msg); if (ret) { @@ -58,6 +92,7 @@ int opal_get_sensor_data(u32 sensor_hndl *sensor_data = be32_to_cpu(data); ret = be64_to_cpu(msg.params[1]); + ret = convert_opal_code(ret); I'd do: ret = convert_opal_code(be64_to_cpu(msg.params[1])); cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [V2, 06/12] selftests, powerpc: Add test for system wide DSCR default
On Tue, 2015-13-01 at 10:22:34 UTC, Anshuman Khandual wrote: This patch adds a test case for the system wide DSCR default value, which when changed through it's sysfs interface must be visible to all threads reading DSCR either through the privilege state SPR or the problem state SPR. The DSCR value change should be immediate as well. ... + +/* Default DSCR access */ +unsigned long get_default_dscr(void) +{ + int fd = -1; + char buf[16]; + unsigned long val; + + if (fd == -1) { + fd = open(DSCR_DEFAULT, O_RDONLY); + if (fd == -1) { + perror(open() failed\n); + exit(1); + } + } + memset(buf, 0, sizeof(buf)); + lseek(fd, 0, SEEK_SET); + read(fd, buf, sizeof(buf)); This and the other tests are failing to build: In file included from dscr_default_test.c:16:0: dscr.h: In function âget_default_dscrâ: dscr.h:93:6: error: ignoring return value of âreadâ, declared with attribute warn_unused_result [-Werror=unused-result] read(fd, buf, sizeof(buf)); ^ dscr.h: In function âset_default_dscrâ: dscr.h:112:7: error: ignoring return value of âwriteâ, declared with attribute warn_unused_result [-Werror=unused-result] write(fd, buf, strlen(buf)); ^ cc1: all warnings being treated as errors cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/powernv: Remove powernv RTAS support
Michael Ellerman m...@ellerman.id.au writes: The powernv code has some conditional support for running on bare metal machines that have no OPAL firmware, but provide RTAS. No released machines ever supported that, and even in the lab it was just a transitional hack in the days when OPAL was still being developed. So remove the code. So when we select CONFIG_PPC_PSERIES, we end up selecting CONFIG_PPC_RTAS. What is the expected behaviour there for powernv ? Can we use RTAS call from powernv code if rtas node is present in device tree ? Or do we want to make sure powernv always use OPAL call ? For example, right now we will use rtas_node get-term-char and put-term-char even with this patch applied. -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] i2c/mpc: Fix ISR return value
I can't apply the patch. There seem to be whitespace problems. Please fix the patch or your mail sending. Sorry for the Delayed response and It's my Bad as I didn't pass it through checkpatch. I would send a fresh patch. Thanks Amit. --- drivers/i2c/busses/i2c-mpc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c index 0edf630..7a3136f 100644 --- a/drivers/i2c/busses/i2c-mpc.c +++ b/drivers/i2c/busses/i2c-mpc.c @@ -95,8 +95,9 @@ static irqreturn_t mpc_i2c_isr(int irq, void *dev_id) i2c-interrupt = readb(i2c-base + MPC_I2C_SR); writeb(0, i2c-base + MPC_I2C_SR); wake_up(i2c-queue); + return IRQ_HANDLED; } - return IRQ_HANDLED; + return IRQ_NONE; } /* Sometimes 9th clock pulse isn't generated, and slave doesn't release -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-i2c in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 04/31] vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au --- drivers/vfio/vfio_iommu_spapr_tce.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index f835e63..8bbee22 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container) * enforcing the limit based on the max that the guest can map. */ down_write(current-mm-mmap_sem); - npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + npages = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; locked = current-mm-locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; if (locked lock_limit !capable(CAP_IPC_LOCK)) { @@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container) down_write(current-mm-mmap_sem); current-mm-locked_vm -= (container-tbl-it_size - IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + container-tbl-it_page_shift) PAGE_SHIFT; up_write(current-mm-mmap_sem); } @@ -222,7 +222,7 @@ static long tce_iommu_build(struct tce_container *container, tce, ret); break; } - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); } if (ret) @@ -267,8 +267,8 @@ static long tce_iommu_ioctl(void *iommu_data, if (info.argsz minsz) return -EINVAL; - info.dma32_window_start = tbl-it_offset IOMMU_PAGE_SHIFT_4K; - info.dma32_window_size = tbl-it_size IOMMU_PAGE_SHIFT_4K; + info.dma32_window_start = tbl-it_offset tbl-it_page_shift; + info.dma32_window_size = tbl-it_size tbl-it_page_shift; info.flags = 0; if (copy_to_user((void __user *)arg, info, minsz)) @@ -298,8 +298,8 @@ static long tce_iommu_ioctl(void *iommu_data, VFIO_DMA_MAP_FLAG_WRITE)) return -EINVAL; - if ((param.size ~IOMMU_PAGE_MASK_4K) || - (param.vaddr ~IOMMU_PAGE_MASK_4K)) + if ((param.size ~IOMMU_PAGE_MASK(tbl)) || + (param.vaddr ~IOMMU_PAGE_MASK(tbl))) return -EINVAL; /* iova is checked by the IOMMU API */ @@ -314,8 +314,8 @@ static long tce_iommu_ioctl(void *iommu_data, return ret; ret = tce_iommu_build(container, tbl, - param.iova IOMMU_PAGE_SHIFT_4K, - tce, param.size IOMMU_PAGE_SHIFT_4K); + param.iova tbl-it_page_shift, + tce, param.size tbl-it_page_shift); iommu_flush_tce(tbl); @@ -341,17 +341,17 @@ static long tce_iommu_ioctl(void *iommu_data, if (param.flags) return -EINVAL; - if (param.size ~IOMMU_PAGE_MASK_4K) + if (param.size ~IOMMU_PAGE_MASK(tbl)) return -EINVAL; ret = iommu_tce_clear_param_check(tbl, param.iova, 0, - param.size IOMMU_PAGE_SHIFT_4K); + param.size tbl-it_page_shift); if (ret) return ret; ret = tce_iommu_clear(container, tbl, - param.iova IOMMU_PAGE_SHIFT_4K, - param.size IOMMU_PAGE_SHIFT_4K); + param.iova tbl-it_page_shift, + param.size tbl-it_page_shift); iommu_flush_tce(tbl); return ret; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 01/31] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
This moves page pinning (get_user_pages_fast()/put_page()) code out of the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs to as the platform code does not deal with page pinning. This makes iommu_take_ownership()/iommu_release_ownership() deal with the IOMMU table bitmap only. This removes page unpinning from iommu_take_ownership() as the actual TCE table might contain garbage and doing put_page() on it is undefined behaviour. Besides the last part, the rest of the patch is mechanical. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v4: * s/iommu_tce_build(tbl, entry + 1/iommu_tce_build(tbl, entry + i/ --- arch/powerpc/include/asm/iommu.h| 4 -- arch/powerpc/kernel/iommu.c | 55 -- drivers/vfio/vfio_iommu_spapr_tce.c | 78 ++--- 3 files changed, 65 insertions(+), 72 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index f1ea597..ed69b7d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -197,10 +197,6 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction); extern unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry); -extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, - unsigned long entry, unsigned long pages); -extern int iommu_put_tce_user_mode(struct iommu_table *tbl, - unsigned long entry, unsigned long tce); extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b054f33..1b4a178 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -991,30 +991,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) } EXPORT_SYMBOL_GPL(iommu_clear_tce); -int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, - unsigned long entry, unsigned long pages) -{ - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; -} -EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); - /* * hwaddr is a kernel virtual address here (0xc... bazillion), * tce_build converts it to a physical address. @@ -1044,35 +1020,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, } EXPORT_SYMBOL_GPL(iommu_tce_build); -int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry, - unsigned long tce) -{ - int ret; - struct page *page = NULL; - unsigned long hwaddr, offset = tce IOMMU_PAGE_MASK(tbl) ~PAGE_MASK; - enum dma_data_direction direction = iommu_tce_direction(tce); - - ret = get_user_pages_fast(tce PAGE_MASK, 1, - direction != DMA_TO_DEVICE, page); - if (unlikely(ret != 1)) { - /* pr_err(iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n, - tce, entry tbl-it_page_shift, ret); */ - return -EFAULT; - } - hwaddr = (unsigned long) page_address(page) + offset; - - ret = iommu_tce_build(tbl, entry, hwaddr, direction); - if (ret) - put_page(page); - - if (ret 0) - pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n, - __func__, entry tbl-it_page_shift, tce, ret); - - return ret; -} -EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); - int iommu_take_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl-it_size + 7) 3; @@ -1086,7 +1033,6 @@ int iommu_take_ownership(struct iommu_table *tbl) } memset(tbl-it_map, 0xff, sz); - iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); /* * Disable iommu bypass, otherwise the user can DMA to all of @@ -1104,7 +1050,6 @@ void iommu_release_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl-it_size + 7) 3; - iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); memset(tbl-it_map, 0, sz); /* Restore bit#0 set by iommu_init_table() */ diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 730b4ef..cefaf05 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -147,6 +147,66 @@ static void
[PATCH kernel v7 02/31] vfio: powerpc/spapr: Do cleanup when releasing the group
This clears the TCE table when a container is being closed as this is a good thing to leave the table clean before passing the ownership back to the host kernel. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index cefaf05..e9b4d7d 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -132,16 +132,24 @@ static void *tce_iommu_open(unsigned long arg) return container; } +static int tce_iommu_clear(struct tce_container *container, + struct iommu_table *tbl, + unsigned long entry, unsigned long pages); + static void tce_iommu_release(void *iommu_data) { struct tce_container *container = iommu_data; + struct iommu_table *tbl = container-tbl; - WARN_ON(container-tbl !container-tbl-it_group); + WARN_ON(tbl !tbl-it_group); tce_iommu_disable(container); - if (container-tbl container-tbl-it_group) - tce_iommu_detach_group(iommu_data, container-tbl-it_group); + if (tbl) { + tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size); + if (tbl-it_group) + tce_iommu_detach_group(iommu_data, tbl-it_group); + } mutex_destroy(container-lock); kfree(container); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 03/31] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size
This checks that the TCE table page size is not bigger that the size of a page we just pinned and going to put its physical address to the table. Otherwise the hardware gets unwanted access to physical memory between the end of the actual page and the end of the aligned up TCE page. Since compound_order() and compound_head() work correctly on non-huge pages, there is no need for additional check whether the page is huge. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v6: * the helper is simplified to one line v4: * s/tce_check_page_size/tce_page_is_contained/ --- drivers/vfio/vfio_iommu_spapr_tce.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index e9b4d7d..f835e63 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -47,6 +47,16 @@ struct tce_container { bool enabled; }; +static bool tce_page_is_contained(struct page *page, unsigned page_shift) +{ + /* +* Check that the TCE table granularity is not bigger than the size of +* a page we just found. Otherwise the hardware can get access to +* a bigger memory chunk that it should. +*/ + return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift; +} + static int tce_iommu_enable(struct tce_container *container) { int ret = 0; @@ -195,6 +205,12 @@ static long tce_iommu_build(struct tce_container *container, ret = -EFAULT; break; } + + if (!tce_page_is_contained(page, tbl-it_page_shift)) { + ret = -EPERM; + break; + } + hva = (unsigned long) page_address(page) + (tce IOMMU_PAGE_MASK(tbl) ~PAGE_MASK); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 00/31] powerpc/iommu/vfio: Enable Dynamic DMA windows
This enables sPAPR defined feature called Dynamic DMA windows (DDW). Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus where devices are allowed to do DMA. These ranges are called DMA windows. By default, there is a single DMA window, 1 or 2GB big, mapped at zero on a PCI bus. Hi-speed devices may suffer from the limited size of the window. The recent host kernels use a TCE bypass window on POWER8 CPU which implements direct PCI bus address range mapping (with offset of 159) to the host memory. For guests, PAPR defines a DDW RTAS API which allows pseries guests querying the hypervisor about DDW support and capabilities (page size mask for now). A pseries guest may request an additional (to the default) DMA windows using this RTAS API. The existing pseries Linux guests request an additional window as big as the guest RAM and map the entire guest window which effectively creates direct mapping of the guest memory to a PCI bus. The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however this patchset only adds support for POWER8 as TCE tables are implemented in POWER7 in a quite different way ans POWER7 is not the highest priority. This patchset reworks PPC64 IOMMU code and adds necessary structures to support big windows. Once a Linux guest discovers the presence of DDW, it does: 1. query hypervisor about number of available windows and page size masks; 2. create a window with the biggest possible page size (today 4K/64K/16M); 3. map the entire guest RAM via H_PUT_TCE* hypercalls; 4. switche dma_ops to direct_dma_ops on the selected PE. Once this is done, H_PUT_TCE is not called anymore for 64bit devices and the guest does not waste time on DMA map/unmap operations. Note that 32bit devices won't use DDW and will keep using the default DMA window so KVM optimizations will be required (to be posted later). This ws pushed to g...@github.com:aik/linux.git + 6a0e0b7...b3f2ffe vfio-for-github - vfio-for-github (forced update) Changes: v7: * moved memory preregistration to the current process's MMU context * added code preventing unregistration if some pages are still mapped; for this, there is a userspace view of the table is stored in iommu_table * added locked_vm counting for DDW tables (including userspace view of those) v6: * fixed a bunch of errors in vfio: powerpc/spapr: Support Dynamic DMA windows * moved static IOMMU properties from iommu_table_group to iommu_table_group_ops v5: * added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory pre-registration feature * added backward compatibility * renamed few things (mostly powerpc_iommu - iommu_table_group) v4: * moved patches around to have VFIO and PPC patches separated as much as possible * now works with the existing upstream QEMU v3: * redesigned the whole thing * multiple IOMMU groups per PHB - one PHB is needed for VFIO in the guest - no problems with locked_vm counting; also we save memory on actual tables * guest RAM preregistration is required for DDW * PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so we do not bother with iommu_table::it_map anymore * added multilevel TCE tables support to support really huge guests v2: * added missing __pa() in powerpc/powernv: Release replaced TCE * reposted to make some noise Alexey Kardashevskiy (31): vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver vfio: powerpc/spapr: Do cleanup when releasing the group vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size vfio: powerpc/spapr: Use it_page_size vfio: powerpc/spapr: Move locked_vm accounting to helpers vfio: powerpc/spapr: Disable DMA mappings on disabled container vfio: powerpc/spapr: Moving pinning/unpinning to helpers vfio: powerpc/spapr: Rework groups attaching powerpc/powernv: Do not set read flag if direction==DMA_NONE powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table powerpc/iommu: Introduce iommu_table_alloc() helper powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control powerpc/iommu: Fix IOMMU ownership control functions powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free() powerpc/iommu/powernv: Release replaced TCE powerpc/powernv/ioda2: Rework iommu_table creation powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table/pnc_pci_free_table powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window powerpc/iommu: Split iommu_free_table into 2 helpers powerpc/powernv: Implement multilevel TCE tables powerpc/powernv: Change prototypes to receive iommu powerpc/powernv/ioda: Define and implement DMA table/window management callbacks vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership powerpc/iommu: Add userspace view of TCE table powerpc/iommu/ioda2: Add
[PATCH kernel v7 20/31] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
This is a part of moving DMA window programming to an iommu_ops callback. This is a mechanical patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci-ioda.c | 85 --- 1 file changed, 56 insertions(+), 29 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 908863a..64b7cfe 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1391,6 +1391,57 @@ static void pnv_pci_free_table(struct iommu_table *tbl) memset(tbl, 0, sizeof(struct iommu_table)); } +static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe, + struct iommu_table *tbl) +{ + struct pnv_phb *phb = pe-phb; + const __be64 *swinvp; + int64_t rc; + const __u64 start_addr = tbl-it_offset tbl-it_page_shift; + const __u64 win_size = tbl-it_size tbl-it_page_shift; + + pe_info(pe, Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n, + start_addr, start_addr + win_size - 1, + 1UL tbl-it_page_shift, tbl-it_size 3); + + pe-table_group.tables[0] = *tbl; + tbl = pe-table_group.tables[0]; + tbl-it_group = pe-table_group; + + /* +* Map TCE table through TVT. The TVE index is the PE number +* shifted by 1 bit for 32-bits DMA space. +*/ + rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, + pe-pe_number 1, 1, __pa(tbl-it_base), + tbl-it_size 3, 1ULL tbl-it_page_shift); + if (rc) { + pe_err(pe, Failed to configure TCE table, err %ld\n, rc); + goto fail; + } + + /* OPAL variant of PHB3 invalidated TCEs */ + swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL); + if (swinvp) { + /* We need a couple more fields -- an address and a data +* to or. Since the bus is only printed out on table free +* errors, and on the first pass the data will be a relative +* bus number, print that out instead. +*/ + pe-tce_inval_reg_phys = be64_to_cpup(swinvp); + tbl-it_index = (unsigned long)ioremap(pe-tce_inval_reg_phys, + 8); + tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); + } + + return 0; +fail: + if (pe-tce32_seg = 0) + pe-tce32_seg = -1; + + return rc; +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe-pe_number 1 ) + 1; @@ -1463,7 +1514,6 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { - const __be64 *swinvp; unsigned int end; struct iommu_table *tbl = pe-table_group.tables[0]; int64_t rc; @@ -1493,31 +1543,14 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe-table_group.ops = pnv_pci_ioda2_ops; #endif - /* -* Map TCE table through TVT. The TVE index is the PE number -* shifted by 1 bit for 32-bits DMA space. -*/ - rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, - pe-pe_number 1, 1, __pa(tbl-it_base), - tbl-it_size 3, 1ULL tbl-it_page_shift); + rc = pnv_pci_ioda2_set_window(pe, tbl); if (rc) { pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, rc); - goto fail; - } - - /* OPAL variant of PHB3 invalidated TCEs */ - swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL); - if (swinvp) { - /* We need a couple more fields -- an address and a data -* to or. Since the bus is only printed out on table free -* errors, and on the first pass the data will be a relative -* bus number, print that out instead. -*/ - pe-tce_inval_reg_phys = be64_to_cpup(swinvp); - tbl-it_index = (unsigned long)ioremap(pe-tce_inval_reg_phys, - 8); - tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); + pnv_pci_free_table(tbl); + if (pe-tce32_seg = 0) + pe-tce32_seg = -1; + return; } iommu_register_group(pe-table_group, phb-hose-global_number, pe-pe_number); @@ -1531,12 +1564,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* Also create a bypass window */ if (!pnv_iommu_bypass_disabled) pnv_pci_ioda2_setup_bypass_pe(phb, pe); - - return; -fail: - if
[PATCH kernel v7 15/31] powerpc/iommu: Fix IOMMU ownership control functions
This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v5: * do not store bit#0 value, it has to be set for zero-based table anyway * removed test_and_clear_bit --- arch/powerpc/kernel/iommu.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7d6089b..068fe4ff 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build); static int iommu_table_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl-it_size + 7) 3; + unsigned long flags, i, sz = (tbl-it_size + 7) 3; + int ret = 0; + + spin_lock_irqsave(tbl-large_pool.lock, flags); + for (i = 0; i tbl-nr_pools; i++) + spin_lock(tbl-pools[i].lock); if (tbl-it_offset == 0) clear_bit(0, tbl-it_map); if (!bitmap_empty(tbl-it_map, tbl-it_size)) { pr_err(iommu_tce: it_map is not empty); - return -EBUSY; + ret = -EBUSY; + if (tbl-it_offset == 0) + set_bit(0, tbl-it_map); + } else { + memset(tbl-it_map, 0xff, sz); } - memset(tbl-it_map, 0xff, sz); + for (i = 0; i tbl-nr_pools; i++) + spin_unlock(tbl-pools[i].lock); + spin_unlock_irqrestore(tbl-large_pool.lock, flags); return 0; } @@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership); static void iommu_table_release_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl-it_size + 7) 3; + unsigned long flags, i, sz = (tbl-it_size + 7) 3; + + spin_lock_irqsave(tbl-large_pool.lock, flags); + for (i = 0; i tbl-nr_pools; i++) + spin_lock(tbl-pools[i].lock); memset(tbl-it_map, 0, sz); @@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl) if (tbl-it_offset == 0) set_bit(0, tbl-it_map); + for (i = 0; i tbl-nr_pools; i++) + spin_unlock(tbl-pools[i].lock); + spin_unlock_irqrestore(tbl-large_pool.lock, flags); } extern void iommu_release_ownership(struct iommu_table_group *table_group) -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 24/31] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. query() returns IOMMU capabilities such as default DMA window address and supported number of DMA windows and TCE table levels. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 21 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 87 - arch/powerpc/platforms/powernv/pci-p5ioc2.c | 11 +++- 3 files changed, 102 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1e0d907..2c08c91 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,6 +64,8 @@ struct iommu_table_ops { long index, long npages); unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -150,12 +152,31 @@ struct iommu_table_group_ops { */ void (*set_ownership)(struct iommu_table_group *table_group, bool enable); + + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); }; struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 80ea84d..6939402 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -25,6 +25,7 @@ #include linux/memblock.h #include linux/iommu.h #include linux/mmzone.h +#include linux/sizes.h #include asm/mmzone.h #include asm/sections.h @@ -1236,6 +1237,8 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index, pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false); } +static void pnv_pci_free_table(struct iommu_table *tbl); + static struct iommu_table_ops pnv_ioda2_iommu_ops = { .set = pnv_ioda2_tce_build_vm, #ifdef CONFIG_IOMMU_API @@ -1243,6 +1246,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free_vm, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, @@ -1325,6 +1329,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); } tbl-it_ops = pnv_ioda1_iommu_ops; + pe-table_group.tce32_start = tbl-it_offset tbl-it_page_shift; + pe-table_group.tce32_size = tbl-it_size tbl-it_page_shift; iommu_init_table(tbl, phb-hose-node); iommu_register_group(pe-table_group, phb-hose-global_number, pe-pe_number); @@ -1409,7 +1415,7 @@ static __be64 *pnv_alloc_tce_table(int nid, } static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, - __u32 page_shift, __u64 window_size, __u32 levels, + int num, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) { struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, @@ -1422,6 +1428,9 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, if (!levels || (levels
[PATCH kernel v7 06/31] vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to the container's state. This allows the user space to pin memory which it might not be allowed to pin. This adds checks to MAP/UNMAP that the container is enabled, otherwise -EPERM is returned. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 9448e39..c137bb3 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -325,6 +325,9 @@ static long tce_iommu_ioctl(void *iommu_data, struct iommu_table *tbl = container-tbl; unsigned long tce; + if (!container-enabled) + return -EPERM; + if (!tbl) return -ENXIO; @@ -369,6 +372,9 @@ static long tce_iommu_ioctl(void *iommu_data, struct vfio_iommu_type1_dma_unmap param; struct iommu_table *tbl = container-tbl; + if (!container-enabled) + return -EPERM; + if (WARN_ON(!tbl)) return -ENXIO; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 31/31] vfio: powerpc/spapr: Support Dynamic DMA windows
This adds create/remove window ioctls to create and remove DMA windows. sPAPR defines a Dynamic DMA windows capability which allows para-virtualized guests to create additional DMA windows on a PCI bus. The existing linux kernels use this new window to map the entire guest memory and switch to the direct DMA operations saving time on map/unmap requests which would normally happen in a big amounts. This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows. Up to 2 windows are supported now by the hardware and by this driver. This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional information such as a number of supported windows and maximum number levels of TCE tables. DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature as we still want to support v2 on platforms which cannot do DDW for the sake of TCE acceleration in KVM (coming soon). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v7: * s/VFIO_IOMMU_INFO_DDW/VFIO_IOMMU_SPAPR_INFO_DDW/ * fixed typos in and updated vfio.txt * fixed VFIO_IOMMU_SPAPR_TCE_GET_INFO handler * moved ddw properties to vfio_iommu_spapr_tce_ddw_info v6: * added explicit VFIO_IOMMU_INFO_DDW flag to vfio_iommu_spapr_tce_info, it used to be page mask flags from platform code * added explicit pgsizes field * added cleanup if tce_iommu_create_window() failed in a middle * added checks for callbacks in tce_iommu_create_window and remove those from tce_iommu_remove_window when it is too late to test anyway * spapr_tce_find_free_table returns sensible error code now * updated description of VFIO_IOMMU_SPAPR_TCE_CREATE/ VFIO_IOMMU_SPAPR_TCE_REMOVE v4: * moved code to tce_iommu_create_window()/tce_iommu_remove_window() helpers * added docs --- Documentation/vfio.txt | 19 arch/powerpc/include/asm/iommu.h| 2 +- drivers/vfio/vfio_iommu_spapr_tce.c | 196 +++- include/uapi/linux/vfio.h | 61 ++- 4 files changed, 273 insertions(+), 5 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 7dcf2b5..8b1ec51 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -452,6 +452,25 @@ address is from pre-registered range. This separation helps in optimizing DMA for guests. +6) sPAPR specification allows guests to have an additional DMA window(s) on +a PCI bus with a variable page size. Two ioctls have been added to support +this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. +The platform has to support the functionality or error will be returned to +the userspace. The existing hardware supports up to 2 DMA windows, one is +2GB long, uses 4K pages and called default 32bit window; the other can +be as big as entire RAM, use different page size, it is optional - guests +create those in run-time if the guest driver supports 64bit DMA. + +VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and +a number of TCE table levels (if a TCE table is going to be big enough and +the kernel may not be able to allocate enough of physically contiguous memory). +It creates a new window in the available slot and returns the bus address where +the new window starts. Due to hardware limitation, the user space cannot choose +the location of DMA windows. + +VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window +and removes it. + --- [1] VFIO was originally an acronym for Virtual Function I/O in its diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 9027b9e..1db774c0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -147,7 +147,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); -#define IOMMU_TABLE_GROUP_MAX_TABLES 1 +#define IOMMU_TABLE_GROUP_MAX_TABLES 2 struct iommu_table_group; diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 8cbd239..225af37 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -535,6 +535,20 @@ static long tce_iommu_build_v2(struct tce_container *container, return ret; } +static int spapr_tce_find_free_table(struct tce_container *container) +{ + int i; + + for (i = 0; i IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { + struct iommu_table *tbl = container-tables[i]; + + if (!tbl-it_size) + return i; + } + + return -ENOSPC; +} + static long tce_iommu_create_table(struct iommu_table_group *table_group, int num, __u32 page_shift, @@ -573,11 +587,114 @@ static void tce_iommu_free_table(struct
[PATCH kernel v7 07/31] vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler. New tce_iommu_unuse_page() helper does put_page() now but it might skip that after the memory registering patch applied. As we are here, this removes unnecessary checks for a value returned by pfn_to_page() as it cannot possibly return NULL. This moves tce_iommu_disable() later to let tce_iommu_clear() know if the container has been enabled because if it has not been, then put_page() must not be called on TCEs from the TCE table. This situation is not yet possible but it will after KVM acceleration patchset is applied. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v6: * tce_get_hva() returns hva via a pointer --- drivers/vfio/vfio_iommu_spapr_tce.c | 68 +++-- 1 file changed, 50 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c137bb3..ec5ee83 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -196,7 +196,6 @@ static void tce_iommu_release(void *iommu_data) struct iommu_table *tbl = container-tbl; WARN_ON(tbl !tbl-it_group); - tce_iommu_disable(container); if (tbl) { tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size); @@ -204,63 +203,96 @@ static void tce_iommu_release(void *iommu_data) if (tbl-it_group) tce_iommu_detach_group(iommu_data, tbl-it_group); } + + tce_iommu_disable(container); + mutex_destroy(container-lock); kfree(container); } +static void tce_iommu_unuse_page(struct tce_container *container, + unsigned long oldtce) +{ + struct page *page; + + if (!(oldtce (TCE_PCI_READ | TCE_PCI_WRITE))) + return; + + /* +* VFIO cannot map/unmap when a container is not enabled so +* we would not need this check but KVM could map/unmap and if +* this happened, we must not put pages as KVM does not get them as +* it expects memory pre-registation to do this part. +*/ + if (!container-enabled) + return; + + page = pfn_to_page(__pa(oldtce) PAGE_SHIFT); + + if (oldtce TCE_PCI_WRITE) + SetPageDirty(page); + + put_page(page); +} + static int tce_iommu_clear(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long pages) { unsigned long oldtce; - struct page *page; for ( ; pages; --pages, ++entry) { oldtce = iommu_clear_tce(tbl, entry); if (!oldtce) continue; - page = pfn_to_page(oldtce PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce)); } return 0; } +static int tce_get_hva(unsigned long tce, unsigned long *hva) +{ + struct page *page = NULL; + enum dma_data_direction direction = iommu_tce_direction(tce); + + if (get_user_pages_fast(tce PAGE_MASK, 1, + direction != DMA_TO_DEVICE, page) != 1) + return -EFAULT; + + *hva = (unsigned long) page_address(page); + + return 0; +} + static long tce_iommu_build(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long tce, unsigned long pages) { long i, ret = 0; - struct page *page = NULL; + struct page *page; unsigned long hva; enum dma_data_direction direction = iommu_tce_direction(tce); for (i = 0; i pages; ++i) { - ret = get_user_pages_fast(tce PAGE_MASK, 1, - direction != DMA_TO_DEVICE, page); - if (unlikely(ret != 1)) { - ret = -EFAULT; + ret = tce_get_hva(tce, hva); + if (ret) break; - } + page = pfn_to_page(__pa(hva) PAGE_SHIFT); if (!tce_page_is_contained(page, tbl-it_page_shift)) { ret = -EPERM; break; } - hva = (unsigned long) page_address(page) + - (tce IOMMU_PAGE_MASK(tbl) ~PAGE_MASK); + /* Preserve offset within IOMMU page */ + hva |= tce IOMMU_PAGE_MASK(tbl) ~PAGE_MASK; ret = iommu_tce_build(tbl, entry + i, hva, direction); if (ret) { - put_page(page); + tce_iommu_unuse_page(container, hva); pr_err(iommu_tce: %s failed
[PATCH kernel v7 05/31] vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). This reworks debug messages to show the current value and the limit. This stores the locked pages number in the container so when unlocking the iommu table pointer won't be needed. This does not have an effect now but it will with the multiple tables per container as then we will allow attaching/detaching groups on fly and we may end up having a container with no group attached but with the counter incremented. While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. This also prints pid of the current process in pr_warn/pr_debug. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v4: * new helpers do nothing if @npages == 0 * tce_iommu_disable() now can decrement the counter if the group was detached (not possible now but will be in the future) --- drivers/vfio/vfio_iommu_spapr_tce.c | 82 - 1 file changed, 63 insertions(+), 19 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 8bbee22..9448e39 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,51 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(long npages) +{ + long ret = 0, locked, lock_limit; + + if (!current || !current-mm) + return -ESRCH; /* process exited */ + + if (!npages) + return 0; + + down_write(current-mm-mmap_sem); + locked = current-mm-locked_vm + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; + if (locked lock_limit !capable(CAP_IPC_LOCK)) + ret = -ENOMEM; + else + current-mm-locked_vm += npages; + + pr_debug([%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), + ret ? - exceeded : ); + + up_write(current-mm-mmap_sem); + + return ret; +} + +static void decrement_locked_vm(long npages) +{ + if (!current || !current-mm || !npages) + return; /* process exited */ + + down_write(current-mm-mmap_sem); + if (npages current-mm-locked_vm) + npages = current-mm-locked_vm; + current-mm-locked_vm -= npages; + pr_debug([%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n, current-pid, + npages PAGE_SHIFT, + current-mm-locked_vm PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK)); + up_write(current-mm-mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -45,6 +90,7 @@ struct tce_container { struct mutex lock; struct iommu_table *tbl; bool enabled; + unsigned long locked_pages; }; static bool tce_page_is_contained(struct page *page, unsigned page_shift) @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; + unsigned long locked; struct iommu_table *tbl = container-tbl; if (!container-tbl) @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups +* each with 2GB DMA window, 8GB will be counted here. The reason for +* this is that we cannot tell here the amount of RAM used by the guest +* as this information is only available from KVM and VFIO is +* KVM agnostic. */ - down_write(current-mm-mmap_sem); - npages = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; - locked = current-mm-locked_vm + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; - if (locked lock_limit !capable(CAP_IPC_LOCK)) { - pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n, - rlimit(RLIMIT_MEMLOCK)); - ret = -ENOMEM; - } else { + locked = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; + ret = try_increment_locked_vm(locked); + if (ret) + return ret; - current-mm-locked_vm += npages; - container-enabled = true; - } - up_write(current-mm-mmap_sem); + container-locked_pages = locked; + +
[PATCH kernel v7 28/31] powerpc/mmu: Add userspace-to-physical addresses translation cache
We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/mmu-hash64.h | 3 + arch/powerpc/include/asm/mmu_context.h | 16 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/mmu_context_hash64.c | 6 + arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 + 5 files changed, 241 insertions(+) create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h index 4f13c3e..83214c4 100644 --- a/arch/powerpc/include/asm/mmu-hash64.h +++ b/arch/powerpc/include/asm/mmu-hash64.h @@ -535,6 +535,9 @@ typedef struct { /* for 4K PTE fragment support */ void *pte_frag; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + struct list_head iommu_group_mem_list; +#endif } mm_context_t; diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 73382eb..3461c91 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -16,6 +16,22 @@ */ extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); extern void destroy_context(struct mm_struct *mm); +#ifdef CONFIG_SPAPR_TCE_IOMMU +typedef struct mm_iommu_table_group_mem_t mm_iommu_table_group_mem_t; + +extern bool mm_iommu_preregistered(void); +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries, + mm_iommu_table_group_mem_t **pmem); +extern mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua, + unsigned long entries); +extern long mm_iommu_put(mm_iommu_table_group_mem_t *mem); +extern void mm_iommu_cleanup(mm_context_t *ctx); +extern mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, + unsigned long size); +extern long mm_iommu_ua_to_hpa(mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); +extern long mm_iommu_mapped_update(mm_iommu_table_group_mem_t *mem, bool inc); +#endif extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 438dcd3..49fbfc7 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -35,3 +35,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT)+= subpage-prot.o obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c index 178876ae..eb3080c 100644 --- a/arch/powerpc/mm/mmu_context_hash64.c +++ b/arch/powerpc/mm/mmu_context_hash64.c @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #ifdef CONFIG_PPC_64K_PAGES mm-context.pte_frag = NULL; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + INIT_LIST_HEAD_RCU(mm-context.iommu_group_mem_list); +#endif return 0; } @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(mm-context); +#endif #ifdef CONFIG_PPC_ICSWX drop_cop(mm-context.acop, mm); diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c new file mode 100644 index 000..c268c4d --- /dev/null +++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c @@ -0,0 +1,215 @@ +/* + * IOMMU helpers in MMU context. + * + * Copyright (C) 2015 IBM Corp. a...@ozlabs.ru + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + */ + +#include linux/sched.h +#include linux/slab.h +#include
[PATCH kernel v7 27/31] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of fiture table
This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_ioda2_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 5 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 54 +++ 2 files changed, 59 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index a768a4d..9027b9e 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -94,6 +94,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -159,6 +160,10 @@ struct iommu_table_group_ops { void (*set_ownership)(struct iommu_table_group *table_group, bool enable); + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 036f3c1..e3ee87d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1373,6 +1373,57 @@ static void pnv_free_tce_table(unsigned long addr, unsigned size, free_pages(addr, get_order(size 3)); } +static unsigned long pnv_get_tce_table_size(unsigned shift, unsigned levels, + unsigned long *left) +{ + unsigned long ret, chunk = 1UL shift, i; + + ret = chunk; + + if (!*left) + return 0; + + --levels; + if (!levels) { + /* This is last level, actual TCEs */ + *left -= min(*left, chunk); + return chunk; + } + + for (i = 0; i (chunk 3); ++i) { + ret += pnv_get_tce_table_size(shift, levels, left); + if (!*left) + break; + } + + return ret; +} + +static unsigned long pnv_ioda2_get_table_size(__u32 page_shift, __u64 window_size, + __u32 levels) +{ + unsigned long tce_table_size, shift, ret; + + if (!levels || (levels POWERNV_IOMMU_MAX_LEVELS)) + return -EINVAL; + + if ((window_size memory_hotplug_max()) || !is_power_of_2(window_size)) + return -EINVAL; + + tce_table_size = (window_size page_shift) * 8; + tce_table_size = max(0x1000UL, tce_table_size); + + /* Allocate TCE table */ + shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels; + shift += 3; + shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K); + + ret = tce_table_size; /* tbl-it_userspace */ + ret += pnv_get_tce_table_size(shift, levels, tce_table_size); + + return ret; +} + static __be64 *pnv_alloc_tce_table(int nid, unsigned shift, unsigned levels, unsigned long *left) { @@ -1452,6 +1503,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, return -ENOMEM; tbl-it_indirect_levels = levels - 1; + tbl-it_allocated_size = pnv_ioda2_get_table_size(page_shift, + window_size, levels); /* Setup linux iommu table */ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, @@ -1679,6 +1732,7 @@ static long pnv_pci_ioda2_create_table_with_uas( static struct iommu_table_group_ops pnv_pci_ioda2_ops = { .set_ownership = pnv_ioda2_set_ownership, + .get_table_size = pnv_ioda2_get_table_size, .create_table = pnv_pci_ioda2_create_table_with_uas, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
On 27.02.2015 03:05, Scott Wood wrote: On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote: On 02/26/2015 02:02 PM, Paolo Bonzini wrote: On 24/02/2015 00:27, Scott Wood wrote: This isn't a host PIC driver. It's guest PIC emulation, some of which is indeed not suitable for a rawlock (in particular, openpic_update_irq which loops on the number of vcpus, with a loop body that calls IRQ_check() which loops over all pending IRQs). The question is what behavior is wanted of code that isn't quite RT-ready. What is preferred, bugs or bad latency? If the answer is bad latency (which can be avoided simply by not running KVM on a RT kernel in production), patch 1 can be applied. If the can be applied *but* makes no difference if applied or not. answer is bugs, patch 1 is not upstream material. I myself prefer to have bad latency; if something takes a spinlock in atomic context, that spinlock should be raw. If it hurts (latency), don't do it (use the affected code). The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists only in -RT. There is no change upstream. In general we fix such things in -RT first and forward the patches upstream if possible. This convert thingy would be possible. Bug fixing comes before latency no matter if RT or not. Converting every lock into a rawlock is not always the answer. Last thing I read from Scott is that he is not entirely sure if this is the right approach or not and patch #1 was not acked-by him either. So for now I wait for Scott's feedback and maybe a backtrace :) Obviously leaving it in a buggy state is not what we want -- but I lean towards a short term fix of putting depends on !PREEMPT_RT on the in-kernel MPIC emulation (which is itself just an optimization -- you can still use KVM without it). This way people don't enable it with RT without being aware of the issue, and there's more of an incentive to fix it properly. I'll let Bogdan supply the backtrace. So about the backtrace. Wasn't really sure how to catch this, so what I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest run 24 netperf flows with an external back to back board of the same kind. I assumed this would provide the sufficient VCPUs and external interrupt to expose an alleged culprit. With regards to measuring the latency, I thought of using ftrace, specifically the preemptirqsoff latency histogram. Unfortunately, I wasn't able to capture any major differences between running a guest with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion applied) vs. no in-kernel MPIC emulation. Function profiling (trace_stat) shows that in the second case there's a far greater time spent in kvm_handle_exit (100x), but overall, the maximum latencies for preemptirqsoff don't look that much different. Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host RT Linux, sorted in descending order, expressed in microseconds: In-kernel MPIC QEMU MPIC 39755105 20793972 13033557 11061725 447 907 423 853 362 723 343 182 260 121 133 116 131 116 118 115 116 114 114 114 114 114 114 99 113 99 103 98 98 98 95 97 87 96 83 83 83 82 80 81 I'm not sure if this captures openpic behavior or just scheduler behavior. Anyways, I'm pro adding the openpic raw_spinlock conversion along with disabling the in-kernel MPIC emulation for upstream. But just wanted to catch up with this last request from a while ago. Do you think it would be better to just submit the new patch or should I do some further testing? Do you have any suggestions regarding what else I should look at / how to test? Thank you, Bogdan P. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 26/31] powerpc/iommu: Add userspace view of TCE table
In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 7 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 23 ++- 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2c08c91..a768a4d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -106,9 +106,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; + unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)-it_userspace ? \ + ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 0bcd988..82102d1 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include linux/pci.h #include linux/iommu.h #include linux/sched.h +#include linux/vmalloc.h #include asm/io.h #include asm/prom.h #include asm/iommu.h @@ -1069,6 +1070,9 @@ static int iommu_table_take_ownership(struct iommu_table *tbl) spin_unlock(tbl-pools[i].lock); spin_unlock_irqrestore(tbl-large_pool.lock, flags); + BUG_ON(tbl-it_userspace); + tbl-it_userspace = vzalloc(sizeof(*tbl-it_userspace) * tbl-it_size); + return 0; } @@ -1102,6 +1106,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl-it_size + 7) 3; + vfree(tbl-it_userspace); + tbl-it_userspace = NULL; + spin_lock_irqsave(tbl-large_pool.lock, flags); for (i = 0; i tbl-nr_pools; i++) spin_lock(tbl-pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index bc36cf1..036f3c1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -26,6 +26,7 @@ #include linux/iommu.h #include linux/mmzone.h #include linux/sizes.h +#include linux/vmalloc.h #include asm/mmzone.h #include asm/sections.h @@ -1469,6 +1470,9 @@ static void pnv_pci_free_table(struct iommu_table *tbl) if (!tbl-it_size) return; + if (tbl-it_userspace) + vfree(tbl-it_userspace); + pnv_free_tce_table(tbl-it_base, size, tbl-it_indirect_levels); iommu_reset_table(tbl, ioda2); } @@ -1656,9 +1660,26 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, pnv_pci_ioda2_set_bypass(pe, !enable); } +static long pnv_pci_ioda2_create_table_with_uas( + struct iommu_table_group *table_group, + int num, __u32 page_shift, __u64 window_size, __u32 levels, + struct iommu_table *tbl) +{ + long ret = pnv_pci_ioda2_create_table(table_group, num, + page_shift, window_size, levels, tbl); + + if (ret) + return ret; + + BUG_ON(tbl-it_userspace); + tbl-it_userspace = vzalloc(sizeof(*tbl-it_userspace) * tbl-it_size); + + return 0; +} + static struct iommu_table_group_ops pnv_pci_ioda2_ops = { .set_ownership = pnv_ioda2_set_ownership, - .create_table = pnv_pci_ioda2_create_table, + .create_table = pnv_pci_ioda2_create_table_with_uas, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, }; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 21/31] powerpc/iommu: Split iommu_free_table into 2 helpers
The iommu_free_table helper release memory it is using (the TCE table and @it_map) and release the iommu_table struct as well. We might not want the very last step as we store iommu_table in parent structures. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 57 2 files changed, 35 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index bde7ee7..8ed4648 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -127,6 +127,7 @@ static inline void *get_iommu_table_base(struct device *dev) extern struct iommu_table *iommu_table_alloc(int node); /* Frees table for an individual device node */ +extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name); extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); /* Initializes an iommu_table based in values set in the passed-in diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 501e8ee..0bcd988 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node) return table_group-tables[0]; } +void iommu_reset_table(struct iommu_table *tbl, const char *node_name) +{ + if (!tbl) + return; + + if (tbl-it_map) { + unsigned long bitmap_sz; + unsigned int order; + + /* +* In case we have reserved the first bit, we should not emit +* the warning below. +*/ + if (tbl-it_offset == 0) + clear_bit(0, tbl-it_map); + + /* verify that table contains no entries */ + if (!bitmap_empty(tbl-it_map, tbl-it_size)) + pr_warn(%s: Unexpected TCEs for %s\n, __func__, + node_name); + + /* calculate bitmap size in bytes */ + bitmap_sz = BITS_TO_LONGS(tbl-it_size) * sizeof(unsigned long); + + /* free bitmap */ + order = get_order(bitmap_sz); + free_pages((unsigned long) tbl-it_map, order); + } + + memset(tbl, 0, sizeof(*tbl)); +} + void iommu_free_table(struct iommu_table *tbl, const char *node_name) { - unsigned long bitmap_sz; - unsigned int order; struct iommu_table_group *table_group = tbl-it_group; - if (!tbl || !tbl-it_map) { - printk(KERN_ERR %s: expected TCE map for %s\n, __func__, - node_name); + if (!tbl) return; - } - /* -* In case we have reserved the first bit, we should not emit -* the warning below. -*/ - if (tbl-it_offset == 0) - clear_bit(0, tbl-it_map); + iommu_reset_table(tbl, node_name); #ifdef CONFIG_IOMMU_API if (table_group-group) { @@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) } #endif - /* verify that table contains no entries */ - if (!bitmap_empty(tbl-it_map, tbl-it_size)) - pr_warn(%s: Unexpected TCEs for %s\n, __func__, node_name); - - /* calculate bitmap size in bytes */ - bitmap_sz = BITS_TO_LONGS(tbl-it_size) * sizeof(unsigned long); - - /* free bitmap */ - order = get_order(bitmap_sz); - free_pages((unsigned long) tbl-it_map, order); - /* free table */ kfree(table_group); } -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/3] powerpc/powernv: convert codes returned by OPAL calls
OPAL has its own list of return codes. The patch provides a translation of such codes in errnos for the opal_sensor_read call, and possibly others if needed. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- Changes since v2 : - renamed and moved the routine to opal.[ch] - changed default value to ERANGE like rtas arch/powerpc/include/asm/opal.h |2 ++ arch/powerpc/platforms/powernv/opal-sensor.c |6 -- arch/powerpc/platforms/powernv/opal.c| 17 + 3 files changed, 23 insertions(+), 2 deletions(-) Index: linux.git/arch/powerpc/platforms/powernv/opal-sensor.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal-sensor.c +++ linux.git/arch/powerpc/platforms/powernv/opal-sensor.c @@ -46,8 +46,10 @@ int opal_get_sensor_data(u32 sensor_hndl mutex_lock(opal_sensor_mutex); ret = opal_sensor_read(sensor_hndl, token, data); - if (ret != OPAL_ASYNC_COMPLETION) + if (ret != OPAL_ASYNC_COMPLETION) { + ret = opal_error_code(ret); goto out_token; + } ret = opal_async_wait_response(token, msg); if (ret) { @@ -57,7 +59,7 @@ int opal_get_sensor_data(u32 sensor_hndl } *sensor_data = be32_to_cpu(data); - ret = be64_to_cpu(msg.params[1]); + ret = opal_error_code(be64_to_cpu(msg.params[1])); out_token: mutex_unlock(opal_sensor_mutex); Index: linux.git/arch/powerpc/include/asm/opal.h === --- linux.git.orig/arch/powerpc/include/asm/opal.h +++ linux.git/arch/powerpc/include/asm/opal.h @@ -983,6 +983,8 @@ struct opal_sg_list *opal_vmalloc_to_sg_ unsigned long vmalloc_size); void opal_free_sg_list(struct opal_sg_list *sg); +extern int opal_error_code(int rc); + /* * Dump region ID range usable by the OS */ Index: linux.git/arch/powerpc/platforms/powernv/opal.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal.c +++ linux.git/arch/powerpc/platforms/powernv/opal.c @@ -894,6 +894,23 @@ void opal_free_sg_list(struct opal_sg_li } } +int opal_error_code(int rc) +{ + switch (rc) { + case OPAL_SUCCESS: return 0; + case OPAL_PARAMETER:return -EINVAL; + case OPAL_UNSUPPORTED: return -ENOSYS; + case OPAL_ASYNC_COMPLETION: return -EAGAIN; + case OPAL_BUSY_EVENT: return -EBUSY; + case OPAL_NO_MEM: return -ENOMEM; + case OPAL_HARDWARE: return -ENOENT; + case OPAL_INTERNAL_ERROR: return -EIO; + default: + pr_err(%s: unexpected OPAL error %d\n, __func__, rc); + return -ERANGE; + } +} + EXPORT_SYMBOL_GPL(opal_poll_events); EXPORT_SYMBOL_GPL(opal_rtc_read); EXPORT_SYMBOL_GPL(opal_rtc_write); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 23/31] powerpc/powernv: Change prototypes to receive iommu
This changes few functions to receive a iommu_table_group pointer rather than PE as they are going to be a part of upcoming iommu_table_group_ops callback set. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci-ioda.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 74e119c..80ea84d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1408,10 +1408,12 @@ static __be64 *pnv_alloc_tce_table(int nid, return addr; } -static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe, +static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) { + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); int nid = pe-phb-hose-node; void *addr; unsigned long tce_table_size, left; @@ -1462,9 +1464,11 @@ static void pnv_pci_free_table(struct iommu_table *tbl) iommu_reset_table(tbl, ioda2); } -static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe, +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, struct iommu_table *tbl) { + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); struct pnv_phb *phb = pe-phb; const __be64 *swinvp; int64_t rc; @@ -1599,12 +1603,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* The PE will reserve all possible 32-bits space */ pe-tce32_seg = 0; - end = (1 ilog2(phb-ioda.m32_pci_base)); pe_info(pe, Setting up 32-bit TCE table at 0..%08x\n, end); - rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K, + rc = pnv_pci_ioda2_create_table(pe-table_group, IOMMU_PAGE_SHIFT_4K, phb-ioda.m32_pci_base, POWERNV_IOMMU_DEFAULT_LEVELS, tbl); if (rc) { @@ -1619,7 +1622,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe-table_group.ops = pnv_pci_ioda2_ops; #endif - rc = pnv_pci_ioda2_set_window(pe, tbl); + rc = pnv_pci_ioda2_set_window(pe-table_group, tbl); if (rc) { pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, rc); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 17/31] powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The returned old TCE value is a virtual address as the new TCE value. This is different from tce_clear() which returns a physical address. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 17 ++-- arch/powerpc/kernel/iommu.c | 53 +--- arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++ arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++ arch/powerpc/platforms/powernv/pci.c| 17 arch/powerpc/platforms/powernv/pci.h| 2 + drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++--- 7 files changed, 130 insertions(+), 62 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index d1f8c6c..bde7ee7 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -44,11 +44,22 @@ extern int iommu_is_off; extern int iommu_force_on; struct iommu_table_ops { + /* When called with direction==DMA_NONE, it is equal to clear() */ int (*set)(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs); +#ifdef CONFIG_IOMMU_API + /* +* Exchanges existing TCE with new TCE plus direction bits; +* returns old TCE and DMA direction mask +*/ + int (*exchange)(struct iommu_table *tbl, + long index, + unsigned long *tce, + enum dma_data_direction *direction); +#endif void (*clear)(struct iommu_table *tbl, long index, long npages); unsigned long (*get)(struct iommu_table *tbl, long index); @@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group, extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, + unsigned long *tce, enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, unsigned long npages); extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table_group *table_group); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 068fe4ff..501e8ee 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check); int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce) { - if (!(tce (TCE_PCI_WRITE | TCE_PCI_READ))) - return -EINVAL; - if (tce ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ)) return -EINVAL; @@ -1002,44 +999,20 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table
[PATCH kernel v7 29/31] vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- drivers/vfio/vfio_iommu_spapr_tce.c | 232 +++- include/uapi/linux/vfio.h | 27 + 2 files changed, 253 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 9aeaed6..5049b4f 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -21,6 +21,7 @@ #include linux/vfio.h #include asm/iommu.h #include asm/tce.h +#include asm/mmu_context.h #define DRIVER_VERSION 0.1 #define DRIVER_AUTHOR a...@ozlabs.ru @@ -91,8 +92,58 @@ struct tce_container { struct iommu_group *grp; bool enabled; unsigned long locked_pages; + bool v2; }; +static long tce_unregister_pages(struct tce_container *container, + __u64 vaddr, __u64 size) +{ + long ret; + mm_iommu_table_group_mem_t *mem; + + if ((vaddr ~PAGE_MASK) || (size ~PAGE_MASK)) + return -EINVAL; + + mem = mm_iommu_get(vaddr, size PAGE_SHIFT); + if (!mem) + return -EINVAL; + + ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */ + if (!ret) + ret = mm_iommu_put(mem); + + return ret; +} + +static long tce_register_pages(struct tce_container *container, + __u64 vaddr, __u64 size) +{ + long ret = 0; + mm_iommu_table_group_mem_t *mem; + unsigned long entries = size PAGE_SHIFT; + + if ((vaddr ~PAGE_MASK) || (size ~PAGE_MASK) || + ((vaddr + size) vaddr)) + return -EINVAL; + + mem = mm_iommu_get(vaddr, entries); + if (!mem) { + ret = try_increment_locked_vm(entries); + if (ret) + return ret; + + ret = mm_iommu_alloc(vaddr, entries, mem); + if (ret) { + decrement_locked_vm(entries); + return ret; + } + } + + container-enabled = true; + + return 0; +} + static bool tce_page_is_contained(struct page *page, unsigned page_shift) { /* @@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg) { struct tce_container *container; - if (arg != VFIO_SPAPR_TCE_IOMMU) { + if ((arg != VFIO_SPAPR_TCE_IOMMU) (arg != VFIO_SPAPR_TCE_v2_IOMMU)) { pr_err(tce_vfio: Wrong IOMMU type\n); return ERR_PTR(-EINVAL); } @@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
Re: [PATCH] mtd/spi: support en25s64 device
On Fri, Mar 27, 2015 at 05:38:30PM +0800, Shengzhou Liu wrote: Add support for EON en25s64 spi device. Signed-off-by: Shengzhou Liu shengzhou@freescale.com --- drivers/mtd/spi-nor/spi-nor.c | 1 + 1 file changed, 1 insertion(+) This is a MTD driver, not a SPI driver - you need to send your patch to the MTD maintainers not me. signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 09/31] powerpc/powernv: Do not set read flag if direction==DMA_NONE
Normally a bitmap from the iommu_table is used to track what TCE entry is in use. Since we are going to use iommu_table without its locks and do xchg() instead, it becomes essential not to put bits which are not implied in the direction flag. This adds iommu_direction_to_tce_perm() (its counterpart is there already) and uses it for powernv's pnv_tce_build(). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 15 +++ arch/powerpc/platforms/powernv/pci.c | 7 +-- 3 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index ed69b7d..2af2d70 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -203,6 +203,7 @@ extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); +extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 1b4a178..029b1ea 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -871,6 +871,21 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, } } +unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir) +{ + switch (dir) { + case DMA_BIDIRECTIONAL: + return TCE_PCI_READ | TCE_PCI_WRITE; + case DMA_FROM_DEVICE: + return TCE_PCI_WRITE; + case DMA_TO_DEVICE: + return TCE_PCI_READ; + default: + return 0; + } +} +EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm); + #ifdef CONFIG_IOMMU_API /* * SPAPR TCE API diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 54323d6..609f5b1 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -593,15 +593,10 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs, bool rm) { - u64 proto_tce; + u64 proto_tce = iommu_direction_to_tce_perm(direction); __be64 *tcep, *tces; u64 rpn; - proto_tce = TCE_PCI_READ; // Read allowed - - if (direction != DMA_TO_DEVICE) - proto_tce |= TCE_PCI_WRITE; - tces = tcep = ((__be64 *)tbl-it_base) + index - tbl-it_offset; rpn = __pa(uaddr) tbl-it_page_shift; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 14/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control
At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds a set_ownership() callback to it which is called when an external user takes control over the IOMMU. This renames set_bypass() to set_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it more generic name. The bool parameter is inverted. The callback is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 14 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 25 + 3 files changed, 56 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index b9e50d3..d1f8c6c 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -92,7 +92,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is enabled, the kernel cannot use IOMMU +* for itself. +*/ + void (*set_ownership)(struct iommu_table_group *table_group, + bool enable); +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a964c50..9687731 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl-it_group, struct pnv_ioda_pe, - table_group); uint16_t window_id = (pe-pe_number 1 ) + 1; int64_t rc; @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe-pdev) - set_iommu_table_base(pe-pdev-dev, tbl); + set_iommu_table_base(pe-pdev-dev, + pe-table_group.tables[0]); else pnv_ioda_setup_bus_dma(pe, pe-pbus, false); } @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, /* TVE #1 is selected by PCI address bit 59 */ pe-tce_bypass_base = 1ull 59; - /* Install set_bypass callback for VFIO */ - pe-table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass; - /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(pe-table_group.tables[0], true); + pnv_pci_ioda2_set_bypass(pe, true); } +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, +bool enable) +{ + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); + if (enable) + iommu_take_ownership(table_group); + else + iommu_release_ownership(table_group); + + pnv_pci_ioda2_set_bypass(pe, !enable); +} + +static struct iommu_table_group_ops pnv_pci_ioda2_ops = { + .set_ownership = pnv_ioda2_set_ownership, +}; + static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { @@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, } tbl-it_ops = pnv_iommu_ops; iommu_init_table(tbl, phb-hose-node); + pe-table_group.ops = pnv_pci_ioda2_ops;
[PATCH kernel v7 22/31] powerpc/powernv: Implement multilevel TCE tables
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_create_table() and pnv_pci_ioda2_free_table() callbacks. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 128 -- arch/powerpc/platforms/powernv/pci.c | 19 + 3 files changed, 123 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 8ed4648..1e0d907 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -90,6 +90,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 64b7cfe..74e119c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -47,6 +47,10 @@ #include powernv.h #include pci.h +#define POWERNV_IOMMU_DEFAULT_LEVELS 1 +#define POWERNV_IOMMU_MAX_LEVELS 5 +#define ROUND_UP(x, n) (((x) + (n) - 1u) ~((n) - 1u)) + static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { @@ -1339,16 +1343,82 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } +static void pnv_free_tce_table(unsigned long addr, unsigned size, + unsigned level) +{ + addr = ~(TCE_PCI_READ | TCE_PCI_WRITE); + + if (level) { + long i; + u64 *tmp = (u64 *) addr; + + for (i = 0; i size; ++i) { + unsigned long hpa = be64_to_cpu(tmp[i]); + + if (!(hpa (TCE_PCI_READ | TCE_PCI_WRITE))) + continue; + + pnv_free_tce_table((unsigned long) __va(hpa), + size, level - 1); + } + } + + free_pages(addr, get_order(size 3)); +} + +static __be64 *pnv_alloc_tce_table(int nid, + unsigned shift, unsigned levels, unsigned long *left) +{ + struct page *tce_mem = NULL; + __be64 *addr, *tmp; + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; + unsigned long chunk = 1UL shift, i; + + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order); + if (!tce_mem) { + pr_err(Failed to allocate a TCE memory\n); + return NULL; + } + + if (!*left) + return NULL; + + addr = page_address(tce_mem); + memset(addr, 0, chunk); + + --levels; + if (!levels) { + /* This is last level, actual TCEs */ + *left -= min(*left, chunk); + return addr; + } + + for (i = 0; i (chunk 3); ++i) { + /* We allocated required TCEs, mark the rest page fault */ + if (!*left) { + addr[i] = cpu_to_be64(0); + continue; + } + + tmp = pnv_alloc_tce_table(nid, shift, levels, left); + addr[i] = cpu_to_be64(__pa(tmp) | + TCE_PCI_READ | TCE_PCI_WRITE); + } + + return addr; +} + static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe, - __u32 page_shift, __u64 window_size, + __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) { int nid = pe-phb-hose-node; - struct page *tce_mem = NULL; void *addr; - unsigned long tce_table_size; - int64_t rc; - unsigned order; + unsigned long tce_table_size, left; + unsigned shift; + + if (!levels || (levels POWERNV_IOMMU_MAX_LEVELS)) + return -EINVAL; if ((window_size memory_hotplug_max()) || !is_power_of_2(window_size)) return -EINVAL; @@ -1357,16 +1427,19 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe, tce_table_size = max(0x1000UL, tce_table_size); /* Allocate TCE table */ - order =
[PATCH v3 2/3] powerpc/powernv: handle OPAL_SUCCESS return in opal_sensor_read
Currently, when a sensor value is read, the kernel calls OPAL, which in turn builds a message for the FSP, and waits for a message back. The new device tree for OPAL sensors [1] adds new sensors that can be read synchronously (core temperatures for instance) and that don't need to wait for a response. This patch modifies the opal call to accept an OPAL_SUCCESS return value and cover the case above. [1] https://lists.ozlabs.org/pipermail/skiboot/2015-March/000639.html Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- We still uselessly reserve a token (for the response) and take a lock, which might raise the need of a new 'opal_sensor_read_sync' call. Changes since v2 : - merged the return code assignments in one call arch/powerpc/platforms/powernv/opal-sensor.c | 32 --- 1 file changed, 20 insertions(+), 12 deletions(-) Index: linux.git/arch/powerpc/platforms/powernv/opal-sensor.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal-sensor.c +++ linux.git/arch/powerpc/platforms/powernv/opal-sensor.c @@ -46,20 +46,28 @@ int opal_get_sensor_data(u32 sensor_hndl mutex_lock(opal_sensor_mutex); ret = opal_sensor_read(sensor_hndl, token, data); - if (ret != OPAL_ASYNC_COMPLETION) { - ret = opal_error_code(ret); - goto out_token; - } + switch (ret) { + case OPAL_ASYNC_COMPLETION: + ret = opal_async_wait_response(token, msg); + if (ret) { + pr_err(%s: Failed to wait for the async response, %d\n, + __func__, ret); + goto out_token; + } - ret = opal_async_wait_response(token, msg); - if (ret) { - pr_err(%s: Failed to wait for the async response, %d\n, - __func__, ret); - goto out_token; - } + ret = opal_error_code(be64_to_cpu(msg.params[1])); + *sensor_data = be32_to_cpu(data); + break; - *sensor_data = be32_to_cpu(data); - ret = opal_error_code(be64_to_cpu(msg.params[1])); + case OPAL_SUCCESS: + ret = 0; + *sensor_data = be32_to_cpu(data); + break; + + default: + ret = opal_error_code(ret); + break; + } out_token: mutex_unlock(opal_sensor_mutex); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 16/31] powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is supposed to be called on IODA1/2 and not called on p5ioc2. It receives start and end host addresses of TCE table. This approach makes it possible to get pnv_pci_ioda_tce_invalidate() unintentionally called on p5ioc2. Another issue is that IODA2 needs PCI addresses to invalidate the cache and those can be calculated from host addresses but since we are going to implement multi-level TCE tables, calculating PCI address from a host address might get either tricky or ugly as TCE table remains flat on PCI bus but not in RAM. This defines separate iommu_table_ops callbacks for p5ioc2 and IODA1/2 PHBs. They all call common pnv_tce_build/pnv_tce_free/pnv_tce_get helpers but call PHB specific TCE invalidation helper (when needed). This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and number of pages which are PCI addresses shifted by IOMMU page shift. The patch is pretty mechanical and behaviour is not expected to change. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci-ioda.c | 92 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 9 ++- arch/powerpc/platforms/powernv/pci.c| 76 +--- arch/powerpc/platforms/powernv/pci.h| 7 ++- 4 files changed, 111 insertions(+), 73 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 9687731..fd993bc 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1065,18 +1065,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, } } -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) { + struct pnv_ioda_pe *pe = container_of(tbl-it_group, + struct pnv_ioda_pe, table_group); __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe-tce_inval_reg_phys : (__be64 __iomem *)tbl-it_index; unsigned long start, end, inc; const unsigned shift = tbl-it_page_shift; - start = __pa(startp); - end = __pa(endp); + start = __pa((__be64 *)tbl-it_base + index - tbl-it_offset); + end = __pa((__be64 *)tbl-it_base + index - tbl-it_offset + + npages - 1); /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */ if (tbl-it_busno) { @@ -1112,10 +1114,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, */ } -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index, + long npages, unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs) { + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction, + attrs); + + if (!ret (tbl-it_type TCE_PCI_SWINV_CREATE)) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); + + return ret; +} + +static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index, + long npages) +{ + pnv_tce_free(tbl, index, npages); + + if (tbl-it_type TCE_PCI_SWINV_FREE) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); +} + +struct iommu_table_ops pnv_ioda1_iommu_ops = { + .set = pnv_ioda1_tce_build_vm, + .clear = pnv_ioda1_tce_free_vm, + .get = pnv_tce_get, +}; + +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) +{ + struct pnv_ioda_pe *pe = container_of(tbl-it_group, + struct pnv_ioda_pe, table_group); unsigned long start, end, inc; __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe-tce_inval_reg_phys : @@ -1128,9 +1160,9 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, end = start; /* Figure out the start, end and step */ - inc = tbl-it_offset + (((u64)startp - tbl-it_base) / sizeof(u64)); + inc = tbl-it_offset + index / sizeof(u64); start |= (inc shift); - inc = tbl-it_offset + (((u64)endp - tbl-it_base) / sizeof(u64)); + inc = tbl-it_offset + (index + npages - 1) / sizeof(u64); end |= (inc shift); inc = (0x1ull shift); mb(); @@ -1144,19 +1176,35 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe
[PATCH kernel v7 19/31] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table/pnc_pci_free_table
This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This enforce window size to be a power of two. This is a pretty mechanical patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci-ioda.c | 85 +++ 1 file changed, 63 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a1e0df9..908863a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,7 +24,9 @@ #include linux/msi.h #include linux/memblock.h #include linux/iommu.h +#include linux/mmzone.h +#include asm/mmzone.h #include asm/sections.h #include asm/io.h #include asm/prom.h @@ -1337,6 +1339,58 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } +static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe, + __u32 page_shift, __u64 window_size, + struct iommu_table *tbl) +{ + int nid = pe-phb-hose-node; + struct page *tce_mem = NULL; + void *addr; + unsigned long tce_table_size; + int64_t rc; + unsigned order; + + if ((window_size memory_hotplug_max()) || !is_power_of_2(window_size)) + return -EINVAL; + + tce_table_size = (window_size page_shift) * 8; + tce_table_size = max(0x1000UL, tce_table_size); + + /* Allocate TCE table */ + order = get_order(tce_table_size); + + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order); + if (!tce_mem) { + pr_err(Failed to allocate a TCE memory, order=%d\n, order); + rc = -ENOMEM; + goto fail; + } + addr = page_address(tce_mem); + memset(addr, 0, tce_table_size); + + /* Setup linux iommu table */ + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, + page_shift); + + tbl-it_ops = pnv_ioda2_iommu_ops; + + return 0; +fail: + if (tce_mem) + __free_pages(tce_mem, get_order(tce_table_size)); + + return rc; +} + +static void pnv_pci_free_table(struct iommu_table *tbl) +{ + if (!tbl-it_size) + return; + + free_pages(tbl-it_base, get_order(tbl-it_size 3)); + memset(tbl, 0, sizeof(struct iommu_table)); +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe-pe_number 1 ) + 1; @@ -1409,11 +1463,9 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { - struct page *tce_mem = NULL; - void *addr; const __be64 *swinvp; - struct iommu_table *tbl; - unsigned int tce_table_size, end; + unsigned int end; + struct iommu_table *tbl = pe-table_group.tables[0]; int64_t rc; /* We shouldn't already have a 32-bit DMA associated */ @@ -1422,30 +1474,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* The PE will reserve all possible 32-bits space */ pe-tce32_seg = 0; + end = (1 ilog2(phb-ioda.m32_pci_base)); - tce_table_size = (end / 0x1000) * 8; pe_info(pe, Setting up 32-bit TCE table at 0..%08x\n, end); - /* Allocate TCE table */ - tce_mem = alloc_pages_node(phb-hose-node, GFP_KERNEL, - get_order(tce_table_size)); - if (!tce_mem) { - pe_err(pe, Failed to allocate a 32-bit TCE memory\n); - goto fail; + rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K, + phb-ioda.m32_pci_base, tbl); + if (rc) { + pe_err(pe, Failed to create 32-bit TCE table, err %ld, rc); + return; } - addr = page_address(tce_mem); - memset(addr, 0, tce_table_size); /* Setup iommu */ pe-table_group.tables[0].it_group = pe-table_group; - - /* Setup linux iommu table */ - tbl = pe-table_group.tables[0]; - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); - - tbl-it_ops = pnv_ioda2_iommu_ops; iommu_init_table(tbl, phb-hose-node); #ifdef CONFIG_IOMMU_API pe-table_group.ops = pnv_pci_ioda2_ops; @@ -1494,8 +1536,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, fail: if (pe-tce32_seg = 0) pe-tce32_seg = -1; - if (tce_mem) - __free_pages(tce_mem, get_order(tce_table_size)); + pnv_pci_free_table(tbl); } static void pnv_ioda_setup_dma(struct pnv_phb *phb) -- 2.0.0 ___ Linuxppc-dev mailing list
[PATCH kernel v7 30/31] vfio: powerpc/spapr: Support multiple groups in one container if possible
At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 289 ++-- 2 files changed, 214 insertions(+), 83 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 5049b4f..8cbd239 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -89,10 +89,16 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct list_head group_list; +}; + +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; }; static long tce_unregister_pages(struct tce_container *container, @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) return (PAGE_SHIFT + compound_order(compound_head(page))) = page_shift; } +static inline bool tce_groups_attached(struct tce_container *container) +{ + return !list_empty(container-group_list); +} + static struct iommu_table *spapr_tce_find_table( struct tce_container *container, phys_addr_t ioba) { long i; struct iommu_table *ret = NULL; - struct iommu_table_group *table_group; - - table_group = iommu_group_get_iommudata(container-grp); - if (!table_group) - return NULL; for (i = 0; i IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { - struct iommu_table *tbl = table_group-tables[i]; + struct iommu_table *tbl = container-tables[i]; unsigned long entry = ioba tbl-it_page_shift; unsigned long start = tbl-it_offset; unsigned long end = start + tbl-it_size; @@ -185,11 +191,8 @@ static int tce_iommu_enable(struct tce_container *container) { int ret = 0; unsigned long locked; - struct iommu_table *tbl; struct iommu_table_group *table_group; - - if (!container-grp) - return -ENXIO; + struct tce_iommu_group *tcegrp; if (!current-mm) return -ESRCH; /* process exited */ @@ -222,12 +225,24 @@ static int tce_iommu_enable(struct tce_container *container) * as this information is only available from KVM and VFIO is * KVM agnostic. */ - table_group = iommu_group_get_iommudata(container-grp); + if (!tce_groups_attached(container)) + return -ENODEV; + + tcegrp = list_first_entry(container-group_list, + struct tce_iommu_group, next); + table_group = iommu_group_get_iommudata(tcegrp-grp); if (!table_group) return -ENODEV; - tbl = table_group-tables[0]; - locked = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; + /* +* We do not allow enabling a group if no DMA-able memory was +* registered as there is no way to know how much we should +* increment the locked_vm counter. +*/ + if (!table_group-tce32_size) + return -EPERM; + + locked = table_group-tce32_size PAGE_SHIFT; ret = try_increment_locked_vm(locked); if (ret) return ret; @@ -266,6 +281,8 @@ static void *tce_iommu_open(unsigned long arg)
[PATCH kernel v7 11/31] powerpc/iommu: Introduce iommu_table_alloc() helper
This replaces multiple calls of kzalloc_node() with a new iommu_table_alloc() helper. Right now it calls kzalloc_node() but later it will be modified to allocate a iommu_table_group struct with a single iommu_table in it. Later the helper will allocate a iommu_table_group struct which embeds the iommu table(s). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c| 9 + arch/powerpc/platforms/powernv/pci.c | 2 +- arch/powerpc/platforms/pseries/iommu.c | 12 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index d909e2a..eb75726 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -117,6 +117,7 @@ static inline void *get_iommu_table_base(struct device *dev) return dev-archdata.dma_data.iommu_table_base; } +extern struct iommu_table *iommu_table_alloc(int node); /* Frees table for an individual device node */ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index eceb214..b39d00a 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -710,6 +710,15 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) return tbl; } +struct iommu_table *iommu_table_alloc(int node) +{ + struct iommu_table *tbl; + + tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node); + + return tbl; +} + void iommu_free_table(struct iommu_table *tbl, const char *node_name) { unsigned long bitmap_sz; diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index c619ec6..1c31ac8 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -680,7 +680,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose) hose-dn-full_name); return NULL; } - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose-node); + tbl = iommu_table_alloc(hose-node); if (WARN_ON(!tbl)) return NULL; pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)), diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 48d1fde..41a8b14 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -617,8 +617,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus) pci-phb-dma_window_size = 0x800ul; pci-phb-dma_window_base_cur = 0x800ul; - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, - pci-phb-node); + tbl = iommu_table_alloc(pci-phb-node); iommu_table_setparms(pci-phb, dn, tbl); tbl-it_ops = iommu_table_pseries_ops; @@ -669,8 +668,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) pdn-full_name, ppci-iommu_table); if (!ppci-iommu_table) { - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, - ppci-phb-node); + tbl = iommu_table_alloc(ppci-phb-node); iommu_table_setparms_lpar(ppci-phb, pdn, tbl, dma_window); tbl-it_ops = iommu_table_lpar_multi_ops; ppci-iommu_table = iommu_init_table(tbl, ppci-phb-node); @@ -697,8 +695,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev) struct pci_controller *phb = PCI_DN(dn)-phb; pr_debug( -- first child, no bridge. Allocating iommu table.\n); - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, - phb-node); + tbl = iommu_table_alloc(phb-node); iommu_table_setparms(phb, dn, tbl); tbl-it_ops = iommu_table_pseries_ops; PCI_DN(dn)-iommu_table = iommu_init_table(tbl, phb-node); @@ -1120,8 +1117,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev) pci = PCI_DN(pdn); if (!pci-iommu_table) { - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, - pci-phb-node); + tbl = iommu_table_alloc(pci-phb-node); iommu_table_setparms_lpar(pci-phb, pdn, tbl, dma_window); tbl-it_ops = iommu_table_lpar_multi_ops; pci-iommu_table = iommu_init_table(tbl, pci-phb-node); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 18/31] powerpc/powernv/ioda2: Rework iommu_table creation
This moves iommu_table creation to the beginning. This is a mechanical patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci-ioda.c | 34 --- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 4d80502..a1e0df9 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1437,27 +1437,33 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, addr = page_address(tce_mem); memset(addr, 0, tce_table_size); + /* Setup iommu */ + pe-table_group.tables[0].it_group = pe-table_group; + + /* Setup linux iommu table */ + tbl = pe-table_group.tables[0]; + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, + IOMMU_PAGE_SHIFT_4K); + + tbl-it_ops = pnv_ioda2_iommu_ops; + iommu_init_table(tbl, phb-hose-node); +#ifdef CONFIG_IOMMU_API + pe-table_group.ops = pnv_pci_ioda2_ops; +#endif + /* * Map TCE table through TVT. The TVE index is the PE number * shifted by 1 bit for 32-bits DMA space. */ rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, - pe-pe_number 1, 1, __pa(addr), - tce_table_size, 0x1000); + pe-pe_number 1, 1, __pa(tbl-it_base), + tbl-it_size 3, 1ULL tbl-it_page_shift); if (rc) { pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, rc); goto fail; } - /* Setup iommu */ - pe-table_group.tables[0].it_group = pe-table_group; - - /* Setup linux iommu table */ - tbl = pe-table_group.tables[0]; - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); - /* OPAL variant of PHB3 invalidated TCEs */ swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL); if (swinvp) { @@ -1471,16 +1477,12 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, 8); tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); } - tbl-it_ops = pnv_ioda2_iommu_ops; - iommu_init_table(tbl, phb-hose-node); -#ifdef CONFIG_IOMMU_API - pe-table_group.ops = pnv_pci_ioda2_ops; -#endif iommu_register_group(pe-table_group, phb-hose-global_number, pe-pe_number); if (pe-pdev) - set_iommu_table_base_and_group(pe-pdev-dev, tbl); + set_iommu_table_base_and_group(pe-pdev-dev, + pe-table_group.tables[0]); else pnv_ioda_setup_bus_dma(pe, pe-pbus, true); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 10/31] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds the requirement for @it_ops to be initialized before calling iommu_init_table() to make sure that we do not leave any IOMMU table with iommu_table_ops uninitialized. This is not a parameter of iommu_init_table() though as there will be cases when iommu_init_table() will not be called on TCE tables, for example - VFIO. This does s/tce_build/set/, s/tce_free/clear/ and removes tce_ redundand prefixes. This removes tce_xxx_rm handlers from ppc_md but does not add them to iommu_table_ops as this will be done later if we decide to support TCE hypercalls in real mode. For pSeries, this always uses tce_buildmulti_pSeriesLP/ tce_buildmulti_pSeriesLP. This changes multi callback to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support multitce=off boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace multi callbacks with single ones. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 17 +++ arch/powerpc/include/asm/machdep.h | 25 arch/powerpc/kernel/iommu.c | 46 +++-- arch/powerpc/kernel/vio.c | 5 arch/powerpc/platforms/cell/iommu.c | 8 +++-- arch/powerpc/platforms/pasemi/iommu.c | 7 +++-- arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++ arch/powerpc/platforms/powernv/pci-p5ioc2.c | 1 + arch/powerpc/platforms/powernv/pci.c| 23 --- arch/powerpc/platforms/powernv/pci.h| 1 + arch/powerpc/platforms/pseries/iommu.c | 34 +++-- arch/powerpc/sysdev/dart_iommu.c| 12 12 files changed, 93 insertions(+), 88 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2af2d70..d909e2a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -43,6 +43,22 @@ extern int iommu_is_off; extern int iommu_force_on; +struct iommu_table_ops { + int (*set)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs); + void (*clear)(struct iommu_table *tbl, + long index, long npages); + unsigned long (*get)(struct iommu_table *tbl, long index); + void (*flush)(struct iommu_table *tbl); +}; + +/* These are used by VIO */ +extern struct iommu_table_ops iommu_table_lpar_multi_ops; +extern struct iommu_table_ops iommu_table_pseries_ops; + /* * IOMAP_MAX_ORDER defines the largest contiguous block * of dma space we can get. IOMAP_MAX_ORDER = 13 @@ -77,6 +93,7 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif + struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index c8175a3..2abe744 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -65,31 +65,6 @@ struct machdep_calls { * destroyed as well */ void(*hpte_clear_all)(void); - int (*tce_build)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -struct dma_attrs *attrs); - void(*tce_free)(struct iommu_table *tbl, - long index, - long npages); - unsigned long (*tce_get)(struct iommu_table *tbl, - long index); - void(*tce_flush)(struct iommu_table *tbl); - - /* _rm versions are for real mode use only */ - int (*tce_build_rm)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -struct dma_attrs *attrs); - void(*tce_free_rm)(struct iommu_table *tbl, - long index, - long npages); - void(*tce_flush_rm)(struct iommu_table *tbl); - void __iomem * (*ioremap)(phys_addr_t addr, unsigned long
[PATCH v3 3/3] powerpc/powernv: remove opal_sensor_mutex
The opal sensor mutex protects the opal_sensor_read call which can return a OPAL_BUSY code on IBM Power systems if a previous request is in progress. This can be handled at user level with a retry. Signed-off-by: Cédric Le Goater c...@fr.ibm.com --- Changes since v2 : - removed a goto label arch/powerpc/platforms/powernv/opal-sensor.c |9 + 1 file changed, 1 insertion(+), 8 deletions(-) Index: linux.git/arch/powerpc/platforms/powernv/opal-sensor.c === --- linux.git.orig/arch/powerpc/platforms/powernv/opal-sensor.c +++ linux.git/arch/powerpc/platforms/powernv/opal-sensor.c @@ -19,13 +19,10 @@ */ #include linux/delay.h -#include linux/mutex.h #include linux/of_platform.h #include asm/opal.h #include asm/machdep.h -static DEFINE_MUTEX(opal_sensor_mutex); - /* * This will return sensor information to driver based on the requested sensor * handle. A handle is an opaque id for the powernv, read by the driver from the @@ -40,11 +37,9 @@ int opal_get_sensor_data(u32 sensor_hndl token = opal_async_get_token_interruptible(); if (token 0) { pr_err(%s: Couldn't get the token, returning\n, __func__); - ret = token; - goto out; + return token; } - mutex_lock(opal_sensor_mutex); ret = opal_sensor_read(sensor_hndl, token, data); switch (ret) { case OPAL_ASYNC_COMPLETION: @@ -70,9 +65,7 @@ int opal_get_sensor_data(u32 sensor_hndl } out_token: - mutex_unlock(opal_sensor_mutex); opal_async_release_token(token); -out: return ret; } EXPORT_SYMBOL_GPL(opal_get_sensor_data); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 12/31] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Documentation/vfio.txt | 23 ++ arch/powerpc/include/asm/iommu.h| 18 +++-- arch/powerpc/kernel/iommu.c | 34 arch/powerpc/platforms/powernv/pci-ioda.c | 38 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++-- arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/powernv/pci.h| 4 +- arch/powerpc/platforms/pseries/iommu.c | 9 ++- drivers/vfio/vfio_iommu_spapr_tce.c | 120 9 files changed, 183 insertions(+), 82 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). + +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from pre-registered range. + +This separation helps in optimizing DMA for guests. + --- [1] VFIO was originally an acronym for Virtual Function I/O in its diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index eb75726..667aa1a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -90,9 +90,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ -#ifdef CONFIG_IOMMU_API - struct iommu_group *it_group; -#endif + struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +#define IOMMU_TABLE_GROUP_MAX_TABLES 1 + +struct iommu_table_group { #ifdef CONFIG_IOMMU_API -extern void iommu_register_group(struct iommu_table *tbl, + struct iommu_group *group; +#endif + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; +}; + +#ifdef CONFIG_IOMMU_API +extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); #else -static inline void iommu_register_group(struct iommu_table *tbl, +static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b39d00a..fd49c8e 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) struct iommu_table *iommu_table_alloc(int node) { - struct iommu_table *tbl; + struct iommu_table_group *table_group; - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node); + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL, + node); + table_group-tables[0].it_group = table_group; - return tbl; + return table_group-tables[0]; } void iommu_free_table(struct iommu_table *tbl, const char *node_name) { unsigned long bitmap_sz;
[PATCH kernel v7 13/31] vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control
This replaces iommu_take_ownership()/iommu_release_ownership() calls with the callback calls and it is up to the platform code to call iommu_take_ownership()/iommu_release_ownership() if needed. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 4 +-- arch/powerpc/kernel/iommu.c | 50 - drivers/vfio/vfio_iommu_spapr_tce.c | 4 +-- 3 files changed, 42 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 667aa1a..b9e50d3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -225,8 +225,8 @@ extern unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry); extern void iommu_flush_tce(struct iommu_table *tbl); -extern int iommu_take_ownership(struct iommu_table *tbl); -extern void iommu_release_ownership(struct iommu_table *tbl); +extern int iommu_take_ownership(struct iommu_table_group *table_group); +extern void iommu_release_ownership(struct iommu_table_group *table_group); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index fd49c8e..7d6089b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1050,7 +1050,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, } EXPORT_SYMBOL_GPL(iommu_tce_build); -int iommu_take_ownership(struct iommu_table *tbl) +static int iommu_table_take_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl-it_size + 7) 3; @@ -1064,19 +1064,36 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl-it_map, 0xff, sz); - /* -* Disable iommu bypass, otherwise the user can DMA to all of -* our physical memory via the bypass window instead of just -* the pages that has been explicitly mapped into the iommu -*/ - if (tbl-set_bypass) - tbl-set_bypass(tbl, false); + return 0; +} + +static void iommu_table_release_ownership(struct iommu_table *tbl); + +int iommu_take_ownership(struct iommu_table_group *table_group) +{ + int i, j, rc = 0; + + for (i = 0; i IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { + struct iommu_table *tbl = table_group-tables[i]; + + if (!tbl-it_map) + continue; + + rc = iommu_table_take_ownership(tbl); + if (rc) { + for (j = 0; j i; ++j) + iommu_table_release_ownership( + table_group-tables[j]); + + return rc; + } + } return 0; } EXPORT_SYMBOL_GPL(iommu_take_ownership); -void iommu_release_ownership(struct iommu_table *tbl) +static void iommu_table_release_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl-it_size + 7) 3; @@ -1086,9 +1103,18 @@ void iommu_release_ownership(struct iommu_table *tbl) if (tbl-it_offset == 0) set_bit(0, tbl-it_map); - /* The kernel owns the device now, we can restore the iommu bypass */ - if (tbl-set_bypass) - tbl-set_bypass(tbl, true); +} + +extern void iommu_release_ownership(struct iommu_table_group *table_group) +{ + int i; + + for (i = 0; i IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { + struct iommu_table *tbl = table_group-tables[i]; + + if (tbl-it_map) + iommu_table_release_ownership(tbl); + } } EXPORT_SYMBOL_GPL(iommu_release_ownership); diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index d61aad2..9f38351 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -535,7 +535,7 @@ static int tce_iommu_attach_group(void *iommu_data, goto unlock_exit; } - ret = iommu_take_ownership(table_group-tables[0]); + ret = iommu_take_ownership(table_group); if (!ret) container-grp = iommu_group; @@ -572,7 +572,7 @@ static void tce_iommu_detach_group(void *iommu_data, table_group = iommu_group_get_iommudata(iommu_group); BUG_ON(!table_group); - iommu_release_ownership(table_group-tables[0]); + iommu_release_ownership(table_group); unlock_exit: mutex_unlock(container-lock); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 08/31] vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches simpler for review. This is a mechanical patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 38 ++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index ec5ee83..244c958 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -478,16 +478,21 @@ static int tce_iommu_attach_group(void *iommu_data, iommu_group_id(container-tbl-it_group), iommu_group_id(iommu_group)); ret = -EBUSY; - } else if (container-enabled) { + goto unlock_exit; + } + + if (container-enabled) { pr_err(tce_vfio: attaching group #%u to enabled container\n, iommu_group_id(iommu_group)); ret = -EBUSY; - } else { - ret = iommu_take_ownership(tbl); - if (!ret) - container-tbl = tbl; + goto unlock_exit; } + ret = iommu_take_ownership(tbl); + if (!ret) + container-tbl = tbl; + +unlock_exit: mutex_unlock(container-lock); return ret; @@ -505,18 +510,21 @@ static void tce_iommu_detach_group(void *iommu_data, pr_warn(tce_vfio: detaching group #%u, expected group is #%u\n, iommu_group_id(iommu_group), iommu_group_id(tbl-it_group)); - } else { - if (container-enabled) { - pr_warn(tce_vfio: detaching group #%u from enabled container, forcing disable\n, - iommu_group_id(tbl-it_group)); - tce_iommu_disable(container); - } + goto unlock_exit; + } - /* pr_debug(tce_vfio: detaching group #%u from iommu %p\n, - iommu_group_id(iommu_group), iommu_group); */ - container-tbl = NULL; - iommu_release_ownership(tbl); + if (container-enabled) { + pr_warn(tce_vfio: detaching group #%u from enabled container, forcing disable\n, + iommu_group_id(tbl-it_group)); + tce_iommu_disable(container); } + + /* pr_debug(tce_vfio: detaching group #%u from iommu %p\n, + iommu_group_id(iommu_group), iommu_group); */ + container-tbl = NULL; + iommu_release_ownership(tbl); + +unlock_exit: mutex_unlock(container-lock); } -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v7 25/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership
Before the IOMMU user (VFIO) would take control over the IOMMU table belonging to a specific IOMMU group. This approach did not allow sharing tables between IOMMU groups attached to the same container. This introduces a new IOMMU ownership flavour when the user can not just control the existing IOMMU table but remove/create tables on demand. If an IOMMU implements a set_ownership() callback, this lets the user have full control over the IOMMU group. When the ownership is taken, the platform code removes all the windows so the caller must create them. Before returning the ownership back to the platform code, VFIO unprograms and removes all the tables it created. Old-style ownership is still supported allowing VFIO to run on older P5IOC2 and IODA IO controllers. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v6: * fixed commit log that VFIO removes tables before passing ownership back to the platform code, not userspace --- arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++--- drivers/vfio/vfio_iommu_spapr_tce.c | 51 --- 2 files changed, 66 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 6939402..bc36cf1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1626,11 +1626,33 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, { struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, table_group); - if (enable) - iommu_take_ownership(table_group); - else - iommu_release_ownership(table_group); + if (enable) { + pnv_pci_ioda2_unset_window(pe-table_group, 0); + pnv_pci_free_table(pe-table_group.tables[0]); + } else { + struct iommu_table *tbl = pe-table_group.tables[0]; + int64_t rc; + rc = pnv_pci_ioda2_create_table(pe-table_group, 0, + IOMMU_PAGE_SHIFT_4K, + pe-phb-ioda.m32_pci_base, + POWERNV_IOMMU_DEFAULT_LEVELS, tbl); + if (rc) { + pe_err(pe, Failed to create 32-bit TCE table, err %ld, + rc); + return; + } + + iommu_init_table(tbl, pe-phb-hose-node); + + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl); + if (rc) { + pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, + rc); + pnv_pci_free_table(tbl); + return; + } + } pnv_pci_ioda2_set_bypass(pe, !enable); } diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 7c3c215..9aeaed6 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -226,18 +226,11 @@ static int tce_iommu_clear(struct tce_container *container, static void tce_iommu_release(void *iommu_data) { struct tce_container *container = iommu_data; - struct iommu_table *tbl; - struct iommu_table_group *table_group; WARN_ON(container-grp); - if (container-grp) { - table_group = iommu_group_get_iommudata(container-grp); - tbl = table_group-tables[0]; - tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size); - + if (container-grp) tce_iommu_detach_group(iommu_data, container-grp); - } tce_iommu_disable(container); @@ -553,14 +546,24 @@ static int tce_iommu_attach_group(void *iommu_data, if (!table_group-ops || !table_group-ops-set_ownership) { ret = iommu_take_ownership(table_group); + } else if (!table_group-ops-create_table || + !table_group-ops-set_window) { + WARN_ON_ONCE(1); + ret = -EFAULT; } else { /* * Disable iommu bypass, otherwise the user can DMA to all of * our physical memory via the bypass window instead of just * the pages that has been explicitly mapped into the iommu */ + struct iommu_table tbltmp = { 0 }, *tbl = tbltmp; + table_group-ops-set_ownership(table_group, true); - ret = 0; + ret = table_group-ops-create_table(table_group, 0, + IOMMU_PAGE_SHIFT_4K, + table_group-tce32_size, 1, tbl); + if (!ret) + ret = table_group-ops-set_window(table_group, 0, tbl); } if (ret) @@ -579,6 +582,7 @@ static void tce_iommu_detach_group(void
Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/2] powerpc/pseries: Simplify check for suspendability during suspend/migration
During suspend/migration operation we must wait for the VASI state reported by the hypervisor to become Suspending prior to making the ibm,suspend-me RTAS call. Calling routines to rtas_ibm_supend_me() pass a vasi_state variable that exposes the VASI state to the caller. This is unnecessary as the caller only really cares about the following three conditions; if there is an error we should bailout, success indicating we have suspended and woken back up so proceed to device tree update, or we are not suspendable yet so try calling rtas_ibm_suspend_me again shortly. This patch removes the extraneous vasi_state variable and simply uses the return code to communicate how to proceed. We either succeed, fail, or get -EAGAIN in which case we sleep for a second before trying to call rtas_ibm_suspend_me again. The behaviour of ppc_rtas() remains the same, but migrate_store() now returns the propogated error code on failure. Previously -1 was returned from migrate_store() in the failure case which equates to -EPERM and was clearly wrong. Signed-off-by: Tyrel Datwyler tyr...@linux.vnet.ibm.com Cc: Nathan Fontenont nf...@linux.vnet.ibm.com Cc: Cyril Bur cyril...@gmail.com --- Changes in v3: - Updated changelog with behaviour change of migrate_store() return code Changes in v2: - Addressed Cyril's comments as follow: - Removed unused vasi_rc variable - Kept return behavior of ppc_rtas the same in the case of VASI error - Updated rtas_ibm_suspend_me function definition for !CONFIG_PPC_PSERIES arch/powerpc/include/asm/rtas.h | 2 +- arch/powerpc/kernel/rtas.c| 26 +- arch/powerpc/platforms/pseries/mobility.c | 9 +++-- 3 files changed, 17 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 2e23e92..fc85eb0 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -327,7 +327,7 @@ extern int rtas_suspend_cpu(struct rtas_suspend_me_data *data); extern int rtas_suspend_last_cpu(struct rtas_suspend_me_data *data); extern int rtas_online_cpus_mask(cpumask_var_t cpus); extern int rtas_offline_cpus_mask(cpumask_var_t cpus); -extern int rtas_ibm_suspend_me(u64 handle, int *vasi_return); +extern int rtas_ibm_suspend_me(u64 handle); struct rtc_time; extern unsigned long rtas_get_boot_time(void); diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 21c45a2..b9a7b89 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -897,7 +897,7 @@ int rtas_offline_cpus_mask(cpumask_var_t cpus) } EXPORT_SYMBOL(rtas_offline_cpus_mask); -int rtas_ibm_suspend_me(u64 handle, int *vasi_return) +int rtas_ibm_suspend_me(u64 handle) { long state; long rc; @@ -919,13 +919,11 @@ int rtas_ibm_suspend_me(u64 handle, int *vasi_return) printk(KERN_ERR rtas_ibm_suspend_me: vasi_state returned %ld\n,rc); return rc; } else if (state == H_VASI_ENABLED) { - *vasi_return = RTAS_NOT_SUSPENDABLE; - return 0; + return -EAGAIN; } else if (state != H_VASI_SUSPENDING) { printk(KERN_ERR rtas_ibm_suspend_me: vasi_state returned state %ld\n, state); - *vasi_return = -1; - return 0; + return -EIO; } if (!alloc_cpumask_var(offline_mask, GFP_TEMPORARY)) @@ -972,7 +970,7 @@ out: return atomic_read(data.error); } #else /* CONFIG_PPC_PSERIES */ -int rtas_ibm_suspend_me(u64 handle, int *vasi_return) +int rtas_ibm_suspend_me(u64 handle) { return -ENOSYS; } @@ -1022,7 +1020,6 @@ asmlinkage int ppc_rtas(struct rtas_args __user *uargs) unsigned long flags; char *buff_copy, *errbuf = NULL; int nargs, nret, token; - int rc; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1054,15 +1051,18 @@ asmlinkage int ppc_rtas(struct rtas_args __user *uargs) if (token == ibm_suspend_me_token) { /* -* rtas_ibm_suspend_me assumes args are in cpu endian, or at least the -* hcall within it requires it. +* rtas_ibm_suspend_me assumes the streamid handle is in cpu +* endian, or at least the hcall within it requires it. */ - int vasi_rc = 0; + int rc = 0; u64 handle = ((u64)be32_to_cpu(args.args[0]) 32) | be32_to_cpu(args.args[1]); - rc = rtas_ibm_suspend_me(handle, vasi_rc); - args.rets[0] = cpu_to_be32(vasi_rc); - if (rc) + rc = rtas_ibm_suspend_me(handle); + if (rc == -EAGAIN) + args.rets[0] = cpu_to_be32(RTAS_NOT_SUSPENDABLE); + else if (rc == -EIO) + args.rets[0] = cpu_to_be32(-1); + else if (rc)
[PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: (0) root @ br30p03: /root # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages: echo 37 /proc/sys/vm/nr_hugepages The echo enver returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node and see if the there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not recliamable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/mm/vmscan.c b/mm/vmscan.c index dcd90c8..033c2b7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
[ Sorry, typo'd anton's address ] On 27.03.2015 [12:28:50 -0700], Nishanth Aravamudan wrote: Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: (0) root @ br30p03: /root # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages: echo 37 /proc/sys/vm/nr_hugepages The echo enver returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node and see if the there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not recliamable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/mm/vmscan.c b/mm/vmscan.c index dcd90c8..033c2b7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
On Fri, Mar 27, 2015 at 3:28 PM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: (0) root @ br30p03: /root # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages: echo 37 /proc/sys/vm/nr_hugepages The echo enver returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node and see if the there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not recliamable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Reviewed-by: Dan Streetman ddstr...@ieee.org diff --git a/mm/vmscan.c b/mm/vmscan.c index dcd90c8..033c2b7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3] powerpc: Use PFN_PHYS() to avoid truncating the physical address
On Fri, 2015-03-27 at 10:45 +1100, Michael Ellerman wrote: On Thu, 2015-03-26 at 10:31 -0500, Emil Medve wrote: Hello Kumar, On 03/26/2015 10:18 AM, Kumar Gala wrote: Why no commit message with what issue this change was trying to fix? A while back, when I attempted to remove bootmem (in favor of just plain memblock as in powerpc land bootmem was just a wrapper to memblock anyway) I run at some point into a problem with an intermediate address value because of this ' PAGE_SHIFT' on the wrong width variable. Using PFN_PHYS() took care of it (it has a cast) so I decided to get this defensive patch applied. Since, I dropped my bootmem/memblock patches in favor to Anton's (Blanchard) work so my concrete issue example is somewhat gone I'm not a big fan of it unless it's actually fixing an issue. It's a lot of churn and the end result is less readable IMHO. It is fixing an issue -- the issue is that there are overflow errors in the code. Some of the places Emil fixed are only for platforms that don't have physical addresses larger than pointers, or have the needed casts, or are known to be dealing with lowmem, but others aren't. E.g. page_is_ram() and devmem_is_allowed() are buggy on 32-bit with 64-bit physical. flush_dcache_icache_page() is buggy on mpc86xx with more than 4 GiB RAM -- though that would still be buggy even with this change, due to __flush_dcache_icache_phys taking unsigned long. The entire concept of that function doesn't work for sizeof(phys_addr_t) sizeof(void *), so in this case 86xx should be using the booke code instead. Even in the places where overflow can't happen due to the above circumstances (other than having the needed cast), it's setting a bad example that can be copied to places where it will break, or the circumstances of the code could change (e.g. currently 64-bit-only code being used on 32-bit). -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs
crickets. How do we make progress in this area? (a) can we assume Andi's json format is acceptable? We would like to know this so we don't have to reformat our data more than once. (b) Would an acceptable interim resolution the 'download area' problem be to take Andi's perf: Add support for full Intel event lists v8 and change the 'download' to refer to tools/perf/event-tables/? (c) If not, given we don't know how to get us out of the current status quo, can this patchseries still be applied, given the original complaint was the size of our events-list.h (whereas power7-events-list.h is almost twice the size)? If not, patch 3/3 in this series is still valid, no matter what, and it should still be applied (let us know if we need to resubmit). Thanks, Kim On Mon, 16 Feb 2015 10:10:45 -0600 Tom Huynh tommy.xhu...@gmail.com wrote: On Mon, Feb 09, 2015 at 09:40:19PM +0100, Andi Kleen wrote: I'll NAK any external 'download area' (and I told that Andi before): tools/perf/event-tables/ or so is a good enough 'download area' with fast enough update cycles. The proposal was to put it on kernel.org, similar to how external firmware blobs are distributed. CPU event lists are data sheets, so are like firmware. They do not follow the normal kernel code licenses. They are not source code. They cannot be reviewed in the normal way. Could you provide more details about the license and review concern? How are the event list files different from hardware- specific information (e.g. reg mapping) in header files? If any 'update' of event descriptions is needed it can happen through the distro package mechanism, or via a simple 'git pull' if it's compiled directly. Lets not overengineer this with any dependence on an external site and with a separate update mechanism - lets just get the tables into tools/ and see it from there... That experiment has been already done for oprofile, didn't work very well. Please excuse my ignorance, could you say exactly what didn't work well for oprofile? Ingo's suggestion seems good to me because these event files will be transparent to the users, and it's just more convenient not having to go to a website to look for the event file that matches the machine to download. The distro package or the perf make mechanism can put these files into the appropriate directory. The users who are not perf developers won't need to know about these files. - Tom ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote: On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote: @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || !zone_reclaimable(zone)) continue; pfmemalloc_reserve += min_wmark_pages(zone); Do you really want zone_reclaimable()? Or do you want something more direct like zone_reclaimable_pages(zone) == 0? Yeah, I guess in my testing this worked out to be the same, since zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will always be false. Thanks! Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com --- v1 - v2: Check against zone_reclaimable_pages, rather zone_reclaimable, based upon feedback from Dave Hansen. diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..c627fa4c991f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || + zone_reclaimable_pages(zone) == 0) continue; pfmemalloc_reserve += min_wmark_pages(zone); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 1/2] powerpc: Add a proper syscall for switching endianness
We currently have a special syscall for switching endianness. This is syscall number 0x1ebe, which is handled explicitly in the 64-bit syscall exception entry. That has a few problems, firstly the syscall number is outside of the usual range, which confuses various tools. For example strace doesn't recognise the syscall at all. Secondly it's handled explicitly as a special case in the syscall exception entry, which is complicated enough without it. As a first step toward removing the special syscall, we need to add a regular syscall that implements the same functionality. The logic is simple, it simply toggles the MSR_LE bit in the userspace MSR. This is the same as the special syscall, with the caveat that the special syscall clobbers fewer registers. This version clobbers r9-r12, XER, CTR, and CR0-1,5-7. Signed-off-by: Michael Ellerman m...@ellerman.id.au --- v3: Don't provide the syscall on 32-bit. v4: No change. arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + arch/powerpc/kernel/entry_64.S | 5 + arch/powerpc/kernel/syscalls.c | 17 + arch/powerpc/kernel/systbl.S| 2 ++ arch/powerpc/kernel/systbl_chk.c| 2 ++ arch/powerpc/platforms/cell/spu_callbacks.c | 1 + 8 files changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 91062eef582f..f1863a138b4a 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -367,3 +367,4 @@ SYSCALL_SPU(getrandom) SYSCALL_SPU(memfd_create) SYSCALL_SPU(bpf) COMPAT_SYS(execveat) +PPC64ONLY(switch_endian) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index 36b79c31eedd..f4f8b667d75b 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -12,7 +12,7 @@ #include uapi/asm/unistd.h -#define __NR_syscalls 363 +#define __NR_syscalls 364 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index ef5b5b1f3123..e4aa173dae62 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -385,5 +385,6 @@ #define __NR_memfd_create 360 #define __NR_bpf 361 #define __NR_execveat 362 +#define __NR_switch_endian 363 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index d180caf2d6de..afbc20019c2e 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -356,6 +356,11 @@ _GLOBAL(ppc64_swapcontext) bl sys_swapcontext b .Lsyscall_exit +_GLOBAL(ppc_switch_endian) + bl save_nvgprs + bl sys_switch_endian + b .Lsyscall_exit + _GLOBAL(ret_from_fork) bl schedule_tail REST_NVGPRS(r1) diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index b2702e87db0d..5fa92706444b 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -121,3 +121,20 @@ long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, return sys_fadvise64(fd, (u64)offset_high 32 | offset_low, (u64)len_high 32 | len_low, advice); } + +long sys_switch_endian(void) +{ + struct thread_info *ti; + + current-thread.regs-msr ^= MSR_LE; + + /* +* Set TIF_RESTOREALL so that r3 isn't clobbered on return to +* userspace. That also has the effect of restoring the non-volatile +* GPRs, so we saved them on the way in here. +*/ + ti = current_thread_info(); + ti-flags |= _TIF_RESTOREALL; + + return 0; +} diff --git a/arch/powerpc/kernel/systbl.S b/arch/powerpc/kernel/systbl.S index 7ab5d434e2ee..4d6b1d3a747f 100644 --- a/arch/powerpc/kernel/systbl.S +++ b/arch/powerpc/kernel/systbl.S @@ -22,6 +22,7 @@ #define PPC_SYS(func) .llong DOTSYM(ppc_##func),DOTSYM(ppc_##func) #define OLDSYS(func) .llong DOTSYM(sys_ni_syscall),DOTSYM(sys_ni_syscall) #define SYS32ONLY(func).llong DOTSYM(sys_ni_syscall),DOTSYM(compat_sys_##func) +#define PPC64ONLY(func).llong DOTSYM(ppc_##func),DOTSYM(sys_ni_syscall) #define SYSX(f, f3264, f32).llong DOTSYM(f),DOTSYM(f3264) #else #define SYSCALL(func) .long sys_##func @@ -29,6 +30,7 @@ #define PPC_SYS(func) .long ppc_##func #define OLDSYS(func) .long sys_##func #define SYS32ONLY(func).long sys_##func +#define PPC64ONLY(func).long sys_ni_syscall #define SYSX(f, f3264, f32).long f32 #endif #define SYSCALL_SPU(func) SYSCALL(func) diff --git
Re: [PATCH v3 2/2] selftests/powerpc: Add a test of the switch_endian() syscall
On Thu, 2015-03-26 at 11:54 +0530, Anshuman Khandual wrote: On 03/26/2015 06:06 AM, Michael Ellerman wrote: On Wed, 2015-03-25 at 17:02 +0530, Anshuman Khandual wrote: On 03/25/2015 10:58 AM, Michael Ellerman wrote: On Wed, 2015-03-18 at 16:04 +1100, Michael Ellerman wrote: On Tue, 2015-03-17 at 11:35 +0530, Anshuman Khandual wrote: On 03/17/2015 04:34 AM, Michael Ellerman wrote: What are you seeing exactly? I am running on a BE PKVM guest but compiling the test case on a different BE machine which has newer version of the compiler. cc (GCC) 4.8.3 20140624 cc -O2 -Wall -g -nostdlib -m64 -c -o check.o check.S objcopy -j .text --reverse-bytes=4 -O binary check.o check-reversed.o hexdump -v -e '/1 .byte 0x%02X\n' check-reversed.o check-reversed.S cc -O2 -Wall -g -nostdlib -m64switch_endian_test.S check-reversed.S -o switch_endian_test which looks very similar to the details you have provided above. Running on guest or host should not make any difference. No it shouldn't. Can you try strace, that should give you the full result code. Also can you try gdb. You can't breakpoint in the wrong-endian region, but it looks like you're getting through that anyway. So try setting a breakpoint at line ~77, and you should be back in BE. Then you can single step and see where it errors out. Did you try these? Yeah. The test program is showing some strange behavior. (1) Without strace: It just fails with 176 return code as before (2) With strace: It works with return code 0 and prints everything !! strace ./switch_endian_test execve(./switch_endian_test, [./switch_endian_test], [/* 50 vars */]) = 0 SYS_363(0x, 0xaaae, 0xaaaf, 0xaab0, 0xaab1) = 6149008514797120170 write(1, Hello wrong-endian world\n, 25Hello wrong-endian world ) = 25 SYS_363(0x19, 0x10010638, 0x19, 0xaab0, 0xaab1) = 25 write(1, Hello right-endian world\n, 25Hello right-endian world ) = 25 write(1, success: switch_endian_test\n, 28success: switch_endian_test ) = 28 exit(0) = ? With GDB and breaking at line 77, it exits with a different exit code this time No that's the same code, 176 == 0260 (octal). 30 cmpdr3,r5 (gdb) 31 bne 1f (gdb) 32 addir3,r15,6 (gdb) 33 cmpdr3,r6 (gdb) 34 bne 1f (gdb) 98 1: li r0, __NR_exit (gdb) 99 sc (gdb) [Inferior 1 (process 6456) exited with code 0260] And that makes sense, it's bailing because r6 doesn't match. In the setup we do: addir6, r15, 6 Where r15 is 0x, so: 0x + 6 = 0xaab0 And when we exit the kernel masks the exit code in r3 with 0xff, so: 0xaab0 0xff = 0xb0 = 176 So for some reason r6 does not contain our pattern. Can you do an info registers and see what's in r6? Sure, here are the details. (gdb) 981: li r0, __NR_exit (gdb) 99sc (gdb) info registers r0 0x11 r1 0x3360 70368744174432 r2 0x10018670 268535408 r3 0xaab0 6149008514797120176 r4 0xaaca 6149008514797120202 r5 0xaaaf 6149008514797120175 r6 0x4000 16384 = r7 0x12e4 268436196 r8 0x8001d033 9223372041149796403 Sigh. This is just a ■■ ■■■ ■■ on my part. At the end of the checking code we call write(), which is a syscall, and it clobbers the register state! Duh. I think the reason you were seeing it and I wasn't is that on my system I have audit enabled, so we *always* go through the path that restores. New patch sent. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] perf/e6500: Make event translations available in sysfs
Thanks for supporting the JSON format too. (c) If not, given we don't know how to get us out of the current status quo, can this patchseries still be applied, given the original complaint was the size of our events-list.h (whereas The Intel core event lists are far larger even (and will grow even more when uncore gets added) power7-events-list.h is almost twice the size)? If not, patch 3/3 in this series is still valid, no matter what, and it should still be applied (let us know if we need to resubmit). Could also just leave out the downloader for now, so that you have to get your own event file and set it up with export EVENTMAP=... That's basically the patchkit, minus one patch. -Andi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 2/2] selftests/powerpc: Add a test of the switch_endian() syscall
This adds a test of the switch_endian() syscall we added in the previous commit. We test it by calling the endian switch syscall, and then executing some code in the other endian to check everything went as expected. That code checks registers we expect to be maintained are. If the endian switch failed to happen that code sequence will be illegal and cause the test to abort. We then switch back to the original endian, do the same checks and finally write a success message and exit(0). Signed-off-by: Michael Ellerman m...@ellerman.id.au --- v3: Have the test switch back to the original endian. v4: Add .gitignore. Drop the message write in the checking code - it clobbers some regs and breaks the second check. tools/testing/selftests/powerpc/Makefile | 2 +- .../selftests/powerpc/switch_endian/.gitignore | 2 + .../selftests/powerpc/switch_endian/Makefile | 23 + .../selftests/powerpc/switch_endian/check.S| 98 ++ .../selftests/powerpc/switch_endian/common.h | 6 ++ .../powerpc/switch_endian/switch_endian_test.S | 82 ++ 6 files changed, 212 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/switch_endian/.gitignore create mode 100644 tools/testing/selftests/powerpc/switch_endian/Makefile create mode 100644 tools/testing/selftests/powerpc/switch_endian/check.S create mode 100644 tools/testing/selftests/powerpc/switch_endian/common.h create mode 100644 tools/testing/selftests/powerpc/switch_endian/switch_endian_test.S diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index 1d5e7ad2c460..85c24a2210b5 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -13,7 +13,7 @@ CFLAGS := -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='$(GIT_VERSION)' -I$(CUR export CC CFLAGS -TARGETS = pmu copyloops mm tm primitives stringloops +TARGETS = pmu copyloops mm tm primitives stringloops switch_endian endif diff --git a/tools/testing/selftests/powerpc/switch_endian/.gitignore b/tools/testing/selftests/powerpc/switch_endian/.gitignore new file mode 100644 index ..89e762eab676 --- /dev/null +++ b/tools/testing/selftests/powerpc/switch_endian/.gitignore @@ -0,0 +1,2 @@ +switch_endian_test +check-reversed.S diff --git a/tools/testing/selftests/powerpc/switch_endian/Makefile b/tools/testing/selftests/powerpc/switch_endian/Makefile new file mode 100644 index ..c7fefbf880b5 --- /dev/null +++ b/tools/testing/selftests/powerpc/switch_endian/Makefile @@ -0,0 +1,23 @@ +PROGS := switch_endian_test + +ASFLAGS += -O2 -Wall -g -nostdlib -m64 + +all: $(PROGS) + +switch_endian_test: check-reversed.S + +check-reversed.o: check.o + objcopy -j .text --reverse-bytes=4 -O binary $ $@ + +check-reversed.S: check-reversed.o + hexdump -v -e '/1 .byte 0x%02X\n' $ $@ + +run_tests: all + @-for PROG in $(PROGS); do \ + ./$$PROG; \ + done; + +clean: + rm -f $(PROGS) *.o check-reversed.S + +.PHONY: all run_tests clean diff --git a/tools/testing/selftests/powerpc/switch_endian/check.S b/tools/testing/selftests/powerpc/switch_endian/check.S new file mode 100644 index ..026bd151a16b --- /dev/null +++ b/tools/testing/selftests/powerpc/switch_endian/check.S @@ -0,0 +1,98 @@ +#include common.h + +/* + * Checks that registers contain what we expect, ie. they were not clobbered by + * the syscall. + * + * r15: pattern to check registers against. + * + * At the end r3 == 0 if everything's OK. + */ + nop # guaranteed to be illegal in reverse-endian + cmpdr15,r3 # check r3 + bne 1f + addir9,r15,4# check r4 + cmpdr9,r4 + bne 1f + lis r9,0x00FF # check CR + ori r9,r9,0xF000 + mfcrr10 + and r10,r10,r9 + cmpwr9,r10 + addir9,r15,34 + bne 1f + addir9,r15,32 # check LR + mflrr10 + cmpdr9,r10 + bne 1f + addir9,r15,5# check r5 + cmpdr9,r5 + bne 1f + addir9,r15,6# check r6 + cmpdr9,r6 + bne 1f + addir9,r15,7# check r7 + cmpdr9,r7 + bne 1f + addir9,r15,8# check r8 + cmpdr9,r8 + bne 1f + addir9,r15,13 # check r13 + cmpdr9,r13 + bne 1f + addir9,r15,14 # check r14 + cmpdr9,r14 + bne 1f + addir9,r15,16 # check r16 + cmpdr9,r16 + bne 1f + addir9,r15,17 # check r17 + cmpdr9,r17 + bne 1f + addir9,r15,18 # check r18 + cmpdr9,r18 + bne 1f + addir9,r15,19 # check r19 + cmpdr9,r19 + bne 1f + addi