[PATCH V2 0/2] arm64/mm: Enable memory hot remove
This series enables memory hot remove on arm64 after fixing a memblock removal ordering problem in generic __remove_memory(). This is based on the following arm64 working tree. git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core Testing: Tested hot remove on arm64 for all 4K, 16K, 64K page config options with all possible VA_BITS and PGTABLE_LEVELS combinations. Build tested on non arm64 platforms. Changes in V2: - Added all received review and ack tags - Split the series from ZONE_DEVICE enablement for better review - Moved memblock re-order patch to the front as per Robin Murphy - Updated commit message on memblock re-order patch per Michal Hocko - Dropped [pmd|pud]_large() definitions - Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large() - Removed __meminit and __ref tags as per Oscar Salvador - Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy - Skipped calling into pgtable_page_dtor() for linear mapping page table pages and updated all relevant functions Changes in V1: (https://lkml.org/lkml/2019/4/3/28) Anshuman Khandual (2): mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() arm64/mm: Enable memory hot remove arch/arm64/Kconfig | 3 + arch/arm64/include/asm/pgtable.h | 2 + arch/arm64/mm/mmu.c | 221 ++- mm/memory_hotplug.c | 3 +- 4 files changed, 225 insertions(+), 4 deletions(-) -- 2.7.4
[PATCH 3/3] ARM: mvebu: add SPDX license identifier
The license is clearly identified as GPL V2 - so just add in the appropriate SPDX license identifier. Signed-off-by: Nicholas Mc Guire --- Problem reported by checkpatch WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 #1: FILE: arch/arm/mach-mvebu/board-v7.c:1: +/* Patch is against 5.1-rc4 (localversion-next is 20190412) arch/arm/mach-mvebu/board-v7.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c index 28fd256..0e021c9 100644 --- a/arch/arm/mach-mvebu/board-v7.c +++ b/arch/arm/mach-mvebu/board-v7.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Device Tree support for Armada 370 and XP platforms. * -- 2.1.4
[PATCH 1/3 RFC] ARM: mvebu: at least warn on kzalloc failure
Although it is very unlikely that the allocation during init would fail any such failure should point to the original cause rather than waiting for a null-pointer dereference to splat. Signed-off-by: Nicholas Mc Guire --- Problem located with experimental coccinelle script While this will not really help much - but kzalloc failures should not go unhandled. Patch was compile-tested: mvebu_v7_defconfig (implies MACH_MVEBU_ANY=y) (with some unrelated sparse warnings about missing syscalls) Patch is against 5.1-rc4 (localversion-next is 20190412) arch/arm/mach-mvebu/board-v7.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c index 0b10acd..37f8cb6 100644 --- a/arch/arm/mach-mvebu/board-v7.c +++ b/arch/arm/mach-mvebu/board-v7.c @@ -128,6 +128,7 @@ static void __init i2c_quirk(void) struct property *new_compat; new_compat = kzalloc(sizeof(*new_compat), GFP_KERNEL); + WARN_ON(!new_compat); new_compat->name = kstrdup("compatible", GFP_KERNEL); new_compat->length = sizeof("marvell,mv78230-a0-i2c"); -- 2.1.4
[PATCH 2/3] ARM: mvebu: drop return from void function
The return statement is unnecessary here - so drop it. Signed-off-by: Nicholas Mc Guire --- Problem reported by checkpatch WARNING: void function return statements are not generally useful #141: FILE: arch/arm/mach-mvebu/board-v7.c:141: + return; +} Patch was compile-tested: mvebu_v7_defconfig (implies MACH_MVEBU_ANY=y) (with some unrelated sparse warnings about missing syscalls) Patch is against 5.1-rc4 (localversion-next is 20190412) arch/arm/mach-mvebu/board-v7.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c index 37f8cb6..28fd256 100644 --- a/arch/arm/mach-mvebu/board-v7.c +++ b/arch/arm/mach-mvebu/board-v7.c @@ -137,7 +137,6 @@ static void __init i2c_quirk(void) of_update_property(np, new_compat); } - return; } static void __init mvebu_dt_init(void) -- 2.1.4
Re: INFO: task hung in do_exit
syzbot has bisected this bug to: commit 430e48ecf31f4f897047f22e02abdfa75730cad8 Author: Amitoj Kaur Chawla Date: Thu Aug 10 16:28:09 2017 + leds: lm3533: constify attribute_group structure bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15f4cee320 start commit: 8ee15f32 Merge tag 'dma-mapping-5.1-1' of git://git.infrad.. git tree: upstream final crash:https://syzkaller.appspot.com/x/report.txt?x=17f4cee320 console output: https://syzkaller.appspot.com/x/log.txt?x=13f4cee320 kernel config: https://syzkaller.appspot.com/x/.config?x=4fb64439e07a1ec0 dashboard link: https://syzkaller.appspot.com/bug?extid=9880e421ec82313d6527 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=149b89af20 Reported-by: syzbot+9880e421ec82313d6...@syzkaller.appspotmail.com Fixes: 430e48ecf31f ("leds: lm3533: constify attribute_group structure") For information about bisection process see: https://goo.gl/tpsmEJ#bisection
Re: [PATCH 5/6] iio: imx7d_adc: Use imx7d_adc_resume() in imx7d_adc_probe()
On Sun, Apr 7, 2019 at 4:15 AM Jonathan Cameron wrote: > > On Wed, 3 Apr 2019 00:03:24 -0700 > Andrey Smirnov wrote: > > > Initialization sequence performed in imx7d_adc_resume() is exactley > > the same as what's being done in imx7d_adc_probe(). Make use of the > > former in the latter to avoid code duplication. > > > > Signed-off-by: Andrey Smirnov > > Cc: Jonathan Cameron > > Cc: Hartmut Knaack > > Cc: Lars-Peter Clausen > > Cc: Peter Meerwald-Stadler > > Cc: Chris Healy > > Cc: linux-...@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > I'm not to going to apply this now, due to the ordering issue with the earlier > and later patches. However, it is fine in of itself. > > I do wonder if we want to rename resume to something else to reflect > it's new usage? imx7d_adc_enable perhaps? > Similar question for suspend in the next patch. Sure, will rename in v2. Thanks, Andrey Smirnov
Re: [PATCH 3/6] iio: imx7d_adc: Use devm_iio_device_register()
On Sun, Apr 7, 2019 at 4:07 AM Jonathan Cameron wrote: > > On Wed, 3 Apr 2019 00:03:22 -0700 > Andrey Smirnov wrote: > > > Use devm_iio_device_register() and drop explicit call to > > iio_device_unregister(). > > > > Signed-off-by: Andrey Smirnov > > Cc: Jonathan Cameron > > Cc: Hartmut Knaack > > Cc: Lars-Peter Clausen > > Cc: Peter Meerwald-Stadler > > Cc: Chris Healy > > Cc: linux-...@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > No to this one. The thing to think about is the resulting order > of the unwinding that happens in remove. > > Do take a look at the code flow, but in short what happens is: > > 1. driver.remove() > 2. Devm release functions run in the opposite order to they were >called during setup. > > The upshot of the change you just made here is that we turn the power > off to the device before we remove the userspace interfaces, giving > potentially interesting failures.. > > There are two options to avoid this: > > 1. Make everything use devm_ calls (often using devm_add_action_or_reset > for the ones that don't have their own versions). > > 2. Stop using devm managed functions at the first non devm_ call that needs > unwinding. From that point onwards in probe / remove you have to do > everything > manually. > Yeah, missed the ordering proble, sorry about that. I'll re-order changes such that this conversion happens last to avoid said problem in v2. Thanks, Andrey Smirnov
Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
On Sat, Apr 13, 2019 at 03:54:39PM -0500, Shawn Landden wrote: /me pulls out his close-reading glasses and the copy_file_range manpage... > If flags includes COPY_FILE_RANGE_FILESIZE then the length > copied is the length of the file. off_in and off_out are > ignored. len must be 0 or the file size. They're ignored? As in the copy operation reads the number of bytes in the file referenced by fd_in from fd_in at its current position and is writes that out to fd_out at its current position? I don't see why I would want such an operation... ...but I can see how people could make use of a CFR_ENTIRE_FILE that would check that both file descriptors are actually regular files, and if so copy the entire contents of the fd_in file into the same position in the fd_out file, and then set the fd_out file's length to match. If @off_in or @off_out are non-NULL then they'll be updated to the new EOFs if the copy completes succesfully and @len can be anything. Also: please update the manual page and the xfstests regression test for this syscall. > This implementation saves a call to stat() in the common case > of copying files. It does not fix any race conditions, but that > is possible in the future with this interface. > > EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0 > or the file size. The values are invalid, so why would we tell userspace to try again instead of the EINVAL that we usually use? > Signed-off-by: Shawn Landden > CC: > --- > fs/read_write.c | 14 +- > include/uapi/linux/stat.h | 4 > 2 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index 61b43ad7608e..6d06361f0856 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, > loff_t pos_in, > struct inode *inode_out = file_inode(file_out); > ssize_t ret; > > - if (flags != 0) > + if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0) FWIW you might as well shorten the prefix to "CFR_" since nobody else is using it. --D > return -EINVAL; > > if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) > @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, > loff_t pos_in, > if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) > return -EINVAL; > > + if (flags & COPY_FILE_RANGE_FILESIZE) { > + struct kstat stat; > + int error; > + error = vfs_getattr(_in->f_path, , > + STATX_SIZE, 0); > + if (error < 0) > + return error; > + if (!(len == 0 || len == stat.size)) > + return -EAGAIN; > + len = stat.size; > + } > + > ret = rw_verify_area(READ, file_in, _in, len); > if (unlikely(ret)) > return ret; > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > index 7b35e98d3c58..1075aa4666ef 100644 > --- a/include/uapi/linux/stat.h > +++ b/include/uapi/linux/stat.h > @@ -170,5 +170,9 @@ struct statx { > > #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */ > > +/* > + * Flags for copy_file_range() > + */ > +#define COPY_FILE_RANGE_FILESIZE 0x0001 /* Copy the full length of > the input file */ > > #endif /* _UAPI_LINUX_STAT_H */ > -- > 2.20.1 >
Re: perf tools:Is there any tools to found out the max latency by irq or cpu idle
On 4/13/19 2:01 AM, Linhaifeng wrote: > Sorry, the value 131081408 is just for example. Actually the result is like > this: > sqrt 2019-04-10 23:53:50: 43968 > sqrt 2019-04-10 23:53:51: 44060 > sqrt 2019-04-10 23:53:52: 49012 > sqrt 2019-04-10 23:53:53: 38172 > sqrt 2019-04-10 23:53:54: 131081408 > sqrt 2019-04-10 23:53:55: 43600 > sqrt 2019-04-10 23:53:56: 46704 > sqrt 2019-04-10 23:53:57: 46880 > sqrt 2019-04-10 23:53:58: 44332 > …… > sqrt 2019-04-10 02:17:15: 136345876 > …… > sqrt 2019-04-10 04:40:35: 136335644 > …… Hi, Still it would be interesting to look at the two raw values used to compute the delta to better diagnose what is going on. -Will > > -Original Message- > From: William Cohen [mailto:wco...@redhat.com] > Sent: 2019年4月12日 22:06 > To: Linhaifeng ; linux-perf-us...@vger.kernel.org; > linux-kernel@vger.kernel.org > Subject: Re: perf tools:Is there any tools to found out the max latency by > irq or cpu idle > > On 4/11/19 8:57 PM, Linhaifeng wrote: >> Hi, >> I have a single thread application like this: >> >> While (1) { >> start = rdtsc(); >> sqrt (1024); >> end = rdtsc(); >> cycles = end – start; >> printf("cycles: %d-%02d-%02d %02d:%02d:%02d: %lu\n", >> 1900+timeinfo->tm_year, 1+timeinfo->tm_mon, timeinfo->tm_mday, >> timeinfo->tm_hour, timeinfo->tm_min, timeinfo->tm_sec, >> cycles); >> } >> It print the cycles of sqrt every second and run with taskset –c 1 ./sqrt. >> The result of test is: >> >> sqrt 2019-04-10 23:53:50: 43968 >> sqrt 2019-04-10 23:53:51: 44060 >> sqrt 2019-04-10 23:53:52: 49012 >> sqrt 2019-04-10 23:53:53: 38172 >> sqrt 2019-04-10 23:53:54: 131081408 >> sqrt 2019-04-10 23:53:55: 43600 >> sqrt 2019-04-10 23:53:56: 46704 >> sqrt 2019-04-10 23:53:57: 46880 >> sqrt 2019-04-10 23:53:58: 44332 >> …… >> sqrt 2019-04-10 02:17:15: 131081408 >> …… >> sqrt 2019-04-10 04:40:35: 131081408 >> …… >> >> Every 2hour23min there would be a large cycles. I use perf sched not found >> any sched_switch events. > > Hi, > > The fact that it is the same value 131081408 every 2 hours 23 minutes looks > suspect. One would expect some variation in the counts. It looks like there > is some rollover or overflow issue. It would be helpful to print out the > start and end values to see what is going on with the raw rdstc values. > Maybe the thread is being moved between processors and the TSC are out of > sync. What particular processor model was this running on? Was this running > on physical hardware or inside a kvm guest? > > According to the Intel 64 and IA-32 Architecture Software Devloper's Manual > Volume 3 (325384-sdm-vol-3abcd.pdf): > > The RDTSC instruction reads the time-stamp counter and is guaranteed to > return a monotonically increasing unique value whenever executed, except for > a 64-bit counter wraparound. Intel guarantees that the time-stamp counter > will not wraparound within 10 years after being reset. The period for counter > wrap is longer for Pentium 4, Intel Xeon, P6 family, and Pentium processors. > > -Will Cohen > > >> >> L2GW_2680:/home/fsp/zn # perf sched record -C 6-11 -o perf.sched ^C[ >> perf record: Woken up 64 times to write data ] [ perf record: Captured >> and wrote 204.878 MB perf.sched (1911189 samples) ] >> >> L2GW_2680:/home/fsp/zn # perf sched latency -i perf.sched >> >> -- >> --- >> Task | Runtime ms | Switches | Average delay ms >> | Maximum delay ms | Maximum delay at | >> -- >> --- >> -- >> --- >> TOTAL: | 0.000 ms | 0 | >> --- >> >> >> >> Is there any other tools of perf to found out the max latency by irq or cpu >> idle ? >> >> >
Re: [PATCH 4/4] ARM: imx legacy: add an SPDX license identifier
On Sat, Apr 13, 2019 at 11:11:05AM -0300, Fabio Estevam wrote: > On Sat, Apr 13, 2019 at 4:30 AM Nicholas Mc Guire wrote: > > > > The header clearly identifies this code as GPL V2 or later - so pop > > in the SPDX license identifier. > > > > Signed-off-by: Nicholas Mc Guire > > --- > > arch/arm/mach-imx/mach-mx27ads.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/arm/mach-imx/mach-mx27ads.c > > b/arch/arm/mach-imx/mach-mx27ads.c > > index c83fdd3..3f68972 100644 > > --- a/arch/arm/mach-imx/mach-mx27ads.c > > +++ b/arch/arm/mach-imx/mach-mx27ads.c > > @@ -1,3 +1,4 @@ > > +// SPDX-License-Identifier: GPL-2.0+ > > As you are adding the SPDX line you could also remove the legal text below. hmmm - as I'm not the copyright holder I do not think I should be doing that - it has more information in it that only GPL V2+ thx! hofrat
Re: [PATCH v5 4/5] xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
On Sat, Apr 13, 2019 at 3:44 PM Sinan Kaya wrote: > > CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly > defined CONFIG_DEBUG_MISC instead to keep the current code. > > Signed-off-by: Sinan Kaya > Reviewed-by: Josh Triplett > --- > arch/xtensa/include/asm/irqflags.h | 2 +- > arch/xtensa/kernel/smp.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) Acked-by: Max Filippov -- Thanks. -- Max
[PATCH v5 3/5] mips: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly defined CONFIG_DEBUG_MISC instead to keep the current code. Signed-off-by: Sinan Kaya Reviewed-by: Josh Triplett --- arch/mips/kernel/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c index 8d1dc6c71173..9fc8fadb8418 100644 --- a/arch/mips/kernel/setup.c +++ b/arch/mips/kernel/setup.c @@ -563,7 +563,7 @@ static void __init bootmem_init(void) offset = __pa_symbol(_text) - __pa_symbol(VMLINUX_LOAD_ADDRESS); memblock_free(__pa_symbol(VMLINUX_LOAD_ADDRESS), offset); -#if defined(CONFIG_DEBUG_KERNEL) && defined(CONFIG_DEBUG_INFO) +#if defined(CONFIG_DEBUG_MISC) && defined(CONFIG_DEBUG_INFO) /* * This information is necessary when debugging the kernel * But is a security vulnerability otherwise! -- 2.21.0
[PATCH v5 4/5] xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly defined CONFIG_DEBUG_MISC instead to keep the current code. Signed-off-by: Sinan Kaya Reviewed-by: Josh Triplett --- arch/xtensa/include/asm/irqflags.h | 2 +- arch/xtensa/kernel/smp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/xtensa/include/asm/irqflags.h b/arch/xtensa/include/asm/irqflags.h index 9b5e8526afe5..12890681587b 100644 --- a/arch/xtensa/include/asm/irqflags.h +++ b/arch/xtensa/include/asm/irqflags.h @@ -27,7 +27,7 @@ static inline unsigned long arch_local_irq_save(void) { unsigned long flags; #if XTENSA_FAKE_NMI -#if defined(CONFIG_DEBUG_KERNEL) && (LOCKLEVEL | TOPLEVEL) >= XCHAL_DEBUGLEVEL +#if defined(CONFIG_DEBUG_MISC) && (LOCKLEVEL | TOPLEVEL) >= XCHAL_DEBUGLEVEL unsigned long tmp; asm volatile("rsr %0, ps\t\n" diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c index 3699d6d3e479..83b244ce61ee 100644 --- a/arch/xtensa/kernel/smp.c +++ b/arch/xtensa/kernel/smp.c @@ -126,7 +126,7 @@ void secondary_start_kernel(void) init_mmu(); -#ifdef CONFIG_DEBUG_KERNEL +#ifdef CONFIG_DEBUG_MISC if (boot_secondary_processors == 0) { pr_debug("%s: boot_secondary_processors:%d; Hanging cpu:%d\n", __func__, boot_secondary_processors, cpu); -- 2.21.0
[PATCH v5 0/5] init: Do not select DEBUG_KERNEL by default
CONFIG_DEBUG_KERNEL has been designed to just enable Kconfig options. Kernel code generatoin should not depend on CONFIG_DEBUG_KERNEL. Proposed alternative plan: let's add a new symbol, something like DEBUG_MISC ("Miscellaneous debug code that should be under a more specific debug option but isn't"), make it depend on DEBUG_KERNEL and be "default DEBUG_KERNEL" but allow itself to be turned off, and then mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to "#ifdef CONFIG_DEBUG_MISC". Diff from v4: - collect reviewed-by - collect acked-by - fix nit on 1/5 Sinan Kaya (5): init: Introduce DEBUG_MISC option powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC mips: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC net: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC arch/mips/kernel/setup.c | 2 +- arch/powerpc/kernel/sysfs.c| 8 arch/xtensa/include/asm/irqflags.h | 2 +- arch/xtensa/kernel/smp.c | 2 +- lib/Kconfig.debug | 9 + net/netfilter/core.c | 2 +- 6 files changed, 17 insertions(+), 8 deletions(-) -- 2.21.0
[PATCH v5 1/5] init: Introduce DEBUG_MISC option
Introduce DEBUG_MISC ("Miscellaneous debug code that should be under a more specific debug option but isn't"), make it depend on DEBUG_KERNEL and be "default DEBUG_KERNEL" but allow itself to be turned off, and then mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to "#ifdef CONFIG_DEBUG_MISC". Signed-off-by: Sinan Kaya Reviewed-by: Josh Triplett --- lib/Kconfig.debug | 9 + 1 file changed, 9 insertions(+) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0d9e81779e37..0103a092ce3d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -438,6 +438,15 @@ config DEBUG_KERNEL Say Y here if you are developing drivers or trying to debug and identify kernel problems. +config DEBUG_MISC + bool "Miscellaneous debug code" + default DEBUG_KERNEL + depends on DEBUG_KERNEL + help + Say Y here if you need to enable miscellaneous debug code that should + be under a more specific debug option but isn't. + + menu "Memory Debugging" source "mm/Kconfig.debug" -- 2.21.0
[PATCH v5 2/5] powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly defined CONFIG_DEBUG_MISC instead to keep the current code. Signed-off-by: Sinan Kaya Reviewed-by: Josh Triplett --- arch/powerpc/kernel/sysfs.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index e8e93c2c7d03..7a1708875d27 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -610,7 +610,7 @@ SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2); SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3); SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4); SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5); -#ifdef CONFIG_DEBUG_KERNEL +#ifdef CONFIG_DEBUG_MISC SYSFS_SPRSETUP(hid0, SPRN_HID0); SYSFS_SPRSETUP(hid1, SPRN_HID1); SYSFS_SPRSETUP(hid4, SPRN_HID4); @@ -639,7 +639,7 @@ SYSFS_SPRSETUP(tsr0, SPRN_PA6T_TSR0); SYSFS_SPRSETUP(tsr1, SPRN_PA6T_TSR1); SYSFS_SPRSETUP(tsr2, SPRN_PA6T_TSR2); SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3); -#endif /* CONFIG_DEBUG_KERNEL */ +#endif /* CONFIG_DEBUG_MISC */ #endif /* HAS_PPC_PMC_PA6T */ #ifdef HAS_PPC_PMC_IBM @@ -680,7 +680,7 @@ static struct device_attribute pa6t_attrs[] = { __ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3), __ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4), __ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5), -#ifdef CONFIG_DEBUG_KERNEL +#ifdef CONFIG_DEBUG_MISC __ATTR(hid0, 0600, show_hid0, store_hid0), __ATTR(hid1, 0600, show_hid1, store_hid1), __ATTR(hid4, 0600, show_hid4, store_hid4), @@ -709,7 +709,7 @@ static struct device_attribute pa6t_attrs[] = { __ATTR(tsr1, 0600, show_tsr1, store_tsr1), __ATTR(tsr2, 0600, show_tsr2, store_tsr2), __ATTR(tsr3, 0600, show_tsr3, store_tsr3), -#endif /* CONFIG_DEBUG_KERNEL */ +#endif /* CONFIG_DEBUG_MISC */ }; #endif /* HAS_PPC_PMC_PA6T */ #endif /* HAS_PPC_PMC_CLASSIC */ -- 2.21.0
Re: [GIT PULL] Please pull NFS client bugfixes
The pull request you sent on Sat, 13 Apr 2019 14:56:35 +: > git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/b60bc0665e6af8c55b946b67ea8cb235823bb74e Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [GIT PULL] clk fixes for v5.1-rc4
The pull request you sent on Fri, 12 Apr 2019 16:21:07 -0700: > https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git > tags/clk-fixes-for-linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/09bad0df3974ef9e682447f48bb7fd55c48513f3 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [GIT PULL] PCI fixes for v5.1
The pull request you sent on Fri, 12 Apr 2019 10:38:38 -0500: > git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git > tags/pci-v5.1-fixes-2 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/a3b84248628df52c39c8724fb69da1ea0f2c0dc7 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
> On Apr 13, 2019, at 1:49 PM, Shawn Landden wrote: > > If flags includes COPY_FILE_RANGE_FILESIZE then the length > copied is the length of the file. off_in and off_out are > ignored. len must be 0 or the file size. > > This implementation saves a call to stat() in the common case > of copying files. It does not fix any race conditions, but that > is possible in the future with this interface. > > EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0 > or the file size. I think you’re asking for trouble here. I assume you have some kind of race prevention in mind here. The trouble is that passing zero means copy the whole thing, but the size-checking behavior is only available for nonzero sizes. This means that anyone who passes their idea of the size needs to account for inconsistent behavior if the size is zero. Also, what happens if the file size changes mid copy? I assume the result is more or less arbitrary, but you should document what behavior is allowed. The docs should cover the case where you race against an O_APPEND write. > > Signed-off-by: Shawn Landden > CC: > --- > fs/read_write.c | 14 +- > include/uapi/linux/stat.h | 4 > 2 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index 61b43ad7608e..6d06361f0856 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, > loff_t pos_in, >struct inode *inode_out = file_inode(file_out); >ssize_t ret; > > -if (flags != 0) > +if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0) >return -EINVAL; > >if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) > @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, > loff_t pos_in, >if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) >return -EINVAL; > > +if (flags & COPY_FILE_RANGE_FILESIZE) { > +struct kstat stat; > +int error; > +error = vfs_getattr(_in->f_path, , > +STATX_SIZE, 0); > +if (error < 0) > +return error; > +if (!(len == 0 || len == stat.size)) > +return -EAGAIN; > +len = stat.size; > +} > + >ret = rw_verify_area(READ, file_in, _in, len); >if (unlikely(ret)) >return ret; > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > index 7b35e98d3c58..1075aa4666ef 100644 > --- a/include/uapi/linux/stat.h > +++ b/include/uapi/linux/stat.h > @@ -170,5 +170,9 @@ struct statx { > > #define STATX_ATTR_AUTOMOUNT0x1000 /* Dir: Automount trigger */ > > +/* > + * Flags for copy_file_range() > + */ > +#define COPY_FILE_RANGE_FILESIZE0x0001 /* Copy the full length of > the input file */ > > #endif /* _UAPI_LINUX_STAT_H */ > -- > 2.20.1 >
Re: Adding plain accesses and detecting data races in the LKMM
On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote: > On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote: > > > > The formula was more along the line of "do not assume either of these > > > > cases to hold; use barrier() is you need an unconditional barrier..." > > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic() > > > > provides a compiler barrier with either barrier() or "memory" clobber. > > > > > > Well, we have two reasonable choices: Say that > > > smp_mb__{before,after}_atomic will always provide a compiler barrier, > > > or don't say this. I see no point in saying that the combination of > > > Before-atomic followed by RMW provides a barrier. > > > > ;-/ I'm fine with the first choice. I don't see how the second choice > > (this proposal/patch) would be consistent with some documentation and > > with the current implementations; for example, > > > > 1) Documentation/atomic_t.txt says: > > > > Thus: > > > > atomic_fetch_add(); > > > > is equivalent to: > > > > smp_mb__before_atomic(); > > atomic_fetch_add_relaxed(); > > smp_mb__after_atomic(); > > > > [...] > > > > 2) Some implementations of the _relaxed() variants do not provide any > > compiler barrier currently. > > But don't all implementations of smp_mb__before_atomic() and > smp_mb__after_atomic() currently supply a compiler barrier? Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently supply a compiler barrier. Nevertheless, there's a difference between: (1) Specify that these barriers supply a compiler barrier, (2) Specify that (certain) combinations of these barriers and RMWs supply a compiler barrier, and (3) This patch... ;-) FWIW, I'm not aware of current/informal documentation following (the arguably simpler but slightly stronger) (1). But again (amending my last remark): (1) and (2) both make sense to me. Thanx, Andrea > > Thanx, Paul >
[tip:x86/fpu] x86/pkeys: Add PKRU value to init_fpstate
Commit-ID: a5eff7259790d5314eff10563d6e59d358cce482 Gitweb: https://git.kernel.org/tip/a5eff7259790d5314eff10563d6e59d358cce482 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:56 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 20:21:10 +0200 x86/pkeys: Add PKRU value to init_fpstate The task's initial PKRU value is set partly for fpu__clear()/ copy_init_pkru_to_fpregs(). It is not part of init_fpstate.xsave and instead it is set explicitly. If the user removes the PKRU state from XSAVE in the signal handler then __fpu__restore_sig() will restore the missing bits from `init_fpstate' and initialize the PKRU value to 0. Add the `init_pkru_value' to `init_fpstate' so it is set to the init value in such a case. In theory copy_init_pkru_to_fpregs() could be removed because restoring the PKRU at return-to-userland should be enough. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andi Kleen Cc: Andy Lutomirski Cc: "Chang S. Bae" Cc: Dominik Brodowski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: Konrad Rzeszutek Wilk Cc: kvm ML Cc: Paolo Bonzini Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-28-bige...@linutronix.de --- arch/x86/kernel/cpu/common.c | 5 + arch/x86/mm/pkeys.c | 6 ++ 2 files changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index cb28e98a0659..352fa19e6311 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -372,6 +372,8 @@ static bool pku_disabled; static __always_inline void setup_pku(struct cpuinfo_x86 *c) { + struct pkru_state *pk; + /* check the boot processor, plus compile options for PKU: */ if (!cpu_feature_enabled(X86_FEATURE_PKU)) return; @@ -382,6 +384,9 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c) return; cr4_set_bits(X86_CR4_PKE); + pk = get_xsave_addr(_fpstate.xsave, XFEATURE_PKRU); + if (pk) + pk->pkru = init_pkru_value; /* * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE * cpuid bit to be set. We need to ensure that we diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 2ecbf4155f98..1dcfc91c8f0c 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -18,6 +18,7 @@ #include /* boot_cpu_has, ...*/ #include /* vma_pkey() */ +#include /* init_fpstate */ int __execute_only_pkey(struct mm_struct *mm) { @@ -161,6 +162,7 @@ static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf, static ssize_t init_pkru_write_file(struct file *file, const char __user *user_buf, size_t count, loff_t *ppos) { + struct pkru_state *pk; char buf[32]; ssize_t len; u32 new_init_pkru; @@ -183,6 +185,10 @@ static ssize_t init_pkru_write_file(struct file *file, return -EINVAL; WRITE_ONCE(init_pkru_value, new_init_pkru); + pk = get_xsave_addr(_fpstate.xsave, XFEATURE_PKRU); + if (!pk) + return -EINVAL; + pk->pkru = new_init_pkru; return count; }
[tip:x86/fpu] x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath
Commit-ID: 06b251dff78704c7d122bd109384d970a7dbe94d Gitweb: https://git.kernel.org/tip/06b251dff78704c7d122bd109384d970a7dbe94d Author: Sebastian Andrzej Siewior AuthorDate: Fri, 12 Apr 2019 20:16:15 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 20:16:15 +0200 x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath If a task is scheduled out and receives a signal then it won't be able to take the fastpath because the registers aren't available. The slowpath is more expensive compared to XRSTOR + XSAVE which usually succeeds. Here are some clock_gettime() numbers from a bigger box with AVX512 during bootup: - __fpregs_load_activate() takes 140ns - 350ns. If it was the most recent FPU context on the CPU then the optimisation in __fpregs_load_activate() will skip the load (which was disabled during the test). - copy_fpregs_to_sigframe() takes 200ns - 450ns if it succeeds. On a pagefault it is 1.8us - 3us usually in the 2.6us area. - The slowpath takes 1.5us - 6us. Usually in the 2.6us area. My testcases (including lat_sig) take the fastpath without __fpregs_load_activate(). I expect this to be the majority. Since the slowpath is in the >1us area it makes sense to load the registers and attempt to save them directly. The direct save may fail but should only happen on the first invocation or after fork() while the page is read-only. [ bp: Massage a bit. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-27-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 3c3167576216..7026f1c4e5e3 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -175,20 +175,21 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) (struct _fpstate_32 __user *) buf) ? -1 : 1; /* -* If we do not need to load the FPU registers at return to userspace -* then the CPU has the current state. Try to save it directly to -* userland's stack frame if it does not cause a pagefault. If it does, -* try the slowpath. +* Load the FPU registers if they are not valid for the current task. +* With a valid FPU state we can attempt to save the state directly to +* userland's stack frame which will likely succeed. If it does not, do +* the slowpath. */ fpregs_lock(); - if (!test_thread_flag(TIF_NEED_FPU_LOAD)) { - pagefault_disable(); - ret = copy_fpregs_to_sigframe(buf_fx); - pagefault_enable(); - if (ret) - copy_fpregs_to_fpstate(fpu); - set_thread_flag(TIF_NEED_FPU_LOAD); - } + if (test_thread_flag(TIF_NEED_FPU_LOAD)) + __fpregs_load_activate(); + + pagefault_disable(); + ret = copy_fpregs_to_sigframe(buf_fx); + pagefault_enable(); + if (ret && !test_thread_flag(TIF_NEED_FPU_LOAD)) + copy_fpregs_to_fpstate(fpu); + set_thread_flag(TIF_NEED_FPU_LOAD); fpregs_unlock(); if (ret) {
[tip:x86/fpu] x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()
Commit-ID: da2f32fb8dc7cbd9433cb2e990693734b30a2465 Gitweb: https://git.kernel.org/tip/da2f32fb8dc7cbd9433cb2e990693734b30a2465 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:54 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 20:05:36 +0200 x86/fpu: Add a fastpath to copy_fpstate_to_sigframe() Try to save the FPU registers directly to the userland stack frame if the CPU holds the FPU registers for the current task. This has to be done with the pagefault disabled because we can't fault (while the FPU registers are locked) and therefore the operation might fail. If it fails try the slowpath which can handle faults. [ bp: Massage a bit. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-26-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index a1bd7be70206..3c3167576216 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -144,8 +144,10 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) * buf == buf_fx for 64-bit frames and 32-bit fsave frame. * buf != buf_fx for 32-bit frames with fxstate. * - * Save the state to task's fpu->state and then copy it to the user frame - * pointed to by the aligned pointer 'buf_fx'. + * Try to save it directly to the user frame with disabled page fault handler. + * If this fails then do the slow path where the FPU state is first saved to + * task's fpu->state and then copy it to the user frame pointed to by the + * aligned pointer 'buf_fx'. * * If this is a 32-bit frame with fxstate, put a fsave header before * the aligned state at 'buf_fx'. @@ -159,6 +161,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) struct xregs_state *xsave = >state.xsave; struct task_struct *tsk = current; int ia32_fxstate = (buf != buf_fx); + int ret = -EFAULT; ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) || IS_ENABLED(CONFIG_IA32_EMULATION)); @@ -173,23 +176,30 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) /* * If we do not need to load the FPU registers at return to userspace -* then the CPU has the current state and we need to save it. Otherwise, -* it has already been done and we can skip it. +* then the CPU has the current state. Try to save it directly to +* userland's stack frame if it does not cause a pagefault. If it does, +* try the slowpath. */ fpregs_lock(); if (!test_thread_flag(TIF_NEED_FPU_LOAD)) { - copy_fpregs_to_fpstate(fpu); + pagefault_disable(); + ret = copy_fpregs_to_sigframe(buf_fx); + pagefault_enable(); + if (ret) + copy_fpregs_to_fpstate(fpu); set_thread_flag(TIF_NEED_FPU_LOAD); } fpregs_unlock(); - if (using_compacted_format()) { - if (copy_xstate_to_user(buf_fx, xsave, 0, size)) - return -1; - } else { - fpstate_sanitize_xstate(fpu); - if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size)) - return -1; + if (ret) { + if (using_compacted_format()) { + if (copy_xstate_to_user(buf_fx, xsave, 0, size)) + return -1; + } else { + fpstate_sanitize_xstate(fpu); + if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size)) + return -1; + } } /* Save the fsave header for the 32-bit frames. */
[tip:x86/fpu] x86/fpu: Add a fastpath to __fpu__restore_sig()
Commit-ID: 1d731e731c4cd7cbd3b1aa295f0932e7610da82f Gitweb: https://git.kernel.org/tip/1d731e731c4cd7cbd3b1aa295f0932e7610da82f Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:53 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 20:04:49 +0200 x86/fpu: Add a fastpath to __fpu__restore_sig() The previous commits refactor the restoration of the FPU registers so that they can be loaded from in-kernel memory. This overhead can be avoided if the load can be performed without a pagefault. Attempt to restore FPU registers by invoking copy_user_to_fpregs_zeroing(). If it fails try the slowpath which can handle pagefaults. [ bp: Add a comment over the fastpath to be able to find one's way around the function. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-25-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 6df1f15e0cd5..a1bd7be70206 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -242,10 +242,10 @@ sanitize_restored_xstate(union fpregs_state *state, /* * Restore the extended state if present. Otherwise, restore the FP/SSE state. */ -static inline int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only) +static int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only) { if (use_xsave()) { - if ((unsigned long)buf % 64 || fx_only) { + if (fx_only) { u64 init_bv = xfeatures_mask & ~XFEATURE_MASK_FPSSE; copy_kernel_to_xregs(_fpstate.xsave, init_bv); return copy_user_to_fxregs(buf); @@ -327,8 +327,27 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) if (ret) goto err_out; envp = + } else { + /* +* Attempt to restore the FPU registers directly from user +* memory. For that to succeed, the user access cannot cause +* page faults. If it does, fall back to the slow path below, +* going through the kernel buffer with the enabled pagefault +* handler. +*/ + fpregs_lock(); + pagefault_disable(); + ret = copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only); + pagefault_enable(); + if (!ret) { + fpregs_mark_activate(); + fpregs_unlock(); + return 0; + } + fpregs_unlock(); } + if (use_xsave() && !fx_only) { u64 init_bv = xfeatures_mask & ~xfeatures;
[tip:x86/fpu] x86/fpu: Defer FPU state load until return to userspace
Commit-ID: 5f409e20b794565e2d60ad333e79334630a6c798 Gitweb: https://git.kernel.org/tip/5f409e20b794565e2d60ad333e79334630a6c798 Author: Rik van Riel AuthorDate: Wed, 3 Apr 2019 18:41:52 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 19:34:47 +0200 x86/fpu: Defer FPU state load until return to userspace Defer loading of FPU state until return to userspace. This gives the kernel the potential to skip loading FPU state for tasks that stay in kernel mode, or for tasks that end up with repeated invocations of kernel_fpu_begin() & kernel_fpu_end(). The fpregs_lock/unlock() section ensures that the registers remain unchanged. Otherwise a context switch or a bottom half could save the registers to its FPU context and the processor's FPU registers would became random if modified at the same time. KVM swaps the host/guest registers on entry/exit path. This flow has been kept as is. First it ensures that the registers are loaded and then saves the current (host) state before it loads the guest's registers. The swap is done at the very end with disabled interrupts so it should not change anymore before theg guest is entered. The read/save version seems to be cheaper compared to memcpy() in a micro benchmark. Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy(). For kernel threads, this flag gets never cleared which avoids saving / restoring the FPU state for kernel threads and during in-kernel usage of the FPU registers. [ bp: Correct and update commit message and fix checkpatch warnings. s/register/registers/ where it is used in plural. minor comment corrections. remove unused trace_x86_fpu_activate_state() TP. ] Signed-off-by: Rik van Riel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: Babu Moger Cc: "Chang S. Bae" Cc: Dmitry Safonov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: Konrad Rzeszutek Wilk Cc: kvm ML Cc: Nicolai Stange Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Tim Chen Cc: Waiman Long Cc: x86-ml Cc: Yi Wang Link: https://lkml.kernel.org/r/20190403164156.19645-24-bige...@linutronix.de --- arch/x86/entry/common.c | 10 +++- arch/x86/include/asm/fpu/api.h | 22 +++- arch/x86/include/asm/fpu/internal.h | 27 + arch/x86/include/asm/trace/fpu.h| 10 ++-- arch/x86/kernel/fpu/core.c | 106 +++- arch/x86/kernel/fpu/signal.c| 49 ++--- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/process_32.c| 5 +- arch/x86/kernel/process_64.c| 5 +- arch/x86/kvm/x86.c | 20 +-- 10 files changed, 184 insertions(+), 72 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 7bc105f47d21..51beb8d29123 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -25,12 +25,13 @@ #include #include #include +#include #include #include #include -#include #include +#include #define CREATE_TRACE_POINTS #include @@ -196,6 +197,13 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs) if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) exit_to_usermode_loop(regs, cached_flags); + /* Reload ti->flags; we may have rescheduled above. */ + cached_flags = READ_ONCE(ti->flags); + + fpregs_assert_state_consistent(); + if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD)) + switch_fpu_return(); + #ifdef CONFIG_COMPAT /* * Compat syscalls set TS_COMPAT. Make sure we clear it before diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 73e684160f35..b774c52e5411 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -10,7 +10,7 @@ #ifndef _ASM_X86_FPU_API_H #define _ASM_X86_FPU_API_H -#include +#include /* * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It @@ -22,17 +22,37 @@ extern void kernel_fpu_begin(void); extern void kernel_fpu_end(void); extern bool irq_fpu_usable(void); +extern void fpregs_mark_activate(void); +/* + * Use fpregs_lock() while editing CPU's FPU registers or fpu->state. + * A context switch will (and softirq might) save CPU's FPU registers to + * fpu->state and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in + * a random state. + */ static inline void fpregs_lock(void) { preempt_disable(); + local_bh_disable(); } static inline void fpregs_unlock(void) { + local_bh_enable(); preempt_enable(); } +#ifdef CONFIG_X86_DEBUG_FPU +extern void fpregs_assert_state_consistent(void); +#else +static inline void fpregs_assert_state_consistent(void) { } +#endif + +/* + * Load the task FPU state before returning to userspace. + */
misuse of fget_raw() in perf_event_get()
What's the point of using fget_raw(), if you do _not_ accept O_PATH descriptors? That should be fget()...
[tip:x86/fpu] x86/fpu: Merge the two code paths in __fpu__restore_sig()
Commit-ID: c2ff9e9a3d9d6c019394a22989a228d02970a8b1 Gitweb: https://git.kernel.org/tip/c2ff9e9a3d9d6c019394a22989a228d02970a8b1 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:51 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 15:41:25 +0200 x86/fpu: Merge the two code paths in __fpu__restore_sig() The ia32_fxstate case (32bit with fxsr) and the other (64bit frames or 32bit frames without fxsr) restore both from kernel memory and sanitize the content. The !ia32_fxstate version restores missing xstates from "init state" while the ia32_fxstate doesn't and skips it. Merge the two code paths and keep the !ia32_fxstate one. Copy only the user_i387_ia32_struct data structure in the ia32_fxstate. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-23-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 139 +-- 1 file changed, 54 insertions(+), 85 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 9ea1eaa4c9b1..b13e86b29426 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -263,12 +263,17 @@ static inline int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) { + struct user_i387_ia32_struct *envp = NULL; + int state_size = fpu_kernel_xstate_size; int ia32_fxstate = (buf != buf_fx); struct task_struct *tsk = current; struct fpu *fpu = >thread.fpu; - int state_size = fpu_kernel_xstate_size; + struct user_i387_ia32_struct env; + union fpregs_state *state; u64 xfeatures = 0; int fx_only = 0; + int ret = 0; + void *tmp; ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) || IS_ENABLED(CONFIG_IA32_EMULATION)); @@ -303,105 +308,69 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) } } + tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, GFP_KERNEL); + if (!tmp) + return -ENOMEM; + state = PTR_ALIGN(tmp, 64); + + if ((unsigned long)buf_fx % 64) + fx_only = 1; + + /* +* For 32-bit frames with fxstate, copy the fxstate so it can be +* reconstructed later. +*/ if (ia32_fxstate) { - /* -* For 32-bit frames with fxstate, copy the user state to the -* thread's fpu state, reconstruct fxstate from the fsave -* header. Validate and sanitize the copied state. -*/ - struct user_i387_ia32_struct env; - union fpregs_state *state; - int err = 0; - void *tmp; + ret = __copy_from_user(, buf, sizeof(env)); + if (ret) + goto err_out; + envp = + } - tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, GFP_KERNEL); - if (!tmp) - return -ENOMEM; - state = PTR_ALIGN(tmp, 64); + if (use_xsave() && !fx_only) { + u64 init_bv = xfeatures_mask & ~xfeatures; if (using_compacted_format()) { - err = copy_user_to_xstate(>xsave, buf_fx); + ret = copy_user_to_xstate(>xsave, buf_fx); } else { - err = __copy_from_user(>xsave, buf_fx, state_size); + ret = __copy_from_user(>xsave, buf_fx, state_size); - if (!err && state_size > offsetof(struct xregs_state, header)) - err = validate_xstate_header(>xsave.header); + if (!ret && state_size > offsetof(struct xregs_state, header)) + ret = validate_xstate_header(>xsave.header); } + if (ret) + goto err_out; - if (err || __copy_from_user(, buf, sizeof(env))) { - err = -1; - } else { - sanitize_restored_xstate(state, , xfeatures, fx_only); - copy_kernel_to_fpregs(state); - } - - kfree(tmp); - return err; - } else { - union fpregs_state *state; - void *tmp; - int ret; - - tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, GFP_KERNEL); - if (!tmp) - return -ENOMEM; -
[tip:x86/fpu] x86/fpu: Restore from kernel memory on the 64-bit path too
Commit-ID: 926b21f37b072ae4c117052de45a975c6d468fec Gitweb: https://git.kernel.org/tip/926b21f37b072ae4c117052de45a975c6d468fec Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:50 +0200 Committer: Borislav Petkov CommitDate: Fri, 12 Apr 2019 15:02:41 +0200 x86/fpu: Restore from kernel memory on the 64-bit path too The 64-bit case (both 64-bit and 32-bit frames) loads the new state from user memory. However, doing this is not desired if the FPU state is going to be restored on return to userland: it would be required to disable preemption in order to avoid a context switch which would set TIF_NEED_FPU_LOAD. If this happens before the restore operation then the loaded registers would become volatile. Furthermore, disabling preemption while accessing user memory requires to disable the pagefault handler. An error during FXRSTOR would then mean that either a page fault occurred (and it would have to be retried with enabled page fault handler) or a #GP occurred because the xstate is bogus (after all, the signal handler can modify it). In order to avoid that mess, copy the FPU state from userland, validate it and then load it. The copy_kernel_…() helpers are basically just like the old helpers except that they operate on kernel memory and the fault handler just sets the error value and the caller handles it. copy_user_to_fpregs_zeroing() and its helpers remain and will be used later for a fastpath optimisation. [ bp: Clarify commit message. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-22-bige...@linutronix.de --- arch/x86/include/asm/fpu/internal.h | 43 + arch/x86/kernel/fpu/signal.c| 62 + 2 files changed, 92 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index da75d7b3e37d..2cf04fbcba5d 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -121,6 +121,21 @@ extern void fpstate_sanitize_xstate(struct fpu *fpu); err;\ }) +#define kernel_insn_err(insn, output, input...) \ +({ \ + int err;\ + asm volatile("1:" #insn "\n\t" \ +"2:\n" \ +".section .fixup,\"ax\"\n" \ +"3: movl $-1,%[err]\n"\ +"jmp 2b\n"\ +".previous\n" \ +_ASM_EXTABLE(1b, 3b) \ +: [err] "=r" (err), output \ +: "0"(0), input); \ + err;\ +}) + #define kernel_insn(insn, output, input...)\ asm volatile("1:" #insn "\n\t" \ "2:\n" \ @@ -149,6 +164,14 @@ static inline void copy_kernel_to_fxregs(struct fxregs_state *fx) kernel_insn(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx)); } +static inline int copy_kernel_to_fxregs_err(struct fxregs_state *fx) +{ + if (IS_ENABLED(CONFIG_X86_32)) + return kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" (*fx)); + else + return kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx)); +} + static inline int copy_user_to_fxregs(struct fxregs_state __user *fx) { if (IS_ENABLED(CONFIG_X86_32)) @@ -162,6 +185,11 @@ static inline void copy_kernel_to_fregs(struct fregs_state *fx) kernel_insn(frstor %[fx], "=m" (*fx), [fx] "m" (*fx)); } +static inline int copy_kernel_to_fregs_err(struct fregs_state *fx) +{ + return kernel_insn_err(frstor %[fx], "=m" (*fx), [fx] "m" (*fx)); +} + static inline int copy_user_to_fregs(struct fregs_state __user *fx) { return user_insn(frstor %[fx], "=m" (*fx), [fx] "m" (*fx)); @@ -361,6 +389,21 @@ static inline int copy_user_to_xregs(struct xregs_state __user *buf, u64 mask) return err; } +/* + * Restore xstate from kernel space xsave area, return an error code instead of + * an exception. + */ +static inline int copy_kernel_to_xregs_err(struct xregs_state
[tip:x86/fpu] x86/fpu: Inline copy_user_to_fpregs_zeroing()
Commit-ID: e0d3602f933367881bddfff310a744e6e61c284c Gitweb: https://git.kernel.org/tip/e0d3602f933367881bddfff310a744e6e61c284c Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:49 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 20:45:20 +0200 x86/fpu: Inline copy_user_to_fpregs_zeroing() Start refactoring __fpu__restore_sig() by inlining copy_user_to_fpregs_zeroing(). The original function remains and will be used to restore from userland memory if possible. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-21-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 9b9dfdc96285..c2ff43fbbd07 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -337,11 +337,29 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) kfree(tmp); return err; } else { + int ret; + /* * For 64-bit frames and 32-bit fsave frames, restore the user * state to the registers directly (with exceptions handled). */ - if (copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only)) { + if (use_xsave()) { + if ((unsigned long)buf_fx % 64 || fx_only) { + u64 init_bv = xfeatures_mask & ~XFEATURE_MASK_FPSSE; + copy_kernel_to_xregs(_fpstate.xsave, init_bv); + ret = copy_user_to_fxregs(buf_fx); + } else { + u64 init_bv = xfeatures_mask & ~xfeatures; + if (unlikely(init_bv)) + copy_kernel_to_xregs(_fpstate.xsave, init_bv); + ret = copy_user_to_xregs(buf_fx, xfeatures); + } + } else if (use_fxsr()) { + ret = copy_user_to_fxregs(buf_fx); + } else + ret = copy_user_to_fregs(buf_fx); + + if (ret) { fpu__clear(fpu); return -1; }
[tip:x86/fpu] x86/entry: Add TIF_NEED_FPU_LOAD
Commit-ID: 383c252545edcc708128e2028a2318b05c45ede4 Gitweb: https://git.kernel.org/tip/383c252545edcc708128e2028a2318b05c45ede4 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:45 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 16:21:51 +0200 x86/entry: Add TIF_NEED_FPU_LOAD Add TIF_NEED_FPU_LOAD. This flag is used for loading the FPU registers before returning to userland. It must not be set on systems without a FPU. If this flag is cleared, the CPU's FPU registers hold the latest, up-to-date content of the current task's (current()) FPU registers. The in-memory copy (union fpregs_state) is not valid. If this flag is set, then all of CPU's FPU registers may hold a random value (except for PKRU) and it is required to load the content of the FPU registers on return to userland. Introduce it now as a preparatory change before adding the main feature. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: Konrad Rzeszutek Wilk Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: Tim Chen Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-17-bige...@linutronix.de --- arch/x86/include/asm/fpu/internal.h | 8 arch/x86/include/asm/thread_info.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 6eb4a0b1ad0e..da75d7b3e37d 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -508,6 +508,14 @@ static inline void __fpregs_load_activate(struct fpu *fpu, int cpu) * - switch_fpu_finish() restores the new state as *necessary. * + * If TIF_NEED_FPU_LOAD is cleared then the CPU's FPU registers + * are saved in the current thread's FPU register state. + * + * If TIF_NEED_FPU_LOAD is set then CPU's FPU registers may not + * hold current()'s FPU registers. It is required to load the + * registers before returning to userland or using the content + * otherwise. + * * The FPU context is only stored/restored for a user task and * ->mm is used to distinguish between kernel and user threads. */ diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e0eccbcb8447..f9453536f9bb 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -88,6 +88,7 @@ struct thread_info { #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ #define TIF_UPROBE 12 /* breakpointed or singlestepping */ #define TIF_PATCH_PENDING 13 /* pending live patching update */ +#define TIF_NEED_FPU_LOAD 14 /* load FPU on return to userspace */ #define TIF_NOCPUID15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ @@ -117,6 +118,7 @@ struct thread_info { #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY) #define _TIF_UPROBE(1 << TIF_UPROBE) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) +#define _TIF_NEED_FPU_LOAD (1 << TIF_NEED_FPU_LOAD) #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32)
[tip:x86/fpu] x86/fpu: Update xstate's PKRU value on write_pkru()
Commit-ID: 0d714dba162620fd8b9f5b3104a487e041353c4d Gitweb: https://git.kernel.org/tip/0d714dba162620fd8b9f5b3104a487e041353c4d Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:48 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 20:33:29 +0200 x86/fpu: Update xstate's PKRU value on write_pkru() During the context switch the xstate is loaded which also includes the PKRU value. If xstate is restored on return to userland it is required that the PKRU value in xstate is the same as the one in the CPU. Save the PKRU in xstate during modification. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andi Kleen Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: Juergen Gross Cc: "Kirill A. Shutemov" Cc: kvm ML Cc: Michal Hocko Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-20-bige...@linutronix.de --- arch/x86/include/asm/pgtable.h | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 9beb371b1adf..5cfbbb6d458d 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -23,6 +23,8 @@ #ifndef __ASSEMBLY__ #include +#include +#include extern pgd_t early_top_pgt[PTRS_PER_PGD]; int __init __early_make_pgtable(unsigned long address, pmdval_t pmd); @@ -133,8 +135,23 @@ static inline u32 read_pkru(void) static inline void write_pkru(u32 pkru) { - if (boot_cpu_has(X86_FEATURE_OSPKE)) - __write_pkru(pkru); + struct pkru_state *pk; + + if (!boot_cpu_has(X86_FEATURE_OSPKE)) + return; + + pk = get_xsave_addr(>thread.fpu.state.xsave, XFEATURE_PKRU); + + /* +* The PKRU value in xstate needs to be in sync with the value that is +* written to the CPU. The FPU restore on return to userland would +* otherwise load the previous value again. +*/ + fpregs_lock(); + if (pk) + pk->pkru = pkru; + __write_pkru(pkru); + fpregs_unlock(); } static inline int pte_young(pte_t pte)
[tip:x86/fpu] x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD
Commit-ID: a352a3b7b7920212ee4c45a41500c66826318e92 Gitweb: https://git.kernel.org/tip/a352a3b7b7920212ee4c45a41500c66826318e92 Author: Rik van Riel AuthorDate: Wed, 3 Apr 2019 18:41:47 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 18:20:04 +0200 x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD The FPU registers need only to be saved if TIF_NEED_FPU_LOAD is not set. Otherwise this has been already done and can be skipped. [ bp: Massage a bit. ] Signed-off-by: Rik van Riel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-19-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 8f23f5237218..9b9dfdc96285 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -171,7 +171,17 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) sizeof(struct user_i387_ia32_struct), NULL, (struct _fpstate_32 __user *) buf) ? -1 : 1; - copy_fpregs_to_fpstate(fpu); + /* +* If we do not need to load the FPU registers at return to userspace +* then the CPU has the current state and we need to save it. Otherwise, +* it has already been done and we can skip it. +*/ + fpregs_lock(); + if (!test_thread_flag(TIF_NEED_FPU_LOAD)) { + copy_fpregs_to_fpstate(fpu); + set_thread_flag(TIF_NEED_FPU_LOAD); + } + fpregs_unlock(); if (using_compacted_format()) { if (copy_xstate_to_user(buf_fx, xsave, 0, size))
[tip:x86/fpu] x86/fpu: Always store the registers in copy_fpstate_to_sigframe()
Commit-ID: 69277c98f5eef0d9839699b7825923c3985f665f Gitweb: https://git.kernel.org/tip/69277c98f5eef0d9839699b7825923c3985f665f Author: Rik van Riel AuthorDate: Wed, 3 Apr 2019 18:41:46 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 18:08:57 +0200 x86/fpu: Always store the registers in copy_fpstate_to_sigframe() copy_fpstate_to_sigframe() stores the registers directly to user space. This is okay because the FPU registers are valid and saving them directly avoids saving them into kernel memory and making a copy. However, this cannot be done anymore if the FPU registers are going to be restored on the return to userland. It is possible that the FPU registers will be invalidated in the middle of the save operation and this should be done with disabled preemption / BH. Save the FPU registers to the task's FPU struct and copy them to the user memory later on. Signed-off-by: Rik van Riel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-18-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 155f4552413e..8f23f5237218 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -144,8 +144,8 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) * buf == buf_fx for 64-bit frames and 32-bit fsave frame. * buf != buf_fx for 32-bit frames with fxstate. * - * Save the state directly to the user frame pointed by the aligned pointer - * 'buf_fx'. + * Save the state to task's fpu->state and then copy it to the user frame + * pointed to by the aligned pointer 'buf_fx'. * * If this is a 32-bit frame with fxstate, put a fsave header before * the aligned state at 'buf_fx'. @@ -155,6 +155,8 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) */ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) { + struct fpu *fpu = >thread.fpu; + struct xregs_state *xsave = >state.xsave; struct task_struct *tsk = current; int ia32_fxstate = (buf != buf_fx); @@ -169,9 +171,16 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) sizeof(struct user_i387_ia32_struct), NULL, (struct _fpstate_32 __user *) buf) ? -1 : 1; - /* Save the live registers state to the user frame directly. */ - if (copy_fpregs_to_sigframe(buf_fx)) - return -1; + copy_fpregs_to_fpstate(fpu); + + if (using_compacted_format()) { + if (copy_xstate_to_user(buf_fx, xsave, 0, size)) + return -1; + } else { + fpstate_sanitize_xstate(fpu); + if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size)) + return -1; + } /* Save the fsave header for the 32-bit frames. */ if ((ia32_fxstate || !use_fxsr()) && save_fsave_header(tsk, buf))
[tip:x86/fpu] x86/fpu: Eager switch PKRU state
Commit-ID: 0cecca9d03c964abbd2b7927d0670eb70db4ebf2 Gitweb: https://git.kernel.org/tip/0cecca9d03c964abbd2b7927d0670eb70db4ebf2 Author: Rik van Riel AuthorDate: Wed, 3 Apr 2019 18:41:44 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 15:57:10 +0200 x86/fpu: Eager switch PKRU state While most of a task's FPU state is only needed in user space, the protection keys need to be in place immediately after a context switch. The reason is that any access to userspace memory while running in kernel mode also needs to abide by the memory permissions specified in the protection keys. The "eager switch" is a preparation for loading the FPU state on return to userland. Instead of decoupling PKRU state from xstate, update PKRU within xstate on write operations by the kernel. For user tasks the PKRU should be always read from the xsave area and it should not change anything because the PKRU value was loaded as part of FPU restore. For kernel threads the default "init_pkru_value" will be written. Before this commit, the kernel thread would end up with a random value which it inherited from the previous user task. [ bigeasy: save pkru to xstate, no cache, don't use __raw_xsave_addr() ] [ bp: update commit message, sort headers properly in asm/fpu/xstate.h ] Signed-off-by: Rik van Riel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andi Kleen Cc: Andy Lutomirski Cc: Aubrey Li Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: Juergen Gross Cc: "Kirill A. Shutemov" Cc: kvm ML Cc: Michal Hocko Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Radim Krčmář Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-16-bige...@linutronix.de --- arch/x86/include/asm/fpu/internal.h | 24 ++-- arch/x86/include/asm/fpu/xstate.h | 4 +++- arch/x86/include/asm/pgtable.h | 6 ++ arch/x86/mm/pkeys.c | 1 - 4 files changed, 31 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 3e0c2c496f2d..6eb4a0b1ad0e 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -534,8 +535,27 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu) */ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu) { - if (static_cpu_has(X86_FEATURE_FPU)) - __fpregs_load_activate(new_fpu, cpu); + u32 pkru_val = init_pkru_value; + struct pkru_state *pk; + + if (!static_cpu_has(X86_FEATURE_FPU)) + return; + + __fpregs_load_activate(new_fpu, cpu); + + if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) + return; + + /* +* PKRU state is switched eagerly because it needs to be valid before we +* return to userland e.g. for a copy_to_user() operation. +*/ + if (current->mm) { + pk = get_xsave_addr(_fpu->state.xsave, XFEATURE_PKRU); + if (pk) + pkru_val = pk->pkru; + } + __write_pkru(pkru_val); } /* diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h index fbe41f808e5d..7e42b285c856 100644 --- a/arch/x86/include/asm/fpu/xstate.h +++ b/arch/x86/include/asm/fpu/xstate.h @@ -2,9 +2,11 @@ #ifndef __ASM_X86_XSAVE_H #define __ASM_X86_XSAVE_H +#include #include + #include -#include +#include /* Bit 63 of XCR0 is reserved for future expansion */ #define XFEATURE_MASK_EXTEND (~(XFEATURE_MASK_FPSSE | (1ULL << 63))) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e8875ca75623..9beb371b1adf 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1355,6 +1355,12 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd) #define PKRU_WD_BIT 0x2 #define PKRU_BITS_PER_PKEY 2 +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS +extern u32 init_pkru_value; +#else +#define init_pkru_value0 +#endif + static inline bool __pkru_allows_read(u32 pkru, u16 pkey) { int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY; diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 50f65fc1b9a3..2ecbf4155f98 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -126,7 +126,6 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey * in the process's lifetime will not accidentally get access * to data which is pkey-protected later on. */ -static u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) | PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) | PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
[tip:x86/fpu] x86/pkeys: Don't check if PKRU is zero before writing it
Commit-ID: 0556cbdc2fbcb3068e5b924a8b3d5386ae0dd27d Gitweb: https://git.kernel.org/tip/0556cbdc2fbcb3068e5b924a8b3d5386ae0dd27d Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:43 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 15:41:05 +0200 x86/pkeys: Don't check if PKRU is zero before writing it write_pkru() checks if the current value is the same as the expected value. So instead of just checking if the current and new value is zero (and skip the write in such a case) we can benefit from that. Remove the zero check of PKRU, __write_pkru() provides such a check now. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-15-bige...@linutronix.de --- arch/x86/mm/pkeys.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 05bb9a44eb1c..50f65fc1b9a3 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -142,13 +142,6 @@ u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) | void copy_init_pkru_to_fpregs(void) { u32 init_pkru_value_snapshot = READ_ONCE(init_pkru_value); - /* -* Any write to PKRU takes it out of the XSAVE 'init -* state' which increases context switch cost. Avoid -* writing 0 when PKRU was already 0. -*/ - if (!init_pkru_value_snapshot && !read_pkru()) - return; /* * Override the PKRU state that came from 'init_fpstate' * with the baseline from the process.
[tip:x86/fpu] x86/fpu: Use a feature number instead of mask in two more helpers
Commit-ID: abd16d68d65229e5acafdadc32704239131bf2ea Gitweb: https://git.kernel.org/tip/abd16d68d65229e5acafdadc32704239131bf2ea Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:40 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 18:20:27 +0200 x86/fpu: Use a feature number instead of mask in two more helpers After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "Eric W. Biederman" Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Masami Hiramatsu Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: "Radim Krčmář" Cc: Rasmus Villemoes Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Siarhei Liakh Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-12-bige...@linutronix.de --- arch/x86/include/asm/fpu/xstate.h | 4 ++-- arch/x86/kernel/fpu/xstate.c | 22 ++ arch/x86/kernel/traps.c | 2 +- arch/x86/kvm/x86.c| 28 ++-- arch/x86/mm/mpx.c | 6 +++--- 5 files changed, 30 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h index 48581988d78c..fbe41f808e5d 100644 --- a/arch/x86/include/asm/fpu/xstate.h +++ b/arch/x86/include/asm/fpu/xstate.h @@ -46,8 +46,8 @@ extern void __init update_regset_xstate_info(unsigned int size, u64 xstate_mask); void fpu__xstate_clear_all_cpu_caps(void); -void *get_xsave_addr(struct xregs_state *xsave, int xstate); -const void *get_xsave_field_ptr(int xstate_field); +void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr); +const void *get_xsave_field_ptr(int xfeature_nr); int using_compacted_format(void); int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int offset, unsigned int size); int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned int offset, unsigned int size); diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 4f7f3c5d0d0c..9c459fd1d38e 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -830,15 +830,14 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr) * * Inputs: * xstate: the thread's storage area for all FPU data - * xstate_feature: state which is defined in xsave.h (e.g. - * XFEATURE_MASK_FP, XFEATURE_MASK_SSE, etc...) + * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP, + * XFEATURE_SSE, etc...) * Output: * address of the state in the xsave area, or NULL if the * field is not present in the xsave buffer. */ -void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature) +void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr) { - int xfeature_nr; /* * Do we even *have* xsave state? */ @@ -850,11 +849,11 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature) * have not enabled. Remember that pcntxt_mask is * what we write to the XCR0 register. */ - WARN_ONCE(!(xfeatures_mask & xstate_feature), + WARN_ONCE(!(xfeatures_mask & BIT_ULL(xfeature_nr)), "get of unsupported state"); /* * This assumes the last 'xsave*' instruction to -* have requested that 'xstate_feature' be saved. +* have requested that 'xfeature_nr' be saved. * If it did not, we might be seeing and old value * of the field in the buffer. * @@ -863,10 +862,9 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature) * or because the "init optimization" caused it * to not be saved. */ - if (!(xsave->header.xfeatures & xstate_feature)) + if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr))) return NULL; - xfeature_nr = fls64(xstate_feature) - 1; return __raw_xsave_addr(xsave, xfeature_nr); } EXPORT_SYMBOL_GPL(get_xsave_addr); @@ -882,13 +880,13 @@ EXPORT_SYMBOL_GPL(get_xsave_addr); * Note that this only works
[tip:x86/fpu] x86/fpu: Only write PKRU if it is different from current
Commit-ID: 577ff465f5a6a5a0866d75a033844810baca20a0 Gitweb: https://git.kernel.org/tip/577ff465f5a6a5a0866d75a033844810baca20a0 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:42 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 15:41:05 +0200 x86/fpu: Only write PKRU if it is different from current According to Dave Hansen, WRPKRU is more expensive than RDPKRU. It has a higher cycle cost and it's also practically a (light) speculation barrier. As an optimisation read the current PKRU value and only write the new one if it is different. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: Juergen Gross Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-14-bige...@linutronix.de --- arch/x86/include/asm/special_insns.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 34897e2b52c9..0a3c4cab39db 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -121,6 +121,13 @@ static inline void wrpkru(u32 pkru) static inline void __write_pkru(u32 pkru) { + /* +* WRPKRU is relatively expensive compared to RDPKRU. +* Avoid WRPKRU when it would not change the value. +*/ + if (pkru == rdpkru()) + return; + wrpkru(pkru); }
[tip:x86/fpu] x86/pkeys: Provide *pkru() helpers
Commit-ID: c806e88734b9e9aea260bf2261c129aa23968fca Gitweb: https://git.kernel.org/tip/c806e88734b9e9aea260bf2261c129aa23968fca Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:41 +0200 Committer: Borislav Petkov CommitDate: Thu, 11 Apr 2019 15:40:58 +0200 x86/pkeys: Provide *pkru() helpers Dave Hansen asked for __read_pkru() and __write_pkru() to be symmetrical. As part of the series __write_pkru() will read back the value and only write it if it is different. In order to make both functions symmetrical, move the function containing only the opcode asm into a function called like the instruction itself. __write_pkru() will just invoke wrpkru() but in a follow-up patch will also read back the value. [ bp: Convert asm opcode wrapper names to rd/wrpkru(). ] Suggested-by: Dave Hansen Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andi Kleen Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: Juergen Gross Cc: "Kirill A. Shutemov" Cc: kvm ML Cc: Michal Hocko Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-13-bige...@linutronix.de --- arch/x86/include/asm/pgtable.h | 2 +- arch/x86/include/asm/special_insns.h | 12 +--- arch/x86/kvm/vmx/vmx.c | 2 +- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2779ace16d23..e8875ca75623 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -127,7 +127,7 @@ static inline int pte_dirty(pte_t pte) static inline u32 read_pkru(void) { if (boot_cpu_has(X86_FEATURE_OSPKE)) - return __read_pkru(); + return rdpkru(); return 0; } diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 43c029cdc3fe..34897e2b52c9 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -92,7 +92,7 @@ static inline void native_write_cr8(unsigned long val) #endif #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS -static inline u32 __read_pkru(void) +static inline u32 rdpkru(void) { u32 ecx = 0; u32 edx, pkru; @@ -107,7 +107,7 @@ static inline u32 __read_pkru(void) return pkru; } -static inline void __write_pkru(u32 pkru) +static inline void wrpkru(u32 pkru) { u32 ecx = 0, edx = 0; @@ -118,8 +118,14 @@ static inline void __write_pkru(u32 pkru) asm volatile(".byte 0x0f,0x01,0xef\n\t" : : "a" (pkru), "c"(ecx), "d"(edx)); } + +static inline void __write_pkru(u32 pkru) +{ + wrpkru(pkru); +} + #else -static inline u32 __read_pkru(void) +static inline u32 rdpkru(void) { return 0; } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index ab432a930ae8..dd1e1eea4f0b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6501,7 +6501,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) */ if (static_cpu_has(X86_FEATURE_PKU) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) { - vcpu->arch.pkru = __read_pkru(); + vcpu->arch.pkru = rdpkru(); if (vcpu->arch.pkru != vmx->host_pkru) __write_pkru(vmx->host_pkru); }
[PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
If flags includes COPY_FILE_RANGE_FILESIZE then the length copied is the length of the file. off_in and off_out are ignored. len must be 0 or the file size. This implementation saves a call to stat() in the common case of copying files. It does not fix any race conditions, but that is possible in the future with this interface. EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0 or the file size. Signed-off-by: Shawn Landden CC: --- fs/read_write.c | 14 +- include/uapi/linux/stat.h | 4 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..6d06361f0856 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, struct inode *inode_out = file_inode(file_out); ssize_t ret; - if (flags != 0) + if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0) return -EINVAL; if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) return -EINVAL; + if (flags & COPY_FILE_RANGE_FILESIZE) { + struct kstat stat; + int error; + error = vfs_getattr(_in->f_path, , + STATX_SIZE, 0); + if (error < 0) + return error; + if (!(len == 0 || len == stat.size)) + return -EAGAIN; + len = stat.size; + } + ret = rw_verify_area(READ, file_in, _in, len); if (unlikely(ret)) return ret; diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 7b35e98d3c58..1075aa4666ef 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -170,5 +170,9 @@ struct statx { #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */ +/* + * Flags for copy_file_range() + */ +#define COPY_FILE_RANGE_FILESIZE 0x0001 /* Copy the full length of the input file */ #endif /* _UAPI_LINUX_STAT_H */ -- 2.20.1
[tip:x86/fpu] x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask
Commit-ID: 07baeb04f37c952981d63359ff840118ce8f5434 Gitweb: https://git.kernel.org/tip/07baeb04f37c952981d63359ff840118ce8f5434 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:39 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 16:33:45 +0200 x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask Most users of __raw_xsave_addr() use a feature number, shift it to a mask and then __raw_xsave_addr() shifts it back to the feature number. Make __raw_xsave_addr() use the feature number as an argument. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-11-bige...@linutronix.de --- arch/x86/kernel/fpu/xstate.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 8bfcc5b9e04b..4f7f3c5d0d0c 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -805,20 +805,18 @@ void fpu__resume_cpu(void) } /* - * Given an xstate feature mask, calculate where in the xsave + * Given an xstate feature nr, calculate where in the xsave * buffer the state is. Callers should ensure that the buffer * is valid. */ -static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask) +static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr) { - int feature_nr = fls64(xstate_feature_mask) - 1; - - if (!xfeature_enabled(feature_nr)) { + if (!xfeature_enabled(xfeature_nr)) { WARN_ON_FPU(1); return NULL; } - return (void *)xsave + xstate_comp_offsets[feature_nr]; + return (void *)xsave + xstate_comp_offsets[xfeature_nr]; } /* * Given the xsave area and a state inside, this function returns the @@ -840,6 +838,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask */ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature) { + int xfeature_nr; /* * Do we even *have* xsave state? */ @@ -867,7 +866,8 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature) if (!(xsave->header.xfeatures & xstate_feature)) return NULL; - return __raw_xsave_addr(xsave, xstate_feature); + xfeature_nr = fls64(xstate_feature) - 1; + return __raw_xsave_addr(xsave, xfeature_nr); } EXPORT_SYMBOL_GPL(get_xsave_addr); @@ -1014,7 +1014,7 @@ int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int of * Copy only in-use xstates: */ if ((header.xfeatures >> i) & 1) { - void *src = __raw_xsave_addr(xsave, 1 << i); + void *src = __raw_xsave_addr(xsave, i); offset = xstate_offsets[i]; size = xstate_sizes[i]; @@ -1100,7 +1100,7 @@ int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned i * Copy only in-use xstates: */ if ((header.xfeatures >> i) & 1) { - void *src = __raw_xsave_addr(xsave, 1 << i); + void *src = __raw_xsave_addr(xsave, i); offset = xstate_offsets[i]; size = xstate_sizes[i]; @@ -1157,7 +1157,7 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf) u64 mask = ((u64)1 << i); if (hdr.xfeatures & mask) { - void *dst = __raw_xsave_addr(xsave, 1 << i); + void *dst = __raw_xsave_addr(xsave, i); offset = xstate_offsets[i]; size = xstate_sizes[i]; @@ -1211,7 +1211,7 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf) u64 mask = ((u64)1 << i); if (hdr.xfeatures & mask) { - void *dst = __raw_xsave_addr(xsave, 1 << i); + void *dst = __raw_xsave_addr(xsave, i); offset = xstate_offsets[i]; size = xstate_sizes[i];
bug disabling NX (noexec=off)
Hi, Intel Core i3-2120 + kernel-5.0.7 x86_64 from Fedora: [0.00] microcode: microcode updated early to revision 0x2e, date = 2018-04-10 [0.00] Linux version 5.0.7-300.fc30.x86_64 (mockbu...@bkernel04.phx2.fedoraproject.org) (gcc version 9.0.1 20190312 (Red Hat 9.0.1-0.10) (GCC)) #1 SMP Mon Apr 8 18:28:09 UTC 2019 [0.00] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.0.7-300.fc30.x86_64 root=UUID=ea93877a-9487-4416-9f32-9d1fba2a639a ro quiet audit=0 usb-storage.delay_use=0 net.ifnames=0 plymouth.enable=0 pti=off spectre_v2=off spectre_v2_user=off spec_store_bypass_disable=off l1tf=off noexec=off [...] [0.00] NX (Execute Disable) protection: disabled by kernel command line option [0.00] [ cut here ] [0.00] attempted to set unsupported pgprot: 8163 bits: 8000 supported: 7fff [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/pgtable.h:537 __early_set_fixmap+0xa2/0xff [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.7-300.fc30.x86_64 #1 [0.00] RIP: 0010:__early_set_fixmap+0xa2/0xff [0.00] Code: ed be ff 49 21 cc 4c 39 e2 74 21 80 3d b8 a2 bb ff 00 75 18 4c 31 e2 48 c7 c7 68 b6 09 ac c6 05 a5 a2 bb ff 01 e8 ae ba 95 fe <0f> 0b 4c 89 e7 4c 09 ef ff 14 25 98 e2 22 ac 48 89 c6 48 89 df ff [0.00] RSP: :ac203de8 EFLAGS: 00010082 ORIG_RAX: [0.00] RAX: RBX: ac972000 RCX: 0078 [0.00] RDX: 0001 RSI: 0092 RDI: 0047 [0.00] RBP: ff20 R08: 0001 R09: 0024 [0.00] R10: 0c1c R11: 0003 R12: 0163 [0.00] R13: 000f R14: 0001 R15: [0.00] FS: () GS:ac73a000() knlGS: [0.00] CS: 0010 DS: ES: CR0: 80050033 [0.00] CR2: 888046212ff8 CR3: 467de000 CR4: 000406a0 [0.00] Call Trace: [0.00] ? __early_ioremap+0x10b/0x189 [0.00] ? dmi_scan_machine+0xfe/0x209 [0.00] ? setup_arch+0x436/0xc9f [0.00] ? start_kernel+0x65/0x519 [0.00] ? secondary_startup_64+0xa4/0xb0 [0.00] random: get_random_bytes called from print_oops_end_marker+0x26/0x40 with crng_init=0 [0.00] ---[ end trace ]--- [0.00] SMBIOS 2.6 present. [...] The same happens in a T61 with an Intel Core 2 Duo T7300. Thanks.
[tip:x86/fpu] x86/fpu: Add an __fpregs_load_activate() internal helper
Commit-ID: 4ee91519e1dccc175665fe24bb20a47c6053575c Gitweb: https://git.kernel.org/tip/4ee91519e1dccc175665fe24bb20a47c6053575c Author: Rik van Riel AuthorDate: Wed, 3 Apr 2019 18:41:38 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 16:23:14 +0200 x86/fpu: Add an __fpregs_load_activate() internal helper Add a helper function that ensures the floating point registers for the current task are active. Use with preemption disabled. While at it, add fpregs_lock/unlock() helpers too, to be used in later patches. [ bp: Add a comment about its intended usage. ] Signed-off-by: Rik van Riel Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-10-bige...@linutronix.de --- arch/x86/include/asm/fpu/api.h | 11 +++ arch/x86/include/asm/fpu/internal.h | 22 ++ 2 files changed, 25 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index b56d504af654..73e684160f35 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -10,6 +10,7 @@ #ifndef _ASM_X86_FPU_API_H #define _ASM_X86_FPU_API_H +#include /* * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It @@ -22,6 +23,16 @@ extern void kernel_fpu_begin(void); extern void kernel_fpu_end(void); extern bool irq_fpu_usable(void); +static inline void fpregs_lock(void) +{ + preempt_disable(); +} + +static inline void fpregs_unlock(void) +{ + preempt_enable(); +} + /* * Query the presence of one or more xfeatures. Works on any legacy CPU as well. * diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 54f70cae2f15..3e0c2c496f2d 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -484,6 +484,18 @@ static inline void fpregs_activate(struct fpu *fpu) trace_x86_fpu_regs_activated(fpu); } +/* + * Internal helper, do not use directly. Use switch_fpu_return() instead. + */ +static inline void __fpregs_load_activate(struct fpu *fpu, int cpu) +{ + if (!fpregs_state_valid(fpu, cpu)) { + if (current->mm) + copy_kernel_to_fpregs(>state); + fpregs_activate(fpu); + } +} + /* * FPU state switching for scheduling. * @@ -522,14 +534,8 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu) */ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu) { - if (static_cpu_has(X86_FEATURE_FPU)) { - if (!fpregs_state_valid(new_fpu, cpu)) { - if (current->mm) - copy_kernel_to_fpregs(_fpu->state); - } - - fpregs_activate(new_fpu); - } + if (static_cpu_has(X86_FEATURE_FPU)) + __fpregs_load_activate(new_fpu, cpu); } /*
[tip:x86/fpu] x86/fpu: Remove user_fpu_begin()
Commit-ID: 0169f53e0d97bb675075506810494bd86b8c934e Gitweb: https://git.kernel.org/tip/0169f53e0d97bb675075506810494bd86b8c934e Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:37 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 15:58:44 +0200 x86/fpu: Remove user_fpu_begin() user_fpu_begin() sets fpu_fpregs_owner_ctx to task's fpu struct. This is always the case since there is no lazy FPU anymore. fpu_fpregs_owner_ctx is used during context switch to decide if it needs to load the saved registers or if the currently loaded registers are valid. It could be skipped during a taskA -> kernel thread -> taskA switch because the switch to the kernel thread would not alter the CPU's sFPU tate. Since this field is always updated during context switch and never invalidated, setting it manually (in user context) makes no difference. A kernel thread with kernel_fpu_begin() block could set fpu_fpregs_owner_ctx to NULL but a kernel thread does not use user_fpu_begin(). This is a leftover from the lazy-FPU time. Remove user_fpu_begin(), it does not change fpu_fpregs_owner_ctx's content. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Nicolai Stange Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-9-bige...@linutronix.de --- arch/x86/include/asm/fpu/internal.h | 17 - arch/x86/kernel/fpu/core.c | 4 +--- arch/x86/kernel/fpu/signal.c| 1 - 3 files changed, 1 insertion(+), 21 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 04042eacc852..54f70cae2f15 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -532,23 +532,6 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu) } } -/* - * Needs to be preemption-safe. - * - * NOTE! user_fpu_begin() must be used only immediately before restoring - * the save state. It does not do any saving/restoring on its own. In - * lazy FPU mode, it is just an optimization to avoid a #NM exception, - * the task can lose the FPU right after preempt_enable(). - */ -static inline void user_fpu_begin(void) -{ - struct fpu *fpu = >thread.fpu; - - preempt_disable(); - fpregs_activate(fpu); - preempt_enable(); -} - /* * MXCSR and XCR definitions: */ diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 97e27de2b7c0..739ca3ae2bdc 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -335,10 +335,8 @@ void fpu__clear(struct fpu *fpu) * Make sure fpstate is cleared and initialized. */ fpu__initialize(fpu); - if (static_cpu_has(X86_FEATURE_FPU)) { - user_fpu_begin(); + if (static_cpu_has(X86_FEATURE_FPU)) copy_init_fpstate_to_fpregs(); - } } /* diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 34989d2a8893..155f4552413e 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -322,7 +322,6 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) * For 64-bit frames and 32-bit fsave frames, restore the user * state to the registers directly (with exceptions handled). */ - user_fpu_begin(); if (copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only)) { fpu__clear(fpu); return -1;
[tip:x86/fpu] x86/fpu: Remove fpu->initialized
Commit-ID: 2722146eb78451b30e4717a267a3a2b44e4ad317 Gitweb: https://git.kernel.org/tip/2722146eb78451b30e4717a267a3a2b44e4ad317 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:36 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 15:42:40 +0200 x86/fpu: Remove fpu->initialized The struct fpu.initialized member is always set to one for user tasks and zero for kernel tasks. This avoids saving/restoring the FPU registers for kernel threads. The ->initialized = 0 case for user tasks has been removed in previous changes, for instance, by doing an explicit unconditional init at fork() time for FPU-less systems which was otherwise delayed until the emulated opcode. The context switch code (switch_fpu_prepare() + switch_fpu_finish()) can't unconditionally save/restore registers for kernel threads. Not only would it slow down the switch but also load a zeroed xcomp_bv for XSAVES. For kernel_fpu_begin() (+end) the situation is similar: EFI with runtime services uses this before alternatives_patched is true. Which means that this function is used too early and it wasn't the case before. For those two cases, use current->mm to distinguish between user and kernel thread. For kernel_fpu_begin() skip save/restore of the FPU registers. During the context switch into a kernel thread don't do anything. There is no reason to save the FPU state of a kernel thread. The reordering in __switch_to() is important because the current() pointer needs to be valid before switch_fpu_finish() is invoked so ->mm is seen of the new task instead the old one. N.B.: fpu__save() doesn't need to check ->mm because it is called by user tasks only. [ bp: Massage. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: Babu Moger Cc: "Chang S. Bae" Cc: Dmitry Safonov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: kvm ML Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Nicolai Stange Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: Sergey Senozhatsky Cc: Will Deacon Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-8-bige...@linutronix.de --- arch/x86/ia32/ia32_signal.c | 17 - arch/x86/include/asm/fpu/internal.h | 18 +- arch/x86/include/asm/fpu/types.h| 9 - arch/x86/include/asm/trace/fpu.h| 5 +-- arch/x86/kernel/fpu/core.c | 70 +++-- arch/x86/kernel/fpu/init.c | 2 -- arch/x86/kernel/fpu/regset.c| 19 +++--- arch/x86/kernel/fpu/xstate.c| 2 -- arch/x86/kernel/process_32.c| 4 +-- arch/x86/kernel/process_64.c| 4 +-- arch/x86/kernel/signal.c| 17 - arch/x86/mm/pkeys.c | 7 +--- 12 files changed, 53 insertions(+), 121 deletions(-) diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 321fe5f5d0e9..6eeb3249f22f 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -216,8 +216,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs, size_t frame_size, void __user **fpstate) { - struct fpu *fpu = >thread.fpu; - unsigned long sp; + unsigned long sp, fx_aligned, math_size; /* Default to using normal stack */ sp = regs->sp; @@ -231,15 +230,11 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs, ksig->ka.sa.sa_restorer) sp = (unsigned long) ksig->ka.sa.sa_restorer; - if (fpu->initialized) { - unsigned long fx_aligned, math_size; - - sp = fpu__alloc_mathframe(sp, 1, _aligned, _size); - *fpstate = (struct _fpstate_32 __user *) sp; - if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned, - math_size) < 0) - return (void __user *) -1L; - } + sp = fpu__alloc_mathframe(sp, 1, _aligned, _size); + *fpstate = (struct _fpstate_32 __user *) sp; + if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned, +math_size) < 0) + return (void __user *) -1L; sp -= frame_size; /* Align the stack pointer according to the i386 ABI, diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 70ecb7c032cb..04042eacc852 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -494,11 +494,14 @@ static inline void fpregs_activate(struct fpu *fpu) * * - switch_fpu_finish() restores the new state as *necessary. + * + * The FPU context is only stored/restored for a user task and + * ->mm is used to
[tip:x86/fpu] x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()
Commit-ID: fbcc9e0c37ba3186c41b5eb1aee2f7f3b711bc1e Gitweb: https://git.kernel.org/tip/fbcc9e0c37ba3186c41b5eb1aee2f7f3b711bc1e Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:34 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Apr 2019 20:48:11 +0200 x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe() With lazy-FPU support the (now named variable) ->initialized was set to true if the CPU's FPU registers were holding a valid state of the FPU registers for the active process. If it was set to false then the FPU state was saved in fpu->state and the FPU was deactivated. With lazy-FPU gone, ->initialized is always true for user threads and kernel threads never call this function so ->initialized is always true in copy_fpstate_to_sigframe(). The using_compacted_format() check is also a leftover from the lazy-FPU time. In the ->initialized == false case copy_to_user() would copy the compacted buffer while userland would expect the non-compacted format instead. So in order to save the FPU state in the non-compacted form it issues XSAVE to save the *current* FPU state. If the FPU is not enabled, the attempt raises the FPU trap, the trap restores the FPU contents and re-enables the FPU and XSAVE is invoked again and succeeds. *This* does not longer work since commit bef8b6da9522 ("x86/fpu: Handle #NM without FPU emulation as an error") Remove the check for ->initialized because it is always true and remove the false condition. Update the comment to reflect that the state is always live. [ bp: Massage. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-6-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 35 --- 1 file changed, 8 insertions(+), 27 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 7296a9bb78e7..c1a5999affa0 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -144,9 +144,8 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) * buf == buf_fx for 64-bit frames and 32-bit fsave frame. * buf != buf_fx for 32-bit frames with fxstate. * - * If the fpu, extended register state is live, save the state directly - * to the user frame pointed by the aligned pointer 'buf_fx'. Otherwise, - * copy the thread's fpu state to the user frame starting at 'buf_fx'. + * Save the state directly to the user frame pointed by the aligned pointer + * 'buf_fx'. * * If this is a 32-bit frame with fxstate, put a fsave header before * the aligned state at 'buf_fx'. @@ -157,7 +156,6 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) { struct fpu *fpu = >thread.fpu; - struct xregs_state *xsave = >state.xsave; struct task_struct *tsk = current; int ia32_fxstate = (buf != buf_fx); @@ -172,29 +170,12 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) sizeof(struct user_i387_ia32_struct), NULL, (struct _fpstate_32 __user *) buf) ? -1 : 1; - if (fpu->initialized || using_compacted_format()) { - /* Save the live register state to the user directly. */ - if (copy_fpregs_to_sigframe(buf_fx)) - return -1; - /* Update the thread's fxstate to save the fsave header. */ - if (ia32_fxstate) - copy_fxregs_to_kernel(fpu); - } else { - /* -* It is a *bug* if kernel uses compacted-format for xsave -* area and we copy it out directly to a signal frame. It -* should have been handled above by saving the registers -* directly. -*/ - if (boot_cpu_has(X86_FEATURE_XSAVES)) { - WARN_ONCE(1, "x86/fpu: saving compacted-format xsave area to a signal frame!\n"); - return -1; - } - - fpstate_sanitize_xstate(fpu); - if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size)) - return -1; - } + /* Save the live registers state to the user frame directly. */ + if (copy_fpregs_to_sigframe(buf_fx)) + return -1; + /* Update the thread's fxstate to save the fsave header. */ + if (ia32_fxstate) + copy_fxregs_to_kernel(fpu); /* Save the fsave header for the 32-bit frames. */ if ((ia32_fxstate || !use_fxsr()) &&
[tip:x86/fpu] x86/fpu: Don't save fxregs for ia32 frames in copy_fpstate_to_sigframe()
Commit-ID: 39388e80f9b0c3788bfb6efe3054bdce0c3ead45 Gitweb: https://git.kernel.org/tip/39388e80f9b0c3788bfb6efe3054bdce0c3ead45 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:35 +0200 Committer: Borislav Petkov CommitDate: Wed, 10 Apr 2019 14:46:35 +0200 x86/fpu: Don't save fxregs for ia32 frames in copy_fpstate_to_sigframe() In commit 72a671ced66db ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels") the 32bit and 64bit path of the signal delivery code were merged. The 32bit version: int save_i387_xstate_ia32(void __user *buf) … if (cpu_has_xsave) return save_i387_xsave(fp); if (cpu_has_fxsr) return save_i387_fxsave(fp); The 64bit version: int save_i387_xstate(void __user *buf) … if (user_has_fpu()) { if (use_xsave()) err = xsave_user(buf); else err = fxsave_user(buf); if (unlikely(err)) { __clear_user(buf, xstate_size); return err; The merge: int save_xstate_sig(void __user *buf, void __user *buf_fx, int size) … if (user_has_fpu()) { /* Save the live register state to the user directly. */ if (save_user_xstate(buf_fx)) return -1; /* Update the thread's fxstate to save the fsave header. */ if (ia32_fxstate) fpu_fxsave(>thread.fpu); I don't think that we needed to save the FPU registers to ->thread.fpu because the registers were stored in buf_fx. Today the state will be restored from buf_fx after the signal was handled (I assume that this was also the case with lazy-FPU). Since commit 66463db4fc560 ("x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal()") it is ensured that the signal handler starts with clear/fresh set of FPU registers which means that the previous store is futile. Remove the copy_fxregs_to_kernel() call because task's FPU state is cleared later in handle_signal() via fpu__clear(). Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-7-bige...@linutronix.de --- arch/x86/kernel/fpu/signal.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index c1a5999affa0..34989d2a8893 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -155,7 +155,6 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) */ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) { - struct fpu *fpu = >thread.fpu; struct task_struct *tsk = current; int ia32_fxstate = (buf != buf_fx); @@ -173,9 +172,6 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) /* Save the live registers state to the user frame directly. */ if (copy_fpregs_to_sigframe(buf_fx)) return -1; - /* Update the thread's fxstate to save the fsave header. */ - if (ia32_fxstate) - copy_fxregs_to_kernel(fpu); /* Save the fsave header for the 32-bit frames. */ if ((ia32_fxstate || !use_fxsr()) && save_fsave_header(tsk, buf))
[PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()
If flags includes COPY_FILE_RANGE_FILESIZE then the length copied is the length of the file. off_in and off_out are ignored. len must be 0 or the file size. This implementation saves a call to stat() in the common case of copying files. It does not fix any race conditions, but that is possible in the future with this interface. EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0 or the file size. Signed-off-by: Shawn Landden CC: --- fs/read_write.c | 14 +- include/uapi/linux/stat.h | 4 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..6d06361f0856 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, struct inode *inode_out = file_inode(file_out); ssize_t ret; - if (flags != 0) + if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0) return -EINVAL; if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) return -EINVAL; + if (flags & COPY_FILE_RANGE_FILESIZE) { + struct kstat stat; + int error; + error = vfs_getattr(_in->f_path, , + STATX_SIZE, 0); + if (error < 0) + return error; + if (!(len == 0 || len == stat.size)) + return -EAGAIN; + len = stat.size; + } + ret = rw_verify_area(READ, file_in, _in, len); if (unlikely(ret)) return ret; diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 7b35e98d3c58..1075aa4666ef 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -170,5 +170,9 @@ struct statx { #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */ +/* + * Flags for copy_file_range() + */ +#define COPY_FILE_RANGE_FILESIZE 0x0001 /* Copy the full length of the input file */ #endif /* _UAPI_LINUX_STAT_H */ -- 2.20.1
[tip:x86/fpu] x86/fpu: Always init the state in fpu__clear()
Commit-ID: 88f5260a3bf9bfb276b5b4aac2e81587e425a1d7 Gitweb: https://git.kernel.org/tip/88f5260a3bf9bfb276b5b4aac2e81587e425a1d7 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:33 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Apr 2019 19:28:06 +0200 x86/fpu: Always init the state in fpu__clear() fpu__clear() only initializes the state if the CPU has FPU support. This initialisation is also required for FPU-less systems and takes place in math_emulate(). Since fpu__initialize() only performs the initialization if ->initialized is zero it does not matter that it is invoked each time an opcode is emulated. It makes the removal of ->initialized easier if the struct is also initialized in the FPU-less case at the same time. Move fpu__initialize() before the FPU feature check so it is also performed in the FPU-less case too. [ bp: Massage a bit. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: Bill Metzenthen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Nicolai Stange Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-5-bige...@linutronix.de --- arch/x86/include/asm/fpu/internal.h | 1 - arch/x86/kernel/fpu/core.c | 5 ++--- arch/x86/math-emu/fpu_entry.c | 3 --- 3 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 75a1d5f712ee..70ecb7c032cb 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -24,7 +24,6 @@ /* * High level FPU state handling functions: */ -extern void fpu__initialize(struct fpu *fpu); extern void fpu__prepare_read(struct fpu *fpu); extern void fpu__prepare_write(struct fpu *fpu); extern void fpu__save(struct fpu *fpu); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1940319268ae..e43296854e37 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -223,7 +223,7 @@ int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu) * Activate the current task's in-memory FPU context, * if it has not been used before: */ -void fpu__initialize(struct fpu *fpu) +static void fpu__initialize(struct fpu *fpu) { WARN_ON_FPU(fpu != >thread.fpu); @@ -236,7 +236,6 @@ void fpu__initialize(struct fpu *fpu) fpu->initialized = 1; } } -EXPORT_SYMBOL_GPL(fpu__initialize); /* * This function must be called before we read a task's fpstate. @@ -365,8 +364,8 @@ void fpu__clear(struct fpu *fpu) /* * Make sure fpstate is cleared and initialized. */ + fpu__initialize(fpu); if (static_cpu_has(X86_FEATURE_FPU)) { - fpu__initialize(fpu); user_fpu_begin(); copy_init_fpstate_to_fpregs(); } diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c index 9e2ba7e667f6..a873da6b46d6 100644 --- a/arch/x86/math-emu/fpu_entry.c +++ b/arch/x86/math-emu/fpu_entry.c @@ -113,9 +113,6 @@ void math_emulate(struct math_emu_info *info) unsigned long code_base = 0; unsigned long code_limit = 0; /* Initialized to stop compiler warnings */ struct desc_struct code_descriptor; - struct fpu *fpu = >thread.fpu; - - fpu__initialize(fpu); #ifdef RE_ENTRANT_CHECKING if (emulating) {
[tip:x86/fpu] x86/fpu: Remove preempt_disable() in fpu__clear()
Commit-ID: 60e528d6ce3f60a058bbb64f8acb2a07f84b172a Gitweb: https://git.kernel.org/tip/60e528d6ce3f60a058bbb64f8acb2a07f84b172a Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:32 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Apr 2019 19:27:46 +0200 x86/fpu: Remove preempt_disable() in fpu__clear() The preempt_disable() section was introduced in commit a10b6a16cdad8 ("x86/fpu: Make the fpu state change in fpu__clear() scheduler-atomic") and it was said to be temporary. fpu__initialize() initializes the FPU struct to its initial value and then sets ->initialized to 1. The last part is the important one. The content of the state does not matter because it gets set via copy_init_fpstate_to_fpregs(). A preemption here has little meaning because the registers will always be set to the same content after copy_init_fpstate_to_fpregs(). A softirq with a kernel_fpu_begin() could also force to save FPU's registers after fpu__initialize() without changing the outcome here. Remove the preempt_disable() section in fpu__clear(), preemption here does not hurt. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Nicolai Stange Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-4-bige...@linutronix.de --- arch/x86/kernel/fpu/core.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1d3ae7988f7f..1940319268ae 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -366,11 +366,9 @@ void fpu__clear(struct fpu *fpu) * Make sure fpstate is cleared and initialized. */ if (static_cpu_has(X86_FEATURE_FPU)) { - preempt_disable(); fpu__initialize(fpu); user_fpu_begin(); copy_init_fpstate_to_fpregs(); - preempt_enable(); } }
[tip:x86/fpu] x86/fpu: Remove fpu__restore()
Commit-ID: 6dd677a044e606fd343e31c2108b13d74aec1ca5 Gitweb: https://git.kernel.org/tip/6dd677a044e606fd343e31c2108b13d74aec1ca5 Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:31 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Apr 2019 19:27:42 +0200 x86/fpu: Remove fpu__restore() There are no users of fpu__restore() so it is time to remove it. The comment regarding fpu__restore() and TS bit is stale since commit b3b0870ef3ffe ("i387: do not preload FPU state at task switch time") and has no meaning since. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Aubrey Li Cc: Babu Moger Cc: "Chang S. Bae" Cc: Dmitry Safonov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: Joerg Roedel Cc: Jonathan Corbet Cc: kvm ML Cc: linux-...@vger.kernel.org Cc: Nicolai Stange Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-3-bige...@linutronix.de --- Documentation/preempt-locking.txt | 1 - arch/x86/include/asm/fpu/internal.h | 1 - arch/x86/kernel/fpu/core.c | 24 arch/x86/kernel/process_32.c| 4 +--- arch/x86/kernel/process_64.c| 4 +--- 5 files changed, 2 insertions(+), 32 deletions(-) diff --git a/Documentation/preempt-locking.txt b/Documentation/preempt-locking.txt index 509f5a422d57..dce336134e54 100644 --- a/Documentation/preempt-locking.txt +++ b/Documentation/preempt-locking.txt @@ -52,7 +52,6 @@ preemption must be disabled around such regions. Note, some FPU functions are already explicitly preempt safe. For example, kernel_fpu_begin and kernel_fpu_end will disable and enable preemption. -However, fpu__restore() must be called with preemption disabled. RULE #3: Lock acquire and release must be performed by same task diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index fb04a3ded7dd..75a1d5f712ee 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -28,7 +28,6 @@ extern void fpu__initialize(struct fpu *fpu); extern void fpu__prepare_read(struct fpu *fpu); extern void fpu__prepare_write(struct fpu *fpu); extern void fpu__save(struct fpu *fpu); -extern void fpu__restore(struct fpu *fpu); extern int fpu__restore_sig(void __user *buf, int ia32_frame); extern void fpu__drop(struct fpu *fpu); extern int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 2e5003fef51a..1d3ae7988f7f 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -303,30 +303,6 @@ void fpu__prepare_write(struct fpu *fpu) } } -/* - * 'fpu__restore()' is called to copy FPU registers from - * the FPU fpstate to the live hw registers and to activate - * access to the hardware registers, so that FPU instructions - * can be used afterwards. - * - * Must be called with kernel preemption disabled (for example - * with local interrupts disabled, as it is in the case of - * do_device_not_available()). - */ -void fpu__restore(struct fpu *fpu) -{ - fpu__initialize(fpu); - - /* Avoid __kernel_fpu_begin() right after fpregs_activate() */ - kernel_fpu_disable(); - trace_x86_fpu_before_restore(fpu); - fpregs_activate(fpu); - copy_kernel_to_fpregs(>state); - trace_x86_fpu_after_restore(fpu); - kernel_fpu_enable(); -} -EXPORT_SYMBOL_GPL(fpu__restore); - /* * Drops current FPU state: deactivates the fpregs and * the fpstate. NOTE: it still leaves previous contents diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index e471d8e6f0b2..7888a41a03cd 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -267,9 +267,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) /* * Leave lazy mode, flushing any hypercalls made here. * This must be done before restoring TLS segments so -* the GDT and LDT are properly updated, and must be -* done before fpu__restore(), so the TS bit is up -* to date. +* the GDT and LDT are properly updated. */ arch_end_context_switch(next_p); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 6a62f4af9fcf..e1983b3a16c4 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -538,9 +538,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) /* * Leave lazy mode, flushing any hypercalls made here. This * must be done after loading TLS entries in the GDT but before -* loading segments that might reference them, and and it must -* be done before fpu__restore(), so the TS bit is up to -* date. +* loading segments
[tip:x86/fpu] x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig()
Commit-ID: 39ea9baffda91df8bfee9b45610242a3191ea1ec Gitweb: https://git.kernel.org/tip/39ea9baffda91df8bfee9b45610242a3191ea1ec Author: Sebastian Andrzej Siewior AuthorDate: Wed, 3 Apr 2019 18:41:30 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Apr 2019 19:27:29 +0200 x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig() This is a preparation for the removal of the ->initialized member in the fpu struct. __fpu__restore_sig() is deactivating the FPU via fpu__drop() and then setting manually ->initialized followed by fpu__restore(). The result is that it is possible to manipulate fpu->state and the state of registers won't be saved/restored on a context switch which would overwrite fpu->state: fpu__drop(fpu): ... fpu->initialized = 0; preempt_enable(); <--- context switch Don't access the fpu->state while the content is read from user space and examined/sanitized. Use a temporary kmalloc() buffer for the preparation of the FPU registers and once the state is considered okay, load it. Should something go wrong, return with an error and without altering the original FPU registers. The removal of fpu__initialize() is a nop because fpu->initialized is already set for the user task. [ bp: Massage a bit. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Borislav Petkov Reviewed-by: Dave Hansen Reviewed-by: Thomas Gleixner Acked-by: Borislav Petkov Cc: Andy Lutomirski Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jann Horn Cc: "Jason A. Donenfeld" Cc: kvm ML Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Rik van Riel Cc: x86-ml Link: https://lkml.kernel.org/r/20190403164156.19645-2-bige...@linutronix.de --- arch/x86/include/asm/fpu/signal.h | 2 +- arch/x86/kernel/fpu/regset.c | 5 ++--- arch/x86/kernel/fpu/signal.c | 40 +++ 3 files changed, 18 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/fpu/signal.h b/arch/x86/include/asm/fpu/signal.h index 44bbc39a57b3..7fb516b6893a 100644 --- a/arch/x86/include/asm/fpu/signal.h +++ b/arch/x86/include/asm/fpu/signal.h @@ -22,7 +22,7 @@ int ia32_setup_frame(int sig, struct ksignal *ksig, extern void convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk); -extern void convert_to_fxsr(struct task_struct *tsk, +extern void convert_to_fxsr(struct fxregs_state *fxsave, const struct user_i387_ia32_struct *env); unsigned long diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c index bc02f5144b95..5dbc099178a8 100644 --- a/arch/x86/kernel/fpu/regset.c +++ b/arch/x86/kernel/fpu/regset.c @@ -269,11 +269,10 @@ convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk) memcpy([i], [i], sizeof(to[0])); } -void convert_to_fxsr(struct task_struct *tsk, +void convert_to_fxsr(struct fxregs_state *fxsave, const struct user_i387_ia32_struct *env) { - struct fxregs_state *fxsave = >thread.fpu.state.fxsave; struct _fpreg *from = (struct _fpreg *) >st_space[0]; struct _fpxreg *to = (struct _fpxreg *) >st_space[0]; int i; @@ -350,7 +349,7 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset, ret = user_regset_copyin(, , , , , 0, -1); if (!ret) - convert_to_fxsr(target, ); + convert_to_fxsr(>thread.fpu.state.fxsave, ); /* * update the header bit in the xsave header, indicating the diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 55b80de13ea5..7296a9bb78e7 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -207,11 +207,11 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) } static inline void -sanitize_restored_xstate(struct task_struct *tsk, +sanitize_restored_xstate(union fpregs_state *state, struct user_i387_ia32_struct *ia32_env, u64 xfeatures, int fx_only) { - struct xregs_state *xsave = >thread.fpu.state.xsave; + struct xregs_state *xsave = >xsave; struct xstate_header *header = >header; if (use_xsave()) { @@ -238,7 +238,7 @@ sanitize_restored_xstate(struct task_struct *tsk, */ xsave->i387.mxcsr &= mxcsr_feature_mask; - convert_to_fxsr(tsk, ia32_env); + convert_to_fxsr(>fxsave, ia32_env); } } @@ -284,8 +284,6 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size) if (!access_ok(buf, size)) return -EACCES; - fpu__initialize(fpu); - if (!static_cpu_has(X86_FEATURE_FPU)) return fpregs_soft_set(current, NULL, 0, sizeof(struct user_i387_ia32_struct), @@ -315,40 +313,32 @@ static int __fpu__restore_sig(void __user *buf, void
Re: [PATCH 5/5] leds: lm3697: Introduce the lm3697 driver
On Mon 2019-03-25 09:24:03, Dan Murphy wrote: > Introduce the lm3697 LED driver for > backlighting and display. > > Datasheet location: > http://www.ti.com/lit/ds/symlink/lm3697.pdf > > Signed-off-by: Dan Murphy > --- > drivers/leds/Kconfig | 8 +- > drivers/leds/Makefile | 1 + > drivers/leds/leds-lm3697.c | 401 + > 3 files changed, 409 insertions(+), 1 deletion(-) > create mode 100644 drivers/leds/leds-lm3697.c > > diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig > index 735009e73414..688bb9a6f275 100644 > --- a/drivers/leds/Kconfig > +++ b/drivers/leds/Kconfig > @@ -776,9 +776,15 @@ config LEDS_NIC78BX > To compile this driver as a module, choose M here: the module > will be called leds-nic78bx. > > +config LEDS_LM3697 > + tristate "LED driver for LM3697" > + depends on LEDS_TI_LMU_COMMON > + help > + Say Y to enable the LM3697 LED driver for TI LMU devices. > + This supports the LED device LM3697. > + > config LEDS_TI_LMU_COMMON > tristate "LED driver for TI LMU" > - depends on REGMAP > help >Say Y to enable the LED driver for TI LMU devices. >This supports common features between the TI LM3532, LM3631, > LM3632, Is deleting "depends on REGMAP" intentional? AFAICT you are using it. Plus we'd normally expect "COMMON" first and then specific driver. Not sure if Kconfig can handle it out-of-order... > +static int lm3697_init(struct lm3697 *priv) > +{ > + struct lm3697_led *led; > + int i, ret; > + > + if (priv->enable_gpio) { > + gpiod_direction_output(priv->enable_gpio, 1); > + } else { > + ret = regmap_write(priv->regmap, LM3697_RESET, LM3697_SW_RESET); > + if (ret) { > + dev_err(>client->dev, "Cannot reset the > device\n"); > + goto out; > + } > + } > + > + ret = regmap_write(priv->regmap, LM3697_CTRL_ENABLE, 0x0); > + if (ret) { > + dev_err(>client->dev, "Cannot write ctrl enable\n"); > + goto out; > + } > + > + ret = regmap_write(priv->regmap, LM3697_OUTPUT_CONFIG, priv->bank_cfg); > + if (ret) > + dev_err(>client->dev, "Cannot write OUTPUT config\n"); Missing goto out? > + for (i = 0; i < LM3697_MAX_CONTROL_BANKS; i++) { > + led = >leds[i]; > + ret = ti_lmu_common_set_ramp(>lmu_data); > + if (ret) > + dev_err(>client->dev, "Setting the ramp rate > failed\n"); > + } > +out: > + return ret; > +} Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote: > On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov wrote: > > > > --- a/fs/read_write.c > > +++ b/fs/read_write.c > > @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, > > const loff_t *ppos, size_t > > inode = file_inode(file); > > if (unlikely((ssize_t) count < 0)) > > return retval; > > - pos = *ppos; > > + pos = (ppos ? *ppos : 0); > > if (unlikely(pos < 0)) { > > if (!unsigned_offsets(file)) > > return retval; > > This part looks silly. We should just avoid all the position overflow > games when we don't have a position at all (ie streaming). You can't > overflow what you don't use. > > Similarly, you can't use ranged mandatory locking on a stream, so the > mandatory locking thing seems dependent on pos too. > > So I think that with a NULL ppos being possible, we should just change > the code to just do all of that conditionally on having a position, > rather than saying that the position of a stream is always 0. Linus, thanks for feedback. After checking a bit of history of rw_verify_area and seeing what it does I agree with you. I was initially confused because rw_verify_area calls security_file_permission() at its tail and so I thought we cannot skip rw_verify_area for !ppos case. After studying a bit I see now that there are several things going on and indeed moving ranged locks checking into under `if ppos!=NULL` makes sense. This code is new to me; apologize for not getting it right initially. > That said, this whole "let's make it possible to not have a position > at all" is a big change, and there's no way I'll apply these before > the 5.2 merge window. Ok. > And I'd really like to have people (Al?) look at this and go "yeah, > makes sense". I do think that moving to a model where we wither have a > (properly locked) file position or no pos pointer at all is the right > model (ie I'd really like to get rid of the mixed case), but there > might be some practical problem that makes it impractical. > > Because the *real* problem with the mixed case is not "insane people > who do bad things might get odd results". No, the real problem with > the mixed case is that it could be a security issue (ie: one process > intentionally changes the file position just as another process is > going a 'read' and then avoids some limit test because the limit test > was done using the old 'pos' value but the actual IO was done using > the new one). > > So I suspect that we will have to either > > - get rid of the mixed case entirely (and do only properly locked > f_pos accesses or pass is a NULL f_pos) > > - continue to support the mixed case, but also continue to support > the nasty temporary 'pos' value with that file_pos_{read,write}() > thing. > > IOW, I would not be ok with passing in a shared - and unlocked - > >f_pos value to random drivers that *might* do odd things when a > race happens. I see. That's why I splitted the whole change into 2: the first patch only adds ppos=NULL semantic for FMODE_STREAM and does not change to passing >f_pos directly. There are signs that people can be starting to use stream_open even now - before we do mass conversion. That's why it is better to fix ppos semantic for stream case early. You said it won't happen before next merge window - ok, but ppos=NULL change is separate and does not depend on passing >f_pos directly. So let's please apply ppos=NULL patch early when merge window begins and independently of what happens with whether we'll switch to direct usage of >f_pos. Looking forward to get review from others too. And I appreciate if people could help at least somehow with "getting rid of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM), because this transition starts to diverge from my particular use-case too far. To me it makes sense to do that transition as follows: - convert nonseekable_open -> stream_open via stream_open.cocci; - audit other nonseekable_open calls and convert left users that truly don't depend on position to stream_open; - extend stream_open.cocci to analyze alloc_file_pseudo as well (this will cover pipes and sockets), or maybe convert pipes and sockets to FMODE_STREAM manually; - extend stream_open.cocci to analyze file_operations that use no_llseek or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo. This might find files that have stream semantic but are opened differently; - extend stream_open.cocci to analyze file_operations whose .read/.write do not use ppos at all (independently of how file was opened); - ... - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if !FMODE_STREAM; - gather bug reports for deadlocked read/write and convert missed cases to FMODE_STREAM, probably extending stream_open.cocci along the road to catch similar cases. i.e. always take f_pos_lock unless a
Re: [PATCH 0/3] 32-bit Meson: add the canvas module
Hi Martin, On Sat, Apr 13, 2019 at 7:15 PM Martin Blumenstingl wrote: > > This adds the canvas module on Meson8, Meson8b and Meson8m2. The canvas > IP is used by the video decoder hardware as well as the VPU (video > output) hardware. > > Neither the VPU nor the video decoder driver support the 32-bit SoCs > yet. However, we can still add the canvas module to have it available > once these drivers gain support for the older SoCs. > > I have tested this on my Meson8m2 board by hacking the VPU driver to > not re-initialize the VPU (and to use the configuration set by u-boot). > With that hack I could get some image out of the CVBS connector. No > changes to the canvas driver were required. > > Due to lack of hardware I could not test Meson8, but I'm following (as > always) what the Amlogic 3.10 vendor kernel uses. > Meson8b is also not tested because u-boot of my EC-100 doesn't have > video output enabled (so I couldn't use the same hack I used on my > Meson8m2 board). > > This series meant to be applied on top of "Meson8b: add support for the > RTC on EC-100 and Odroid-C1" from [0] > > The series looks good to me, however I wonder if we should maybe add a new compatible ? The canvas IP before the GX* generation does not handle what Amlogic calls "endianness", the field that allows doing some byte-switching to get proper NV12/NV21. So the following defines are unusable: #define MESON_CANVAS_ENDIAN_SWAP16 0x1 #define MESON_CANVAS_ENDIAN_SWAP32 0x3 #define MESON_CANVAS_ENDIAN_SWAP64 0x7 #define MESON_CANVAS_ENDIAN_SWAP128 0xf It wouldn't change much functionally, but we could have e.g a warning if a m8 canvas user tries to set endianness even though it does nothing. Maxime > [0] https://patchwork.kernel.org/cover/10899509/ > > > Martin Blumenstingl (3): > ARM: dts: meson8: add the canvas module > ARM: dts: meson8m2: update the offset of the canvas module > ARM: dts: meson8b: add the canvas module > > arch/arm/boot/dts/meson8.dtsi | 21 + > arch/arm/boot/dts/meson8b.dtsi | 21 + > arch/arm/boot/dts/meson8m2.dtsi | 10 ++ > 3 files changed, 52 insertions(+) > > -- > 2.21.0 > > > ___ > linux-amlogic mailing list > linux-amlo...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-amlogic
Re: Alleged fix for writer stall on -rcu branch dev
On Sat, Apr 13, 2019 at 07:21:53AM -0700, Paul E. McKenney wrote: > On Wed, Apr 10, 2019 at 05:06:10PM -0700, Paul E. McKenney wrote: > > On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote: > > > On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote: > > > > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote: > > > > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior > > > > > wrote: > > > > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote: > > > > > > > Hello, Sebastian, > > > > > > Hi Paul, > > > > > > > > > > > > > A few commits later, I finally have something that might work. > > > > > > > ;-) > > > > > > > > > > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The > > > > > > first > > > > > > batch of 18 completed and this > > > > > > https://breakpoint.cc/console-tree1.2.log > > > > > > > > > > Another new one on me! ;-) > > > > > > > > > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to > > > > > offline CPU 6. Unfortunately, CPU 6 is not in the list receiving an > > > > > NMI, perhaps because it looked like it was already offline. It does > > > > > eventually finish going offline quite some time later. > > > > > > > > > > Perhaps my forward-progress testing is interacting badly with CPU > > > > > hotplug, but only very occasionally? > > > > > > > > Somehow, yes. > > > > > > > > [ 8433.835292] Unregister pv shared memory for cpu 6 > > > > [ 8433.864122] smpboot: CPU 6 is now offline > > > > > > > > CPU6 is offline. > > > > > > > > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6 > > > > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock > > > > [ 8434.989124] KVM setup async PF for cpu 6 > > > > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380 > > > > > > > > CPU6 is booting again. > > > > > > > > [ 8436.035269] Unregister pv shared memory for cpu 6 > > > > > > > > going down again > > > > > > > > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU > > > > [ 8462.037385] rcu: 11-...!: (25587 ticks this GP) > > > > idle=57e/1/0x4002 softirq=418067/418067 fqs=1 > > > > last_accelerate: 2456/89e6, Nonlazy posted: .LD > > > > > > > > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11, > > > > correct? > > > > > > Yes to both. > > > > > > > [ 8525.041031] smpboot: CPU 6 is now offline > > > > > > > > 63 secs later, the CPU is down. So that is too long. > > > > > > Ah, I did miss the down-then-up, thank you! > > > > > > > > Something for me to look at, anyway! > > > > > > > > There is also > > > > [ 8556.907804] WARNING: CPU: 35 PID: 833 at > > > > /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 > > > > rcu_torture_fwd_prog+0x5c4/0x630 > > > > > > > > but I'm not sure what it is :) > > > > > > The forward-progress test exceeded eight seconds, but pushed fewer > > > than 100 callbacks through the system. One way that could happen is > > > if the forward-progress kthread was delayed during that eight seconds. > > > Another is if grace periods were insanely fast (which they are not). > > > > > > > > > I should have been alone on that server but I can't guarantee it. I > > > > > > try > > > > > > to throw the rcu-torture at a little smaller box and check how it > > > > > > goes. > > > > > > > > > > Even if you were not alone, I would guess that preempting a runnable > > > > > process for 92 seconds would have been a visible event on the host. > > > > > > > > 92 seconds?! Okay, that should have been visible on the host side. And I > > > > see nothing in dmesg on the host side. > > > > But why 92? 25 for RCU stall + 63 between starting to go down and > > > > finishing? > > > > > > Because I missed the momentary up-then-down that you spotted. Your 63 > > > seconds is correct. > > > > > > > > But maybe not? The watchdog timer is still 120 seconds? Or does your > > > > > host set it to 22 seconds? > > > > > > > > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be > > > > the default value. > > > > > > Which I believe to be 22 seconds. Hmmm... > > > > > > Either way, thank you very much for the testing!!! > > > > [ Adding LKML back -- I think we dropped it due to an attachment? ] > > > > One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads. > > This shouldn't be a big deal for the intended use case (real-time > > workloads with lightish load), but a two-jiffy delay could allow quite a > > few callbacks to build up when doing forward-progress testing. But not > > a problem for use_softirq runs, which is what you were running. > > > > But it might be that I need to either skip the forward-progress testing > > if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too > > many callbacks, or some such. Either way, though, I would need at least > > a cond_resched() on each pass through the rcu_cpu_kthread() function. > > > > Thoughts
Re: [GIT PULL] arm64 fixes for -rc5
The pull request you sent on Fri, 12 Apr 2019 17:05:15 +0100: > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-fixes has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/5ded88718aef7e92a9806f6ff4b89c7f2a4f1570 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote: > And I'd really like to have people (Al?) look at this and go "yeah, > makes sense". I do think that moving to a model where we wither have a > (properly locked) file position or no pos pointer at all is the right > model (ie I'd really like to get rid of the mixed case), but there > might be some practical problem that makes it impractical. > > Because the *real* problem with the mixed case is not "insane people > who do bad things might get odd results". No, the real problem with > the mixed case is that it could be a security issue (ie: one process > intentionally changes the file position just as another process is > going a 'read' and then avoids some limit test because the limit test > was done using the old 'pos' value but the actual IO was done using > the new one). > > So I suspect that we will have to either > > - get rid of the mixed case entirely (and do only properly locked > f_pos accesses or pass is a NULL f_pos) Good luck. Character devices get no exclusion often do use position are many might be buried behind several layers of indirection ... and often left unmaintained for a decade or two IOW, I don't see how you'd go about eliminating the mixed case...
Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov wrote: > > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, > const loff_t *ppos, size_t > inode = file_inode(file); > if (unlikely((ssize_t) count < 0)) > return retval; > - pos = *ppos; > + pos = (ppos ? *ppos : 0); > if (unlikely(pos < 0)) { > if (!unsigned_offsets(file)) > return retval; This part looks silly. We should just avoid all the position overflow games when we don't have a position at all (ie streaming). You can't overflow what you don't use. Similarly, you can't use ranged mandatory locking on a stream, so the mandatory locking thing seems dependent on pos too. So I think that with a NULL ppos being possible, we should just change the code to just do all of that conditionally on having a position, rather than saying that the position of a stream is always 0. That said, this whole "let's make it possible to not have a position at all" is a big change, and there's no way I'll apply these before the 5.2 merge window. And I'd really like to have people (Al?) look at this and go "yeah, makes sense". I do think that moving to a model where we wither have a (properly locked) file position or no pos pointer at all is the right model (ie I'd really like to get rid of the mixed case), but there might be some practical problem that makes it impractical. Because the *real* problem with the mixed case is not "insane people who do bad things might get odd results". No, the real problem with the mixed case is that it could be a security issue (ie: one process intentionally changes the file position just as another process is going a 'read' and then avoids some limit test because the limit test was done using the old 'pos' value but the actual IO was done using the new one). So I suspect that we will have to either - get rid of the mixed case entirely (and do only properly locked f_pos accesses or pass is a NULL f_pos) - continue to support the mixed case, but also continue to support the nasty temporary 'pos' value with that file_pos_{read,write}() thing. IOW, I would not be ok with passing in a shared - and unlocked - >f_pos value to random drivers that *might* do odd things when a race happens. Linus
[PATCH v4 06/16] locking/rwsem: Code cleanup after files merging
After merging all the relevant rwsem code into one single file, there are a number of optimizations and cleanups that can be done: 1) Remove all the EXPORT_SYMBOL() calls for functions that are not accessed elsewhere. 2) Remove all the __visible tags as none of the functions will be called from assembly code anymore. 3) Make all the internal functions static. 4) Remove some unneeded blank lines. That enables the compiler to do better optimization and reduce code size. The text+data size of rwsem.o on an x86-64 machine was reduced from 8945 bytes to 4651 bytes with this change. Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 50 +- 1 file changed, 10 insertions(+), 40 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 5f06b0601eb6..c1a089ab19fd 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -207,7 +207,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, osq_lock_init(>osq); #endif } - EXPORT_SYMBOL(__init_rwsem); enum rwsem_waiter_type { @@ -575,19 +574,17 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state) return ERR_PTR(-EINTR); } -__visible struct rw_semaphore * __sched +static inline struct rw_semaphore * __sched rwsem_down_read_failed(struct rw_semaphore *sem) { return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE); } -EXPORT_SYMBOL(rwsem_down_read_failed); -__visible struct rw_semaphore * __sched +static inline struct rw_semaphore * __sched rwsem_down_read_failed_killable(struct rw_semaphore *sem) { return __rwsem_down_read_failed_common(sem, TASK_KILLABLE); } -EXPORT_SYMBOL(rwsem_down_read_failed_killable); /* * Wait until we successfully acquire the write lock @@ -694,26 +691,23 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state) return ERR_PTR(-EINTR); } -__visible struct rw_semaphore * __sched +static inline struct rw_semaphore * __sched rwsem_down_write_failed(struct rw_semaphore *sem) { return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE); } -EXPORT_SYMBOL(rwsem_down_write_failed); -__visible struct rw_semaphore * __sched +static inline struct rw_semaphore * __sched rwsem_down_write_failed_killable(struct rw_semaphore *sem) { return __rwsem_down_write_failed_common(sem, TASK_KILLABLE); } -EXPORT_SYMBOL(rwsem_down_write_failed_killable); /* * handle waking up a waiter on the semaphore * - up_read/up_write has decremented the active part of count if we come here */ -__visible -struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) +static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) { unsigned long flags; DEFINE_WAKE_Q(wake_q); @@ -728,15 +722,13 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) return sem; } -EXPORT_SYMBOL(rwsem_wake); /* * downgrade a write lock into a read lock * - caller incremented waiting part of count and discovered it still negative * - just wake up any readers at the front of the queue */ -__visible -struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) +static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) { unsigned long flags; DEFINE_WAKE_Q(wake_q); @@ -751,7 +743,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) return sem; } -EXPORT_SYMBOL(rwsem_downgrade_wake); /* * lock for reading @@ -895,7 +886,6 @@ void __sched down_read(struct rw_semaphore *sem) LOCK_CONTENDED(sem, __down_read_trylock, __down_read); } - EXPORT_SYMBOL(down_read); int __sched down_read_killable(struct rw_semaphore *sem) @@ -910,7 +900,6 @@ int __sched down_read_killable(struct rw_semaphore *sem) return 0; } - EXPORT_SYMBOL(down_read_killable); /* @@ -924,7 +913,6 @@ int down_read_trylock(struct rw_semaphore *sem) rwsem_acquire_read(>dep_map, 0, 1, _RET_IP_); return ret; } - EXPORT_SYMBOL(down_read_trylock); /* @@ -934,10 +922,8 @@ void __sched down_write(struct rw_semaphore *sem) { might_sleep(); rwsem_acquire(>dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } - EXPORT_SYMBOL(down_write); /* @@ -948,14 +934,14 @@ int __sched down_write_killable(struct rw_semaphore *sem) might_sleep(); rwsem_acquire(>dep_map, 0, 0, _RET_IP_); - if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) { + if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, + __down_write_killable)) { rwsem_release(>dep_map, 1, _RET_IP_); return -EINTR; } return 0; } - EXPORT_SYMBOL(down_write_killable); /* @@ -970,7 +956,6 @@ int down_write_trylock(struct rw_semaphore *sem) return
[PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
Now we only have one implementation of rwsem. Even though we still use xadd to handle reader locking, we use cmpxchg for writer instead. So the filename rwsem-xadd.c is not strictly correct. Also no one outside of the rwsem code need to know the internal implementation other than function prototypes for two internal functions that are called directly from percpu-rwsem.c. So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in the following order: The rwsem.h file now contains only 2 function declarations for __up_read() and __down_read(). This is a code relocation patch with no code change at all except making __up_read() and __down_read() non-static functions so they can be used by percpu-rwsem.c. Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/locking/Makefile | 2 +- kernel/locking/rwsem-xadd.c | 608 - kernel/locking/rwsem.c | 870 kernel/locking/rwsem.h | 283 +--- 4 files changed, 877 insertions(+), 886 deletions(-) delete mode 100644 kernel/locking/rwsem-xadd.c diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile index 6fe2f333aecb..45452facff3b 100644 --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -3,7 +3,7 @@ # and is generally not a function of system call inputs. KCOV_INSTRUMENT:= n -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE) diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c deleted file mode 100644 index 92f7d7b6bfa3.. --- a/kernel/locking/rwsem-xadd.c +++ /dev/null @@ -1,608 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* rwsem.c: R/W semaphores: contention handling functions - * - * Written by David Howells (dhowe...@redhat.com). - * Derived from arch/i386/kernel/semaphore.c - * - * Writer lock-stealing by Alex Shi - * and Michel Lespinasse - * - * Optimistic spinning by Tim Chen - * and Davidlohr Bueso . Based on mutexes. - * - * Rwsem count bit fields re-definition by Waiman Long . - */ -#include -#include -#include -#include -#include -#include -#include -#include - -#include "rwsem.h" - -/* - * Guide to the rw_semaphore's count field. - * - * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned - * by a writer. - * - * The lock is owned by readers when - * (1) the RWSEM_WRITER_LOCKED isn't set in count, - * (2) some of the reader bits are set in count, and - * (3) the owner field has RWSEM_READ_OWNED bit set. - * - * Having some reader bits set is not enough to guarantee a readers owned - * lock as the readers may be in the process of backing out from the count - * and a writer has just released the lock. So another writer may steal - * the lock immediately after that. - */ - -/* - * Initialize an rwsem: - */ -void __init_rwsem(struct rw_semaphore *sem, const char *name, - struct lock_class_key *key) -{ -#ifdef CONFIG_DEBUG_LOCK_ALLOC - /* -* Make sure we are not reinitializing a held semaphore: -*/ - debug_check_no_locks_freed((void *)sem, sizeof(*sem)); - lockdep_init_map(>dep_map, name, key, 0); -#endif - atomic_long_set(>count, RWSEM_UNLOCKED_VALUE); - raw_spin_lock_init(>wait_lock); - INIT_LIST_HEAD(>wait_list); - sem->owner = NULL; -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER - osq_lock_init(>osq); -#endif -} - -EXPORT_SYMBOL(__init_rwsem); - -enum rwsem_waiter_type { - RWSEM_WAITING_FOR_WRITE, - RWSEM_WAITING_FOR_READ -}; - -struct rwsem_waiter { - struct list_head list; - struct task_struct *task; - enum rwsem_waiter_type type; -}; - -enum rwsem_wake_type { - RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */ - RWSEM_WAKE_READERS, /* Wake readers only */ - RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */ -}; - -/* - * handle the lock release when processes blocked on it that can now run - * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must - * have been set. - * - there must be someone on the queue - * - the wait_lock must be held by the caller - * - tasks are marked for wakeup, the caller must later invoke wake_up_q() - * to actually wakeup the blocked task(s) and drop the reference count, - * preferably when the wait_lock is released - * - woken process blocks are discarded from the list after having task zeroed - * - writers are only marked woken if downgrading is false - */ -static void __rwsem_mark_wake(struct rw_semaphore *sem, - enum rwsem_wake_type wake_type, - struct wake_q_head *wake_q) -{ - struct rwsem_waiter *waiter, *tmp; - long oldcount, woken = 0, adjustment = 0; - - /* -* Take a peek at the queue head waiter such
Re: [PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)
Hello everyone, On Sat, Apr 13, 2019 at 06:50:57PM +0200, Julia Lawall wrote: > Hello, > > Kirill will explain about this issue. pci/switchtec switching to stream_open is already queued to merge window and it was acked by Logan Gunthorpe: https://lore.kernel.org/lkml/CAHk-=wgqgN5j1ZWnyVLqqoyU=ccwtyoko3mdyu8l_5e21kv...@mail.gmail.com/ https://lab.nexedi.com/kirr/linux/commit/edaeb4101860 ( there are too many Cc's in that patch and email with it and reply-all to it did not get into mailing list probably due to being considered as spam ) stream_open.cocci was issuing only warning for pci/switchtec, but after 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue") they started to use wait_even_* inside read method and, since stream_open.cocci considers wait_event_* as blocking the warning became error. Previously it was completions there, but I added support for wait events only for simplicity. I can handle pci/switchtec switching via big nonseekable_open -> stream_open change at next merge window, or, alternatively please feel free to switch pci/switchtec now on its own. The change is correct - I was manually reviewing all changes that stream_open.cocci produces and pci/switchtec was there. Kirill > julia > > -- Forwarded message -- > Date: Sat, 13 Apr 2019 11:22:51 +0800 > From: kbuild test robot > To: kbu...@01.org > Cc: Julia Lawall > Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings > > CC: kbuild-...@01.org > TO: Sebastian Andrzej Siewior > CC: Kurt Schwemmer > CC: Logan Gunthorpe > CC: Bjorn Helgaas > CC: linux-...@vger.kernel.org > CC: linux-kernel@vger.kernel.org > > From: kbuild test robot > > drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can > deadlock .write(); change nonseekable_open -> stream_open to fix. > > Generated by: scripts/coccinelle/api/stream_open.cocci > > Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue") > Signed-off-by: kbuild test robot > --- > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git > linux-5.0.y-rt-rebase > head: 794c294ae4483c240429c25a0d18e272e92c94de > commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: > Don't use completion's wait queue > :: branch date: 7 hours ago > :: commit date: 7 hours ago > > Please take the patch only if it's a positive warning. Thanks! > > switchtec.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/drivers/pci/switch/switchtec.c > +++ b/drivers/pci/switch/switchtec.c > @@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino > return PTR_ERR(stuser); > > filp->private_data = stuser; > - nonseekable_open(inode, filp); > + stream_open(inode, filp); > > dev_dbg(>dev, "%s: %p\n", __func__, stuser);
[PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer
This patch enables readers to optimistically spin on a rwsem when it is owned by a writer instead of going to sleep directly. The rwsem_can_spin_on_owner() function is extracted out of rwsem_optimistic_spin() and is called directly by __rwsem_down_read_failed_common() and __rwsem_down_write_failed_common(). With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal numbers of readers and writers before and after the patch were as follows: # of Threads Pre-patchPost-patch --- 4 1,6741,684 8 1,0621,074 16924 900 32300 458 64195 208 128164 168 240149 143 The performance change wasn't significant in this case, but this change is required by a follow-on patch. Signed-off-by: Waiman Long --- kernel/locking/lock_events_list.h | 1 + kernel/locking/rwsem.c| 91 +++ 2 files changed, 80 insertions(+), 12 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 29e5c52197fa..333ed5fda333 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)/* # of reader sleeps */ LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */ LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */ +LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/ LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired */ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/ LOCK_EVENT(rwsem_rlock)/* # of read locks acquired */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index cf0a90d251aa..3cf8355252d1 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -103,9 +103,12 @@ RWSEM_FLAG_HANDOFF) #define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK) +#define RWSEM_COUNT_WLOCKED(c) ((c) & RWSEM_WRITER_MASK) #define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF) #define RWSEM_COUNT_LOCKED_OR_HANDOFF(c) \ ((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF)) +#define RWSEM_COUNT_WLOCKED_OR_HANDOFF(c) \ + ((c) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF)) /* * All writes to owner are protected by WRITE_ONCE() to make sure that @@ -436,6 +439,30 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem, } #ifdef CONFIG_RWSEM_SPIN_ON_OWNER +/* + * Try to acquire read lock before the reader is put on wait queue. + * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff + * is ongoing. + */ +static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem) +{ + long count = atomic_long_read(>count); + + if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) + return false; + + count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >count); + if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) { + rwsem_set_reader_owned(sem); + lockevent_inc(rwsem_opt_rlock); + return true; + } + + /* Back out the change */ + atomic_long_add(-RWSEM_READER_BIAS, >count); + return false; +} + /* * Try to acquire write lock before the writer has been put on wait queue. */ @@ -470,9 +497,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN)); - if (need_resched()) + if (need_resched()) { + lockevent_inc(rwsem_opt_fail); return false; + } + preempt_disable(); rcu_read_lock(); owner = READ_ONCE(sem->owner); if (owner) { @@ -480,6 +510,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) owner_on_cpu(owner); } rcu_read_unlock(); + preempt_enable(); + + lockevent_cond_inc(rwsem_opt_fail, !ret); return ret; } @@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) return !owner ? OWNER_NULL : OWNER_READER; } -static bool rwsem_optimistic_spin(struct rw_semaphore *sem) +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) { bool taken = false; bool is_rt_task = rt_task(current); @@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) preempt_disable(); /* sem->wait_lock should not be held when doing optimistic spinning */ - if (!rwsem_can_spin_on_owner(sem)) -
[PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue
When the front of the wait queue is a reader, other readers immediately following the first reader will also be woken up at the same time. However, if there is a writer in between. Those readers behind the writer will not be woken up. Because of optimistic spinning, the lock acquisition order is not FIFO anyway. The lock handoff mechanism will ensure that lock starvation will not happen. Assuming that the lock hold times of the other readers still in the queue will be about the same as the readers that are being woken up, there is really not much additional cost other than the additional latency due to the wakeup of additional tasks by the waker. Therefore all the readers up to a maximum of 256 in the queue are woken up when the first waiter is a reader to improve reader throughput. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-Patch Post-patch - -- 4 1,6411,674 87311,062 16564 924 32 78 300 64 38 195 240 50 149 There is no performance gain at low contention level. At high contention level, however, this patch gives a pretty decent performance boost. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 8e19b5141595..cf0a90d251aa 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -260,6 +260,13 @@ enum writer_wait_state { */ #define RWSEM_WAIT_TIMEOUT (HZ/250) +/* + * We limit the maximum number of readers that can be woken up for a + * wake-up call to not penalizing the waking thread for spending too + * much time doing it. + */ +#define MAX_READERS_WAKEUP 0x100 + /* * handle the lock release when processes blocked on it that can now run * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must @@ -332,16 +339,16 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, } /* -* Grant an infinite number of read locks to the readers at the front -* of the queue. We know that woken will be at least 1 as we accounted -* for above. Note we increment the 'active part' of the count by the +* Grant up to MAX_READERS_WAKEUP read locks to all the readers in the +* queue. We know that woken will be at least 1 as we accounted for +* above. Note we increment the 'active part' of the count by the * number of readers before waking any processes up. */ list_for_each_entry_safe(waiter, tmp, >wait_list, list) { struct task_struct *tsk; if (waiter->type == RWSEM_WAITING_FOR_WRITE) - break; + continue; woken++; tsk = waiter->task; @@ -360,6 +367,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, * after setting the reader waiter to nil. */ wake_q_add_safe(wake_q, tsk); + + /* +* Limit # of readers that can be woken up per wakeup call. +*/ + if (woken >= MAX_READERS_WAKEUP) + break; } adjustment = woken * RWSEM_READER_BIAS - adjustment; -- 2.18.1
[PATCH 2/2] vfs: use >f_pos directly on files that have position
Long ago vfs read/write operations were passing ppos=>f_pos directly to .read / .write file_operations methods. That changed in 2004 in 55f09ec0087c ("read/write: pass down a copy of f_pos, not f_pos itself.") which started to pass ppos=_var trying to avoid simultaneous read/write/lseek stepping onto each other toes and overwriting file->f_pos racily. That measure was not complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per POSIX") added file->f_pos_lock to completely disable simultaneous read/write/lseek runs. After f_pos_lock was introduced the reason to avoid passing ppos=>f_pos directly due to concurrency vanished. Linus explains[1]: In fact, we *used* to (long ago) pass in the address of "file->f_pos" itself to the low-level read/write routines. We then changed it to do that indirection through a local copy of pos (and file_pos_read/file_pos_write) because we didn't do the proper locking, so different read/write versions could mess with each other (and with lseek). But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per POSIX") did was to add the proper locking at least for the cases that we care about deeply, so we *could* say that we have three cases: - FMODE_ATOMIC_POS: properly locked, - FMODE_STREAM: no pos at all - otherwise a "mostly don't care - don't mix!" and so we could go back to not copying the pos at all, and instead do something like loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : >f_pos; ret = vfs_write(f.file, buf, count, ppos); and perhaps have a long-term plan to try to get rid of the "don't mix" case entirely (ie "if you use f_pos, then we'll do the proper locking") (The above is obviously surrounded by the fdget_pos()/fdput_pos() that implements the locking decision). Currently for regular files we always set FMODE_ATOMIC_POS and change that to FMODE_STREAM if stream_open is used explicitly on open. That leaves other files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case. Sockets, for example, always check that on read/write the initial pos they receive is 0 and don't update it. And if it is !0 they return -ESPIPE. That suggests that we can do the switch into passing >f_pos directly now and incrementally convert to FMODE_STREAM files that were doing the stream-like checking manually in their low-level .read/.write handlers. Note: it is theoretically possible that a driver updates *ppos inside even if read/write returns error. For such cases the conversion will change IO semantic a bit. The semantic that is changing here was introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write() if vfs_{read,write}{,v}() fails". [1] https://lore.kernel.org/linux-fsdevel/CAHk-=whjtzt52snhbgrnmnuxfn3ge9x_e02x8bpxtkqrfyz...@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Kirill Smelkov --- fs/read_write.c | 36 1 file changed, 4 insertions(+), 32 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index d62556be6848..13550b65cb2c 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos, *ppos = file_ppos(f.file); - if (ppos) { - pos = *ppos; - ppos = - } - ret = vfs_read(f.file, buf, count, ppos); - if (ret >= 0 && ppos) - f.file->f_pos = pos; + ret = vfs_read(f.file, buf, count, file_ppos(f.file)); fdput_pos(f); } return ret; @@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos, *ppos = file_ppos(f.file); - if (ppos) { - pos = *ppos; - ppos = - } - ret = vfs_write(f.file, buf, count, ppos); - if (ret >= 0 && ppos) - f.file->f_pos = pos; + ret = vfs_write(f.file, buf, count, file_ppos(f.file)); fdput_pos(f); } @@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec, ssize_t ret = -EBADF; if (f.file) { - loff_t pos, *ppos = file_ppos(f.file); - if (ppos) { - pos = *ppos; - ppos = - } - ret = vfs_readv(f.file, vec, vlen, ppos, flags); - if (ret >= 0 && ppos) - f.file->f_pos = pos; + ret = vfs_readv(f.file, vec, vlen,
[PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
This amends commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") in how position is passed into .read()/.write() handler for stream-like files: Rasmus noticed that we currently pass 0 as position and ignore any position change if that is done by a file implementation. This papers over bugs if ppos is used in files that declare themselves as being stream-like as such bugs will go unnoticed. Even if a file implementation is correctly converted into using stream_open, its read/write later could be changed to use ppos and even though that won't be working correctly, that bug might go unnoticed without someone doing wrong behaviour analysis. It is thus better to pass ppos=NULL into read/write for stream-like files as that don't give any chance for ppos usage bugs because it will oops if ppos is ever used inside .read() or .write(). Note 1: rw_verify_area, new_sync_{read,write} needs to be updated because they are called by vfs_read/vfs_write & friends before file_operations .read/.write . Note 2: if file backend uses new-style .read_iter/.write_iter, position is still passed into there as non-pointer kiocb.ki_pos . Currently stream_open.cocci (semantic patch added by 10dce8af3422) ignores files whose file_operations has *_iter methods. Suggested-by: Rasmus Villemoes Signed-off-by: Kirill Smelkov --- fs/open.c | 5 ++-- fs/read_write.c | 75 + 2 files changed, 47 insertions(+), 33 deletions(-) diff --git a/fs/open.c b/fs/open.c index a00350018a47..9c7d724a6f67 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open); /* * stream_open is used by subsystems that want stream-like file descriptors. * Such file descriptors are not seekable and don't have notion of position - * (file.f_pos is always 0). Contrary to file descriptors of other regular - * files, .read() and .write() can run simultaneously. + * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL). + * Contrary to file descriptors of other regular files, .read() and .write() + * can run simultaneously. * * stream_open never fails and is marked to return int so that it could be * directly used as file_operations.open . diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..d62556be6848 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t inode = file_inode(file); if (unlikely((ssize_t) count < 0)) return retval; - pos = *ppos; + pos = (ppos ? *ppos : 0); if (unlikely(pos < 0)) { if (!unsigned_offsets(file)) return retval; @@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo ssize_t ret; init_sync_kiocb(, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(, READ, , 1, len); ret = call_read_iter(filp, , ); BUG_ON(ret == -EIOCBQUEUED); - *ppos = kiocb.ki_pos; + if (ppos) + *ppos = kiocb.ki_pos; return ret; } @@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t ssize_t ret; init_sync_kiocb(, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(, WRITE, , 1, len); ret = call_write_iter(filp, , ); BUG_ON(ret == -EIOCBQUEUED); - if (ret > 0) + if (ret > 0 && ppos) *ppos = kiocb.ki_pos; return ret; } @@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ return ret; } -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_mode & FMODE_STREAM ? 0 : file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) +/* file_ppos returns >f_pos or NULL if file is stream */ +static inline loff_t *file_ppos(struct file *file) { - if ((file->f_mode & FMODE_STREAM) == 0) - file->f_pos = pos; + return file->f_mode & FMODE_STREAM ? NULL : >f_pos; } ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) @@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_read(f.file, buf, count, ); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = file_ppos(f.file); + if (ppos) { + pos = *ppos; + ppos = + } + ret = vfs_read(f.file, buf, count, ppos); + if (ret >=
[PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader
An RT task can do optimistic spinning only if the lock holder is actually running. If the state of the lock holder isn't known, there is a possibility that high priority of the RT task may block forward progress of the lock holder if it happens to reside on the same CPU. This will lead to deadlock. So we have to make sure that an RT task will not spin on a reader-owned rwsem. When the owner is temporarily set to NULL, it is more tricky to decide if an RT task should stop spinning as it may be a temporary state where another writer may have just stolen the lock which then failed the task's trylock attempt. So one more retry is allowed to make sure that the lock is not spinnable by an RT task. When testing on a 8-socket IvyBridge-EX system, the one additional retry seems to improve locking performance of RT write locking threads under heavy contentions. The table below shows the locking rates (in kops/s) with various write locking threads before and after the patch. Locking threads Pre-patch Post-patch --- - --- 4 2,753 2,608 8 2,529 2,520 16 1,727 1,918 32 1,263 1,956 64 889 1,343 Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2d6850c3e77b..8e19b5141595 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -539,6 +539,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) static bool rwsem_optimistic_spin(struct rw_semaphore *sem) { bool taken = false; + bool is_rt_task = rt_task(current); + int prev_owner_state = OWNER_NULL; preempt_disable(); @@ -556,7 +558,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * 2) readers own the lock as we can't determine if they are * actively running or not. */ - while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) { + for (;;) { + enum owner_state owner_state = rwsem_spin_on_owner(sem); + + if (!(owner_state & OWNER_SPINNABLE)) + break; + /* * Try to acquire the lock */ @@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) } /* -* When there's no owner, we might have preempted between the -* owner acquiring the lock and setting the owner field. If -* we're an RT task that will live-lock because we won't let -* the owner complete. +* An RT task cannot do optimistic spinning if it cannot +* be sure the lock holder is running or live-lock may +* happen if the current task and the lock holder happen +* to run in the same CPU. +* +* When there's no owner or is reader-owned, an RT task +* will stop spinning if the owner state is not a writer +* at the previous iteration of the loop. This allows the +* RT task to recheck if the task that steals the lock is +* a spinnable writer. If so, it can keeps on spinning. +* +* If the owner is a writer, the need_resched() check is +* done inside rwsem_spin_on_owner(). If the owner is not +* a writer, need_resched() check needs to be done here. */ - if (!sem->owner && (need_resched() || rt_task(current))) - break; + if (owner_state != OWNER_WRITER) { + if (need_resched()) + break; + if (is_rt_task && (prev_owner_state != OWNER_WRITER)) + break; + } + prev_owner_state = owner_state; /* * The cpu_relax() call is a compiler barrier which forces -- 2.18.1
[PATCH v4 04/16] locking/rwsem: Implement a new locking scheme
The current way of using various reader, writer and waiting biases in the rwsem code are confusing and hard to understand. I have to reread the rwsem count guide in the rwsem-xadd.c file from time to time to remind myself how this whole thing works. It also makes the rwsem code harder to be optimized. To make rwsem more sane, a new locking scheme similar to the one in qrwlock is now being used. The atomic long count has the following bit definitions: Bit 0 - writer locked bit Bit 1 - waiters present bit Bits 2-7 - reserved for future extension Bits 8-X - reader count (24/56 bits) The cmpxchg instruction is now used to acquire the write lock. The read lock is still acquired with xadd instruction, so there is no change here. This scheme will allow up to 16M/64P active readers which should be more than enough. We can always use some more reserved bits if necessary. With that change, we can deterministically know if a rwsem has been write-locked. Looking at the count alone, however, one cannot determine for certain if a rwsem is owned by readers or not as the readers that set the reader count bits may be in the process of backing out. So we still need the reader-owned bit in the owner field to be sure. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) of the benchmark on a 8-socket 120-core IvyBridge-EX system before and after the patch were as follows: Before Patch After Patch # of Threads wlockrlockwlockrlock ---- 130,659 31,341 31,055 31,283 2 8,909 16,4579,884 17,659 4 9,028 15,8238,933 20,233 8 8,410 14,2127,230 17,140 16 8,217 25,2407,479 24,607 The locking rates of the benchmark on a Power8 system were as follows: Before Patch After Patch # of Threads wlockrlockwlockrlock ---- 112,963 13,647 13,275 13,601 2 7,570 11,5697,902 10,829 4 5,2325,5165,4665,435 8 5,2333,3865,4673,168 The locking rates of the benchmark on a 2-socket ARM64 system were as follows: Before Patch After Patch # of Threads wlockrlockwlockrlock ---- 121,495 21,046 21,524 21,074 2 5,293 10,5025,333 10,504 4 5,325 11,4635,358 11,631 8 5,391 11,7125,470 11,680 The performance are roughly the same before and after the patch. There are run-to-run variations in performance. Runs with higher variances usually have higher throughput. Signed-off-by: Waiman Long --- kernel/locking/rwsem-xadd.c | 147 kernel/locking/rwsem.h | 74 +- 2 files changed, 86 insertions(+), 135 deletions(-) diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 98de7f0cfedd..92f7d7b6bfa3 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -9,6 +9,8 @@ * * Optimistic spinning by Tim Chen * and Davidlohr Bueso . Based on mutexes. + * + * Rwsem count bit fields re-definition by Waiman Long . */ #include #include @@ -22,52 +24,20 @@ #include "rwsem.h" /* - * Guide to the rw_semaphore's count field for common values. - * (32-bit case illustrated, similar for 64-bit) - * - * 0x000X (1) X readers active or attempting lock, no writer waiting - * X = #active_readers + #readers attempting to lock - * (X*ACTIVE_BIAS) - * - * 0x rwsem is unlocked, and no one is waiting for the lock or - * attempting to read lock or write lock. - * - * 0x000X (1) X readers active or attempting lock, with waiters for lock - * X = #active readers + # readers attempting lock - * (X*ACTIVE_BIAS + WAITING_BIAS) - * (2) 1 writer attempting lock, no waiters for lock - * X-1 = #active readers + #readers attempting lock - * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS) - * (3) 1 writer active, no waiters for lock - * X-1 = #active readers + #readers attempting lock - * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS) - * - * 0x0001 (1) 1 reader active or attempting lock, waiters for lock - * (WAITING_BIAS + ACTIVE_BIAS) - * (2) 1 writer active or attempting lock, no waiters for lock - * (ACTIVE_WRITE_BIAS) + * Guide to the rw_semaphore's count field. * - * 0x (1) There are writers or readers queued but none active - * or in the process of attempting lock. - * (WAITING_BIAS) - * Note: writer
[PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64
With separate count and owner, there are timing windows where the two values are inconsistent. That can cause problem when trying to figure out the exact state of the rwsem. For instance, a RT task will stop optimistic spinning if the lock is acquired by a writer but the owner field isn't set yet. That can be solved by combining the count and owner together in a single atomic value. On 32-bit architectures, there aren't enough bits to hold both. 64-bit architectures, however, can have enough bits to do that. For x86-64, the physical address can use up to 52 bits. That is 4PB of memory. That leaves 12 bits available for other use. The task structure pointer is aligned to the L1 cache size. That means another 6 bits (64 bytes cacheline) will be available. Reserving 2 bits for status flags, we will have 16 bits for the reader count and the read fail bit. That can supports up to (32k-1) readers. The owner value will still be duplicated in the owner field as that will ease debugging when looking at core dump. This change is currently enabled for x86-64 only. Other 64-bit architectures may be enabled in the future if the need arises. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with writer-only locking threads and then equal numbers of readers and writers (mixed) before patch and after this and subsequent related patches were as follows: Before Patch After Patch # of Threads wlockmixedwlockmixed ---- 130,422 31,034 30,323 30,379 2 6,4276,6847,8049,436 4 6,7426,7387,5688,268 8 7,0927,2225,6797,041 16 6,8827,1636,8487,652 32 7,4587,3167,9752,189 64 7,906 5208,269 534 128 1,680 4258,047 448 In the single thread case, the complex write-locking operation does introduce a little bit of overhead (about 0.3%). For the contended cases, except for some anomalies in the data, there is no evidence that this change will adversely impact performance. When running the same microbenchmark with RT locking threads instead, we got the following results: Before Patch After Patch # of Threads wlockmixedwlockmixed ---- 2 4,0653,6424,7565,062 4 2,2541,9073,4602,496 8 2,386 9643,0121,964 16 2,0951,5963,0831,862 32 2,388 5303,717 359 64 1,424 3224,060 401 128 1,642 5104,488 628 It is obvious that RT tasks can benefit pretty significantly with this set of patches. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 112 + 1 file changed, 103 insertions(+), 9 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index f37ab6358fe0..27219abb8bb6 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -73,7 +73,41 @@ #endif /* - * On 64-bit architectures, the bit definitions of the count are: + * Enable the merging of owner into count for x86-64 only. + */ +#ifdef CONFIG_X86_64 +#define RWSEM_MERGE_OWNER_TO_COUNT +#endif + +/* + * With separate count and owner, there are timing windows where the two + * values are inconsistent. That can cause problem when trying to figure + * out the exact state of the rwsem. That can be solved by combining + * the count and owner together in a single atomic value. + * + * On 64-bit architectures, the owner task structure pointer can be + * compressed and combined with reader count and other status flags. + * A simple compression method is to map the virtual address back to + * the physical address by subtracting PAGE_OFFSET. On 32-bit + * architectures, the long integer value just isn't big enough for + * combining owner and count. So they remain separate. + * + * For x86-64, the physical address can use up to 52 bits. That is 4PB + * of memory. That leaves 12 bits available for other use. The task + * structure pointer is also aligned to the L1 cache size. That means + * another 6 bits (64 bytes cacheline) will be available. Reserving + * 2 bits for status flags, we will have 16 bits for the reader count + * and read fail bit. That can supports up to (32k-1) active readers. + * + * On x86-64, the bit definitions of the count are: + * + * Bit 0- waiters present bit + * Bit 1- lock handoff bit + * Bits 2-47 - compressed task structure pointer + * Bits 48-62 - 15-bit reader counts + * Bit 63- read fail bit + * + * On other 64-bit architectures, the bit definitions are: * * Bit 0- writer locked bit * Bit 1- waiters present bit @@
[PATCH v4 14/16] locking/rwsem: Guard against making count negative
The upper bits of the count field is used as reader count. When sufficient number of active readers are present, the most significant bit will be set and the count becomes negative. If the number of active readers keep on piling up, we may eventually overflow the reader counts. This is not likely to happen unless the number of bits reserved for reader count is reduced because those bits are need for other purpose. To prevent this count overflow from happening, the most significant bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock attempts will now fail for both the fast and optimistic spinning paths whenever this bit is set. So all those extra readers will be put to sleep in the wait queue. Wakeup will not happen until the reader count reaches 0. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 84 -- 1 file changed, 64 insertions(+), 20 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index ab26aba65371..f37ab6358fe0 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -73,13 +73,28 @@ #endif /* - * The definition of the atomic counter in the semaphore: + * On 64-bit architectures, the bit definitions of the count are: * - * Bit 0 - writer locked bit - * Bit 1 - waiters present bit - * Bit 2 - lock handoff bit - * Bits 3-7 - reserved - * Bits 8-X - 24-bit (32-bit) or 56-bit reader count + * Bit 0- writer locked bit + * Bit 1- waiters present bit + * Bit 2- lock handoff bit + * Bits 3-7 - reserved + * Bits 8-62 - 55-bit reader count + * Bit 63 - read fail bit + * + * On 32-bit architectures, the bit definitions of the count are: + * + * Bit 0- writer locked bit + * Bit 1- waiters present bit + * Bit 2- lock handoff bit + * Bits 3-7 - reserved + * Bits 8-30 - 23-bit reader count + * Bit 31 - read fail bit + * + * It is not likely that the most significant bit (read fail bit) will ever + * be set. This guard bit is still checked anyway in the down_read() fastpath + * just in case we need to use up more of the reader bits for other purpose + * in the future. * * atomic_long_fetch_add() is used to obtain reader lock, whereas * atomic_long_cmpxchg() will be used to obtain writer lock. @@ -96,6 +111,7 @@ #define RWSEM_WRITER_LOCKED(1UL << 0) #define RWSEM_FLAG_WAITERS (1UL << 1) #define RWSEM_FLAG_HANDOFF (1UL << 2) +#define RWSEM_FLAG_READFAIL(1UL << (BITS_PER_LONG - 1)) #define RWSEM_READER_SHIFT 8 #define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT) @@ -103,7 +119,7 @@ #define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK) #define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\ -RWSEM_FLAG_HANDOFF) +RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL) #define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK) #define RWSEM_COUNT_WLOCKED(c) ((c) & RWSEM_WRITER_MASK) @@ -315,7 +331,8 @@ enum writer_wait_state { /* * We limit the maximum number of readers that can be woken up for a * wake-up call to not penalizing the waking thread for spending too - * much time doing it. + * much time doing it as well as the unlikely possiblity of overflowing + * the reader count. */ #define MAX_READERS_WAKEUP 0x100 @@ -799,12 +816,35 @@ rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter) * Wait for the read lock to be granted */ static inline struct rw_semaphore __sched * -__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state) +__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long count) { - long count, adjustment = -RWSEM_READER_BIAS; + long adjustment = -RWSEM_READER_BIAS; struct rwsem_waiter waiter; DEFINE_WAKE_Q(wake_q); + if (unlikely(count < 0)) { + /* +* The sign bit has been set meaning that too many active +* readers are present. We need to decrement reader count & +* enter wait queue immediately to avoid overflowing the +* reader count. +* +* As preemption is not disabled, there is a remote +* possibility that premption can happen in the narrow +* timing window between incrementing and decrementing +* the reader count and the task is put to sleep for a +* considerable amount of time. If sufficient number +* of such unfortunate sequence of events happen, we +* may still overflow the reader count. It is extremely +* unlikey, though. If this is a concern, we should consider +* disable preemption during this timing window to make +* sure that such unfortunate event will not happen. +*/ +
[PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state
This patch modifies rwsem_spin_on_owner() to return four possible values to better reflect the state of lock holder which enables us to make a better decision of what to do next. In the special case that there is no active lock and the handoff bit is set, optimistic spinning has to be stopped. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 45 +++--- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index aaab546a890d..2d6850c3e77b 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -156,6 +156,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner) return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED); } +static inline bool is_rwsem_owner_reader(struct task_struct *owner) +{ + return (unsigned long)owner & RWSEM_READER_OWNED; +} + /* * Return true if rwsem is owned by an anonymous writer or readers. */ @@ -466,14 +471,30 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) } /* - * Return true only if we can still spin on the owner field of the rwsem. + * Return the folowing 4 values depending on the lock owner state. + * OWNER_NULL : owner is currently NULL + * OWNER_WRITER: when owner changes and is a writer + * OWNER_READER: when owner changes and the new owner may be a reader. + * OWNER_NONSPINNABLE: + *when optimistic spinning has to stop because either the + *owner stops running, is unknown, or its timeslice has + *been used up. */ -static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) +enum owner_state { + OWNER_NULL = 1 << 0, + OWNER_WRITER= 1 << 1, + OWNER_READER= 1 << 2, + OWNER_NONSPINNABLE = 1 << 3, +}; +#define OWNER_SPINNABLE(OWNER_NULL | OWNER_WRITER) + +static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) { struct task_struct *owner = READ_ONCE(sem->owner); + long count; if (!is_rwsem_owner_spinnable(owner)) - return false; + return OWNER_NONSPINNABLE; rcu_read_lock(); while (owner && (READ_ONCE(sem->owner) == owner)) { @@ -491,7 +512,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ if (need_resched() || !owner_on_cpu(owner)) { rcu_read_unlock(); - return false; + return OWNER_NONSPINNABLE; } cpu_relax(); @@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) /* * If there is a new owner or the owner is not set, we continue -* spinning. +* spinning except when here is no active locks and the handoff bit +* is set. In this case, we have to stop spinning. */ - return is_rwsem_owner_spinnable(READ_ONCE(sem->owner)); + owner = READ_ONCE(sem->owner); + if (!is_rwsem_owner_spinnable(owner)) + return OWNER_NONSPINNABLE; + if (owner && !is_rwsem_owner_reader(owner)) + return OWNER_WRITER; + + count = atomic_long_read(>count); + if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count)) + return OWNER_NONSPINNABLE; + return !owner ? OWNER_NULL : OWNER_READER; } static bool rwsem_optimistic_spin(struct rw_semaphore *sem) @@ -525,7 +556,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * 2) readers own the lock as we can't determine if they are * actively running or not. */ - while (rwsem_spin_on_owner(sem)) { + while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) { /* * Try to acquire the lock */ -- 2.18.1
[PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers
Before combining owner and count, we are adding two new helpers for accessing the owner value in the rwsem. 1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem) 2) bool is_rwsem_reader_owned(struct rw_semaphore *sem) Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 50 +++--- 1 file changed, 37 insertions(+), 13 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 8b23009e6b2c..ab26aba65371 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -130,6 +130,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem) WRITE_ONCE(sem->owner, NULL); } +static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem) +{ + return READ_ONCE(sem->owner); +} + /* * The task_struct pointer of the last owning reader will be left in * the owner field. @@ -174,6 +179,23 @@ static inline bool is_rwsem_spinnable(struct rw_semaphore *sem) return is_rwsem_owner_spinnable(READ_ONCE(sem->owner)); } +/* + * Return true if the rwsem is owned by a reader. + */ +static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem) +{ +#ifdef CONFIG_DEBUG_RWSEMS + /* +* Check the count to see if it is write-locked. +*/ + long count = atomic_long_read(>count); + + if (count & RWSEM_WRITER_MASK) + return false; +#endif + return (unsigned long)sem->owner & RWSEM_READER_OWNED; +} + /* * Return true if rwsem is owned by an anonymous writer or readers. */ @@ -193,6 +215,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem) { unsigned long val = (unsigned long)current | RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED; + if (READ_ONCE(sem->owner) == (struct task_struct *)val) cmpxchg_relaxed((unsigned long *)>owner, val, RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED); @@ -530,7 +553,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) preempt_disable(); rcu_read_lock(); - owner = READ_ONCE(sem->owner); + owner = rwsem_get_owner(sem); if (owner) { ret = is_rwsem_owner_spinnable(owner) && (is_rwsem_owner_reader(owner) || owner_on_cpu(owner)); @@ -562,15 +585,21 @@ enum owner_state { static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) { - struct task_struct *owner = READ_ONCE(sem->owner); + struct task_struct *owner = rwsem_get_owner(sem); long count; if (!is_rwsem_owner_spinnable(owner)) return OWNER_NONSPINNABLE; rcu_read_lock(); - while (owner && !is_rwsem_owner_reader(owner) -&& (READ_ONCE(sem->owner) == owner)) { + while (owner && !is_rwsem_owner_reader(owner)) { + struct task_struct *new_owner = rwsem_get_owner(sem); + + if (new_owner != owner) { + owner = new_owner; + break; /* The owner has changed */ + } + /* * Ensure we emit the owner->on_cpu, dereference _after_ * checking sem->owner still matches owner, if that fails, @@ -597,7 +626,6 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) * spinning except when here is no active locks and the handoff bit * is set. In this case, we have to stop spinning. */ - owner = READ_ONCE(sem->owner); if (!is_rwsem_owner_spinnable(owner)) return OWNER_NONSPINNABLE; if (owner && !is_rwsem_owner_reader(owner)) @@ -1093,8 +1121,7 @@ inline void __down_read(struct rw_semaphore *sem) if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >count) & RWSEM_READ_FAILED_MASK)) { rwsem_down_read_failed(sem); - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & - RWSEM_READER_OWNED), sem); + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); } else { rwsem_set_reader_owned(sem); } @@ -1106,8 +1133,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem) >count) & RWSEM_READ_FAILED_MASK)) { if (IS_ERR(rwsem_down_read_failed_killable(sem))) return -EINTR; - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & - RWSEM_READER_OWNED), sem); + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); } else { rwsem_set_reader_owned(sem); } @@ -1174,8 +1200,7 @@ inline void __up_read(struct rw_semaphore *sem) { long tmp; - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
[PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word
On 64-bit architectures, each rwsem writer will have its unique lock word for acquiring the lock. Right now, the writer code recomputes the lock word every time it tries to acquire the lock. This is a waste of time. The lock word is now cached and reused when it is needed. On 32-bit architectures, the extra constant argument to rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should be optimized out by the compiler. Signed-off-by: Waiman Long --- kernel/locking/rwsem.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 27219abb8bb6..2c8187690c7c 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -561,6 +561,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, * bit is set or the lock is acquired. */ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem, + const long wlock, enum writer_wait_state wstate) { long new; @@ -581,7 +582,7 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem, if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count)) return false; - new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED - + new = (count & ~RWSEM_FLAG_HANDOFF) + wlock - (list_is_singular(>wait_list) ? RWSEM_FLAG_WAITERS : 0); if (atomic_long_try_cmpxchg_acquire(>count, , new)) { @@ -623,13 +624,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem) /* * Try to acquire write lock before the writer has been put on wait queue. */ -static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem) +static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem, +const long wlock) { long count = atomic_long_read(>count); while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) { if (atomic_long_try_cmpxchg_acquire(>count, , - count + RWSEM_WRITER_LOCKED)) { + count + wlock)) { rwsem_set_owner(sem); lockevent_inc(rwsem_opt_wlock); return true; @@ -779,7 +781,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem) : 25 * NSEC_PER_USEC); } -static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) +static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock) { bool taken = false; bool is_rt_task = rt_task(current); @@ -808,7 +810,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) /* * Try to acquire the lock */ - taken = wlock ? rwsem_try_write_lock_unqueued(sem) + taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock) : rwsem_try_read_lock_unqueued(sem); if (taken) @@ -887,7 +889,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) return false; } -static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) +static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, +const long wlock) { return false; } @@ -944,7 +947,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long count) */ atomic_long_add(-RWSEM_READER_BIAS, >count); adjustment = 0; - if (rwsem_optimistic_spin(sem, false)) { + if (rwsem_optimistic_spin(sem, 0)) { unsigned long flags; /* @@ -1058,10 +1061,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state) struct rwsem_waiter waiter; struct rw_semaphore *ret = sem; DEFINE_WAKE_Q(wake_q); + const long wlock = RWSEM_WRITER_LOCKED; /* do optimistic spinning and steal lock if possible */ if (rwsem_can_spin_on_owner(sem) && - rwsem_optimistic_spin(sem, true)) + rwsem_optimistic_spin(sem, wlock)) return sem; /* @@ -1120,7 +1124,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state) /* wait until we successfully acquire the lock */ set_current_state(state); while (true) { - if (rwsem_try_write_lock(count, sem, wstate)) + if (rwsem_try_write_lock(count, sem, wlock, wstate)) break; raw_spin_unlock_irq(>wait_lock); -- 2.18.1
[PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem
When the rwsem is owned by reader, writers stop optimistic spinning simply because there is no easy way to figure out if all the readers are actively running or not. However, there are scenarios where the readers are unlikely to sleep and optimistic spinning can help performance. This patch provides a simple mechanism for spinning on a reader-owned rwsem by a writer. It is a time threshold based spinning where the allowable spinning time can vary from 10us to 25us depending on the condition of the rwsem. When the time threshold is exceeded, a bit will be set in the owner field to indicate that no more optimistic spinning will be allowed on this rwsem until it becomes writer owned again. Not even readers is allowed to acquire the reader-locked rwsem by optimistic spinning for fairness. The time taken for each iteration of the reader-owned rwsem spinning loop varies. Below are sample minimum elapsed times for 16 iterations of the loop. System Time for 16 Iterations -- -- 1-socket Skylake ~800ns 4-socket Broadwell~300ns 2-socket ThunderX2 (arm64)~250ns When the lock cacheline is contended, we can see up to almost 10X increase in elapsed time. So 25us will be at most 500, 1300 and 1600 iterations for each of the above systems. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-patchPost-patch --- 2 1,7596,684 4 1,6846,738 8 1,0747,222 169007,163 324587,316 64208 520 128168 425 240143 474 This patch gives a big boost in performance for mixed reader/writer workloads. With 32 locking threads, the rwsem lock event data were: rwsem_opt_fail=79850 rwsem_opt_nospin=5069 rwsem_opt_rlock=597484 rwsem_opt_wlock=957339 rwsem_sleep_reader=57782 rwsem_sleep_writer=55663 With 64 locking threads, the data looked like: rwsem_opt_fail=346723 rwsem_opt_nospin=6293 rwsem_opt_rlock=1127119 rwsem_opt_wlock=1400628 rwsem_sleep_reader=308201 rwsem_sleep_writer=72281 So a lot more threads acquired the lock in the slowpath and more threads went to sleep. Signed-off-by: Waiman Long --- kernel/locking/lock_events_list.h | 1 + kernel/locking/rwsem.c| 121 ++ 2 files changed, 107 insertions(+), 15 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 333ed5fda333..f3550aa5866a 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */ LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/ LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired */ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/ +LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */ LOCK_EVENT(rwsem_rlock)/* # of read locks acquired */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired*/ LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 3cf8355252d1..8b23009e6b2c 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -35,18 +36,20 @@ * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned, *i.e. the owner(s) cannot be readily determined. It can be reader - *owned or the owning writer is indeterminate. + *owned or the owning writer is indeterminate. Optimistic spinning + *should be disabled if this flag is set. * * When a writer acquires a rwsem, it puts its task_struct pointer - * into the owner field. It is cleared after an unlock. + * into the owner field or the count itself (64-bit only. It should + * be cleared after an unlock. * * When a reader acquires a rwsem, it will also puts its task_struct - * pointer into the owner field with both the RWSEM_READER_OWNED and - * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will - * largely be left untouched. So for a free or reader-owned rwsem, - * the owner value may contain information about the last reader that - * acquires the rwsem. The anonymous bit is set because that particular - * reader may or may not still own the lock. + * pointer into the owner field with the RWSEM_READER_OWNED bit set. + * On
[PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2
v4: - Fix the missing initialization bug with !CONFIG_RWSEM_SPIN_ON_OWNER in patch 2. - Move the "Remove rwsem_wake() wakeup optimization" patch before the "Implement a new locking scheme" patch. - Add two new patches to merge the relevant content of rwsem.h and rwsem-xadd.c into rwsem.c as suggested by PeterZ. - Refactor the lock handoff patch to make all setting and clearing of the handoff bit serialized by wait_lock to ensure correctness. - Adapt the rest of the patches to the new code base. v3: - Add 2 more patches in front to fix build and testing issues found. Patch 1 can actually be merged on top of the patch "locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1. - Change the handoff patch (now patch 4) to set handoff bit immediately after wakeup for RT writers. The timeout limit is also tightened to 4ms. - There is no code changes in other patches other than resolving conflicts with patches 1, 2 and 4. v2: - Move the negative reader count checking patch (patch 12->10) forward to before the merge owner to count patch as suggested by Linus & expand the comment. - Change the reader-owned rwsem spinning from count based to time based to have better control of the max time allowed. This is part 2 of a 3-part (0/1/2) series to rearchitect the internal operation of rwsem. part 0: merged into tip part 1: https://lore.kernel.org/lkml/20190404174320.22416-1-long...@redhat.com/ This patchset revamps the current rwsem-xadd implementation to make it saner and easier to work with. It also implements the following 3 new features: 1) Waiter lock handoff 2) Reader optimistic spinning 3) Store write-lock owner in the atomic count (x86-64 only) Waiter lock handoff is similar to the mechanism currently in the mutex code. This ensures that lock starvation won't happen. Reader optimistic spinning enables readers to acquire the lock more quickly. So workloads that use a mix of readers and writers should see an increase in performance as long as the reader critical sections are short. Finally, storing the write-lock owner into the count will allow optimistic spinners to get to the lock holder's task structure more quickly and eliminating the timing gap where the write lock is acquired but the owner isn't known yet. This is important for RT tasks where spinning on a lock with an unknown owner is not allowed. Because of the fact that multiple readers can share the same lock, there is a natural preference for readers when measuring in term of locking throughput as more readers are likely to get into the locking fast path than the writers. With waiter lock handoff, we are not going to starve the writers. On a 8-socket 120-core 240-thread IvyBridge-EX system with 120 reader and writer locking threads, the min/mean/max locking operations done in a 5-second testing window before the patchset were: 120 readers, Iterations Min/Mean/Max = 399/400/401 120 writers, Iterations Min/Mean/Max = 400/33,389/211,359 After the patchset, they became: 120 readers, Iterations Min/Mean/Max = 584/10,266/26,609 120 writers, Iterations Min/Mean/Max = 22,080/29,016/38,728 So it was much fairer to readers. With less locking threads, the readers were preferred than writers. Patch 1 fixes an testing issue with locking selftest introduced by the patch "locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1. Patch 2 makes owner a permanent member of the rw_semaphore structure and set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER. Patch 3 removes rwsem_wake() wakeup optimization as it doesn't work with lock handoff. Patch 4 implements a new rwsem locking scheme similar to what qrwlock is current doing. Write lock is done by atomic_cmpxchg() while read lock is still being done by atomic_add(). Patch 5 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is left for internal function declaration needed by percpu-rwsem.c. Patch 6 optimizes the merged rwsem.c file to generate smaller object file. Patch 7 implments lock handoff to prevent lock starvation. It is expected that throughput will be lower on workloads with highly contended rwsems for better fairness. Patch 8 makes rwsem_spin_on_owner() returns owner state. Patch 9 disallows RT tasks to spin on a rwsem with unknown owner. Patch 10 makes reader wakeup to wake almost all the readers in the wait queue instead of just those in the front. Patch 11 enables reader to spin on a writer-owned rwsem. Patch 12 enables a writer to spin on a reader-owned rwsem for at most 25us. Patch 13 adds some new rwsem owner access helper functions. Patch 14 handles the case of too many readers by reserving the sign bit to designate that a reader lock attempt will fail and the locking reader will be put to sleep. This will ensure that we will not overflow the reader count. Patch 15 merges the write-lock owner task
[PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization
With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly acquired and an optimistic spinning locker is present. This can help performance by avoiding spinning on the wait_lock when it is contended. With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless waiter wakeup(s)"), the performance advantage of the above optimization diminishes as the average wait_lock hold time become much shorter. With a later patch that supports rwsem lock handoff, we can no longer relies on the fact that the presence of an optimistic spinning locker will ensure that the lock will be acquired by a task soon and rwsem_wake() will be called later on to wake up waiters. This can lead to missed wakeup and application hang. So the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()") will have to be reverted. Signed-off-by: Waiman Long --- kernel/locking/rwsem-xadd.c | 72 - 1 file changed, 72 deletions(-) diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 7fd4f1de794a..98de7f0cfedd 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -395,25 +395,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) lockevent_cond_inc(rwsem_opt_fail, !taken); return taken; } - -/* - * Return true if the rwsem has active spinner - */ -static inline bool rwsem_has_spinner(struct rw_semaphore *sem) -{ - return osq_is_locked(>osq); -} - #else static bool rwsem_optimistic_spin(struct rw_semaphore *sem) { return false; } - -static inline bool rwsem_has_spinner(struct rw_semaphore *sem) -{ - return false; -} #endif /* @@ -635,65 +621,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) unsigned long flags; DEFINE_WAKE_Q(wake_q); - /* - * __rwsem_down_write_failed_common(sem) - * rwsem_optimistic_spin(sem) - * osq_unlock(sem->osq) - * ... - * atomic_long_add_return(>count) - * - * - VS - - * - * __up_write() - *if (atomic_long_sub_return_release(>count) < 0) - * rwsem_wake(sem) - *osq_is_locked(>osq) - * - * And __up_write() must observe !osq_is_locked() when it observes the - * atomic_long_add_return() in order to not miss a wakeup. - * - * This boils down to: - * - * [S.rel] X = 1[RmW] r0 = (Y += 0) - * MB RMB - * [RmW] Y += 1 [L] r1 = X - * - * exists (r0=1 /\ r1=0) - */ - smp_rmb(); - - /* -* If a spinner is present, it is not necessary to do the wakeup. -* Try to do wakeup only if the trylock succeeds to minimize -* spinlock contention which may introduce too much delay in the -* unlock operation. -* -*spinning writer up_write/up_read caller -*--- --- -* [S] osq_unlock() [L] osq -* MB RMB -* [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock) -* -* Here, it is important to make sure that there won't be a missed -* wakeup while the rwsem is free and the only spinning writer goes -* to sleep without taking the rwsem. Even when the spinning writer -* is just going to break out of the waiting loop, it will still do -* a trylock in rwsem_down_write_failed() before sleeping. IOW, if -* rwsem_has_spinner() is true, it will guarantee at least one -* trylock attempt on the rwsem later on. -*/ - if (rwsem_has_spinner(sem)) { - /* -* The smp_rmb() here is to make sure that the spinner -* state is consulted before reading the wait_lock. -*/ - smp_rmb(); - if (!raw_spin_trylock_irqsave(>wait_lock, flags)) - return sem; - goto locked; - } raw_spin_lock_irqsave(>wait_lock, flags); -locked: if (!list_empty(>wait_list)) __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q); -- 2.18.1
[PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest
Disable the DEBUG_RWSEMS check when locking selftest is running with debug_locks_silent flag set. Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 37db17890e36..64877f5294e3 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -30,7 +30,8 @@ #ifdef CONFIG_DEBUG_RWSEMS # define DEBUG_RWSEMS_WARN_ON(c, sem) do {\ - if (WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\ + if (!debug_locks_silent && \ + WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\ #c, atomic_long_read(&(sem)->count),\ (long)((sem)->owner), (long)current,\ list_empty(&(sem)->wait_list) ? "" : "not ")) \ -- 2.18.1
[PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
The owner field in the rw_semaphore structure is used primarily for optimistic spinning. However, identifying the rwsem owner can also be helpful in debugging as well as tracing locking related issues when analyzing crash dump. The owner field may also store state information that can be important to the operation of the rwsem. So the owner field is now made a permanent member of the rw_semaphore structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER. Signed-off-by: Waiman Long --- include/linux/rwsem.h | 9 + kernel/locking/rwsem-xadd.c | 2 +- kernel/locking/rwsem.h | 23 --- lib/Kconfig.debug | 8 4 files changed, 10 insertions(+), 32 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 2ea18a3def04..148983e21d47 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -34,12 +34,12 @@ */ struct rw_semaphore { atomic_long_t count; -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER /* -* Write owner. Used as a speculative check to see -* if the owner is running on the cpu. +* Write owner or one of the read owners. Can be used as a +* speculative check to see if the owner is running on the cpu. */ struct task_struct *owner; +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER struct optimistic_spin_queue osq; /* spinner MCS lock */ #endif raw_spinlock_t wait_lock; @@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #endif #ifdef CONFIG_RWSEM_SPIN_ON_OWNER -#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL +#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED #else #define __RWSEM_OPT_INIT(lockname) #endif #define __RWSEM_INITIALIZER(name) \ { __RWSEM_INIT_COUNT(name), \ + .owner = NULL,\ .wait_list = LIST_HEAD_INIT((name).wait_list),\ .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \ __RWSEM_OPT_INIT(name)\ diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 6b3ee9948bf1..7fd4f1de794a 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, atomic_long_set(>count, RWSEM_UNLOCKED_VALUE); raw_spin_lock_init(>wait_lock); INIT_LIST_HEAD(>wait_list); -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER sem->owner = NULL; +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER osq_lock_init(>osq); #endif } diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 64877f5294e3..eb9c8534299b 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -61,7 +61,6 @@ #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER /* * All writes to owner are protected by WRITE_ONCE() to make sure that * store tearing can't happen as optimistic spinners may read and use @@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner) * real owner or one of the real owners. The only exception is when the * unlock is done by up_read_non_owner(). */ -#define rwsem_clear_reader_owned rwsem_clear_reader_owned static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem) { unsigned long val = (unsigned long)current | RWSEM_READER_OWNED @@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem) cmpxchg_relaxed((unsigned long *)>owner, val, RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED); } -#endif - #else -static inline void rwsem_set_owner(struct rw_semaphore *sem) -{ -} - -static inline void rwsem_clear_owner(struct rw_semaphore *sem) -{ -} - -static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem, - struct task_struct *owner) -{ -} - -static inline void rwsem_set_reader_owned(struct rw_semaphore *sem) -{ -} -#endif - -#ifndef rwsem_clear_reader_owned static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem) { } diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0d9e81779e37..2047f3884540 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1067,7 +1067,7 @@ config PROVE_LOCKING select DEBUG_SPINLOCK select DEBUG_MUTEXES select DEBUG_RT_MUTEXES if RT_MUTEXES - select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER + select DEBUG_RWSEMS select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC select TRACE_IRQFLAGS @@ -1171,10 +1171,10 @@ config DEBUG_WW_MUTEX_SLOWPATH config DEBUG_RWSEMS bool "RW Semaphore debugging: basic checks" - depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER +
[PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation
Because of writer lock stealing, it is possible that a constant stream of incoming writers will cause a waiting writer or reader to wait indefinitely leading to lock starvation. This patch implements a lock handoff mechanism to disable lock stealing and force lock handoff to the first waiter or waiters (for readers) in the queue after at least a 4ms waiting period unless it is a RT writer task which doesn't need to wait. The waiting period is used to avoid discouraging lock stealing too much to affect performance. The setting and clearing of the handoff bit is serialized by the wait_lock. So racing is not possible. A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core 80-thread Skylake system with a v5.1 based kernel and 240 write_lock threads with 5us sleep critical section. Before the patch, the min/mean/max numbers of locking operations for the locking threads were 1/7,792/173,696. After the patch, the figures became 5,842/6,542/7,458. It can be seen that the rwsem became much more fair, though there was a drop of about 16% in the mean locking operations done which was a tradeoff of having better fairness. Making the waiter set the handoff bit right after the first wakeup can impact performance especially with a mixed reader/writer workload. With the same microbenchmark with short critical section and equal number of reader and writer threads (40/40), the reader/writer locking operation counts with the current patch were: 40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796 40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081 By making waiter set handoff bit immediately after wakeup: 40 readers, Iterations Min/Mean/Max = 43/44/46 40 writers, Iterations Min/Mean/Max = 43/1,263/3,191 Signed-off-by: Waiman Long --- kernel/locking/lock_events_list.h | 2 + kernel/locking/rwsem.c| 205 +++--- 2 files changed, 164 insertions(+), 43 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index ad7668cfc9da..29e5c52197fa 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -61,7 +61,9 @@ LOCK_EVENT(rwsem_opt_fail)/* # of failed opt-spinnings */ LOCK_EVENT(rwsem_rlock)/* # of read locks acquired */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired*/ LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */ +LOCK_EVENT(rwsem_rlock_handoff)/* # of read lock handoffs */ LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls */ LOCK_EVENT(rwsem_wlock)/* # of write locks acquired */ LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */ +LOCK_EVENT(rwsem_wlock_handoff)/* # of write lock handoffs */ LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index c1a089ab19fd..aaab546a890d 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -74,22 +74,38 @@ * * Bit 0 - writer locked bit * Bit 1 - waiters present bit - * Bits 2-7 - reserved + * Bit 2 - lock handoff bit + * Bits 3-7 - reserved * Bits 8-X - 24-bit (32-bit) or 56-bit reader count * * atomic_long_fetch_add() is used to obtain reader lock, whereas * atomic_long_cmpxchg() will be used to obtain writer lock. + * + * There are three places where the lock handoff bit may be set or cleared. + * 1) __rwsem_mark_wake() for readers. + * 2) rwsem_try_write_lock() for writers. + * 3) Error path of __rwsem_down_write_failed_common(). + * + * For all the above cases, wait_lock will be held. A writer must also + * be the first one in the wait_list to be eligible for setting the handoff + * bit. So concurrent setting/clearing of handoff bit is not possible. */ #define RWSEM_WRITER_LOCKED(1UL << 0) #define RWSEM_FLAG_WAITERS (1UL << 1) +#define RWSEM_FLAG_HANDOFF (1UL << 2) + #define RWSEM_READER_SHIFT 8 #define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT) #define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1)) #define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK) -#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS) +#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\ +RWSEM_FLAG_HANDOFF) #define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK) +#define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF) +#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c) \ + ((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF)) /* * All writes to owner are protected by WRITE_ONCE() to make sure that @@ -218,6 +234,7 @@ struct rwsem_waiter { struct list_head list; struct task_struct *task; enum rwsem_waiter_type type; + unsigned
[PATCH 1/3] ARM: dts: meson8: add the canvas module
Add the canvas module to Meson8 because it's required for the VPU (video output) and video decoders. The canvas module is located inside thie "DMC bus" (where also some of the memory controller registers are located). The "DMC bus" itself is part of the so-called "MMC bus". Amlogic's vendor kernel has an explicit #define for the "DMC" register range on Meson8m2 while there's no such #define for Meson8. However, the canvas and memory controller registers on Meson8 are all expressed as "0x6000 + actual offset", while Meson8m2 uses "DMC + actual offset". Thus it's safe to assume that the DMC bus exists on both SoCs even though the registers inside are slightly different. Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson8.dtsi | 21 + 1 file changed, 21 insertions(+) diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi index 7ef442462ea4..6b5c90bb960b 100644 --- a/arch/arm/boot/dts/meson8.dtsi +++ b/arch/arm/boot/dts/meson8.dtsi @@ -228,6 +228,27 @@ }; }; + mmcbus: bus@c800 { + compatible = "simple-bus"; + reg = <0xc800 0x8000>; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0xc800 0x8000>; + + dmcbus: bus@6000 { + compatible = "simple-bus"; + reg = <0x6000 0x400>; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0x6000 0x400>; + + canvas: video-lut@20 { + compatible = "amlogic,canvas"; + reg = <0x20 0x14>; + }; + }; + }; + apb: bus@d000 { compatible = "simple-bus"; reg = <0xd000 0x20>; -- 2.21.0
[PATCH 2/3] ARM: dts: meson8m2: update the offset of the canvas module
With the Meson8m2 SoC the canvas module was moved from offset 0x20 (Meson8) to offset 0x48 (same as on Meson8b). The offsets inside the canvas module are identical. Correct the offset so the driver uses the correct registers. Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson8m2.dtsi | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm/boot/dts/meson8m2.dtsi b/arch/arm/boot/dts/meson8m2.dtsi index bb87b251e16d..e88c36d47e5c 100644 --- a/arch/arm/boot/dts/meson8m2.dtsi +++ b/arch/arm/boot/dts/meson8m2.dtsi @@ -14,6 +14,16 @@ compatible = "amlogic,meson8m2-clkc", "amlogic,meson8-clkc"; }; + { + /* the offset of the canvas registers has changed compared to Meson8 */ + /delete-node/ video-lut@20; + + canvas: video-lut@48 { + compatible = "amlogic,canvas"; + reg = <0x48 0x14>; + }; +}; + { compatible = "amlogic,meson8m2-dwmac", "snps,dwmac"; reg = <0xc941 0x1 -- 2.21.0
[PATCH 3/3] ARM: dts: meson8b: add the canvas module
Add the canvas module to Meson8b because it's required for the VPU (video output) and video decoders. The canvas module is located inside the "DMC bus" (where also some of the memory controller registers are located). The "DMC bus" itself is part of the so-called "MMC bus". Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson8b.dtsi | 21 + 1 file changed, 21 insertions(+) diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi index 800cd65fc50a..f4e832638e9c 100644 --- a/arch/arm/boot/dts/meson8b.dtsi +++ b/arch/arm/boot/dts/meson8b.dtsi @@ -205,6 +205,27 @@ }; }; + mmcbus: bus@c800 { + compatible = "simple-bus"; + reg = <0xc800 0x8000>; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0xc800 0x8000>; + + dmcbus: bus@6000 { + compatible = "simple-bus"; + reg = <0x6000 0x400>; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0x6000 0x400>; + + canvas: video-lut@48 { + compatible = "amlogic,canvas"; + reg = <0x48 0x14>; + }; + }; + }; + apb: bus@d000 { compatible = "simple-bus"; reg = <0xd000 0x20>; -- 2.21.0
[PATCH 0/3] 32-bit Meson: add the canvas module
This adds the canvas module on Meson8, Meson8b and Meson8m2. The canvas IP is used by the video decoder hardware as well as the VPU (video output) hardware. Neither the VPU nor the video decoder driver support the 32-bit SoCs yet. However, we can still add the canvas module to have it available once these drivers gain support for the older SoCs. I have tested this on my Meson8m2 board by hacking the VPU driver to not re-initialize the VPU (and to use the configuration set by u-boot). With that hack I could get some image out of the CVBS connector. No changes to the canvas driver were required. Due to lack of hardware I could not test Meson8, but I'm following (as always) what the Amlogic 3.10 vendor kernel uses. Meson8b is also not tested because u-boot of my EC-100 doesn't have video output enabled (so I couldn't use the same hack I used on my Meson8m2 board). This series meant to be applied on top of "Meson8b: add support for the RTC on EC-100 and Odroid-C1" from [0] [0] https://patchwork.kernel.org/cover/10899509/ Martin Blumenstingl (3): ARM: dts: meson8: add the canvas module ARM: dts: meson8m2: update the offset of the canvas module ARM: dts: meson8b: add the canvas module arch/arm/boot/dts/meson8.dtsi | 21 + arch/arm/boot/dts/meson8b.dtsi | 21 + arch/arm/boot/dts/meson8m2.dtsi | 10 ++ 3 files changed, 52 insertions(+) -- 2.21.0
Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
On Fri, Apr 12, 2019 at 03:41:44PM +0300, Kirill Smelkov wrote: > On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote: > > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov wrote: > > > > > > However file->f_pos writing is still there and it will bug under race > > > detector, e.g. under KTSAN because read and write can be running > > > simultaneously. Maybe it is not only race bug, but also a bit of > > > slowdown as read and write code paths write to the same memory thus > > > needing inter-cpu synchronization if read and write handlers are on > > > different cpus. However for this I'm not sure. > > > > I doubt it's noticeable, but it would probably be good to simply not > > do the write for streams. > > > > That *could* be done somewhat similarly, with just changing th address > > to be updated, or course. > > > > In fact, we *used* to (long ago) pass in the address of "file->f_pos" > > itself to the low-level read/write routines. We then changed it to do > > that indirection through a local copy of pos (and > > file_pos_read/file_pos_write) because we didn't do the proper locking, > > so different read/write versions could mess with each other (and with > > lseek). > > > > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos > > accesses as per POSIX") did was to add the proper locking at least for > > the cases that we care about deeply, so we *could* say that we have > > three cases: > > > > - FMODE_ATOMIC_POS: properly locked, > > - FMODE_STREAM: no pos at all > > - otherwise a "mostly don't care - don't mix!" > > > > and so we could go back to not copying the pos at all, and instead do > > something like > > > > loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : >f_pos; > > ret = vfs_write(f.file, buf, count, ppos); > > > > and perhaps have a long-term plan to try to get rid of the "don't mix" > > case entirely (ie "if you use f_pos, then we'll do the proper > > locking") > > > > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that > > implements the locking decision). > > Ok Linus, thanks for feedback. Please consider applying or queueing the > following patches. Resending those patches one by one in separate emails, if maybe 2 patches inline was problematic to handle...
Re: [PATCH 1/2] driver core: printk to pr_*
On Fri, 2019-04-12 at 16:47 +0100, Willy Wolff wrote: > Signed-off-by: Willy Wolff Most people seem to prefer some type of commit message before your sign-off. And trivia: > diff --git a/drivers/base/dd.c b/drivers/base/dd.c [] > @@ -602,8 +601,8 @@ static int really_probe_debug(struct device *dev, struct > device_driver *drv) > ret = really_probe(dev, drv); > rettime = ktime_get(); > delta = ktime_sub(rettime, calltime); > - printk(KERN_DEBUG "probe of %s returned %d after %lld usecs\n", > -dev_name(dev), ret, (s64) ktime_to_us(delta)); > + pr_debug("probe of %s returned %d after %lld usecs\n", > + dev_name(dev), ret, (s64) ktime_to_us(delta)); While I think this particular change is fine, and you may already know this, but converting a printk(KERN_DEBUG ...) to pr_debug(...) changes object code and output. The first is always emitted, the second is not emitted unless DEBUG is defined or CONFIG_DYNAMIC_DEBUG is enabled. Your commit message should show this understanding.
Re: [PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)
On 2019-04-13 10:50 a.m., Julia Lawall wrote: > Hello, > > Kirill will explain about this issue. I'm aware of this effort and Acked Kirill's original monolithic patch. So, for this patch: Acked-by: Logan Gunthorpe julia > > -- Forwarded message -- > Date: Sat, 13 Apr 2019 11:22:51 +0800 > From: kbuild test robot > To: kbu...@01.org > Cc: Julia Lawall > Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings > > CC: kbuild-...@01.org > TO: Sebastian Andrzej Siewior > CC: Kurt Schwemmer > CC: Logan Gunthorpe > CC: Bjorn Helgaas > CC: linux-...@vger.kernel.org > CC: linux-kernel@vger.kernel.org > > From: kbuild test robot > > drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can > deadlock .write(); change nonseekable_open -> stream_open to fix. > > Generated by: scripts/coccinelle/api/stream_open.cocci > > Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue") > Signed-off-by: kbuild test robot > --- > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git > linux-5.0.y-rt-rebase > head: 794c294ae4483c240429c25a0d18e272e92c94de > commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: > Don't use completion's wait queue > :: branch date: 7 hours ago > :: commit date: 7 hours ago > > Please take the patch only if it's a positive warning. Thanks! > > switchtec.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/drivers/pci/switch/switchtec.c > +++ b/drivers/pci/switch/switchtec.c > @@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino > return PTR_ERR(stuser); > > filp->private_data = stuser; > - nonseekable_open(inode, filp); > + stream_open(inode, filp); > > dev_dbg(>dev, "%s: %p\n", __func__, stuser); >
[PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)
Hello, Kirill will explain about this issue. julia -- Forwarded message -- Date: Sat, 13 Apr 2019 11:22:51 +0800 From: kbuild test robot To: kbu...@01.org Cc: Julia Lawall Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings CC: kbuild-...@01.org TO: Sebastian Andrzej Siewior CC: Kurt Schwemmer CC: Logan Gunthorpe CC: Bjorn Helgaas CC: linux-...@vger.kernel.org CC: linux-kernel@vger.kernel.org From: kbuild test robot drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix. Generated by: scripts/coccinelle/api/stream_open.cocci Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue") Signed-off-by: kbuild test robot --- tree: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.0.y-rt-rebase head: 794c294ae4483c240429c25a0d18e272e92c94de commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: Don't use completion's wait queue :: branch date: 7 hours ago :: commit date: 7 hours ago Please take the patch only if it's a positive warning. Thanks! switchtec.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino return PTR_ERR(stuser); filp->private_data = stuser; - nonseekable_open(inode, filp); + stream_open(inode, filp); dev_dbg(>dev, "%s: %p\n", __func__, stuser);
[PATCH 0/3] Meson8b: add support for the RTC on EC-100 and Odroid-C1
This adds support for the RTC on the Meson8b EC-100 and Odroid-C1 boards. Example kernel log snippet while booting my EC-100: [5.713750] meson-rtc c8100740.rtc: setting system clock to 2019-04-13T16:21:48 UTC (1555172508) I am only 99% sure about the naming of the clock in patch 2 and 3. The public S805 datasheet from Hardkernel shows (on page 24 for example) that clk81 can use "XTAL/32khz" as clock input. That "XTAL/32khz" clock is described as a mux between 24MHz (our main XTAL) and 32kHz ("that other XTAL"). I believe that this other 32kHz XTAL is NOT the RTC32K crystal because: - schematics of the EC-100 and Odroid-C1 clearly show that the SoC input for the RTC32K clock is labeled RTC32K_XI / RTC32K_XO - GPIOAO_6 has a CLK_32KIN function (shows in EC-100 and Odroid-C1 schematics as well as the public S805 datasheet) - Always On domain PWR_CNTL0[11:10] (public S805 datasheet page 19) describes it as "Alternate 32khz input clock select from GPIO pad" Thus I believe that the naming of the RTC32K clock is correct, but I wanted to point out that I'm only 99% (instead of 100%) sure. Jianxin, please let me know if you disagree with my findings. Martin Blumenstingl (3): ARM: dts: meson: add support for the RTC ARM: dts: meson8b: ec100: enable the RTC ARM: dts: meson8b: odroid-c1: prepare support for the RTC arch/arm/boot/dts/meson.dtsi | 9 + arch/arm/boot/dts/meson8.dtsi | 5 + arch/arm/boot/dts/meson8b-ec100.dts| 14 ++ arch/arm/boot/dts/meson8b-odroidc1.dts | 14 ++ arch/arm/boot/dts/meson8b.dtsi | 5 + 5 files changed, 47 insertions(+) -- 2.21.0
[PATCH 2/3] ARM: dts: meson8b: ec100: enable the RTC
The RTC is always enabled on this board since the battery is already connected in the factory. According to the schematics the VCC_RTC regulator (which is either powered by the internal 3.3V or a battery) is connected to the 0.9V RTC_VDD input of the SoCs. Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson8b-ec100.dts | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm/boot/dts/meson8b-ec100.dts b/arch/arm/boot/dts/meson8b-ec100.dts index 3ca9638fad09..9bf4249cb60d 100644 --- a/arch/arm/boot/dts/meson8b-ec100.dts +++ b/arch/arm/boot/dts/meson8b-ec100.dts @@ -88,6 +88,14 @@ }; }; + rtc32k_xtal: rtc32k-xtal-clk { + /* X2 in the schematics */ + compatible = "fixed-clock"; + clock-frequency = <32768>; + clock-output-names = "RTC32K"; + #clock-cells = <0>; + }; + usb_vbus: regulator-usb-vbus { /* * Silergy SY6288CCAC-GP 2A Power Distribution Switch. @@ -347,6 +355,12 @@ clock-names = "clkin0"; }; + { + status = "okay"; + clocks = <_xtal>; + vdd-supply = <_rtc>; +}; + /* exposed through the pin headers labeled "URDUG1" on the top of the PCB */ _AO { status = "okay"; -- 2.21.0
[PATCH 1/3] ARM: dts: meson: add support for the RTC
The 32-bit Meson SoCs have an RTC block in the AO (always on) area. The RTC requires an external 32.768 kHz oscillator to work properly. Whether or not this crystal exists depends on the board, so it has to be added for each board.dts (instead of adding it somewhere in a generic .dtsi). Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson.dtsi | 9 + arch/arm/boot/dts/meson8.dtsi | 5 + arch/arm/boot/dts/meson8b.dtsi | 5 + 3 files changed, 19 insertions(+) diff --git a/arch/arm/boot/dts/meson.dtsi b/arch/arm/boot/dts/meson.dtsi index 6f54a8897574..8841783aceec 100644 --- a/arch/arm/boot/dts/meson.dtsi +++ b/arch/arm/boot/dts/meson.dtsi @@ -252,6 +252,15 @@ #size-cells = <0>; status = "disabled"; }; + + rtc: rtc@740 { + compatible = "amlogic,meson6-rtc"; + reg = <0x740 0x14>; + interrupts = ; + #address-cells = <1>; + #size-cells = <1>; + status = "disabled"; + }; }; usb0: usb@c904 { diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi index d2ec4af82cc5..7ef442462ea4 100644 --- a/arch/arm/boot/dts/meson8.dtsi +++ b/arch/arm/boot/dts/meson8.dtsi @@ -541,6 +541,11 @@ compatible = "amlogic,meson8-pwm", "amlogic,meson8b-pwm"; }; + { + compatible = "amlogic,meson8-rtc"; + resets = < RESET_RTC>; +}; + { compatible = "amlogic,meson8-saradc", "amlogic,meson-saradc"; clocks = < CLKID_XTAL>, diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi index df42e48f1cc1..800cd65fc50a 100644 --- a/arch/arm/boot/dts/meson8b.dtsi +++ b/arch/arm/boot/dts/meson8b.dtsi @@ -510,6 +510,11 @@ compatible = "amlogic,meson8b-pwm"; }; + { + compatible = "amlogic,meson8b-rtc"; + resets = < RESET_RTC>; +}; + { compatible = "amlogic,meson8b-saradc", "amlogic,meson-saradc"; clocks = < CLKID_XTAL>, -- 2.21.0
[PATCH 3/3] ARM: dts: meson8b: odroid-c1: prepare support for the RTC
The Odroid-C1 has the 32.768 kHz oscillator (X3 in the schematics) which is required for the RTC. A battery can be connected separately (to the BT1 header) - then the "rtc" node can be enabled manually. By default the RTC is disabled because the boards typically come without the RTC battery. Signed-off-by: Martin Blumenstingl --- arch/arm/boot/dts/meson8b-odroidc1.dts | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm/boot/dts/meson8b-odroidc1.dts b/arch/arm/boot/dts/meson8b-odroidc1.dts index 0157646e3a89..f3ad9397f670 100644 --- a/arch/arm/boot/dts/meson8b-odroidc1.dts +++ b/arch/arm/boot/dts/meson8b-odroidc1.dts @@ -124,6 +124,14 @@ io-channels = < 8>; }; + rtc32k_xtal: rtc32k-xtal-clk { + /* X3 in the schematics */ + compatible = "fixed-clock"; + clock-frequency = <32768>; + clock-output-names = "RTC32K"; + #clock-cells = <0>; + }; + vcc_1v8: regulator-vcc-1v8 { /* * RICHTEK RT9179 configured for a fixed output voltage of @@ -345,6 +353,12 @@ clock-names = "clkin0"; }; + { + /* needs to be enabled manually when a battery is connected */ + clocks = <_xtal>; + vdd-supply = <_rtc>; +}; + _AO { status = "okay"; pinctrl-0 = <_ao_a_pins>; -- 2.21.0
[PATCH 1/1] arm: mm: Export __sync_icache_dcache() for xen-privcmd
This patch avoids ERROR: "__sync_icache_dcache" [drivers/xen/xen-privcmd.ko] undefined! observed when compiling v4.19.34. The xen-privcmd driver, which can be modular, calls set_pte_at() which in turn may call __sync_icache_dcache(). Cc: sta...@vger.kernel.org Signed-off-by: Heinrich Schuchardt --- arch/arm/mm/flush.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c index 58469623b015..5345f86c56d2 100644 --- a/arch/arm/mm/flush.c +++ b/arch/arm/mm/flush.c @@ -295,6 +295,7 @@ void __sync_icache_dcache(pte_t pteval) if (pte_exec(pteval)) __flush_icache_all(); } +EXPORT_SYMBOL_GPL(__sync_icache_dcache); #endif /* -- 2.20.1
[PATCH] staging: axis-fifo: Add elaborate description in Kconfig
- The Xilinx AXI-Stream FIFO IP core driver description is elaborated. - References: Xilinx PG080 document, axis-fifo.txt Signed-off-by: Moses Christopher --- drivers/staging/axis-fifo/Kconfig | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/staging/axis-fifo/Kconfig b/drivers/staging/axis-fifo/Kconfig index d9725888af6f..ecb7811fd387 100644 --- a/drivers/staging/axis-fifo/Kconfig +++ b/drivers/staging/axis-fifo/Kconfig @@ -6,5 +6,7 @@ config XIL_AXIS_FIFO depends on OF default n help - This adds support for the Xilinx AXI-Stream - FIFO IP core driver. + This adds support for the Xilinx AXI-Stream FIFO IP core driver. + The AXI Streaming FIFO allows memory mapped access to a AXI Streaming + interface. The Xilinx AXI-Stream FIFO IP core can be used to interface + to the AXI Ethernet without the need to use DMA. -- 2.17.1
Re: [PATCH v3 00/26] Add support for PCIe RC and EP mode in TI's AM654 SoC
On Fri, Apr 12, 2019 at 04:48:36PM +0100, Lorenzo Pieralisi wrote: > On Mon, Mar 25, 2019 at 03:09:21PM +0530, Kishon Vijay Abraham I wrote: > > Add PCIe RC support for TI's AM654 SoC. The PCIe controller in AM654 > > uses Synopsys core revision 4.90a and uses the same TI wrapper as used > > in keystone2 with certain modification. Hence AM654 will use the same > > pci wrapper driver pci-keystone.c > > > > This series was initially part of [1]. This series only includes patches > > that has to be merged via Lorenzo's tree. The PHY patches and dt patches > > will be sent separately. > > > > This series is created over keystone MSI cleanup series [2]. > > > > This series: > > *) Cleanup pci-keystone driver so that both RC mode and EP mode of > >AM654 can be supported > > *) Modify epc-core to support allocation of aligned buffers required for > >AM654 > > *) Fix ATU unroll identification > > *) Add support for both host mode and device mode in AM654 > > > > Changes from v2: > > *) Missed updating "Reviewed-by: Rob Herring " tags > >in the version that was sent to list. > > *) Add const qualifier to struct dw_pcie_ep_ops in pci-layerscape-ep.c > > > > Changes from v1: > > *) Support for legacy interrupt in AM654 is removed (see background here > >[3]) > > *) Allow of_pci_get_max_link_speed to be used by Endpoint controller > > driver > > *) Add support to set max-link-speed from DT in pci-keystone driver > > *) Update "Reviewed-by: Rob Herring " tags. > > > > [1] -> https://lore.kernel.org/patchwork/cover/989487/ > > [2] -> https://lkml.org/lkml/2019/3/21/193 > > [3] -> https://lkml.org/lkml/2019/3/19/235 > > > > Kishon Vijay Abraham I (26): > > PCI: keystone: Add start_link/stop_link dw_pcie_ops > > PCI: keystone: Cleanup error_irq configuration > > dt-bindings: PCI: keystone: Add "reg-names" binding information > > PCI: keystone: Perform host initialization in a single function > > PCI: keystone: Use platform_get_resource_byname to get memory > > resources > > PCI: keystone: Move initializations to appropriate places > > dt-bindings: PCI: Add dt-binding to configure PCIe mode > > PCI: keystone: Explicitly set the PCIe mode > > dt-bindings: PCI: Document "atu" reg-names > > PCI: dwc: Enable iATU unroll for endpoint too > > PCI: dwc: Fix ATU identification for designware version >= 4.80 > > PCI: keystone: Prevent ARM32 specific code to be compiled for ARM64 > > dt-bindings: PCI: Add PCI RC dt binding documentation for AM654 > > PCI: keystone: Add support for PCIe RC in AM654x Platforms > > PCI: keystone: Invoke phy_reset API before enabling PHY > > PCI: OF: Allow of_pci_get_max_link_speed() to be used by PCI Endpoint > > drivers > > PCI: keystone: Add support to set the max link speed from DT > > PCI: endpoint: Add support to allocate aligned buffers to be mapped in > > BARs > > PCI: dwc: Add const qualifier to struct dw_pcie_ep_ops > > PCI: dwc: Fix dw_pcie_ep_find_capability to return correct capability > > offset > > PCI: dwc: Add callbacks for accessing dbi2 address space > > PCI: keystone: Add support for PCIe EP in AM654x Platforms > > PCI: designware-ep: Configure RESBAR to advertise the smallest size > > PCI: designware-ep: Use aligned ATU window for raising MSI interrupts > > misc: pci_endpoint_test: Add support to test PCI EP in AM654x > > misc: pci_endpoint_test: Fix test_reg_bar to be updated in > > pci_endpoint_test > > > > .../bindings/pci/designware-pcie.txt | 7 +- > > .../devicetree/bindings/pci/pci-keystone.txt | 14 +- > > drivers/misc/pci_endpoint_test.c | 18 + > > drivers/pci/Makefile | 2 +- > > drivers/pci/controller/dwc/Kconfig| 25 +- > > drivers/pci/controller/dwc/pci-dra7xx.c | 2 +- > > drivers/pci/controller/dwc/pci-keystone.c | 577 +++--- > > .../pci/controller/dwc/pci-layerscape-ep.c| 2 +- > > drivers/pci/controller/dwc/pcie-artpec6.c | 2 +- > > .../pci/controller/dwc/pcie-designware-ep.c | 55 +- > > .../pci/controller/dwc/pcie-designware-host.c | 19 - > > .../pci/controller/dwc/pcie-designware-plat.c | 2 +- > > drivers/pci/controller/dwc/pcie-designware.c | 52 ++ > > drivers/pci/controller/dwc/pcie-designware.h | 15 +- > > drivers/pci/endpoint/functions/pci-epf-test.c | 5 +- > > drivers/pci/endpoint/pci-epf-core.c | 10 +- > > drivers/pci/of.c | 44 +- > > include/linux/pci-epc.h | 2 + > > include/linux/pci-epf.h | 3 +- > > 19 files changed, 683 insertions(+), 173 deletions(-) > > Hi Kishon, > > I have applied the series, after rewriting the commit logs we > discussed, branch: pci/keystone, please have a look and let me know > if that's good to go. It all looks good to me; I responded to a few things I noticed while writing the notes for merging the branch. I don't
Re: [PATCH v2 23/26] PCI: designware-ep: Configure RESBAR to advertise the smallest size
On Mon, Mar 25, 2019 at 02:04:58PM +0530, Kishon Vijay Abraham I wrote: > Configure RESBAR capability to advertise the smallest size (1MB) for > couple of reasons. A) Host side resource allocation of BAR fails for > larger sizes. B) Endpoint function driver does not allocate memory > for all supported sizes in RESBAR capability. > If and when there is a usecase required to add more flexibility using > RESBAR, this can be revisited. The #define used in the code below is "REBAR"; maybe spell it out once and then use REBAR instead of RESBAR? > Signed-off-by: Kishon Vijay Abraham I > --- > .../pci/controller/dwc/pcie-designware-ep.c | 34 +++ > 1 file changed, 34 insertions(+) > > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c > b/drivers/pci/controller/dwc/pcie-designware-ep.c > index 74477ad7467f..0c208b9bda43 100644 > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c > @@ -504,10 +504,32 @@ void dw_pcie_ep_exit(struct dw_pcie_ep *ep) > pci_epc_mem_exit(epc); > } > > +static unsigned int dw_pcie_ep_find_ext_capability(struct dw_pcie *pci, int > cap) > +{ > + u32 header; > + int pos = PCI_CFG_SPACE_SIZE; > + > + while (pos) { > + header = dw_pcie_readl_dbi(pci, pos); > + if (PCI_EXT_CAP_ID(header) == cap) > + return pos; > + > + pos = PCI_EXT_CAP_NEXT(header); > + if (!pos) > + break; > + } > + > + return 0; > +} > + > int dw_pcie_ep_init(struct dw_pcie_ep *ep) > { > + int i; > int ret; > + u32 reg; > void *addr; > + unsigned int nbars; > + unsigned int offset; > struct pci_epc *epc; > struct dw_pcie *pci = to_dw_pcie_from_ep(ep); > struct device *dev = pci->dev; > @@ -591,6 +613,18 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep) > > ep->msix_cap = dw_pcie_ep_find_capability(pci, PCI_CAP_ID_MSIX); > > + offset = dw_pcie_ep_find_ext_capability(pci, PCI_EXT_CAP_ID_REBAR); > + if (offset) { > + reg = dw_pcie_readl_dbi(pci, offset + PCI_REBAR_CTRL); > + nbars = (reg & PCI_REBAR_CTRL_NBAR_MASK) >> > + PCI_REBAR_CTRL_NBAR_SHIFT; > + > + dw_pcie_dbi_ro_wr_en(pci); > + for (i = 0; i < nbars; i++, offset += PCI_REBAR_CTRL) > + dw_pcie_writel_dbi(pci, offset + PCI_REBAR_CAP, 0x0); > + dw_pcie_dbi_ro_wr_dis(pci); > + } > + > dw_pcie_setup(pci); > > return 0; > -- > 2.17.1 >
[PATCH] RDMA/cxgb4: fix null pointer dereference on alloc_skb failure
From: Colin Ian King Currently if alloc_skb fails to allocate the skb a null skb is passed to t4_set_arp_err_handler and this ends up dereferencing the null skb. Avoid the null pointer dereference by checking for a null skb and returning early. Addresses-Coverity: ("Dereference null return") Fixes: b38a0ad8ec11 ("RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL messages") Signed-off-by: Colin Ian King --- drivers/infiniband/hw/cxgb4/cm.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 1e68d87b663d..0f3b1193d5f8 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -460,6 +460,8 @@ static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) skb_reset_transport_header(skb); } else { skb = alloc_skb(len, gfp); + if (!skb) + return NULL; } t4_set_arp_err_handler(skb, NULL, NULL); return skb; -- 2.20.1
Re: [PATCH v3 24/26] PCI: designware-ep: Use aligned ATU window for raising MSI interrupts
On Mon, Mar 25, 2019 at 03:09:45PM +0530, Kishon Vijay Abraham I wrote: > Certain platforms like K2G reguires the outbound ATU window to be s/reguires/require/ (note two changes: "g->q" and drop s)
Re: [PATCH v3 20/26] PCI: dwc: Fix dw_pcie_ep_find_capability to return correct capability offset
On Mon, Mar 25, 2019 at 03:09:41PM +0530, Kishon Vijay Abraham I wrote: > commit beb4641a787df79a ("PCI: dwc: Add MSI-X callbacks handler") while > adding MSI-X callback handler, introduced dw_pcie_ep_find_capability and > __dw_pcie_ep_find_next_cap for finding the MSI and MSIX capability. > > However if MSI or MSIX capability is the last capability (i.e there are > no additional items in the capabilities list and the Next Capability > Pointer is set to '0'), __dw_pcie_ep_find_next_cap will return '0' > even though MSI or MSIX capability may be present. This is because of > incorrect ordering of "next_cap_ptr" check. Fix it here. Nice fix. It looks like dw_pcie_ep_find_capability() is currently only used for MSI and MSI-X, but IIUC, it *could* be used for any capability, so this commit message seems a little too specific. It would be slightly less confusing if beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler") had been split into a commit that added the general-purpose dw_pcie_ep_find_capability() and a commit that used it for MSI and MSI-X, but obviously we can't change history. > Fixes: beb4641a787df79a142 ("PCI: dwc: Add MSI-X callbacks handler") These commit references are longer than necessary (and different - 19-char SHA1 here and 16 chars above). I typically use 12 chars via gsr is aliased to `git --no-pager show -s --abbrev-commit --abbrev=12 --pretty=format:"%h (\"%s\")%n"' > Signed-off-by: Kishon Vijay Abraham I > --- > drivers/pci/controller/dwc/pcie-designware-ep.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c > b/drivers/pci/controller/dwc/pcie-designware-ep.c > index dc6a4bbd3ace..74477ad7467f 100644 > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c > @@ -46,16 +46,19 @@ static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie *pci, > u8 cap_ptr, > u8 cap_id, next_cap_ptr; > u16 reg; > > + if (!cap_ptr) > + return 0; > + > reg = dw_pcie_readw_dbi(pci, cap_ptr); > - next_cap_ptr = (reg & 0xff00) >> 8; > cap_id = (reg & 0x00ff); > > - if (!next_cap_ptr || cap_id > PCI_CAP_ID_MAX) > + if (cap_id > PCI_CAP_ID_MAX) > return 0; > > if (cap_id == cap) > return cap_ptr; > > + next_cap_ptr = (reg & 0xff00) >> 8; > return __dw_pcie_ep_find_next_cap(pci, next_cap_ptr, cap); > } > > @@ -67,9 +70,6 @@ static u8 dw_pcie_ep_find_capability(struct dw_pcie *pci, > u8 cap) > reg = dw_pcie_readw_dbi(pci, PCI_CAPABILITY_LIST); > next_cap_ptr = (reg & 0x00ff); > > - if (!next_cap_ptr) > - return 0; > - > return __dw_pcie_ep_find_next_cap(pci, next_cap_ptr, cap); > } > > -- > 2.17.1 >
[PATCH v2 2/3] RISC-V: Add interrupt related SCAUSE defines in asm/encoding.h
This patch adds SCAUSE interrupt flag and SCAUSE interrupt cause related defines to asm/encoding.h. We also use these defines in arch/riscv/kernel/irq.c and express SIE/SIP flag in-terms of interrupt causes. Signed-off-by: Anup Patel --- arch/riscv/include/asm/encoding.h | 25 + arch/riscv/kernel/irq.c | 9 + 2 files changed, 26 insertions(+), 8 deletions(-) diff --git a/arch/riscv/include/asm/encoding.h b/arch/riscv/include/asm/encoding.h index 29699705dc36..4f187854fd8b 100644 --- a/arch/riscv/include/asm/encoding.h +++ b/arch/riscv/include/asm/encoding.h @@ -44,10 +44,22 @@ #define SATP_MODE SATP_MODE_39 #endif -/* Interrupt Enable and Interrupt Pending flags */ -#define SIE_SSIE _AC(0x0002, UL) /* Software Interrupt Enable */ -#define SIE_STIE _AC(0x0020, UL) /* Timer Interrupt Enable */ -#define SIE_SEIE _AC(0x0200, UL) /* External Interrupt Enable */ +/* SCAUSE */ +#ifdef CONFIG_64BIT +#define SCAUSE_IRQ_FLAG_AC(0x8000, UL) +#else +#define SCAUSE_IRQ_FLAG_AC(0x8000, UL) +#endif + +#define IRQ_U_SOFT 0 +#define IRQ_S_SOFT 1 +#define IRQ_M_SOFT 3 +#define IRQ_U_TIMER4 +#define IRQ_S_TIMER5 +#define IRQ_M_TIMER7 +#define IRQ_U_EXT 8 +#define IRQ_S_EXT 9 +#define IRQ_M_EXT 11 #define EXC_INST_MISALIGNED0 #define EXC_INST_ACCESS1 @@ -59,4 +71,9 @@ #define EXC_LOAD_PAGE_FAULT13 #define EXC_STORE_PAGE_FAULT 15 +/* SIE (Interrupt Enable) and SIP (Interrupt Pending) flags */ +#define SIE_SSIE (_AC(0x1, UL) << IRQ_S_SOFT) +#define SIE_STIE (_AC(0x1, UL) << IRQ_S_TIMER) +#define SIE_SEIE (_AC(0x1, UL) << IRQ_S_EXT) + #endif /* _ASM_RISCV_ENCODING_H */ diff --git a/arch/riscv/kernel/irq.c b/arch/riscv/kernel/irq.c index 48e6b7db83a1..22b8183ae8d4 100644 --- a/arch/riscv/kernel/irq.c +++ b/arch/riscv/kernel/irq.c @@ -9,14 +9,15 @@ #include #include #include +#include #include /* * Possible interrupt causes: */ -#define INTERRUPT_CAUSE_SOFTWARE1 -#define INTERRUPT_CAUSE_TIMER 5 -#define INTERRUPT_CAUSE_EXTERNAL9 +#define INTERRUPT_CAUSE_SOFTWARE IRQ_S_SOFT +#define INTERRUPT_CAUSE_TIMER IRQ_S_TIMER +#define INTERRUPT_CAUSE_EXTERNAL IRQ_S_EXT /* * The high order bit of the trap cause register is always set for @@ -24,7 +25,7 @@ * quickly. The INTERRUPT_CAUSE_* macros don't contain that bit, so we * need to mask it off. */ -#define INTERRUPT_CAUSE_FLAG (1UL << (__riscv_xlen - 1)) +#define INTERRUPT_CAUSE_FLAG SCAUSE_IRQ_FLAG int arch_show_interrupts(struct seq_file *p, int prec) { -- 2.17.1
[PATCH v2 3/3] RISC-V: Access CSRs using CSR numbers
We should prefer accessing CSRs using their CSR numbers because: 1. It compiles fine with older toolchains. 2. We can use latest CSR names in #define macro names of CSR numbers as-per RISC-V spec. 3. We can access newly added CSRs even if toolchain does not recognize newly addes CSRs by name. Signed-off-by: Anup Patel --- arch/riscv/include/asm/csr.h | 15 --- arch/riscv/include/asm/encoding.h| 16 arch/riscv/include/asm/irqflags.h| 10 +- arch/riscv/include/asm/mmu_context.h | 7 +-- arch/riscv/kernel/entry.S| 22 +++--- arch/riscv/kernel/head.S | 12 ++-- arch/riscv/kernel/perf_event.c | 4 ++-- arch/riscv/kernel/smp.c | 2 +- arch/riscv/kernel/traps.c| 6 +++--- arch/riscv/mm/fault.c| 6 +- 10 files changed, 54 insertions(+), 46 deletions(-) diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h index 8cf698e39463..6bf5652d3565 100644 --- a/arch/riscv/include/asm/csr.h +++ b/arch/riscv/include/asm/csr.h @@ -14,6 +14,7 @@ #ifndef _ASM_RISCV_CSR_H #define _ASM_RISCV_CSR_H +#include #include #ifndef __ASSEMBLY__ @@ -21,7 +22,7 @@ #define csr_swap(csr, val) \ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrrw %0, " #csr ", %1" \ + __asm__ __volatile__ ("csrrw %0, " __ASM_STR(csr) ", %1"\ : "=r" (__v) : "rK" (__v) \ : "memory"); \ __v;\ @@ -30,7 +31,7 @@ #define csr_read(csr) \ ({ \ register unsigned long __v; \ - __asm__ __volatile__ ("csrr %0, " #csr \ + __asm__ __volatile__ ("csrr %0, " __ASM_STR(csr)\ : "=r" (__v) :\ : "memory"); \ __v;\ @@ -39,7 +40,7 @@ #define csr_write(csr, val)\ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrw " #csr ", %0" \ + __asm__ __volatile__ ("csrw " __ASM_STR(csr) ", %0" \ : : "rK" (__v)\ : "memory"); \ }) @@ -47,7 +48,7 @@ #define csr_read_set(csr, val) \ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrrs %0, " #csr ", %1" \ + __asm__ __volatile__ ("csrrs %0, " __ASM_STR(csr) ", %1"\ : "=r" (__v) : "rK" (__v) \ : "memory"); \ __v;\ @@ -56,7 +57,7 @@ #define csr_set(csr, val) \ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrs " #csr ", %0" \ + __asm__ __volatile__ ("csrs " __ASM_STR(csr) ", %0" \ : : "rK" (__v)\ : "memory"); \ }) @@ -64,7 +65,7 @@ #define csr_read_clear(csr, val) \ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrrc %0, " #csr ", %1" \ + __asm__ __volatile__ ("csrrc %0, " __ASM_STR(csr) ", %1"\ : "=r" (__v) : "rK" (__v) \ : "memory"); \ __v;\ @@ -73,7 +74,7 @@ #define csr_clear(csr, val)\ ({ \ unsigned long __v = (unsigned long)(val); \ - __asm__ __volatile__ ("csrc " #csr ", %0" \ + __asm__ __volatile__ ("csrc " __ASM_STR(csr) ", %0" \ : : "rK" (__v)\ : "memory"); \ }) diff --git a/arch/riscv/include/asm/encoding.h b/arch/riscv/include/asm/encoding.h index 4f187854fd8b..717b823ac110 100644 ---