date:20190413

[PATCH V2 0/2] arm64/mm: Enable memory hot remove

2019-04-13 Thread Anshuman Khandual

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic __remove_memory(). This is based
on the following arm64 working tree.

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core

Testing:

Tested hot remove on arm64 for all 4K, 16K, 64K page config options with
all possible VA_BITS and PGTABLE_LEVELS combinations. Build tested on non
arm64 platforms.

Changes in V2:

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review

- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko

- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

Anshuman Khandual (2):
  mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig   |   3 +
 arch/arm64/include/asm/pgtable.h |   2 +
 arch/arm64/mm/mmu.c  | 221 ++-
 mm/memory_hotplug.c  |   3 +-
 4 files changed, 225 insertions(+), 4 deletions(-)

-- 
2.7.4

[PATCH 3/3] ARM: mvebu: add SPDX license identifier

2019-04-13 Thread Nicholas Mc Guire

The license is clearly identified as GPL V2 - so just add in the
appropriate SPDX license identifier.

Signed-off-by: Nicholas Mc Guire 
---

Problem reported by checkpatch

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#1: FILE: arch/arm/mach-mvebu/board-v7.c:1:
+/*

Patch is against 5.1-rc4 (localversion-next is 20190412)

 arch/arm/mach-mvebu/board-v7.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c
index 28fd256..0e021c9 100644
--- a/arch/arm/mach-mvebu/board-v7.c
+++ b/arch/arm/mach-mvebu/board-v7.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Device Tree support for Armada 370 and XP platforms.
  *
-- 
2.1.4

[PATCH 1/3 RFC] ARM: mvebu: at least warn on kzalloc failure

2019-04-13 Thread Nicholas Mc Guire

Although it is very unlikely that the allocation during init would
fail any such failure should point to the original cause rather
than waiting for a null-pointer dereference to splat.

Signed-off-by: Nicholas Mc Guire 
---

Problem located with experimental coccinelle script

While this will not really help much - but kzalloc failures should not
go unhandled. 

Patch was compile-tested: mvebu_v7_defconfig (implies MACH_MVEBU_ANY=y)
(with some unrelated sparse warnings about missing syscalls)

Patch is against 5.1-rc4 (localversion-next is 20190412)

 arch/arm/mach-mvebu/board-v7.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c
index 0b10acd..37f8cb6 100644
--- a/arch/arm/mach-mvebu/board-v7.c
+++ b/arch/arm/mach-mvebu/board-v7.c
@@ -128,6 +128,7 @@ static void __init i2c_quirk(void)
struct property *new_compat;
 
new_compat = kzalloc(sizeof(*new_compat), GFP_KERNEL);
+   WARN_ON(!new_compat);
 
new_compat->name = kstrdup("compatible", GFP_KERNEL);
new_compat->length = sizeof("marvell,mv78230-a0-i2c");
-- 
2.1.4

[PATCH 2/3] ARM: mvebu: drop return from void function

2019-04-13 Thread Nicholas Mc Guire

The return statement is unnecessary here - so drop it.

Signed-off-by: Nicholas Mc Guire 
---

Problem reported by checkpatch

WARNING: void function return statements are not generally useful
#141: FILE: arch/arm/mach-mvebu/board-v7.c:141:
+   return;
+}

Patch was compile-tested: mvebu_v7_defconfig (implies MACH_MVEBU_ANY=y)
(with some unrelated sparse warnings about missing syscalls)

Patch is against 5.1-rc4 (localversion-next is 20190412)

 arch/arm/mach-mvebu/board-v7.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/mach-mvebu/board-v7.c b/arch/arm/mach-mvebu/board-v7.c
index 37f8cb6..28fd256 100644
--- a/arch/arm/mach-mvebu/board-v7.c
+++ b/arch/arm/mach-mvebu/board-v7.c
@@ -137,7 +137,6 @@ static void __init i2c_quirk(void)
 
of_update_property(np, new_compat);
}
-   return;
 }
 
 static void __init mvebu_dt_init(void)
-- 
2.1.4

Re: INFO: task hung in do_exit

2019-04-13 Thread syzbot


syzbot has bisected this bug to:

commit 430e48ecf31f4f897047f22e02abdfa75730cad8
Author: Amitoj Kaur Chawla 
Date:   Thu Aug 10 16:28:09 2017 +

leds: lm3533: constify attribute_group structure

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15f4cee320
start commit:   8ee15f32 Merge tag 'dma-mapping-5.1-1' of git://git.infrad..
git tree:   upstream
final crash:https://syzkaller.appspot.com/x/report.txt?x=17f4cee320
console output: https://syzkaller.appspot.com/x/log.txt?x=13f4cee320
kernel config:  https://syzkaller.appspot.com/x/.config?x=4fb64439e07a1ec0
dashboard link: https://syzkaller.appspot.com/bug?extid=9880e421ec82313d6527
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=149b89af20

Reported-by: syzbot+9880e421ec82313d6...@syzkaller.appspotmail.com
Fixes: 430e48ecf31f ("leds: lm3533: constify attribute_group structure")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

Re: [PATCH 5/6] iio: imx7d_adc: Use imx7d_adc_resume() in imx7d_adc_probe()

2019-04-13 Thread Andrey Smirnov

On Sun, Apr 7, 2019 at 4:15 AM Jonathan Cameron  wrote:
>
> On Wed,  3 Apr 2019 00:03:24 -0700
> Andrey Smirnov  wrote:
>
> > Initialization sequence performed in imx7d_adc_resume() is exactley
> > the same as what's being done in imx7d_adc_probe(). Make use of the
> > former in the latter to avoid code duplication.
> >
> > Signed-off-by: Andrey Smirnov 
> > Cc: Jonathan Cameron 
> > Cc: Hartmut Knaack 
> > Cc: Lars-Peter Clausen 
> > Cc: Peter Meerwald-Stadler 
> > Cc: Chris Healy 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> I'm not to going to apply this now, due to the ordering issue with the earlier
> and later patches.  However, it is fine in of itself.
>
> I do wonder if we want to rename resume to something else to reflect
> it's new usage?  imx7d_adc_enable perhaps?
> Similar question for suspend in the next patch.

Sure, will rename in v2.

Thanks,
Andrey Smirnov

Re: [PATCH 3/6] iio: imx7d_adc: Use devm_iio_device_register()

2019-04-13 Thread Andrey Smirnov

On Sun, Apr 7, 2019 at 4:07 AM Jonathan Cameron  wrote:
>
> On Wed,  3 Apr 2019 00:03:22 -0700
> Andrey Smirnov  wrote:
>
> > Use devm_iio_device_register() and drop explicit call to
> > iio_device_unregister().
> >
> > Signed-off-by: Andrey Smirnov 
> > Cc: Jonathan Cameron 
> > Cc: Hartmut Knaack 
> > Cc: Lars-Peter Clausen 
> > Cc: Peter Meerwald-Stadler 
> > Cc: Chris Healy 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> No to this one.  The thing to think about is the resulting order
> of the unwinding that happens in remove.
>
> Do take a look at the code flow, but in short what happens is:
>
> 1. driver.remove()
> 2. Devm release functions run in the opposite order to they were
>called during setup.
>
> The upshot of the change you just made here is that we turn the power
> off to the device before we remove the userspace interfaces, giving
> potentially interesting failures..
>
> There are two options to avoid this:
>
> 1. Make everything use devm_ calls (often using devm_add_action_or_reset
> for the ones that don't have their own versions).
>
> 2. Stop using devm managed functions at the first non devm_ call that needs
> unwinding.  From that point onwards in probe / remove you have to do 
> everything
> manually.
>

Yeah, missed the ordering proble, sorry about that. I'll re-order
changes such that this conversion happens last to avoid said problem
in v2.

Thanks,
Andrey Smirnov

Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()

2019-04-13 Thread Darrick J. Wong

On Sat, Apr 13, 2019 at 03:54:39PM -0500, Shawn Landden wrote:

/me pulls out his close-reading glasses and the copy_file_range manpage...

> If flags includes COPY_FILE_RANGE_FILESIZE then the length
> copied is the length of the file. off_in and off_out are
> ignored.  len must be 0 or the file size.

They're ignored?  As in the copy operation reads the number of bytes in
the file referenced by fd_in from fd_in at its current position and is
writes that out to fd_out at its current position?  I don't see why I
would want such an operation...

...but I can see how people could make use of a CFR_ENTIRE_FILE that
would check that both file descriptors are actually regular files, and
if so copy the entire contents of the fd_in file into the same position
in the fd_out file, and then set the fd_out file's length to match.  If
@off_in or @off_out are non-NULL then they'll be updated to the new EOFs
if the copy completes succesfully and @len can be anything.

Also: please update the manual page and the xfstests regression test for
this syscall.

> This implementation saves a call to stat() in the common case
> of copying files. It does not fix any race conditions, but that
> is possible in the future with this interface.
> 
> EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
> or the file size.

The values are invalid, so why would we tell userspace to try again
instead of the EINVAL that we usually use?

> Signed-off-by: Shawn Landden 
> CC: 
> ---
>  fs/read_write.c   | 14 +-
>  include/uapi/linux/stat.h |  4 
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..6d06361f0856 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
> loff_t pos_in,
>   struct inode *inode_out = file_inode(file_out);
>   ssize_t ret;
>  
> - if (flags != 0)
> + if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)

FWIW you might as well shorten the prefix to "CFR_" since nobody else is
using it.

--D

>   return -EINVAL;
>  
>   if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
> loff_t pos_in,
>   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>   return -EINVAL;
>  
> + if (flags & COPY_FILE_RANGE_FILESIZE) {
> + struct kstat stat;
> + int error;
> + error = vfs_getattr(_in->f_path, ,
> + STATX_SIZE, 0);
> + if (error < 0)
> + return error;
> + if (!(len == 0 || len == stat.size))
> + return -EAGAIN;
> + len = stat.size;
> + }
> +
>   ret = rw_verify_area(READ, file_in, _in, len);
>   if (unlikely(ret))
>   return ret;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..1075aa4666ef 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -170,5 +170,9 @@ struct statx {
>  
>  #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */
>  
> +/*
> + * Flags for copy_file_range()
> + */
> +#define COPY_FILE_RANGE_FILESIZE 0x0001 /* Copy the full length of 
> the input file */
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> -- 
> 2.20.1
>

Re: perf tools:Is there any tools to found out the max latency by irq or cpu idle

2019-04-13 Thread William Cohen

On 4/13/19 2:01 AM, Linhaifeng wrote:
> Sorry, the value 131081408 is just for example. Actually the result is like 
> this:
>  sqrt 2019-04-10 23:53:50: 43968
>  sqrt 2019-04-10 23:53:51: 44060
>  sqrt 2019-04-10 23:53:52: 49012
>  sqrt 2019-04-10 23:53:53: 38172
>  sqrt 2019-04-10 23:53:54: 131081408
>  sqrt 2019-04-10 23:53:55: 43600
>  sqrt 2019-04-10 23:53:56: 46704
>  sqrt 2019-04-10 23:53:57: 46880
>  sqrt 2019-04-10 23:53:58: 44332
>  ……
>  sqrt 2019-04-10 02:17:15: 136345876
>  ……
>  sqrt 2019-04-10 04:40:35: 136335644
>  ……

Hi,

Still it would be interesting to look at the two raw values used to compute the 
delta to better diagnose what is going on.

-Will
> 
> -Original Message-
> From: William Cohen [mailto:wco...@redhat.com] 
> Sent: 2019年4月12日 22:06
> To: Linhaifeng ; linux-perf-us...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: Re: perf tools:Is there any tools to found out the max latency by 
> irq or cpu idle
> 
> On 4/11/19 8:57 PM, Linhaifeng wrote:
>> Hi,
>> I have a single thread application like this:
>>
>> While (1) {
>>     start = rdtsc();
>>     sqrt (1024);  
>> 　　end = rdtsc();
>> 　　cycles = end – start;
>> 　　printf("cycles: %d-%02d-%02d %02d:%02d:%02d: %lu\n", 
>> 　　1900+timeinfo->tm_year, 1+timeinfo->tm_mon, timeinfo->tm_mday, 
>> timeinfo->tm_hour, timeinfo->tm_min, timeinfo->tm_sec,
>> 　　cycles);
>> }
>> It print the cycles of sqrt every second and run with taskset –c 1 ./sqrt.
>> The result of test is:
>>
>> sqrt 2019-04-10 23:53:50: 43968
>> sqrt 2019-04-10 23:53:51: 44060
>> sqrt 2019-04-10 23:53:52: 49012
>> sqrt 2019-04-10 23:53:53: 38172
>> sqrt 2019-04-10 23:53:54: 131081408
>> sqrt 2019-04-10 23:53:55: 43600
>> sqrt 2019-04-10 23:53:56: 46704
>> sqrt 2019-04-10 23:53:57: 46880
>> sqrt 2019-04-10 23:53:58: 44332
>> ……
>> sqrt 2019-04-10 02:17:15: 131081408
>> ……
>> sqrt 2019-04-10 04:40:35: 131081408
>> ……
>>
>> Every 2hour23min there would be a large cycles. I use perf sched not found 
>> any sched_switch events.
> 
> Hi,
> 
> The fact that it is the same value 131081408 every 2 hours 23 minutes looks 
> suspect. One would expect some variation in the counts.  It looks like there 
> is some rollover or overflow issue.  It would be helpful to print out the 
> start and end values to see what is going on with the raw rdstc values.  
> Maybe the thread is being moved between processors and the TSC are out of 
> sync.  What particular processor model was this running on? Was this running 
> on physical hardware or inside a kvm guest? 
> 
> According to the Intel 64 and IA-32 Architecture Software Devloper's Manual 
> Volume 3 (325384-sdm-vol-3abcd.pdf):
> 
> The RDTSC instruction reads the time-stamp counter and is guaranteed to 
> return a monotonically increasing unique value whenever executed, except for 
> a 64-bit counter wraparound. Intel guarantees that the time-stamp counter 
> will not wraparound within 10 years after being reset. The period for counter 
> wrap is longer for Pentium 4, Intel Xeon, P6 family, and Pentium processors.
> 
> -Will Cohen
> 
> 
>>
>> L2GW_2680:/home/fsp/zn # perf sched record -C 6-11 -o perf.sched ^C[ 
>> perf record: Woken up 64 times to write data ] [ perf record: Captured 
>> and wrote 204.878 MB perf.sched (1911189 samples) ]
>>
>> L2GW_2680:/home/fsp/zn # perf sched latency -i perf.sched
>>
>> --
>> ---
>>   Task  |   Runtime ms  | Switches | Average delay ms 
>> | Maximum delay ms | Maximum delay at   |
>> --
>> ---
>> --
>> ---
>>   TOTAL:    |  0.000 ms |    0 |
>> ---
>>
>>
>>
>> Is there any other tools of perf to found out the max latency by irq or cpu 
>> idle ?
>>
>>
>

Re: [PATCH 4/4] ARM: imx legacy: add an SPDX license identifier

2019-04-13 Thread Nicholas Mc Guire

On Sat, Apr 13, 2019 at 11:11:05AM -0300, Fabio Estevam wrote:
> On Sat, Apr 13, 2019 at 4:30 AM Nicholas Mc Guire  wrote:
> >
> > The header clearly identifies this code as GPL V2 or later - so pop
> > in the SPDX license identifier.
> >
> > Signed-off-by: Nicholas Mc Guire 
> > ---
> >  arch/arm/mach-imx/mach-mx27ads.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm/mach-imx/mach-mx27ads.c 
> > b/arch/arm/mach-imx/mach-mx27ads.c
> > index c83fdd3..3f68972 100644
> > --- a/arch/arm/mach-imx/mach-mx27ads.c
> > +++ b/arch/arm/mach-imx/mach-mx27ads.c
> > @@ -1,3 +1,4 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> 
> As you are adding the SPDX line you could also remove the legal text below.

hmmm - as I'm not the copyright holder I do not think I should
be doing that - it has more information in it that only GPL V2+

thx!
hofrat

Re: [PATCH v5 4/5] xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-13 Thread Max Filippov

On Sat, Apr 13, 2019 at 3:44 PM Sinan Kaya  wrote:
>
> CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
> defined CONFIG_DEBUG_MISC instead to keep the current code.
>
> Signed-off-by: Sinan Kaya 
> Reviewed-by: Josh Triplett 
> ---
>  arch/xtensa/include/asm/irqflags.h | 2 +-
>  arch/xtensa/kernel/smp.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Acked-by: Max Filippov 

-- 
Thanks.
-- Max

[PATCH v5 3/5] mips: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-13 Thread Sinan Kaya

CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Signed-off-by: Sinan Kaya 
Reviewed-by: Josh Triplett 
---
 arch/mips/kernel/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index 8d1dc6c71173..9fc8fadb8418 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -563,7 +563,7 @@ static void __init bootmem_init(void)
offset = __pa_symbol(_text) - __pa_symbol(VMLINUX_LOAD_ADDRESS);
memblock_free(__pa_symbol(VMLINUX_LOAD_ADDRESS), offset);
 
-#if defined(CONFIG_DEBUG_KERNEL) && defined(CONFIG_DEBUG_INFO)
+#if defined(CONFIG_DEBUG_MISC) && defined(CONFIG_DEBUG_INFO)
/*
 * This information is necessary when debugging the kernel
 * But is a security vulnerability otherwise!
-- 
2.21.0

[PATCH v5 4/5] xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-13 Thread Sinan Kaya

CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Signed-off-by: Sinan Kaya 
Reviewed-by: Josh Triplett 
---
 arch/xtensa/include/asm/irqflags.h | 2 +-
 arch/xtensa/kernel/smp.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/xtensa/include/asm/irqflags.h 
b/arch/xtensa/include/asm/irqflags.h
index 9b5e8526afe5..12890681587b 100644
--- a/arch/xtensa/include/asm/irqflags.h
+++ b/arch/xtensa/include/asm/irqflags.h
@@ -27,7 +27,7 @@ static inline unsigned long arch_local_irq_save(void)
 {
unsigned long flags;
 #if XTENSA_FAKE_NMI
-#if defined(CONFIG_DEBUG_KERNEL) && (LOCKLEVEL | TOPLEVEL) >= XCHAL_DEBUGLEVEL
+#if defined(CONFIG_DEBUG_MISC) && (LOCKLEVEL | TOPLEVEL) >= XCHAL_DEBUGLEVEL
unsigned long tmp;
 
asm volatile("rsr   %0, ps\t\n"
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 3699d6d3e479..83b244ce61ee 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -126,7 +126,7 @@ void secondary_start_kernel(void)
 
init_mmu();
 
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
if (boot_secondary_processors == 0) {
pr_debug("%s: boot_secondary_processors:%d; Hanging cpu:%d\n",
__func__, boot_secondary_processors, cpu);
-- 
2.21.0

[PATCH v5 0/5] init: Do not select DEBUG_KERNEL by default

2019-04-13 Thread Sinan Kaya

CONFIG_DEBUG_KERNEL has been designed to just enable Kconfig options.
Kernel code generatoin should not depend on CONFIG_DEBUG_KERNEL.

Proposed alternative plan: let's add a new symbol, something like
DEBUG_MISC ("Miscellaneous debug code that should be under a more
specific debug option but isn't"), make it depend on DEBUG_KERNEL and be
"default DEBUG_KERNEL" but allow itself to be turned off, and then
mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to
"#ifdef CONFIG_DEBUG_MISC".

Diff from v4:
- collect reviewed-by
- collect acked-by
- fix nit on 1/5

Sinan Kaya (5):
  init: Introduce DEBUG_MISC option
  powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
  mips: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
  xtensa: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
  net: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

 arch/mips/kernel/setup.c   | 2 +-
 arch/powerpc/kernel/sysfs.c| 8 
 arch/xtensa/include/asm/irqflags.h | 2 +-
 arch/xtensa/kernel/smp.c   | 2 +-
 lib/Kconfig.debug  | 9 +
 net/netfilter/core.c   | 2 +-
 6 files changed, 17 insertions(+), 8 deletions(-)

-- 
2.21.0

[PATCH v5 1/5] init: Introduce DEBUG_MISC option

2019-04-13 Thread Sinan Kaya

Introduce DEBUG_MISC ("Miscellaneous debug code that should be under a more
specific debug option but isn't"), make it depend on DEBUG_KERNEL and be
"default DEBUG_KERNEL" but allow itself to be turned off, and then
mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to
"#ifdef CONFIG_DEBUG_MISC".

Signed-off-by: Sinan Kaya 
Reviewed-by: Josh Triplett 
---
 lib/Kconfig.debug | 9 +
 1 file changed, 9 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0d9e81779e37..0103a092ce3d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -438,6 +438,15 @@ config DEBUG_KERNEL
  Say Y here if you are developing drivers or trying to debug and
  identify kernel problems.
 
+config DEBUG_MISC
+   bool "Miscellaneous debug code"
+   default DEBUG_KERNEL
+   depends on DEBUG_KERNEL
+   help
+ Say Y here if you need to enable miscellaneous debug code that should
+ be under a more specific debug option but isn't.
+
+
 menu "Memory Debugging"
 
 source "mm/Kconfig.debug"
-- 
2.21.0

[PATCH v5 2/5] powerpc: Replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

2019-04-13 Thread Sinan Kaya

CONFIG_DEBUG_KERNEL should not impact code generation. Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Signed-off-by: Sinan Kaya 
Reviewed-by: Josh Triplett 
---
 arch/powerpc/kernel/sysfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e8e93c2c7d03..7a1708875d27 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -610,7 +610,7 @@ SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
 SYSFS_SPRSETUP(hid4, SPRN_HID4);
@@ -639,7 +639,7 @@ SYSFS_SPRSETUP(tsr0, SPRN_PA6T_TSR0);
 SYSFS_SPRSETUP(tsr1, SPRN_PA6T_TSR1);
 SYSFS_SPRSETUP(tsr2, SPRN_PA6T_TSR2);
 SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3);
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 #endif /* HAS_PPC_PMC_PA6T */
 
 #ifdef HAS_PPC_PMC_IBM
@@ -680,7 +680,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
__ATTR(hid4, 0600, show_hid4, store_hid4),
@@ -709,7 +709,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(tsr1, 0600, show_tsr1, store_tsr1),
__ATTR(tsr2, 0600, show_tsr2, store_tsr2),
__ATTR(tsr3, 0600, show_tsr3, store_tsr3),
-#endif /* CONFIG_DEBUG_KERNEL */
+#endif /* CONFIG_DEBUG_MISC */
 };
 #endif /* HAS_PPC_PMC_PA6T */
 #endif /* HAS_PPC_PMC_CLASSIC */
-- 
2.21.0

Re: [GIT PULL] Please pull NFS client bugfixes

2019-04-13 Thread pr-tracker-bot

The pull request you sent on Sat, 13 Apr 2019 14:56:35 +:

> git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.1-4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b60bc0665e6af8c55b946b67ea8cb235823bb74e

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

Re: [GIT PULL] clk fixes for v5.1-rc4

2019-04-13 Thread pr-tracker-bot

The pull request you sent on Fri, 12 Apr 2019 16:21:07 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git 
> tags/clk-fixes-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/09bad0df3974ef9e682447f48bb7fd55c48513f3

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

Re: [GIT PULL] PCI fixes for v5.1

2019-04-13 Thread pr-tracker-bot

The pull request you sent on Fri, 12 Apr 2019 10:38:38 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git 
> tags/pci-v5.1-fixes-2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a3b84248628df52c39c8724fb69da1ea0f2c0dc7

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

Re: [PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()

2019-04-13 Thread Andy Lutomirski




> On Apr 13, 2019, at 1:49 PM, Shawn Landden  wrote:
> 
> If flags includes COPY_FILE_RANGE_FILESIZE then the length
> copied is the length of the file. off_in and off_out are
> ignored. len must be 0 or the file size.
> 
> This implementation saves a call to stat() in the common case
> of copying files. It does not fix any race conditions, but that
> is possible in the future with this interface.
> 
> EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
> or the file size.

I think you’re asking for trouble here. I assume you have some kind of race 
prevention in mind here.  The trouble is that passing zero means copy the whole 
thing, but the size-checking behavior is only available for nonzero sizes. This 
means that anyone who passes their idea of the size needs to account for 
inconsistent behavior if the size is zero.

Also, what happens if the file size changes mid copy?  I assume the result is 
more or less arbitrary, but you should document what behavior is allowed.  The 
docs should cover the case where you race against an O_APPEND write.

> 
> Signed-off-by: Shawn Landden 
> CC: 
> ---
> fs/read_write.c   | 14 +-
> include/uapi/linux/stat.h |  4 
> 2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..6d06361f0856 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
> loff_t pos_in,
>struct inode *inode_out = file_inode(file_out);
>ssize_t ret;
> 
> -if (flags != 0)
> +if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
>return -EINVAL;
> 
>if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> @@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, 
> loff_t pos_in,
>if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>return -EINVAL;
> 
> +if (flags & COPY_FILE_RANGE_FILESIZE) {
> +struct kstat stat;
> +int error;
> +error = vfs_getattr(_in->f_path, ,
> +STATX_SIZE, 0);
> +if (error < 0)
> +return error;
> +if (!(len == 0 || len == stat.size))
> +return -EAGAIN;
> +len = stat.size;
> +}
> +
>ret = rw_verify_area(READ, file_in, _in, len);
>if (unlikely(ret))
>return ret;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..1075aa4666ef 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -170,5 +170,9 @@ struct statx {
> 
> #define STATX_ATTR_AUTOMOUNT0x1000 /* Dir: Automount trigger */
> 
> +/*
> + * Flags for copy_file_range()
> + */
> +#define COPY_FILE_RANGE_FILESIZE0x0001 /* Copy the full length of 
> the input file */
> 
> #endif /* _UAPI_LINUX_STAT_H */
> -- 
> 2.20.1
>

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-13 Thread Andrea Parri

On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > > The formula was more along the line of "do not assume either of these
> > > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > > provides a compiler barrier with either barrier() or "memory" clobber.
> > > 
> > > Well, we have two reasonable choices: Say that 
> > > smp_mb__{before,after}_atomic will always provide a compiler barrier, 
> > > or don't say this.  I see no point in saying that the combination of 
> > > Before-atomic followed by RMW provides a barrier.
> > 
> > ;-/ I'm fine with the first choice. I don't see how the second choice
> > (this proposal/patch) would be consistent with some documentation and
> > with the current implementations; for example,
> > 
> > 1) Documentation/atomic_t.txt says:
> > 
> > Thus:
> > 
> >   atomic_fetch_add();
> > 
> > is equivalent to:
> > 
> >   smp_mb__before_atomic();
> >   atomic_fetch_add_relaxed();
> >   smp_mb__after_atomic();
> > 
> > [...]
> > 
> > 2) Some implementations of the _relaxed() variants do not provide any
> > compiler barrier currently.
> 
> But don't all implementations of smp_mb__before_atomic() and
> smp_mb__after_atomic() currently supply a compiler barrier?

Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently
supply a compiler barrier.

Nevertheless, there's a difference between:  (1) Specify that these barriers
supply a compiler barrier,  (2) Specify that (certain) combinations of these
barriers and RMWs supply a compiler barrier, and  (3) This patch...  ;-)

FWIW, I'm not aware of current/informal documentation following (the arguably
simpler but slightly stronger) (1).  But again (amending my last remark): (1)
and (2) both make sense to me.

Thanx,
  Andrea


> 
>   Thanx, Paul
>

[tip:x86/fpu] x86/pkeys: Add PKRU value to init_fpstate

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  a5eff7259790d5314eff10563d6e59d358cce482
Gitweb: https://git.kernel.org/tip/a5eff7259790d5314eff10563d6e59d358cce482
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:56 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 20:21:10 +0200

x86/pkeys: Add PKRU value to init_fpstate

The task's initial PKRU value is set partly for fpu__clear()/
copy_init_pkru_to_fpregs(). It is not part of init_fpstate.xsave and
instead it is set explicitly.

If the user removes the PKRU state from XSAVE in the signal handler then
__fpu__restore_sig() will restore the missing bits from `init_fpstate'
and initialize the PKRU value to 0.

Add the `init_pkru_value' to `init_fpstate' so it is set to the init
value in such a case.

In theory copy_init_pkru_to_fpregs() could be removed because restoring
the PKRU at return-to-userland should be enough.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: "Chang S. Bae" 
Cc: Dominik Brodowski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: Konrad Rzeszutek Wilk 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-28-bige...@linutronix.de
---
 arch/x86/kernel/cpu/common.c | 5 +
 arch/x86/mm/pkeys.c  | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..352fa19e6311 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -372,6 +372,8 @@ static bool pku_disabled;
 
 static __always_inline void setup_pku(struct cpuinfo_x86 *c)
 {
+   struct pkru_state *pk;
+
/* check the boot processor, plus compile options for PKU: */
if (!cpu_feature_enabled(X86_FEATURE_PKU))
return;
@@ -382,6 +384,9 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
return;
 
cr4_set_bits(X86_CR4_PKE);
+   pk = get_xsave_addr(_fpstate.xsave, XFEATURE_PKRU);
+   if (pk)
+   pk->pkru = init_pkru_value;
/*
 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
 * cpuid bit to be set.  We need to ensure that we
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 2ecbf4155f98..1dcfc91c8f0c 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -18,6 +18,7 @@
 
 #include  /* boot_cpu_has, ...*/
 #include /* vma_pkey()   */
+#include   /* init_fpstate */
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -161,6 +162,7 @@ static ssize_t init_pkru_read_file(struct file *file, char 
__user *user_buf,
 static ssize_t init_pkru_write_file(struct file *file,
 const char __user *user_buf, size_t count, loff_t *ppos)
 {
+   struct pkru_state *pk;
char buf[32];
ssize_t len;
u32 new_init_pkru;
@@ -183,6 +185,10 @@ static ssize_t init_pkru_write_file(struct file *file,
return -EINVAL;
 
WRITE_ONCE(init_pkru_value, new_init_pkru);
+   pk = get_xsave_addr(_fpstate.xsave, XFEATURE_PKRU);
+   if (!pk)
+   return -EINVAL;
+   pk->pkru = new_init_pkru;
return count;
 }

[tip:x86/fpu] x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  06b251dff78704c7d122bd109384d970a7dbe94d
Gitweb: https://git.kernel.org/tip/06b251dff78704c7d122bd109384d970a7dbe94d
Author: Sebastian Andrzej Siewior 
AuthorDate: Fri, 12 Apr 2019 20:16:15 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 20:16:15 +0200

x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath

If a task is scheduled out and receives a signal then it won't be
able to take the fastpath because the registers aren't available. The
slowpath is more expensive compared to XRSTOR + XSAVE which usually
succeeds.

Here are some clock_gettime() numbers from a bigger box with AVX512
during bootup:

- __fpregs_load_activate() takes 140ns - 350ns. If it was the most recent
  FPU context on the CPU then the optimisation in __fpregs_load_activate()
  will skip the load (which was disabled during the test).

- copy_fpregs_to_sigframe() takes 200ns - 450ns if it succeeds. On a
  pagefault it is 1.8us - 3us usually in the 2.6us area.

- The slowpath takes 1.5us - 6us. Usually in the 2.6us area.

My testcases (including lat_sig) take the fastpath without
__fpregs_load_activate(). I expect this to be the majority.

Since the slowpath is in the >1us area it makes sense to load the
registers and attempt to save them directly. The direct save may fail
but should only happen on the first invocation or after fork() while the
page is read-only.

 [ bp: Massage a bit. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-27-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 3c3167576216..7026f1c4e5e3 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -175,20 +175,21 @@ int copy_fpstate_to_sigframe(void __user *buf, void 
__user *buf_fx, int size)
(struct _fpstate_32 __user *) buf) ? -1 : 1;
 
/*
-* If we do not need to load the FPU registers at return to userspace
-* then the CPU has the current state. Try to save it directly to
-* userland's stack frame if it does not cause a pagefault. If it does,
-* try the slowpath.
+* Load the FPU registers if they are not valid for the current task.
+* With a valid FPU state we can attempt to save the state directly to
+* userland's stack frame which will likely succeed. If it does not, do
+* the slowpath.
 */
fpregs_lock();
-   if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
-   pagefault_disable();
-   ret = copy_fpregs_to_sigframe(buf_fx);
-   pagefault_enable();
-   if (ret)
-   copy_fpregs_to_fpstate(fpu);
-   set_thread_flag(TIF_NEED_FPU_LOAD);
-   }
+   if (test_thread_flag(TIF_NEED_FPU_LOAD))
+   __fpregs_load_activate();
+
+   pagefault_disable();
+   ret = copy_fpregs_to_sigframe(buf_fx);
+   pagefault_enable();
+   if (ret && !test_thread_flag(TIF_NEED_FPU_LOAD))
+   copy_fpregs_to_fpstate(fpu);
+   set_thread_flag(TIF_NEED_FPU_LOAD);
fpregs_unlock();
 
if (ret) {

[tip:x86/fpu] x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  da2f32fb8dc7cbd9433cb2e990693734b30a2465
Gitweb: https://git.kernel.org/tip/da2f32fb8dc7cbd9433cb2e990693734b30a2465
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:54 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 20:05:36 +0200

x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()

Try to save the FPU registers directly to the userland stack frame if
the CPU holds the FPU registers for the current task. This has to be
done with the pagefault disabled because we can't fault (while the FPU
registers are locked) and therefore the operation might fail. If it
fails try the slowpath which can handle faults.

 [ bp: Massage a bit. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-26-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index a1bd7be70206..3c3167576216 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -144,8 +144,10 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
  * buf == buf_fx for 64-bit frames and 32-bit fsave frame.
  * buf != buf_fx for 32-bit frames with fxstate.
  *
- * Save the state to task's fpu->state and then copy it to the user frame
- * pointed to by the aligned pointer 'buf_fx'.
+ * Try to save it directly to the user frame with disabled page fault handler.
+ * If this fails then do the slow path where the FPU state is first saved to
+ * task's fpu->state and then copy it to the user frame pointed to by the
+ * aligned pointer 'buf_fx'.
  *
  * If this is a 32-bit frame with fxstate, put a fsave header before
  * the aligned state at 'buf_fx'.
@@ -159,6 +161,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user 
*buf_fx, int size)
struct xregs_state *xsave = >state.xsave;
struct task_struct *tsk = current;
int ia32_fxstate = (buf != buf_fx);
+   int ret = -EFAULT;
 
ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) ||
 IS_ENABLED(CONFIG_IA32_EMULATION));
@@ -173,23 +176,30 @@ int copy_fpstate_to_sigframe(void __user *buf, void 
__user *buf_fx, int size)
 
/*
 * If we do not need to load the FPU registers at return to userspace
-* then the CPU has the current state and we need to save it. Otherwise,
-* it has already been done and we can skip it.
+* then the CPU has the current state. Try to save it directly to
+* userland's stack frame if it does not cause a pagefault. If it does,
+* try the slowpath.
 */
fpregs_lock();
if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
-   copy_fpregs_to_fpstate(fpu);
+   pagefault_disable();
+   ret = copy_fpregs_to_sigframe(buf_fx);
+   pagefault_enable();
+   if (ret)
+   copy_fpregs_to_fpstate(fpu);
set_thread_flag(TIF_NEED_FPU_LOAD);
}
fpregs_unlock();
 
-   if (using_compacted_format()) {
-   if (copy_xstate_to_user(buf_fx, xsave, 0, size))
-   return -1;
-   } else {
-   fpstate_sanitize_xstate(fpu);
-   if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size))
-   return -1;
+   if (ret) {
+   if (using_compacted_format()) {
+   if (copy_xstate_to_user(buf_fx, xsave, 0, size))
+   return -1;
+   } else {
+   fpstate_sanitize_xstate(fpu);
+   if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size))
+   return -1;
+   }
}
 
/* Save the fsave header for the 32-bit frames. */

[tip:x86/fpu] x86/fpu: Add a fastpath to fpurestore_sig()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  1d731e731c4cd7cbd3b1aa295f0932e7610da82f
Gitweb: https://git.kernel.org/tip/1d731e731c4cd7cbd3b1aa295f0932e7610da82f
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:53 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 20:04:49 +0200

x86/fpu: Add a fastpath to __fpu__restore_sig()

The previous commits refactor the restoration of the FPU registers so
that they can be loaded from in-kernel memory. This overhead can be
avoided if the load can be performed without a pagefault.

Attempt to restore FPU registers by invoking
copy_user_to_fpregs_zeroing(). If it fails try the slowpath which can
handle pagefaults.

 [ bp: Add a comment over the fastpath to be able to find one's way
   around the function. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-25-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 6df1f15e0cd5..a1bd7be70206 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -242,10 +242,10 @@ sanitize_restored_xstate(union fpregs_state *state,
 /*
  * Restore the extended state if present. Otherwise, restore the FP/SSE state.
  */
-static inline int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int 
fx_only)
+static int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only)
 {
if (use_xsave()) {
-   if ((unsigned long)buf % 64 || fx_only) {
+   if (fx_only) {
u64 init_bv = xfeatures_mask & ~XFEATURE_MASK_FPSSE;
copy_kernel_to_xregs(_fpstate.xsave, init_bv);
return copy_user_to_fxregs(buf);
@@ -327,8 +327,27 @@ static int __fpu__restore_sig(void __user *buf, void 
__user *buf_fx, int size)
if (ret)
goto err_out;
envp = 
+   } else {
+   /*
+* Attempt to restore the FPU registers directly from user
+* memory. For that to succeed, the user access cannot cause
+* page faults. If it does, fall back to the slow path below,
+* going through the kernel buffer with the enabled pagefault
+* handler.
+*/
+   fpregs_lock();
+   pagefault_disable();
+   ret = copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only);
+   pagefault_enable();
+   if (!ret) {
+   fpregs_mark_activate();
+   fpregs_unlock();
+   return 0;
+   }
+   fpregs_unlock();
}
 
+
if (use_xsave() && !fx_only) {
u64 init_bv = xfeatures_mask & ~xfeatures;

[tip:x86/fpu] x86/fpu: Defer FPU state load until return to userspace

2019-04-13 Thread tip-bot for Rik van Riel

Commit-ID:  5f409e20b794565e2d60ad333e79334630a6c798
Gitweb: https://git.kernel.org/tip/5f409e20b794565e2d60ad333e79334630a6c798
Author: Rik van Riel 
AuthorDate: Wed, 3 Apr 2019 18:41:52 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 19:34:47 +0200

x86/fpu: Defer FPU state load until return to userspace

Defer loading of FPU state until return to userspace. This gives
the kernel the potential to skip loading FPU state for tasks that
stay in kernel mode, or for tasks that end up with repeated
invocations of kernel_fpu_begin() & kernel_fpu_end().

The fpregs_lock/unlock() section ensures that the registers remain
unchanged. Otherwise a context switch or a bottom half could save the
registers to its FPU context and the processor's FPU registers would
became random if modified at the same time.

KVM swaps the host/guest registers on entry/exit path. This flow has
been kept as is. First it ensures that the registers are loaded and then
saves the current (host) state before it loads the guest's registers. The
swap is done at the very end with disabled interrupts so it should not
change anymore before theg guest is entered. The read/save version seems
to be cheaper compared to memcpy() in a micro benchmark.

Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy().
For kernel threads, this flag gets never cleared which avoids saving /
restoring the FPU state for kernel threads and during in-kernel usage of
the FPU registers.

 [
   bp: Correct and update commit message and fix checkpatch warnings.
   s/register/registers/ where it is used in plural.
   minor comment corrections.
   remove unused trace_x86_fpu_activate_state() TP.
 ]

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Konrad Rzeszutek Wilk 
Cc: kvm ML 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Tim Chen 
Cc: Waiman Long 
Cc: x86-ml 
Cc: Yi Wang 
Link: https://lkml.kernel.org/r/20190403164156.19645-24-bige...@linutronix.de
---
 arch/x86/entry/common.c |  10 +++-
 arch/x86/include/asm/fpu/api.h  |  22 +++-
 arch/x86/include/asm/fpu/internal.h |  27 +
 arch/x86/include/asm/trace/fpu.h|  10 ++--
 arch/x86/kernel/fpu/core.c  | 106 +++-
 arch/x86/kernel/fpu/signal.c|  49 ++---
 arch/x86/kernel/process.c   |   2 +-
 arch/x86/kernel/process_32.c|   5 +-
 arch/x86/kernel/process_64.c|   5 +-
 arch/x86/kvm/x86.c  |  20 +--
 10 files changed, 184 insertions(+), 72 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..51beb8d29123 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -25,12 +25,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -196,6 +197,13 @@ __visible inline void prepare_exit_to_usermode(struct 
pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);
 
+   /* Reload ti->flags; we may have rescheduled above. */
+   cached_flags = READ_ONCE(ti->flags);
+
+   fpregs_assert_state_consistent();
+   if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
+   switch_fpu_return();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 73e684160f35..b774c52e5411 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -10,7 +10,7 @@
 
 #ifndef _ASM_X86_FPU_API_H
 #define _ASM_X86_FPU_API_H
-#include 
+#include 
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
@@ -22,17 +22,37 @@
 extern void kernel_fpu_begin(void);
 extern void kernel_fpu_end(void);
 extern bool irq_fpu_usable(void);
+extern void fpregs_mark_activate(void);
 
+/*
+ * Use fpregs_lock() while editing CPU's FPU registers or fpu->state.
+ * A context switch will (and softirq might) save CPU's FPU registers to
+ * fpu->state and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
+ * a random state.
+ */
 static inline void fpregs_lock(void)
 {
preempt_disable();
+   local_bh_disable();
 }
 
 static inline void fpregs_unlock(void)
 {
+   local_bh_enable();
preempt_enable();
 }
 
+#ifdef CONFIG_X86_DEBUG_FPU
+extern void fpregs_assert_state_consistent(void);
+#else
+static inline void fpregs_assert_state_consistent(void) { }
+#endif
+
+/*
+ * Load the task FPU state before returning to userspace.
+ */

misuse of fget_raw() in perf_event_get()

2019-04-13 Thread Al Viro

What's the point of using fget_raw(), if you do
_not_ accept O_PATH descriptors?  That should be fget()...

[tip:x86/fpu] x86/fpu: Merge the two code paths in fpurestore_sig()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  c2ff9e9a3d9d6c019394a22989a228d02970a8b1
Gitweb: https://git.kernel.org/tip/c2ff9e9a3d9d6c019394a22989a228d02970a8b1
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:51 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 15:41:25 +0200

x86/fpu: Merge the two code paths in __fpu__restore_sig()

The ia32_fxstate case (32bit with fxsr) and the other (64bit frames or
32bit frames without fxsr) restore both from kernel memory and sanitize
the content.

The !ia32_fxstate version restores missing xstates from "init state"
while the ia32_fxstate doesn't and skips it.

Merge the two code paths and keep the !ia32_fxstate one. Copy only the
user_i387_ia32_struct data structure in the ia32_fxstate.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-23-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 139 +--
 1 file changed, 54 insertions(+), 85 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 9ea1eaa4c9b1..b13e86b29426 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -263,12 +263,17 @@ static inline int copy_user_to_fpregs_zeroing(void __user 
*buf, u64 xbv, int fx_
 
 static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 {
+   struct user_i387_ia32_struct *envp = NULL;
+   int state_size = fpu_kernel_xstate_size;
int ia32_fxstate = (buf != buf_fx);
struct task_struct *tsk = current;
struct fpu *fpu = >thread.fpu;
-   int state_size = fpu_kernel_xstate_size;
+   struct user_i387_ia32_struct env;
+   union fpregs_state *state;
u64 xfeatures = 0;
int fx_only = 0;
+   int ret = 0;
+   void *tmp;
 
ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) ||
 IS_ENABLED(CONFIG_IA32_EMULATION));
@@ -303,105 +308,69 @@ static int __fpu__restore_sig(void __user *buf, void 
__user *buf_fx, int size)
}
}
 
+   tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, GFP_KERNEL);
+   if (!tmp)
+   return -ENOMEM;
+   state = PTR_ALIGN(tmp, 64);
+
+   if ((unsigned long)buf_fx % 64)
+   fx_only = 1;
+
+   /*
+* For 32-bit frames with fxstate, copy the fxstate so it can be
+* reconstructed later.
+*/
if (ia32_fxstate) {
-   /*
-* For 32-bit frames with fxstate, copy the user state to the
-* thread's fpu state, reconstruct fxstate from the fsave
-* header. Validate and sanitize the copied state.
-*/
-   struct user_i387_ia32_struct env;
-   union fpregs_state *state;
-   int err = 0;
-   void *tmp;
+   ret = __copy_from_user(, buf, sizeof(env));
+   if (ret)
+   goto err_out;
+   envp = 
+   }
 
-   tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, 
GFP_KERNEL);
-   if (!tmp)
-   return -ENOMEM;
-   state = PTR_ALIGN(tmp, 64);
+   if (use_xsave() && !fx_only) {
+   u64 init_bv = xfeatures_mask & ~xfeatures;
 
if (using_compacted_format()) {
-   err = copy_user_to_xstate(>xsave, buf_fx);
+   ret = copy_user_to_xstate(>xsave, buf_fx);
} else {
-   err = __copy_from_user(>xsave, buf_fx, 
state_size);
+   ret = __copy_from_user(>xsave, buf_fx, 
state_size);
 
-   if (!err && state_size > offsetof(struct xregs_state, 
header))
-   err = 
validate_xstate_header(>xsave.header);
+   if (!ret && state_size > offsetof(struct xregs_state, 
header))
+   ret = 
validate_xstate_header(>xsave.header);
}
+   if (ret)
+   goto err_out;
 
-   if (err || __copy_from_user(, buf, sizeof(env))) {
-   err = -1;
-   } else {
-   sanitize_restored_xstate(state, , xfeatures, 
fx_only);
-   copy_kernel_to_fpregs(state);
-   }
-
-   kfree(tmp);
-   return err;
-   } else {
-   union fpregs_state *state;
-   void *tmp;
-   int ret;
-
-   tmp = kzalloc(sizeof(*state) + fpu_kernel_xstate_size + 64, 
GFP_KERNEL);
-   if (!tmp)
-   return -ENOMEM;
-

[tip:x86/fpu] x86/fpu: Restore from kernel memory on the 64-bit path too

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  926b21f37b072ae4c117052de45a975c6d468fec
Gitweb: https://git.kernel.org/tip/926b21f37b072ae4c117052de45a975c6d468fec
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:50 +0200
Committer:  Borislav Petkov 
CommitDate: Fri, 12 Apr 2019 15:02:41 +0200

x86/fpu: Restore from kernel memory on the 64-bit path too

The 64-bit case (both 64-bit and 32-bit frames) loads the new state from
user memory.

However, doing this is not desired if the FPU state is going to be
restored on return to userland: it would be required to disable
preemption in order to avoid a context switch which would set
TIF_NEED_FPU_LOAD. If this happens before the restore operation then the
loaded registers would become volatile.

Furthermore, disabling preemption while accessing user memory requires
to disable the pagefault handler. An error during FXRSTOR would then
mean that either a page fault occurred (and it would have to be retried
with enabled page fault handler) or a #GP occurred because the xstate is
bogus (after all, the signal handler can modify it).

In order to avoid that mess, copy the FPU state from userland, validate
it and then load it. The copy_kernel_…() helpers are basically just
like the old helpers except that they operate on kernel memory and the
fault handler just sets the error value and the caller handles it.

copy_user_to_fpregs_zeroing() and its helpers remain and will be used
later for a fastpath optimisation.

 [ bp: Clarify commit message. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-22-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/internal.h | 43 +
 arch/x86/kernel/fpu/signal.c| 62 +
 2 files changed, 92 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index da75d7b3e37d..2cf04fbcba5d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -121,6 +121,21 @@ extern void fpstate_sanitize_xstate(struct fpu *fpu);
err;\
 })
 
+#define kernel_insn_err(insn, output, input...)
\
+({ \
+   int err;\
+   asm volatile("1:" #insn "\n\t"  \
+"2:\n" \
+".section .fixup,\"ax\"\n" \
+"3:  movl $-1,%[err]\n"\
+"jmp  2b\n"\
+".previous\n"  \
+_ASM_EXTABLE(1b, 3b)   \
+: [err] "=r" (err), output \
+: "0"(0), input);  \
+   err;\
+})
+
 #define kernel_insn(insn, output, input...)\
asm volatile("1:" #insn "\n\t"  \
 "2:\n" \
@@ -149,6 +164,14 @@ static inline void copy_kernel_to_fxregs(struct 
fxregs_state *fx)
kernel_insn(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx));
 }
 
+static inline int copy_kernel_to_fxregs_err(struct fxregs_state *fx)
+{
+   if (IS_ENABLED(CONFIG_X86_32))
+   return kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" 
(*fx));
+   else
+   return kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" 
(*fx));
+}
+
 static inline int copy_user_to_fxregs(struct fxregs_state __user *fx)
 {
if (IS_ENABLED(CONFIG_X86_32))
@@ -162,6 +185,11 @@ static inline void copy_kernel_to_fregs(struct fregs_state 
*fx)
kernel_insn(frstor %[fx], "=m" (*fx), [fx] "m" (*fx));
 }
 
+static inline int copy_kernel_to_fregs_err(struct fregs_state *fx)
+{
+   return kernel_insn_err(frstor %[fx], "=m" (*fx), [fx] "m" (*fx));
+}
+
 static inline int copy_user_to_fregs(struct fregs_state __user *fx)
 {
return user_insn(frstor %[fx], "=m" (*fx), [fx] "m" (*fx));
@@ -361,6 +389,21 @@ static inline int copy_user_to_xregs(struct xregs_state 
__user *buf, u64 mask)
return err;
 }
 
+/*
+ * Restore xstate from kernel space xsave area, return an error code instead of
+ * an exception.
+ */
+static inline int copy_kernel_to_xregs_err(struct xregs_state

[tip:x86/fpu] x86/fpu: Inline copy_user_to_fpregs_zeroing()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  e0d3602f933367881bddfff310a744e6e61c284c
Gitweb: https://git.kernel.org/tip/e0d3602f933367881bddfff310a744e6e61c284c
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:49 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 20:45:20 +0200

x86/fpu: Inline copy_user_to_fpregs_zeroing()

Start refactoring __fpu__restore_sig() by inlining
copy_user_to_fpregs_zeroing(). The original function remains and will be
used to restore from userland memory if possible.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-21-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 9b9dfdc96285..c2ff43fbbd07 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -337,11 +337,29 @@ static int __fpu__restore_sig(void __user *buf, void 
__user *buf_fx, int size)
kfree(tmp);
return err;
} else {
+   int ret;
+
/*
 * For 64-bit frames and 32-bit fsave frames, restore the user
 * state to the registers directly (with exceptions handled).
 */
-   if (copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only)) {
+   if (use_xsave()) {
+   if ((unsigned long)buf_fx % 64 || fx_only) {
+   u64 init_bv = xfeatures_mask & 
~XFEATURE_MASK_FPSSE;
+   copy_kernel_to_xregs(_fpstate.xsave, 
init_bv);
+   ret = copy_user_to_fxregs(buf_fx);
+   } else {
+   u64 init_bv = xfeatures_mask & ~xfeatures;
+   if (unlikely(init_bv))
+   
copy_kernel_to_xregs(_fpstate.xsave, init_bv);
+   ret = copy_user_to_xregs(buf_fx, xfeatures);
+   }
+   } else if (use_fxsr()) {
+   ret = copy_user_to_fxregs(buf_fx);
+   } else
+   ret = copy_user_to_fregs(buf_fx);
+
+   if (ret) {
fpu__clear(fpu);
return -1;
}

[tip:x86/fpu] x86/entry: Add TIF_NEED_FPU_LOAD

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  383c252545edcc708128e2028a2318b05c45ede4
Gitweb: https://git.kernel.org/tip/383c252545edcc708128e2028a2318b05c45ede4
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:45 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 16:21:51 +0200

x86/entry: Add TIF_NEED_FPU_LOAD

Add TIF_NEED_FPU_LOAD. This flag is used for loading the FPU registers
before returning to userland. It must not be set on systems without a
FPU.

If this flag is cleared, the CPU's FPU registers hold the latest,
up-to-date content of the current task's (current()) FPU registers.
The in-memory copy (union fpregs_state) is not valid.

If this flag is set, then all of CPU's FPU registers may hold a random
value (except for PKRU) and it is required to load the content of the
FPU registers on return to userland.

Introduce it now as a preparatory change before adding the main feature.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Konrad Rzeszutek Wilk 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Tim Chen 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-17-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/internal.h | 8 
 arch/x86/include/asm/thread_info.h  | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 6eb4a0b1ad0e..da75d7b3e37d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -508,6 +508,14 @@ static inline void __fpregs_load_activate(struct fpu *fpu, 
int cpu)
  *  - switch_fpu_finish() restores the new state as
  *necessary.
  *
+ * If TIF_NEED_FPU_LOAD is cleared then the CPU's FPU registers
+ * are saved in the current thread's FPU register state.
+ *
+ * If TIF_NEED_FPU_LOAD is set then CPU's FPU registers may not
+ * hold current()'s FPU registers. It is required to load the
+ * registers before returning to userland or using the content
+ * otherwise.
+ *
  * The FPU context is only stored/restored for a user task and
  * ->mm is used to distinguish between kernel and user threads.
  */
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index e0eccbcb8447..f9453536f9bb 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -88,6 +88,7 @@ struct thread_info {
 #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return */
 #define TIF_UPROBE 12  /* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING  13  /* pending live patching update */
+#define TIF_NEED_FPU_LOAD  14  /* load FPU on return to userspace */
 #define TIF_NOCPUID15  /* CPUID is not accessible in userland 
*/
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
@@ -117,6 +118,7 @@ struct thread_info {
 #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
+#define _TIF_NEED_FPU_LOAD (1 << TIF_NEED_FPU_LOAD)
 #define _TIF_NOCPUID   (1 << TIF_NOCPUID)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)

[tip:x86/fpu] x86/fpu: Update xstate's PKRU value on write_pkru()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  0d714dba162620fd8b9f5b3104a487e041353c4d
Gitweb: https://git.kernel.org/tip/0d714dba162620fd8b9f5b3104a487e041353c4d
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:48 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 20:33:29 +0200

x86/fpu: Update xstate's PKRU value on write_pkru()

During the context switch the xstate is loaded which also includes the
PKRU value.

If xstate is restored on return to userland it is required
that the PKRU value in xstate is the same as the one in the CPU.

Save the PKRU in xstate during modification.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: "Kirill A. Shutemov" 
Cc: kvm ML 
Cc: Michal Hocko 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-20-bige...@linutronix.de
---
 arch/x86/include/asm/pgtable.h | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9beb371b1adf..5cfbbb6d458d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,8 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
+#include 
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -133,8 +135,23 @@ static inline u32 read_pkru(void)
 
 static inline void write_pkru(u32 pkru)
 {
-   if (boot_cpu_has(X86_FEATURE_OSPKE))
-   __write_pkru(pkru);
+   struct pkru_state *pk;
+
+   if (!boot_cpu_has(X86_FEATURE_OSPKE))
+   return;
+
+   pk = get_xsave_addr(>thread.fpu.state.xsave, XFEATURE_PKRU);
+
+   /*
+* The PKRU value in xstate needs to be in sync with the value that is
+* written to the CPU. The FPU restore on return to userland would
+* otherwise load the previous value again.
+*/
+   fpregs_lock();
+   if (pk)
+   pk->pkru = pkru;
+   __write_pkru(pkru);
+   fpregs_unlock();
 }
 
 static inline int pte_young(pte_t pte)

[tip:x86/fpu] x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD

2019-04-13 Thread tip-bot for Rik van Riel

Commit-ID:  a352a3b7b7920212ee4c45a41500c66826318e92
Gitweb: https://git.kernel.org/tip/a352a3b7b7920212ee4c45a41500c66826318e92
Author: Rik van Riel 
AuthorDate: Wed, 3 Apr 2019 18:41:47 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 18:20:04 +0200

x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD

The FPU registers need only to be saved if TIF_NEED_FPU_LOAD is not set.
Otherwise this has been already done and can be skipped.

 [ bp: Massage a bit. ]

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-19-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 8f23f5237218..9b9dfdc96285 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -171,7 +171,17 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user 
*buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;
 
-   copy_fpregs_to_fpstate(fpu);
+   /*
+* If we do not need to load the FPU registers at return to userspace
+* then the CPU has the current state and we need to save it. Otherwise,
+* it has already been done and we can skip it.
+*/
+   fpregs_lock();
+   if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
+   copy_fpregs_to_fpstate(fpu);
+   set_thread_flag(TIF_NEED_FPU_LOAD);
+   }
+   fpregs_unlock();
 
if (using_compacted_format()) {
if (copy_xstate_to_user(buf_fx, xsave, 0, size))

[tip:x86/fpu] x86/fpu: Always store the registers in copy_fpstate_to_sigframe()

2019-04-13 Thread tip-bot for Rik van Riel

Commit-ID:  69277c98f5eef0d9839699b7825923c3985f665f
Gitweb: https://git.kernel.org/tip/69277c98f5eef0d9839699b7825923c3985f665f
Author: Rik van Riel 
AuthorDate: Wed, 3 Apr 2019 18:41:46 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 18:08:57 +0200

x86/fpu: Always store the registers in copy_fpstate_to_sigframe()

copy_fpstate_to_sigframe() stores the registers directly to user space.
This is okay because the FPU registers are valid and saving them
directly avoids saving them into kernel memory and making a copy.

However, this cannot be done anymore if the FPU registers are going
to be restored on the return to userland. It is possible that the FPU
registers will be invalidated in the middle of the save operation and
this should be done with disabled preemption / BH.

Save the FPU registers to the task's FPU struct and copy them to the
user memory later on.

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-18-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 155f4552413e..8f23f5237218 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -144,8 +144,8 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
  * buf == buf_fx for 64-bit frames and 32-bit fsave frame.
  * buf != buf_fx for 32-bit frames with fxstate.
  *
- * Save the state directly to the user frame pointed by the aligned pointer
- * 'buf_fx'.
+ * Save the state to task's fpu->state and then copy it to the user frame
+ * pointed to by the aligned pointer 'buf_fx'.
  *
  * If this is a 32-bit frame with fxstate, put a fsave header before
  * the aligned state at 'buf_fx'.
@@ -155,6 +155,8 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
  */
 int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 {
+   struct fpu *fpu = >thread.fpu;
+   struct xregs_state *xsave = >state.xsave;
struct task_struct *tsk = current;
int ia32_fxstate = (buf != buf_fx);
 
@@ -169,9 +171,16 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user 
*buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;
 
-   /* Save the live registers state to the user frame directly. */
-   if (copy_fpregs_to_sigframe(buf_fx))
-   return -1;
+   copy_fpregs_to_fpstate(fpu);
+
+   if (using_compacted_format()) {
+   if (copy_xstate_to_user(buf_fx, xsave, 0, size))
+   return -1;
+   } else {
+   fpstate_sanitize_xstate(fpu);
+   if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size))
+   return -1;
+   }
 
/* Save the fsave header for the 32-bit frames. */
if ((ia32_fxstate || !use_fxsr()) && save_fsave_header(tsk, buf))

[tip:x86/fpu] x86/fpu: Eager switch PKRU state

2019-04-13 Thread tip-bot for Rik van Riel

Commit-ID:  0cecca9d03c964abbd2b7927d0670eb70db4ebf2
Gitweb: https://git.kernel.org/tip/0cecca9d03c964abbd2b7927d0670eb70db4ebf2
Author: Rik van Riel 
AuthorDate: Wed, 3 Apr 2019 18:41:44 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 15:57:10 +0200

x86/fpu: Eager switch PKRU state

While most of a task's FPU state is only needed in user space, the
protection keys need to be in place immediately after a context switch.

The reason is that any access to userspace memory while running in
kernel mode also needs to abide by the memory permissions specified in
the protection keys.

The "eager switch" is a preparation for loading the FPU state on return
to userland. Instead of decoupling PKRU state from xstate, update PKRU
within xstate on write operations by the kernel.

For user tasks the PKRU should be always read from the xsave area and it
should not change anything because the PKRU value was loaded as part of
FPU restore.

For kernel threads the default "init_pkru_value" will be written. Before
this commit, the kernel thread would end up with a random value which it
inherited from the previous user task.

 [ bigeasy: save pkru to xstate, no cache, don't use __raw_xsave_addr() ]

 [ bp: update commit message, sort headers properly in asm/fpu/xstate.h ]

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: "Kirill A. Shutemov" 
Cc: kvm ML 
Cc: Michal Hocko 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-16-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/internal.h | 24 ++--
 arch/x86/include/asm/fpu/xstate.h   |  4 +++-
 arch/x86/include/asm/pgtable.h  |  6 ++
 arch/x86/mm/pkeys.c |  1 -
 4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 3e0c2c496f2d..6eb4a0b1ad0e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -534,8 +535,27 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  */
 static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
 {
-   if (static_cpu_has(X86_FEATURE_FPU))
-   __fpregs_load_activate(new_fpu, cpu);
+   u32 pkru_val = init_pkru_value;
+   struct pkru_state *pk;
+
+   if (!static_cpu_has(X86_FEATURE_FPU))
+   return;
+
+   __fpregs_load_activate(new_fpu, cpu);
+
+   if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+   return;
+
+   /*
+* PKRU state is switched eagerly because it needs to be valid before we
+* return to userland e.g. for a copy_to_user() operation.
+*/
+   if (current->mm) {
+   pk = get_xsave_addr(_fpu->state.xsave, XFEATURE_PKRU);
+   if (pk)
+   pkru_val = pk->pkru;
+   }
+   __write_pkru(pkru_val);
 }
 
 /*
diff --git a/arch/x86/include/asm/fpu/xstate.h 
b/arch/x86/include/asm/fpu/xstate.h
index fbe41f808e5d..7e42b285c856 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -2,9 +2,11 @@
 #ifndef __ASM_X86_XSAVE_H
 #define __ASM_X86_XSAVE_H
 
+#include 
 #include 
+
 #include 
-#include 
+#include 
 
 /* Bit 63 of XCR0 is reserved for future expansion */
 #define XFEATURE_MASK_EXTEND   (~(XFEATURE_MASK_FPSSE | (1ULL << 63)))
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e8875ca75623..9beb371b1adf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1355,6 +1355,12 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+extern u32 init_pkru_value;
+#else
+#define init_pkru_value0
+#endif
+
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 50f65fc1b9a3..2ecbf4155f98 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -126,7 +126,6 @@ int __arch_override_mprotect_pkey(struct vm_area_struct 
*vma, int prot, int pkey
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-static
 u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
  PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
  PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |

[tip:x86/fpu] x86/pkeys: Don't check if PKRU is zero before writing it

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  0556cbdc2fbcb3068e5b924a8b3d5386ae0dd27d
Gitweb: https://git.kernel.org/tip/0556cbdc2fbcb3068e5b924a8b3d5386ae0dd27d
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:43 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 15:41:05 +0200

x86/pkeys: Don't check if PKRU is zero before writing it

write_pkru() checks if the current value is the same as the expected
value. So instead of just checking if the current and new value is zero
(and skip the write in such a case) we can benefit from that.

Remove the zero check of PKRU, __write_pkru() provides such a check now.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-15-bige...@linutronix.de
---
 arch/x86/mm/pkeys.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 05bb9a44eb1c..50f65fc1b9a3 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -142,13 +142,6 @@ u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | 
PKRU_AD_KEY( 3) |
 void copy_init_pkru_to_fpregs(void)
 {
u32 init_pkru_value_snapshot = READ_ONCE(init_pkru_value);
-   /*
-* Any write to PKRU takes it out of the XSAVE 'init
-* state' which increases context switch cost.  Avoid
-* writing 0 when PKRU was already 0.
-*/
-   if (!init_pkru_value_snapshot && !read_pkru())
-   return;
/*
 * Override the PKRU state that came from 'init_fpstate'
 * with the baseline from the process.

[tip:x86/fpu] x86/fpu: Use a feature number instead of mask in two more helpers

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  abd16d68d65229e5acafdadc32704239131bf2ea
Gitweb: https://git.kernel.org/tip/abd16d68d65229e5acafdadc32704239131bf2ea
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:40 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 18:20:27 +0200

x86/fpu: Use a feature number instead of mask in two more helpers

After changing the argument of __raw_xsave_addr() from a mask to
number Dave suggested to check if it makes sense to do the same for
get_xsave_addr(). As it turns out it does.

Only get_xsave_addr() needs the mask to check if the requested feature
is part of what is supported/saved and then uses the number again. The
shift operation is cheaper compared to fls64() (find last bit set).
Also, the feature number uses less opcode space compared to the mask. :)

Make the get_xsave_addr() argument a xfeature number instead of a mask
and fix up its callers.

Furthermore, use xfeature_nr and xfeature_mask consistently.

This results in the following changes to the kvm code:

  feature -> xfeature_mask
  index -> xfeature_nr

Suggested-by: Dave Hansen 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "Eric W. Biederman" 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Masami Hiramatsu 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: "Radim Krčmář" 
Cc: Rasmus Villemoes 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Siarhei Liakh 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-12-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/xstate.h |  4 ++--
 arch/x86/kernel/fpu/xstate.c  | 22 ++
 arch/x86/kernel/traps.c   |  2 +-
 arch/x86/kvm/x86.c| 28 ++--
 arch/x86/mm/mpx.c |  6 +++---
 5 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h 
b/arch/x86/include/asm/fpu/xstate.h
index 48581988d78c..fbe41f808e5d 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -46,8 +46,8 @@ extern void __init update_regset_xstate_info(unsigned int 
size,
 u64 xstate_mask);
 
 void fpu__xstate_clear_all_cpu_caps(void);
-void *get_xsave_addr(struct xregs_state *xsave, int xstate);
-const void *get_xsave_field_ptr(int xstate_field);
+void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int 
offset, unsigned int size);
 int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned 
int offset, unsigned int size);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4f7f3c5d0d0c..9c459fd1d38e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -830,15 +830,14 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, 
int xfeature_nr)
  *
  * Inputs:
  * xstate: the thread's storage area for all FPU data
- * xstate_feature: state which is defined in xsave.h (e.g.
- * XFEATURE_MASK_FP, XFEATURE_MASK_SSE, etc...)
+ * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ * XFEATURE_SSE, etc...)
  * Output:
  * address of the state in the xsave area, or NULL if the
  * field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
+void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 {
-   int xfeature_nr;
/*
 * Do we even *have* xsave state?
 */
@@ -850,11 +849,11 @@ void *get_xsave_addr(struct xregs_state *xsave, int 
xstate_feature)
 * have not enabled.  Remember that pcntxt_mask is
 * what we write to the XCR0 register.
 */
-   WARN_ONCE(!(xfeatures_mask & xstate_feature),
+   WARN_ONCE(!(xfeatures_mask & BIT_ULL(xfeature_nr)),
  "get of unsupported state");
/*
 * This assumes the last 'xsave*' instruction to
-* have requested that 'xstate_feature' be saved.
+* have requested that 'xfeature_nr' be saved.
 * If it did not, we might be seeing and old value
 * of the field in the buffer.
 *
@@ -863,10 +862,9 @@ void *get_xsave_addr(struct xregs_state *xsave, int 
xstate_feature)
 * or because the "init optimization" caused it
 * to not be saved.
 */
-   if (!(xsave->header.xfeatures & xstate_feature))
+   if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
return NULL;
 
-   xfeature_nr = fls64(xstate_feature) - 1;
return __raw_xsave_addr(xsave, xfeature_nr);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
@@ -882,13 +880,13 @@ EXPORT_SYMBOL_GPL(get_xsave_addr);
  * Note that this only works

[tip:x86/fpu] x86/fpu: Only write PKRU if it is different from current

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  577ff465f5a6a5a0866d75a033844810baca20a0
Gitweb: https://git.kernel.org/tip/577ff465f5a6a5a0866d75a033844810baca20a0
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:42 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 15:41:05 +0200

x86/fpu: Only write PKRU if it is different from current

According to Dave Hansen, WRPKRU is more expensive than RDPKRU. It has
a higher cycle cost and it's also practically a (light) speculation
barrier.

As an optimisation read the current PKRU value and only write the new
one if it is different.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: Juergen Gross 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-14-bige...@linutronix.de
---
 arch/x86/include/asm/special_insns.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 34897e2b52c9..0a3c4cab39db 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -121,6 +121,13 @@ static inline void wrpkru(u32 pkru)
 
 static inline void __write_pkru(u32 pkru)
 {
+   /*
+* WRPKRU is relatively expensive compared to RDPKRU.
+* Avoid WRPKRU when it would not change the value.
+*/
+   if (pkru == rdpkru())
+   return;
+
wrpkru(pkru);
 }

[tip:x86/fpu] x86/pkeys: Provide *pkru() helpers

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  c806e88734b9e9aea260bf2261c129aa23968fca
Gitweb: https://git.kernel.org/tip/c806e88734b9e9aea260bf2261c129aa23968fca
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:41 +0200
Committer:  Borislav Petkov 
CommitDate: Thu, 11 Apr 2019 15:40:58 +0200

x86/pkeys: Provide *pkru() helpers

Dave Hansen asked for __read_pkru() and __write_pkru() to be
symmetrical.

As part of the series __write_pkru() will read back the value and only
write it if it is different.

In order to make both functions symmetrical, move the function
containing only the opcode asm into a function called like the
instruction itself.

__write_pkru() will just invoke wrpkru() but in a follow-up patch will
also read back the value.

 [ bp: Convert asm opcode wrapper names to rd/wrpkru(). ]

Suggested-by: Dave Hansen 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andi Kleen 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Juergen Gross 
Cc: "Kirill A. Shutemov" 
Cc: kvm ML 
Cc: Michal Hocko 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-13-bige...@linutronix.de
---
 arch/x86/include/asm/pgtable.h   |  2 +-
 arch/x86/include/asm/special_insns.h | 12 +---
 arch/x86/kvm/vmx/vmx.c   |  2 +-
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..e8875ca75623 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -127,7 +127,7 @@ static inline int pte_dirty(pte_t pte)
 static inline u32 read_pkru(void)
 {
if (boot_cpu_has(X86_FEATURE_OSPKE))
-   return __read_pkru();
+   return rdpkru();
return 0;
 }
 
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 43c029cdc3fe..34897e2b52c9 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -92,7 +92,7 @@ static inline void native_write_cr8(unsigned long val)
 #endif
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-static inline u32 __read_pkru(void)
+static inline u32 rdpkru(void)
 {
u32 ecx = 0;
u32 edx, pkru;
@@ -107,7 +107,7 @@ static inline u32 __read_pkru(void)
return pkru;
 }
 
-static inline void __write_pkru(u32 pkru)
+static inline void wrpkru(u32 pkru)
 {
u32 ecx = 0, edx = 0;
 
@@ -118,8 +118,14 @@ static inline void __write_pkru(u32 pkru)
asm volatile(".byte 0x0f,0x01,0xef\n\t"
 : : "a" (pkru), "c"(ecx), "d"(edx));
 }
+
+static inline void __write_pkru(u32 pkru)
+{
+   wrpkru(pkru);
+}
+
 #else
-static inline u32 __read_pkru(void)
+static inline u32 rdpkru(void)
 {
return 0;
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ab432a930ae8..dd1e1eea4f0b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6501,7 +6501,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 */
if (static_cpu_has(X86_FEATURE_PKU) &&
kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) {
-   vcpu->arch.pkru = __read_pkru();
+   vcpu->arch.pkru = rdpkru();
if (vcpu->arch.pkru != vmx->host_pkru)
__write_pkru(vmx->host_pkru);
}

[PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()

2019-04-13 Thread Shawn Landden

If flags includes COPY_FILE_RANGE_FILESIZE then the length
copied is the length of the file. off_in and off_out are
ignored. len must be 0 or the file size.

This implementation saves a call to stat() in the common case
of copying files. It does not fix any race conditions, but that
is possible in the future with this interface.

EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
or the file size.

Signed-off-by: Shawn Landden 
CC: 
---
 fs/read_write.c   | 14 +-
 include/uapi/linux/stat.h |  4 
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..6d06361f0856 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
struct inode *inode_out = file_inode(file_out);
ssize_t ret;
 
-   if (flags != 0)
+   if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
return -EINVAL;
 
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
return -EINVAL;
 
+   if (flags & COPY_FILE_RANGE_FILESIZE) {
+   struct kstat stat;
+   int error;
+   error = vfs_getattr(_in->f_path, ,
+   STATX_SIZE, 0);
+   if (error < 0)
+   return error;
+   if (!(len == 0 || len == stat.size))
+   return -EAGAIN;
+   len = stat.size;
+   }
+
ret = rw_verify_area(READ, file_in, _in, len);
if (unlikely(ret))
return ret;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..1075aa4666ef 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -170,5 +170,9 @@ struct statx {
 
 #define STATX_ATTR_AUTOMOUNT   0x1000 /* Dir: Automount trigger */
 
+/*
+ * Flags for copy_file_range()
+ */
+#define COPY_FILE_RANGE_FILESIZE   0x0001 /* Copy the full length of 
the input file */
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.20.1

[tip:x86/fpu] x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  07baeb04f37c952981d63359ff840118ce8f5434
Gitweb: https://git.kernel.org/tip/07baeb04f37c952981d63359ff840118ce8f5434
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:39 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 16:33:45 +0200

x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask

Most users of __raw_xsave_addr() use a feature number, shift it to a
mask and then __raw_xsave_addr() shifts it back to the feature number.

Make __raw_xsave_addr() use the feature number as an argument.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-11-bige...@linutronix.de
---
 arch/x86/kernel/fpu/xstate.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 8bfcc5b9e04b..4f7f3c5d0d0c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -805,20 +805,18 @@ void fpu__resume_cpu(void)
 }
 
 /*
- * Given an xstate feature mask, calculate where in the xsave
+ * Given an xstate feature nr, calculate where in the xsave
  * buffer the state is.  Callers should ensure that the buffer
  * is valid.
  */
-static void *__raw_xsave_addr(struct xregs_state *xsave, int 
xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 {
-   int feature_nr = fls64(xstate_feature_mask) - 1;
-
-   if (!xfeature_enabled(feature_nr)) {
+   if (!xfeature_enabled(xfeature_nr)) {
WARN_ON_FPU(1);
return NULL;
}
 
-   return (void *)xsave + xstate_comp_offsets[feature_nr];
+   return (void *)xsave + xstate_comp_offsets[xfeature_nr];
 }
 /*
  * Given the xsave area and a state inside, this function returns the
@@ -840,6 +838,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, 
int xstate_feature_mask
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
+   int xfeature_nr;
/*
 * Do we even *have* xsave state?
 */
@@ -867,7 +866,8 @@ void *get_xsave_addr(struct xregs_state *xsave, int 
xstate_feature)
if (!(xsave->header.xfeatures & xstate_feature))
return NULL;
 
-   return __raw_xsave_addr(xsave, xstate_feature);
+   xfeature_nr = fls64(xstate_feature) - 1;
+   return __raw_xsave_addr(xsave, xfeature_nr);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -1014,7 +1014,7 @@ int copy_xstate_to_kernel(void *kbuf, struct xregs_state 
*xsave, unsigned int of
 * Copy only in-use xstates:
 */
if ((header.xfeatures >> i) & 1) {
-   void *src = __raw_xsave_addr(xsave, 1 << i);
+   void *src = __raw_xsave_addr(xsave, i);
 
offset = xstate_offsets[i];
size = xstate_sizes[i];
@@ -1100,7 +1100,7 @@ int copy_xstate_to_user(void __user *ubuf, struct 
xregs_state *xsave, unsigned i
 * Copy only in-use xstates:
 */
if ((header.xfeatures >> i) & 1) {
-   void *src = __raw_xsave_addr(xsave, 1 << i);
+   void *src = __raw_xsave_addr(xsave, i);
 
offset = xstate_offsets[i];
size = xstate_sizes[i];
@@ -1157,7 +1157,7 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, 
const void *kbuf)
u64 mask = ((u64)1 << i);
 
if (hdr.xfeatures & mask) {
-   void *dst = __raw_xsave_addr(xsave, 1 << i);
+   void *dst = __raw_xsave_addr(xsave, i);
 
offset = xstate_offsets[i];
size = xstate_sizes[i];
@@ -1211,7 +1211,7 @@ int copy_user_to_xstate(struct xregs_state *xsave, const 
void __user *ubuf)
u64 mask = ((u64)1 << i);
 
if (hdr.xfeatures & mask) {
-   void *dst = __raw_xsave_addr(xsave, 1 << i);
+   void *dst = __raw_xsave_addr(xsave, i);
 
offset = xstate_offsets[i];
size = xstate_sizes[i];

bug disabling NX (noexec=off)

2019-04-13 Thread Xose Vazquez Perez

Hi,

Intel Core i3-2120 + kernel-5.0.7 x86_64 from Fedora:

[0.00] microcode: microcode updated early to revision 0x2e, date = 
2018-04-10
[0.00] Linux version 5.0.7-300.fc30.x86_64 
(mockbu...@bkernel04.phx2.fedoraproject.org) (gcc version 9.0.1 20190312 (Red 
Hat 9.0.1-0.10) (GCC)) #1 SMP Mon Apr 8 18:28:09 UTC 2019
[0.00] Command line: 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.0.7-300.fc30.x86_64 
root=UUID=ea93877a-9487-4416-9f32-9d1fba2a639a ro quiet audit=0 
usb-storage.delay_use=0 net.ifnames=0
plymouth.enable=0 pti=off spectre_v2=off spectre_v2_user=off 
spec_store_bypass_disable=off l1tf=off noexec=off
[...]
[0.00] NX (Execute Disable) protection: disabled by kernel command line 
option
[0.00] [ cut here ]
[0.00] attempted to set unsupported pgprot: 8163 bits: 
8000 supported: 7fff
[0.00] WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/pgtable.h:537 
__early_set_fixmap+0xa2/0xff
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.7-300.fc30.x86_64 #1
[0.00] RIP: 0010:__early_set_fixmap+0xa2/0xff
[0.00] Code: ed be ff 49 21 cc 4c 39 e2 74 21 80 3d b8 a2 bb ff 00 75 
18 4c 31 e2 48 c7 c7 68 b6 09 ac c6 05 a5 a2 bb ff 01 e8 ae ba 95 fe <0f> 0b 4c 
89 e7 4c 09 ef ff 14 25 98 e2 22 ac 48 89
c6 48 89 df ff
[0.00] RSP: :ac203de8 EFLAGS: 00010082 ORIG_RAX: 

[0.00] RAX:  RBX: ac972000 RCX: 0078
[0.00] RDX: 0001 RSI: 0092 RDI: 0047
[0.00] RBP: ff20 R08: 0001 R09: 0024
[0.00] R10: 0c1c R11: 0003 R12: 0163
[0.00] R13: 000f R14: 0001 R15: 
[0.00] FS:  () GS:ac73a000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: 888046212ff8 CR3: 467de000 CR4: 000406a0
[0.00] Call Trace:
[0.00]  ? __early_ioremap+0x10b/0x189
[0.00]  ? dmi_scan_machine+0xfe/0x209
[0.00]  ? setup_arch+0x436/0xc9f
[0.00]  ? start_kernel+0x65/0x519
[0.00]  ? secondary_startup_64+0xa4/0xb0
[0.00] random: get_random_bytes called from 
print_oops_end_marker+0x26/0x40 with crng_init=0
[0.00] ---[ end trace  ]---
[0.00] SMBIOS 2.6 present.
[...]


The same happens in a T61 with an Intel Core 2 Duo T7300.


Thanks.

[tip:x86/fpu] x86/fpu: Add an __fpregs_load_activate() internal helper

2019-04-13 Thread tip-bot for Rik van Riel

Commit-ID:  4ee91519e1dccc175665fe24bb20a47c6053575c
Gitweb: https://git.kernel.org/tip/4ee91519e1dccc175665fe24bb20a47c6053575c
Author: Rik van Riel 
AuthorDate: Wed, 3 Apr 2019 18:41:38 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 16:23:14 +0200

x86/fpu: Add an __fpregs_load_activate() internal helper

Add a helper function that ensures the floating point registers for the
current task are active. Use with preemption disabled.

While at it, add fpregs_lock/unlock() helpers too, to be used in later
patches.

 [ bp: Add a comment about its intended usage. ]

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-10-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/api.h  | 11 +++
 arch/x86/include/asm/fpu/internal.h | 22 ++
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index b56d504af654..73e684160f35 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -10,6 +10,7 @@
 
 #ifndef _ASM_X86_FPU_API_H
 #define _ASM_X86_FPU_API_H
+#include 
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
@@ -22,6 +23,16 @@ extern void kernel_fpu_begin(void);
 extern void kernel_fpu_end(void);
 extern bool irq_fpu_usable(void);
 
+static inline void fpregs_lock(void)
+{
+   preempt_disable();
+}
+
+static inline void fpregs_unlock(void)
+{
+   preempt_enable();
+}
+
 /*
  * Query the presence of one or more xfeatures. Works on any legacy CPU as 
well.
  *
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 54f70cae2f15..3e0c2c496f2d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -484,6 +484,18 @@ static inline void fpregs_activate(struct fpu *fpu)
trace_x86_fpu_regs_activated(fpu);
 }
 
+/*
+ * Internal helper, do not use directly. Use switch_fpu_return() instead.
+ */
+static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
+{
+   if (!fpregs_state_valid(fpu, cpu)) {
+   if (current->mm)
+   copy_kernel_to_fpregs(>state);
+   fpregs_activate(fpu);
+   }
+}
+
 /*
  * FPU state switching for scheduling.
  *
@@ -522,14 +534,8 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  */
 static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
 {
-   if (static_cpu_has(X86_FEATURE_FPU)) {
-   if (!fpregs_state_valid(new_fpu, cpu)) {
-   if (current->mm)
-   copy_kernel_to_fpregs(_fpu->state);
-   }
-
-   fpregs_activate(new_fpu);
-   }
+   if (static_cpu_has(X86_FEATURE_FPU))
+   __fpregs_load_activate(new_fpu, cpu);
 }
 
 /*

[tip:x86/fpu] x86/fpu: Remove user_fpu_begin()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  0169f53e0d97bb675075506810494bd86b8c934e
Gitweb: https://git.kernel.org/tip/0169f53e0d97bb675075506810494bd86b8c934e
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:37 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 15:58:44 +0200

x86/fpu: Remove user_fpu_begin()

user_fpu_begin() sets fpu_fpregs_owner_ctx to task's fpu struct. This is
always the case since there is no lazy FPU anymore.

fpu_fpregs_owner_ctx is used during context switch to decide if it needs
to load the saved registers or if the currently loaded registers are
valid. It could be skipped during a

  taskA -> kernel thread -> taskA

switch because the switch to the kernel thread would not alter the CPU's
sFPU tate.

Since this field is always updated during context switch and
never invalidated, setting it manually (in user context) makes no
difference. A kernel thread with kernel_fpu_begin() block could
set fpu_fpregs_owner_ctx to NULL but a kernel thread does not use
user_fpu_begin().

This is a leftover from the lazy-FPU time.

Remove user_fpu_begin(), it does not change fpu_fpregs_owner_ctx's
content.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-9-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/internal.h | 17 -
 arch/x86/kernel/fpu/core.c  |  4 +---
 arch/x86/kernel/fpu/signal.c|  1 -
 3 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 04042eacc852..54f70cae2f15 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -532,23 +532,6 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, 
int cpu)
}
 }
 
-/*
- * Needs to be preemption-safe.
- *
- * NOTE! user_fpu_begin() must be used only immediately before restoring
- * the save state. It does not do any saving/restoring on its own. In
- * lazy FPU mode, it is just an optimization to avoid a #NM exception,
- * the task can lose the FPU right after preempt_enable().
- */
-static inline void user_fpu_begin(void)
-{
-   struct fpu *fpu = >thread.fpu;
-
-   preempt_disable();
-   fpregs_activate(fpu);
-   preempt_enable();
-}
-
 /*
  * MXCSR and XCR definitions:
  */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 97e27de2b7c0..739ca3ae2bdc 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -335,10 +335,8 @@ void fpu__clear(struct fpu *fpu)
 * Make sure fpstate is cleared and initialized.
 */
fpu__initialize(fpu);
-   if (static_cpu_has(X86_FEATURE_FPU)) {
-   user_fpu_begin();
+   if (static_cpu_has(X86_FEATURE_FPU))
copy_init_fpstate_to_fpregs();
-   }
 }
 
 /*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 34989d2a8893..155f4552413e 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -322,7 +322,6 @@ static int __fpu__restore_sig(void __user *buf, void __user 
*buf_fx, int size)
 * For 64-bit frames and 32-bit fsave frames, restore the user
 * state to the registers directly (with exceptions handled).
 */
-   user_fpu_begin();
if (copy_user_to_fpregs_zeroing(buf_fx, xfeatures, fx_only)) {
fpu__clear(fpu);
return -1;

[tip:x86/fpu] x86/fpu: Remove fpu->initialized

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  2722146eb78451b30e4717a267a3a2b44e4ad317
Gitweb: https://git.kernel.org/tip/2722146eb78451b30e4717a267a3a2b44e4ad317
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:36 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 15:42:40 +0200

x86/fpu: Remove fpu->initialized

The struct fpu.initialized member is always set to one for user tasks
and zero for kernel tasks. This avoids saving/restoring the FPU
registers for kernel threads.

The ->initialized = 0 case for user tasks has been removed in previous
changes, for instance, by doing an explicit unconditional init at fork()
time for FPU-less systems which was otherwise delayed until the emulated
opcode.

The context switch code (switch_fpu_prepare() + switch_fpu_finish())
can't unconditionally save/restore registers for kernel threads. Not
only would it slow down the switch but also load a zeroed xcomp_bv for
XSAVES.

For kernel_fpu_begin() (+end) the situation is similar: EFI with runtime
services uses this before alternatives_patched is true. Which means that
this function is used too early and it wasn't the case before.

For those two cases, use current->mm to distinguish between user and
kernel thread. For kernel_fpu_begin() skip save/restore of the FPU
registers.

During the context switch into a kernel thread don't do anything. There
is no reason to save the FPU state of a kernel thread.

The reordering in __switch_to() is important because the current()
pointer needs to be valid before switch_fpu_finish() is invoked so ->mm
is seen of the new task instead the old one.

N.B.: fpu__save() doesn't need to check ->mm because it is called by
user tasks only.

 [ bp: Massage. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: kvm ML 
Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Will Deacon 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-8-bige...@linutronix.de
---
 arch/x86/ia32/ia32_signal.c | 17 -
 arch/x86/include/asm/fpu/internal.h | 18 +-
 arch/x86/include/asm/fpu/types.h|  9 -
 arch/x86/include/asm/trace/fpu.h|  5 +--
 arch/x86/kernel/fpu/core.c  | 70 +++--
 arch/x86/kernel/fpu/init.c  |  2 --
 arch/x86/kernel/fpu/regset.c| 19 +++---
 arch/x86/kernel/fpu/xstate.c|  2 --
 arch/x86/kernel/process_32.c|  4 +--
 arch/x86/kernel/process_64.c|  4 +--
 arch/x86/kernel/signal.c| 17 -
 arch/x86/mm/pkeys.c |  7 +---
 12 files changed, 53 insertions(+), 121 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 321fe5f5d0e9..6eeb3249f22f 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -216,8 +216,7 @@ static void __user *get_sigframe(struct ksignal *ksig, 
struct pt_regs *regs,
 size_t frame_size,
 void __user **fpstate)
 {
-   struct fpu *fpu = >thread.fpu;
-   unsigned long sp;
+   unsigned long sp, fx_aligned, math_size;
 
/* Default to using normal stack */
sp = regs->sp;
@@ -231,15 +230,11 @@ static void __user *get_sigframe(struct ksignal *ksig, 
struct pt_regs *regs,
 ksig->ka.sa.sa_restorer)
sp = (unsigned long) ksig->ka.sa.sa_restorer;
 
-   if (fpu->initialized) {
-   unsigned long fx_aligned, math_size;
-
-   sp = fpu__alloc_mathframe(sp, 1, _aligned, _size);
-   *fpstate = (struct _fpstate_32 __user *) sp;
-   if (copy_fpstate_to_sigframe(*fpstate, (void __user 
*)fx_aligned,
-   math_size) < 0)
-   return (void __user *) -1L;
-   }
+   sp = fpu__alloc_mathframe(sp, 1, _aligned, _size);
+   *fpstate = (struct _fpstate_32 __user *) sp;
+   if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
+math_size) < 0)
+   return (void __user *) -1L;
 
sp -= frame_size;
/* Align the stack pointer according to the i386 ABI,
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 70ecb7c032cb..04042eacc852 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -494,11 +494,14 @@ static inline void fpregs_activate(struct fpu *fpu)
  *
  *  - switch_fpu_finish() restores the new state as
  *necessary.
+ *
+ * The FPU context is only stored/restored for a user task and
+ * ->mm is used to

[tip:x86/fpu] x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  fbcc9e0c37ba3186c41b5eb1aee2f7f3b711bc1e
Gitweb: https://git.kernel.org/tip/fbcc9e0c37ba3186c41b5eb1aee2f7f3b711bc1e
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:34 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Apr 2019 20:48:11 +0200

x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()

With lazy-FPU support the (now named variable) ->initialized was set to
true if the CPU's FPU registers were holding a valid state of the
FPU registers for the active process. If it was set to false then the
FPU state was saved in fpu->state and the FPU was deactivated.

With lazy-FPU gone, ->initialized is always true for user threads and
kernel threads never call this function so ->initialized is always true
in copy_fpstate_to_sigframe().

The using_compacted_format() check is also a leftover from the lazy-FPU
time. In the

  ->initialized == false

case copy_to_user() would copy the compacted buffer while userland would
expect the non-compacted format instead. So in order to save the FPU
state in the non-compacted form it issues XSAVE to save the *current*
FPU state.

If the FPU is not enabled, the attempt raises the FPU trap, the trap
restores the FPU contents and re-enables the FPU and XSAVE is invoked
again and succeeds.

*This* does not longer work since commit

  bef8b6da9522 ("x86/fpu: Handle #NM without FPU emulation as an error")

Remove the check for ->initialized because it is always true and remove
the false condition. Update the comment to reflect that the state is
always live.

 [ bp: Massage. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-6-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 35 ---
 1 file changed, 8 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 7296a9bb78e7..c1a5999affa0 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -144,9 +144,8 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
  * buf == buf_fx for 64-bit frames and 32-bit fsave frame.
  * buf != buf_fx for 32-bit frames with fxstate.
  *
- * If the fpu, extended register state is live, save the state directly
- * to the user frame pointed by the aligned pointer 'buf_fx'. Otherwise,
- * copy the thread's fpu state to the user frame starting at 'buf_fx'.
+ * Save the state directly to the user frame pointed by the aligned pointer
+ * 'buf_fx'.
  *
  * If this is a 32-bit frame with fxstate, put a fsave header before
  * the aligned state at 'buf_fx'.
@@ -157,7 +156,6 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
 int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 {
struct fpu *fpu = >thread.fpu;
-   struct xregs_state *xsave = >state.xsave;
struct task_struct *tsk = current;
int ia32_fxstate = (buf != buf_fx);
 
@@ -172,29 +170,12 @@ int copy_fpstate_to_sigframe(void __user *buf, void 
__user *buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;
 
-   if (fpu->initialized || using_compacted_format()) {
-   /* Save the live register state to the user directly. */
-   if (copy_fpregs_to_sigframe(buf_fx))
-   return -1;
-   /* Update the thread's fxstate to save the fsave header. */
-   if (ia32_fxstate)
-   copy_fxregs_to_kernel(fpu);
-   } else {
-   /*
-* It is a *bug* if kernel uses compacted-format for xsave
-* area and we copy it out directly to a signal frame. It
-* should have been handled above by saving the registers
-* directly.
-*/
-   if (boot_cpu_has(X86_FEATURE_XSAVES)) {
-   WARN_ONCE(1, "x86/fpu: saving compacted-format xsave 
area to a signal frame!\n");
-   return -1;
-   }
-
-   fpstate_sanitize_xstate(fpu);
-   if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size))
-   return -1;
-   }
+   /* Save the live registers state to the user frame directly. */
+   if (copy_fpregs_to_sigframe(buf_fx))
+   return -1;
+   /* Update the thread's fxstate to save the fsave header. */
+   if (ia32_fxstate)
+   copy_fxregs_to_kernel(fpu);
 
/* Save the fsave header for the 32-bit frames. */
if ((ia32_fxstate || !use_fxsr()) &&

[tip:x86/fpu] x86/fpu: Don't save fxregs for ia32 frames in copy_fpstate_to_sigframe()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  39388e80f9b0c3788bfb6efe3054bdce0c3ead45
Gitweb: https://git.kernel.org/tip/39388e80f9b0c3788bfb6efe3054bdce0c3ead45
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:35 +0200
Committer:  Borislav Petkov 
CommitDate: Wed, 10 Apr 2019 14:46:35 +0200

x86/fpu: Don't save fxregs for ia32 frames in copy_fpstate_to_sigframe()

In commit

  72a671ced66db ("x86, fpu: Unify signal handling code paths for x86 and x86_64 
kernels")

the 32bit and 64bit path of the signal delivery code were merged.

The 32bit version:

  int save_i387_xstate_ia32(void __user *buf)
  …
 if (cpu_has_xsave)
 return save_i387_xsave(fp);
 if (cpu_has_fxsr)
 return save_i387_fxsave(fp);

The 64bit version:

  int save_i387_xstate(void __user *buf)
  …
 if (user_has_fpu()) {
 if (use_xsave())
 err = xsave_user(buf);
 else
 err = fxsave_user(buf);

 if (unlikely(err)) {
 __clear_user(buf, xstate_size);
 return err;

The merge:

  int save_xstate_sig(void __user *buf, void __user *buf_fx, int size)
  …
 if (user_has_fpu()) {
 /* Save the live register state to the user directly. */
 if (save_user_xstate(buf_fx))
 return -1;
 /* Update the thread's fxstate to save the fsave header. */
 if (ia32_fxstate)
 fpu_fxsave(>thread.fpu);

I don't think that we needed to save the FPU registers to ->thread.fpu
because the registers were stored in buf_fx. Today the state will be
restored from buf_fx after the signal was handled (I assume that this
was also the case with lazy-FPU).

Since commit

  66463db4fc560 ("x86, fpu: shift drop_init_fpu() from save_xstate_sig() to 
handle_signal()")

it is ensured that the signal handler starts with clear/fresh set of FPU
registers which means that the previous store is futile.

Remove the copy_fxregs_to_kernel() call because task's FPU state is
cleared later in handle_signal() via fpu__clear().

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-7-bige...@linutronix.de
---
 arch/x86/kernel/fpu/signal.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index c1a5999affa0..34989d2a8893 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -155,7 +155,6 @@ static inline int copy_fpregs_to_sigframe(struct 
xregs_state __user *buf)
  */
 int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 {
-   struct fpu *fpu = >thread.fpu;
struct task_struct *tsk = current;
int ia32_fxstate = (buf != buf_fx);
 
@@ -173,9 +172,6 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user 
*buf_fx, int size)
/* Save the live registers state to the user frame directly. */
if (copy_fpregs_to_sigframe(buf_fx))
return -1;
-   /* Update the thread's fxstate to save the fsave header. */
-   if (ia32_fxstate)
-   copy_fxregs_to_kernel(fpu);
 
/* Save the fsave header for the 32-bit frames. */
if ((ia32_fxstate || !use_fxsr()) && save_fsave_header(tsk, buf))

[PATCH] new flag COPY_FILE_RANGE_FILESIZE for copy_file_range()

2019-04-13 Thread Shawn Landden

If flags includes COPY_FILE_RANGE_FILESIZE then the length
copied is the length of the file. off_in and off_out are
ignored. len must be 0 or the file size.

This implementation saves a call to stat() in the common case
of copying files. It does not fix any race conditions, but that
is possible in the future with this interface.

EAGAIN: If COPY_FILE_RANGE_FILESIZE was passed and len is not 0
or the file size.

Signed-off-by: Shawn Landden 
CC: 
---
 fs/read_write.c   | 14 +-
 include/uapi/linux/stat.h |  4 
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..6d06361f0856 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1557,7 +1557,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
struct inode *inode_out = file_inode(file_out);
ssize_t ret;
 
-   if (flags != 0)
+   if ((flags & ~COPY_FILE_RANGE_FILESIZE) != 0)
return -EINVAL;
 
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1565,6 +1565,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
return -EINVAL;
 
+   if (flags & COPY_FILE_RANGE_FILESIZE) {
+   struct kstat stat;
+   int error;
+   error = vfs_getattr(_in->f_path, ,
+   STATX_SIZE, 0);
+   if (error < 0)
+   return error;
+   if (!(len == 0 || len == stat.size))
+   return -EAGAIN;
+   len = stat.size;
+   }
+
ret = rw_verify_area(READ, file_in, _in, len);
if (unlikely(ret))
return ret;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..1075aa4666ef 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -170,5 +170,9 @@ struct statx {
 
 #define STATX_ATTR_AUTOMOUNT   0x1000 /* Dir: Automount trigger */
 
+/*
+ * Flags for copy_file_range()
+ */
+#define COPY_FILE_RANGE_FILESIZE   0x0001 /* Copy the full length of 
the input file */
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.20.1

[tip:x86/fpu] x86/fpu: Always init the state in fpu__clear()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  88f5260a3bf9bfb276b5b4aac2e81587e425a1d7
Gitweb: https://git.kernel.org/tip/88f5260a3bf9bfb276b5b4aac2e81587e425a1d7
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:33 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Apr 2019 19:28:06 +0200

x86/fpu: Always init the state in fpu__clear()

fpu__clear() only initializes the state if the CPU has FPU support.
This initialisation is also required for FPU-less systems and takes
place in math_emulate(). Since fpu__initialize() only performs the
initialization if ->initialized is zero it does not matter that it
is invoked each time an opcode is emulated. It makes the removal of
->initialized easier if the struct is also initialized in the FPU-less
case at the same time.

Move fpu__initialize() before the FPU feature check so it is also
performed in the FPU-less case too.

 [ bp: Massage a bit. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Bill Metzenthen 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-5-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/internal.h | 1 -
 arch/x86/kernel/fpu/core.c  | 5 ++---
 arch/x86/math-emu/fpu_entry.c   | 3 ---
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 75a1d5f712ee..70ecb7c032cb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -24,7 +24,6 @@
 /*
  * High level FPU state handling functions:
  */
-extern void fpu__initialize(struct fpu *fpu);
 extern void fpu__prepare_read(struct fpu *fpu);
 extern void fpu__prepare_write(struct fpu *fpu);
 extern void fpu__save(struct fpu *fpu);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1940319268ae..e43296854e37 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -223,7 +223,7 @@ int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu)
  * Activate the current task's in-memory FPU context,
  * if it has not been used before:
  */
-void fpu__initialize(struct fpu *fpu)
+static void fpu__initialize(struct fpu *fpu)
 {
WARN_ON_FPU(fpu != >thread.fpu);
 
@@ -236,7 +236,6 @@ void fpu__initialize(struct fpu *fpu)
fpu->initialized = 1;
}
 }
-EXPORT_SYMBOL_GPL(fpu__initialize);
 
 /*
  * This function must be called before we read a task's fpstate.
@@ -365,8 +364,8 @@ void fpu__clear(struct fpu *fpu)
/*
 * Make sure fpstate is cleared and initialized.
 */
+   fpu__initialize(fpu);
if (static_cpu_has(X86_FEATURE_FPU)) {
-   fpu__initialize(fpu);
user_fpu_begin();
copy_init_fpstate_to_fpregs();
}
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 9e2ba7e667f6..a873da6b46d6 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -113,9 +113,6 @@ void math_emulate(struct math_emu_info *info)
unsigned long code_base = 0;
unsigned long code_limit = 0;   /* Initialized to stop compiler 
warnings */
struct desc_struct code_descriptor;
-   struct fpu *fpu = >thread.fpu;
-
-   fpu__initialize(fpu);
 
 #ifdef RE_ENTRANT_CHECKING
if (emulating) {

[tip:x86/fpu] x86/fpu: Remove preempt_disable() in fpu__clear()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  60e528d6ce3f60a058bbb64f8acb2a07f84b172a
Gitweb: https://git.kernel.org/tip/60e528d6ce3f60a058bbb64f8acb2a07f84b172a
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:32 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Apr 2019 19:27:46 +0200

x86/fpu: Remove preempt_disable() in fpu__clear()

The preempt_disable() section was introduced in commit

  a10b6a16cdad8 ("x86/fpu: Make the fpu state change in fpu__clear() 
scheduler-atomic")

and it was said to be temporary.

fpu__initialize() initializes the FPU struct to its initial value and
then sets ->initialized to 1. The last part is the important one.
The content of the state does not matter because it gets set via
copy_init_fpstate_to_fpregs().

A preemption here has little meaning because the registers will always be
set to the same content after copy_init_fpstate_to_fpregs(). A softirq
with a kernel_fpu_begin() could also force to save FPU's registers after
fpu__initialize() without changing the outcome here.

Remove the preempt_disable() section in fpu__clear(), preemption here
does not hurt.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-4-bige...@linutronix.de
---
 arch/x86/kernel/fpu/core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1d3ae7988f7f..1940319268ae 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -366,11 +366,9 @@ void fpu__clear(struct fpu *fpu)
 * Make sure fpstate is cleared and initialized.
 */
if (static_cpu_has(X86_FEATURE_FPU)) {
-   preempt_disable();
fpu__initialize(fpu);
user_fpu_begin();
copy_init_fpstate_to_fpregs();
-   preempt_enable();
}
 }

[tip:x86/fpu] x86/fpu: Remove fpu__restore()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  6dd677a044e606fd343e31c2108b13d74aec1ca5
Gitweb: https://git.kernel.org/tip/6dd677a044e606fd343e31c2108b13d74aec1ca5
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:31 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Apr 2019 19:27:42 +0200

x86/fpu: Remove fpu__restore()

There are no users of fpu__restore() so it is time to remove it. The
comment regarding fpu__restore() and TS bit is stale since commit

  b3b0870ef3ffe ("i387: do not preload FPU state at task switch time")

and has no meaning since.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Jonathan Corbet 
Cc: kvm ML 
Cc: linux-...@vger.kernel.org
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-3-bige...@linutronix.de
---
 Documentation/preempt-locking.txt   |  1 -
 arch/x86/include/asm/fpu/internal.h |  1 -
 arch/x86/kernel/fpu/core.c  | 24 
 arch/x86/kernel/process_32.c|  4 +---
 arch/x86/kernel/process_64.c|  4 +---
 5 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/Documentation/preempt-locking.txt 
b/Documentation/preempt-locking.txt
index 509f5a422d57..dce336134e54 100644
--- a/Documentation/preempt-locking.txt
+++ b/Documentation/preempt-locking.txt
@@ -52,7 +52,6 @@ preemption must be disabled around such regions.
 
 Note, some FPU functions are already explicitly preempt safe.  For example,
 kernel_fpu_begin and kernel_fpu_end will disable and enable preemption.
-However, fpu__restore() must be called with preemption disabled.
 
 
 RULE #3: Lock acquire and release must be performed by same task
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index fb04a3ded7dd..75a1d5f712ee 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -28,7 +28,6 @@ extern void fpu__initialize(struct fpu *fpu);
 extern void fpu__prepare_read(struct fpu *fpu);
 extern void fpu__prepare_write(struct fpu *fpu);
 extern void fpu__save(struct fpu *fpu);
-extern void fpu__restore(struct fpu *fpu);
 extern int  fpu__restore_sig(void __user *buf, int ia32_frame);
 extern void fpu__drop(struct fpu *fpu);
 extern int  fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2e5003fef51a..1d3ae7988f7f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -303,30 +303,6 @@ void fpu__prepare_write(struct fpu *fpu)
}
 }
 
-/*
- * 'fpu__restore()' is called to copy FPU registers from
- * the FPU fpstate to the live hw registers and to activate
- * access to the hardware registers, so that FPU instructions
- * can be used afterwards.
- *
- * Must be called with kernel preemption disabled (for example
- * with local interrupts disabled, as it is in the case of
- * do_device_not_available()).
- */
-void fpu__restore(struct fpu *fpu)
-{
-   fpu__initialize(fpu);
-
-   /* Avoid __kernel_fpu_begin() right after fpregs_activate() */
-   kernel_fpu_disable();
-   trace_x86_fpu_before_restore(fpu);
-   fpregs_activate(fpu);
-   copy_kernel_to_fpregs(>state);
-   trace_x86_fpu_after_restore(fpu);
-   kernel_fpu_enable();
-}
-EXPORT_SYMBOL_GPL(fpu__restore);
-
 /*
  * Drops current FPU state: deactivates the fpregs and
  * the fpstate. NOTE: it still leaves previous contents
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index e471d8e6f0b2..7888a41a03cd 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -267,9 +267,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct 
*next_p)
/*
 * Leave lazy mode, flushing any hypercalls made here.
 * This must be done before restoring TLS segments so
-* the GDT and LDT are properly updated, and must be
-* done before fpu__restore(), so the TS bit is up
-* to date.
+* the GDT and LDT are properly updated.
 */
arch_end_context_switch(next_p);
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6a62f4af9fcf..e1983b3a16c4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -538,9 +538,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct 
*next_p)
/*
 * Leave lazy mode, flushing any hypercalls made here.  This
 * must be done after loading TLS entries in the GDT but before
-* loading segments that might reference them, and and it must
-* be done before fpu__restore(), so the TS bit is up to
-* date.
+* loading segments

[tip:x86/fpu] x86/fpu: Remove fpu->initialized usage in fpurestore_sig()

2019-04-13 Thread tip-bot for Sebastian Andrzej Siewior

Commit-ID:  39ea9baffda91df8bfee9b45610242a3191ea1ec
Gitweb: https://git.kernel.org/tip/39ea9baffda91df8bfee9b45610242a3191ea1ec
Author: Sebastian Andrzej Siewior 
AuthorDate: Wed, 3 Apr 2019 18:41:30 +0200
Committer:  Borislav Petkov 
CommitDate: Tue, 9 Apr 2019 19:27:29 +0200

x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig()

This is a preparation for the removal of the ->initialized member in the
fpu struct.

__fpu__restore_sig() is deactivating the FPU via fpu__drop() and then
setting manually ->initialized followed by fpu__restore(). The result is
that it is possible to manipulate fpu->state and the state of registers
won't be saved/restored on a context switch which would overwrite
fpu->state:

fpu__drop(fpu):
  ...
  fpu->initialized = 0;
  preempt_enable();

  <--- context switch

Don't access the fpu->state while the content is read from user space
and examined/sanitized. Use a temporary kmalloc() buffer for the
preparation of the FPU registers and once the state is considered okay,
load it. Should something go wrong, return with an error and without
altering the original FPU registers.

The removal of fpu__initialize() is a nop because fpu->initialized is
already set for the user task.

 [ bp: Massage a bit. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Acked-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: kvm ML 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-2-bige...@linutronix.de
---
 arch/x86/include/asm/fpu/signal.h |  2 +-
 arch/x86/kernel/fpu/regset.c  |  5 ++---
 arch/x86/kernel/fpu/signal.c  | 40 +++
 3 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/fpu/signal.h 
b/arch/x86/include/asm/fpu/signal.h
index 44bbc39a57b3..7fb516b6893a 100644
--- a/arch/x86/include/asm/fpu/signal.h
+++ b/arch/x86/include/asm/fpu/signal.h
@@ -22,7 +22,7 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
 
 extern void convert_from_fxsr(struct user_i387_ia32_struct *env,
  struct task_struct *tsk);
-extern void convert_to_fxsr(struct task_struct *tsk,
+extern void convert_to_fxsr(struct fxregs_state *fxsave,
const struct user_i387_ia32_struct *env);
 
 unsigned long
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index bc02f5144b95..5dbc099178a8 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -269,11 +269,10 @@ convert_from_fxsr(struct user_i387_ia32_struct *env, 
struct task_struct *tsk)
memcpy([i], [i], sizeof(to[0]));
 }
 
-void convert_to_fxsr(struct task_struct *tsk,
+void convert_to_fxsr(struct fxregs_state *fxsave,
 const struct user_i387_ia32_struct *env)
 
 {
-   struct fxregs_state *fxsave = >thread.fpu.state.fxsave;
struct _fpreg *from = (struct _fpreg *) >st_space[0];
struct _fpxreg *to = (struct _fpxreg *) >st_space[0];
int i;
@@ -350,7 +349,7 @@ int fpregs_set(struct task_struct *target, const struct 
user_regset *regset,
 
ret = user_regset_copyin(, , , , , 0, -1);
if (!ret)
-   convert_to_fxsr(target, );
+   convert_to_fxsr(>thread.fpu.state.fxsave, );
 
/*
 * update the header bit in the xsave header, indicating the
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 55b80de13ea5..7296a9bb78e7 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -207,11 +207,11 @@ int copy_fpstate_to_sigframe(void __user *buf, void 
__user *buf_fx, int size)
 }
 
 static inline void
-sanitize_restored_xstate(struct task_struct *tsk,
+sanitize_restored_xstate(union fpregs_state *state,
 struct user_i387_ia32_struct *ia32_env,
 u64 xfeatures, int fx_only)
 {
-   struct xregs_state *xsave = >thread.fpu.state.xsave;
+   struct xregs_state *xsave = >xsave;
struct xstate_header *header = >header;
 
if (use_xsave()) {
@@ -238,7 +238,7 @@ sanitize_restored_xstate(struct task_struct *tsk,
 */
xsave->i387.mxcsr &= mxcsr_feature_mask;
 
-   convert_to_fxsr(tsk, ia32_env);
+   convert_to_fxsr(>fxsave, ia32_env);
}
 }
 
@@ -284,8 +284,6 @@ static int __fpu__restore_sig(void __user *buf, void __user 
*buf_fx, int size)
if (!access_ok(buf, size))
return -EACCES;
 
-   fpu__initialize(fpu);
-
if (!static_cpu_has(X86_FEATURE_FPU))
return fpregs_soft_set(current, NULL,
   0, sizeof(struct user_i387_ia32_struct),
@@ -315,40 +313,32 @@ static int __fpu__restore_sig(void __user *buf, void

Re: [PATCH 5/5] leds: lm3697: Introduce the lm3697 driver

2019-04-13 Thread Pavel Machek

On Mon 2019-03-25 09:24:03, Dan Murphy wrote:
> Introduce the lm3697 LED driver for
> backlighting and display.
> 
> Datasheet location:
> http://www.ti.com/lit/ds/symlink/lm3697.pdf
> 
> Signed-off-by: Dan Murphy 
> ---
>  drivers/leds/Kconfig   |   8 +-
>  drivers/leds/Makefile  |   1 +
>  drivers/leds/leds-lm3697.c | 401 +
>  3 files changed, 409 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/leds/leds-lm3697.c
> 
> diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig
> index 735009e73414..688bb9a6f275 100644
> --- a/drivers/leds/Kconfig
> +++ b/drivers/leds/Kconfig
> @@ -776,9 +776,15 @@ config LEDS_NIC78BX
> To compile this driver as a module, choose M here: the module
> will be called leds-nic78bx.
>  
> +config LEDS_LM3697
> + tristate "LED driver for LM3697"
> + depends on LEDS_TI_LMU_COMMON
> + help
> +   Say Y to enable the LM3697 LED driver for TI LMU devices.
> +   This supports the LED device LM3697.
> +
>  config LEDS_TI_LMU_COMMON
>   tristate "LED driver for TI LMU"
> - depends on REGMAP
>   help
>Say Y to enable the LED driver for TI LMU devices.
>This supports common features between the TI LM3532, LM3631, 
> LM3632,

Is deleting "depends on REGMAP" intentional? AFAICT you are using it.

Plus we'd normally expect "COMMON" first and then specific driver. Not
sure if Kconfig can handle it out-of-order...


> +static int lm3697_init(struct lm3697 *priv)
> +{
> + struct lm3697_led *led;
> + int i, ret;
> +
> + if (priv->enable_gpio) {
> + gpiod_direction_output(priv->enable_gpio, 1);
> + } else {
> + ret = regmap_write(priv->regmap, LM3697_RESET, LM3697_SW_RESET);
> + if (ret) {
> + dev_err(>client->dev, "Cannot reset the 
> device\n");
> + goto out;
> + }
> + }
> +
> + ret = regmap_write(priv->regmap, LM3697_CTRL_ENABLE, 0x0);
> + if (ret) {
> + dev_err(>client->dev, "Cannot write ctrl enable\n");
> + goto out;
> + }
> +
> + ret = regmap_write(priv->regmap, LM3697_OUTPUT_CONFIG, priv->bank_cfg);
> + if (ret)
> + dev_err(>client->dev, "Cannot write OUTPUT config\n");

Missing goto out?

> + for (i = 0; i < LM3697_MAX_CONTROL_BANKS; i++) {
> + led = >leds[i];
> + ret = ti_lmu_common_set_ramp(>lmu_data);
> + if (ret)
> + dev_err(>client->dev, "Setting the ramp rate 
> failed\n");
> + }
> +out:
> + return ret;
> +}
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

2019-04-13 Thread Kirill Smelkov

On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote:
> On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov  wrote:
> >
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, 
> > const loff_t *ppos, size_t
> > inode = file_inode(file);
> > if (unlikely((ssize_t) count < 0))
> > return retval;
> > -   pos = *ppos;
> > +   pos = (ppos ? *ppos : 0);
> > if (unlikely(pos < 0)) {
> > if (!unsigned_offsets(file))
> > return retval;
> 
> This part looks silly. We should just avoid all the position overflow
> games when we don't have a position at all (ie streaming). You can't
> overflow what you don't use.
> 
> Similarly, you can't use ranged mandatory locking on a stream, so the
> mandatory locking thing seems dependent on pos too.
> 
> So I think that with a NULL ppos being possible, we should just change
> the code to just do all of that conditionally on having a position,
> rather than saying that the position of a stream is always 0.

Linus, thanks for feedback. After checking a bit of history of
rw_verify_area and seeing what it does I agree with you. I was initially
confused because rw_verify_area calls security_file_permission() at its
tail and so I thought we cannot skip rw_verify_area for !ppos case.
After studying a bit I see now that there are several things going on
and indeed moving ranged locks checking into under `if ppos!=NULL` makes
sense. This code is new to me; apologize for not getting it right
initially.

> That said, this whole "let's make it possible to not have a position
> at all" is a big change, and there's no way I'll apply these before
> the 5.2 merge window.

Ok.

> And I'd really like to have people (Al?) look at this and go "yeah,
> makes sense". I do think that moving to a model where we wither have a
> (properly locked) file position or no pos pointer at all is the right
> model (ie I'd really like to get rid of the mixed case), but there
> might be some practical problem that makes it impractical.
> 
> Because the *real* problem with the mixed case is not "insane people
> who do bad things might get odd results". No, the real problem with
> the mixed case is that it could be a security issue (ie: one process
> intentionally changes the file position just as another process is
> going a 'read' and then avoids some limit test because the limit test
> was done using the old 'pos' value but the actual IO was done using
> the new one).
> 
> So I suspect that we will have to either
> 
>  - get rid of the mixed case entirely (and do only properly locked
> f_pos accesses or pass is a NULL f_pos)
> 
>  - continue to support the mixed case, but also continue to support
> the nasty temporary 'pos' value with that file_pos_{read,write}()
> thing.
> 
> IOW, I would not be ok with passing in a shared - and unlocked -
> >f_pos value to random drivers that *might* do odd things when a
> race happens.

I see. That's why I splitted the whole change into 2: the first patch
only adds ppos=NULL semantic for FMODE_STREAM and does not change to
passing >f_pos directly. There are signs that people can be
starting to use stream_open even now - before we do mass conversion.
That's why it is better to fix ppos semantic for stream case early.
You said it won't happen before next merge window - ok, but ppos=NULL
change is separate and does not depend on passing >f_pos directly.
So let's please apply ppos=NULL patch early when merge window begins and
independently of what happens with whether we'll switch to direct usage
of >f_pos.

Looking forward to get review from others too.

And I appreciate if people could help at least somehow with "getting rid
of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM),
because this transition starts to diverge from my particular use-case
too far. To me it makes sense to do that transition as follows:

- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that truly
  don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
  will cover pipes and sockets), or maybe convert pipes and sockets to
  FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use no_llseek
  or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo.
  This might find files that have stream semantic but are opened
  differently;
- extend stream_open.cocci to analyze file_operations whose .read/.write
  do not use ppos at all (independently of how file was opened);
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
  !FMODE_STREAM;
- gather bug reports for deadlocked read/write and convert missed cases
  to FMODE_STREAM, probably extending stream_open.cocci along the road
  to catch similar cases.

i.e. always take f_pos_lock unless a

Re: [PATCH 0/3] 32-bit Meson: add the canvas module

2019-04-13 Thread Maxime Jourdan

Hi Martin,
On Sat, Apr 13, 2019 at 7:15 PM Martin Blumenstingl
 wrote:
>
> This adds the canvas module on Meson8, Meson8b and Meson8m2. The canvas
> IP is used by the video decoder hardware as well as the VPU (video
> output) hardware.
>
> Neither the VPU nor the video decoder driver support the 32-bit SoCs
> yet. However, we can still add the canvas module to have it available
> once these drivers gain support for the older SoCs.
>
> I have tested this on my Meson8m2 board by hacking the VPU driver to
> not re-initialize the VPU (and to use the configuration set by u-boot).
> With that hack I could get some image out of the CVBS connector. No
> changes to the canvas driver were required.
>
> Due to lack of hardware I could not test Meson8, but I'm following (as
> always) what the Amlogic 3.10 vendor kernel uses.
> Meson8b is also not tested because u-boot of my EC-100 doesn't have
> video output enabled (so I couldn't use the same hack I used on my
> Meson8m2 board).
>
> This series meant to be applied on top of "Meson8b: add support for the
> RTC on EC-100 and Odroid-C1" from [0]
>
>

The series looks good to me, however I wonder if we should maybe add a
new compatible ?

The canvas IP before the GX* generation does not handle what Amlogic
calls "endianness", the field that allows doing some byte-switching to
get proper NV12/NV21. So the following defines are unusable:

#define MESON_CANVAS_ENDIAN_SWAP16 0x1
#define MESON_CANVAS_ENDIAN_SWAP32 0x3
#define MESON_CANVAS_ENDIAN_SWAP64 0x7
#define MESON_CANVAS_ENDIAN_SWAP128 0xf

It wouldn't change much functionally, but we could have e.g a warning
if a m8 canvas user tries to set endianness even though it does
nothing.

Maxime

> [0] https://patchwork.kernel.org/cover/10899509/
>
>
> Martin Blumenstingl (3):
>   ARM: dts: meson8: add the canvas module
>   ARM: dts: meson8m2: update the offset of the canvas module
>   ARM: dts: meson8b: add the canvas module
>
>  arch/arm/boot/dts/meson8.dtsi   | 21 +
>  arch/arm/boot/dts/meson8b.dtsi  | 21 +
>  arch/arm/boot/dts/meson8m2.dtsi | 10 ++
>  3 files changed, 52 insertions(+)
>
> --
> 2.21.0
>
>
> ___
> linux-amlogic mailing list
> linux-amlo...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-amlogic

Re: Alleged fix for writer stall on -rcu branch dev

2019-04-13 Thread Paul E. McKenney

On Sat, Apr 13, 2019 at 07:21:53AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 10, 2019 at 05:06:10PM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote:
> > > > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior 
> > > > > wrote:
> > > > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote:
> > > > > > > Hello, Sebastian,
> > > > > > Hi Paul,
> > > > > > 
> > > > > > > A few commits later, I finally have something that might work.  
> > > > > > > ;-)
> > > > > > 
> > > > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The 
> > > > > > first
> > > > > > batch of 18 completed and this
> > > > > > https://breakpoint.cc/console-tree1.2.log
> > > > > 
> > > > > Another new one on me!  ;-)
> > > > > 
> > > > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to
> > > > > offline CPU 6.  Unfortunately, CPU 6 is not in the list receiving an
> > > > > NMI, perhaps because it looked like it was already offline.  It does
> > > > > eventually finish going offline quite some time later.
> > > > > 
> > > > > Perhaps my forward-progress testing is interacting badly with CPU
> > > > > hotplug, but only very occasionally?
> > > > 
> > > > Somehow, yes.
> > > > 
> > > > [ 8433.835292] Unregister pv shared memory for cpu 6
> > > > [ 8433.864122] smpboot: CPU 6 is now offline
> > > > 
> > > > CPU6 is offline.
> > > > 
> > > > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6
> > > > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock
> > > > [ 8434.989124] KVM setup async PF for cpu 6
> > > > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380
> > > > 
> > > > CPU6 is booting again.
> > > > 
> > > > [ 8436.035269] Unregister pv shared memory for cpu 6
> > > > 
> > > > going down again
> > > > 
> > > > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > > [ 8462.037385] rcu: 11-...!: (25587 ticks this GP) 
> > > > idle=57e/1/0x4002 softirq=418067/418067 fqs=1 
> > > > last_accelerate: 2456/89e6, Nonlazy posted: .LD
> > > > 
> > > > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11,
> > > > correct?
> > > 
> > > Yes to both.
> > > 
> > > > [ 8525.041031] smpboot: CPU 6 is now offline
> > > > 
> > > > 63 secs later, the CPU is down. So that is too long.
> > > 
> > > Ah, I did miss the down-then-up, thank you!
> > > 
> > > > > Something for me to look at, anyway!
> > > > 
> > > > There is also
> > > > [ 8556.907804] WARNING: CPU: 35 PID: 833 at 
> > > > /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 
> > > > rcu_torture_fwd_prog+0x5c4/0x630
> > > > 
> > > > but I'm not sure what it is :)
> > > 
> > > The forward-progress test exceeded eight seconds, but pushed fewer
> > > than 100 callbacks through the system.  One way that could happen is
> > > if the forward-progress kthread was delayed during that eight seconds.
> > > Another is if grace periods were insanely fast (which they are not).
> > > 
> > > > > > I should have been alone on that server but I can't guarantee it. I 
> > > > > > try
> > > > > > to throw the rcu-torture at a little smaller box and check how it 
> > > > > > goes.
> > > > > 
> > > > > Even if you were not alone, I would guess that preempting a runnable
> > > > > process for 92 seconds would have been a visible event on the host.
> > > > 
> > > > 92 seconds?! Okay, that should have been visible on the host side. And I
> > > > see nothing in dmesg on the host side.
> > > > But why 92? 25 for RCU stall + 63 between starting to go down and
> > > > finishing?
> > > 
> > > Because I missed the momentary up-then-down that you spotted.  Your 63
> > > seconds is correct.
> > > 
> > > > > But maybe not?  The watchdog timer is still 120 seconds?  Or does your
> > > > > host set it to 22 seconds?
> > > > 
> > > > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be
> > > > the default value.
> > > 
> > > Which I believe to be 22 seconds.  Hmmm...
> > > 
> > > Either way, thank you very much for the testing!!!
> > 
> > [ Adding LKML back -- I think we dropped it due to an attachment? ]
> > 
> > One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads.
> > This shouldn't be a big deal for the intended use case (real-time
> > workloads with lightish load), but a two-jiffy delay could allow quite a
> > few callbacks to build up when doing forward-progress testing.  But not
> > a problem for use_softirq runs, which is what you were running.
> > 
> > But it might be that I need to either skip the forward-progress testing
> > if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too
> > many callbacks, or some such.  Either way, though, I would need at least
> > a cond_resched() on each pass through the rcu_cpu_kthread() function.
> > 
> > Thoughts

Re: [GIT PULL] arm64 fixes for -rc5

2019-04-13 Thread pr-tracker-bot

The pull request you sent on Fri, 12 Apr 2019 17:05:15 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5ded88718aef7e92a9806f6ff4b89c7f2a4f1570

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

2019-04-13 Thread Al Viro

On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote:

> And I'd really like to have people (Al?) look at this and go "yeah,
> makes sense". I do think that moving to a model where we wither have a
> (properly locked) file position or no pos pointer at all is the right
> model (ie I'd really like to get rid of the mixed case), but there
> might be some practical problem that makes it impractical.
>
> Because the *real* problem with the mixed case is not "insane people
> who do bad things might get odd results". No, the real problem with
> the mixed case is that it could be a security issue (ie: one process
> intentionally changes the file position just as another process is
> going a 'read' and then avoids some limit test because the limit test
> was done using the old 'pos' value but the actual IO was done using
> the new one).
> 
> So I suspect that we will have to either
> 
>  - get rid of the mixed case entirely (and do only properly locked
> f_pos accesses or pass is a NULL f_pos)

Good luck.  Character devices
get no exclusion
often do use position
are many
might be buried behind several layers of indirection
... and often left unmaintained for a decade or two

IOW, I don't see how you'd go about eliminating the mixed case...

Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

2019-04-13 Thread Linus Torvalds

On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov  wrote:
>
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, 
> const loff_t *ppos, size_t
> inode = file_inode(file);
> if (unlikely((ssize_t) count < 0))
> return retval;
> -   pos = *ppos;
> +   pos = (ppos ? *ppos : 0);
> if (unlikely(pos < 0)) {
> if (!unsigned_offsets(file))
> return retval;

This part looks silly. We should just avoid all the position overflow
games when we don't have a position at all (ie streaming). You can't
overflow what you don't use.

Similarly, you can't use ranged mandatory locking on a stream, so the
mandatory locking thing seems dependent on pos too.

So I think that with a NULL ppos being possible, we should just change
the code to just do all of that conditionally on having a position,
rather than saying that the position of a stream is always 0.

That said, this whole "let's make it possible to not have a position
at all" is a big change, and there's no way I'll apply these before
the 5.2 merge window.

And I'd really like to have people (Al?) look at this and go "yeah,
makes sense". I do think that moving to a model where we wither have a
(properly locked) file position or no pos pointer at all is the right
model (ie I'd really like to get rid of the mixed case), but there
might be some practical problem that makes it impractical.

Because the *real* problem with the mixed case is not "insane people
who do bad things might get odd results". No, the real problem with
the mixed case is that it could be a security issue (ie: one process
intentionally changes the file position just as another process is
going a 'read' and then avoids some limit test because the limit test
was done using the old 'pos' value but the actual IO was done using
the new one).

So I suspect that we will have to either

 - get rid of the mixed case entirely (and do only properly locked
f_pos accesses or pass is a NULL f_pos)

 - continue to support the mixed case, but also continue to support
the nasty temporary 'pos' value with that file_pos_{read,write}()
thing.

IOW, I would not be ok with passing in a shared - and unlocked -
>f_pos value to random drivers that *might* do odd things when a
race happens.

   Linus

[PATCH v4 06/16] locking/rwsem: Code cleanup after files merging

2019-04-13 Thread Waiman Long

After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine was reduced
from 8945 bytes to 4651 bytes with this change.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 50 +-
 1 file changed, 10 insertions(+), 40 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 5f06b0601eb6..c1a089ab19fd 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -207,7 +207,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char 
*name,
osq_lock_init(>osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -575,19 +574,17 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, 
int state)
return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_read_failed(struct rw_semaphore *sem)
 {
return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(rwsem_down_read_failed);
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_read_failed_killable(struct rw_semaphore *sem)
 {
return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
 }
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
 
 /*
  * Wait until we successfully acquire the write lock
@@ -694,26 +691,23 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
*sem, int state)
return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_write_failed(struct rw_semaphore *sem)
 {
return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(rwsem_down_write_failed);
 
-__visible struct rw_semaphore * __sched
+static inline struct rw_semaphore * __sched
 rwsem_down_write_failed_killable(struct rw_semaphore *sem)
 {
return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
 }
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
 
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -728,15 +722,13 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
return sem;
 }
-EXPORT_SYMBOL(rwsem_wake);
 
 /*
  * downgrade a write lock into a read lock
  * - caller incremented waiting part of count and discovered it still negative
  * - just wake up any readers at the front of the queue
  */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -751,7 +743,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct 
rw_semaphore *sem)
 
return sem;
 }
-EXPORT_SYMBOL(rwsem_downgrade_wake);
 
 /*
  * lock for reading
@@ -895,7 +886,6 @@ void __sched down_read(struct rw_semaphore *sem)
 
LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read);
 
 int __sched down_read_killable(struct rw_semaphore *sem)
@@ -910,7 +900,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
 
return 0;
 }
-
 EXPORT_SYMBOL(down_read_killable);
 
 /*
@@ -924,7 +913,6 @@ int down_read_trylock(struct rw_semaphore *sem)
rwsem_acquire_read(>dep_map, 0, 1, _RET_IP_);
return ret;
 }
-
 EXPORT_SYMBOL(down_read_trylock);
 
 /*
@@ -934,10 +922,8 @@ void __sched down_write(struct rw_semaphore *sem)
 {
might_sleep();
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write);
 
 /*
@@ -948,14 +934,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
might_sleep();
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
 
-   if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, 
__down_write_killable)) {
+   if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+ __down_write_killable)) {
rwsem_release(>dep_map, 1, _RET_IP_);
return -EINTR;
}
 
return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable);
 
 /*
@@ -970,7 +956,6 @@ int down_write_trylock(struct rw_semaphore *sem)
 
return

[PATCH v4 05/16] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c

2019-04-13 Thread Waiman Long

Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  
  
  
  

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/locking/Makefile |   2 +-
 kernel/locking/rwsem-xadd.c | 608 -
 kernel/locking/rwsem.c  | 870 
 kernel/locking/rwsem.h  | 283 +---
 4 files changed, 877 insertions(+), 886 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 92f7d7b6bfa3..
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,608 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowe...@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi 
- * and Michel Lespinasse 
- *
- * Optimistic spinning by Tim Chen 
- * and Davidlohr Bueso . Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long .
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-   /*
-* Make sure we are not reinitializing a held semaphore:
-*/
-   debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-   lockdep_init_map(>dep_map, name, key, 0);
-#endif
-   atomic_long_set(>count, RWSEM_UNLOCKED_VALUE);
-   raw_spin_lock_init(>wait_lock);
-   INIT_LIST_HEAD(>wait_list);
-   sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-   osq_lock_init(>osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-   RWSEM_WAITING_FOR_WRITE,
-   RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-   struct list_head list;
-   struct task_struct *task;
-   enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-   RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
-   RWSEM_WAKE_READERS, /* Wake readers only */
-   RWSEM_WAKE_READ_OWNED   /* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
-{
-   struct rwsem_waiter *waiter, *tmp;
-   long oldcount, woken = 0, adjustment = 0;
-
-   /*
-* Take a peek at the queue head waiter such

Re: [PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)

2019-04-13 Thread Kirill Smelkov

Hello everyone,

On Sat, Apr 13, 2019 at 06:50:57PM +0200, Julia Lawall wrote:
> Hello,
> 
> Kirill will explain about this issue.

pci/switchtec switching to stream_open is already queued to merge
window and it was acked by Logan Gunthorpe:


https://lore.kernel.org/lkml/CAHk-=wgqgN5j1ZWnyVLqqoyU=ccwtyoko3mdyu8l_5e21kv...@mail.gmail.com/
https://lab.nexedi.com/kirr/linux/commit/edaeb4101860

( there are too many Cc's in that patch and email with it and reply-all to
  it did not get into mailing list probably due to being considered as spam )

stream_open.cocci was issuing only warning for pci/switchtec, but after
8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue") they
started to use wait_even_* inside read method and, since
stream_open.cocci considers wait_event_* as blocking the warning became
error. Previously it was completions there, but I added support for wait
events only for simplicity.

I can handle pci/switchtec switching via big nonseekable_open ->
stream_open change at next merge window, or, alternatively please feel
free to switch pci/switchtec now on its own. The change is correct - I
was manually reviewing all changes that stream_open.cocci produces and
pci/switchtec was there.

Kirill

> julia
> 
> -- Forwarded message --
> Date: Sat, 13 Apr 2019 11:22:51 +0800
> From: kbuild test robot 
> To: kbu...@01.org
> Cc: Julia Lawall 
> Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings
> 
> CC: kbuild-...@01.org
> TO: Sebastian Andrzej Siewior 
> CC: Kurt Schwemmer 
> CC: Logan Gunthorpe 
> CC: Bjorn Helgaas 
> CC: linux-...@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> 
> From: kbuild test robot 
> 
> drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can 
> deadlock .write(); change nonseekable_open -> stream_open to fix.
> 
> Generated by: scripts/coccinelle/api/stream_open.cocci
> 
> Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue")
> Signed-off-by: kbuild test robot 
> ---
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git 
> linux-5.0.y-rt-rebase
> head:   794c294ae4483c240429c25a0d18e272e92c94de
> commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: 
> Don't use completion's wait queue
> :: branch date: 7 hours ago
> :: commit date: 7 hours ago
> 
> Please take the patch only if it's a positive warning. Thanks!
> 
>  switchtec.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/drivers/pci/switch/switchtec.c
> +++ b/drivers/pci/switch/switchtec.c
> @@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino
>   return PTR_ERR(stuser);
> 
>   filp->private_data = stuser;
> - nonseekable_open(inode, filp);
> + stream_open(inode, filp);
> 
>   dev_dbg(>dev, "%s: %p\n", __func__, stuser);

[PATCH v4 11/16] locking/rwsem: Enable readers spinning on writer

2019-04-13 Thread Waiman Long

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
__rwsem_down_read_failed_common() and __rwsem_down_write_failed_common().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patchPost-patch
     ---
4  1,6741,684
8  1,0621,074
   16924  900
   32300  458
   64195  208
  128164  168
  240149  143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c| 91 +++
 2 files changed, 80 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index 29e5c52197fa..333ed5fda333 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)/* # of reader sleeps   
*/
 LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps   */
 LOCK_EVENT(rwsem_wake_reader)  /* # of reader wakeups  */
 LOCK_EVENT(rwsem_wake_writer)  /* # of writer wakeups  */
+LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/
 LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired   */
 LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index cf0a90d251aa..3cf8355252d1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -103,9 +103,12 @@
 RWSEM_FLAG_HANDOFF)
 
 #define RWSEM_COUNT_LOCKED(c)  ((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_WLOCKED(c) ((c) & RWSEM_WRITER_MASK)
 #define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF)
 #define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)   \
((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
+#define RWSEM_COUNT_WLOCKED_OR_HANDOFF(c)  \
+   ((c) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -436,6 +439,30 @@ static inline bool rwsem_try_write_lock(long count, struct 
rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+   long count = atomic_long_read(>count);
+
+   if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
+   return false;
+
+   count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >count);
+   if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
+   rwsem_set_reader_owned(sem);
+   lockevent_inc(rwsem_opt_rlock);
+   return true;
+   }
+
+   /* Back out the change */
+   atomic_long_add(-RWSEM_READER_BIAS, >count);
+   return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -470,9 +497,12 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 
BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
 
-   if (need_resched())
+   if (need_resched()) {
+   lockevent_inc(rwsem_opt_fail);
return false;
+   }
 
+   preempt_disable();
rcu_read_lock();
owner = READ_ONCE(sem->owner);
if (owner) {
@@ -480,6 +510,9 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
  owner_on_cpu(owner);
}
rcu_read_unlock();
+   preempt_enable();
+
+   lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
 }
 
@@ -549,7 +582,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct 
rw_semaphore *sem)
return !owner ? OWNER_NULL : OWNER_READER;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
bool taken = false;
bool is_rt_task = rt_task(current);
@@ -558,9 +591,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
preempt_disable();
 
/* sem->wait_lock should not be held when doing optimistic spinning */
-   if (!rwsem_can_spin_on_owner(sem))
-

[PATCH v4 10/16] locking/rwsem: Wake up almost all readers in wait queue

2019-04-13 Thread Waiman Long

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
     -   --
4  1,6411,674
87311,062
   16564  924
   32 78  300
   64 38  195
  240 50  149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8e19b5141595..cf0a90d251aa 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -260,6 +260,13 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT (HZ/250)
 
+/*
+ * We limit the maximum number of readers that can be woken up for a
+ * wake-up call to not penalizing the waking thread for spending too
+ * much time doing it.
+ */
+#define MAX_READERS_WAKEUP 0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must
@@ -332,16 +339,16 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}
 
/*
-* Grant an infinite number of read locks to the readers at the front
-* of the queue. We know that woken will be at least 1 as we accounted
-* for above. Note we increment the 'active part' of the count by the
+* Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+* queue. We know that woken will be at least 1 as we accounted for
+* above. Note we increment the 'active part' of the count by the
 * number of readers before waking any processes up.
 */
list_for_each_entry_safe(waiter, tmp, >wait_list, list) {
struct task_struct *tsk;
 
if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-   break;
+   continue;
 
woken++;
tsk = waiter->task;
@@ -360,6 +367,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 * after setting the reader waiter to nil.
 */
wake_q_add_safe(wake_q, tsk);
+
+   /*
+* Limit # of readers that can be woken up per wakeup call.
+*/
+   if (woken >= MAX_READERS_WAKEUP)
+   break;
}
 
adjustment = woken * RWSEM_READER_BIAS - adjustment;
-- 
2.18.1

[PATCH 2/2] vfs: use >f_pos directly on files that have position

2019-04-13 Thread Kirill Smelkov

Long ago vfs read/write operations were passing ppos=>f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=>f_pos directly due to concurrency vanished.

Linus explains[1]:

In fact, we *used* to (long ago) pass in the address of "file->f_pos"
itself to the low-level read/write routines. We then changed it to do
that indirection through a local copy of pos (and
file_pos_read/file_pos_write) because we didn't do the proper locking,
so different read/write versions could mess with each other (and with
lseek).

But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
accesses as per POSIX") did was to add the proper locking at least for
the cases that we care about deeply, so we *could* say that we have
three cases:

 - FMODE_ATOMIC_POS: properly locked,
 - FMODE_STREAM: no pos at all
 - otherwise a "mostly don't care - don't mix!"

and so we could go back to not copying the pos at all, and instead do
something like

loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : 
>f_pos;
ret = vfs_write(f.file, buf, count, ppos);

and perhaps have a long-term plan to try to get rid of the "don't mix"
case entirely (ie "if you use f_pos, then we'll do the proper
locking")

(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing >f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] 
https://lore.kernel.org/linux-fsdevel/CAHk-=whjtzt52snhbgrnmnuxfn3ge9x_e02x8bpxtkqrfyz...@mail.gmail.com/

Suggested-by: Linus Torvalds 
Signed-off-by: Kirill Smelkov 
---
 fs/read_write.c | 36 
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, 
size_t count)
ssize_t ret = -EBADF;
 
if (f.file) {
-   loff_t pos, *ppos = file_ppos(f.file);
-   if (ppos) {
-   pos = *ppos;
-   ppos = 
-   }
-   ret = vfs_read(f.file, buf, count, ppos);
-   if (ret >= 0 && ppos)
-   f.file->f_pos = pos;
+   ret = vfs_read(f.file, buf, count, file_ppos(f.file));
fdput_pos(f);
}
return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user 
*buf, size_t count)
ssize_t ret = -EBADF;
 
if (f.file) {
-   loff_t pos, *ppos = file_ppos(f.file);
-   if (ppos) {
-   pos = *ppos;
-   ppos = 
-   }
-   ret = vfs_write(f.file, buf, count, ppos);
-   if (ret >= 0 && ppos)
-   f.file->f_pos = pos;
+   ret = vfs_write(f.file, buf, count, file_ppos(f.file));
fdput_pos(f);
}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct 
iovec __user *vec,
ssize_t ret = -EBADF;
 
if (f.file) {
-   loff_t pos, *ppos = file_ppos(f.file);
-   if (ppos) {
-   pos = *ppos;
-   ppos = 
-   }
-   ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-   if (ret >= 0 && ppos)
-   f.file->f_pos = pos;
+   ret = vfs_readv(f.file, vec, vlen,

[PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

2019-04-13 Thread Kirill Smelkov

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Kirill Smelkov 
---
 fs/open.c   |  5 ++--
 fs/read_write.c | 75 +
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..d62556be6848 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const 
loff_t *ppos, size_t
inode = file_inode(file);
if (unlikely((ssize_t) count < 0))
return retval;
-   pos = *ppos;
+   pos = (ppos ? *ppos : 0);
if (unlikely(pos < 0)) {
if (!unsigned_offsets(file))
return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char 
__user *buf, size_t len, lo
ssize_t ret;
 
init_sync_kiocb(, filp);
-   kiocb.ki_pos = *ppos;
+   kiocb.ki_pos = (ppos ? *ppos : 0);
iov_iter_init(, READ, , 1, len);
 
ret = call_read_iter(filp, , );
BUG_ON(ret == -EIOCBQUEUED);
-   *ppos = kiocb.ki_pos;
+   if (ppos)
+   *ppos = kiocb.ki_pos;
return ret;
 }
 
@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const 
char __user *buf, size_t
ssize_t ret;
 
init_sync_kiocb(, filp);
-   kiocb.ki_pos = *ppos;
+   kiocb.ki_pos = (ppos ? *ppos : 0);
iov_iter_init(, WRITE, , 1, len);
 
ret = call_write_iter(filp, , );
BUG_ON(ret == -EIOCBQUEUED);
-   if (ret > 0)
+   if (ret > 0 && ppos)
*ppos = kiocb.ki_pos;
return ret;
 }
@@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user 
*buf, size_t count, loff_
return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-   return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns >f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-   if ((file->f_mode & FMODE_STREAM) == 0)
-   file->f_pos = pos;
+   return file->f_mode & FMODE_STREAM ? NULL : >f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, 
size_t count)
ssize_t ret = -EBADF;
 
if (f.file) {
-   loff_t pos = file_pos_read(f.file);
-   ret = vfs_read(f.file, buf, count, );
-   if (ret >= 0)
-   file_pos_write(f.file, pos);
+   loff_t pos, *ppos = file_ppos(f.file);
+   if (ppos) {
+   pos = *ppos;
+   ppos = 
+   }
+   ret = vfs_read(f.file, buf, count, ppos);
+   if (ret >=

[PATCH v4 09/16] locking/rwsem: Ensure an RT task will not spin on reader

2019-04-13 Thread Waiman Long

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, it is more tricky to decide
if an RT task should stop spinning as it may be a temporary state
where another writer may have just stolen the lock which then failed
the task's trylock attempt. So one more retry is allowed to make sure
that the lock is not spinnable by an RT task.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

Locking threads Pre-patch Post-patch
--- - ---
4 2,753  2,608
8 2,529  2,520
   16 1,727  1,918
   32 1,263  1,956
   64   889  1,343

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2d6850c3e77b..8e19b5141595 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -539,6 +539,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
bool taken = false;
+   bool is_rt_task = rt_task(current);
+   int prev_owner_state = OWNER_NULL;
 
preempt_disable();
 
@@ -556,7 +558,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 *  2) readers own the lock as we can't determine if they are
 * actively running or not.
 */
-   while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+   for (;;) {
+   enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+   if (!(owner_state & OWNER_SPINNABLE))
+   break;
+
/*
 * Try to acquire the lock
 */
@@ -566,13 +573,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore 
*sem)
}
 
/*
-* When there's no owner, we might have preempted between the
-* owner acquiring the lock and setting the owner field. If
-* we're an RT task that will live-lock because we won't let
-* the owner complete.
+* An RT task cannot do optimistic spinning if it cannot
+* be sure the lock holder is running or live-lock may
+* happen if the current task and the lock holder happen
+* to run in the same CPU.
+*
+* When there's no owner or is reader-owned, an RT task
+* will stop spinning if the owner state is not a writer
+* at the previous iteration of the loop. This allows the
+* RT task to recheck if the task that steals the lock is
+* a spinnable writer. If so, it can keeps on spinning.
+*
+* If the owner is a writer, the need_resched() check is
+* done inside rwsem_spin_on_owner(). If the owner is not
+* a writer, need_resched() check needs to be done here.
 */
-   if (!sem->owner && (need_resched() || rt_task(current)))
-   break;
+   if (owner_state != OWNER_WRITER) {
+   if (need_resched())
+   break;
+   if (is_rt_task && (prev_owner_state != OWNER_WRITER))
+   break;
+   }
+   prev_owner_state = owner_state;
 
/*
 * The cpu_relax() call is a compiler barrier which forces
-- 
2.18.1

[PATCH v4 04/16] locking/rwsem: Implement a new locking scheme

2019-04-13 Thread Waiman Long

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
130,659   31,341   31,055   31,283
2 8,909   16,4579,884   17,659
4 9,028   15,8238,933   20,233
8 8,410   14,2127,230   17,140
   16 8,217   25,2407,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
112,963   13,647   13,275   13,601
2 7,570   11,5697,902   10,829
4 5,2325,5165,4665,435
8 5,2333,3865,4673,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
121,495   21,046   21,524   21,074
2 5,293   10,5025,333   10,504
4 5,325   11,4635,358   11,631
8 5,391   11,7125,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem-xadd.c | 147 
 kernel/locking/rwsem.h  |  74 +-
 2 files changed, 86 insertions(+), 135 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 98de7f0cfedd..92f7d7b6bfa3 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen 
  * and Davidlohr Bueso . Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long .
  */
 #include 
 #include 
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x000X  (1) X readers active or attempting lock, no writer waiting
- * X = #active_readers + #readers attempting to lock
- * (X*ACTIVE_BIAS)
- *
- * 0x  rwsem is unlocked, and no one is waiting for the lock or
- * attempting to read lock or write lock.
- *
- * 0x000X  (1) X readers active or attempting lock, with waiters for lock
- * X = #active readers + # readers attempting lock
- * (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0x0001  (1) 1 reader active or attempting lock, waiters for lock
- * (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
- * (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0x  (1) There are writers or readers queued but none active
- * or in the process of attempting lock.
- * (WAITING_BIAS)
- * Note: writer

[PATCH v4 15/16] locking/rwsem: Merge owner into count on x86-64

2019-04-13 Thread Waiman Long

With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers.

The owner value will still be duplicated in the owner field as that
will ease debugging when looking at core dump.

This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:

  Before Patch  After Patch
   # of Threads  wlockmixedwlockmixed
     ----
130,422   31,034   30,323   30,379
2 6,4276,6847,8049,436
4 6,7426,7387,5688,268
8 7,0927,2225,6797,041
   16 6,8827,1636,8487,652
   32 7,4587,3167,9752,189
   64 7,906  5208,269  534
  128 1,680  4258,047  448

In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.

When running the same microbenchmark with RT locking threads instead,
we got the following results:

  Before Patch  After Patch
   # of Threads  wlockmixedwlockmixed
     ----
2 4,0653,6424,7565,062
4 2,2541,9073,4602,496
8 2,386  9643,0121,964
   16 2,0951,5963,0831,862
   32 2,388  5303,717  359
   64 1,424  3224,060  401
  128 1,642  5104,488  628

It is obvious that RT tasks can benefit pretty significantly with this set
of patches.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 112 +
 1 file changed, 103 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f37ab6358fe0..27219abb8bb6 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -73,7 +73,41 @@
 #endif
 
 /*
- * On 64-bit architectures, the bit definitions of the count are:
+ * Enable the merging of owner into count for x86-64 only.
+ */
+#ifdef CONFIG_X86_64
+#define RWSEM_MERGE_OWNER_TO_COUNT
+#endif
+
+/*
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
+ *
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits. That is 4PB
+ * of memory. That leaves 12 bits available for other use. The task
+ * structure pointer is also aligned to the L1 cache size. That means
+ * another 6 bits (64 bytes cacheline) will be available. Reserving
+ * 2 bits for status flags, we will have 16 bits for the reader count
+ * and read fail bit. That can supports up to (32k-1) active readers.
+ *
+ * On x86-64, the bit definitions of the count are:
+ *
+ * Bit   0- waiters present bit
+ * Bit   1- lock handoff bit
+ * Bits  2-47 - compressed task structure pointer
+ * Bits 48-62 - 15-bit reader counts
+ * Bit  63- read fail bit
+ *
+ * On other 64-bit architectures, the bit definitions are:
  *
  * Bit  0- writer locked bit
  * Bit  1- waiters present bit
@@

[PATCH v4 14/16] locking/rwsem: Guard against making count negative

2019-04-13 Thread Waiman Long

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant bit
is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock attempts
will now fail for both the fast and optimistic spinning paths whenever
this bit is set. So all those extra readers will be put to sleep in
the wait queue. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 84 --
 1 file changed, 64 insertions(+), 20 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ab26aba65371..f37ab6358fe0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -73,13 +73,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0- writer locked bit
+ * Bit  1- waiters present bit
+ * Bit  2- lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0- writer locked bit
+ * Bit  1- waiters present bit
+ * Bit  2- lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -96,6 +111,7 @@
 #define RWSEM_WRITER_LOCKED(1UL << 0)
 #define RWSEM_FLAG_WAITERS (1UL << 1)
 #define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_READFAIL(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT 8
 #define RWSEM_READER_BIAS  (1UL << RWSEM_READER_SHIFT)
@@ -103,7 +119,7 @@
 #define RWSEM_WRITER_MASK  RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-RWSEM_FLAG_HANDOFF)
+RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 #define RWSEM_COUNT_LOCKED(c)  ((c) & RWSEM_LOCK_MASK)
 #define RWSEM_COUNT_WLOCKED(c) ((c) & RWSEM_WRITER_MASK)
@@ -315,7 +331,8 @@ enum writer_wait_state {
 /*
  * We limit the maximum number of readers that can be woken up for a
  * wake-up call to not penalizing the waking thread for spending too
- * much time doing it.
+ * much time doing it as well as the unlikely possiblity of overflowing
+ * the reader count.
  */
 #define MAX_READERS_WAKEUP 0x100
 
@@ -799,12 +816,35 @@ rwsem_waiter_is_first(struct rw_semaphore *sem, struct 
rwsem_waiter *waiter)
  * Wait for the read lock to be granted
  */
 static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state, long 
count)
 {
-   long count, adjustment = -RWSEM_READER_BIAS;
+   long adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
 
+   if (unlikely(count < 0)) {
+   /*
+* The sign bit has been set meaning that too many active
+* readers are present. We need to decrement reader count &
+* enter wait queue immediately to avoid overflowing the
+* reader count.
+*
+* As preemption is not disabled, there is a remote
+* possibility that premption can happen in the narrow
+* timing window between incrementing and decrementing
+* the reader count and the task is put to sleep for a
+* considerable amount of time. If sufficient number
+* of such unfortunate sequence of events happen, we
+* may still overflow the reader count. It is extremely
+* unlikey, though. If this is a concern, we should consider
+* disable preemption during this timing window to make
+* sure that such unfortunate event will not happen.
+*/
+

[PATCH v4 08/16] locking/rwsem: Make rwsem_spin_on_owner() return owner state

2019-04-13 Thread Waiman Long

This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

In the special case that there is no active lock and the handoff bit
is set, optimistic spinning has to be stopped.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 45 +++---
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index aaab546a890d..2d6850c3e77b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -156,6 +156,11 @@ static inline bool is_rwsem_owner_spinnable(struct 
task_struct *owner)
return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
 }
 
+static inline bool is_rwsem_owner_reader(struct task_struct *owner)
+{
+   return (unsigned long)owner & RWSEM_READER_OWNED;
+}
+
 /*
  * Return true if rwsem is owned by an anonymous writer or readers.
  */
@@ -466,14 +471,30 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * Return the folowing 4 values depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *when optimistic spinning has to stop because either the
+ *owner stops running, is unknown, or its timeslice has
+ *been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+   OWNER_NULL  = 1 << 0,
+   OWNER_WRITER= 1 << 1,
+   OWNER_READER= 1 << 2,
+   OWNER_NONSPINNABLE  = 1 << 3,
+};
+#define OWNER_SPINNABLE(OWNER_NULL | OWNER_WRITER)
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
struct task_struct *owner = READ_ONCE(sem->owner);
+   long count;
 
if (!is_rwsem_owner_spinnable(owner))
-   return false;
+   return OWNER_NONSPINNABLE;
 
rcu_read_lock();
while (owner && (READ_ONCE(sem->owner) == owner)) {
@@ -491,7 +512,7 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
if (need_resched() || !owner_on_cpu(owner)) {
rcu_read_unlock();
-   return false;
+   return OWNER_NONSPINNABLE;
}
 
cpu_relax();
@@ -500,9 +521,19 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 
/*
 * If there is a new owner or the owner is not set, we continue
-* spinning.
+* spinning except when here is no active locks and the handoff bit
+* is set. In this case, we have to stop spinning.
 */
-   return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+   owner = READ_ONCE(sem->owner);
+   if (!is_rwsem_owner_spinnable(owner))
+   return OWNER_NONSPINNABLE;
+   if (owner && !is_rwsem_owner_reader(owner))
+   return OWNER_WRITER;
+
+   count = atomic_long_read(>count);
+   if (RWSEM_COUNT_HANDOFF(count) && !RWSEM_COUNT_LOCKED(count))
+   return OWNER_NONSPINNABLE;
+   return !owner ? OWNER_NULL : OWNER_READER;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -525,7 +556,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 *  2) readers own the lock as we can't determine if they are
 * actively running or not.
 */
-   while (rwsem_spin_on_owner(sem)) {
+   while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
/*
 * Try to acquire the lock
 */
-- 
2.18.1

[PATCH v4 13/16] locking/rwsem: Add more rwsem owner access helpers

2019-04-13 Thread Waiman Long

Before combining owner and count, we are adding two new helpers for
accessing the owner value in the rwsem.

 1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 2) bool is_rwsem_reader_owned(struct rw_semaphore *sem)

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 50 +++---
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8b23009e6b2c..ab26aba65371 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -130,6 +130,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore 
*sem)
WRITE_ONCE(sem->owner, NULL);
 }
 
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+   return READ_ONCE(sem->owner);
+}
+
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -174,6 +179,23 @@ static inline bool is_rwsem_spinnable(struct rw_semaphore 
*sem)
return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
 }
 
+/*
+ * Return true if the rwsem is owned by a reader.
+ */
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_DEBUG_RWSEMS
+   /*
+* Check the count to see if it is write-locked.
+*/
+   long count = atomic_long_read(>count);
+
+   if (count & RWSEM_WRITER_MASK)
+   return false;
+#endif
+   return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+}
+
 /*
  * Return true if rwsem is owned by an anonymous writer or readers.
  */
@@ -193,6 +215,7 @@ static inline void rwsem_clear_reader_owned(struct 
rw_semaphore *sem)
 {
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
   | RWSEM_ANONYMOUSLY_OWNED;
+
if (READ_ONCE(sem->owner) == (struct task_struct *)val)
cmpxchg_relaxed((unsigned long *)>owner, val,
RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
@@ -530,7 +553,7 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 
preempt_disable();
rcu_read_lock();
-   owner = READ_ONCE(sem->owner);
+   owner = rwsem_get_owner(sem);
if (owner) {
ret = is_rwsem_owner_spinnable(owner) &&
 (is_rwsem_owner_reader(owner) || owner_on_cpu(owner));
@@ -562,15 +585,21 @@ enum owner_state {
 
 static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
-   struct task_struct *owner = READ_ONCE(sem->owner);
+   struct task_struct *owner = rwsem_get_owner(sem);
long count;
 
if (!is_rwsem_owner_spinnable(owner))
return OWNER_NONSPINNABLE;
 
rcu_read_lock();
-   while (owner && !is_rwsem_owner_reader(owner)
-&& (READ_ONCE(sem->owner) == owner)) {
+   while (owner && !is_rwsem_owner_reader(owner)) {
+   struct task_struct *new_owner = rwsem_get_owner(sem);
+
+   if (new_owner != owner) {
+   owner = new_owner;
+   break;  /* The owner has changed */
+   }
+
/*
 * Ensure we emit the owner->on_cpu, dereference _after_
 * checking sem->owner still matches owner, if that fails,
@@ -597,7 +626,6 @@ static noinline enum owner_state rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 * spinning except when here is no active locks and the handoff bit
 * is set. In this case, we have to stop spinning.
 */
-   owner = READ_ONCE(sem->owner);
if (!is_rwsem_owner_spinnable(owner))
return OWNER_NONSPINNABLE;
if (owner && !is_rwsem_owner_reader(owner))
@@ -1093,8 +1121,7 @@ inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_failed(sem);
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-   RWSEM_READER_OWNED), sem);
+   DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1106,8 +1133,7 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
>count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-   RWSEM_READER_OWNED), sem);
+   DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1174,8 +1200,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
long tmp;
 
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &

[PATCH v4 16/16] locking/rwsem: Remove redundant computation of writer lock word

2019-04-13 Thread Waiman Long

On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

On 32-bit architectures, the extra constant argument to
rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should be
optimized out by the compiler.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 27219abb8bb6..2c8187690c7c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -561,6 +561,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * bit is set or the lock is acquired.
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+   const long wlock,
enum writer_wait_state wstate)
 {
long new;
@@ -581,7 +582,7 @@ static inline bool rwsem_try_write_lock(long count, struct 
rw_semaphore *sem,
if ((wstate == WRITER_NOT_FIRST) && RWSEM_COUNT_HANDOFF(count))
return false;
 
-   new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
+   new = (count & ~RWSEM_FLAG_HANDOFF) + wlock -
  (list_is_singular(>wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
if (atomic_long_try_cmpxchg_acquire(>count, , new)) {
@@ -623,13 +624,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct 
rw_semaphore *sem)
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+const long wlock)
 {
long count = atomic_long_read(>count);
 
while (!RWSEM_COUNT_LOCKED_OR_HANDOFF(count)) {
if (atomic_long_try_cmpxchg_acquire(>count, ,
-   count + RWSEM_WRITER_LOCKED)) {
+   count + wlock)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -779,7 +781,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore 
*sem)
: 25 * NSEC_PER_USEC);
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 {
bool taken = false;
bool is_rt_task = rt_task(current);
@@ -808,7 +810,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, 
bool wlock)
/*
 * Try to acquire the lock
 */
-   taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+   taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
  : rwsem_try_read_lock_unqueued(sem);
 
if (taken)
@@ -887,7 +889,8 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+const long wlock)
 {
return false;
 }
@@ -944,7 +947,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, 
int state, long count)
 */
atomic_long_add(-RWSEM_READER_BIAS, >count);
adjustment = 0;
-   if (rwsem_optimistic_spin(sem, false)) {
+   if (rwsem_optimistic_spin(sem, 0)) {
unsigned long flags;
 
/*
@@ -1058,10 +1061,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
*sem, int state)
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
+   const long wlock = RWSEM_WRITER_LOCKED;
 
/* do optimistic spinning and steal lock if possible */
if (rwsem_can_spin_on_owner(sem) &&
-   rwsem_optimistic_spin(sem, true))
+   rwsem_optimistic_spin(sem, wlock))
return sem;
 
/*
@@ -1120,7 +1124,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
*sem, int state)
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
-   if (rwsem_try_write_lock(count, sem, wstate))
+   if (rwsem_try_write_lock(count, sem, wlock, wstate))
break;
 
raw_spin_unlock_irq(>wait_lock);
-- 
2.18.1

[PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem

2019-04-13 Thread Waiman Long

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, a bit will be set in the owner field
to indicate that no more optimistic spinning will be allowed on this
rwsem until it becomes writer owned again. Not even readers is allowed
to acquire the reader-locked rwsem by optimistic spinning for fairness.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

  System Time for 16 Iterations
  -- --
  1-socket Skylake  ~800ns
  4-socket Broadwell~300ns
  2-socket ThunderX2 (arm64)~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patchPost-patch
     ---
2  1,7596,684
4  1,6846,738
8  1,0747,222
   169007,163
   324587,316
   64208  520
  128168  425
  240143  474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c| 121 ++
 2 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index 333ed5fda333..f3550aa5866a 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups  
*/
 LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/
 LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired   */
 LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/
+LOCK_EVENT(rwsem_opt_nospin)   /* # of disabled reader opt-spinnings   */
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 3cf8355252d1..8b23009e6b2c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -35,18 +36,20 @@
  *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
  *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
  *i.e. the owner(s) cannot be readily determined. It can be reader
- *owned or the owning writer is indeterminate.
+ *owned or the owning writer is indeterminate. Optimistic spinning
+ *should be disabled if this flag is set.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
+ * into the owner field or the count itself (64-bit only. It should
+ * be cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On

[PATCH v4 00/16] locking/rwsem: Rwsem rearchitecture part 2

2019-04-13 Thread Waiman Long

 v4:
  - Fix the missing initialization bug with !CONFIG_RWSEM_SPIN_ON_OWNER
in patch 2.
  - Move the "Remove rwsem_wake() wakeup optimization" patch before
the "Implement a new locking scheme" patch.
  - Add two new patches to merge the relevant content of rwsem.h and
rwsem-xadd.c into rwsem.c as suggested by PeterZ.
  - Refactor the lock handoff patch to make all setting and clearing
of the handoff bit serialized by wait_lock to ensure correctness.
  - Adapt the rest of the patches to the new code base.

 v3:
  - Add 2 more patches in front to fix build and testing issues found.
Patch 1 can actually be merged on top of the patch "locking/rwsem:
Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1.
  - Change the handoff patch (now patch 4) to set handoff bit immediately
after wakeup for RT writers. The timeout limit is also tightened to
4ms.
  - There is no code changes in other patches other than resolving conflicts
with patches 1, 2 and 4.

 v2:
  - Move the negative reader count checking patch (patch 12->10)
forward to before the merge owner to count patch as suggested by
Linus & expand the comment.
  - Change the reader-owned rwsem spinning from count based to time
based to have better control of the max time allowed.

This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem.

part 0: merged into tip
part 1: https://lore.kernel.org/lkml/20190404174320.22416-1-long...@redhat.com/

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning
 3) Store write-lock owner in the atomic count (x86-64 only)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

On a 8-socket 120-core 240-thread IvyBridge-EX system with 120 reader
and writer locking threads, the min/mean/max locking operations done
in a 5-second testing window before the patchset were:

  120 readers, Iterations Min/Mean/Max = 399/400/401
  120 writers, Iterations Min/Mean/Max = 400/33,389/211,359

After the patchset, they became:

  120 readers, Iterations Min/Mean/Max = 584/10,266/26,609
  120 writers, Iterations Min/Mean/Max = 22,080/29,016/38,728

So it was much fairer to readers. With less locking threads, the readers
were preferred than writers.

Patch 1 fixes an testing issue with locking selftest introduced by the
patch "locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1.

Patch 2 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Patch 3 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 4 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 5 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.

Patch 6 optimizes the merged rwsem.c file to generate smaller object
file.

Patch 7 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.

Patch 8 makes rwsem_spin_on_owner() returns owner state.

Patch 9 disallows RT tasks to spin on a rwsem with unknown owner.

Patch 10 makes reader wakeup to wake almost all the readers in the wait
queue instead of just those in the front.

Patch 11 enables reader to spin on a writer-owned rwsem.

Patch 12 enables a writer to spin on a reader-owned rwsem for at most
25us.

Patch 13 adds some new rwsem owner access helper functions.

Patch 14 handles the case of too many readers by reserving the sign
bit to designate that a reader lock attempt will fail and the locking
reader will be put to sleep. This will ensure that we will not overflow
the reader count.

Patch 15 merges the write-lock owner task

[PATCH v4 03/16] locking/rwsem: Remove rwsem_wake() wakeup optimization

2019-04-13 Thread Waiman Long

With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present.  This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem-xadd.c | 72 -
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7fd4f1de794a..98de7f0cfedd 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -395,25 +395,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore 
*sem)
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-   return osq_is_locked(>osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-   return false;
-}
 #endif
 
 /*
@@ -635,65 +621,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
 
-   /*
-   * __rwsem_down_write_failed_common(sem)
-   *   rwsem_optimistic_spin(sem)
-   * osq_unlock(sem->osq)
-   *   ...
-   *   atomic_long_add_return(>count)
-   *
-   *  - VS -
-   *
-   *  __up_write()
-   *if (atomic_long_sub_return_release(>count) < 0)
-   *  rwsem_wake(sem)
-   *osq_is_locked(>osq)
-   *
-   * And __up_write() must observe !osq_is_locked() when it observes the
-   * atomic_long_add_return() in order to not miss a wakeup.
-   *
-   * This boils down to:
-   *
-   * [S.rel] X = 1[RmW] r0 = (Y += 0)
-   * MB RMB
-   * [RmW]   Y += 1   [L]   r1 = X
-   *
-   * exists (r0=1 /\ r1=0)
-   */
-   smp_rmb();
-
-   /*
-* If a spinner is present, it is not necessary to do the wakeup.
-* Try to do wakeup only if the trylock succeeds to minimize
-* spinlock contention which may introduce too much delay in the
-* unlock operation.
-*
-*spinning writer   up_write/up_read caller
-*---   ---
-* [S]   osq_unlock()   [L]   osq
-*   MB   RMB
-* [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-*
-* Here, it is important to make sure that there won't be a missed
-* wakeup while the rwsem is free and the only spinning writer goes
-* to sleep without taking the rwsem. Even when the spinning writer
-* is just going to break out of the waiting loop, it will still do
-* a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-* rwsem_has_spinner() is true, it will guarantee at least one
-* trylock attempt on the rwsem later on.
-*/
-   if (rwsem_has_spinner(sem)) {
-   /*
-* The smp_rmb() here is to make sure that the spinner
-* state is consulted before reading the wait_lock.
-*/
-   smp_rmb();
-   if (!raw_spin_trylock_irqsave(>wait_lock, flags))
-   return sem;
-   goto locked;
-   }
raw_spin_lock_irqsave(>wait_lock, flags);
-locked:
 
if (!list_empty(>wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q);
-- 
2.18.1

[PATCH v4 01/16] locking/rwsem: Prevent unneeded warning during locking selftest

2019-04-13 Thread Waiman Long

Disable the DEBUG_RWSEMS check when locking selftest is running with
debug_locks_silent flag set.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 37db17890e36..64877f5294e3 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -30,7 +30,8 @@
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)  do {\
-   if (WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 
0x%lx, curr 0x%lx, list %sempty\n",\
+   if (!debug_locks_silent &&  \
+   WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 
0x%lx, curr 0x%lx, list %sempty\n",\
#c, atomic_long_read(&(sem)->count),\
(long)((sem)->owner), (long)current,\
list_empty(&(sem)->wait_list) ? "" : "not "))   \
-- 
2.18.1

[PATCH v4 02/16] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER

2019-04-13 Thread Waiman Long

The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long 
---
 include/linux/rwsem.h   |  9 +
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h  | 23 ---
 lib/Kconfig.debug   |  8 
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
-* Write owner. Used as a speculative check to see
-* if the owner is running on the cpu.
+* Write owner or one of the read owners. Can be used as a
+* speculative check to see if the owner is running on the cpu.
 */
struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)  \
{ __RWSEM_INIT_COUNT(name), \
+ .owner = NULL,\
  .wait_list = LIST_HEAD_INIT((name).wait_list),\
  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
  __RWSEM_OPT_INIT(name)\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b3ee9948bf1..7fd4f1de794a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
atomic_long_set(>count, RWSEM_UNLOCKED_VALUE);
raw_spin_lock_init(>wait_lock);
INIT_LIST_HEAD(>wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
osq_lock_init(>osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct 
task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct 
rw_semaphore *sem)
cmpxchg_relaxed((unsigned long *)>owner, val,
RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-  struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0d9e81779e37..2047f3884540 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1067,7 +1067,7 @@ config PROVE_LOCKING
select DEBUG_SPINLOCK
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
-   select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+   select DEBUG_RWSEMS
select DEBUG_WW_MUTEX_SLOWPATH
select DEBUG_LOCK_ALLOC
select TRACE_IRQFLAGS
@@ -1171,10 +1171,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
bool "RW Semaphore debugging: basic checks"
-   depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+

[PATCH v4 07/16] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-13 Thread Waiman Long

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c| 205 +++---
 2 files changed, 164 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index ad7668cfc9da..29e5c52197fa 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,9 @@ LOCK_EVENT(rwsem_opt_fail)/* # of failed opt-spinnings
*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
+LOCK_EVENT(rwsem_rlock_handoff)/* # of read lock handoffs  
*/
 LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls  */
 LOCK_EVENT(rwsem_wlock)/* # of write locks acquired
*/
 LOCK_EVENT(rwsem_wlock_fail)   /* # of failed write lock acquisitions  */
+LOCK_EVENT(rwsem_wlock_handoff)/* # of write lock handoffs 
*/
 LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c1a089ab19fd..aaab546a890d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -74,22 +74,38 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) __rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of __rwsem_down_write_failed_common().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED(1UL << 0)
 #define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_FLAG_HANDOFF (1UL << 2)
+
 #define RWSEM_READER_SHIFT 8
 #define RWSEM_READER_BIAS  (1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK  (~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK  RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+RWSEM_FLAG_HANDOFF)
 
 #define RWSEM_COUNT_LOCKED(c)  ((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF)
+#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c)   \
+   ((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -218,6 +234,7 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+   unsigned

[PATCH 1/3] ARM: dts: meson8: add the canvas module

2019-04-13 Thread Martin Blumenstingl

Add the canvas module to Meson8 because it's required for the VPU
(video output) and video decoders.

The canvas module is located inside thie "DMC bus" (where also some of
the memory controller registers are located). The "DMC bus" itself is
part of the so-called "MMC bus".

Amlogic's vendor kernel has an explicit #define for the "DMC" register
range on Meson8m2 while there's no such #define for Meson8. However, the
canvas and memory controller registers on Meson8 are all expressed as
"0x6000 + actual offset", while Meson8m2 uses "DMC + actual offset".
Thus it's safe to assume that the DMC bus exists on both SoCs even
though the registers inside are slightly different.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8.dtsi | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi
index 7ef442462ea4..6b5c90bb960b 100644
--- a/arch/arm/boot/dts/meson8.dtsi
+++ b/arch/arm/boot/dts/meson8.dtsi
@@ -228,6 +228,27 @@
};
};
 
+   mmcbus: bus@c800 {
+   compatible = "simple-bus";
+   reg = <0xc800 0x8000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0xc800 0x8000>;
+
+   dmcbus: bus@6000 {
+   compatible = "simple-bus";
+   reg = <0x6000 0x400>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x6000 0x400>;
+
+   canvas: video-lut@20 {
+   compatible = "amlogic,canvas";
+   reg = <0x20 0x14>;
+   };
+   };
+   };
+
apb: bus@d000 {
compatible = "simple-bus";
reg = <0xd000 0x20>;
-- 
2.21.0

[PATCH 2/3] ARM: dts: meson8m2: update the offset of the canvas module

2019-04-13 Thread Martin Blumenstingl

With the Meson8m2 SoC the canvas module was moved from offset 0x20
(Meson8) to offset 0x48 (same as on Meson8b). The offsets inside the
canvas module are identical.

Correct the offset so the driver uses the correct registers.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8m2.dtsi | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm/boot/dts/meson8m2.dtsi b/arch/arm/boot/dts/meson8m2.dtsi
index bb87b251e16d..e88c36d47e5c 100644
--- a/arch/arm/boot/dts/meson8m2.dtsi
+++ b/arch/arm/boot/dts/meson8m2.dtsi
@@ -14,6 +14,16 @@
compatible = "amlogic,meson8m2-clkc", "amlogic,meson8-clkc";
 };
 
+ {
+   /* the offset of the canvas registers has changed compared to Meson8 */
+   /delete-node/ video-lut@20;
+
+   canvas: video-lut@48 {
+   compatible = "amlogic,canvas";
+   reg = <0x48 0x14>;
+   };
+};
+
  {
compatible = "amlogic,meson8m2-dwmac", "snps,dwmac";
reg = <0xc941 0x1
-- 
2.21.0

[PATCH 3/3] ARM: dts: meson8b: add the canvas module

2019-04-13 Thread Martin Blumenstingl

Add the canvas module to Meson8b because it's required for the VPU
(video output) and video decoders.

The canvas module is located inside the "DMC bus" (where also some of
the memory controller registers are located). The "DMC bus" itself is
part of the so-called "MMC bus".

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8b.dtsi | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi
index 800cd65fc50a..f4e832638e9c 100644
--- a/arch/arm/boot/dts/meson8b.dtsi
+++ b/arch/arm/boot/dts/meson8b.dtsi
@@ -205,6 +205,27 @@
};
};
 
+   mmcbus: bus@c800 {
+   compatible = "simple-bus";
+   reg = <0xc800 0x8000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0xc800 0x8000>;
+
+   dmcbus: bus@6000 {
+   compatible = "simple-bus";
+   reg = <0x6000 0x400>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x6000 0x400>;
+
+   canvas: video-lut@48 {
+   compatible = "amlogic,canvas";
+   reg = <0x48 0x14>;
+   };
+   };
+   };
+
apb: bus@d000 {
compatible = "simple-bus";
reg = <0xd000 0x20>;
-- 
2.21.0

[PATCH 0/3] 32-bit Meson: add the canvas module

2019-04-13 Thread Martin Blumenstingl

This adds the canvas module on Meson8, Meson8b and Meson8m2. The canvas
IP is used by the video decoder hardware as well as the VPU (video
output) hardware.

Neither the VPU nor the video decoder driver support the 32-bit SoCs
yet. However, we can still add the canvas module to have it available
once these drivers gain support for the older SoCs.

I have tested this on my Meson8m2 board by hacking the VPU driver to
not re-initialize the VPU (and to use the configuration set by u-boot).
With that hack I could get some image out of the CVBS connector. No
changes to the canvas driver were required.

Due to lack of hardware I could not test Meson8, but I'm following (as
always) what the Amlogic 3.10 vendor kernel uses.
Meson8b is also not tested because u-boot of my EC-100 doesn't have
video output enabled (so I couldn't use the same hack I used on my
Meson8m2 board).

This series meant to be applied on top of "Meson8b: add support for the
RTC on EC-100 and Odroid-C1" from [0]


[0] https://patchwork.kernel.org/cover/10899509/


Martin Blumenstingl (3):
  ARM: dts: meson8: add the canvas module
  ARM: dts: meson8m2: update the offset of the canvas module
  ARM: dts: meson8b: add the canvas module

 arch/arm/boot/dts/meson8.dtsi   | 21 +
 arch/arm/boot/dts/meson8b.dtsi  | 21 +
 arch/arm/boot/dts/meson8m2.dtsi | 10 ++
 3 files changed, 52 insertions(+)

-- 
2.21.0

Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock

2019-04-13 Thread Kirill Smelkov

On Fri, Apr 12, 2019 at 03:41:44PM +0300, Kirill Smelkov wrote:
> On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov  wrote:
> > >
> > > However file->f_pos writing is still there and it will bug under race
> > > detector, e.g. under KTSAN because read and write can be running
> > > simultaneously. Maybe it is not only race bug, but also a bit of
> > > slowdown as read and write code paths write to the same memory thus
> > > needing inter-cpu synchronization if read and write handlers are on
> > > different cpus. However for this I'm not sure.
> > 
> > I doubt it's noticeable, but it would probably be good to simply not
> > do the write for streams.
> > 
> > That *could* be done somewhat similarly, with just changing th address
> > to be updated, or course.
> > 
> > In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> > itself to the low-level read/write routines. We then changed it to do
> > that indirection through a local copy of pos (and
> > file_pos_read/file_pos_write) because we didn't do the proper locking,
> > so different read/write versions could mess with each other (and with
> > lseek).
> > 
> > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> > accesses as per POSIX") did was to add the proper locking at least for
> > the cases that we care about deeply, so we *could* say that we have
> > three cases:
> > 
> >  - FMODE_ATOMIC_POS: properly locked,
> >  - FMODE_STREAM: no pos at all
> >  - otherwise a "mostly don't care - don't mix!"
> > 
> > and so we could go back to not copying the pos at all, and instead do
> > something like
> > 
> > loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : >f_pos;
> > ret = vfs_write(f.file, buf, count, ppos);
> > 
> > and perhaps have a long-term plan to try to get rid of the "don't mix"
> > case entirely (ie "if you use f_pos, then we'll do the proper
> > locking")
> > 
> > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> > implements the locking decision).
> 
> Ok Linus, thanks for feedback. Please consider applying or queueing the
> following patches.

Resending those patches one by one in separate emails, if maybe 2
patches inline was problematic to handle...

Re: [PATCH 1/2] driver core: printk to pr_*

2019-04-13 Thread Joe Perches

On Fri, 2019-04-12 at 16:47 +0100, Willy Wolff wrote:
> Signed-off-by: Willy Wolff 

Most people seem to prefer some type of commit message before
your sign-off.

And trivia:

> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
[]
> @@ -602,8 +601,8 @@ static int really_probe_debug(struct device *dev, struct 
> device_driver *drv)
>   ret = really_probe(dev, drv);
>   rettime = ktime_get();
>   delta = ktime_sub(rettime, calltime);
> - printk(KERN_DEBUG "probe of %s returned %d after %lld usecs\n",
> -dev_name(dev), ret, (s64) ktime_to_us(delta));
> + pr_debug("probe of %s returned %d after %lld usecs\n",
> +  dev_name(dev), ret, (s64) ktime_to_us(delta));

While I think this particular change is fine, and you may
already know this, but converting a
printk(KERN_DEBUG ...)
to
pr_debug(...)
changes object code and output.

The first is always emitted, the second is not emitted
unless DEBUG is defined or CONFIG_DYNAMIC_DEBUG is enabled.  

Your commit message should show this understanding.

Re: [PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)

2019-04-13 Thread Logan Gunthorpe




On 2019-04-13 10:50 a.m., Julia Lawall wrote:
> Hello,
> 
> Kirill will explain about this issue.

I'm aware of this effort and Acked Kirill's original monolithic patch.
So, for this patch:

Acked-by: Logan Gunthorpe  julia
> 
> -- Forwarded message --
> Date: Sat, 13 Apr 2019 11:22:51 +0800
> From: kbuild test robot 
> To: kbu...@01.org
> Cc: Julia Lawall 
> Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings
> 
> CC: kbuild-...@01.org
> TO: Sebastian Andrzej Siewior 
> CC: Kurt Schwemmer 
> CC: Logan Gunthorpe 
> CC: Bjorn Helgaas 
> CC: linux-...@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> 
> From: kbuild test robot 
> 
> drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can 
> deadlock .write(); change nonseekable_open -> stream_open to fix.
> 
> Generated by: scripts/coccinelle/api/stream_open.cocci
> 
> Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue")
> Signed-off-by: kbuild test robot 
> ---
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git 
> linux-5.0.y-rt-rebase
> head:   794c294ae4483c240429c25a0d18e272e92c94de
> commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: 
> Don't use completion's wait queue
> :: branch date: 7 hours ago
> :: commit date: 7 hours ago
> 
> Please take the patch only if it's a positive warning. Thanks!
> 
>  switchtec.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/drivers/pci/switch/switchtec.c
> +++ b/drivers/pci/switch/switchtec.c
> @@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino
>   return PTR_ERR(stuser);
> 
>   filp->private_data = stuser;
> - nonseekable_open(inode, filp);
> + stream_open(inode, filp);
> 
>   dev_dbg(>dev, "%s: %p\n", __func__, stuser);
>

[PATCH] pci/switchtec: fix stream_open.cocci warnings (fwd)

2019-04-13 Thread Julia Lawall

Hello,

Kirill will explain about this issue.

julia

-- Forwarded message --
Date: Sat, 13 Apr 2019 11:22:51 +0800
From: kbuild test robot 
To: kbu...@01.org
Cc: Julia Lawall 
Subject: [PATCH] pci/switchtec: fix stream_open.cocci warnings

CC: kbuild-...@01.org
TO: Sebastian Andrzej Siewior 
CC: Kurt Schwemmer 
CC: Logan Gunthorpe 
CC: Bjorn Helgaas 
CC: linux-...@vger.kernel.org
CC: linux-kernel@vger.kernel.org

From: kbuild test robot 

drivers/pci/switch/switchtec.c:395:1-17: ERROR: switchtec_fops: .read() can 
deadlock .write(); change nonseekable_open -> stream_open to fix.

Generated by: scripts/coccinelle/api/stream_open.cocci

Fixes: 8a29a3bae2a2 ("pci/switchtec: Don't use completion's wait queue")
Signed-off-by: kbuild test robot 
---

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git 
linux-5.0.y-rt-rebase
head:   794c294ae4483c240429c25a0d18e272e92c94de
commit: 8a29a3bae2a2dfb0116cd8791d9700515d6e765e [154/311] pci/switchtec: Don't 
use completion's wait queue
:: branch date: 7 hours ago
:: commit date: 7 hours ago

Please take the patch only if it's a positive warning. Thanks!

 switchtec.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -392,7 +392,7 @@ static int switchtec_dev_open(struct ino
return PTR_ERR(stuser);

filp->private_data = stuser;
-   nonseekable_open(inode, filp);
+   stream_open(inode, filp);

dev_dbg(>dev, "%s: %p\n", __func__, stuser);

[PATCH 0/3] Meson8b: add support for the RTC on EC-100 and Odroid-C1

2019-04-13 Thread Martin Blumenstingl

This adds support for the RTC on the Meson8b EC-100 and Odroid-C1
boards. Example kernel log snippet while booting my EC-100:
  [5.713750] meson-rtc c8100740.rtc: setting system clock to 
2019-04-13T16:21:48 UTC (1555172508)

I am only 99% sure about the naming of the clock in patch 2 and 3.
The public S805 datasheet from Hardkernel shows (on page 24 for example)
that clk81 can use "XTAL/32khz" as clock input. That "XTAL/32khz" clock
is described as a mux between 24MHz (our main XTAL) and 32kHz ("that
other XTAL").
I believe that this other 32kHz XTAL is NOT the RTC32K crystal because:
- schematics of the EC-100 and Odroid-C1 clearly show that the SoC input
  for the RTC32K clock is labeled RTC32K_XI / RTC32K_XO
- GPIOAO_6 has a CLK_32KIN function (shows in EC-100 and Odroid-C1
  schematics as well as the public S805 datasheet)
- Always On domain PWR_CNTL0[11:10] (public S805 datasheet page 19)
  describes it as "Alternate 32khz input clock select from GPIO pad"

Thus I believe that the naming of the RTC32K clock is correct, but I
wanted to point out that I'm only 99% (instead of 100%) sure.
Jianxin, please let me know if you disagree with my findings.


Martin Blumenstingl (3):
  ARM: dts: meson: add support for the RTC
  ARM: dts: meson8b: ec100: enable the RTC
  ARM: dts: meson8b: odroid-c1: prepare support for the RTC

 arch/arm/boot/dts/meson.dtsi   |  9 +
 arch/arm/boot/dts/meson8.dtsi  |  5 +
 arch/arm/boot/dts/meson8b-ec100.dts| 14 ++
 arch/arm/boot/dts/meson8b-odroidc1.dts | 14 ++
 arch/arm/boot/dts/meson8b.dtsi |  5 +
 5 files changed, 47 insertions(+)

-- 
2.21.0

[PATCH 2/3] ARM: dts: meson8b: ec100: enable the RTC

2019-04-13 Thread Martin Blumenstingl

The RTC is always enabled on this board since the battery is already
connected in the factory.
According to the schematics the VCC_RTC regulator (which is either
powered by the internal 3.3V or a battery) is connected to the 0.9V
RTC_VDD input of the SoCs.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8b-ec100.dts | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm/boot/dts/meson8b-ec100.dts 
b/arch/arm/boot/dts/meson8b-ec100.dts
index 3ca9638fad09..9bf4249cb60d 100644
--- a/arch/arm/boot/dts/meson8b-ec100.dts
+++ b/arch/arm/boot/dts/meson8b-ec100.dts
@@ -88,6 +88,14 @@
};
};
 
+   rtc32k_xtal: rtc32k-xtal-clk {
+   /* X2 in the schematics */
+   compatible = "fixed-clock";
+   clock-frequency = <32768>;
+   clock-output-names = "RTC32K";
+   #clock-cells = <0>;
+   };
+
usb_vbus: regulator-usb-vbus {
/*
 * Silergy SY6288CCAC-GP 2A Power Distribution Switch.
@@ -347,6 +355,12 @@
clock-names = "clkin0";
 };
 
+ {
+   status = "okay";
+   clocks = <_xtal>;
+   vdd-supply = <_rtc>;
+};
+
 /* exposed through the pin headers labeled "URDUG1" on the top of the PCB */
 _AO {
status = "okay";
-- 
2.21.0

[PATCH 1/3] ARM: dts: meson: add support for the RTC

2019-04-13 Thread Martin Blumenstingl

The 32-bit Meson SoCs have an RTC block in the AO (always on) area. The
RTC requires an external 32.768 kHz oscillator to work properly. Whether
or not this crystal exists depends on the board, so it has to be added
for each board.dts (instead of adding it somewhere in a generic .dtsi).

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson.dtsi   | 9 +
 arch/arm/boot/dts/meson8.dtsi  | 5 +
 arch/arm/boot/dts/meson8b.dtsi | 5 +
 3 files changed, 19 insertions(+)

diff --git a/arch/arm/boot/dts/meson.dtsi b/arch/arm/boot/dts/meson.dtsi
index 6f54a8897574..8841783aceec 100644
--- a/arch/arm/boot/dts/meson.dtsi
+++ b/arch/arm/boot/dts/meson.dtsi
@@ -252,6 +252,15 @@
#size-cells = <0>;
status = "disabled";
};
+
+   rtc: rtc@740 {
+   compatible = "amlogic,meson6-rtc";
+   reg = <0x740 0x14>;
+   interrupts = ;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   status = "disabled";
+   };
};
 
usb0: usb@c904 {
diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi
index d2ec4af82cc5..7ef442462ea4 100644
--- a/arch/arm/boot/dts/meson8.dtsi
+++ b/arch/arm/boot/dts/meson8.dtsi
@@ -541,6 +541,11 @@
compatible = "amlogic,meson8-pwm", "amlogic,meson8b-pwm";
 };
 
+ {
+   compatible = "amlogic,meson8-rtc";
+   resets = < RESET_RTC>;
+};
+
  {
compatible = "amlogic,meson8-saradc", "amlogic,meson-saradc";
clocks = < CLKID_XTAL>,
diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi
index df42e48f1cc1..800cd65fc50a 100644
--- a/arch/arm/boot/dts/meson8b.dtsi
+++ b/arch/arm/boot/dts/meson8b.dtsi
@@ -510,6 +510,11 @@
compatible = "amlogic,meson8b-pwm";
 };
 
+ {
+   compatible = "amlogic,meson8b-rtc";
+   resets = < RESET_RTC>;
+};
+
  {
compatible = "amlogic,meson8b-saradc", "amlogic,meson-saradc";
clocks = < CLKID_XTAL>,
-- 
2.21.0

[PATCH 3/3] ARM: dts: meson8b: odroid-c1: prepare support for the RTC

2019-04-13 Thread Martin Blumenstingl

The Odroid-C1 has the 32.768 kHz oscillator (X3 in the schematics) which
is required for the RTC. A battery can be connected separately (to the
BT1 header) - then the "rtc" node can be enabled manually. By default
the RTC is disabled because the boards typically come without the RTC
battery.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8b-odroidc1.dts | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm/boot/dts/meson8b-odroidc1.dts 
b/arch/arm/boot/dts/meson8b-odroidc1.dts
index 0157646e3a89..f3ad9397f670 100644
--- a/arch/arm/boot/dts/meson8b-odroidc1.dts
+++ b/arch/arm/boot/dts/meson8b-odroidc1.dts
@@ -124,6 +124,14 @@
io-channels = < 8>;
};
 
+   rtc32k_xtal: rtc32k-xtal-clk {
+   /* X3 in the schematics */
+   compatible = "fixed-clock";
+   clock-frequency = <32768>;
+   clock-output-names = "RTC32K";
+   #clock-cells = <0>;
+   };
+
vcc_1v8: regulator-vcc-1v8 {
/*
 * RICHTEK RT9179 configured for a fixed output voltage of
@@ -345,6 +353,12 @@
clock-names = "clkin0";
 };
 
+ {
+   /* needs to be enabled manually when a battery is connected */
+   clocks = <_xtal>;
+   vdd-supply = <_rtc>;
+};
+
 _AO {
status = "okay";
pinctrl-0 = <_ao_a_pins>;
-- 
2.21.0

[PATCH 1/1] arm: mm: Export __sync_icache_dcache() for xen-privcmd

2019-04-13 Thread Heinrich Schuchardt

This patch avoids
ERROR: "__sync_icache_dcache" [drivers/xen/xen-privcmd.ko] undefined!
observed when compiling v4.19.34.

The xen-privcmd driver, which can be modular, calls set_pte_at()
which in turn may call __sync_icache_dcache().

Cc: sta...@vger.kernel.org
Signed-off-by: Heinrich Schuchardt 
---
 arch/arm/mm/flush.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 58469623b015..5345f86c56d2 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -295,6 +295,7 @@ void __sync_icache_dcache(pte_t pteval)
if (pte_exec(pteval))
__flush_icache_all();
 }
+EXPORT_SYMBOL_GPL(__sync_icache_dcache);
 #endif

 /*
--
2.20.1

[PATCH] staging: axis-fifo: Add elaborate description in Kconfig

2019-04-13 Thread Moses Christopher

  - The Xilinx AXI-Stream FIFO IP core driver description is elaborated.
  - References: Xilinx PG080 document, axis-fifo.txt

Signed-off-by: Moses Christopher 
---
 drivers/staging/axis-fifo/Kconfig | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/axis-fifo/Kconfig 
b/drivers/staging/axis-fifo/Kconfig
index d9725888af6f..ecb7811fd387 100644
--- a/drivers/staging/axis-fifo/Kconfig
+++ b/drivers/staging/axis-fifo/Kconfig
@@ -6,5 +6,7 @@ config XIL_AXIS_FIFO
depends on OF
default n
help
- This adds support for the Xilinx AXI-Stream
- FIFO IP core driver.
+ This adds support for the Xilinx AXI-Stream FIFO IP core driver.
+ The AXI Streaming FIFO allows memory mapped access to a AXI Streaming
+ interface. The Xilinx AXI-Stream FIFO IP core can be used to interface
+ to the AXI Ethernet without the need to use DMA.
-- 
2.17.1

Re: [PATCH v3 00/26] Add support for PCIe RC and EP mode in TI's AM654 SoC

2019-04-13 Thread Bjorn Helgaas

On Fri, Apr 12, 2019 at 04:48:36PM +0100, Lorenzo Pieralisi wrote:
> On Mon, Mar 25, 2019 at 03:09:21PM +0530, Kishon Vijay Abraham I wrote:
> > Add PCIe RC support for TI's AM654 SoC. The PCIe controller in AM654
> > uses Synopsys core revision 4.90a and uses the same TI wrapper as used
> > in keystone2 with certain modification. Hence AM654 will use the same
> > pci wrapper driver pci-keystone.c
> > 
> > This series was initially part of [1]. This series only includes patches
> > that has to be merged via Lorenzo's tree. The PHY patches and dt patches
> > will be sent separately.
> > 
> > This series is created over keystone MSI cleanup series [2].
> > 
> > This series:
> > *) Cleanup pci-keystone driver so that both RC mode and EP mode of
> >AM654 can be supported
> > *) Modify epc-core to support allocation of aligned buffers required for
> >AM654
> > *) Fix ATU unroll identification
> > *) Add support for both host mode and device mode in AM654
> > 
> > Changes from v2:
> > *) Missed updating "Reviewed-by: Rob Herring " tags
> >in the version that was sent to list.
> > *) Add const qualifier to struct dw_pcie_ep_ops in pci-layerscape-ep.c
> > 
> > Changes from v1:
> > *) Support for legacy interrupt in AM654 is removed (see background here
> >[3])
> > *) Allow of_pci_get_max_link_speed to be used by Endpoint controller
> > driver
> > *) Add support to set max-link-speed from DT in pci-keystone driver
> > *) Update "Reviewed-by: Rob Herring " tags.
> > 
> > [1] -> https://lore.kernel.org/patchwork/cover/989487/
> > [2] -> https://lkml.org/lkml/2019/3/21/193
> > [3] -> https://lkml.org/lkml/2019/3/19/235
> > 
> > Kishon Vijay Abraham I (26):
> >   PCI: keystone: Add start_link/stop_link dw_pcie_ops
> >   PCI: keystone: Cleanup error_irq configuration
> >   dt-bindings: PCI: keystone: Add "reg-names" binding information
> >   PCI: keystone: Perform host initialization in a single function
> >   PCI: keystone: Use platform_get_resource_byname to get memory
> > resources
> >   PCI: keystone: Move initializations to appropriate places
> >   dt-bindings: PCI: Add dt-binding to configure PCIe mode
> >   PCI: keystone: Explicitly set the PCIe mode
> >   dt-bindings: PCI: Document "atu" reg-names
> >   PCI: dwc: Enable iATU unroll for endpoint too
> >   PCI: dwc: Fix ATU identification for designware version >= 4.80
> >   PCI: keystone: Prevent ARM32 specific code to be compiled for ARM64
> >   dt-bindings: PCI: Add PCI RC dt binding documentation for AM654
> >   PCI: keystone: Add support for PCIe RC in AM654x Platforms
> >   PCI: keystone: Invoke phy_reset API before enabling PHY
> >   PCI: OF: Allow of_pci_get_max_link_speed() to be used by PCI Endpoint
> > drivers
> >   PCI: keystone: Add support to set the max link speed from DT
> >   PCI: endpoint: Add support to allocate aligned buffers to be mapped in
> > BARs
> >   PCI: dwc: Add const qualifier to struct dw_pcie_ep_ops
> >   PCI: dwc: Fix dw_pcie_ep_find_capability to return correct capability
> > offset
> >   PCI: dwc: Add callbacks for accessing dbi2 address space
> >   PCI: keystone: Add support for PCIe EP in AM654x Platforms
> >   PCI: designware-ep: Configure RESBAR to advertise the smallest size
> >   PCI: designware-ep: Use aligned ATU window for raising MSI interrupts
> >   misc: pci_endpoint_test: Add support to test PCI EP in AM654x
> >   misc: pci_endpoint_test: Fix test_reg_bar to be updated in
> > pci_endpoint_test
> > 
> >  .../bindings/pci/designware-pcie.txt  |   7 +-
> >  .../devicetree/bindings/pci/pci-keystone.txt  |  14 +-
> >  drivers/misc/pci_endpoint_test.c  |  18 +
> >  drivers/pci/Makefile  |   2 +-
> >  drivers/pci/controller/dwc/Kconfig|  25 +-
> >  drivers/pci/controller/dwc/pci-dra7xx.c   |   2 +-
> >  drivers/pci/controller/dwc/pci-keystone.c | 577 +++---
> >  .../pci/controller/dwc/pci-layerscape-ep.c|   2 +-
> >  drivers/pci/controller/dwc/pcie-artpec6.c |   2 +-
> >  .../pci/controller/dwc/pcie-designware-ep.c   |  55 +-
> >  .../pci/controller/dwc/pcie-designware-host.c |  19 -
> >  .../pci/controller/dwc/pcie-designware-plat.c |   2 +-
> >  drivers/pci/controller/dwc/pcie-designware.c  |  52 ++
> >  drivers/pci/controller/dwc/pcie-designware.h  |  15 +-
> >  drivers/pci/endpoint/functions/pci-epf-test.c |   5 +-
> >  drivers/pci/endpoint/pci-epf-core.c   |  10 +-
> >  drivers/pci/of.c  |  44 +-
> >  include/linux/pci-epc.h   |   2 +
> >  include/linux/pci-epf.h   |   3 +-
> >  19 files changed, 683 insertions(+), 173 deletions(-)
> 
> Hi Kishon,
> 
> I have applied the series, after rewriting the commit logs we
> discussed, branch: pci/keystone, please have a look and let me know
> if that's good to go.

It all looks good to me; I responded to a few things I noticed while
writing the notes for merging the branch.  I don't

Re: [PATCH v2 23/26] PCI: designware-ep: Configure RESBAR to advertise the smallest size

2019-04-13 Thread Bjorn Helgaas

On Mon, Mar 25, 2019 at 02:04:58PM +0530, Kishon Vijay Abraham I wrote:
> Configure RESBAR capability to advertise the smallest size (1MB) for
> couple of reasons. A) Host side resource allocation of BAR fails for
> larger sizes. B) Endpoint function driver does not allocate memory
> for all supported sizes in RESBAR capability.
> If and when there is a usecase required to add more flexibility using
> RESBAR, this can be revisited.

The #define used in the code below is "REBAR"; maybe spell it out
once and then use REBAR instead of RESBAR?

> Signed-off-by: Kishon Vijay Abraham I 
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 34 +++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
> b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 74477ad7467f..0c208b9bda43 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -504,10 +504,32 @@ void dw_pcie_ep_exit(struct dw_pcie_ep *ep)
>   pci_epc_mem_exit(epc);
>  }
>  
> +static unsigned int dw_pcie_ep_find_ext_capability(struct dw_pcie *pci, int 
> cap)
> +{
> + u32 header;
> + int pos = PCI_CFG_SPACE_SIZE;
> +
> + while (pos) {
> + header = dw_pcie_readl_dbi(pci, pos);
> + if (PCI_EXT_CAP_ID(header) == cap)
> + return pos;
> +
> + pos = PCI_EXT_CAP_NEXT(header);
> + if (!pos)
> + break;
> + }
> +
> + return 0;
> +}
> +
>  int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  {
> + int i;
>   int ret;
> + u32 reg;
>   void *addr;
> + unsigned int nbars;
> + unsigned int offset;
>   struct pci_epc *epc;
>   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>   struct device *dev = pci->dev;
> @@ -591,6 +613,18 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  
>   ep->msix_cap = dw_pcie_ep_find_capability(pci, PCI_CAP_ID_MSIX);
>  
> + offset = dw_pcie_ep_find_ext_capability(pci, PCI_EXT_CAP_ID_REBAR);
> + if (offset) {
> + reg = dw_pcie_readl_dbi(pci, offset + PCI_REBAR_CTRL);
> + nbars = (reg & PCI_REBAR_CTRL_NBAR_MASK) >>
> + PCI_REBAR_CTRL_NBAR_SHIFT;
> +
> + dw_pcie_dbi_ro_wr_en(pci);
> + for (i = 0; i < nbars; i++, offset += PCI_REBAR_CTRL)
> + dw_pcie_writel_dbi(pci, offset + PCI_REBAR_CAP, 0x0);
> + dw_pcie_dbi_ro_wr_dis(pci);
> + }
> +
>   dw_pcie_setup(pci);
>  
>   return 0;
> -- 
> 2.17.1
>

[PATCH] RDMA/cxgb4: fix null pointer dereference on alloc_skb failure

2019-04-13 Thread Colin King

From: Colin Ian King 

Currently if alloc_skb fails to allocate the skb a null skb is passed
to t4_set_arp_err_handler and this ends up dereferencing the null skb.
Avoid the null pointer dereference by checking for a null skb and
returning early.

Addresses-Coverity: ("Dereference null return")
Fixes: b38a0ad8ec11 ("RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL 
messages")
Signed-off-by: Colin Ian King 
---
 drivers/infiniband/hw/cxgb4/cm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 1e68d87b663d..0f3b1193d5f8 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -460,6 +460,8 @@ static struct sk_buff *get_skb(struct sk_buff *skb, int 
len, gfp_t gfp)
skb_reset_transport_header(skb);
} else {
skb = alloc_skb(len, gfp);
+   if (!skb)
+   return NULL;
}
t4_set_arp_err_handler(skb, NULL, NULL);
return skb;
-- 
2.20.1

Re: [PATCH v3 24/26] PCI: designware-ep: Use aligned ATU window for raising MSI interrupts

2019-04-13 Thread Bjorn Helgaas

On Mon, Mar 25, 2019 at 03:09:45PM +0530, Kishon Vijay Abraham I wrote:
> Certain platforms like K2G reguires the outbound ATU window to be

s/reguires/require/ (note two changes: "g->q" and drop s)

Re: [PATCH v3 20/26] PCI: dwc: Fix dw_pcie_ep_find_capability to return correct capability offset

2019-04-13 Thread Bjorn Helgaas

On Mon, Mar 25, 2019 at 03:09:41PM +0530, Kishon Vijay Abraham I wrote:
> commit beb4641a787df79a ("PCI: dwc: Add MSI-X callbacks handler") while
> adding MSI-X callback handler, introduced dw_pcie_ep_find_capability and
> __dw_pcie_ep_find_next_cap for finding the MSI and MSIX capability.
> 
> However if MSI or MSIX capability is the last capability (i.e there are
> no additional items in the capabilities list and the Next Capability
> Pointer is set to '0'), __dw_pcie_ep_find_next_cap will return '0'
> even though MSI or MSIX capability may be present. This is because of
> incorrect ordering of "next_cap_ptr" check. Fix it here.

Nice fix.  It looks like dw_pcie_ep_find_capability() is currently
only used for MSI and MSI-X, but IIUC, it *could* be used for any
capability, so this commit message seems a little too specific.

It would be slightly less confusing if beb4641a787d ("PCI: dwc: Add
MSI-X callbacks handler") had been split into a commit that added the
general-purpose dw_pcie_ep_find_capability() and a commit that used it
for MSI and MSI-X, but obviously we can't change history.

> Fixes: beb4641a787df79a142 ("PCI: dwc: Add MSI-X callbacks handler")

These commit references are longer than necessary (and different -
19-char SHA1 here and 16 chars above).  I typically use 12 chars via

  gsr is aliased to `git --no-pager show -s --abbrev-commit --abbrev=12 
--pretty=format:"%h (\"%s\")%n"'


> Signed-off-by: Kishon Vijay Abraham I 
> ---
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
> b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index dc6a4bbd3ace..74477ad7467f 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -46,16 +46,19 @@ static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie *pci, 
> u8 cap_ptr,
>   u8 cap_id, next_cap_ptr;
>   u16 reg;
>  
> + if (!cap_ptr)
> + return 0;
> +
>   reg = dw_pcie_readw_dbi(pci, cap_ptr);
> - next_cap_ptr = (reg & 0xff00) >> 8;
>   cap_id = (reg & 0x00ff);
>  
> - if (!next_cap_ptr || cap_id > PCI_CAP_ID_MAX)
> + if (cap_id > PCI_CAP_ID_MAX)
>   return 0;
>  
>   if (cap_id == cap)
>   return cap_ptr;
>  
> + next_cap_ptr = (reg & 0xff00) >> 8;
>   return __dw_pcie_ep_find_next_cap(pci, next_cap_ptr, cap);
>  }
>  
> @@ -67,9 +70,6 @@ static u8 dw_pcie_ep_find_capability(struct dw_pcie *pci, 
> u8 cap)
>   reg = dw_pcie_readw_dbi(pci, PCI_CAPABILITY_LIST);
>   next_cap_ptr = (reg & 0x00ff);
>  
> - if (!next_cap_ptr)
> - return 0;
> -
>   return __dw_pcie_ep_find_next_cap(pci, next_cap_ptr, cap);
>  }
>  
> -- 
> 2.17.1
>

[PATCH v2 2/3] RISC-V: Add interrupt related SCAUSE defines in asm/encoding.h

2019-04-13 Thread Anup Patel

This patch adds SCAUSE interrupt flag and SCAUSE interrupt cause
related defines to asm/encoding.h. We also use these defines in
arch/riscv/kernel/irq.c and express SIE/SIP flag in-terms of
interrupt causes.

Signed-off-by: Anup Patel 
---
 arch/riscv/include/asm/encoding.h | 25 +
 arch/riscv/kernel/irq.c   |  9 +
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/include/asm/encoding.h 
b/arch/riscv/include/asm/encoding.h
index 29699705dc36..4f187854fd8b 100644
--- a/arch/riscv/include/asm/encoding.h
+++ b/arch/riscv/include/asm/encoding.h
@@ -44,10 +44,22 @@
 #define SATP_MODE  SATP_MODE_39
 #endif
 
-/* Interrupt Enable and Interrupt Pending flags */
-#define SIE_SSIE   _AC(0x0002, UL) /* Software Interrupt Enable */
-#define SIE_STIE   _AC(0x0020, UL) /* Timer Interrupt Enable */
-#define SIE_SEIE   _AC(0x0200, UL) /* External Interrupt Enable */
+/* SCAUSE */
+#ifdef CONFIG_64BIT
+#define SCAUSE_IRQ_FLAG_AC(0x8000, UL)
+#else
+#define SCAUSE_IRQ_FLAG_AC(0x8000, UL)
+#endif
+
+#define IRQ_U_SOFT 0
+#define IRQ_S_SOFT 1
+#define IRQ_M_SOFT 3
+#define IRQ_U_TIMER4
+#define IRQ_S_TIMER5
+#define IRQ_M_TIMER7
+#define IRQ_U_EXT  8
+#define IRQ_S_EXT  9
+#define IRQ_M_EXT  11
 
 #define EXC_INST_MISALIGNED0
 #define EXC_INST_ACCESS1
@@ -59,4 +71,9 @@
 #define EXC_LOAD_PAGE_FAULT13
 #define EXC_STORE_PAGE_FAULT   15
 
+/* SIE (Interrupt Enable) and SIP (Interrupt Pending) flags */
+#define SIE_SSIE   (_AC(0x1, UL) << IRQ_S_SOFT)
+#define SIE_STIE   (_AC(0x1, UL) << IRQ_S_TIMER)
+#define SIE_SEIE   (_AC(0x1, UL) << IRQ_S_EXT)
+
 #endif /* _ASM_RISCV_ENCODING_H */
diff --git a/arch/riscv/kernel/irq.c b/arch/riscv/kernel/irq.c
index 48e6b7db83a1..22b8183ae8d4 100644
--- a/arch/riscv/kernel/irq.c
+++ b/arch/riscv/kernel/irq.c
@@ -9,14 +9,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
  * Possible interrupt causes:
  */
-#define INTERRUPT_CAUSE_SOFTWARE1
-#define INTERRUPT_CAUSE_TIMER   5
-#define INTERRUPT_CAUSE_EXTERNAL9
+#define INTERRUPT_CAUSE_SOFTWARE   IRQ_S_SOFT
+#define INTERRUPT_CAUSE_TIMER  IRQ_S_TIMER
+#define INTERRUPT_CAUSE_EXTERNAL   IRQ_S_EXT
 
 /*
  * The high order bit of the trap cause register is always set for
@@ -24,7 +25,7 @@
  * quickly.  The INTERRUPT_CAUSE_* macros don't contain that bit, so we
  * need to mask it off.
  */
-#define INTERRUPT_CAUSE_FLAG   (1UL << (__riscv_xlen - 1))
+#define INTERRUPT_CAUSE_FLAG   SCAUSE_IRQ_FLAG
 
 int arch_show_interrupts(struct seq_file *p, int prec)
 {
-- 
2.17.1

[PATCH v2 3/3] RISC-V: Access CSRs using CSR numbers

2019-04-13 Thread Anup Patel

We should prefer accessing CSRs using their CSR numbers because:
1. It compiles fine with older toolchains.
2. We can use latest CSR names in #define macro names of CSR numbers
   as-per RISC-V spec.
3. We can access newly added CSRs even if toolchain does not recognize
   newly addes CSRs by name.

Signed-off-by: Anup Patel 
---
 arch/riscv/include/asm/csr.h | 15 ---
 arch/riscv/include/asm/encoding.h| 16 
 arch/riscv/include/asm/irqflags.h| 10 +-
 arch/riscv/include/asm/mmu_context.h |  7 +--
 arch/riscv/kernel/entry.S| 22 +++---
 arch/riscv/kernel/head.S | 12 ++--
 arch/riscv/kernel/perf_event.c   |  4 ++--
 arch/riscv/kernel/smp.c  |  2 +-
 arch/riscv/kernel/traps.c|  6 +++---
 arch/riscv/mm/fault.c|  6 +-
 10 files changed, 54 insertions(+), 46 deletions(-)

diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 8cf698e39463..6bf5652d3565 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_RISCV_CSR_H
 #define _ASM_RISCV_CSR_H
 
+#include 
 #include 
 
 #ifndef __ASSEMBLY__
@@ -21,7 +22,7 @@
 #define csr_swap(csr, val) \
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrrw %0, " #csr ", %1"  \
+   __asm__ __volatile__ ("csrrw %0, " __ASM_STR(csr) ", %1"\
  : "=r" (__v) : "rK" (__v) \
  : "memory");  \
__v;\
@@ -30,7 +31,7 @@
 #define csr_read(csr)  \
 ({ \
register unsigned long __v; \
-   __asm__ __volatile__ ("csrr %0, " #csr  \
+   __asm__ __volatile__ ("csrr %0, " __ASM_STR(csr)\
  : "=r" (__v) :\
  : "memory");  \
__v;\
@@ -39,7 +40,7 @@
 #define csr_write(csr, val)\
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrw " #csr ", %0"   \
+   __asm__ __volatile__ ("csrw " __ASM_STR(csr) ", %0" \
  : : "rK" (__v)\
  : "memory");  \
 })
@@ -47,7 +48,7 @@
 #define csr_read_set(csr, val) \
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrrs %0, " #csr ", %1"  \
+   __asm__ __volatile__ ("csrrs %0, " __ASM_STR(csr) ", %1"\
  : "=r" (__v) : "rK" (__v) \
  : "memory");  \
__v;\
@@ -56,7 +57,7 @@
 #define csr_set(csr, val)  \
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrs " #csr ", %0"   \
+   __asm__ __volatile__ ("csrs " __ASM_STR(csr) ", %0" \
  : : "rK" (__v)\
  : "memory");  \
 })
@@ -64,7 +65,7 @@
 #define csr_read_clear(csr, val)   \
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrrc %0, " #csr ", %1"  \
+   __asm__ __volatile__ ("csrrc %0, " __ASM_STR(csr) ", %1"\
  : "=r" (__v) : "rK" (__v) \
  : "memory");  \
__v;\
@@ -73,7 +74,7 @@
 #define csr_clear(csr, val)\
 ({ \
unsigned long __v = (unsigned long)(val);   \
-   __asm__ __volatile__ ("csrc " #csr ", %0"   \
+   __asm__ __volatile__ ("csrc " __ASM_STR(csr) ", %0" \
  : : "rK" (__v)\
  : "memory");  \
 })
diff --git a/arch/riscv/include/asm/encoding.h 
b/arch/riscv/include/asm/encoding.h
index 4f187854fd8b..717b823ac110 100644
---

1 2 3 >

1 - 100 of 217 matches

Mail list logo