Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async
Sergey Senozhatsky wrote: > printk() is expected to work under different conditions and in different > scenarios, including corner cases of OOM when all of the workers are busy > (e.g. allocating memory). Thus by default printk() uses its own dedicated > workqueue with WQ_MEM_RECLAIM bit set. It falls back to system_long_wq > (console_unlock() is time unbound) only if it has failed to allocate > printk_wq. Another thing to mention, is that deferred printk() messages > may appear before printk_wq is allocated, so in the very beginning we > have to printk deferred messages the old way -- in IRQ context. I think we should not depend on system_long_wq which does not have WQ_MEM_RECLAIM bit. If workqueue allocation upon boot fails (due to ENOMEM), such systems won't be able to start userspace programs. Moreover, I don't like use of a workqueue even if printk_wq was allocated with WQ_MEM_RECLAIM bit. As you can see in the discussion of the OOM reaper, the OOM reaper chose a dedicated kernel thread rather than a workqueue ( http://lkml.kernel.org/r/1454505240-23446-2-git-send-email-mho...@kernel.org ). Blocking actual printing until ongoing workqueue item calls schedule_timeout_*() is not nice (see commit 373ccbe59270 and 564e81a57f97). Use of WQ_MEM_RECLAIM means we add a task_struct for that workqueue. Thus, using a kernel thread does not change total number of task_struct compared to WQ_MEM_RECLAIM approach. I think actual printing should occur as soon as possible rather than randomly deferred until workqueue item sleeps. > +static void printing_work_func(struct work_struct *work) > +{ > + console_lock(); > + console_unlock(); > +} Is this safe? If somebody invokes the OOM killer between console_lock() and console_unlock(), won't this cause OOM killer messages not printed?
Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async
Sergey Senozhatsky wrote: > printk() is expected to work under different conditions and in different > scenarios, including corner cases of OOM when all of the workers are busy > (e.g. allocating memory). Thus by default printk() uses its own dedicated > workqueue with WQ_MEM_RECLAIM bit set. It falls back to system_long_wq > (console_unlock() is time unbound) only if it has failed to allocate > printk_wq. Another thing to mention, is that deferred printk() messages > may appear before printk_wq is allocated, so in the very beginning we > have to printk deferred messages the old way -- in IRQ context. I think we should not depend on system_long_wq which does not have WQ_MEM_RECLAIM bit. If workqueue allocation upon boot fails (due to ENOMEM), such systems won't be able to start userspace programs. Moreover, I don't like use of a workqueue even if printk_wq was allocated with WQ_MEM_RECLAIM bit. As you can see in the discussion of the OOM reaper, the OOM reaper chose a dedicated kernel thread rather than a workqueue ( http://lkml.kernel.org/r/1454505240-23446-2-git-send-email-mho...@kernel.org ). Blocking actual printing until ongoing workqueue item calls schedule_timeout_*() is not nice (see commit 373ccbe59270 and 564e81a57f97). Use of WQ_MEM_RECLAIM means we add a task_struct for that workqueue. Thus, using a kernel thread does not change total number of task_struct compared to WQ_MEM_RECLAIM approach. I think actual printing should occur as soon as possible rather than randomly deferred until workqueue item sleeps. > +static void printing_work_func(struct work_struct *work) > +{ > + console_lock(); > + console_unlock(); > +} Is this safe? If somebody invokes the OOM killer between console_lock() and console_unlock(), won't this cause OOM killer messages not printed?
arch/mips/vdso/gettimeofday.c:1:0: error: '-march=r3900' requires '-mfp32'
Hi Guenter, First bad commit (maybe != root cause): tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 398c7500a1f5f74e207bd2edca1b1721b3cc1f1e MIPS: VDSO: Fix build error with binutils 2.24 and earlier date: 10 weeks ago config: mips-jmr3927_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 398c7500a1f5f74e207bd2edca1b1721b3cc1f1e # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> arch/mips/vdso/gettimeofday.c:1:0: error: '-march=r3900' requires '-mfp32' /* ^ vim +1 arch/mips/vdso/gettimeofday.c a7f4df4e Alex Smith 2015-10-21 @1 /* a7f4df4e Alex Smith 2015-10-21 2 * Copyright (C) 2015 Imagination Technologies a7f4df4e Alex Smith 2015-10-21 3 * Author: Alex Smitha7f4df4e Alex Smith 2015-10-21 4 * a7f4df4e Alex Smith 2015-10-21 5 * This program is free software; you can redistribute it and/or modify it a7f4df4e Alex Smith 2015-10-21 6 * under the terms of the GNU General Public License as published by the a7f4df4e Alex Smith 2015-10-21 7 * Free Software Foundation; either version 2 of the License, or (at your a7f4df4e Alex Smith 2015-10-21 8 * option) any later version. a7f4df4e Alex Smith 2015-10-21 9 */ :: The code at line 1 was first introduced by commit :: a7f4df4e21dd8a8dab96e88acd2c9c5017b83fc6 MIPS: VDSO: Add implementations of gettimeofday() and clock_gettime() :: TO: Alex Smith :: CC: Ralf Baechle --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
arch/mips/vdso/gettimeofday.c:1:0: error: '-march=r3900' requires '-mfp32'
Hi Guenter, First bad commit (maybe != root cause): tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 398c7500a1f5f74e207bd2edca1b1721b3cc1f1e MIPS: VDSO: Fix build error with binutils 2.24 and earlier date: 10 weeks ago config: mips-jmr3927_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 398c7500a1f5f74e207bd2edca1b1721b3cc1f1e # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> arch/mips/vdso/gettimeofday.c:1:0: error: '-march=r3900' requires '-mfp32' /* ^ vim +1 arch/mips/vdso/gettimeofday.c a7f4df4e Alex Smith 2015-10-21 @1 /* a7f4df4e Alex Smith 2015-10-21 2 * Copyright (C) 2015 Imagination Technologies a7f4df4e Alex Smith 2015-10-21 3 * Author: Alex Smith a7f4df4e Alex Smith 2015-10-21 4 * a7f4df4e Alex Smith 2015-10-21 5 * This program is free software; you can redistribute it and/or modify it a7f4df4e Alex Smith 2015-10-21 6 * under the terms of the GNU General Public License as published by the a7f4df4e Alex Smith 2015-10-21 7 * Free Software Foundation; either version 2 of the License, or (at your a7f4df4e Alex Smith 2015-10-21 8 * option) any later version. a7f4df4e Alex Smith 2015-10-21 9 */ :: The code at line 1 was first introduced by commit :: a7f4df4e21dd8a8dab96e88acd2c9c5017b83fc6 MIPS: VDSO: Add implementations of gettimeofday() and clock_gettime() :: TO: Alex Smith :: CC: Ralf Baechle --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
arch/mips/vdso/elf.S:1:0: error: '-march=r3900' requires '-mfp32'
Hi Alex, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: ebb5e78cc63417a35254a791de66e1cc84f963cc MIPS: Initial implementation of a VDSO date: 4 months ago config: mips-jmr3927_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout ebb5e78cc63417a35254a791de66e1cc84f963cc # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> arch/mips/vdso/elf.S:1:0: error: '-march=r3900' requires '-mfp32' /* ^ -- >> arch/mips/vdso/sigreturn.S:1:0: error: '-march=r3900' requires '-mfp32' /* ^ vim +1 arch/mips/vdso/elf.S > 1 /* 2 * Copyright (C) 2015 Imagination Technologies 3 * Author: Alex Smith4 * --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
arch/mips/vdso/elf.S:1:0: error: '-march=r3900' requires '-mfp32'
Hi Alex, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: ebb5e78cc63417a35254a791de66e1cc84f963cc MIPS: Initial implementation of a VDSO date: 4 months ago config: mips-jmr3927_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout ebb5e78cc63417a35254a791de66e1cc84f963cc # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> arch/mips/vdso/elf.S:1:0: error: '-march=r3900' requires '-mfp32' /* ^ -- >> arch/mips/vdso/sigreturn.S:1:0: error: '-march=r3900' requires '-mfp32' /* ^ vim +1 arch/mips/vdso/elf.S > 1 /* 2 * Copyright (C) 2015 Imagination Technologies 3 * Author: Alex Smith 4 * --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async
On (03/05/16 19:55), Sergey Senozhatsky wrote: [..] > +static int __init init_printk_workqueue(void) > +{ > + if (printk_sync) > + return 0; > + > + printk_wq = alloc_workqueue("printk_wq", WQ_MEM_RECLAIM, 0); > + /* > + * Fallback to one of system-wide workqueues if we failed to > + * allocate a dedicated one. console_unlock() may take some > + * time to flush all of the messages, so we better use > + * system_long_wq. > + */ > + if (!printk_wq) { > + pr_err("printk: Fallback to system_long workqueue\n"); > + printk_wq = system_long_wq; > + } > + > + return 0; > +} > +early_initcall(init_printk_workqueue); ^^ better be late_initcall(). init_workqueues(), which creates pwq_cache kmem_cache, is early initcall. ===8<===8<===8< >From d9cf33958febd863ceb25c59d447aa9500427c30 Mon Sep 17 00:00:00 2001 From: Jan KaraSubject: [PATCH 1/2] printk: Make printk() completely async Currently, printk() sometimes waits for message to be printed to console and sometimes it does not (when console_sem is held by some other process). In case printk() grabs console_sem and starts printing to console, it prints messages from kernel printk buffer until the buffer is empty. When serial console is attached, printing is slow and thus other CPUs in the system have plenty of time to append new messages to the buffer while one CPU is printing. Thus the CPU can spend unbounded amount of time doing printing in console_unlock(). This is especially serious problem if the printk() calling console_unlock() was called with interrupts disabled. In practice users have observed a CPU can spend tens of seconds printing in console_unlock() (usually during boot when hundreds of SCSI devices are discovered) resulting in RCU stalls (CPU doing printing doesn't reach quiescent state for a long time), softlockup reports (IPIs for the printing CPU don't get served and thus other CPUs are spinning waiting for the printing CPU to process IPIs), and eventually a machine death (as messages from stalls and lockups append to printk buffer faster than we are able to print). So these machines are unable to boot with serial console attached. Another observed issue is that due to slow printk, hardware discovery is slow and udev times out before kernel manages to discover all the attached HW. Also during artificial stress testing SATA disk disappears from the system because its interrupts aren't served for too long. This patch makes printk() completely asynchronous (similar to what printk_deferred() did until now). It appends message to the kernel printk buffer and queues work to do the printing to console. This has the advantage that printing always happens from a schedulable contex and thus we don't lockup any particular CPU or even interrupts. Also it has the advantage that printk() is fast and thus kernel booting is not slowed down by slow serial console. Disadvantage of this method is that in case of crash there is higher chance that important messages won't appear in console output (we may need working scheduling to print message to console). We somewhat mitigate this risk by switching printk to the original method of immediate printing to console if oops is in progress. Also for debugging purposes we provide printk.synchronous kernel parameter which resorts to the original printk behavior. printk() is expected to work under different conditions and in different scenarios, including corner cases of OOM when all of the workers are busy (e.g. allocating memory). Thus by default printk() uses its own dedicated workqueue with WQ_MEM_RECLAIM bit set. It falls back to system_long_wq (console_unlock() is time unbound) only if it has failed to allocate printk_wq. Another thing to mention, is that deferred printk() messages may appear before printk_wq is allocated, so in the very beginning we have to printk deferred messages the old way -- in IRQ context. Signed-off-by: Jan Kara Signed-off-by: Sergey Senozhatsky --- Documentation/kernel-parameters.txt | 10 ++ kernel/printk/printk.c | 194 +--- 2 files changed, 146 insertions(+), 58 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e0a21e4..5b01941 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3108,6 +3108,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted. printk.time=Show timing data prefixed to each printk message line Format: (1/Y/y=enable, 0/N/n=disable) + printk.synchronous= + By default kernel messages are printed to console + asynchronously (except during early boot or when oops + is happening). That avoids kernel stalling behind slow +
Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async
On (03/05/16 19:55), Sergey Senozhatsky wrote: [..] > +static int __init init_printk_workqueue(void) > +{ > + if (printk_sync) > + return 0; > + > + printk_wq = alloc_workqueue("printk_wq", WQ_MEM_RECLAIM, 0); > + /* > + * Fallback to one of system-wide workqueues if we failed to > + * allocate a dedicated one. console_unlock() may take some > + * time to flush all of the messages, so we better use > + * system_long_wq. > + */ > + if (!printk_wq) { > + pr_err("printk: Fallback to system_long workqueue\n"); > + printk_wq = system_long_wq; > + } > + > + return 0; > +} > +early_initcall(init_printk_workqueue); ^^ better be late_initcall(). init_workqueues(), which creates pwq_cache kmem_cache, is early initcall. ===8<===8<===8< >From d9cf33958febd863ceb25c59d447aa9500427c30 Mon Sep 17 00:00:00 2001 From: Jan Kara Subject: [PATCH 1/2] printk: Make printk() completely async Currently, printk() sometimes waits for message to be printed to console and sometimes it does not (when console_sem is held by some other process). In case printk() grabs console_sem and starts printing to console, it prints messages from kernel printk buffer until the buffer is empty. When serial console is attached, printing is slow and thus other CPUs in the system have plenty of time to append new messages to the buffer while one CPU is printing. Thus the CPU can spend unbounded amount of time doing printing in console_unlock(). This is especially serious problem if the printk() calling console_unlock() was called with interrupts disabled. In practice users have observed a CPU can spend tens of seconds printing in console_unlock() (usually during boot when hundreds of SCSI devices are discovered) resulting in RCU stalls (CPU doing printing doesn't reach quiescent state for a long time), softlockup reports (IPIs for the printing CPU don't get served and thus other CPUs are spinning waiting for the printing CPU to process IPIs), and eventually a machine death (as messages from stalls and lockups append to printk buffer faster than we are able to print). So these machines are unable to boot with serial console attached. Another observed issue is that due to slow printk, hardware discovery is slow and udev times out before kernel manages to discover all the attached HW. Also during artificial stress testing SATA disk disappears from the system because its interrupts aren't served for too long. This patch makes printk() completely asynchronous (similar to what printk_deferred() did until now). It appends message to the kernel printk buffer and queues work to do the printing to console. This has the advantage that printing always happens from a schedulable contex and thus we don't lockup any particular CPU or even interrupts. Also it has the advantage that printk() is fast and thus kernel booting is not slowed down by slow serial console. Disadvantage of this method is that in case of crash there is higher chance that important messages won't appear in console output (we may need working scheduling to print message to console). We somewhat mitigate this risk by switching printk to the original method of immediate printing to console if oops is in progress. Also for debugging purposes we provide printk.synchronous kernel parameter which resorts to the original printk behavior. printk() is expected to work under different conditions and in different scenarios, including corner cases of OOM when all of the workers are busy (e.g. allocating memory). Thus by default printk() uses its own dedicated workqueue with WQ_MEM_RECLAIM bit set. It falls back to system_long_wq (console_unlock() is time unbound) only if it has failed to allocate printk_wq. Another thing to mention, is that deferred printk() messages may appear before printk_wq is allocated, so in the very beginning we have to printk deferred messages the old way -- in IRQ context. Signed-off-by: Jan Kara Signed-off-by: Sergey Senozhatsky --- Documentation/kernel-parameters.txt | 10 ++ kernel/printk/printk.c | 194 +--- 2 files changed, 146 insertions(+), 58 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e0a21e4..5b01941 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3108,6 +3108,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted. printk.time=Show timing data prefixed to each printk message line Format: (1/Y/y=enable, 0/N/n=disable) + printk.synchronous= + By default kernel messages are printed to console + asynchronously (except during early boot or when oops + is happening). That avoids kernel stalling behind slow + serial console and thus avoids softlockups, interrupt +
Re: [PATCH v5 5/8] kbuild: add fine grained build dependencies for exported symbols
On Sat, 5 Mar 2016, Nicolas Pitre wrote: > On Sat, 5 Mar 2016, Michal Marek wrote: > > > I reproduced the SIGBUS after a few iterations, and it crashes in > > parse_dep_file(). I'm now testing this > > > > diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include > > index 0be7e09ba381..4fdd8348acf6 100644 > > --- a/scripts/Kbuild.include > > +++ b/scripts/Kbuild.include > > @@ -270,10 +270,12 @@ else > > > > # Filter out exported kernel symbol names from the preprocessor output. > > # See also __KSYM_DEPS__ in include/linux/export.h. > > +# We disable the depfile generation here, so as not to overwrite the > > existing > > +# depfile while fixdep is parsing it > > ksym_dep_filter = > > \ > > case "$(1)" in \ > > - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ > > - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ > > + cc_*_c) $(CPP) $(filter-out -Wp$(comma)-M%, $(c_flags)) -D__KSYM_DEPS__ > > $< ;; \ > > + as_*_S) $(CPP) $(filter-out -Wp$(comma)-M%, $(a_flags)) -D__KSYM_DEPS__ > > $< ;; \ > > cpp_lds_S) : ;; \ > > *) echo "Don't know how to preprocess $(1)"; false ;;\ > > esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' > > This makes perfect sense even if I can't reproduce on my side. I folded the following patch into my tree and pushed it out. diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index 0be7e09ba3..2c14a27e39 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -270,14 +270,17 @@ else # Filter out exported kernel symbol names from the preprocessor output. # See also __KSYM_DEPS__ in include/linux/export.h. +# We disable the depfile generation here, so as not to overwrite the existing +# depfile while fixdep is parsing it +flags_nodeps = $(filter-out -Wp$(comma)-M%, $($(1))) ksym_dep_filter =\ case "$(1)" in \ - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ + cc_*_c) $(CPP) $(call flags_nodeps,c_flags) -D__KSYM_DEPS__ $< ;;\ + as_*_S) $(CPP) $(call flags_nodeps,a_flags) -D__KSYM_DEPS__ $< ;;\ cpp_lds_S) : ;; \ - *) echo "Don't know how to preprocess $(1)"; false ;;\ + *) echo "Don't know how to preprocess $(1)" >&2; false ;;\ esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' - + cmd_and_fixdep = \ $(echo-cmd) $(cmd_$(1)); \ $(ksym_dep_filter) | \ Nicolas
Re: [PATCH v5 5/8] kbuild: add fine grained build dependencies for exported symbols
On Sat, 5 Mar 2016, Nicolas Pitre wrote: > On Sat, 5 Mar 2016, Michal Marek wrote: > > > I reproduced the SIGBUS after a few iterations, and it crashes in > > parse_dep_file(). I'm now testing this > > > > diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include > > index 0be7e09ba381..4fdd8348acf6 100644 > > --- a/scripts/Kbuild.include > > +++ b/scripts/Kbuild.include > > @@ -270,10 +270,12 @@ else > > > > # Filter out exported kernel symbol names from the preprocessor output. > > # See also __KSYM_DEPS__ in include/linux/export.h. > > +# We disable the depfile generation here, so as not to overwrite the > > existing > > +# depfile while fixdep is parsing it > > ksym_dep_filter = > > \ > > case "$(1)" in \ > > - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ > > - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ > > + cc_*_c) $(CPP) $(filter-out -Wp$(comma)-M%, $(c_flags)) -D__KSYM_DEPS__ > > $< ;; \ > > + as_*_S) $(CPP) $(filter-out -Wp$(comma)-M%, $(a_flags)) -D__KSYM_DEPS__ > > $< ;; \ > > cpp_lds_S) : ;; \ > > *) echo "Don't know how to preprocess $(1)"; false ;;\ > > esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' > > This makes perfect sense even if I can't reproduce on my side. I folded the following patch into my tree and pushed it out. diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index 0be7e09ba3..2c14a27e39 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -270,14 +270,17 @@ else # Filter out exported kernel symbol names from the preprocessor output. # See also __KSYM_DEPS__ in include/linux/export.h. +# We disable the depfile generation here, so as not to overwrite the existing +# depfile while fixdep is parsing it +flags_nodeps = $(filter-out -Wp$(comma)-M%, $($(1))) ksym_dep_filter =\ case "$(1)" in \ - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ + cc_*_c) $(CPP) $(call flags_nodeps,c_flags) -D__KSYM_DEPS__ $< ;;\ + as_*_S) $(CPP) $(call flags_nodeps,a_flags) -D__KSYM_DEPS__ $< ;;\ cpp_lds_S) : ;; \ - *) echo "Don't know how to preprocess $(1)"; false ;;\ + *) echo "Don't know how to preprocess $(1)" >&2; false ;;\ esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' - + cmd_and_fixdep = \ $(echo-cmd) $(cmd_$(1)); \ $(ksym_dep_filter) | \ Nicolas
[PATCH v2 01/10] selftests/x86: In syscall_nt, test NT|TF as well
Setting TF prevents fastpath returns in most cases, which causes the test to fail on 32-bit kernels because 32-bit kernels do not, in fact, handle NT correctly on SYSENTER entries. The next patch will fix 32-bit kernels. Signed-off-by: Andy Lutomirski--- tools/testing/selftests/x86/syscall_nt.c | 57 +++- 1 file changed, 49 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/x86/syscall_nt.c b/tools/testing/selftests/x86/syscall_nt.c index 60c06af4646a..43fcab367fb0 100644 --- a/tools/testing/selftests/x86/syscall_nt.c +++ b/tools/testing/selftests/x86/syscall_nt.c @@ -17,6 +17,9 @@ #include #include +#include +#include +#include #include #include @@ -26,6 +29,8 @@ # define WIDTH "l" #endif +static unsigned int nerrs; + static unsigned long get_eflags(void) { unsigned long eflags; @@ -39,16 +44,52 @@ static void set_eflags(unsigned long eflags) : : "rm" (eflags) : "flags"); } -int main() +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), + int flags) { - printf("[RUN]\tSet NT and issue a syscall\n"); - set_eflags(get_eflags() | X86_EFLAGS_NT); + struct sigaction sa; + memset(, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(_mask); + if (sigaction(sig, , 0)) + err(1, "sigaction"); +} + +static void sigtrap(int sig, siginfo_t *si, void *ctx_void) +{ +} + +static void do_it(unsigned long extraflags) +{ + unsigned long flags; + + set_eflags(get_eflags() | extraflags); syscall(SYS_getpid); - if (get_eflags() & X86_EFLAGS_NT) { - printf("[OK]\tThe syscall worked and NT is still set\n"); - return 0; + flags = get_eflags(); + if ((flags & extraflags) == extraflags) { + printf("[OK]\tThe syscall worked and flags are still set\n"); } else { - printf("[FAIL]\tThe syscall worked but NT was cleared\n"); - return 1; + printf("[FAIL]\tThe syscall worked but flags were cleared (flags = 0x%lx but expected 0x%lx set)\n", + flags, extraflags); + nerrs++; } } + +int main(void) +{ + printf("[RUN]\tSet NT and issue a syscall\n"); + do_it(X86_EFLAGS_NT); + + /* +* Now try it again with TF set -- TF forces returns via IRET in all +* cases except non-ptregs-using 64-bit full fast path syscalls. +*/ + + sethandler(SIGTRAP, sigtrap, 0); + + printf("[RUN]\tSet NT|TF and issue a syscall\n"); + do_it(X86_EFLAGS_NT | X86_EFLAGS_TF); + + return nerrs == 0 ? 0 : 1; +} -- 2.5.0
[PATCH v2 04/10] x86/entry/32: Restore FLAGS on SYSEXIT
We weren't restoring FLAGS at all on SYSEXIT. Apparently no one cared. With this patch applied, native kernels should always honor task_pt_regs()->flags, which opens the door for some sys_iopl cleanups. I'll do those as a separate series, though, since getting it right will involve tweaking some paravirt ops. (The short version is that, before this patch, sys_iopl, invoked via SYSENTER, wasn't guaranteed to ever transfer the updated regs->flags, so sys_iopl had to change the hardware flags register as well.) Reported-by: Brian GerstSigned-off-by: Andy Lutomirski --- arch/x86/entry/entry_32.S | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 289a17bf0c71..a8c3424c3392 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -343,6 +343,15 @@ sysenter_past_esp: popl%eax/* pt_regs->ax */ /* +* Restore all flags except IF (we restore IF separately because +* STI gives a one-instruction window in which we won't be interrupted, +* whereas POPF does not. +*/ + addl$PT_EFLAGS-PT_DS, %esp /* point esp at pt_regs->flags */ + btr $X86_EFLAGS_IF_BIT, (%esp) + popfl + + /* * Return back to the vDSO, which will pop ecx and edx. * Don't bother with DS and ES (they already contain __USER_DS). */ -- 2.5.0
[PATCH v2 01/10] selftests/x86: In syscall_nt, test NT|TF as well
Setting TF prevents fastpath returns in most cases, which causes the test to fail on 32-bit kernels because 32-bit kernels do not, in fact, handle NT correctly on SYSENTER entries. The next patch will fix 32-bit kernels. Signed-off-by: Andy Lutomirski --- tools/testing/selftests/x86/syscall_nt.c | 57 +++- 1 file changed, 49 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/x86/syscall_nt.c b/tools/testing/selftests/x86/syscall_nt.c index 60c06af4646a..43fcab367fb0 100644 --- a/tools/testing/selftests/x86/syscall_nt.c +++ b/tools/testing/selftests/x86/syscall_nt.c @@ -17,6 +17,9 @@ #include #include +#include +#include +#include #include #include @@ -26,6 +29,8 @@ # define WIDTH "l" #endif +static unsigned int nerrs; + static unsigned long get_eflags(void) { unsigned long eflags; @@ -39,16 +44,52 @@ static void set_eflags(unsigned long eflags) : : "rm" (eflags) : "flags"); } -int main() +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), + int flags) { - printf("[RUN]\tSet NT and issue a syscall\n"); - set_eflags(get_eflags() | X86_EFLAGS_NT); + struct sigaction sa; + memset(, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(_mask); + if (sigaction(sig, , 0)) + err(1, "sigaction"); +} + +static void sigtrap(int sig, siginfo_t *si, void *ctx_void) +{ +} + +static void do_it(unsigned long extraflags) +{ + unsigned long flags; + + set_eflags(get_eflags() | extraflags); syscall(SYS_getpid); - if (get_eflags() & X86_EFLAGS_NT) { - printf("[OK]\tThe syscall worked and NT is still set\n"); - return 0; + flags = get_eflags(); + if ((flags & extraflags) == extraflags) { + printf("[OK]\tThe syscall worked and flags are still set\n"); } else { - printf("[FAIL]\tThe syscall worked but NT was cleared\n"); - return 1; + printf("[FAIL]\tThe syscall worked but flags were cleared (flags = 0x%lx but expected 0x%lx set)\n", + flags, extraflags); + nerrs++; } } + +int main(void) +{ + printf("[RUN]\tSet NT and issue a syscall\n"); + do_it(X86_EFLAGS_NT); + + /* +* Now try it again with TF set -- TF forces returns via IRET in all +* cases except non-ptregs-using 64-bit full fast path syscalls. +*/ + + sethandler(SIGTRAP, sigtrap, 0); + + printf("[RUN]\tSet NT|TF and issue a syscall\n"); + do_it(X86_EFLAGS_NT | X86_EFLAGS_TF); + + return nerrs == 0 ? 0 : 1; +} -- 2.5.0
[PATCH v2 04/10] x86/entry/32: Restore FLAGS on SYSEXIT
We weren't restoring FLAGS at all on SYSEXIT. Apparently no one cared. With this patch applied, native kernels should always honor task_pt_regs()->flags, which opens the door for some sys_iopl cleanups. I'll do those as a separate series, though, since getting it right will involve tweaking some paravirt ops. (The short version is that, before this patch, sys_iopl, invoked via SYSENTER, wasn't guaranteed to ever transfer the updated regs->flags, so sys_iopl had to change the hardware flags register as well.) Reported-by: Brian Gerst Signed-off-by: Andy Lutomirski --- arch/x86/entry/entry_32.S | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 289a17bf0c71..a8c3424c3392 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -343,6 +343,15 @@ sysenter_past_esp: popl%eax/* pt_regs->ax */ /* +* Restore all flags except IF (we restore IF separately because +* STI gives a one-instruction window in which we won't be interrupted, +* whereas POPF does not. +*/ + addl$PT_EFLAGS-PT_DS, %esp /* point esp at pt_regs->flags */ + btr $X86_EFLAGS_IF_BIT, (%esp) + popfl + + /* * Return back to the vDSO, which will pop ecx and edx. * Don't bother with DS and ES (they already contain __USER_DS). */ -- 2.5.0
[PATCH v2 08/10] x86/entry: Only allocate space for SYSENTER_stack if needed
The SYSENTER stack is only used on 32-bit kernels. Remove it in 64-bit kernels. (We may end up using it down the road on 64-bit kernels. If so, we'll re-enable it for CONFIG_IA32_EMULATION.) Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/processor.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index ecb410310e70..7cd01b71b5bd 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -297,10 +297,12 @@ struct tss_struct { */ unsigned long io_bitmap[IO_BITMAP_LONGS + 1]; +#ifdef CONFIG_X86_32 /* * Space for the temporary SYSENTER stack: */ unsigned long SYSENTER_stack[64]; +#endif } cacheline_aligned; -- 2.5.0
[PATCH v2 08/10] x86/entry: Only allocate space for SYSENTER_stack if needed
The SYSENTER stack is only used on 32-bit kernels. Remove it in 64-bit kernels. (We may end up using it down the road on 64-bit kernels. If so, we'll re-enable it for CONFIG_IA32_EMULATION.) Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/processor.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index ecb410310e70..7cd01b71b5bd 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -297,10 +297,12 @@ struct tss_struct { */ unsigned long io_bitmap[IO_BITMAP_LONGS + 1]; +#ifdef CONFIG_X86_32 /* * Space for the temporary SYSENTER stack: */ unsigned long SYSENTER_stack[64]; +#endif } cacheline_aligned; -- 2.5.0
[PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST, so the exception handler is invoked on the temporary SYSENTER stack. Because the SYSENTER stack is very small, we have a fixup to switch off the stack quickly when this happens. The old fixup had several issues: 1. It checked the interrupt frame's CS and EIP. This wasn't obviously correct on Xen or if vm86 mode was in use [1]. 2. In the NMI handler, it did some frightening digging into the stack frame. I'm not convinced this digging was correct. 3. The fixup didn't switch stacks and then switch back. Instead, it synthesized a brand new stack frame that would redirect the IRET back to the SYSENTER code. That frame was highly questionable. For one thing, if NMI nested inside #DB, we would effectively abort the #DB prologue, which was probably safe but was frightening. For another, the code used PUSHFL to write the FLAGS portion of the frame, which was simply bogus -- by the time PUSHFL was called, at least TF, NT, VM, and all of the arithmetic flags were clobbered. Simplify this considerably. Instead of looking at the saved frame to see where we came from, check the hardware ESP register against the SYSENTER stack directly. Malicious user code cannot spoof the kernel ESP register, and by moving the check after SAVE_ALL, we can use normal PER_CPU accesses to find all the relevant addresses. With this patch applied, the improved syscall_nt_32 test finally passes on 32-bit kernels. [1] It isn't obviously correct, but it is nonetheless safe from vm86 shenanigans as far as I can tell. A user can't point EIP at entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32, like all kernel addresses, is greater than 0x and would thus violate the CS segment limit. Signed-off-by: Andy Lutomirski--- arch/x86/entry/entry_32.S| 114 ++- arch/x86/kernel/asm-offsets_32.c | 5 ++ 2 files changed, 56 insertions(+), 63 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7700cf695d8c..ad9a85e62128 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -987,51 +987,48 @@ error_code: jmp ret_from_exception END(page_fault) -/* - * Debug traps and NMI can happen at the one SYSENTER instruction - * that sets up the real kernel stack. Check here, since we can't - * allow the wrong stack to be used. - * - * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have - * already pushed 3 words if it hits on the sysenter instruction: - * eflags, cs and eip. - * - * We just load the right stack, and push the three (known) values - * by hand onto the new stack - while updating the return eip past - * the instruction that would have done it for sysenter. - */ -.macro FIX_STACK offset ok label - cmpw$__KERNEL_CS, 4(%esp) - jne \ok -\label: - movlTSS_sysenter_sp0 + \offset(%esp), %esp - pushfl - pushl $__KERNEL_CS - pushl $sysenter_past_esp -.endm - ENTRY(debug) + /* +* #DB can happen at the first instruction of +* entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this +* happens, then we will be running on a very small stack. We +* need to detect this condition and switch to the thread +* stack before calling any C code at all. +* +* If you edit this code, keep in mind that NMIs can happen in here. +*/ ASM_CLAC - cmpl$entry_SYSENTER_32, (%esp) - jne debug_stack_correct - FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn -debug_stack_correct: pushl $-1 # mark this as an int SAVE_ALL - TRACE_IRQS_OFF xorl%edx, %edx # error code 0 movl%esp, %eax # pt_regs pointer + + /* Are we currently on the SYSENTER stack? */ + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx) + subl%eax, %ecx /* ecx = (end of SYENTER_stack) - esp */ + cmpl$SIZEOF_SYSENTER_stack, %ecx + jb .Ldebug_from_sysenter_stack + + TRACE_IRQS_OFF + calldo_debug + jmp ret_from_exception + +.Ldebug_from_sysenter_stack: + /* We're on the SYSENTER stack. Switch off. */ + movl%esp, %ebp + movlPER_CPU_VAR(cpu_current_top_of_stack), %esp + TRACE_IRQS_OFF calldo_debug + movl%ebp, %esp jmp ret_from_exception END(debug) /* - * NMI is doubly nasty. It can happen _while_ we're handling - * a debug fault, and the debug fault hasn't yet been able to - * clear up the stack. So we first check whether we got an - * NMI on the sysenter entry path, but after that we need to - * check whether we got an NMI on the debug path where the debug - * fault happened on the sysenter
[PATCH v2 10/10] x86/entry/32: Add and check a stack canary for the SYSENTER stack
Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/processor.h | 3 ++- arch/x86/kernel/process.c| 3 +++ arch/x86/kernel/traps.c | 8 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 7cd01b71b5bd..50a6dc871cc0 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -299,8 +299,9 @@ struct tss_struct { #ifdef CONFIG_X86_32 /* -* Space for the temporary SYSENTER stack: +* Space for the temporary SYSENTER stack. */ + unsigned long SYSENTER_stack_canary; unsigned long SYSENTER_stack[64]; #endif diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 9f7c21c22477..ee9a9792caeb 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -57,6 +57,9 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = { */ .io_bitmap = { [0 ... IO_BITMAP_LONGS] = ~0 }, #endif +#ifdef CONFIG_X86_32 + .SYSENTER_stack_canary = STACK_END_MAGIC, +#endif }; EXPORT_PER_CPU_SYMBOL(cpu_tss); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 80928ea78373..590110119e6a 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -713,6 +713,14 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) debug_stack_usage_dec(); exit: +#if defined(CONFIG_X86_32) + /* +* This is the most likely code path that involves non-trivial use +* of the SYSENTER stack. Check that we haven't overrun it. +*/ + WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC, +"Overran or corrupted SYSENTER stack\n"); +#endif ist_exit(regs); } NOKPROBE_SYMBOL(do_debug); -- 2.5.0
[PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST, so the exception handler is invoked on the temporary SYSENTER stack. Because the SYSENTER stack is very small, we have a fixup to switch off the stack quickly when this happens. The old fixup had several issues: 1. It checked the interrupt frame's CS and EIP. This wasn't obviously correct on Xen or if vm86 mode was in use [1]. 2. In the NMI handler, it did some frightening digging into the stack frame. I'm not convinced this digging was correct. 3. The fixup didn't switch stacks and then switch back. Instead, it synthesized a brand new stack frame that would redirect the IRET back to the SYSENTER code. That frame was highly questionable. For one thing, if NMI nested inside #DB, we would effectively abort the #DB prologue, which was probably safe but was frightening. For another, the code used PUSHFL to write the FLAGS portion of the frame, which was simply bogus -- by the time PUSHFL was called, at least TF, NT, VM, and all of the arithmetic flags were clobbered. Simplify this considerably. Instead of looking at the saved frame to see where we came from, check the hardware ESP register against the SYSENTER stack directly. Malicious user code cannot spoof the kernel ESP register, and by moving the check after SAVE_ALL, we can use normal PER_CPU accesses to find all the relevant addresses. With this patch applied, the improved syscall_nt_32 test finally passes on 32-bit kernels. [1] It isn't obviously correct, but it is nonetheless safe from vm86 shenanigans as far as I can tell. A user can't point EIP at entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32, like all kernel addresses, is greater than 0x and would thus violate the CS segment limit. Signed-off-by: Andy Lutomirski --- arch/x86/entry/entry_32.S| 114 ++- arch/x86/kernel/asm-offsets_32.c | 5 ++ 2 files changed, 56 insertions(+), 63 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7700cf695d8c..ad9a85e62128 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -987,51 +987,48 @@ error_code: jmp ret_from_exception END(page_fault) -/* - * Debug traps and NMI can happen at the one SYSENTER instruction - * that sets up the real kernel stack. Check here, since we can't - * allow the wrong stack to be used. - * - * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have - * already pushed 3 words if it hits on the sysenter instruction: - * eflags, cs and eip. - * - * We just load the right stack, and push the three (known) values - * by hand onto the new stack - while updating the return eip past - * the instruction that would have done it for sysenter. - */ -.macro FIX_STACK offset ok label - cmpw$__KERNEL_CS, 4(%esp) - jne \ok -\label: - movlTSS_sysenter_sp0 + \offset(%esp), %esp - pushfl - pushl $__KERNEL_CS - pushl $sysenter_past_esp -.endm - ENTRY(debug) + /* +* #DB can happen at the first instruction of +* entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this +* happens, then we will be running on a very small stack. We +* need to detect this condition and switch to the thread +* stack before calling any C code at all. +* +* If you edit this code, keep in mind that NMIs can happen in here. +*/ ASM_CLAC - cmpl$entry_SYSENTER_32, (%esp) - jne debug_stack_correct - FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn -debug_stack_correct: pushl $-1 # mark this as an int SAVE_ALL - TRACE_IRQS_OFF xorl%edx, %edx # error code 0 movl%esp, %eax # pt_regs pointer + + /* Are we currently on the SYSENTER stack? */ + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx) + subl%eax, %ecx /* ecx = (end of SYENTER_stack) - esp */ + cmpl$SIZEOF_SYSENTER_stack, %ecx + jb .Ldebug_from_sysenter_stack + + TRACE_IRQS_OFF + calldo_debug + jmp ret_from_exception + +.Ldebug_from_sysenter_stack: + /* We're on the SYSENTER stack. Switch off. */ + movl%esp, %ebp + movlPER_CPU_VAR(cpu_current_top_of_stack), %esp + TRACE_IRQS_OFF calldo_debug + movl%ebp, %esp jmp ret_from_exception END(debug) /* - * NMI is doubly nasty. It can happen _while_ we're handling - * a debug fault, and the debug fault hasn't yet been able to - * clear up the stack. So we first check whether we got an - * NMI on the sysenter entry path, but after that we need to - * check whether we got an NMI on the debug path where the debug - * fault happened on the sysenter path. + * NMI is
[PATCH v2 10/10] x86/entry/32: Add and check a stack canary for the SYSENTER stack
Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/processor.h | 3 ++- arch/x86/kernel/process.c| 3 +++ arch/x86/kernel/traps.c | 8 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 7cd01b71b5bd..50a6dc871cc0 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -299,8 +299,9 @@ struct tss_struct { #ifdef CONFIG_X86_32 /* -* Space for the temporary SYSENTER stack: +* Space for the temporary SYSENTER stack. */ + unsigned long SYSENTER_stack_canary; unsigned long SYSENTER_stack[64]; #endif diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 9f7c21c22477..ee9a9792caeb 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -57,6 +57,9 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = { */ .io_bitmap = { [0 ... IO_BITMAP_LONGS] = ~0 }, #endif +#ifdef CONFIG_X86_32 + .SYSENTER_stack_canary = STACK_END_MAGIC, +#endif }; EXPORT_PER_CPU_SYMBOL(cpu_tss); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 80928ea78373..590110119e6a 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -713,6 +713,14 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) debug_stack_usage_dec(); exit: +#if defined(CONFIG_X86_32) + /* +* This is the most likely code path that involves non-trivial use +* of the SYSENTER stack. Check that we haven't overrun it. +*/ + WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC, +"Overran or corrupted SYSENTER stack\n"); +#endif ist_exit(regs); } NOKPROBE_SYMBOL(do_debug); -- 2.5.0
[PATCH v2 05/10] x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions
The SDM says that debug exceptions clear BTF, and we need to keep TIF_BLOCKSTEP in sync with BTF. Clear it unconditionally and improve the comment. I suspect that the fact that kmemcheck could cause TIF_BLOCKSTEP not to be cleared was just an oversight. Signed-off-by: Andy Lutomirski--- arch/x86/kernel/traps.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index dd2c2e66c2e1..19e6cfa501e3 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -598,6 +598,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) dr6 &= ~DR6_RESERVED; /* +* The SDM says "The processor clears the BTF flag when it +* generates a debug exception." Clear TIF_BLOCKSTEP to keep +* TIF_BLOCKSTEP in sync with the hardware BTF flag. +*/ + clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); + + /* * If dr6 has no reason to give us about the origin of this trap, * then it's very likely the result of an icebp/int01 trap. * User wants a sigtrap for that. @@ -612,11 +619,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) /* DR6 may or may not be cleared by the CPU */ set_debugreg(0, 6); - /* -* The processor cleared BTF, so don't mark that we need it set. -*/ - clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); - /* Store the virtualized DR6 value */ tsk->thread.debugreg6 = dr6; -- 2.5.0
[PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling
Due to a blatant design error, SYSENTER doesn't clear TF. As a result, if a user does SYSENTER with TF set, we will single-step through the kernel until something clears TF. There is absolutely nothing we can do to prevent this short of turning off SYSENTER [1]. Simplify the handling considerably with two changes: 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can add TF to that list of flags to sanitize with no overhead whatsoever. 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue. That's all we need to do. Don't get too excited -- our handling is still buggy on 32-bit kernels. There's nothing wrong with the SYSENTER code itself, but the #DB prologue has a clever fixup for traps on the very first instruction of entry_SYSENTER_32, and the fixup doesn't work quite correctly. The next two patches will fix that. [1] We could probably prevent it by forcing BTF on at all times and making sure we clear TF before any branches in the SYSENTER code. Needless to say, this is a bad idea. Signed-off-by: Andy Lutomirski--- arch/x86/entry/entry_32.S| 42 ++-- arch/x86/entry/entry_64_compat.S | 9 ++- arch/x86/include/asm/proto.h | 15 ++-- arch/x86/kernel/traps.c | 52 +--- 4 files changed, 94 insertions(+), 24 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index a8c3424c3392..7700cf695d8c 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -287,7 +287,26 @@ need_resched: END(resume_kernel) #endif - # SYSENTER call handler stub +GLOBAL(__begin_SYSENTER_singlestep_region) +/* + * All code from here through __end_SYSENTER_singlestep_region is subject + * to being single-stepped if a user program sets TF and executes SYSENTER. + * There is absolutely nothing that we can do to prevent this from happening + * (thanks Intel!). To keep our handling of this situation as simple as + * possible, we handle TF just like AC and NT, except that our #DB handler + * will ignore all of the single-step traps generated in this range. + */ + +#ifdef CONFIG_XEN +/* + * Xen doesn't set %esp to be precisely what the normal SYSENTER + * entry point expects, so fix it up before using the normal path. + */ +ENTRY(xen_sysenter_target) + addl$5*4, %esp /* remove xen-provided frame */ + jmp sysenter_past_esp +#endif + ENTRY(entry_SYSENTER_32) movlTSS_sysenter_sp0(%esp), %esp sysenter_past_esp: @@ -301,19 +320,25 @@ sysenter_past_esp: SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest */ /* -* SYSENTER doesn't filter flags, so we need to clear NT and AC -* ourselves. To save a few cycles, we can check whether +* SYSENTER doesn't filter flags, so we need to clear NT, AC +* and TF ourselves. To save a few cycles, we can check whether * either was set instead of doing an unconditional popfq. * This needs to happen before enabling interrupts so that * we don't get preempted with NT set. * +* If TF is set, we will single-step all the way to here -- do_debug +* will ignore all the traps. (Yes, this is slow, but so is +* single-stepping in general. This allows us to avoid having +* a more complicated code to handle the case where a user program +* forces us to single-step through the SYSENTER entry code.) +* * NB.: .Lsysenter_fix_flags is a label with the code under it moved * out-of-line as an optimization: NT is unlikely to be set in the * majority of the cases and instead of polluting the I$ unnecessarily, * we're keeping that code behind a branch which will predict as * not-taken and therefore its instructions won't be fetched. */ - testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp) + testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp) jnz .Lsysenter_fix_flags .Lsysenter_flags_fixed: @@ -369,6 +394,7 @@ sysenter_past_esp: pushl $X86_EFLAGS_FIXED popfl jmp .Lsysenter_flags_fixed +GLOBAL(__end_SYSENTER_singlestep_region) ENDPROC(entry_SYSENTER_32) # system call handler stub @@ -662,14 +688,6 @@ ENTRY(spurious_interrupt_bug) END(spurious_interrupt_bug) #ifdef CONFIG_XEN -/* - * Xen doesn't set %esp to be precisely what the normal SYSENTER - * entry point expects, so fix it up before using the normal path. - */ -ENTRY(xen_sysenter_target) - addl$5*4, %esp /* remove xen-provided frame */ - jmp sysenter_past_esp - ENTRY(xen_hypervisor_callback) pushl $-1 /* orig_ax = -1 => not a system call */ SAVE_ALL diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index
[PATCH v2 06/10] x86/traps: Clear DR6 early in do_debug and improve the comment
Leaving any bits set in DR6 on return from a debug exception is asking for trouble. Prevent it by writing zero right away and clarify the comment. Signed-off-by: Andy Lutomirski--- arch/x86/kernel/traps.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 19e6cfa501e3..6dddc220e3ed 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -593,6 +593,18 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) ist_enter(regs); get_debugreg(dr6, 6); + /* +* The Intel SDM says: +* +* Certain debug exceptions may clear bits 0-3. The remaining +* contents of the DR6 register are never cleared by the +* processor. To avoid confusion in identifying debug +* exceptions, debug handlers should clear the register before +* returning to the interrupted task. +* +* Keep it simple: clear DR6 immediately. +*/ + set_debugreg(0, 6); /* Filter out all the reserved bits which are preset to 1 */ dr6 &= ~DR6_RESERVED; @@ -616,9 +628,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) if ((dr6 & DR_STEP) && kmemcheck_trap(regs)) goto exit; - /* DR6 may or may not be cleared by the CPU */ - set_debugreg(0, 6); - /* Store the virtualized DR6 value */ tsk->thread.debugreg6 = dr6; -- 2.5.0
[PATCH v2 03/10] x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
This makes the 32-bit code work just like the 64-bit code. It should speed up syscalls on 32-bit kernels on Skylake by something like 20 cycles (by analogy to the 64-bit compat case). It also cleans up NT just like we do for the 64-bit case. Signed-off-by: Andy Lutomirski--- arch/x86/entry/entry_32.S | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index ab710eee4308..289a17bf0c71 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -294,7 +294,6 @@ sysenter_past_esp: pushl $__USER_DS /* pt_regs->ss */ pushl %ebp/* pt_regs->sp (stashed in bp) */ pushfl /* pt_regs->flags (except IF = 0) */ - ASM_CLAC/* Clear AC after saving FLAGS */ orl $X86_EFLAGS_IF, (%esp) /* Fix IF */ pushl $__USER_CS /* pt_regs->cs */ pushl $0 /* pt_regs->ip = 0 (placeholder) */ @@ -302,6 +301,23 @@ sysenter_past_esp: SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest */ /* +* SYSENTER doesn't filter flags, so we need to clear NT and AC +* ourselves. To save a few cycles, we can check whether +* either was set instead of doing an unconditional popfq. +* This needs to happen before enabling interrupts so that +* we don't get preempted with NT set. +* +* NB.: .Lsysenter_fix_flags is a label with the code under it moved +* out-of-line as an optimization: NT is unlikely to be set in the +* majority of the cases and instead of polluting the I$ unnecessarily, +* we're keeping that code behind a branch which will predict as +* not-taken and therefore its instructions won't be fetched. +*/ + testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp) + jnz .Lsysenter_fix_flags +.Lsysenter_flags_fixed: + + /* * User mode is traced as though IRQs are on, and SYSENTER * turned them off. */ @@ -339,6 +355,11 @@ sysenter_past_esp: .popsection _ASM_EXTABLE(1b, 2b) PTGS_TO_GS_EX + +.Lsysenter_fix_flags: + pushl $X86_EFLAGS_FIXED + popfl + jmp .Lsysenter_flags_fixed ENDPROC(entry_SYSENTER_32) # system call handler stub -- 2.5.0
[PATCH v2 05/10] x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions
The SDM says that debug exceptions clear BTF, and we need to keep TIF_BLOCKSTEP in sync with BTF. Clear it unconditionally and improve the comment. I suspect that the fact that kmemcheck could cause TIF_BLOCKSTEP not to be cleared was just an oversight. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/traps.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index dd2c2e66c2e1..19e6cfa501e3 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -598,6 +598,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) dr6 &= ~DR6_RESERVED; /* +* The SDM says "The processor clears the BTF flag when it +* generates a debug exception." Clear TIF_BLOCKSTEP to keep +* TIF_BLOCKSTEP in sync with the hardware BTF flag. +*/ + clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); + + /* * If dr6 has no reason to give us about the origin of this trap, * then it's very likely the result of an icebp/int01 trap. * User wants a sigtrap for that. @@ -612,11 +619,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) /* DR6 may or may not be cleared by the CPU */ set_debugreg(0, 6); - /* -* The processor cleared BTF, so don't mark that we need it set. -*/ - clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP); - /* Store the virtualized DR6 value */ tsk->thread.debugreg6 = dr6; -- 2.5.0
[PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling
Due to a blatant design error, SYSENTER doesn't clear TF. As a result, if a user does SYSENTER with TF set, we will single-step through the kernel until something clears TF. There is absolutely nothing we can do to prevent this short of turning off SYSENTER [1]. Simplify the handling considerably with two changes: 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can add TF to that list of flags to sanitize with no overhead whatsoever. 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue. That's all we need to do. Don't get too excited -- our handling is still buggy on 32-bit kernels. There's nothing wrong with the SYSENTER code itself, but the #DB prologue has a clever fixup for traps on the very first instruction of entry_SYSENTER_32, and the fixup doesn't work quite correctly. The next two patches will fix that. [1] We could probably prevent it by forcing BTF on at all times and making sure we clear TF before any branches in the SYSENTER code. Needless to say, this is a bad idea. Signed-off-by: Andy Lutomirski --- arch/x86/entry/entry_32.S| 42 ++-- arch/x86/entry/entry_64_compat.S | 9 ++- arch/x86/include/asm/proto.h | 15 ++-- arch/x86/kernel/traps.c | 52 +--- 4 files changed, 94 insertions(+), 24 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index a8c3424c3392..7700cf695d8c 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -287,7 +287,26 @@ need_resched: END(resume_kernel) #endif - # SYSENTER call handler stub +GLOBAL(__begin_SYSENTER_singlestep_region) +/* + * All code from here through __end_SYSENTER_singlestep_region is subject + * to being single-stepped if a user program sets TF and executes SYSENTER. + * There is absolutely nothing that we can do to prevent this from happening + * (thanks Intel!). To keep our handling of this situation as simple as + * possible, we handle TF just like AC and NT, except that our #DB handler + * will ignore all of the single-step traps generated in this range. + */ + +#ifdef CONFIG_XEN +/* + * Xen doesn't set %esp to be precisely what the normal SYSENTER + * entry point expects, so fix it up before using the normal path. + */ +ENTRY(xen_sysenter_target) + addl$5*4, %esp /* remove xen-provided frame */ + jmp sysenter_past_esp +#endif + ENTRY(entry_SYSENTER_32) movlTSS_sysenter_sp0(%esp), %esp sysenter_past_esp: @@ -301,19 +320,25 @@ sysenter_past_esp: SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest */ /* -* SYSENTER doesn't filter flags, so we need to clear NT and AC -* ourselves. To save a few cycles, we can check whether +* SYSENTER doesn't filter flags, so we need to clear NT, AC +* and TF ourselves. To save a few cycles, we can check whether * either was set instead of doing an unconditional popfq. * This needs to happen before enabling interrupts so that * we don't get preempted with NT set. * +* If TF is set, we will single-step all the way to here -- do_debug +* will ignore all the traps. (Yes, this is slow, but so is +* single-stepping in general. This allows us to avoid having +* a more complicated code to handle the case where a user program +* forces us to single-step through the SYSENTER entry code.) +* * NB.: .Lsysenter_fix_flags is a label with the code under it moved * out-of-line as an optimization: NT is unlikely to be set in the * majority of the cases and instead of polluting the I$ unnecessarily, * we're keeping that code behind a branch which will predict as * not-taken and therefore its instructions won't be fetched. */ - testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp) + testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp) jnz .Lsysenter_fix_flags .Lsysenter_flags_fixed: @@ -369,6 +394,7 @@ sysenter_past_esp: pushl $X86_EFLAGS_FIXED popfl jmp .Lsysenter_flags_fixed +GLOBAL(__end_SYSENTER_singlestep_region) ENDPROC(entry_SYSENTER_32) # system call handler stub @@ -662,14 +688,6 @@ ENTRY(spurious_interrupt_bug) END(spurious_interrupt_bug) #ifdef CONFIG_XEN -/* - * Xen doesn't set %esp to be precisely what the normal SYSENTER - * entry point expects, so fix it up before using the normal path. - */ -ENTRY(xen_sysenter_target) - addl$5*4, %esp /* remove xen-provided frame */ - jmp sysenter_past_esp - ENTRY(xen_hypervisor_callback) pushl $-1 /* orig_ax = -1 => not a system call */ SAVE_ALL diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index 8d3728131809..26469c11796d
[PATCH v2 06/10] x86/traps: Clear DR6 early in do_debug and improve the comment
Leaving any bits set in DR6 on return from a debug exception is asking for trouble. Prevent it by writing zero right away and clarify the comment. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/traps.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 19e6cfa501e3..6dddc220e3ed 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -593,6 +593,18 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) ist_enter(regs); get_debugreg(dr6, 6); + /* +* The Intel SDM says: +* +* Certain debug exceptions may clear bits 0-3. The remaining +* contents of the DR6 register are never cleared by the +* processor. To avoid confusion in identifying debug +* exceptions, debug handlers should clear the register before +* returning to the interrupted task. +* +* Keep it simple: clear DR6 immediately. +*/ + set_debugreg(0, 6); /* Filter out all the reserved bits which are preset to 1 */ dr6 &= ~DR6_RESERVED; @@ -616,9 +628,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) if ((dr6 & DR_STEP) && kmemcheck_trap(regs)) goto exit; - /* DR6 may or may not be cleared by the CPU */ - set_debugreg(0, 6); - /* Store the virtualized DR6 value */ tsk->thread.debugreg6 = dr6; -- 2.5.0
[PATCH v2 03/10] x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
This makes the 32-bit code work just like the 64-bit code. It should speed up syscalls on 32-bit kernels on Skylake by something like 20 cycles (by analogy to the 64-bit compat case). It also cleans up NT just like we do for the 64-bit case. Signed-off-by: Andy Lutomirski --- arch/x86/entry/entry_32.S | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index ab710eee4308..289a17bf0c71 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -294,7 +294,6 @@ sysenter_past_esp: pushl $__USER_DS /* pt_regs->ss */ pushl %ebp/* pt_regs->sp (stashed in bp) */ pushfl /* pt_regs->flags (except IF = 0) */ - ASM_CLAC/* Clear AC after saving FLAGS */ orl $X86_EFLAGS_IF, (%esp) /* Fix IF */ pushl $__USER_CS /* pt_regs->cs */ pushl $0 /* pt_regs->ip = 0 (placeholder) */ @@ -302,6 +301,23 @@ sysenter_past_esp: SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest */ /* +* SYSENTER doesn't filter flags, so we need to clear NT and AC +* ourselves. To save a few cycles, we can check whether +* either was set instead of doing an unconditional popfq. +* This needs to happen before enabling interrupts so that +* we don't get preempted with NT set. +* +* NB.: .Lsysenter_fix_flags is a label with the code under it moved +* out-of-line as an optimization: NT is unlikely to be set in the +* majority of the cases and instead of polluting the I$ unnecessarily, +* we're keeping that code behind a branch which will predict as +* not-taken and therefore its instructions won't be fetched. +*/ + testl $X86_EFLAGS_NT|X86_EFLAGS_AC, PT_EFLAGS(%esp) + jnz .Lsysenter_fix_flags +.Lsysenter_flags_fixed: + + /* * User mode is traced as though IRQs are on, and SYSENTER * turned them off. */ @@ -339,6 +355,11 @@ sysenter_past_esp: .popsection _ASM_EXTABLE(1b, 2b) PTGS_TO_GS_EX + +.Lsysenter_fix_flags: + pushl $X86_EFLAGS_FIXED + popfl + jmp .Lsysenter_flags_fixed ENDPROC(entry_SYSENTER_32) # system call handler stub -- 2.5.0
[PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups
hpa asked me to get rid of the ASM_CLAC at the beginning of the SYSENTER path. Little did he know... This series makes the observed behavior of SYSENTER wrt flags the same for all sane flags and kernel bitnesses. That is, SYSENTER preserves flags now unless you do a syscall that explicitly changes flags, and the HW flags that the syscall executes with are sanitized. This includes NT, TF, AC and all arithmetic flags. Prior to this series, 32-bit kernels clobbered TF and the arithmetic flags and behaved highly erratically if NT was set. (If IF is cleared by evil userspace when SYSENTER starts, IF will be set again on return. There's nothing the kernel can do about this -- SYSENTER inherently forgets the state of IF.) This series speeds up SYSENTER on all kernels by a surprisingly large amount on Skylake because it eliminates an unconditional CLAC. While SYSENTER used to handle TF correctly as far as I can tell on 64-bit kernels, the means by which it did so was heavily tangled up in the ptrace single-step logic. It now works just like all the other kernel entries except insofar as do_debug has a simple special case for it. Relatedly, the bizarre and poorly explained old fixup in do_debug is now hidden behind a WARN_ON_ONCE in preparation for deleting it at some point. The code that fixed up NMI and #DB early in SYSENTER in 32-bit kernels used to be both terrifying and incorrect. (It doesn't appear to have been exploitably bad, but the reason for that is subtle, and the code was certainy more fragile than it deserved to me.) We still need a special fixup, but it's much simpler now. While I was doing all this, I also noticed that DR6 and BTF handling in do_debug was a bit off. Two of the patches in here try to fix it up. Have fun! tl;dr: Cleanups and sanity fixes here, but no security fixes, and I don't think anything needs to be backported or put in x86/urgent. This series applies to the result of merging tip:x86/asm and tip:x86/urgent. I've been testing on a somewhat bastardized base, because tip currently doesn't work on my laptop in 32-bit mode. (That bug is fixed in Linus' tree.) Changes from v1: - s/Sysenter/SYSENTER in two places (Borislav) - int main() -> int main(void) (Borislav) Andy Lutomirski (10): selftests/x86: In syscall_nt, test NT|TF as well x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test x86/entry/32: Filter NT and speed up AC filtering in SYSENTER x86/entry/32: Restore FLAGS on SYSEXIT x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions x86/traps: Clear DR6 early in do_debug and improve the comment x86/entry: Vastly simplify SYSENTER TF handling x86/entry: Only allocate space for SYSENTER_stack if needed x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup x86/entry/32: Add and check a stack canary for the SYSENTER stack arch/x86/entry/entry_32.S| 182 ++- arch/x86/entry/entry_64_compat.S | 15 ++- arch/x86/include/asm/processor.h | 5 +- arch/x86/include/asm/proto.h | 15 ++- arch/x86/kernel/asm-offsets_32.c | 5 + arch/x86/kernel/process.c| 3 + arch/x86/kernel/traps.c | 87 --- tools/testing/selftests/x86/syscall_nt.c | 57 -- 8 files changed, 263 insertions(+), 106 deletions(-) -- 2.5.0
[PATCH v2 02/10] x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
CLAC is slow, and the SYSENTER code already has an unlikely path that runs if unusual flags are set. Drop the CLAC and instead rely on the unlikely path to clear AC. This seems to save ~24 cycles on my Skylake laptop. (Hey, Intel, make this faster please!) Signed-off-by: Andy Lutomirski--- arch/x86/entry/entry_64_compat.S | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index 89bcb4979e7a..8d3728131809 100644 --- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -66,8 +66,6 @@ ENTRY(entry_SYSENTER_compat) */ pushfq /* pt_regs->flags (except IF = 0) */ orl $X86_EFLAGS_IF, (%rsp) /* Fix saved flags */ - ASM_CLAC/* Clear AC after saving FLAGS */ - pushq $__USER32_CS/* pt_regs->cs */ xorq%r8,%r8 pushq %r8 /* pt_regs->ip = 0 (placeholder) */ @@ -90,9 +88,9 @@ ENTRY(entry_SYSENTER_compat) cld /* -* Sysenter doesn't filter flags, so we need to clear NT +* SYSENTER doesn't filter flags, so we need to clear NT and AC * ourselves. To save a few cycles, we can check whether -* NT was set instead of doing an unconditional popfq. +* either was set instead of doing an unconditional popfq. * This needs to happen before enabling interrupts so that * we don't get preempted with NT set. * @@ -102,7 +100,7 @@ ENTRY(entry_SYSENTER_compat) * we're keeping that code behind a branch which will predict as * not-taken and therefore its instructions won't be fetched. */ - testl $X86_EFLAGS_NT, EFLAGS(%rsp) + testl $X86_EFLAGS_NT|X86_EFLAGS_AC, EFLAGS(%rsp) jnz .Lsysenter_fix_flags .Lsysenter_flags_fixed: -- 2.5.0
[PATCH 00/10] x86: Various SYSENTER/SYSEXIT/#DB fixes and cleanups
hpa asked me to get rid of the ASM_CLAC at the beginning of the SYSENTER path. Little did he know... This series makes the observed behavior of SYSENTER wrt flags the same for all sane flags and kernel bitnesses. That is, SYSENTER preserves flags now unless you do a syscall that explicitly changes flags, and the HW flags that the syscall executes with are sanitized. This includes NT, TF, AC and all arithmetic flags. Prior to this series, 32-bit kernels clobbered TF and the arithmetic flags and behaved highly erratically if NT was set. (If IF is cleared by evil userspace when SYSENTER starts, IF will be set again on return. There's nothing the kernel can do about this -- SYSENTER inherently forgets the state of IF.) This series speeds up SYSENTER on all kernels by a surprisingly large amount on Skylake because it eliminates an unconditional CLAC. While SYSENTER used to handle TF correctly as far as I can tell on 64-bit kernels, the means by which it did so was heavily tangled up in the ptrace single-step logic. It now works just like all the other kernel entries except insofar as do_debug has a simple special case for it. Relatedly, the bizarre and poorly explained old fixup in do_debug is now hidden behind a WARN_ON_ONCE in preparation for deleting it at some point. The code that fixed up NMI and #DB early in SYSENTER in 32-bit kernels used to be both terrifying and incorrect. (It doesn't appear to have been exploitably bad, but the reason for that is subtle, and the code was certainy more fragile than it deserved to me.) We still need a special fixup, but it's much simpler now. While I was doing all this, I also noticed that DR6 and BTF handling in do_debug was a bit off. Two of the patches in here try to fix it up. Have fun! tl;dr: Cleanups and sanity fixes here, but no security fixes, and I don't think anything needs to be backported or put in x86/urgent. This series applies to the result of merging tip:x86/asm and tip:x86/urgent. I've been testing on a somewhat bastardized base, because tip currently doesn't work on my laptop in 32-bit mode. (That bug is fixed in Linus' tree.) Changes from v1: - s/Sysenter/SYSENTER in two places (Borislav) - int main() -> int main(void) (Borislav) Andy Lutomirski (10): selftests/x86: In syscall_nt, test NT|TF as well x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test x86/entry/32: Filter NT and speed up AC filtering in SYSENTER x86/entry/32: Restore FLAGS on SYSEXIT x86/traps: Clear TIF_BLOCKSTEP on all debug exceptions x86/traps: Clear DR6 early in do_debug and improve the comment x86/entry: Vastly simplify SYSENTER TF handling x86/entry: Only allocate space for SYSENTER_stack if needed x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup x86/entry/32: Add and check a stack canary for the SYSENTER stack arch/x86/entry/entry_32.S| 182 ++- arch/x86/entry/entry_64_compat.S | 15 ++- arch/x86/include/asm/processor.h | 5 +- arch/x86/include/asm/proto.h | 15 ++- arch/x86/kernel/asm-offsets_32.c | 5 + arch/x86/kernel/process.c| 3 + arch/x86/kernel/traps.c | 87 --- tools/testing/selftests/x86/syscall_nt.c | 57 -- 8 files changed, 263 insertions(+), 106 deletions(-) -- 2.5.0
[PATCH v2 02/10] x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
CLAC is slow, and the SYSENTER code already has an unlikely path that runs if unusual flags are set. Drop the CLAC and instead rely on the unlikely path to clear AC. This seems to save ~24 cycles on my Skylake laptop. (Hey, Intel, make this faster please!) Signed-off-by: Andy Lutomirski --- arch/x86/entry/entry_64_compat.S | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index 89bcb4979e7a..8d3728131809 100644 --- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -66,8 +66,6 @@ ENTRY(entry_SYSENTER_compat) */ pushfq /* pt_regs->flags (except IF = 0) */ orl $X86_EFLAGS_IF, (%rsp) /* Fix saved flags */ - ASM_CLAC/* Clear AC after saving FLAGS */ - pushq $__USER32_CS/* pt_regs->cs */ xorq%r8,%r8 pushq %r8 /* pt_regs->ip = 0 (placeholder) */ @@ -90,9 +88,9 @@ ENTRY(entry_SYSENTER_compat) cld /* -* Sysenter doesn't filter flags, so we need to clear NT +* SYSENTER doesn't filter flags, so we need to clear NT and AC * ourselves. To save a few cycles, we can check whether -* NT was set instead of doing an unconditional popfq. +* either was set instead of doing an unconditional popfq. * This needs to happen before enabling interrupts so that * we don't get preempted with NT set. * @@ -102,7 +100,7 @@ ENTRY(entry_SYSENTER_compat) * we're keeping that code behind a branch which will predict as * not-taken and therefore its instructions won't be fetched. */ - testl $X86_EFLAGS_NT, EFLAGS(%rsp) + testl $X86_EFLAGS_NT|X86_EFLAGS_AC, EFLAGS(%rsp) jnz .Lsysenter_fix_flags .Lsysenter_flags_fixed: -- 2.5.0
Re: multipath: I/O hanging forever
On Fri, Mar 04, 2016 at 10:30:44AM -0700, Andrea Righi wrote: > On Sun, Feb 28, 2016 at 08:46:16PM -0700, Andrea Righi wrote: > > On Sun, Feb 28, 2016 at 06:53:33PM -0700, Andrea Righi wrote: > > ... > > > I'm using 4.5.0-rc5+, from Linus' git. I'll try to do a git bisect > > > later, I'm pretty sure this problem has been introduced recently (i.e., > > > I've never seen this issue with 4.1.x). > > > > I confirm, just tested kernel 4.1 and this problem doesn't happen. > > Alright, I had some spare time to bisect this problem and I found that > the commit that introduced this issue is c66a14d. > > So, I tried to revert the commit (with some changes to fix conflicts and > ABI changes) and now multipath seems to work fine for me (no hung task). Is it hanging on first IO, first large IO, or just randomly?
Re: multipath: I/O hanging forever
On Fri, Mar 04, 2016 at 10:30:44AM -0700, Andrea Righi wrote: > On Sun, Feb 28, 2016 at 08:46:16PM -0700, Andrea Righi wrote: > > On Sun, Feb 28, 2016 at 06:53:33PM -0700, Andrea Righi wrote: > > ... > > > I'm using 4.5.0-rc5+, from Linus' git. I'll try to do a git bisect > > > later, I'm pretty sure this problem has been introduced recently (i.e., > > > I've never seen this issue with 4.1.x). > > > > I confirm, just tested kernel 4.1 and this problem doesn't happen. > > Alright, I had some spare time to bisect this problem and I found that > the commit that introduced this issue is c66a14d. > > So, I tried to revert the commit (with some changes to fix conflicts and > ABI changes) and now multipath seems to work fine for me (no hung task). Is it hanging on first IO, first large IO, or just randomly?
[PATCH v2] sched/cputime: steal_account_process_tick() should return jiffies
The callers of steal_account_process_tick() expect it to return whether a jiffy should be considered stolen or not. Currently the return value of steal_account_process_tick() is in units of cputime, which vary between either jiffies or nsecs depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN. If cputime has nsecs granularity and there is a tiny amount of stolen time (a few nsecs, say) then we will consider the entire tick stolen and will not account the tick on user/system/idle, causing /proc/stats to show invalid data. The fix is to change steal_account_process_tick() to accumulate the stolen time and only account it once it's worth a jiffy. (Thanks to Frederic Weisbecker for suggestions to fix a bug in my first version of the patch.) Signed-off-by: Chris Friesen--- kernel/sched/cputime.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b2ab2ff..ab2b5fb 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,21 +262,21 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(_steal_enabled)) { u64 steal; - cputime_t steal_ct; + unsigned long steal_jiffies; steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; /* -* cputime_t may be less precise than nsecs (eg: if it's -* based on jiffies). Lets cast the result to cputime +* steal is in nsecs but our caller is expecting steal +* time in jiffies. Lets cast the result to jiffies * granularity and account the rest on the next rounds. */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); + steal_jiffies = nsecs_to_jiffies(steal); + this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); - account_steal_time(steal_ct); - return steal_ct; + account_steal_time(jiffies_to_cputime(steal_jiffies)); + return steal_jiffies; } #endif return false;
[PATCH v2] sched/cputime: steal_account_process_tick() should return jiffies
The callers of steal_account_process_tick() expect it to return whether a jiffy should be considered stolen or not. Currently the return value of steal_account_process_tick() is in units of cputime, which vary between either jiffies or nsecs depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN. If cputime has nsecs granularity and there is a tiny amount of stolen time (a few nsecs, say) then we will consider the entire tick stolen and will not account the tick on user/system/idle, causing /proc/stats to show invalid data. The fix is to change steal_account_process_tick() to accumulate the stolen time and only account it once it's worth a jiffy. (Thanks to Frederic Weisbecker for suggestions to fix a bug in my first version of the patch.) Signed-off-by: Chris Friesen --- kernel/sched/cputime.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b2ab2ff..ab2b5fb 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,21 +262,21 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(_steal_enabled)) { u64 steal; - cputime_t steal_ct; + unsigned long steal_jiffies; steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; /* -* cputime_t may be less precise than nsecs (eg: if it's -* based on jiffies). Lets cast the result to cputime +* steal is in nsecs but our caller is expecting steal +* time in jiffies. Lets cast the result to jiffies * granularity and account the rest on the next rounds. */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); + steal_jiffies = nsecs_to_jiffies(steal); + this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); - account_steal_time(steal_ct); - return steal_ct; + account_steal_time(jiffies_to_cputime(steal_jiffies)); + return steal_jiffies; } #endif return false;
[RESEND PATCH v2 1/2] dt/bindings: Add bindings for PIC32 SPI peripheral
Signed-off-by: Purna Chandra Mandal--- Changes in v2: - fix indentation - add space after comma - moved 'cs-gpios' section under 'required' properties. .../bindings/spi/microchip,spi-pic32.txt | 34 ++ 1 file changed, 34 insertions(+) create mode 100644 Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt diff --git a/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt b/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt new file mode 100644 index 000..79de379f --- /dev/null +++ b/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt @@ -0,0 +1,34 @@ +Microchip PIC32 SPI Master controller + +Required properties: +- compatible: Should be "microchip,pic32mzda-spi". +- reg: Address and length of register space for the device. +- interrupts: Should contain all three spi interrupts in sequence + of , , . +- interrupt-names: Should be "fault", "rx", "tx" in order. +- clocks: Phandle of the clock generating SPI clock on the bus. +- clock-names: Should be "mck0". +- cs-gpios: Specifies the gpio pins to be used for chipselects. +See: Documentation/devicetree/bindings/spi/spi-bus.txt + +Optional properties: +- dmas: Two or more DMA channel specifiers following the convention outlined +in Documentation/devicetree/bindings/dma/dma.txt +- dma-names: Names for the dma channels. There must be at least one channel + named "spi-tx" for transmit and named "spi-rx" for receive. + +Example: + +spi1: spi@1f821000 { +compatible = "microchip,pic32mzda-spi"; +reg = <0x1f821000 0x200>; +interrupts = <109 IRQ_TYPE_LEVEL_HIGH>, + <110 IRQ_TYPE_LEVEL_HIGH>, + <111 IRQ_TYPE_LEVEL_HIGH>; +interrupt-names = "fault", "rx", "tx"; +clocks = <>; +clock-names = "mck0"; +cs-gpios = < 4 GPIO_ACTIVE_LOW>; +dmas = < 134>, < 135>; +dma-names = "spi-rx", "spi-tx"; +}; -- 1.8.3.1
[RESEND PATCH v2 1/2] dt/bindings: Add bindings for PIC32 SPI peripheral
Signed-off-by: Purna Chandra Mandal --- Changes in v2: - fix indentation - add space after comma - moved 'cs-gpios' section under 'required' properties. .../bindings/spi/microchip,spi-pic32.txt | 34 ++ 1 file changed, 34 insertions(+) create mode 100644 Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt diff --git a/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt b/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt new file mode 100644 index 000..79de379f --- /dev/null +++ b/Documentation/devicetree/bindings/spi/microchip,spi-pic32.txt @@ -0,0 +1,34 @@ +Microchip PIC32 SPI Master controller + +Required properties: +- compatible: Should be "microchip,pic32mzda-spi". +- reg: Address and length of register space for the device. +- interrupts: Should contain all three spi interrupts in sequence + of , , . +- interrupt-names: Should be "fault", "rx", "tx" in order. +- clocks: Phandle of the clock generating SPI clock on the bus. +- clock-names: Should be "mck0". +- cs-gpios: Specifies the gpio pins to be used for chipselects. +See: Documentation/devicetree/bindings/spi/spi-bus.txt + +Optional properties: +- dmas: Two or more DMA channel specifiers following the convention outlined +in Documentation/devicetree/bindings/dma/dma.txt +- dma-names: Names for the dma channels. There must be at least one channel + named "spi-tx" for transmit and named "spi-rx" for receive. + +Example: + +spi1: spi@1f821000 { +compatible = "microchip,pic32mzda-spi"; +reg = <0x1f821000 0x200>; +interrupts = <109 IRQ_TYPE_LEVEL_HIGH>, + <110 IRQ_TYPE_LEVEL_HIGH>, + <111 IRQ_TYPE_LEVEL_HIGH>; +interrupt-names = "fault", "rx", "tx"; +clocks = <>; +clock-names = "mck0"; +cs-gpios = < 4 GPIO_ACTIVE_LOW>; +dmas = < 134>, < 135>; +dma-names = "spi-rx", "spi-tx"; +}; -- 1.8.3.1
[PATCH] target: Drop incorrect ABORT_TASK put for completed commands
From: Nicholas BellingerThis patch fixes a recent ABORT_TASK regression associated with commit febe562c, where a left-over target_put_sess_cmd() would still be called when __target_check_io_state() detected a command has already been completed, and explicit ABORT must be avoided. Note commit febe562c dropped the local kref_get_unless_zero() check in core_tmr_abort_task(), but did not drop this extra corresponding target_put_sess_cmd() in the failure path. So go ahead and drop this now bogus target_put_sess_cmd(), and avoid this potential use-after-free. Reported-by: Dan Lane Cc: Quinn Tran Cc: Himanshu Madhani Cc: Sagi Grimberg Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Andy Grover Cc: Mike Christie Signed-off-by: Nicholas Bellinger --- drivers/target/target_core_tmr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c index 82a663b..4f229e7 100644 --- a/drivers/target/target_core_tmr.c +++ b/drivers/target/target_core_tmr.c @@ -177,7 +177,6 @@ void core_tmr_abort_task( if (!__target_check_io_state(se_cmd, se_sess, 0)) { spin_unlock_irqrestore(_sess->sess_cmd_lock, flags); - target_put_sess_cmd(se_cmd); goto out; } list_del_init(_cmd->se_cmd_list); -- 1.9.1
[PATCH] target: Drop incorrect ABORT_TASK put for completed commands
From: Nicholas Bellinger This patch fixes a recent ABORT_TASK regression associated with commit febe562c, where a left-over target_put_sess_cmd() would still be called when __target_check_io_state() detected a command has already been completed, and explicit ABORT must be avoided. Note commit febe562c dropped the local kref_get_unless_zero() check in core_tmr_abort_task(), but did not drop this extra corresponding target_put_sess_cmd() in the failure path. So go ahead and drop this now bogus target_put_sess_cmd(), and avoid this potential use-after-free. Reported-by: Dan Lane Cc: Quinn Tran Cc: Himanshu Madhani Cc: Sagi Grimberg Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Andy Grover Cc: Mike Christie Signed-off-by: Nicholas Bellinger --- drivers/target/target_core_tmr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c index 82a663b..4f229e7 100644 --- a/drivers/target/target_core_tmr.c +++ b/drivers/target/target_core_tmr.c @@ -177,7 +177,6 @@ void core_tmr_abort_task( if (!__target_check_io_state(se_cmd, se_sess, 0)) { spin_unlock_irqrestore(_sess->sess_cmd_lock, flags); - target_put_sess_cmd(se_cmd); goto out; } list_del_init(_cmd->se_cmd_list); -- 1.9.1
Re: [PATCH] steal_account_process_tick() should return jiffies
On 03/05/2016 07:19 AM, Frederic Weisbecker wrote: On Sat, Mar 05, 2016 at 11:27:01AM +0100, Thomas Gleixner wrote: Chris, On Fri, 4 Mar 2016, Chris Friesen wrote: First of all the subject line should contain a subsystem prefix, i.e. "sched/cputime:" The callers of steal_account_process_tick() expect it to return whether the last jiffy was stolen or not. Currently the return value of steal_account_process_tick() is in units of cputime, which vary between either jiffies or nsecs depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN. Sure, but what is the actual problem? The return value is boolean and tells whether there was stolen time accounted or not. Indeed the changelog should better explain the problem. So I think the issue is that if the cputime has nsecs granularity and we have a tiny stolen time to account (lets say a few nanosecs, in fact anything that is below a jiffy), we are not going to account the tick on user/system. Yes, this is exactly it. Because of this, if CONFIG_VIRT_CPU_ACCOUNTING_GEN is enabled in a guest then the idle/system/user stats in /proc/stat can show odd values, and "top" shows nothing for user/system even if CPU hogs are running. But the fix doesn't look right to me because we are still accounting the steal time if it is lower than a jiffy and that steal time will never be substracted to user/system time if it never reach a jiffy. Instead the fix should accumulate the steal time and account it only once it's worth a jiffy and then substract it from system/user time accordingly. Yes, on reflection you are correct, and the patch looks pretty close, except that account_steal_time() is still expecting units of cputime. I'll send a followup patch. > Something like that: diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b2ab2ff..d38e25f 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,7 +262,7 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(_steal_enabled)) { u64 steal; - cputime_t steal_ct; + unsigned long steal_jiffies; steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; @@ -272,11 +272,11 @@ static __always_inline bool steal_account_process_tick(void) * based on jiffies). Lets cast the result to cputime * granularity and account the rest on the next rounds. */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); + steal_jiffies = nsecs_to_jiffies(steal); + this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); account_steal_time(steal_ct); - return steal_ct; + return steal_jiffies; } #endif return false;
Re: [PATCH] steal_account_process_tick() should return jiffies
On 03/05/2016 07:19 AM, Frederic Weisbecker wrote: On Sat, Mar 05, 2016 at 11:27:01AM +0100, Thomas Gleixner wrote: Chris, On Fri, 4 Mar 2016, Chris Friesen wrote: First of all the subject line should contain a subsystem prefix, i.e. "sched/cputime:" The callers of steal_account_process_tick() expect it to return whether the last jiffy was stolen or not. Currently the return value of steal_account_process_tick() is in units of cputime, which vary between either jiffies or nsecs depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN. Sure, but what is the actual problem? The return value is boolean and tells whether there was stolen time accounted or not. Indeed the changelog should better explain the problem. So I think the issue is that if the cputime has nsecs granularity and we have a tiny stolen time to account (lets say a few nanosecs, in fact anything that is below a jiffy), we are not going to account the tick on user/system. Yes, this is exactly it. Because of this, if CONFIG_VIRT_CPU_ACCOUNTING_GEN is enabled in a guest then the idle/system/user stats in /proc/stat can show odd values, and "top" shows nothing for user/system even if CPU hogs are running. But the fix doesn't look right to me because we are still accounting the steal time if it is lower than a jiffy and that steal time will never be substracted to user/system time if it never reach a jiffy. Instead the fix should accumulate the steal time and account it only once it's worth a jiffy and then substract it from system/user time accordingly. Yes, on reflection you are correct, and the patch looks pretty close, except that account_steal_time() is still expecting units of cputime. I'll send a followup patch. > Something like that: diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b2ab2ff..d38e25f 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,7 +262,7 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(_steal_enabled)) { u64 steal; - cputime_t steal_ct; + unsigned long steal_jiffies; steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; @@ -272,11 +272,11 @@ static __always_inline bool steal_account_process_tick(void) * based on jiffies). Lets cast the result to cputime * granularity and account the rest on the next rounds. */ - steal_ct = nsecs_to_cputime(steal); - this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); + steal_jiffies = nsecs_to_jiffies(steal); + this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); account_steal_time(steal_ct); - return steal_ct; + return steal_jiffies; } #endif return false;
Re: [PATCH v2] sparc64: Add support for Application Data Integrity (ADI)
From: Khalid AzizDate: Wed, 2 Mar 2016 13:39:37 -0700 > In this > first implementation I am enabling ADI for hugepages only > since these pages are locked in memory and hence avoid the > issue of saving and restoring tags. This makes the feature almost entire useless. Non-hugepages must be in the initial implementation. > + PR_ENABLE_SPARC_ADI - Enable ADI checking in all pages in the address > + range specified. The pages in the range must be already > + locked. This operation enables the TTE.mcd bit for the > + pages specified. arg2 is the starting address for address > + range and must be page aligned. arg3 is the length of > + memory address range and must be a multiple of page size. I strongly dislike this interface, and it makes the prtctl cases look extremely ugly and hide to the casual reader what the code is actually doing. This is an mprotect() operation, so add a new flag bit and implement this via mprotect please. Then since you are guarenteed to have a consistent ADI setting for every single VMA region, you never "lose" the ADI state when you swap out. It's implicit in the VMA itself, because you'll store in the VMA that this is an ADI region. I also want this enabled unconditionally, without any Kconfig knobs. Thanks.
Re: [PATCH v2] sparc64: Add support for Application Data Integrity (ADI)
From: Khalid Aziz Date: Wed, 2 Mar 2016 13:39:37 -0700 > In this > first implementation I am enabling ADI for hugepages only > since these pages are locked in memory and hence avoid the > issue of saving and restoring tags. This makes the feature almost entire useless. Non-hugepages must be in the initial implementation. > + PR_ENABLE_SPARC_ADI - Enable ADI checking in all pages in the address > + range specified. The pages in the range must be already > + locked. This operation enables the TTE.mcd bit for the > + pages specified. arg2 is the starting address for address > + range and must be page aligned. arg3 is the length of > + memory address range and must be a multiple of page size. I strongly dislike this interface, and it makes the prtctl cases look extremely ugly and hide to the casual reader what the code is actually doing. This is an mprotect() operation, so add a new flag bit and implement this via mprotect please. Then since you are guarenteed to have a consistent ADI setting for every single VMA region, you never "lose" the ADI state when you swap out. It's implicit in the VMA itself, because you'll store in the VMA that this is an ADI region. I also want this enabled unconditionally, without any Kconfig knobs. Thanks.
include/linux/kprobes.h:361:2: error: invalid use of undefined type 'struct kprobe_ctlblk'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: b2c0b2cbb282f0cf42518ffacbe197e6f2884168 nmi: create generic NMI backtrace implementation date: 8 months ago config: mn10300-allmodconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout b2c0b2cbb282f0cf42518ffacbe197e6f2884168 # save the attached .config to linux build tree make.cross ARCH=mn10300 All errors (new ones prefixed by >>): In file included from lib/nmi_backtrace.c:17:0: include/linux/kprobes.h: In function 'get_kprobe_ctlblk': >> include/linux/kprobes.h:361:2: error: invalid use of undefined type 'struct >> kprobe_ctlblk' return this_cpu_ptr(_ctlblk); ^ vim +361 include/linux/kprobes.h e6584523 Ananth N Mavinakayanahalli 2005-11-07 355 { b76834bc Christoph Lameter 2010-12-06 356 __this_cpu_write(current_kprobe, NULL); e6584523 Ananth N Mavinakayanahalli 2005-11-07 357 } e6584523 Ananth N Mavinakayanahalli 2005-11-07 358 e6584523 Ananth N Mavinakayanahalli 2005-11-07 359 static inline struct kprobe_ctlblk *get_kprobe_ctlblk(void) e6584523 Ananth N Mavinakayanahalli 2005-11-07 360 { bdffd893 Christoph Lameter 2014-04-29 @361 return this_cpu_ptr(_ctlblk); e6584523 Ananth N Mavinakayanahalli 2005-11-07 362 } e6584523 Ananth N Mavinakayanahalli 2005-11-07 363 ^1da177e Linus Torvalds 2005-04-16 364 int register_kprobe(struct kprobe *p); :: The code at line 361 was first introduced by commit :: bdffd893a0e9c431304142d12d9a0a21d365c502 tracing: Replace __get_cpu_var uses with this_cpu_ptr :: TO: Christoph Lameter:: CC: Steven Rostedt --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
include/linux/kprobes.h:361:2: error: invalid use of undefined type 'struct kprobe_ctlblk'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: b2c0b2cbb282f0cf42518ffacbe197e6f2884168 nmi: create generic NMI backtrace implementation date: 8 months ago config: mn10300-allmodconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout b2c0b2cbb282f0cf42518ffacbe197e6f2884168 # save the attached .config to linux build tree make.cross ARCH=mn10300 All errors (new ones prefixed by >>): In file included from lib/nmi_backtrace.c:17:0: include/linux/kprobes.h: In function 'get_kprobe_ctlblk': >> include/linux/kprobes.h:361:2: error: invalid use of undefined type 'struct >> kprobe_ctlblk' return this_cpu_ptr(_ctlblk); ^ vim +361 include/linux/kprobes.h e6584523 Ananth N Mavinakayanahalli 2005-11-07 355 { b76834bc Christoph Lameter 2010-12-06 356 __this_cpu_write(current_kprobe, NULL); e6584523 Ananth N Mavinakayanahalli 2005-11-07 357 } e6584523 Ananth N Mavinakayanahalli 2005-11-07 358 e6584523 Ananth N Mavinakayanahalli 2005-11-07 359 static inline struct kprobe_ctlblk *get_kprobe_ctlblk(void) e6584523 Ananth N Mavinakayanahalli 2005-11-07 360 { bdffd893 Christoph Lameter 2014-04-29 @361 return this_cpu_ptr(_ctlblk); e6584523 Ananth N Mavinakayanahalli 2005-11-07 362 } e6584523 Ananth N Mavinakayanahalli 2005-11-07 363 ^1da177e Linus Torvalds 2005-04-16 364 int register_kprobe(struct kprobe *p); :: The code at line 361 was first introduced by commit :: bdffd893a0e9c431304142d12d9a0a21d365c502 tracing: Replace __get_cpu_var uses with this_cpu_ptr :: TO: Christoph Lameter :: CC: Steven Rostedt --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
drivers/vhost/vhost.c:718:3: error: call to '__compiletime_assert_718' declared with attribute error: BUILD_BUG_ON failed: __alignof__ *vq->avail > VRING_AVAIL_ALIGN_SIZE
Hi Michael, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 5d9a07b0de512b77bf28d2401e5fe3351f00a240 vhost: relax used address alignment date: 1 year, 2 months ago config: openrisc-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 5d9a07b0de512b77bf28d2401e5fe3351f00a240 # save the attached .config to linux build tree make.cross ARCH=openrisc All errors (new ones prefixed by >>): drivers/vhost/vhost.c: In function 'vhost_vring_ioctl': >> drivers/vhost/vhost.c:718:3: error: call to '__compiletime_assert_718' >> declared with attribute error: BUILD_BUG_ON failed: __alignof__ *vq->avail > >> VRING_AVAIL_ALIGN_SIZE vim +/__compiletime_assert_718 +718 drivers/vhost/vhost.c 712 (u64)(unsigned long)a.avail_user_addr != a.avail_user_addr) { 713 r = -EFAULT; 714 break; 715 } 716 717 /* Make sure it's safe to cast pointers to vring types. */ > 718 BUILD_BUG_ON(__alignof__ *vq->avail > > VRING_AVAIL_ALIGN_SIZE); 719 BUILD_BUG_ON(__alignof__ *vq->used > VRING_USED_ALIGN_SIZE); 720 if ((a.avail_user_addr & (VRING_AVAIL_ALIGN_SIZE - 1)) || 721 (a.used_user_addr & (VRING_USED_ALIGN_SIZE - 1)) || --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
drivers/vhost/vhost.c:718:3: error: call to '__compiletime_assert_718' declared with attribute error: BUILD_BUG_ON failed: __alignof__ *vq->avail > VRING_AVAIL_ALIGN_SIZE
Hi Michael, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 5d9a07b0de512b77bf28d2401e5fe3351f00a240 vhost: relax used address alignment date: 1 year, 2 months ago config: openrisc-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 5d9a07b0de512b77bf28d2401e5fe3351f00a240 # save the attached .config to linux build tree make.cross ARCH=openrisc All errors (new ones prefixed by >>): drivers/vhost/vhost.c: In function 'vhost_vring_ioctl': >> drivers/vhost/vhost.c:718:3: error: call to '__compiletime_assert_718' >> declared with attribute error: BUILD_BUG_ON failed: __alignof__ *vq->avail > >> VRING_AVAIL_ALIGN_SIZE vim +/__compiletime_assert_718 +718 drivers/vhost/vhost.c 712 (u64)(unsigned long)a.avail_user_addr != a.avail_user_addr) { 713 r = -EFAULT; 714 break; 715 } 716 717 /* Make sure it's safe to cast pointers to vring types. */ > 718 BUILD_BUG_ON(__alignof__ *vq->avail > > VRING_AVAIL_ALIGN_SIZE); 719 BUILD_BUG_ON(__alignof__ *vq->used > VRING_USED_ALIGN_SIZE); 720 if ((a.avail_user_addr & (VRING_AVAIL_ALIGN_SIZE - 1)) || 721 (a.used_user_addr & (VRING_USED_ALIGN_SIZE - 1)) || --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 09/13] efi/arm64: Check for h/w support before booting a >4 KB granule kernel
On 17 February 2016 at 13:36, Matt Flemingwrote: > From: Ard Biesheuvel > > A kernel built with support for a page size that is not supported by the > hardware it runs on cannot boot to a state where it can inform the user > about it. > > If we happen to be booting via UEFI, we can fail gracefully so check > if the currently configured page size is supported by the hardware before > entering the kernel proper. Note that UEFI mandates support for 4 KB pages, > so in that case, no check is needed. > > Reviewed-by: Jeremy Linton > Tested-by: Suzuki K Poulose > Signed-off-by: Ard Biesheuvel > Acked-by: Mark Rutland > Signed-off-by: Matt Fleming Hi Matt, This patch turned up in -next with 'granule' replaced with 'granular', both in the commit log and in the patch itself. The term 'granule' is part of the idiom used by the ARM Architecture Reference Manual, and so changing it silently to 'granular' is not entirely appropriate here (although harmless in practice, obviously). In general, I would appreciate it if in the future, such changes were not made silently somewhere in the merge pipeline. Thanks, Ard. > --- > drivers/firmware/efi/libstub/arm64-stub.c | 20 > 1 file changed, 20 insertions(+) > > diff --git a/drivers/firmware/efi/libstub/arm64-stub.c > b/drivers/firmware/efi/libstub/arm64-stub.c > index 9e0342745e4f..aef04ad60e0d 100644 > --- a/drivers/firmware/efi/libstub/arm64-stub.c > +++ b/drivers/firmware/efi/libstub/arm64-stub.c > @@ -12,6 +12,26 @@ > #include > #include > #include > +#include > + > +efi_status_t check_platform_features(efi_system_table_t *sys_table_arg) > +{ > + u64 tg; > + > + /* UEFI mandates support for 4 KB granule, no need to check */ > + if (IS_ENABLED(CONFIG_ARM64_4K_PAGES)) > + return EFI_SUCCESS; > + > + tg = (read_cpuid(ID_AA64MMFR0_EL1) >> ID_AA64MMFR0_TGRAN_SHIFT) & 0xf; > + if (tg != ID_AA64MMFR0_TGRAN_SUPPORTED) { > + if (IS_ENABLED(CONFIG_ARM64_64K_PAGES)) > + pr_efi_err(sys_table_arg, "This 64 KB granule kernel > is not supported by your CPU\n"); > + else > + pr_efi_err(sys_table_arg, "This 16 KB granule kernel > is not supported by your CPU\n"); > + return EFI_UNSUPPORTED; > + } > + return EFI_SUCCESS; > +} > > efi_status_t handle_kernel_image(efi_system_table_t *sys_table_arg, > unsigned long *image_addr, > -- > 2.6.2 >
Re: [PATCH 09/13] efi/arm64: Check for h/w support before booting a >4 KB granule kernel
On 17 February 2016 at 13:36, Matt Fleming wrote: > From: Ard Biesheuvel > > A kernel built with support for a page size that is not supported by the > hardware it runs on cannot boot to a state where it can inform the user > about it. > > If we happen to be booting via UEFI, we can fail gracefully so check > if the currently configured page size is supported by the hardware before > entering the kernel proper. Note that UEFI mandates support for 4 KB pages, > so in that case, no check is needed. > > Reviewed-by: Jeremy Linton > Tested-by: Suzuki K Poulose > Signed-off-by: Ard Biesheuvel > Acked-by: Mark Rutland > Signed-off-by: Matt Fleming Hi Matt, This patch turned up in -next with 'granule' replaced with 'granular', both in the commit log and in the patch itself. The term 'granule' is part of the idiom used by the ARM Architecture Reference Manual, and so changing it silently to 'granular' is not entirely appropriate here (although harmless in practice, obviously). In general, I would appreciate it if in the future, such changes were not made silently somewhere in the merge pipeline. Thanks, Ard. > --- > drivers/firmware/efi/libstub/arm64-stub.c | 20 > 1 file changed, 20 insertions(+) > > diff --git a/drivers/firmware/efi/libstub/arm64-stub.c > b/drivers/firmware/efi/libstub/arm64-stub.c > index 9e0342745e4f..aef04ad60e0d 100644 > --- a/drivers/firmware/efi/libstub/arm64-stub.c > +++ b/drivers/firmware/efi/libstub/arm64-stub.c > @@ -12,6 +12,26 @@ > #include > #include > #include > +#include > + > +efi_status_t check_platform_features(efi_system_table_t *sys_table_arg) > +{ > + u64 tg; > + > + /* UEFI mandates support for 4 KB granule, no need to check */ > + if (IS_ENABLED(CONFIG_ARM64_4K_PAGES)) > + return EFI_SUCCESS; > + > + tg = (read_cpuid(ID_AA64MMFR0_EL1) >> ID_AA64MMFR0_TGRAN_SHIFT) & 0xf; > + if (tg != ID_AA64MMFR0_TGRAN_SUPPORTED) { > + if (IS_ENABLED(CONFIG_ARM64_64K_PAGES)) > + pr_efi_err(sys_table_arg, "This 64 KB granule kernel > is not supported by your CPU\n"); > + else > + pr_efi_err(sys_table_arg, "This 16 KB granule kernel > is not supported by your CPU\n"); > + return EFI_UNSUPPORTED; > + } > + return EFI_SUCCESS; > +} > > efi_status_t handle_kernel_image(efi_system_table_t *sys_table_arg, > unsigned long *image_addr, > -- > 2.6.2 >
[PATCH v2] serial: mvebu-uart: fix platform_no_drv_owner.cocci warnings
No need to set .owner here. The core will do it. Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci CC: Wilson DingSigned-off-by: Fengguang Wu Signed-off-by: Julia Lawall --- v2: No change to patch, correct misspelled email mvebu-uart.c |1 - 1 file changed, 1 deletion(-) --- a/drivers/tty/serial/mvebu-uart.c +++ b/drivers/tty/serial/mvebu-uart.c @@ -615,7 +615,6 @@ static struct platform_driver mvebu_uart .probe = mvebu_uart_probe, .remove = mvebu_uart_remove, .driver = { - .owner = THIS_MODULE, .name = "mvebu-uart", .of_match_table = of_match_ptr(mvebu_uart_of_match), },
[PATCH v2] serial: mvebu-uart: fix platform_no_drv_owner.cocci warnings
No need to set .owner here. The core will do it. Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci CC: Wilson Ding Signed-off-by: Fengguang Wu Signed-off-by: Julia Lawall --- v2: No change to patch, correct misspelled email mvebu-uart.c |1 - 1 file changed, 1 deletion(-) --- a/drivers/tty/serial/mvebu-uart.c +++ b/drivers/tty/serial/mvebu-uart.c @@ -615,7 +615,6 @@ static struct platform_driver mvebu_uart .probe = mvebu_uart_probe, .remove = mvebu_uart_remove, .driver = { - .owner = THIS_MODULE, .name = "mvebu-uart", .of_match_table = of_match_ptr(mvebu_uart_of_match), },
Linux 4.1.19
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'm announcing the release of the 4.1.19 kernel. All users of the 4.1 kernel series must upgrade. The updated 4.1.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.1.y and can be browsed at the normal kernel.org git web browser: http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary - Linux 4.1.19 - Alexander Duyck (2): flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen net: Copy inner L3 and L4 headers as unaligned on GRE TEB Alexandra Yates (1): ahci: Intel DNV device IDs SATA Alexei Potashnik (5): qla2xxx: delay plogi/prli ack until existing sessions are deleted qla2xxx: drop cmds/tmrs arrived while session is being deleted qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives qla2xxx: added sess generations to detect RSCN update races qla2xxx: terminate exchange when command is aborted by LIO Amir Vadai (1): net/mlx4_en: Count HW buffer overrun only once Andrey Konovalov (1): ALSA: usb-audio: avoid freeing umidi object twice Andy Shevchenko (1): dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer Anton Protopopov (2): cifs: fix erroneous return value rtnl: RTM_GETNETCONF: fix wrong return value Arnd Bergmann (1): tracing: Fix freak link error caused by branch tracer Bard Liao (1): ASoC: rt5645: fix the shift bit of IN1 boost Bart Van Assche (1): target: Remove first argument of target_{get,put}_sess_cmd() Bjørn Mork (1): qmi_wwan: add "4G LTE usb-modem U901" CQ Tang (1): iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG Cyrille Pitchen (1): crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts Dan Carpenter (1): intel_scu_ipcutil: underflow in scu_reg_access() Daniel Borkmann (1): bpf: fix branch offset adjustment on backjumps after patching ctx expansion David Henningsson (1): ALSA: hda - Fix static checker warning in patch_hdmi.c David Sterba (1): btrfs: properly set the termination value of ctx->pos in readdir Davidlohr Bueso (2): ipc,shm: move BUG_ON check into shm_lock ipc: convert invalid scenarios to use WARN_ON Dmitry Torokhov (1): Input: vmmouse - fix absolute device registration Dmitry V. Levin (1): unix_diag: fix incorrect sign extension in unix_lookup_by_ino Eric Dumazet (5): tcp: fix NULL deref in tcp_v4_send_ack() af_unix: fix struct pid memory leak tcp: beware of alignments in tcp_get_info() ipv6: fix a lockdep splat ipv4: fix memory leaks in ip_cmsg_send() callers Eryu Guan (1): ext4: don't read blocks from disk after extents being swapped Eugenia Emantayev (2): net/mlx4_en: Choose time-stamping shift value according to HW frequency net/mlx4_en: Avoid changing dev->features directly in run-time Filipe Manana (1): Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl Gavin Shan (2): powerpc/eeh: Fix stale cached primary bus powerpc/eeh: Fix build error caused by pci_dn Gerd Hoffmann (1): drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command Guillaume Nault (1): pppoe: fix reference counting in PPPoE proxy Hangbin Liu (1): net/ipv6: add sysctl option accept_ra_min_hop_limit Hannes Frederic Sowa (2): pptp: fix illegal memory access caused by multiple bind()s unix: correctly track in-flight fds in sending process user_struct Hannes Reinecke (2): scsi_dh_rdac: always retry MODE SELECT on command lock violation bio: return EINTR if copying to user space got interrupted Hans Westgaard Ry (1): net:Add sysctl_max_skb_frags Herbert Xu (4): Backport fix for crypto: algif_skcipher - Require setkey before accept(2) Backport fix for crypto: algif_skcipher - Add nokey compatibility path Backport fix for crypto: algif_skcipher - Remove custom release parent function Backport fix for crypto: algif_skcipher - Fix race condition in skcipher_check_key Herton R. Krzesinski (2): pty: fix possible use after free of tty->driver_data pty: make sure super_block is still valid in final /dev/tty close Ido Schimmel (1): switchdev: Require RTNL mutex to be held when sending FDB notifications Insu Yun (1): ext4: fix potential integer overflow James Bottomley (1): klist: fix starting point removed bug in klist iterators James Hogan (1): MIPS: Fix buffer overflow in syscall_get_arguments() Jan Kara (1): ext4: fix crashes in dioread_nolock mode Jani Nikula (2): drm/i915/dsi: defend gpio table against out of bounds access drm/i915/dsi: don't pass
Linux 4.1.19
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'm announcing the release of the 4.1.19 kernel. All users of the 4.1 kernel series must upgrade. The updated 4.1.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.1.y and can be browsed at the normal kernel.org git web browser: http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary - Linux 4.1.19 - Alexander Duyck (2): flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen net: Copy inner L3 and L4 headers as unaligned on GRE TEB Alexandra Yates (1): ahci: Intel DNV device IDs SATA Alexei Potashnik (5): qla2xxx: delay plogi/prli ack until existing sessions are deleted qla2xxx: drop cmds/tmrs arrived while session is being deleted qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives qla2xxx: added sess generations to detect RSCN update races qla2xxx: terminate exchange when command is aborted by LIO Amir Vadai (1): net/mlx4_en: Count HW buffer overrun only once Andrey Konovalov (1): ALSA: usb-audio: avoid freeing umidi object twice Andy Shevchenko (1): dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer Anton Protopopov (2): cifs: fix erroneous return value rtnl: RTM_GETNETCONF: fix wrong return value Arnd Bergmann (1): tracing: Fix freak link error caused by branch tracer Bard Liao (1): ASoC: rt5645: fix the shift bit of IN1 boost Bart Van Assche (1): target: Remove first argument of target_{get,put}_sess_cmd() Bjørn Mork (1): qmi_wwan: add "4G LTE usb-modem U901" CQ Tang (1): iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG Cyrille Pitchen (1): crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts Dan Carpenter (1): intel_scu_ipcutil: underflow in scu_reg_access() Daniel Borkmann (1): bpf: fix branch offset adjustment on backjumps after patching ctx expansion David Henningsson (1): ALSA: hda - Fix static checker warning in patch_hdmi.c David Sterba (1): btrfs: properly set the termination value of ctx->pos in readdir Davidlohr Bueso (2): ipc,shm: move BUG_ON check into shm_lock ipc: convert invalid scenarios to use WARN_ON Dmitry Torokhov (1): Input: vmmouse - fix absolute device registration Dmitry V. Levin (1): unix_diag: fix incorrect sign extension in unix_lookup_by_ino Eric Dumazet (5): tcp: fix NULL deref in tcp_v4_send_ack() af_unix: fix struct pid memory leak tcp: beware of alignments in tcp_get_info() ipv6: fix a lockdep splat ipv4: fix memory leaks in ip_cmsg_send() callers Eryu Guan (1): ext4: don't read blocks from disk after extents being swapped Eugenia Emantayev (2): net/mlx4_en: Choose time-stamping shift value according to HW frequency net/mlx4_en: Avoid changing dev->features directly in run-time Filipe Manana (1): Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl Gavin Shan (2): powerpc/eeh: Fix stale cached primary bus powerpc/eeh: Fix build error caused by pci_dn Gerd Hoffmann (1): drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command Guillaume Nault (1): pppoe: fix reference counting in PPPoE proxy Hangbin Liu (1): net/ipv6: add sysctl option accept_ra_min_hop_limit Hannes Frederic Sowa (2): pptp: fix illegal memory access caused by multiple bind()s unix: correctly track in-flight fds in sending process user_struct Hannes Reinecke (2): scsi_dh_rdac: always retry MODE SELECT on command lock violation bio: return EINTR if copying to user space got interrupted Hans Westgaard Ry (1): net:Add sysctl_max_skb_frags Herbert Xu (4): Backport fix for crypto: algif_skcipher - Require setkey before accept(2) Backport fix for crypto: algif_skcipher - Add nokey compatibility path Backport fix for crypto: algif_skcipher - Remove custom release parent function Backport fix for crypto: algif_skcipher - Fix race condition in skcipher_check_key Herton R. Krzesinski (2): pty: fix possible use after free of tty->driver_data pty: make sure super_block is still valid in final /dev/tty close Ido Schimmel (1): switchdev: Require RTNL mutex to be held when sending FDB notifications Insu Yun (1): ext4: fix potential integer overflow James Bottomley (1): klist: fix starting point removed bug in klist iterators James Hogan (1): MIPS: Fix buffer overflow in syscall_get_arguments() Jan Kara (1): ext4: fix crashes in dioread_nolock mode Jani Nikula (2): drm/i915/dsi: defend gpio table against out of bounds access drm/i915/dsi: don't pass
Linux 3.18.28
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'm announcing the release of the 3.18.28 kernel. All users of the 3.18 kernel series must upgrade. The updated 3.18.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-3.18.y and can be browsed at the normal kernel.org git web browser: http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary - Linux 3.18.28 - Alexandra Yates (1): ahci: Intel DNV device IDs SATA Alexei Potashnik (5): qla2xxx: delay plogi/prli ack until existing sessions are deleted qla2xxx: drop cmds/tmrs arrived while session is being deleted qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives qla2xxx: added sess generations to detect RSCN update races qla2xxx: terminate exchange when command is aborted by LIO Andrey Konovalov (1): ALSA: usb-audio: avoid freeing umidi object twice Andy Shevchenko (1): dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer Anton Protopopov (1): cifs: fix erroneous return value Arnd Bergmann (1): tracing: Fix freak link error caused by branch tracer Axel Lin (1): phy: core: Fixup return value of phy_exit when !pm_runtime_enabled Bard Liao (1): ASoC: rt5645: fix the shift bit of IN1 boost Bart Van Assche (1): target: Remove first argument of target_{get,put}_sess_cmd() Bruno Prémont (1): qla2xxx: fix busy wait regression CQ Tang (1): iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG Chris Mason (1): fs-writeback: unplug before cond_resched in writeback_sb_inodes Dan Carpenter (1): intel_scu_ipcutil: underflow in scu_reg_access() David Henningsson (1): ALSA: hda - Fix static checker warning in patch_hdmi.c David Sterba (1): btrfs: properly set the termination value of ctx->pos in readdir Dmitry Monakhov (1): ext4: move_extent improve bh vanishing success factor Eryu Guan (1): ext4: don't read blocks from disk after extents being swapped Filipe Manana (1): Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl Gavin Shan (1): powerpc/powernv: Shorten EEH function names Gerd Hoffmann (1): drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command Hannes Reinecke (1): scsi_dh_rdac: always retry MODE SELECT on command lock violation Herton R. Krzesinski (2): pty: fix possible use after free of tty->driver_data pty: make sure super_block is still valid in final /dev/tty close Insu Yun (1): ext4: fix potential integer overflow James Bottomley (1): klist: fix starting point removed bug in klist iterators James Hogan (1): MIPS: Fix buffer overflow in syscall_get_arguments() Jan Kara (1): ext4: fix crashes in dioread_nolock mode Jani Nikula (2): drm/i915/dsi: defend gpio table against out of bounds access drm/i915/dsi: don't pass arbitrary data to sideband Jeremy McNicoll (1): tty: Add support for PCIe WCH382 2S multi-IO card Linus Walleij (2): ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz() ARM: 8519/1: ICST: try other dividends than 1 Mathias Krause (1): crypto: user - lock crypto_alg_list on alg dump Mika Westerberg (1): SCSI: Add Marvell Console to VPD blacklist Nicholas Bellinger (1): target: Fix LUN_RESET active TMR descriptor handling Nicolai Hähnle (1): drm/radeon: hold reference to fences in radeon_sa_bo_new Quinn Tran (1): qla2xxx: Use pci_enable_msix_range() instead of pci_enable_msix() Rasmus Villemoes (1): drm/radeon: use post-decrement in error handling Roland Dreier (1): qla2xxx: kill sessions/log out initiator on RSCN and port down events Ryan Ware (1): EVM: Use crypto_memneq() for digest comparisons Sasha Levin (1): Linux 3.18.28 Saurav Kashyap (1): qla2xxx: Mark port lost when we receive an RSCN for it. Sebastian Andrzej Siewior (1): PCI/AER: Flush workqueue on device remove to avoid use-after-free Sergej Pupykin (2): parport: Add support for the WCH382 2S/1P multi-IO card tty: Add support for the WCH384 4S multi-IO card Shawn Lin (1): phy: core: fix wrong err handle for phy_power_on Stefan Haberland (2): s390/dasd: prevent incorrect length error under z/VM after PAV changes s390/dasd: fix refcount for PAV reassignment Steven Rostedt (Red Hat) (1): tracepoints: Do not trace when cpu is offline Swapnil Nagle (1): qla2xxx: cleanup cmd in qla workqueue before processing TMR Takashi Iwai (11): ALSA: hda - Fix speaker output from VAIO AiO machines ALSA: dummy: Implement timer backend switching more safely ALSA: timer: Fix wrong instance passed to slave callbacks ALSA: timer: Fix race between stop
Linux 3.18.28
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'm announcing the release of the 3.18.28 kernel. All users of the 3.18 kernel series must upgrade. The updated 3.18.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-3.18.y and can be browsed at the normal kernel.org git web browser: http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary - Linux 3.18.28 - Alexandra Yates (1): ahci: Intel DNV device IDs SATA Alexei Potashnik (5): qla2xxx: delay plogi/prli ack until existing sessions are deleted qla2xxx: drop cmds/tmrs arrived while session is being deleted qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives qla2xxx: added sess generations to detect RSCN update races qla2xxx: terminate exchange when command is aborted by LIO Andrey Konovalov (1): ALSA: usb-audio: avoid freeing umidi object twice Andy Shevchenko (1): dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer Anton Protopopov (1): cifs: fix erroneous return value Arnd Bergmann (1): tracing: Fix freak link error caused by branch tracer Axel Lin (1): phy: core: Fixup return value of phy_exit when !pm_runtime_enabled Bard Liao (1): ASoC: rt5645: fix the shift bit of IN1 boost Bart Van Assche (1): target: Remove first argument of target_{get,put}_sess_cmd() Bruno Prémont (1): qla2xxx: fix busy wait regression CQ Tang (1): iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG Chris Mason (1): fs-writeback: unplug before cond_resched in writeback_sb_inodes Dan Carpenter (1): intel_scu_ipcutil: underflow in scu_reg_access() David Henningsson (1): ALSA: hda - Fix static checker warning in patch_hdmi.c David Sterba (1): btrfs: properly set the termination value of ctx->pos in readdir Dmitry Monakhov (1): ext4: move_extent improve bh vanishing success factor Eryu Guan (1): ext4: don't read blocks from disk after extents being swapped Filipe Manana (1): Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl Gavin Shan (1): powerpc/powernv: Shorten EEH function names Gerd Hoffmann (1): drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command Hannes Reinecke (1): scsi_dh_rdac: always retry MODE SELECT on command lock violation Herton R. Krzesinski (2): pty: fix possible use after free of tty->driver_data pty: make sure super_block is still valid in final /dev/tty close Insu Yun (1): ext4: fix potential integer overflow James Bottomley (1): klist: fix starting point removed bug in klist iterators James Hogan (1): MIPS: Fix buffer overflow in syscall_get_arguments() Jan Kara (1): ext4: fix crashes in dioread_nolock mode Jani Nikula (2): drm/i915/dsi: defend gpio table against out of bounds access drm/i915/dsi: don't pass arbitrary data to sideband Jeremy McNicoll (1): tty: Add support for PCIe WCH382 2S multi-IO card Linus Walleij (2): ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz() ARM: 8519/1: ICST: try other dividends than 1 Mathias Krause (1): crypto: user - lock crypto_alg_list on alg dump Mika Westerberg (1): SCSI: Add Marvell Console to VPD blacklist Nicholas Bellinger (1): target: Fix LUN_RESET active TMR descriptor handling Nicolai Hähnle (1): drm/radeon: hold reference to fences in radeon_sa_bo_new Quinn Tran (1): qla2xxx: Use pci_enable_msix_range() instead of pci_enable_msix() Rasmus Villemoes (1): drm/radeon: use post-decrement in error handling Roland Dreier (1): qla2xxx: kill sessions/log out initiator on RSCN and port down events Ryan Ware (1): EVM: Use crypto_memneq() for digest comparisons Sasha Levin (1): Linux 3.18.28 Saurav Kashyap (1): qla2xxx: Mark port lost when we receive an RSCN for it. Sebastian Andrzej Siewior (1): PCI/AER: Flush workqueue on device remove to avoid use-after-free Sergej Pupykin (2): parport: Add support for the WCH382 2S/1P multi-IO card tty: Add support for the WCH384 4S multi-IO card Shawn Lin (1): phy: core: fix wrong err handle for phy_power_on Stefan Haberland (2): s390/dasd: prevent incorrect length error under z/VM after PAV changes s390/dasd: fix refcount for PAV reassignment Steven Rostedt (Red Hat) (1): tracepoints: Do not trace when cpu is offline Swapnil Nagle (1): qla2xxx: cleanup cmd in qla workqueue before processing TMR Takashi Iwai (11): ALSA: hda - Fix speaker output from VAIO AiO machines ALSA: dummy: Implement timer backend switching more safely ALSA: timer: Fix wrong instance passed to slave callbacks ALSA: timer: Fix race between stop
Re: [PATCH 4/5] usb: gadget: f_midi: cleanups and typos fixes
On Sat, Mar 05, 2016 at 11:53:35PM +, Felipe Ferreri Tonello wrote: > Hi Greg, > > On March 5, 2016 7:39:13 PM GMT+00:00, Greg KHwrote: > >On Sat, Mar 05, 2016 at 11:28:45AM -0500, Michal Nazarewicz wrote: > >> >> On Wed, Mar 02 2016, Felipe F. Tonello wrote: > >> >>> @@ -16,7 +16,7 @@ > >> >>> * Copyright (C) 2006 Thumtronics Pty Ltd. > >> >>> * Ben Williamson > >> >>> * > >> >>> - * Licensed under the GPL-2 or later. > >> >>> + * Licensed under the GPLv2. > >> > >> > On March 4, 2016 7:17:31 PM GMT+00:00, Michal Nazarewicz > > wrote: > >> >> Any particular reason to do that? > >> > >> On Fri, Mar 04 2016, Felipe Ferreri Tonello wrote: > >> > Because the kernel is v2 only and not later. > >> > >> Linux as a whole is GPLv2 only, but that doesn’t necessarily mean > >that > >> parts of it cannot be dual licensed (or GPLv2+). It’s safer to leave > >> copyright noticed clear unless you explicitly want your contribution > >be > >> GPLv2 only which brings the whole file GPLv2 only. > > > >But you can't change the license of someone else's code, which is what > >is happening here. Felipe T, you can't do that at all unless you want > >to get into big trouble, please consult a lawyer for all of the gory > >details. > > Thanks for letting me know. TBH, I had no idea about it. Never change a copyright or a license if you don't know exactly what you are doing, or why you are doing it, and have consulted with a lawyer beforehand. The issues here are real, don't take them lightly. greg k-h
Re: [PATCH 4/5] usb: gadget: f_midi: cleanups and typos fixes
On Sat, Mar 05, 2016 at 11:53:35PM +, Felipe Ferreri Tonello wrote: > Hi Greg, > > On March 5, 2016 7:39:13 PM GMT+00:00, Greg KH wrote: > >On Sat, Mar 05, 2016 at 11:28:45AM -0500, Michal Nazarewicz wrote: > >> >> On Wed, Mar 02 2016, Felipe F. Tonello wrote: > >> >>> @@ -16,7 +16,7 @@ > >> >>> * Copyright (C) 2006 Thumtronics Pty Ltd. > >> >>> * Ben Williamson > >> >>> * > >> >>> - * Licensed under the GPL-2 or later. > >> >>> + * Licensed under the GPLv2. > >> > >> > On March 4, 2016 7:17:31 PM GMT+00:00, Michal Nazarewicz > > wrote: > >> >> Any particular reason to do that? > >> > >> On Fri, Mar 04 2016, Felipe Ferreri Tonello wrote: > >> > Because the kernel is v2 only and not later. > >> > >> Linux as a whole is GPLv2 only, but that doesn’t necessarily mean > >that > >> parts of it cannot be dual licensed (or GPLv2+). It’s safer to leave > >> copyright noticed clear unless you explicitly want your contribution > >be > >> GPLv2 only which brings the whole file GPLv2 only. > > > >But you can't change the license of someone else's code, which is what > >is happening here. Felipe T, you can't do that at all unless you want > >to get into big trouble, please consult a lawyer for all of the gory > >details. > > Thanks for letting me know. TBH, I had no idea about it. Never change a copyright or a license if you don't know exactly what you are doing, or why you are doing it, and have consulted with a lawyer beforehand. The issues here are real, don't take them lightly. greg k-h
Re: [PATCH] mailbox: sti: fix check of error code returned by devm_ioremap_resource()
On Sun, 06 Mar 2016, Vladimir Zapolskiy wrote: > The change fixes potential oops while accessing iomem on invalid > address, if devm_ioremap_resource() fails due to some reason. > > The devm_ioremap_resource() function returns ERR_PTR() and never > returns NULL, which makes useless a following check for NULL. > > Signed-off-by: Vladimir Zapolskiy> Fixes: 9ef4546cbd7e ("mailbox: Add support for ST's Mailbox IP") > --- > drivers/mailbox/mailbox-sti.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Acked-by: Lee Jones > diff --git a/drivers/mailbox/mailbox-sti.c b/drivers/mailbox/mailbox-sti.c > index 2394cfe..a334db5 100644 > --- a/drivers/mailbox/mailbox-sti.c > +++ b/drivers/mailbox/mailbox-sti.c > @@ -430,8 +430,8 @@ static int sti_mbox_probe(struct platform_device *pdev) > > res = platform_get_resource(pdev, IORESOURCE_MEM, 0); > mdev->base = devm_ioremap_resource(>dev, res); > - if (!mdev->base) > - return -ENOMEM; > + if (IS_ERR(mdev->base)) > + return PTR_ERR(mdev->base); > > ret = of_property_read_string(np, "mbox-name", >name); > if (ret) -- Lee Jones Linaro STMicroelectronics Landing Team Lead Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
Re: [PATCH] serial: 8250_ingenic: Remove global variable
On Sun, Mar 06, 2016 at 12:55:55AM +0100, Paul Cercueil wrote: > Signed-off-by: Paul Cercueil> --- > drivers/tty/serial/8250/8250_ingenic.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) I can't take patches without a changelog entry :(
Re: [PATCH] mailbox: sti: fix check of error code returned by devm_ioremap_resource()
On Sun, 06 Mar 2016, Vladimir Zapolskiy wrote: > The change fixes potential oops while accessing iomem on invalid > address, if devm_ioremap_resource() fails due to some reason. > > The devm_ioremap_resource() function returns ERR_PTR() and never > returns NULL, which makes useless a following check for NULL. > > Signed-off-by: Vladimir Zapolskiy > Fixes: 9ef4546cbd7e ("mailbox: Add support for ST's Mailbox IP") > --- > drivers/mailbox/mailbox-sti.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Acked-by: Lee Jones > diff --git a/drivers/mailbox/mailbox-sti.c b/drivers/mailbox/mailbox-sti.c > index 2394cfe..a334db5 100644 > --- a/drivers/mailbox/mailbox-sti.c > +++ b/drivers/mailbox/mailbox-sti.c > @@ -430,8 +430,8 @@ static int sti_mbox_probe(struct platform_device *pdev) > > res = platform_get_resource(pdev, IORESOURCE_MEM, 0); > mdev->base = devm_ioremap_resource(>dev, res); > - if (!mdev->base) > - return -ENOMEM; > + if (IS_ERR(mdev->base)) > + return PTR_ERR(mdev->base); > > ret = of_property_read_string(np, "mbox-name", >name); > if (ret) -- Lee Jones Linaro STMicroelectronics Landing Team Lead Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
Re: [PATCH] serial: 8250_ingenic: Remove global variable
On Sun, Mar 06, 2016 at 12:55:55AM +0100, Paul Cercueil wrote: > Signed-off-by: Paul Cercueil > --- > drivers/tty/serial/8250/8250_ingenic.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) I can't take patches without a changelog entry :(
Re: soft lockup when passing vvar address to write(2)
On Mar 5, 2016 1:04 AM, "Ingo Molnar"wrote: > > > * Thomas Gleixner wrote: > > > On Fri, 4 Mar 2016, Andy Lutomirski wrote: > > > Thomas, I still think we should consider just deleting the HPET vclock > > > code and accept the syscall overhead on systems that are stuck using > > > HPET. If fast syscalls are available (which should include every > > > system with HPET, unless there are some 32-bit AMD systems lying > > > around), then the overhead in a syscall is *tiny* compared to the code > > > of the HPET read itself. > > > > No objection from my side, really. > > Seconded. HPET hardware overhead is typically horrifically large in any case, > no > need to memory map it and expose hardware breakages to user-space ... I'll do it for 4.7. > > It's also a (mild) security hole: a well-known HPET address can be abused as a > statistical trampoline periodically cycling through 'dangerous' instruction > values. That weakness has closed for quite a while -- it's mapped NX and it's randomized. I'm also not planning to revert the mapping security improvement -- even if we remove the HPET code, it still applies to kvmclock and to anything else that gets added in the future. It's also very little code. --Andy
Re: soft lockup when passing vvar address to write(2)
On Mar 5, 2016 1:04 AM, "Ingo Molnar" wrote: > > > * Thomas Gleixner wrote: > > > On Fri, 4 Mar 2016, Andy Lutomirski wrote: > > > Thomas, I still think we should consider just deleting the HPET vclock > > > code and accept the syscall overhead on systems that are stuck using > > > HPET. If fast syscalls are available (which should include every > > > system with HPET, unless there are some 32-bit AMD systems lying > > > around), then the overhead in a syscall is *tiny* compared to the code > > > of the HPET read itself. > > > > No objection from my side, really. > > Seconded. HPET hardware overhead is typically horrifically large in any case, > no > need to memory map it and expose hardware breakages to user-space ... I'll do it for 4.7. > > It's also a (mild) security hole: a well-known HPET address can be abused as a > statistical trampoline periodically cycling through 'dangerous' instruction > values. That weakness has closed for quite a while -- it's mapped NX and it's randomized. I'm also not planning to revert the mapping security improvement -- even if we remove the HPET code, it still applies to kvmclock and to anything else that gets added in the future. It's also very little code. --Andy
Re: Applied "regulator: max8973: add support for junction thermal warning" to the regulator tree
On Sat, Mar 05, 2016 at 09:25:49PM +0900, Mark Brown wrote: > The patch > >regulator: max8973: add support for junction thermal warning > > has been applied to the regulator tree at > >git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git > > All being well this means that it will be integrated into the linux-next > tree (usually sometime in the next 24 hours) and sent to Linus during > the next merge window (or sooner if it is a bug fix), however if > problems are discovered then the patch may be dropped or reverted. ...and reverted because the 0day bot found similar build failures to the last time :( signature.asc Description: PGP signature
Re: Applied "regulator: max8973: add support for junction thermal warning" to the regulator tree
On Sat, Mar 05, 2016 at 09:25:49PM +0900, Mark Brown wrote: > The patch > >regulator: max8973: add support for junction thermal warning > > has been applied to the regulator tree at > >git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git > > All being well this means that it will be integrated into the linux-next > tree (usually sometime in the next 24 hours) and sent to Linus during > the next merge window (or sooner if it is a bug fix), however if > problems are discovered then the patch may be dropped or reverted. ...and reverted because the 0day bot found similar build failures to the last time :( signature.asc Description: PGP signature
Re: [PATCH v5 5/8] kbuild: add fine grained build dependencies for exported symbols
On Sat, 5 Mar 2016, Michal Marek wrote: > On Fri, Mar 04, 2016 at 11:53:53PM +0100, Michal Marek wrote: > > Dne 4.3.2016 v 23:51 Michal Marek napsal(a): > > > Dne 4.3.2016 v 06:40 Nicolas Pitre napsal(a): > > >> +cmd_and_fixdep = > > >> \ > > >> +$(echo-cmd) $(cmd_$(1)); > > >> \ > > >> +$(ksym_dep_filter) | > > >> \ > > >> +scripts/basic/fixdep -e $(depfile) $@ '$(make-cmd)' > > >> \ > > >> +> $(dot-target).tmp; > > >> \ > > >> +rm -f $(depfile); > > >> \ > > >> +mv -f $(dot-target).tmp $(dot-target).cmd; > > > > > > While trying this, I got a SIGBUS from fixdep once. My theory is that > > > the depfile is mmap()ed by fixdep and modified by the preprocesor run at > > > the same time. I could not reproduce this so far (still trying). But if > > > it's really this race, the fix would be to disable dependency generation > > > in the preprocessor by passing -Wp,MD,/dev/null or somesuch. But we > > > never had this problem with genksyms, which is weird. It could as well > > > be that my build machine's memory is faulty :(. > > > > Actually, genksyms does not ran in parallel. neither before nor after > run > > this patch. > > I reproduced the SIGBUS after a few iterations, and it crashes in > parse_dep_file(). I'm now testing this > > diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include > index 0be7e09ba381..4fdd8348acf6 100644 > --- a/scripts/Kbuild.include > +++ b/scripts/Kbuild.include > @@ -270,10 +270,12 @@ else > > # Filter out exported kernel symbol names from the preprocessor output. > # See also __KSYM_DEPS__ in include/linux/export.h. > +# We disable the depfile generation here, so as not to overwrite the existing > +# depfile while fixdep is parsing it > ksym_dep_filter = > \ > case "$(1)" in \ > - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ > - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ > + cc_*_c) $(CPP) $(filter-out -Wp$(comma)-M%, $(c_flags)) -D__KSYM_DEPS__ > $< ;; \ > + as_*_S) $(CPP) $(filter-out -Wp$(comma)-M%, $(a_flags)) -D__KSYM_DEPS__ > $< ;; \ > cpp_lds_S) : ;; \ > *) echo "Don't know how to preprocess $(1)"; false ;;\ > esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' This makes perfect sense even if I can't reproduce on my side. Are you willing to fold this patch in, or do you prefer that I repost? Nicolas
Re: [PATCH v5 5/8] kbuild: add fine grained build dependencies for exported symbols
On Sat, 5 Mar 2016, Michal Marek wrote: > On Fri, Mar 04, 2016 at 11:53:53PM +0100, Michal Marek wrote: > > Dne 4.3.2016 v 23:51 Michal Marek napsal(a): > > > Dne 4.3.2016 v 06:40 Nicolas Pitre napsal(a): > > >> +cmd_and_fixdep = > > >> \ > > >> +$(echo-cmd) $(cmd_$(1)); > > >> \ > > >> +$(ksym_dep_filter) | > > >> \ > > >> +scripts/basic/fixdep -e $(depfile) $@ '$(make-cmd)' > > >> \ > > >> +> $(dot-target).tmp; > > >> \ > > >> +rm -f $(depfile); > > >> \ > > >> +mv -f $(dot-target).tmp $(dot-target).cmd; > > > > > > While trying this, I got a SIGBUS from fixdep once. My theory is that > > > the depfile is mmap()ed by fixdep and modified by the preprocesor run at > > > the same time. I could not reproduce this so far (still trying). But if > > > it's really this race, the fix would be to disable dependency generation > > > in the preprocessor by passing -Wp,MD,/dev/null or somesuch. But we > > > never had this problem with genksyms, which is weird. It could as well > > > be that my build machine's memory is faulty :(. > > > > Actually, genksyms does not ran in parallel. neither before nor after > run > > this patch. > > I reproduced the SIGBUS after a few iterations, and it crashes in > parse_dep_file(). I'm now testing this > > diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include > index 0be7e09ba381..4fdd8348acf6 100644 > --- a/scripts/Kbuild.include > +++ b/scripts/Kbuild.include > @@ -270,10 +270,12 @@ else > > # Filter out exported kernel symbol names from the preprocessor output. > # See also __KSYM_DEPS__ in include/linux/export.h. > +# We disable the depfile generation here, so as not to overwrite the existing > +# depfile while fixdep is parsing it > ksym_dep_filter = > \ > case "$(1)" in \ > - cc_*_c) $(CPP) $(c_flags) -D__KSYM_DEPS__ $< ;; \ > - as_*_S) $(CPP) $(a_flags) -D__KSYM_DEPS__ $< ;; \ > + cc_*_c) $(CPP) $(filter-out -Wp$(comma)-M%, $(c_flags)) -D__KSYM_DEPS__ > $< ;; \ > + as_*_S) $(CPP) $(filter-out -Wp$(comma)-M%, $(a_flags)) -D__KSYM_DEPS__ > $< ;; \ > cpp_lds_S) : ;; \ > *) echo "Don't know how to preprocess $(1)"; false ;;\ > esac | sed -rn 's/^.*=== __KSYM_(.*) ===.*$$/KSYM_\1/p' This makes perfect sense even if I can't reproduce on my side. Are you willing to fold this patch in, or do you prefer that I repost? Nicolas
Re: Kernel docs: muddying the waters a bit
Em Fri, 04 Mar 2016 15:09:09 +0100 Johannes Stezenbachescreveu: > On Fri, Mar 04, 2016 at 09:59:50AM -0300, Mauro Carvalho Chehab wrote: > > > > 3) I tried to use a .. cssclass, as Johannes suggested, but > > I was not able to include the CSS file. I suspect that this is > > easy to fix, but I want to see if the cssclass will also work for > > the pdf output as well. > > "cssclass" was (I think) a custom role defined in the example, > unless you also have defined a custom role you can use plain "class". > I have not looked deeper into the theming and template stuff. Well, it accepted cssclass for html (well, it didn't find the templates - so I guess it is just me failing to understand how to tell sphinx to get the stylesheet), but it rejects it for latexPDF. > > > 4) It seems that it can't produce nested tables in pdf: > > > > Markup is unsupported in LaTeX: > > v4l-table-within-table:: nested tables are not yet implemented. > > Makefile:115: recipe for target 'latexpdf' failed > > This: > http://www.sphinx-doc.org/en/stable/markup/misc.html#tables > > suggests you need to add the tabularcolumns directive > for complex tables. > > BTW, as an alternative to the ASCII-art input > there is also support for CSV and list tables: > http://docutils.sourceforge.net/docs/ref/rst/directives.html#table I converted one of the big tables to CSV. At least now it recognized it as a table. Yet, the table was very badly formated: https://mchehab.fedorapeople.org/media-kabi-docs-test/rst_tests/packed-rgb.html This is how this table should look like: https://linuxtv.org/downloads/v4l-dvb-apis/packed-rgb.html Also, as this table has merged cells at the legend. I've no idea how to tell sphinx to do that on csv format. The RST files are on this git tree: https://git.linuxtv.org/mchehab/v4l2-docs-poc.git/ Regards, Mauro
Re: Kernel docs: muddying the waters a bit
Em Fri, 04 Mar 2016 15:09:09 +0100 Johannes Stezenbach escreveu: > On Fri, Mar 04, 2016 at 09:59:50AM -0300, Mauro Carvalho Chehab wrote: > > > > 3) I tried to use a .. cssclass, as Johannes suggested, but > > I was not able to include the CSS file. I suspect that this is > > easy to fix, but I want to see if the cssclass will also work for > > the pdf output as well. > > "cssclass" was (I think) a custom role defined in the example, > unless you also have defined a custom role you can use plain "class". > I have not looked deeper into the theming and template stuff. Well, it accepted cssclass for html (well, it didn't find the templates - so I guess it is just me failing to understand how to tell sphinx to get the stylesheet), but it rejects it for latexPDF. > > > 4) It seems that it can't produce nested tables in pdf: > > > > Markup is unsupported in LaTeX: > > v4l-table-within-table:: nested tables are not yet implemented. > > Makefile:115: recipe for target 'latexpdf' failed > > This: > http://www.sphinx-doc.org/en/stable/markup/misc.html#tables > > suggests you need to add the tabularcolumns directive > for complex tables. > > BTW, as an alternative to the ASCII-art input > there is also support for CSV and list tables: > http://docutils.sourceforge.net/docs/ref/rst/directives.html#table I converted one of the big tables to CSV. At least now it recognized it as a table. Yet, the table was very badly formated: https://mchehab.fedorapeople.org/media-kabi-docs-test/rst_tests/packed-rgb.html This is how this table should look like: https://linuxtv.org/downloads/v4l-dvb-apis/packed-rgb.html Also, as this table has merged cells at the legend. I've no idea how to tell sphinx to do that on csv format. The RST files are on this git tree: https://git.linuxtv.org/mchehab/v4l2-docs-poc.git/ Regards, Mauro
Re: [PATCH v2 2/2] spi: spi-pic32: Add PIC32 SPI master driver
On Sat, Mar 05, 2016 at 08:45:22PM +0530, Purna Chandra Mandal wrote: > Sorry Mark. > I have missed adding you in CC. Please find it from: > https://lkml.org/lkml/2016/3/4/401 No, I can't review or apply patches from a web link - please resend. signature.asc Description: PGP signature
Re: [PATCH v2 2/2] spi: spi-pic32: Add PIC32 SPI master driver
On Sat, Mar 05, 2016 at 08:45:22PM +0530, Purna Chandra Mandal wrote: > Sorry Mark. > I have missed adding you in CC. Please find it from: > https://lkml.org/lkml/2016/3/4/401 No, I can't review or apply patches from a web link - please resend. signature.asc Description: PGP signature
Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching
On Sat, Mar 5, 2016 at 5:49 PM, Peter Zijlstrawrote: > On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote: > >> >>> Even if there are platforms which may change the CPU frequency behind >> >>> cpufreq's back, breaking the transition notifiers, I'm worried about the >> >>> addition of an interface which itself breaks them. The platforms which >> >>> do change CPU frequency on their own have probably evolved to live with >> >>> or work around this behavior. As other platforms migrate to fast >> >>> frequency switching they might be surprised when things don't work as >> >>> advertised. > > There's only 43 sites of cpufreq_register_notifier in 37 files, that > should be fairly simple to audit. > >> >>> I'm not sure what the easiest way to deal with this is. I see the >> >>> transition notifiers are the srcu type, which I understand to be >> >>> blocking. Going through the tree and reworking everyone's callbacks and >> >>> changing the type to atomic is obviously not realistic. >> >> >> >> Right. > > Even if it was (and per the above it looks entirely feasible), that's > just not going to happen. We're not ever going to call random notifier > crap from this deep within the scheduler. > >> >>> How about modifying cpufreq_register_notifier to return an error if the >> >>> driver has a fast_switch callback installed and an attempt to register a >> >>> transition notifier is made? >> >> >> >> That sounds like a good idea. > > Agreed, fail the stuff hard. > > Simply make cpufreq_register_notifier a __must_check function and add > error handling to all call sites. Quite frankly, I don't see a compelling reason to do anything about the notifications at this point. The ACPI driver is the only one that will support fast switching for the time being and on practically all platforms that can use the ACPI driver the transition notifications cannot be relied on anyway for a few reasons. First, if intel_pstate or HWP is in use, they won't be coming at all. Second, anything turbo will just change frequency at will without notifying (like HWP). Finally, if they are coming, whoever receives them is notified about the frequency that is requested and not the real one, which is misleading, because (a) the request may just make the CPU go into the turbo range and then see above or (b) if the CPU is in a platform-coordinated package, its request will only be granted if it's the winning one. >> > I guess what might be done would be to spawn a work item to carry out >> > a notify when the frequency changes. >> >> In fact, the mechanism may be relatively simple if I'm not mistaken. >> >> In the "fast switch" case, the governor may spawn a work item that >> will just execute cpufreq_get() on policy->cpu. That will notice that >> policy->cur is different from the real current frequency and will >> re-adjust. >> >> Of course, cpufreq_driver_fast_switch() will need to be modified so it >> doesn't update policy->cur then perhaps with a comment that the >> governor using it will be responsible for that. > > No no no, that's just horrible. Why would you want to keep this > notification stuff alive? If your platform can change frequency 'fast' > you don't want notifiers. I'm not totally sure about that. > > What's the point of a notification that says: "At some point in the > random past my frequency has changed, and it likely has changed again > since then, do 'something'." > > That's pointless. If you have dependent clock domains or whatever, you > simply _cannot_ be fast. > What about thermal? They don't need to get very accurate information, but they need to be updated on a regular basis. It would do if they get averages instead of momentary values (and may be better even).
Re: [PATCH v2 6/10] cpufreq: Support for fast frequency switching
On Sat, Mar 5, 2016 at 5:49 PM, Peter Zijlstra wrote: > On Sat, Mar 05, 2016 at 12:18:54AM +0100, Rafael J. Wysocki wrote: > >> >>> Even if there are platforms which may change the CPU frequency behind >> >>> cpufreq's back, breaking the transition notifiers, I'm worried about the >> >>> addition of an interface which itself breaks them. The platforms which >> >>> do change CPU frequency on their own have probably evolved to live with >> >>> or work around this behavior. As other platforms migrate to fast >> >>> frequency switching they might be surprised when things don't work as >> >>> advertised. > > There's only 43 sites of cpufreq_register_notifier in 37 files, that > should be fairly simple to audit. > >> >>> I'm not sure what the easiest way to deal with this is. I see the >> >>> transition notifiers are the srcu type, which I understand to be >> >>> blocking. Going through the tree and reworking everyone's callbacks and >> >>> changing the type to atomic is obviously not realistic. >> >> >> >> Right. > > Even if it was (and per the above it looks entirely feasible), that's > just not going to happen. We're not ever going to call random notifier > crap from this deep within the scheduler. > >> >>> How about modifying cpufreq_register_notifier to return an error if the >> >>> driver has a fast_switch callback installed and an attempt to register a >> >>> transition notifier is made? >> >> >> >> That sounds like a good idea. > > Agreed, fail the stuff hard. > > Simply make cpufreq_register_notifier a __must_check function and add > error handling to all call sites. Quite frankly, I don't see a compelling reason to do anything about the notifications at this point. The ACPI driver is the only one that will support fast switching for the time being and on practically all platforms that can use the ACPI driver the transition notifications cannot be relied on anyway for a few reasons. First, if intel_pstate or HWP is in use, they won't be coming at all. Second, anything turbo will just change frequency at will without notifying (like HWP). Finally, if they are coming, whoever receives them is notified about the frequency that is requested and not the real one, which is misleading, because (a) the request may just make the CPU go into the turbo range and then see above or (b) if the CPU is in a platform-coordinated package, its request will only be granted if it's the winning one. >> > I guess what might be done would be to spawn a work item to carry out >> > a notify when the frequency changes. >> >> In fact, the mechanism may be relatively simple if I'm not mistaken. >> >> In the "fast switch" case, the governor may spawn a work item that >> will just execute cpufreq_get() on policy->cpu. That will notice that >> policy->cur is different from the real current frequency and will >> re-adjust. >> >> Of course, cpufreq_driver_fast_switch() will need to be modified so it >> doesn't update policy->cur then perhaps with a comment that the >> governor using it will be responsible for that. > > No no no, that's just horrible. Why would you want to keep this > notification stuff alive? If your platform can change frequency 'fast' > you don't want notifiers. I'm not totally sure about that. > > What's the point of a notification that says: "At some point in the > random past my frequency has changed, and it likely has changed again > since then, do 'something'." > > That's pointless. If you have dependent clock domains or whatever, you > simply _cannot_ be fast. > What about thermal? They don't need to get very accurate information, but they need to be updated on a regular basis. It would do if they get averages instead of momentary values (and may be better even).
Re: SCSI sr driver: parallel writes to optical serialized which hurts performance (sr_mutex)
Johan de Jong wrote: > Hi Wakko, > > If I remember correctly I did see you commenting on discussions on > either the Otto Meta patch, or another that proposed to remove the > mutex entirely. I was unaware of any others. I received the last set of patches from Tim more than a year ago. I wasn't using a system with multiple drives other than my 3.3.0 box. Last year around november I gathered some more drives and put them in another system and I decided to test the patches. I sent a report back to the list back in november 2015 (maybe october) about the success. > Do you have more information on why this never resulted in a succesful > concerted effort to get a patch in the kernel tree? Do the patches > have drawbacks or have they never been submitted properly? If the > latter, we might endeavor it? No, sorry. I'm just a user. I haven't had any crashes with the patches. I'd like to see the patches go in, I'm tired of patching every new kernel. I'm running the ones from Tim on 2 machines. One machine has the dvd burners (exported as iscsi targets) and the other machine connects to it. The patches have to be on both machines for it to work (I assume). I'm able to burn at 16x to 3 burners over iscsi at the same time. I've burned many discs this way with out any issues (other than bad media). Just FYI: The computer with the burners is running 4.4.1 and the iscsi initiator is 3.12.52. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: SCSI sr driver: parallel writes to optical serialized which hurts performance (sr_mutex)
Johan de Jong wrote: > Hi Wakko, > > If I remember correctly I did see you commenting on discussions on > either the Otto Meta patch, or another that proposed to remove the > mutex entirely. I was unaware of any others. I received the last set of patches from Tim more than a year ago. I wasn't using a system with multiple drives other than my 3.3.0 box. Last year around november I gathered some more drives and put them in another system and I decided to test the patches. I sent a report back to the list back in november 2015 (maybe october) about the success. > Do you have more information on why this never resulted in a succesful > concerted effort to get a patch in the kernel tree? Do the patches > have drawbacks or have they never been submitted properly? If the > latter, we might endeavor it? No, sorry. I'm just a user. I haven't had any crashes with the patches. I'd like to see the patches go in, I'm tired of patching every new kernel. I'm running the ones from Tim on 2 machines. One machine has the dvd burners (exported as iscsi targets) and the other machine connects to it. The patches have to be on both machines for it to work (I assume). I'm able to burn at 16x to 3 burners over iscsi at the same time. I've burned many discs this way with out any issues (other than bad media). Just FYI: The computer with the burners is running 4.4.1 and the iscsi initiator is 3.12.52. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following Ceph patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This is a final commit we missed to align the protocol compatibility with the feature bits. It decodes a few extra fields in two different messages and reports EIO when they are used (not yet supported). Thanks! sage Yan, Zheng (1): ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support fs/ceph/addr.c | 4 fs/ceph/caps.c | 27 --- fs/ceph/inode.c| 2 ++ fs/ceph/mds_client.c | 16 fs/ceph/mds_client.h | 1 + fs/ceph/super.h| 1 + include/linux/ceph/ceph_features.h | 1 + 7 files changed, 49 insertions(+), 3 deletions(-)
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following Ceph patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This is a final commit we missed to align the protocol compatibility with the feature bits. It decodes a few extra fields in two different messages and reports EIO when they are used (not yet supported). Thanks! sage Yan, Zheng (1): ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support fs/ceph/addr.c | 4 fs/ceph/caps.c | 27 --- fs/ceph/inode.c| 2 ++ fs/ceph/mds_client.c | 16 fs/ceph/mds_client.h | 1 + fs/ceph/super.h| 1 + include/linux/ceph/ceph_features.h | 1 + 7 files changed, 49 insertions(+), 3 deletions(-)
Re: [RFT PATCH 1/2] usb: dwc2: Add a 10 ms delay to dwc2_core_reset()
Hi, On Sat, Mar 5, 2016 at 12:41 PM, Michael Niewoehnerwrote: > Hi Douglas, > Hi John, > > Am 05.03.2016 um 01:33 schrieb Doug Anderson : > >> Michael, >> >> On Fri, Mar 4, 2016 at 4:09 PM, Michael Niewoehner >> wrote: > From testing and trying to make sense of the documentation, it appears > that a 10 ms delay is needed after resetting the core to make sure that > everything is stable and consistent. Let's add it. > > In my testing (on rk3288) this allows us to revert commit > 192cb07f7928 ("usb: dwc2: Fix probe problem on bcm2835"). Though I > could never reproduce the problems on my board, this might also allow us > to revert commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing > dr_mode"). > > Signed-off-by: Douglas Anderson Tested-by: Michael Niewoehner >> >> Thanks! That's great news! >> >> I’m a bit confused since git log says bd84f4ae9986 has been merged in 62718e304aa6 but looking at drivers/usb/dwc2/core.c it seems the patch has not been applied anyways ... However, I tested you your two patches with „magically reverted“ bd84f4ae9986 (msleep 50) on rk3188. The sdcard keeps being detected and boots just fine. >>> I meant usb stick of course… too much sdcards in my head today \o/. >> >> Odd. It looks to be there for me... >> >> $ git checkout 62718e304aa6 >> HEAD is now at 62718e304aa6... Merge tag 'usb-4.5-rc6' of >> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb >> >> $ git grep -C3 "NOTE: This is required" -- drivers/usb/dwc2/core.c >> drivers/usb/dwc2/core.c-} >> drivers/usb/dwc2/core.c- >> drivers/usb/dwc2/core.c-/* >> drivers/usb/dwc2/core.c: * NOTE: This is required for some >> rockchip soc based >> drivers/usb/dwc2/core.c- * platforms. >> drivers/usb/dwc2/core.c- */ >> drivers/usb/dwc2/core.c-msleep(50); > > I unfortunately have bad news. > After some more testing it turned out that usb does NOT work (always) on > rk3188 when reverting bd84f4ae9986. > It looks like that is dependent on which device / vendor is plugged in. > The usb stick I tested yesterday worked once but today just blinks shortly > and then stops working. > Another usb stick I tested today doesn’t blink or work at all. Maybe I should > have tested booting some more times :-( Just to clarify based on IRC conversation on #linux-rockchip: * My two patches work fine as per Michael (c0d3z3r0) and another person (mrjay). * My two patches _don't_ also allow us to revert to "50 ms" commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing dr_mode"). This is Michael's "bad news". That means we could apply my two patches and the continue to work separately to figure out how to revert commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing dr_mode"). Thanks! -Doug
Re: [RFT PATCH 1/2] usb: dwc2: Add a 10 ms delay to dwc2_core_reset()
Hi, On Sat, Mar 5, 2016 at 12:41 PM, Michael Niewoehner wrote: > Hi Douglas, > Hi John, > > Am 05.03.2016 um 01:33 schrieb Doug Anderson : > >> Michael, >> >> On Fri, Mar 4, 2016 at 4:09 PM, Michael Niewoehner >> wrote: > From testing and trying to make sense of the documentation, it appears > that a 10 ms delay is needed after resetting the core to make sure that > everything is stable and consistent. Let's add it. > > In my testing (on rk3288) this allows us to revert commit > 192cb07f7928 ("usb: dwc2: Fix probe problem on bcm2835"). Though I > could never reproduce the problems on my board, this might also allow us > to revert commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing > dr_mode"). > > Signed-off-by: Douglas Anderson Tested-by: Michael Niewoehner >> >> Thanks! That's great news! >> >> I’m a bit confused since git log says bd84f4ae9986 has been merged in 62718e304aa6 but looking at drivers/usb/dwc2/core.c it seems the patch has not been applied anyways ... However, I tested you your two patches with „magically reverted“ bd84f4ae9986 (msleep 50) on rk3188. The sdcard keeps being detected and boots just fine. >>> I meant usb stick of course… too much sdcards in my head today \o/. >> >> Odd. It looks to be there for me... >> >> $ git checkout 62718e304aa6 >> HEAD is now at 62718e304aa6... Merge tag 'usb-4.5-rc6' of >> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb >> >> $ git grep -C3 "NOTE: This is required" -- drivers/usb/dwc2/core.c >> drivers/usb/dwc2/core.c-} >> drivers/usb/dwc2/core.c- >> drivers/usb/dwc2/core.c-/* >> drivers/usb/dwc2/core.c: * NOTE: This is required for some >> rockchip soc based >> drivers/usb/dwc2/core.c- * platforms. >> drivers/usb/dwc2/core.c- */ >> drivers/usb/dwc2/core.c-msleep(50); > > I unfortunately have bad news. > After some more testing it turned out that usb does NOT work (always) on > rk3188 when reverting bd84f4ae9986. > It looks like that is dependent on which device / vendor is plugged in. > The usb stick I tested yesterday worked once but today just blinks shortly > and then stops working. > Another usb stick I tested today doesn’t blink or work at all. Maybe I should > have tested booting some more times :-( Just to clarify based on IRC conversation on #linux-rockchip: * My two patches work fine as per Michael (c0d3z3r0) and another person (mrjay). * My two patches _don't_ also allow us to revert to "50 ms" commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing dr_mode"). This is Michael's "bad news". That means we could apply my two patches and the continue to work separately to figure out how to revert commit bd84f4ae9986 ("usb: dwc2: Add extra delay when forcing dr_mode"). Thanks! -Doug
[PATCH 2/3] crypto: af_alg - add AEAD operation type
We need to allow the user to set the authentication type. This adds a new operation that sets IPSec or TLS authentication mode. Signed-off-by: Tadeusz Struk--- crypto/af_alg.c |6 ++ include/crypto/if_alg.h |1 + include/uapi/linux/if_alg.h |4 3 files changed, 11 insertions(+) diff --git a/crypto/af_alg.c b/crypto/af_alg.c index f5e18c2..cc5f2b0 100644 --- a/crypto/af_alg.c +++ b/crypto/af_alg.c @@ -464,6 +464,12 @@ int af_alg_cmsg_send(struct msghdr *msg, struct af_alg_control *con) con->op = *(u32 *)CMSG_DATA(cmsg); break; + case ALG_SET_AEAD_TYPE: + if (cmsg->cmsg_len < CMSG_LEN(sizeof(u32))) + return -EINVAL; + con->op_type = *(u32 *)CMSG_DATA(cmsg); + break; + case ALG_SET_AEAD_ASSOCLEN: if (cmsg->cmsg_len < CMSG_LEN(sizeof(u32))) return -EINVAL; diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h index a2bfd78..d76ea0c 100644 --- a/include/crypto/if_alg.h +++ b/include/crypto/if_alg.h @@ -45,6 +45,7 @@ struct af_alg_completion { struct af_alg_control { struct af_alg_iv *iv; int op; + int op_type; unsigned int aead_assoclen; }; diff --git a/include/uapi/linux/if_alg.h b/include/uapi/linux/if_alg.h index f2acd2f..cef00de 100644 --- a/include/uapi/linux/if_alg.h +++ b/include/uapi/linux/if_alg.h @@ -34,9 +34,13 @@ struct af_alg_iv { #define ALG_SET_OP 3 #define ALG_SET_AEAD_ASSOCLEN 4 #define ALG_SET_AEAD_AUTHSIZE 5 +#define ALG_SET_AEAD_TYPE 6 /* Operations */ #define ALG_OP_DECRYPT 0 #define ALG_OP_ENCRYPT 1 +/* AEAD operation type */ +#define ALG_AEAD_IPSEC 0 /* First encrypt then authenticate */ +#define ALG_AEAD_TLS 1 /* First authenticate then encrypt */ #endif /* _LINUX_IF_ALG_H */
[PATCH 3/3] crypto: algif_aead - modify algif aead interface to work with encauth
Updates to algif_aead to allow it to work with the new TLS authentication mode. This patch is generated on top of the algif_aead async patch: https://patchwork.kernel.org/patch/8182971/ Signed-off-by: Tadeusz Struk--- crypto/algif_aead.c | 93 +++ 1 file changed, 79 insertions(+), 14 deletions(-) diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c index 47d4f71..2d53054 100644 --- a/crypto/algif_aead.c +++ b/crypto/algif_aead.c @@ -26,7 +26,7 @@ struct aead_sg_list { unsigned int cur; - struct scatterlist sg[ALG_MAX_PAGES]; + struct scatterlist sg[ALG_MAX_PAGES + 1]; }; struct aead_async_rsgl { @@ -40,6 +40,7 @@ struct aead_async_req { struct list_head list; struct kiocb *iocb; unsigned int tsgls; + bool padded; char iv[]; }; @@ -49,6 +50,7 @@ struct aead_ctx { struct list_head list; void *iv; + void *padd; struct af_alg_completion completion; @@ -58,6 +60,7 @@ struct aead_ctx { bool more; bool merge; bool enc; + bool type; size_t aead_assoclen; struct aead_request aead_req; @@ -88,7 +91,7 @@ static void aead_reset_ctx(struct aead_ctx *ctx) { struct aead_sg_list *sgl = >tsgl; - sg_init_table(sgl->sg, ALG_MAX_PAGES); + sg_init_table(sgl->sg, ALG_MAX_PAGES + 1); sgl->cur = 0; ctx->used = 0; ctx->more = 0; @@ -191,6 +194,7 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) struct af_alg_control con = {}; long copied = 0; bool enc = 0; + bool type = 0; bool init = 0; int err = -EINVAL; @@ -211,6 +215,15 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) return -EINVAL; } + switch (con.op_type) { + case ALG_AEAD_IPSEC: + case ALG_AEAD_TLS: + type = con.op_type; + break; + default: + return -EINVAL; + } + if (con.iv && con.iv->ivlen != ivsize) return -EINVAL; } @@ -221,6 +234,7 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) if (init) { ctx->enc = enc; + ctx->type = type; if (con.iv) memcpy(ctx->iv, con.iv->iv, ivsize); @@ -399,7 +413,8 @@ static void aead_async_cb(struct crypto_async_request *_req, int err) for (i = 0; i < areq->tsgls; i++) put_page(sg_page(sg + i)); - sock_kfree_s(sk, areq->tsgl, sizeof(*areq->tsgl) * areq->tsgls); + sock_kfree_s(sk, areq->tsgl, sizeof(*areq->tsgl) * +(areq->tsgls + areq->padded)); sock_kfree_s(sk, req, reqlen); __sock_put(sk); iocb->ki_complete(iocb, err, err); @@ -417,11 +432,14 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg, struct aead_sg_list *sgl = >tsgl; struct aead_async_rsgl *last_rsgl = NULL, *rsgl; unsigned int as = crypto_aead_authsize(tfm); + unsigned int bs = crypto_aead_blocksize(tfm); unsigned int i, reqlen = GET_REQ_SIZE(tfm); int err = -ENOMEM; unsigned long used; size_t outlen; size_t usedpages = 0; + size_t paddlen = 0; + char *paddbuf; lock_sock(sk); if (ctx->more) { @@ -451,17 +469,28 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg, aead_async_cb, sk); used -= ctx->aead_assoclen + (ctx->enc ? as : 0); + if (ctx->enc && ctx->type == ALG_AEAD_TLS) + paddlen = bs - ((used + as) % bs); + + outlen += paddlen; + areq->padded = !!paddlen; + /* take over all tx sgls from ctx */ - areq->tsgl = sock_kmalloc(sk, sizeof(*areq->tsgl) * sgl->cur, - GFP_KERNEL); + areq->tsgl = sock_kmalloc(sk, sizeof(*areq->tsgl) * + (sgl->cur + !!paddlen), GFP_KERNEL); if (unlikely(!areq->tsgl)) goto free; - sg_init_table(areq->tsgl, sgl->cur); + sg_init_table(areq->tsgl, sgl->cur + !!paddlen); for (i = 0; i < sgl->cur; i++) sg_set_page(>tsgl[i], sg_page(>sg[i]), sgl->sg[i].length, sgl->sg[i].offset); + if (paddlen) { + paddbuf = areq->iv + crypto_aead_ivsize(tfm); + sg_set_buf(>tsgl[sgl->cur], paddbuf, paddlen); + } + areq->tsgls = sgl->cur; /* create rx sgls */ @@ -530,7 +559,8 @@ free: sock_kfree_s(sk, rsgl, sizeof(*rsgl)); } if (areq->tsgl) - sock_kfree_s(sk, areq->tsgl,
[PATCH 1/3] crypto: authenc - add TLS type encryption
This patch adds a new authentication mode for TLS type encryption. During encrypt it generates auth data + padding and then the plaintext || authdata || padding is encrypted. This requires the user to provide extra space for the cipher text. The required space can be calculated as outlen = assoc len + plaintext len + hash size + cipher block size On decrypt first the whole buffer is decrypted, and then verification of the authdata and padding is performed. Signed-off-by: Tadeusz Struk--- crypto/Makefile |2 crypto/encauth.c | 510 ++ 2 files changed, 511 insertions(+), 1 deletion(-) create mode 100644 crypto/encauth.c diff --git a/crypto/Makefile b/crypto/Makefile index 4f4ef7e..a372335 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -103,7 +103,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c_generic.o obj-$(CONFIG_CRYPTO_CRC32) += crc32_generic.o obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_common.o crct10dif_generic.o -obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o +obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o encauth.o obj-$(CONFIG_CRYPTO_LZO) += lzo.o obj-$(CONFIG_CRYPTO_LZ4) += lz4.o obj-$(CONFIG_CRYPTO_LZ4HC) += lz4hc.o diff --git a/crypto/encauth.c b/crypto/encauth.c new file mode 100644 index 000..3c0ee1a --- /dev/null +++ b/crypto/encauth.c @@ -0,0 +1,510 @@ +/* + * Encauth: Simple AEAD wrapper for TLS. + * Derived from authenc.c + * + * Copyright (c) 2016 Intel Corp. + * + * Author: Tadeusz Struk + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct encauth_instance_ctx { + struct crypto_ahash_spawn auth; + struct crypto_skcipher_spawn enc; + unsigned int reqoff; +}; + +struct crypto_encauth_ctx { + struct crypto_ahash *auth; + struct crypto_ablkcipher *enc; + struct crypto_blkcipher *null; +}; + +struct encauth_request_ctx { + struct scatterlist src[2]; + struct scatterlist dst[2]; + int padd_err; + u8 paddlen; + char tail[]; +}; + +static void encauth_request_complete(struct aead_request *req, int err) +{ + if (err != -EINPROGRESS) + aead_request_complete(req, err); +} + +static int crypto_encauth_setkey(struct crypto_aead *encauth, const u8 *key, +unsigned int keylen) +{ + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(encauth); + struct crypto_ahash *auth = ctx->auth; + struct crypto_ablkcipher *enc = ctx->enc; + struct crypto_authenc_keys keys; + int err = -EINVAL; + + if (crypto_authenc_extractkeys(, key, keylen) != 0) + goto badkey; + + crypto_ahash_clear_flags(auth, CRYPTO_TFM_REQ_MASK); + crypto_ahash_set_flags(auth, crypto_aead_get_flags(encauth) & + CRYPTO_TFM_REQ_MASK); + err = crypto_ahash_setkey(auth, keys.authkey, keys.authkeylen); + crypto_aead_set_flags(encauth, crypto_ahash_get_flags(auth) & + CRYPTO_TFM_RES_MASK); + + if (err) + goto out; + + crypto_ablkcipher_clear_flags(enc, CRYPTO_TFM_REQ_MASK); + crypto_ablkcipher_set_flags(enc, crypto_aead_get_flags(encauth) & + CRYPTO_TFM_REQ_MASK); + err = crypto_ablkcipher_setkey(enc, keys.enckey, keys.enckeylen); + crypto_aead_set_flags(encauth, crypto_ablkcipher_get_flags(enc) & + CRYPTO_TFM_RES_MASK); + +out: + return err; + +badkey: + crypto_aead_set_flags(encauth, CRYPTO_TFM_RES_BAD_KEY_LEN); + goto out; +} + +static int crypto_encauth_copy_assoc(struct aead_request *req) +{ + struct crypto_aead *encauth = crypto_aead_reqtfm(req); + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(encauth); + struct blkcipher_desc desc = { + .tfm = ctx->null, + }; + + return crypto_blkcipher_encrypt(, req->dst, req->src, + req->assoclen); +} + +static void encauth_encrypt_done(struct crypto_async_request *req, int err) +{ + struct aead_request *areq = req->data; + + encauth_request_complete(areq, err); +} + +static int crypto_encauth_encrypt(struct aead_request *req) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + struct aead_instance *inst = aead_alg_instance(tfm); + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(tfm); + struct encauth_instance_ctx *ictx = aead_instance_ctx(inst); + struct
[PATCH 2/3] crypto: af_alg - add AEAD operation type
We need to allow the user to set the authentication type. This adds a new operation that sets IPSec or TLS authentication mode. Signed-off-by: Tadeusz Struk --- crypto/af_alg.c |6 ++ include/crypto/if_alg.h |1 + include/uapi/linux/if_alg.h |4 3 files changed, 11 insertions(+) diff --git a/crypto/af_alg.c b/crypto/af_alg.c index f5e18c2..cc5f2b0 100644 --- a/crypto/af_alg.c +++ b/crypto/af_alg.c @@ -464,6 +464,12 @@ int af_alg_cmsg_send(struct msghdr *msg, struct af_alg_control *con) con->op = *(u32 *)CMSG_DATA(cmsg); break; + case ALG_SET_AEAD_TYPE: + if (cmsg->cmsg_len < CMSG_LEN(sizeof(u32))) + return -EINVAL; + con->op_type = *(u32 *)CMSG_DATA(cmsg); + break; + case ALG_SET_AEAD_ASSOCLEN: if (cmsg->cmsg_len < CMSG_LEN(sizeof(u32))) return -EINVAL; diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h index a2bfd78..d76ea0c 100644 --- a/include/crypto/if_alg.h +++ b/include/crypto/if_alg.h @@ -45,6 +45,7 @@ struct af_alg_completion { struct af_alg_control { struct af_alg_iv *iv; int op; + int op_type; unsigned int aead_assoclen; }; diff --git a/include/uapi/linux/if_alg.h b/include/uapi/linux/if_alg.h index f2acd2f..cef00de 100644 --- a/include/uapi/linux/if_alg.h +++ b/include/uapi/linux/if_alg.h @@ -34,9 +34,13 @@ struct af_alg_iv { #define ALG_SET_OP 3 #define ALG_SET_AEAD_ASSOCLEN 4 #define ALG_SET_AEAD_AUTHSIZE 5 +#define ALG_SET_AEAD_TYPE 6 /* Operations */ #define ALG_OP_DECRYPT 0 #define ALG_OP_ENCRYPT 1 +/* AEAD operation type */ +#define ALG_AEAD_IPSEC 0 /* First encrypt then authenticate */ +#define ALG_AEAD_TLS 1 /* First authenticate then encrypt */ #endif /* _LINUX_IF_ALG_H */
[PATCH 3/3] crypto: algif_aead - modify algif aead interface to work with encauth
Updates to algif_aead to allow it to work with the new TLS authentication mode. This patch is generated on top of the algif_aead async patch: https://patchwork.kernel.org/patch/8182971/ Signed-off-by: Tadeusz Struk --- crypto/algif_aead.c | 93 +++ 1 file changed, 79 insertions(+), 14 deletions(-) diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c index 47d4f71..2d53054 100644 --- a/crypto/algif_aead.c +++ b/crypto/algif_aead.c @@ -26,7 +26,7 @@ struct aead_sg_list { unsigned int cur; - struct scatterlist sg[ALG_MAX_PAGES]; + struct scatterlist sg[ALG_MAX_PAGES + 1]; }; struct aead_async_rsgl { @@ -40,6 +40,7 @@ struct aead_async_req { struct list_head list; struct kiocb *iocb; unsigned int tsgls; + bool padded; char iv[]; }; @@ -49,6 +50,7 @@ struct aead_ctx { struct list_head list; void *iv; + void *padd; struct af_alg_completion completion; @@ -58,6 +60,7 @@ struct aead_ctx { bool more; bool merge; bool enc; + bool type; size_t aead_assoclen; struct aead_request aead_req; @@ -88,7 +91,7 @@ static void aead_reset_ctx(struct aead_ctx *ctx) { struct aead_sg_list *sgl = >tsgl; - sg_init_table(sgl->sg, ALG_MAX_PAGES); + sg_init_table(sgl->sg, ALG_MAX_PAGES + 1); sgl->cur = 0; ctx->used = 0; ctx->more = 0; @@ -191,6 +194,7 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) struct af_alg_control con = {}; long copied = 0; bool enc = 0; + bool type = 0; bool init = 0; int err = -EINVAL; @@ -211,6 +215,15 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) return -EINVAL; } + switch (con.op_type) { + case ALG_AEAD_IPSEC: + case ALG_AEAD_TLS: + type = con.op_type; + break; + default: + return -EINVAL; + } + if (con.iv && con.iv->ivlen != ivsize) return -EINVAL; } @@ -221,6 +234,7 @@ static int aead_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) if (init) { ctx->enc = enc; + ctx->type = type; if (con.iv) memcpy(ctx->iv, con.iv->iv, ivsize); @@ -399,7 +413,8 @@ static void aead_async_cb(struct crypto_async_request *_req, int err) for (i = 0; i < areq->tsgls; i++) put_page(sg_page(sg + i)); - sock_kfree_s(sk, areq->tsgl, sizeof(*areq->tsgl) * areq->tsgls); + sock_kfree_s(sk, areq->tsgl, sizeof(*areq->tsgl) * +(areq->tsgls + areq->padded)); sock_kfree_s(sk, req, reqlen); __sock_put(sk); iocb->ki_complete(iocb, err, err); @@ -417,11 +432,14 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg, struct aead_sg_list *sgl = >tsgl; struct aead_async_rsgl *last_rsgl = NULL, *rsgl; unsigned int as = crypto_aead_authsize(tfm); + unsigned int bs = crypto_aead_blocksize(tfm); unsigned int i, reqlen = GET_REQ_SIZE(tfm); int err = -ENOMEM; unsigned long used; size_t outlen; size_t usedpages = 0; + size_t paddlen = 0; + char *paddbuf; lock_sock(sk); if (ctx->more) { @@ -451,17 +469,28 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg, aead_async_cb, sk); used -= ctx->aead_assoclen + (ctx->enc ? as : 0); + if (ctx->enc && ctx->type == ALG_AEAD_TLS) + paddlen = bs - ((used + as) % bs); + + outlen += paddlen; + areq->padded = !!paddlen; + /* take over all tx sgls from ctx */ - areq->tsgl = sock_kmalloc(sk, sizeof(*areq->tsgl) * sgl->cur, - GFP_KERNEL); + areq->tsgl = sock_kmalloc(sk, sizeof(*areq->tsgl) * + (sgl->cur + !!paddlen), GFP_KERNEL); if (unlikely(!areq->tsgl)) goto free; - sg_init_table(areq->tsgl, sgl->cur); + sg_init_table(areq->tsgl, sgl->cur + !!paddlen); for (i = 0; i < sgl->cur; i++) sg_set_page(>tsgl[i], sg_page(>sg[i]), sgl->sg[i].length, sgl->sg[i].offset); + if (paddlen) { + paddbuf = areq->iv + crypto_aead_ivsize(tfm); + sg_set_buf(>tsgl[sgl->cur], paddbuf, paddlen); + } + areq->tsgls = sgl->cur; /* create rx sgls */ @@ -530,7 +559,8 @@ free: sock_kfree_s(sk, rsgl, sizeof(*rsgl)); } if (areq->tsgl) - sock_kfree_s(sk, areq->tsgl, sizeof(*areq->tsgl) *
[PATCH 1/3] crypto: authenc - add TLS type encryption
This patch adds a new authentication mode for TLS type encryption. During encrypt it generates auth data + padding and then the plaintext || authdata || padding is encrypted. This requires the user to provide extra space for the cipher text. The required space can be calculated as outlen = assoc len + plaintext len + hash size + cipher block size On decrypt first the whole buffer is decrypted, and then verification of the authdata and padding is performed. Signed-off-by: Tadeusz Struk --- crypto/Makefile |2 crypto/encauth.c | 510 ++ 2 files changed, 511 insertions(+), 1 deletion(-) create mode 100644 crypto/encauth.c diff --git a/crypto/Makefile b/crypto/Makefile index 4f4ef7e..a372335 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -103,7 +103,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c_generic.o obj-$(CONFIG_CRYPTO_CRC32) += crc32_generic.o obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_common.o crct10dif_generic.o -obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o +obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o encauth.o obj-$(CONFIG_CRYPTO_LZO) += lzo.o obj-$(CONFIG_CRYPTO_LZ4) += lz4.o obj-$(CONFIG_CRYPTO_LZ4HC) += lz4hc.o diff --git a/crypto/encauth.c b/crypto/encauth.c new file mode 100644 index 000..3c0ee1a --- /dev/null +++ b/crypto/encauth.c @@ -0,0 +1,510 @@ +/* + * Encauth: Simple AEAD wrapper for TLS. + * Derived from authenc.c + * + * Copyright (c) 2016 Intel Corp. + * + * Author: Tadeusz Struk + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct encauth_instance_ctx { + struct crypto_ahash_spawn auth; + struct crypto_skcipher_spawn enc; + unsigned int reqoff; +}; + +struct crypto_encauth_ctx { + struct crypto_ahash *auth; + struct crypto_ablkcipher *enc; + struct crypto_blkcipher *null; +}; + +struct encauth_request_ctx { + struct scatterlist src[2]; + struct scatterlist dst[2]; + int padd_err; + u8 paddlen; + char tail[]; +}; + +static void encauth_request_complete(struct aead_request *req, int err) +{ + if (err != -EINPROGRESS) + aead_request_complete(req, err); +} + +static int crypto_encauth_setkey(struct crypto_aead *encauth, const u8 *key, +unsigned int keylen) +{ + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(encauth); + struct crypto_ahash *auth = ctx->auth; + struct crypto_ablkcipher *enc = ctx->enc; + struct crypto_authenc_keys keys; + int err = -EINVAL; + + if (crypto_authenc_extractkeys(, key, keylen) != 0) + goto badkey; + + crypto_ahash_clear_flags(auth, CRYPTO_TFM_REQ_MASK); + crypto_ahash_set_flags(auth, crypto_aead_get_flags(encauth) & + CRYPTO_TFM_REQ_MASK); + err = crypto_ahash_setkey(auth, keys.authkey, keys.authkeylen); + crypto_aead_set_flags(encauth, crypto_ahash_get_flags(auth) & + CRYPTO_TFM_RES_MASK); + + if (err) + goto out; + + crypto_ablkcipher_clear_flags(enc, CRYPTO_TFM_REQ_MASK); + crypto_ablkcipher_set_flags(enc, crypto_aead_get_flags(encauth) & + CRYPTO_TFM_REQ_MASK); + err = crypto_ablkcipher_setkey(enc, keys.enckey, keys.enckeylen); + crypto_aead_set_flags(encauth, crypto_ablkcipher_get_flags(enc) & + CRYPTO_TFM_RES_MASK); + +out: + return err; + +badkey: + crypto_aead_set_flags(encauth, CRYPTO_TFM_RES_BAD_KEY_LEN); + goto out; +} + +static int crypto_encauth_copy_assoc(struct aead_request *req) +{ + struct crypto_aead *encauth = crypto_aead_reqtfm(req); + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(encauth); + struct blkcipher_desc desc = { + .tfm = ctx->null, + }; + + return crypto_blkcipher_encrypt(, req->dst, req->src, + req->assoclen); +} + +static void encauth_encrypt_done(struct crypto_async_request *req, int err) +{ + struct aead_request *areq = req->data; + + encauth_request_complete(areq, err); +} + +static int crypto_encauth_encrypt(struct aead_request *req) +{ + struct crypto_aead *tfm = crypto_aead_reqtfm(req); + struct aead_instance *inst = aead_alg_instance(tfm); + struct crypto_encauth_ctx *ctx = crypto_aead_ctx(tfm); + struct encauth_instance_ctx *ictx = aead_instance_ctx(inst); + struct encauth_request_ctx *areq_ctx = aead_request_ctx(req);
[PATCH 0/3] crypto: af_alg - add TLS type encryption
Hi, The following series adds TLS type authentication. To do this a new template, encauth, is introduced. It is derived from the existing authenc template and modified to work in "first auth then encrypt" mode. The algif interface is also changed to work with the new authentication type. --- Tadeusz Struk (3): crypto: authenc - add TLS type encryption crypto: af_alg - add AEAD operation type crypto: algif_aead - modify algif aead interface to work with encauth crypto/Makefile |2 crypto/af_alg.c |6 + crypto/algif_aead.c | 93 +++- crypto/encauth.c| 510 +++ include/crypto/if_alg.h |1 include/uapi/linux/if_alg.h |4 6 files changed, 601 insertions(+), 15 deletions(-) create mode 100644 crypto/encauth.c --
[PATCH 0/3] crypto: af_alg - add TLS type encryption
Hi, The following series adds TLS type authentication. To do this a new template, encauth, is introduced. It is derived from the existing authenc template and modified to work in "first auth then encrypt" mode. The algif interface is also changed to work with the new authentication type. --- Tadeusz Struk (3): crypto: authenc - add TLS type encryption crypto: af_alg - add AEAD operation type crypto: algif_aead - modify algif aead interface to work with encauth crypto/Makefile |2 crypto/af_alg.c |6 + crypto/algif_aead.c | 93 +++- crypto/encauth.c| 510 +++ include/crypto/if_alg.h |1 include/uapi/linux/if_alg.h |4 6 files changed, 601 insertions(+), 15 deletions(-) create mode 100644 crypto/encauth.c --
[PATCH] mailbox: sti: fix check of error code returned by devm_ioremap_resource()
The change fixes potential oops while accessing iomem on invalid address, if devm_ioremap_resource() fails due to some reason. The devm_ioremap_resource() function returns ERR_PTR() and never returns NULL, which makes useless a following check for NULL. Signed-off-by: Vladimir ZapolskiyFixes: 9ef4546cbd7e ("mailbox: Add support for ST's Mailbox IP") --- drivers/mailbox/mailbox-sti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/mailbox/mailbox-sti.c b/drivers/mailbox/mailbox-sti.c index 2394cfe..a334db5 100644 --- a/drivers/mailbox/mailbox-sti.c +++ b/drivers/mailbox/mailbox-sti.c @@ -430,8 +430,8 @@ static int sti_mbox_probe(struct platform_device *pdev) res = platform_get_resource(pdev, IORESOURCE_MEM, 0); mdev->base = devm_ioremap_resource(>dev, res); - if (!mdev->base) - return -ENOMEM; + if (IS_ERR(mdev->base)) + return PTR_ERR(mdev->base); ret = of_property_read_string(np, "mbox-name", >name); if (ret) -- 2.1.4
[PATCH] mailbox: sti: fix check of error code returned by devm_ioremap_resource()
The change fixes potential oops while accessing iomem on invalid address, if devm_ioremap_resource() fails due to some reason. The devm_ioremap_resource() function returns ERR_PTR() and never returns NULL, which makes useless a following check for NULL. Signed-off-by: Vladimir Zapolskiy Fixes: 9ef4546cbd7e ("mailbox: Add support for ST's Mailbox IP") --- drivers/mailbox/mailbox-sti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/mailbox/mailbox-sti.c b/drivers/mailbox/mailbox-sti.c index 2394cfe..a334db5 100644 --- a/drivers/mailbox/mailbox-sti.c +++ b/drivers/mailbox/mailbox-sti.c @@ -430,8 +430,8 @@ static int sti_mbox_probe(struct platform_device *pdev) res = platform_get_resource(pdev, IORESOURCE_MEM, 0); mdev->base = devm_ioremap_resource(>dev, res); - if (!mdev->base) - return -ENOMEM; + if (IS_ERR(mdev->base)) + return PTR_ERR(mdev->base); ret = of_property_read_string(np, "mbox-name", >name); if (ret) -- 2.1.4
vgacon.c:undefined reference to `screen_info'
Hi Chen, It's probably a bug fix that unveils the link errors. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: f69405ce6c0fc9f4a039011007371b31f80b470d openrisc: include: asm: Kbuild: add default "vga.h" date: 2 years, 4 months ago config: openrisc-alldefconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout f69405ce6c0fc9f4a039011007371b31f80b470d # save the attached .config to linux build tree make.cross ARCH=openrisc All errors (new ones prefixed by >>): drivers/built-in.o: In function `vgacon_save_screen': >> vgacon.c:(.text+0x20e0): undefined reference to `screen_info' vgacon.c:(.text+0x20e8): undefined reference to `screen_info' drivers/built-in.o: In function `vgacon_init': vgacon.c:(.text+0x284c): undefined reference to `screen_info' vgacon.c:(.text+0x2850): undefined reference to `screen_info' drivers/built-in.o: In function `vgacon_startup': vgacon.c:(.text+0x28d8): undefined reference to `screen_info' drivers/built-in.o:vgacon.c:(.text+0x28f0): more undefined references to `screen_info' follow --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
vgacon.c:undefined reference to `screen_info'
Hi Chen, It's probably a bug fix that unveils the link errors. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: f69405ce6c0fc9f4a039011007371b31f80b470d openrisc: include: asm: Kbuild: add default "vga.h" date: 2 years, 4 months ago config: openrisc-alldefconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout f69405ce6c0fc9f4a039011007371b31f80b470d # save the attached .config to linux build tree make.cross ARCH=openrisc All errors (new ones prefixed by >>): drivers/built-in.o: In function `vgacon_save_screen': >> vgacon.c:(.text+0x20e0): undefined reference to `screen_info' vgacon.c:(.text+0x20e8): undefined reference to `screen_info' drivers/built-in.o: In function `vgacon_init': vgacon.c:(.text+0x284c): undefined reference to `screen_info' vgacon.c:(.text+0x2850): undefined reference to `screen_info' drivers/built-in.o: In function `vgacon_startup': vgacon.c:(.text+0x28d8): undefined reference to `screen_info' drivers/built-in.o:vgacon.c:(.text+0x28f0): more undefined references to `screen_info' follow --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
drivers/mfd/syscon.c:89:2: error: implicit declaration of function 'iounmap'
Hi Rob, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 0166dc11be911213e0b1b764488c671be4c48cf3 of: make CONFIG_OF user selectable date: 9 months ago config: um-allmodconfig (attached as .config) reproduce: git checkout 0166dc11be911213e0b1b764488c671be4c48cf3 # save the attached .config to linux build tree make ARCH=um All errors (new ones prefixed by >>): drivers/mfd/syscon.c: In function 'of_syscon_register': >> drivers/mfd/syscon.c:89:2: error: implicit declaration of function 'iounmap' >> [-Werror=implicit-function-declaration] iounmap(base); ^ cc1: some warnings being treated as errors vim +/iounmap +89 drivers/mfd/syscon.c bdb0066d Pankaj Dubey 2014-09-30 73if (IS_ERR(regmap)) { bdb0066d Pankaj Dubey 2014-09-30 74pr_err("regmap init failed\n"); bdb0066d Pankaj Dubey 2014-09-30 75ret = PTR_ERR(regmap); bdb0066d Pankaj Dubey 2014-09-30 76goto err_regmap; bdb0066d Pankaj Dubey 2014-09-30 77} bdb0066d Pankaj Dubey 2014-09-30 78 bdb0066d Pankaj Dubey 2014-09-30 79syscon->regmap = regmap; bdb0066d Pankaj Dubey 2014-09-30 80syscon->np = np; bdb0066d Pankaj Dubey 2014-09-30 81 bdb0066d Pankaj Dubey 2014-09-30 82spin_lock(_list_slock); bdb0066d Pankaj Dubey 2014-09-30 83list_add_tail(>list, _list); bdb0066d Pankaj Dubey 2014-09-30 84spin_unlock(_list_slock); 87d68730 Dong Aisheng 2012-09-05 85 bdb0066d Pankaj Dubey 2014-09-30 86return syscon; bdb0066d Pankaj Dubey 2014-09-30 87 bdb0066d Pankaj Dubey 2014-09-30 88 err_regmap: bdb0066d Pankaj Dubey 2014-09-30 @89iounmap(base); bdb0066d Pankaj Dubey 2014-09-30 90 err_map: bdb0066d Pankaj Dubey 2014-09-30 91kfree(syscon); bdb0066d Pankaj Dubey 2014-09-30 92return ERR_PTR(ret); 87d68730 Dong Aisheng 2012-09-05 93 } 87d68730 Dong Aisheng 2012-09-05 94 87d68730 Dong Aisheng 2012-09-05 95 struct regmap *syscon_node_to_regmap(struct device_node *np) 87d68730 Dong Aisheng 2012-09-05 96 { bdb0066d Pankaj Dubey 2014-09-30 97struct syscon *entry, *syscon = NULL; :: The code at line 89 was first introduced by commit :: bdb0066df96e74a4002125467ebe459feff1ebef mfd: syscon: Decouple syscon interface from platform devices :: TO: Pankaj Dubey:: CC: Lee Jones --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
drivers/mfd/syscon.c:89:2: error: implicit declaration of function 'iounmap'
Hi Rob, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: 67944024c1cdd897e49a09b0d6af3ea38d1388ca commit: 0166dc11be911213e0b1b764488c671be4c48cf3 of: make CONFIG_OF user selectable date: 9 months ago config: um-allmodconfig (attached as .config) reproduce: git checkout 0166dc11be911213e0b1b764488c671be4c48cf3 # save the attached .config to linux build tree make ARCH=um All errors (new ones prefixed by >>): drivers/mfd/syscon.c: In function 'of_syscon_register': >> drivers/mfd/syscon.c:89:2: error: implicit declaration of function 'iounmap' >> [-Werror=implicit-function-declaration] iounmap(base); ^ cc1: some warnings being treated as errors vim +/iounmap +89 drivers/mfd/syscon.c bdb0066d Pankaj Dubey 2014-09-30 73if (IS_ERR(regmap)) { bdb0066d Pankaj Dubey 2014-09-30 74pr_err("regmap init failed\n"); bdb0066d Pankaj Dubey 2014-09-30 75ret = PTR_ERR(regmap); bdb0066d Pankaj Dubey 2014-09-30 76goto err_regmap; bdb0066d Pankaj Dubey 2014-09-30 77} bdb0066d Pankaj Dubey 2014-09-30 78 bdb0066d Pankaj Dubey 2014-09-30 79syscon->regmap = regmap; bdb0066d Pankaj Dubey 2014-09-30 80syscon->np = np; bdb0066d Pankaj Dubey 2014-09-30 81 bdb0066d Pankaj Dubey 2014-09-30 82spin_lock(_list_slock); bdb0066d Pankaj Dubey 2014-09-30 83list_add_tail(>list, _list); bdb0066d Pankaj Dubey 2014-09-30 84spin_unlock(_list_slock); 87d68730 Dong Aisheng 2012-09-05 85 bdb0066d Pankaj Dubey 2014-09-30 86return syscon; bdb0066d Pankaj Dubey 2014-09-30 87 bdb0066d Pankaj Dubey 2014-09-30 88 err_regmap: bdb0066d Pankaj Dubey 2014-09-30 @89iounmap(base); bdb0066d Pankaj Dubey 2014-09-30 90 err_map: bdb0066d Pankaj Dubey 2014-09-30 91kfree(syscon); bdb0066d Pankaj Dubey 2014-09-30 92return ERR_PTR(ret); 87d68730 Dong Aisheng 2012-09-05 93 } 87d68730 Dong Aisheng 2012-09-05 94 87d68730 Dong Aisheng 2012-09-05 95 struct regmap *syscon_node_to_regmap(struct device_node *np) 87d68730 Dong Aisheng 2012-09-05 96 { bdb0066d Pankaj Dubey 2014-09-30 97struct syscon *entry, *syscon = NULL; :: The code at line 89 was first introduced by commit :: bdb0066df96e74a4002125467ebe459feff1ebef mfd: syscon: Decouple syscon interface from platform devices :: TO: Pankaj Dubey :: CC: Lee Jones --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE
On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote: > On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote: > > On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote: > > > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote: > > > > On Fri, 4 Mar 2016, Dave Hansen wrote: > > > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote: > > > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote: > > > > > >> Truncate and punch hole that only cover part of THP range is > > > > > >> implemented > > > > > >> by zero out this part of THP. > > > > > >> > > > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) > > > > > >> behaviour. > > > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may > > > > > >> have > > > > > >> inconsistent results depending what pages happened to be allocated. > > > > > >> Not sure if it should be considered ABI break or not. > > > > > > > > > > > > Looks like this shouldn't be a problem. man 2 fallocate: > > > > > > > > > > > > Within the specified range, partial filesystem blocks are > > > > > > zeroed, > > > > > > and whole filesystem blocks are removed from the file. After a > > > > > > successful call, subsequent reads from this range will return > > > > > > zeroes. > > > > > > > > > > > > It means we effectively have 2M filesystem block size. > > > > > > > > > > The question is still whether this will case problems for apps. > > > > > > > > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs > > > > > filesystem act like they have a 2M blocksize and others like they have > > > > > 4k? Would that confuse apps? > > > > > > > > At risk of addressing the tip of an iceberg, before diving down to > > > > scope out the rest of the iceberg... > > > > > > > > > > (Though in the case of my huge tmpfs, it's the reverse: the small hole > > > > punch splits the hugepage; but it's natural that Kirill's way would try > > > > to hold on to its compound pages for longer than I do, and that's fine > > > > so long as it's all consistent.) > > > > > > > Ah, but suppose someone holepunches out most of each 2M page: they would > > > > expect the memcg not to be charged for those holes (just as when they > > > > munmap most of an anonymous THP) - that does suggest splitting is > > > > needed. > > > > > > I think filesystems will expect splitting to happen. They call > > > truncate_pagecache_range() on the region that the hole is being > > > punched out of, and they expect page cache pages over this range to > > > be unmapped, invalidated and then removed from the mapping tree as a > > > result. Also, most filesystems think the page cache only contains > > > PAGE_CACHE_SIZE mappings, so they are completely unaware of the > > > limitations THP might have when it comes to invalidation. > > > > > > IOWs, if this range is not aligned to huge page boundaries, then it > > > implies the huge page is either split into PAGE_SIZE mappings and > > > then the range is invalidated as expected, or it is completely > > > invalidated and then refaulted on future accesses which determine if > > > THP or normal pages are used for the page being faulted > > > > The filesystem in question is tmpfs and complete invalidation is not > > always an option. > > Then your two options are: splitting the page and rerunning the hole > punch, or simply zeroing the sections of the THP rather than trying > to punch out the backing store. The second option is implemented at the moment as splitting can fail. > > For other filesystems it also can be unavailable > > immediately if the page is dirty (the dirty flag is tracked on per-THP > > basis at the moment). > > Filesystems with persistent storage flush the range being punched > first to ensure that partial blocks are correctly written before we > start freeing the backing store. This is needed on XFS to ensure > hole punch plays nicely with delayed allocation and other extent > based operations. Hence we know that we have clean pages over the > hole we are about to punch and so there is no reason the > invalidation should *ever* fail. Okay. It means we have other option to consider on THP-enabling for a filesystem with persistent storage. > tmpfs is a special snowflake when it comes to these fallocate based > filesystem layout manipulation functions - it does not have > persistent storage, so you have to do things very differently to > ensure that data is not lost. > > > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return > > -EBUSY (or other errno on your choice), if we cannot split the page > > right away? > > Which means THP are not transparent any more. What does an > application do when it gets an EBUSY, anyway? I guess it's reasonable to expect from an application to handle EOPNOTSUPP as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems. Although, non-consistent result from the same fd
Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE
On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote: > On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote: > > On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote: > > > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote: > > > > On Fri, 4 Mar 2016, Dave Hansen wrote: > > > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote: > > > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote: > > > > > >> Truncate and punch hole that only cover part of THP range is > > > > > >> implemented > > > > > >> by zero out this part of THP. > > > > > >> > > > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) > > > > > >> behaviour. > > > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may > > > > > >> have > > > > > >> inconsistent results depending what pages happened to be allocated. > > > > > >> Not sure if it should be considered ABI break or not. > > > > > > > > > > > > Looks like this shouldn't be a problem. man 2 fallocate: > > > > > > > > > > > > Within the specified range, partial filesystem blocks are > > > > > > zeroed, > > > > > > and whole filesystem blocks are removed from the file. After a > > > > > > successful call, subsequent reads from this range will return > > > > > > zeroes. > > > > > > > > > > > > It means we effectively have 2M filesystem block size. > > > > > > > > > > The question is still whether this will case problems for apps. > > > > > > > > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs > > > > > filesystem act like they have a 2M blocksize and others like they have > > > > > 4k? Would that confuse apps? > > > > > > > > At risk of addressing the tip of an iceberg, before diving down to > > > > scope out the rest of the iceberg... > > > > > > > > > > (Though in the case of my huge tmpfs, it's the reverse: the small hole > > > > punch splits the hugepage; but it's natural that Kirill's way would try > > > > to hold on to its compound pages for longer than I do, and that's fine > > > > so long as it's all consistent.) > > > > > > > Ah, but suppose someone holepunches out most of each 2M page: they would > > > > expect the memcg not to be charged for those holes (just as when they > > > > munmap most of an anonymous THP) - that does suggest splitting is > > > > needed. > > > > > > I think filesystems will expect splitting to happen. They call > > > truncate_pagecache_range() on the region that the hole is being > > > punched out of, and they expect page cache pages over this range to > > > be unmapped, invalidated and then removed from the mapping tree as a > > > result. Also, most filesystems think the page cache only contains > > > PAGE_CACHE_SIZE mappings, so they are completely unaware of the > > > limitations THP might have when it comes to invalidation. > > > > > > IOWs, if this range is not aligned to huge page boundaries, then it > > > implies the huge page is either split into PAGE_SIZE mappings and > > > then the range is invalidated as expected, or it is completely > > > invalidated and then refaulted on future accesses which determine if > > > THP or normal pages are used for the page being faulted > > > > The filesystem in question is tmpfs and complete invalidation is not > > always an option. > > Then your two options are: splitting the page and rerunning the hole > punch, or simply zeroing the sections of the THP rather than trying > to punch out the backing store. The second option is implemented at the moment as splitting can fail. > > For other filesystems it also can be unavailable > > immediately if the page is dirty (the dirty flag is tracked on per-THP > > basis at the moment). > > Filesystems with persistent storage flush the range being punched > first to ensure that partial blocks are correctly written before we > start freeing the backing store. This is needed on XFS to ensure > hole punch plays nicely with delayed allocation and other extent > based operations. Hence we know that we have clean pages over the > hole we are about to punch and so there is no reason the > invalidation should *ever* fail. Okay. It means we have other option to consider on THP-enabling for a filesystem with persistent storage. > tmpfs is a special snowflake when it comes to these fallocate based > filesystem layout manipulation functions - it does not have > persistent storage, so you have to do things very differently to > ensure that data is not lost. > > > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return > > -EBUSY (or other errno on your choice), if we cannot split the page > > right away? > > Which means THP are not transparent any more. What does an > application do when it gets an EBUSY, anyway? I guess it's reasonable to expect from an application to handle EOPNOTSUPP as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems. Although, non-consistent result from the same fd
/bin/bash: line 0: [: -ge: unary operator expected
Hi Huacai, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: a7c9b603cf2371edacb054abc35597e810c1e5fd commit: 5188129b8c9f58ba089bfd3809a163a8c087c797 MIPS: Loongson-3: Improve -march option and move it to Platform date: 6 weeks ago config: mips-jz4740 (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 5188129b8c9f58ba089bfd3809a163a8c087c797 # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> /bin/bash: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) /bin/sh: line 0: [: -lt: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
/bin/bash: line 0: [: -ge: unary operator expected
Hi Huacai, FYI, the error/warning still remains. tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: a7c9b603cf2371edacb054abc35597e810c1e5fd commit: 5188129b8c9f58ba089bfd3809a163a8c087c797 MIPS: Loongson-3: Improve -march option and move it to Platform date: 6 weeks ago config: mips-jz4740 (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 5188129b8c9f58ba089bfd3809a163a8c087c797 # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): >> /bin/bash: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) /bin/sh: line 0: [: -lt: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected -- awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected awk: scripts/ld-version.sh: line 4: regular expression compile failed (missing '(') .*) >> /bin/sh: line 0: [: -ge: unary operator expected --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
RE: [PATCH RESEND] net:fec:Fix error checking in the function fec_enet_init
From: Nicholas KrauseSent: Saturday, March 05, 2016 4:00 AM > To: da...@davemloft.net > Cc: b38...@freescale.com; and...@lunn.ch; fabio.este...@freescale.com; > l.st...@pengutronix.de; rmk+ker...@arm.linux.org.uk; trem...@gmail.com; > johan...@sipsolutions.net; u.kleine-koe...@pengutronix.de; > haoke...@gmail.com; net...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH RESEND] net:fec:Fix error checking in the function > fec_enet_init > > This fixes error checking in the function fec_enet_init to properly check if > the > internal call to the function fec_enet_alloc_queue fails and if so immediately > return the error code to the caller for it to handle it's own intended error > paths. > > Signed-off-by: Nicholas Krause > --- > drivers/net/ethernet/freescale/fec_main.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index b349e6f..18c625f 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -3123,6 +3123,7 @@ static int fec_enet_init(struct net_device *ndev) > dma_addr_t bd_dma; > int bd_size; > unsigned int i; > + int ret; > > #if defined(CONFIG_ARM) > fep->rx_align = 0xf; > @@ -3132,7 +3133,9 @@ static int fec_enet_init(struct net_device *ndev) > fep->tx_align = 0x3; > #endif > > - fec_enet_alloc_queue(ndev); > + ret = fec_enet_alloc_queue(ndev); > + if (ret) > + return ret; > > if (fep->bufdesc_ex) > fep->bufdesc_size = sizeof(struct bufdesc_ex); > -- > 2.1.4 Thanks. Acked-by: Fugang Duan
RE: [PATCH RESEND] net:fec:Fix error checking in the function fec_enet_init
From: Nicholas Krause Sent: Saturday, March 05, 2016 4:00 AM > To: da...@davemloft.net > Cc: b38...@freescale.com; and...@lunn.ch; fabio.este...@freescale.com; > l.st...@pengutronix.de; rmk+ker...@arm.linux.org.uk; trem...@gmail.com; > johan...@sipsolutions.net; u.kleine-koe...@pengutronix.de; > haoke...@gmail.com; net...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH RESEND] net:fec:Fix error checking in the function > fec_enet_init > > This fixes error checking in the function fec_enet_init to properly check if > the > internal call to the function fec_enet_alloc_queue fails and if so immediately > return the error code to the caller for it to handle it's own intended error > paths. > > Signed-off-by: Nicholas Krause > --- > drivers/net/ethernet/freescale/fec_main.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index b349e6f..18c625f 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -3123,6 +3123,7 @@ static int fec_enet_init(struct net_device *ndev) > dma_addr_t bd_dma; > int bd_size; > unsigned int i; > + int ret; > > #if defined(CONFIG_ARM) > fep->rx_align = 0xf; > @@ -3132,7 +3133,9 @@ static int fec_enet_init(struct net_device *ndev) > fep->tx_align = 0x3; > #endif > > - fec_enet_alloc_queue(ndev); > + ret = fec_enet_alloc_queue(ndev); > + if (ret) > + return ret; > > if (fep->bufdesc_ex) > fep->bufdesc_size = sizeof(struct bufdesc_ex); > -- > 2.1.4 Thanks. Acked-by: Fugang Duan