Re: Improving documentation of parent-ID field in /proc/PID/mountinfo
On Mon, Nov 13, 2017 at 07:02:21AM +0100, Michael Kerrisk (man-pages) wrote: > Hello Ram, > > Long ago (2.6.29) you added the /proc/PID/mountinfo file and > associated documentation in Documentation/filesystems/proc.txt. Later, > I pasted much of that documentation into the proc(5) manual page. > > That documentation says of the second field in the file: > > [[ > This file contains lines of the form: > > 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue > (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) > > (1) mount ID: unique identifier of the mount (may be reused after umount) > (2) parent ID: ID of parent (or of self for the top of the mount tree) > ... > ]] > > The last piece of the description of field (2) doesn't seem to be > correct, or is at least rather unclear. I take this to be saying that > that for the root mount point, /, field (2) will have the same value > as field (1). I never actually looked at this detail closely, but > Alexander pointed out that this is obviously not so, as one can > immediately verify: > > $ grep '/ / ' /proc/$$/mountinfo > 65 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,seclabel,data=order > > I dug around in the kernel source for a bit. I do not have an exact > handle on the details, but I can see roughly what is going on. > Internally, there seems to be one ("hidden") mount ID reserved to each > mount namespace, and that ID is the parent of the root mount point. > > Looking through the (4.14) kernel source, mount IDs are allocated by > mnt_alloc_id() (in fs/namespace.c), which is in turn called by > alloc_vfsmnt() which is in turn called by clone_mnt(). > > A new mount namespace is created by the kernel function copy_mnt_ns() > (in fs/namespace.c, called by create_new_namespaces() in > kernel/nsproxy.c). The copy_mnt_ns() function calls copy_tree() (in > fs/namespace.c), and copy_tree() calls clone_mnt() in *two* places. > The first of these is the call that creates the "hidden" mount ID that > becomes the parent of the root mount point. (I verified this by > instrumenting the kernel with a few printk() calls to display the > IDs.) The second place where copy_tree() calls clone_mnt() is in a > loop that replicates each of the mount points (including the root > mount point) in the source mount namespace. We used to report that mount, ones upon a time. Something has changed the behavior since then and its not reported any more, thus making it hidden. > > With these details in mind, I propose to patch the man page to read as > below. Perhaps you have some corrections or improvements to suggest > for this text? > > [[ > (2) parent ID: the ID of the parent mount. For the root >mount point, the ID shown here is a hidden mount ID >associated with the mount namespace. That ID is dis‐ >tinct from any of the IDs shown in field (1) of the >records shown in the mountinfo file, and does not >appear in field (1) in the mountinfo file in any other >mount namespace. (In the initial mount namespace, >this hidden ID has the value 0.) It captures the current semantics correctly. RP
Re: Improving documentation of parent-ID field in /proc/PID/mountinfo
On Mon, Nov 13, 2017 at 07:02:21AM +0100, Michael Kerrisk (man-pages) wrote: > Hello Ram, > > Long ago (2.6.29) you added the /proc/PID/mountinfo file and > associated documentation in Documentation/filesystems/proc.txt. Later, > I pasted much of that documentation into the proc(5) manual page. > > That documentation says of the second field in the file: > > [[ > This file contains lines of the form: > > 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue > (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) > > (1) mount ID: unique identifier of the mount (may be reused after umount) > (2) parent ID: ID of parent (or of self for the top of the mount tree) > ... > ]] > > The last piece of the description of field (2) doesn't seem to be > correct, or is at least rather unclear. I take this to be saying that > that for the root mount point, /, field (2) will have the same value > as field (1). I never actually looked at this detail closely, but > Alexander pointed out that this is obviously not so, as one can > immediately verify: > > $ grep '/ / ' /proc/$$/mountinfo > 65 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,seclabel,data=order > > I dug around in the kernel source for a bit. I do not have an exact > handle on the details, but I can see roughly what is going on. > Internally, there seems to be one ("hidden") mount ID reserved to each > mount namespace, and that ID is the parent of the root mount point. > > Looking through the (4.14) kernel source, mount IDs are allocated by > mnt_alloc_id() (in fs/namespace.c), which is in turn called by > alloc_vfsmnt() which is in turn called by clone_mnt(). > > A new mount namespace is created by the kernel function copy_mnt_ns() > (in fs/namespace.c, called by create_new_namespaces() in > kernel/nsproxy.c). The copy_mnt_ns() function calls copy_tree() (in > fs/namespace.c), and copy_tree() calls clone_mnt() in *two* places. > The first of these is the call that creates the "hidden" mount ID that > becomes the parent of the root mount point. (I verified this by > instrumenting the kernel with a few printk() calls to display the > IDs.) The second place where copy_tree() calls clone_mnt() is in a > loop that replicates each of the mount points (including the root > mount point) in the source mount namespace. We used to report that mount, ones upon a time. Something has changed the behavior since then and its not reported any more, thus making it hidden. > > With these details in mind, I propose to patch the man page to read as > below. Perhaps you have some corrections or improvements to suggest > for this text? > > [[ > (2) parent ID: the ID of the parent mount. For the root >mount point, the ID shown here is a hidden mount ID >associated with the mount namespace. That ID is dis‐ >tinct from any of the IDs shown in field (1) of the >records shown in the mountinfo file, and does not >appear in field (1) in the mountinfo file in any other >mount namespace. (In the initial mount namespace, >this hidden ID has the value 0.) It captures the current semantics correctly. RP
[GIT PULL] RAS updates for v4.15
Linus, Please pull the latest ras-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ras-core-for-linus # HEAD: 783ca517bfd62ca516178712775e4b273292d5b1 x86/MCE/AMD: Fix mce_severity_amd_smca() signature Two minor updates to AMD SMCA support, plus a timer_setup() conversion. Thanks, Ingo --> Kees Cook (1): x86/mce: Convert timers to use timer_setup() Yazen Ghannam (2): x86/MCE/AMD: Always give panic severity for UC errors in kernel context x86/MCE/AMD: Fix mce_severity_amd_smca() signature arch/x86/kernel/cpu/mcheck/mce-severity.c | 9 - arch/x86/kernel/cpu/mcheck/mce.c | 13 + 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 87cc9ab7a13c..4ca632a06e0b 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -204,7 +204,7 @@ static int error_context(struct mce *m) return IN_KERNEL; } -static int mce_severity_amd_smca(struct mce *m, int err_ctx) +static int mce_severity_amd_smca(struct mce *m, enum context err_ctx) { u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank); u32 low, high; @@ -245,6 +245,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc if (m->status & MCI_STATUS_UC) { + if (ctx == IN_KERNEL) + return MCE_PANIC_SEVERITY; + /* * On older systems where overflow_recov flag is not present, we * should simply panic if an error overflow occurs. If @@ -255,10 +258,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc if (mce_flags.smca) return mce_severity_amd_smca(m, ctx); - /* software can try to contain */ - if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == IN_KERNEL)) - return MCE_PANIC_SEVERITY; - /* kill current process */ return MCE_AR_SEVERITY; } else { diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3b413065c613..b1d616d08eee 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1367,13 +1367,12 @@ static void __start_timer(struct timer_list *t, unsigned long interval) local_irq_restore(flags); } -static void mce_timer_fn(unsigned long data) +static void mce_timer_fn(struct timer_list *t) { - struct timer_list *t = this_cpu_ptr(_timer); - int cpu = smp_processor_id(); + struct timer_list *cpu_t = this_cpu_ptr(_timer); unsigned long iv; - WARN_ON(cpu != data); + WARN_ON(cpu_t != t); iv = __this_cpu_read(mce_next_interval); @@ -1763,17 +1762,15 @@ static void mce_start_timer(struct timer_list *t) static void __mcheck_cpu_setup_timer(void) { struct timer_list *t = this_cpu_ptr(_timer); - unsigned int cpu = smp_processor_id(); - setup_pinned_timer(t, mce_timer_fn, cpu); + timer_setup(t, mce_timer_fn, TIMER_PINNED); } static void __mcheck_cpu_init_timer(void) { struct timer_list *t = this_cpu_ptr(_timer); - unsigned int cpu = smp_processor_id(); - setup_pinned_timer(t, mce_timer_fn, cpu); + timer_setup(t, mce_timer_fn, TIMER_PINNED); mce_start_timer(t); }
[GIT PULL] RAS updates for v4.15
Linus, Please pull the latest ras-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ras-core-for-linus # HEAD: 783ca517bfd62ca516178712775e4b273292d5b1 x86/MCE/AMD: Fix mce_severity_amd_smca() signature Two minor updates to AMD SMCA support, plus a timer_setup() conversion. Thanks, Ingo --> Kees Cook (1): x86/mce: Convert timers to use timer_setup() Yazen Ghannam (2): x86/MCE/AMD: Always give panic severity for UC errors in kernel context x86/MCE/AMD: Fix mce_severity_amd_smca() signature arch/x86/kernel/cpu/mcheck/mce-severity.c | 9 - arch/x86/kernel/cpu/mcheck/mce.c | 13 + 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 87cc9ab7a13c..4ca632a06e0b 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -204,7 +204,7 @@ static int error_context(struct mce *m) return IN_KERNEL; } -static int mce_severity_amd_smca(struct mce *m, int err_ctx) +static int mce_severity_amd_smca(struct mce *m, enum context err_ctx) { u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank); u32 low, high; @@ -245,6 +245,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc if (m->status & MCI_STATUS_UC) { + if (ctx == IN_KERNEL) + return MCE_PANIC_SEVERITY; + /* * On older systems where overflow_recov flag is not present, we * should simply panic if an error overflow occurs. If @@ -255,10 +258,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc if (mce_flags.smca) return mce_severity_amd_smca(m, ctx); - /* software can try to contain */ - if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == IN_KERNEL)) - return MCE_PANIC_SEVERITY; - /* kill current process */ return MCE_AR_SEVERITY; } else { diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3b413065c613..b1d616d08eee 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1367,13 +1367,12 @@ static void __start_timer(struct timer_list *t, unsigned long interval) local_irq_restore(flags); } -static void mce_timer_fn(unsigned long data) +static void mce_timer_fn(struct timer_list *t) { - struct timer_list *t = this_cpu_ptr(_timer); - int cpu = smp_processor_id(); + struct timer_list *cpu_t = this_cpu_ptr(_timer); unsigned long iv; - WARN_ON(cpu != data); + WARN_ON(cpu_t != t); iv = __this_cpu_read(mce_next_interval); @@ -1763,17 +1762,15 @@ static void mce_start_timer(struct timer_list *t) static void __mcheck_cpu_setup_timer(void) { struct timer_list *t = this_cpu_ptr(_timer); - unsigned int cpu = smp_processor_id(); - setup_pinned_timer(t, mce_timer_fn, cpu); + timer_setup(t, mce_timer_fn, TIMER_PINNED); } static void __mcheck_cpu_init_timer(void) { struct timer_list *t = this_cpu_ptr(_timer); - unsigned int cpu = smp_processor_id(); - setup_pinned_timer(t, mce_timer_fn, cpu); + timer_setup(t, mce_timer_fn, TIMER_PINNED); mce_start_timer(t); }
RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
Marc-Andre, It looks to me that the 4th and 5th patches somehow has not been sent. Could you send them, too? I'd like them to actually build and run the kernel for testing. > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau > Sent: Wednesday, November 8, 2017 1:24 AM > To: linux-kernel@vger.kernel.org > Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau >> Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support > > Hi, > > This series adds DMA operations support to the qemu fw_cfg kernel > module and populates "etc/vmcoreinfo" with vmcoreinfo location > details. > > Note: the support for this entry handling has been merged for next > qemu release (2.11) > > v5: > - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch > > v4: > - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build > - fix build with !CONFIG_CRASH_CORE > - replace the unbounded yield() loop with a usleep_range() loop and a > 200ms timeout > - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama) > - drop the experimental sysfs write support patch from this series > > v3: (thanks kbuild) > - add "fw_cfg: fix the command line module name" patch > - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y > - fix 'Wshift-count-overflow' > > v2: > - use platform device for dma mapping > - add etc/vmcoreinfo patch > - some code cleanups > > Marc-André Lureau (5): > fw_cfg: fix the command line module name > fw_cfg: add DMA register > fw_cfg: do DMA read operation > crash: export paddr_vmcoreinfo_note() > fw_cfg: write vmcoreinfo details > > drivers/firmware/qemu_fw_cfg.c | 292 > + > kernel/crash_core.c| 1 + > 2 files changed, 264 insertions(+), 29 deletions(-) > > -- > 2.15.0.125.g8f49766d64 > >
RE: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support
Marc-Andre, It looks to me that the 4th and 5th patches somehow has not been sent. Could you send them, too? I'd like them to actually build and run the kernel for testing. > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Marc-Andre Lureau > Sent: Wednesday, November 8, 2017 1:24 AM > To: linux-kernel@vger.kernel.org > Cc: so...@cmu.edu; qemu-de...@nongnu.org; m...@redhat.com; Marc-André Lureau > > Subject: [PATCH v5 0/5] fw_cfg: add DMA operations & etc/vmcoreinfo support > > Hi, > > This series adds DMA operations support to the qemu fw_cfg kernel > module and populates "etc/vmcoreinfo" with vmcoreinfo location > details. > > Note: the support for this entry handling has been merged for next > qemu release (2.11) > > v5: > - resent to CC kdump people on the paddr_vmcoreinfo_note() export patch > > v4: > - export paddr_vmcoreinfo_note() to fix fw_cfg.ko build > - fix build with !CONFIG_CRASH_CORE > - replace the unbounded yield() loop with a usleep_range() loop and a > 200ms timeout > - do not write vmcoreinfo entry when running the kdump kernel (D. Hatayama) > - drop the experimental sysfs write support patch from this series > > v3: (thanks kbuild) > - add "fw_cfg: fix the command line module name" patch > - fix build of "fw_cfg: add DMA register" with CONFIG_FW_CFG_SYSFS_CMDLINE=y > - fix 'Wshift-count-overflow' > > v2: > - use platform device for dma mapping > - add etc/vmcoreinfo patch > - some code cleanups > > Marc-André Lureau (5): > fw_cfg: fix the command line module name > fw_cfg: add DMA register > fw_cfg: do DMA read operation > crash: export paddr_vmcoreinfo_note() > fw_cfg: write vmcoreinfo details > > drivers/firmware/qemu_fw_cfg.c | 292 > + > kernel/crash_core.c| 1 + > 2 files changed, 264 insertions(+), 29 deletions(-) > > -- > 2.15.0.125.g8f49766d64 > >
[GIT PULL] perf updates for v4.15
Linus, Please pull the latest perf-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-core-for-linus # HEAD: fcdfafcb73be8fa45909327bbddca46fb362a675 kprobes: Don't spam the build log with deprecation warnings The main changes in this cycle were: - kprobes updates: use better W^X patterns for code modifications, improve optprobes, remove jprobes. (Masami Hiramatsu, Kees Cook) - core fixes: event timekeeping (enabled/running times statistics) fixes, perf_event_read() locking fixes and cleanups, etc. (Peter Zijlstra) - Extend x86 Intel free-running PEBS support and support x86 user-register sampling in perf record and perf script. (Andi Kleen) - Tooling updates: - Completely rework the way inline frames are handled. Instead of querying for the inline nodes on-demand in the individual tools, we now create proper callchain nodes for inlined frames. (Milian Wolff) - 'perf trace' updates (Arnaldo Carvalho de Melo) - Implement a way to print formatted output to per-event files in 'perf script' to facilitate generate flamegraphs, elliminating the need to write scripts to do that separation (yuzhoujian, Arnaldo Carvalho de Melo) - Update vendor events JSON metrics for Intel's Broadwell, Broadwell Server, Haswell, Haswell Server, IvyBridge, IvyTown, JakeTown, Sandy Bridge, Skylake, SkyLake Server - and Goldmont Plus V1 (Andi Kleen, Kan Liang) - Multithread the synthesizing of PERF_RECORD_ events for pre-existing threads in 'perf top', speeding up that phase, greatly improving the user experience in systems such as Intel's Knights Mill (Kan Liang) - Introduce the concept of weak groups in 'perf stat': try to set up a group, but if it's not schedulable fallback to not using a group. That gives us the best of both worlds: groups if they work, but still a usable fallback if they don't. E.g: (Andi Kleen) - perf sched timehist enhancements (David Ahern) - ... various other enhancements, updates, cleanups and fixes. Thanks, Ingo --> Alexander Shishkin (1): perf/core: Explain perf_sched_mutex Andi Kleen (40): perf tools: Support weak groups in 'perf stat' perf vendor events: Support metric_group and no event name in JSON parser perf stat: Factor out generic metric printing perf stat: Print generic metric header even for failed expressions perf pmu: Extract function to get JSON alias map perf stat: Support JSON metrics in perf stat perf list: Add metric groups to perf list perf stat: Don't use ctx for saved values lookup perf stat: Support duration_time for metrics perf stat: Hide internal duration_time counter perf stat: Update walltime_nsecs_stats in interval mode perf record: Support direct --user-regs arguments perf script: Support user regs perf stat: Fall weak group back even for EBADF perf vendor events: Add JSON metrics for Broadwell perf vendor events: Add JSON metrics for Skylake perf vendor events: Add JSON metrics for Sandy Bridge perf vendor events: Add JSON metrics for Sandy Bridge EP perf vendor events: Add JSON metrics for Ivy Bridge perf vendor events: Add JSON metrics for Haswell perf vendor events: Add JSON metrics for Ivy Town perf vendor events: Add JSON metrics for Haswell EP perf vendor events: Add JSON metrics for Broadwell Server perf vendor events: Add JSON metrics for Broadwell DE perf vendor events: Add JSON metrics for Skylake server perf pmu: Improve error messages for missing PMUs perf stat: Fix adding multiple event groups perf/x86: Enable free running PEBS for REGS_USER/INTR perf vendor events: Update JSON metrics for Broadwell perf vendor events: Update JSON metrics for Broadwell Server perf vendor events: Update JSON metrics for Haswell perf vendor events: Update JSON metrics for Haswell Server perf vendor events: Update JSON metrics for IvyBridge perf vendor events: Update JSON metrics for IvyTown perf vendor events: Update JSON metrics for JakeTown perf vendor events: Update JSON metrics for Sandy Bridge perf vendor events: Update JSON metrics for Skylake perf vendor events: Update JSON metrics for Skylake Server perf list: Fix group description in the man page perf vendor events: Fix incorrect cmask syntax for some Intel metrics Arnaldo Carvalho de Melo (25): perf tools: Make copyfile_offset() static perf machine: Optimize a bit the machine__findnew_thread() methods perf trace beauty madvise: Generate 'behavior' string table from kernel headers tools: Update asm-generic/mman-common.h copy from the kernel perf tools: Get all of tools/{arch,include}/ in the MANIFEST
[GIT PULL] perf updates for v4.15
Linus, Please pull the latest perf-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-core-for-linus # HEAD: fcdfafcb73be8fa45909327bbddca46fb362a675 kprobes: Don't spam the build log with deprecation warnings The main changes in this cycle were: - kprobes updates: use better W^X patterns for code modifications, improve optprobes, remove jprobes. (Masami Hiramatsu, Kees Cook) - core fixes: event timekeeping (enabled/running times statistics) fixes, perf_event_read() locking fixes and cleanups, etc. (Peter Zijlstra) - Extend x86 Intel free-running PEBS support and support x86 user-register sampling in perf record and perf script. (Andi Kleen) - Tooling updates: - Completely rework the way inline frames are handled. Instead of querying for the inline nodes on-demand in the individual tools, we now create proper callchain nodes for inlined frames. (Milian Wolff) - 'perf trace' updates (Arnaldo Carvalho de Melo) - Implement a way to print formatted output to per-event files in 'perf script' to facilitate generate flamegraphs, elliminating the need to write scripts to do that separation (yuzhoujian, Arnaldo Carvalho de Melo) - Update vendor events JSON metrics for Intel's Broadwell, Broadwell Server, Haswell, Haswell Server, IvyBridge, IvyTown, JakeTown, Sandy Bridge, Skylake, SkyLake Server - and Goldmont Plus V1 (Andi Kleen, Kan Liang) - Multithread the synthesizing of PERF_RECORD_ events for pre-existing threads in 'perf top', speeding up that phase, greatly improving the user experience in systems such as Intel's Knights Mill (Kan Liang) - Introduce the concept of weak groups in 'perf stat': try to set up a group, but if it's not schedulable fallback to not using a group. That gives us the best of both worlds: groups if they work, but still a usable fallback if they don't. E.g: (Andi Kleen) - perf sched timehist enhancements (David Ahern) - ... various other enhancements, updates, cleanups and fixes. Thanks, Ingo --> Alexander Shishkin (1): perf/core: Explain perf_sched_mutex Andi Kleen (40): perf tools: Support weak groups in 'perf stat' perf vendor events: Support metric_group and no event name in JSON parser perf stat: Factor out generic metric printing perf stat: Print generic metric header even for failed expressions perf pmu: Extract function to get JSON alias map perf stat: Support JSON metrics in perf stat perf list: Add metric groups to perf list perf stat: Don't use ctx for saved values lookup perf stat: Support duration_time for metrics perf stat: Hide internal duration_time counter perf stat: Update walltime_nsecs_stats in interval mode perf record: Support direct --user-regs arguments perf script: Support user regs perf stat: Fall weak group back even for EBADF perf vendor events: Add JSON metrics for Broadwell perf vendor events: Add JSON metrics for Skylake perf vendor events: Add JSON metrics for Sandy Bridge perf vendor events: Add JSON metrics for Sandy Bridge EP perf vendor events: Add JSON metrics for Ivy Bridge perf vendor events: Add JSON metrics for Haswell perf vendor events: Add JSON metrics for Ivy Town perf vendor events: Add JSON metrics for Haswell EP perf vendor events: Add JSON metrics for Broadwell Server perf vendor events: Add JSON metrics for Broadwell DE perf vendor events: Add JSON metrics for Skylake server perf pmu: Improve error messages for missing PMUs perf stat: Fix adding multiple event groups perf/x86: Enable free running PEBS for REGS_USER/INTR perf vendor events: Update JSON metrics for Broadwell perf vendor events: Update JSON metrics for Broadwell Server perf vendor events: Update JSON metrics for Haswell perf vendor events: Update JSON metrics for Haswell Server perf vendor events: Update JSON metrics for IvyBridge perf vendor events: Update JSON metrics for IvyTown perf vendor events: Update JSON metrics for JakeTown perf vendor events: Update JSON metrics for Sandy Bridge perf vendor events: Update JSON metrics for Skylake perf vendor events: Update JSON metrics for Skylake Server perf list: Fix group description in the man page perf vendor events: Fix incorrect cmask syntax for some Intel metrics Arnaldo Carvalho de Melo (25): perf tools: Make copyfile_offset() static perf machine: Optimize a bit the machine__findnew_thread() methods perf trace beauty madvise: Generate 'behavior' string table from kernel headers tools: Update asm-generic/mman-common.h copy from the kernel perf tools: Get all of tools/{arch,include}/ in the MANIFEST
[PATCH] perf tool: Fix build failure when NO_AUXTRACE=1
Perf tool fails with following build failure when AUXTRACE is not set: $ make NO_AUXTRACE=1 builtin-script.c: In function 'perf_script__process_auxtrace_info': util/auxtrace.h:608:44: error: called object is not a function or function pointer #define perf_event__process_auxtrace_info 0 ^ Fix it by guarding function under HAVE_AUXTRACE_SUPPORT. Fixes: 47e5a26a916b ("perf script: Fix --per-event-dump for auxtrace synth evsels") Signed-off-by: Ravi Bangoria--- tools/perf/builtin-script.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index ad6404dcf91c..9b43bda45a41 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -2848,6 +2848,7 @@ int process_cpu_map_event(struct perf_tool *tool __maybe_unused, return set_maps(script); } +#ifdef HAVE_AUXTRACE_SUPPORT static int perf_script__process_auxtrace_info(struct perf_tool *tool, union perf_event *event, struct perf_session *session) @@ -2862,6 +2863,9 @@ static int perf_script__process_auxtrace_info(struct perf_tool *tool, return ret; } +#else +#define perf_script__process_auxtrace_info 0 +#endif int cmd_script(int argc, const char **argv) { -- 2.13.6
[PATCH] perf tool: Fix build failure when NO_AUXTRACE=1
Perf tool fails with following build failure when AUXTRACE is not set: $ make NO_AUXTRACE=1 builtin-script.c: In function 'perf_script__process_auxtrace_info': util/auxtrace.h:608:44: error: called object is not a function or function pointer #define perf_event__process_auxtrace_info 0 ^ Fix it by guarding function under HAVE_AUXTRACE_SUPPORT. Fixes: 47e5a26a916b ("perf script: Fix --per-event-dump for auxtrace synth evsels") Signed-off-by: Ravi Bangoria --- tools/perf/builtin-script.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index ad6404dcf91c..9b43bda45a41 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -2848,6 +2848,7 @@ int process_cpu_map_event(struct perf_tool *tool __maybe_unused, return set_maps(script); } +#ifdef HAVE_AUXTRACE_SUPPORT static int perf_script__process_auxtrace_info(struct perf_tool *tool, union perf_event *event, struct perf_session *session) @@ -2862,6 +2863,9 @@ static int perf_script__process_auxtrace_info(struct perf_tool *tool, return ret; } +#else +#define perf_script__process_auxtrace_info 0 +#endif int cmd_script(int argc, const char **argv) { -- 2.13.6
Re: [PATCH] perf/core: fast breakpoint modification via _IOC_MODIFY_BREAKPOINT
On Sun, Nov 12, 2017 at 11:09:23AM -0800, Milind Chabbi wrote: > , > > On Thu, Nov 9, 2017 at 10:59 AM, Milind Chabbi> wrote: > > SNIP > > > > On Thu, Nov 9, 2017 at 5:12 AM, Jiri Olsa wrote: > >> > >> > >> how about something like below (untested) > >> > >> looks like there's no irq caller for modify_user_hw_breakpoint, > >> so we should be fine with locking nr_bp_mutex > >> > >> jirka > >> > >> > >> --- > >> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c > >> index 3f8cb1e14588..f062b68399ea 100644 > >> --- a/kernel/events/hw_breakpoint.c > >> +++ b/kernel/events/hw_breakpoint.c > >> @@ -448,6 +448,8 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> else > >> perf_event_disable(bp); > >> > >> + release_bp_slot(bp); > >> + > >> bp->attr.bp_addr = attr->bp_addr; > >> bp->attr.bp_type = attr->bp_type; > >> bp->attr.bp_len = attr->bp_len; > >> @@ -455,9 +457,9 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> if (attr->disabled) > >> goto end; > >> > >> - err = validate_hw_breakpoint(bp); > >> + err = reserve_bp_slot(bp); > >> if (!err) > >> - perf_event_enable(bp); > >> + err = validate_hw_breakpoint(bp); > >> > >> if (err) { > >> bp->attr.bp_addr = old_addr; > >> @@ -469,6 +471,7 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> return err; > >> } > >> > >> + perf_event_enable(bp); > >> end: > >> bp->attr.disabled = attr->disabled; > >> > > > > We can do this accounting only if bp->attr.bp_type != attr->bp_type. > > > > -Milind > > > Jirka, > > Neither of us seems to fully understand the convoluted logic used in > breakpoint counting. yea, I was hoping some of the guys would take over ;-) the problem I have with the patch above is that we could fail to reserve the slot at the end, which is not what the caller might expect > > I tested the following sequence on an x86 machine, which has four > debug registers (without your suggested patch for counting > correction). > > fd1 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR1 > fd2 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR2 > fd3 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR3 > fd4 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR4 > ioctl(fd4, MODIFY, ...); // change fd4 to BP_TYPE= HW_BREAKPOINT_X @ ADDR5 > close(fd4); > fd5 = perf_event_open(); //BP_TYPE=RW @ ADDR6 > > We expected fd5 to fail because four BP_TYPE=TYPE_DATA are in use as > per the accounting, but in reality, fd5 was successfully opened. but you closed fd4 before openning fd5..? > > Is the accounting accidentally working on x86? > Is there another architecture where TYPE_DATA and TYPE_INS are counted > differently? [jolsa@krava linux-perf]$ grep -r HAVE_MIXED_BREAKPOINTS_REGS arch/* arch/Kconfig:config HAVE_MIXED_BREAKPOINTS_REGS arch/sh/Kconfig:select HAVE_MIXED_BREAKPOINTS_REGS arch/x86/Kconfig: select HAVE_MIXED_BREAKPOINTS_REGS I'll try to check on it this week jirka
Re: [PATCH] perf/core: fast breakpoint modification via _IOC_MODIFY_BREAKPOINT
On Sun, Nov 12, 2017 at 11:09:23AM -0800, Milind Chabbi wrote: > , > > On Thu, Nov 9, 2017 at 10:59 AM, Milind Chabbi > wrote: > > SNIP > > > > On Thu, Nov 9, 2017 at 5:12 AM, Jiri Olsa wrote: > >> > >> > >> how about something like below (untested) > >> > >> looks like there's no irq caller for modify_user_hw_breakpoint, > >> so we should be fine with locking nr_bp_mutex > >> > >> jirka > >> > >> > >> --- > >> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c > >> index 3f8cb1e14588..f062b68399ea 100644 > >> --- a/kernel/events/hw_breakpoint.c > >> +++ b/kernel/events/hw_breakpoint.c > >> @@ -448,6 +448,8 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> else > >> perf_event_disable(bp); > >> > >> + release_bp_slot(bp); > >> + > >> bp->attr.bp_addr = attr->bp_addr; > >> bp->attr.bp_type = attr->bp_type; > >> bp->attr.bp_len = attr->bp_len; > >> @@ -455,9 +457,9 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> if (attr->disabled) > >> goto end; > >> > >> - err = validate_hw_breakpoint(bp); > >> + err = reserve_bp_slot(bp); > >> if (!err) > >> - perf_event_enable(bp); > >> + err = validate_hw_breakpoint(bp); > >> > >> if (err) { > >> bp->attr.bp_addr = old_addr; > >> @@ -469,6 +471,7 @@ int modify_user_hw_breakpoint(struct perf_event *bp, > >> struct perf_event_attr *att > >> return err; > >> } > >> > >> + perf_event_enable(bp); > >> end: > >> bp->attr.disabled = attr->disabled; > >> > > > > We can do this accounting only if bp->attr.bp_type != attr->bp_type. > > > > -Milind > > > Jirka, > > Neither of us seems to fully understand the convoluted logic used in > breakpoint counting. yea, I was hoping some of the guys would take over ;-) the problem I have with the patch above is that we could fail to reserve the slot at the end, which is not what the caller might expect > > I tested the following sequence on an x86 machine, which has four > debug registers (without your suggested patch for counting > correction). > > fd1 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR1 > fd2 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR2 > fd3 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR3 > fd4 = perf_event_open(...); //BP_TYPE= HW_BREAKPOINT_RW @ ADDR4 > ioctl(fd4, MODIFY, ...); // change fd4 to BP_TYPE= HW_BREAKPOINT_X @ ADDR5 > close(fd4); > fd5 = perf_event_open(); //BP_TYPE=RW @ ADDR6 > > We expected fd5 to fail because four BP_TYPE=TYPE_DATA are in use as > per the accounting, but in reality, fd5 was successfully opened. but you closed fd4 before openning fd5..? > > Is the accounting accidentally working on x86? > Is there another architecture where TYPE_DATA and TYPE_INS are counted > differently? [jolsa@krava linux-perf]$ grep -r HAVE_MIXED_BREAKPOINTS_REGS arch/* arch/Kconfig:config HAVE_MIXED_BREAKPOINTS_REGS arch/sh/Kconfig:select HAVE_MIXED_BREAKPOINTS_REGS arch/x86/Kconfig: select HAVE_MIXED_BREAKPOINTS_REGS I'll try to check on it this week jirka
Crypto Update for 4.15
Hi Linus: Here is the crypto update for 4.15: API: - Disambiguate EBUSY when queueing crypto request by adding ENOSPC. This change touches code outside the crypto API. - Reset settings when empty string is written to rng_current. Algorithms: - Add OSCCA SM3 secure hash. Drivers: - Remove old mv_cesa driver (replaced by marvell/cesa). - Enable rfc3686/ecb/cfb/ofb AES in crypto4xx. - Add ccm/gcm AES in crypto4xx. - Add support for BCM7278 in iproc-rng200. - Add hash support on Exynos in s5p-sss. - Fix fallback-induced error in vmx. - Fix output IV in atmel-aes. - Fix empty GCM hash in mediatek. Others: - Fix DoS potential in lib/mpi. - Fix potential out-of-order issues with padata. Please note that there may be a conflict with the tips tree due to the timer_setup patch being applied in both cryptodev and the tips tree. The version in the tips tree also touchs the mv_cesa driver which just happens to have been removed in this cycle in cryptodev. Any changes to mv_cesa may be safely discarded. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Allen (1): crypto: omap - return -ENOMEM on allocation failure. Arnd Bergmann (1): crypto: axis - hide an unused variable Arvind Yadav (11): crypto: nx - constify vio_device_id crypto: nx-842 - constify vio_device_id hwrng: pseries - constify vio_device_id crypto: padlock-aes - constify x86_cpu_id crypto: padlock-sha - constify x86_cpu_id hwrng: core - pr_err() strings should end with newlines crypto: omap-aes - pr_err() strings should end with newlines crypto: virtio - pr_err() strings should end with newlines crypto: chelsio - pr_err() strings should end with newlines crypto: qat - pr_err() strings should end with newlines crypto: bcm - pr_err() strings should end with newlines Boris BREZILLON (5): crypto: marvell - Add a platform_device_id table ARM: configs: Stop selecting the old CESA driver crypto: marvell - Remove the old mv_cesa driver crypto: marvell - Switch cipher algs to the skcipher interface crypto: marvell - Add a NULL entry at the end of mv_cesa_plat_id_table[] Christian Lamparter (25): crypto: crypto4xx - remove bad list_del crypto: crypto4xx - remove unused definitions and write-only variables crypto: crypto4xx - set CRYPTO_ALG_KERN_DRIVER_ONLY flag crypto: crypto4xx - remove extern statement before function declaration crypto: crypto4xx - remove double assignment of pd_uinfo->state crypto: crypto4xx - fix dynamic_sa_ctl's sa_contents declaration crypto: crypto4xx - move and refactor dynamic_contents helpers crypto: crypto4xx - enable AES RFC3686, ECB, CFB and OFB offloads crypto: crypto4xx - refactor crypto4xx_copy_pkt_to_dst() crypto: crypto4xx - replace crypto4xx_dev's scatter_buffer_size with constant crypto: crypto4xx - fix crypto4xx_build_pdr, crypto4xx_build_sdr leak crypto: crypto4xx - pointer arithmetic overhaul crypto: crypto4xx - wire up hmac_mc to hmac_muting crypto: crypto4xx - fix off-by-one AES-OFB crypto: crypto4xx - fix type mismatch compiler error crypto: crypto4xx - increase context and scatter ring buffer elements crypto: crypto4xx - add backlog queue support crypto: crypto4xx - use the correct LE32 format for IV and key defs crypto: crypto4xx - overhaul crypto4xx_build_pd() crypto: crypto4xx - fix various warnings crypto: crypto4xx - fix stalls under heavy load crypto: crypto4xx - simplify sa and state context acquisition crypto: crypto4xx - prepare for AEAD support crypto: crypto4xx - add aes-ccm support crypto: crypto4xx - add aes-gcm support Christophe Jaillet (2): crypto: lrw - Fix an error handling path in 'create()' crypto: lrw - Check for incorrect cipher name Colin Ian King (5): crypto: aesni - make arrays aesni_simd_skciphers and aesni_simd_skciphers2 static crypto: algboss - remove redundant setting of len to zero crypto: cavium - clean up clang warning on unread variable offset crypto: ccp - remove unused variable qim crypto: qat - remove unused and redundant pointer vf_info Corentin LABBE (14): crypto: gcm - add GCM IV size constant crypto: caam - Use GCM IV size constant crypto: ccp - Use GCM IV size constant crypto: nx - Use GCM IV size constant crypto: atmel - Use GCM IV size constant crypto: bcm - Use GCM IV size constant crypto: mediatek - Use GCM IV size constant crypto: chelsio - Use GCM IV size constant crypto: omap - Use GCM IV size constant crypto: gcm - Use GCM IV size constant crypto: aesni - Use GCM IV size constant crypto: stm32 - use of_device_get_match_data crypto: omap - use of_device_get_match_data crypto: bcm - use of_device_get_match_data Eric Biggers
Crypto Update for 4.15
Hi Linus: Here is the crypto update for 4.15: API: - Disambiguate EBUSY when queueing crypto request by adding ENOSPC. This change touches code outside the crypto API. - Reset settings when empty string is written to rng_current. Algorithms: - Add OSCCA SM3 secure hash. Drivers: - Remove old mv_cesa driver (replaced by marvell/cesa). - Enable rfc3686/ecb/cfb/ofb AES in crypto4xx. - Add ccm/gcm AES in crypto4xx. - Add support for BCM7278 in iproc-rng200. - Add hash support on Exynos in s5p-sss. - Fix fallback-induced error in vmx. - Fix output IV in atmel-aes. - Fix empty GCM hash in mediatek. Others: - Fix DoS potential in lib/mpi. - Fix potential out-of-order issues with padata. Please note that there may be a conflict with the tips tree due to the timer_setup patch being applied in both cryptodev and the tips tree. The version in the tips tree also touchs the mv_cesa driver which just happens to have been removed in this cycle in cryptodev. Any changes to mv_cesa may be safely discarded. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Allen (1): crypto: omap - return -ENOMEM on allocation failure. Arnd Bergmann (1): crypto: axis - hide an unused variable Arvind Yadav (11): crypto: nx - constify vio_device_id crypto: nx-842 - constify vio_device_id hwrng: pseries - constify vio_device_id crypto: padlock-aes - constify x86_cpu_id crypto: padlock-sha - constify x86_cpu_id hwrng: core - pr_err() strings should end with newlines crypto: omap-aes - pr_err() strings should end with newlines crypto: virtio - pr_err() strings should end with newlines crypto: chelsio - pr_err() strings should end with newlines crypto: qat - pr_err() strings should end with newlines crypto: bcm - pr_err() strings should end with newlines Boris BREZILLON (5): crypto: marvell - Add a platform_device_id table ARM: configs: Stop selecting the old CESA driver crypto: marvell - Remove the old mv_cesa driver crypto: marvell - Switch cipher algs to the skcipher interface crypto: marvell - Add a NULL entry at the end of mv_cesa_plat_id_table[] Christian Lamparter (25): crypto: crypto4xx - remove bad list_del crypto: crypto4xx - remove unused definitions and write-only variables crypto: crypto4xx - set CRYPTO_ALG_KERN_DRIVER_ONLY flag crypto: crypto4xx - remove extern statement before function declaration crypto: crypto4xx - remove double assignment of pd_uinfo->state crypto: crypto4xx - fix dynamic_sa_ctl's sa_contents declaration crypto: crypto4xx - move and refactor dynamic_contents helpers crypto: crypto4xx - enable AES RFC3686, ECB, CFB and OFB offloads crypto: crypto4xx - refactor crypto4xx_copy_pkt_to_dst() crypto: crypto4xx - replace crypto4xx_dev's scatter_buffer_size with constant crypto: crypto4xx - fix crypto4xx_build_pdr, crypto4xx_build_sdr leak crypto: crypto4xx - pointer arithmetic overhaul crypto: crypto4xx - wire up hmac_mc to hmac_muting crypto: crypto4xx - fix off-by-one AES-OFB crypto: crypto4xx - fix type mismatch compiler error crypto: crypto4xx - increase context and scatter ring buffer elements crypto: crypto4xx - add backlog queue support crypto: crypto4xx - use the correct LE32 format for IV and key defs crypto: crypto4xx - overhaul crypto4xx_build_pd() crypto: crypto4xx - fix various warnings crypto: crypto4xx - fix stalls under heavy load crypto: crypto4xx - simplify sa and state context acquisition crypto: crypto4xx - prepare for AEAD support crypto: crypto4xx - add aes-ccm support crypto: crypto4xx - add aes-gcm support Christophe Jaillet (2): crypto: lrw - Fix an error handling path in 'create()' crypto: lrw - Check for incorrect cipher name Colin Ian King (5): crypto: aesni - make arrays aesni_simd_skciphers and aesni_simd_skciphers2 static crypto: algboss - remove redundant setting of len to zero crypto: cavium - clean up clang warning on unread variable offset crypto: ccp - remove unused variable qim crypto: qat - remove unused and redundant pointer vf_info Corentin LABBE (14): crypto: gcm - add GCM IV size constant crypto: caam - Use GCM IV size constant crypto: ccp - Use GCM IV size constant crypto: nx - Use GCM IV size constant crypto: atmel - Use GCM IV size constant crypto: bcm - Use GCM IV size constant crypto: mediatek - Use GCM IV size constant crypto: chelsio - Use GCM IV size constant crypto: omap - Use GCM IV size constant crypto: gcm - Use GCM IV size constant crypto: aesni - Use GCM IV size constant crypto: stm32 - use of_device_get_match_data crypto: omap - use of_device_get_match_data crypto: bcm - use of_device_get_match_data Eric Biggers
[PATCH] spi: spi-fsl-dspi: add SPI_LSB_FIRST to driver capabilities
The driver as well as the controller support the SPI lsb first mode. However, it's not possible to configure it e.g. when using spidev. Adding this flag to mode_bits resolves the issue and lsb first mode can be used. Signed-off-by: Kurt Kanzenbach--- drivers/spi/spi-fsl-dspi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c index f652f70cb8db..02d3ed7f2558 100644 --- a/drivers/spi/spi-fsl-dspi.c +++ b/drivers/spi/spi-fsl-dspi.c @@ -980,7 +980,7 @@ static int dspi_probe(struct platform_device *pdev) master->dev.of_node = pdev->dev.of_node; master->cleanup = dspi_cleanup; - master->mode_bits = SPI_CPOL | SPI_CPHA; + master->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST; master->bits_per_word_mask = SPI_BPW_MASK(4) | SPI_BPW_MASK(8) | SPI_BPW_MASK(16); -- 2.11.0
[PATCH] spi: spi-fsl-dspi: add SPI_LSB_FIRST to driver capabilities
The driver as well as the controller support the SPI lsb first mode. However, it's not possible to configure it e.g. when using spidev. Adding this flag to mode_bits resolves the issue and lsb first mode can be used. Signed-off-by: Kurt Kanzenbach --- drivers/spi/spi-fsl-dspi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c index f652f70cb8db..02d3ed7f2558 100644 --- a/drivers/spi/spi-fsl-dspi.c +++ b/drivers/spi/spi-fsl-dspi.c @@ -980,7 +980,7 @@ static int dspi_probe(struct platform_device *pdev) master->dev.of_node = pdev->dev.of_node; master->cleanup = dspi_cleanup; - master->mode_bits = SPI_CPOL | SPI_CPHA; + master->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST; master->bits_per_word_mask = SPI_BPW_MASK(4) | SPI_BPW_MASK(8) | SPI_BPW_MASK(16); -- 2.11.0
Re: [PATCH v4 4/4] ARM64: dts: meson: drop "sana" clock from SAR ADC
Hi Kevin & others I'd like to just re-send the patch [4/4] (while leave others[1-3/4] unchanged), to have separated DT patch the for 32bit / 64bit platform. is this ok for you? On 11/12/17 09:33, Martin Blumenstingl wrote: > Hi Yixun, > > On Tue, Nov 7, 2017 at 3:10 PM, Yixun Lanwrote: >> From: Xingyu Chen >> >> The SAR ADC modules doesn't require The "sana" clock. >> >> Singed-off-by: Xingyu Chen >> Signed-off-by: Yixun Lan >> --- >> arch/arm/boot/dts/meson8.dtsi | 5 ++--- >> arch/arm/boot/dts/meson8b.dtsi | 5 ++--- > these two should go into a separate patch (with "ARM: dts: ..." > prefix) - the ARM maintainers want separate pull requests for the > 32-bit and 64-bit .dts changes, so patches should also follow that > schema > > with that fixed, you can add my ACK on both (32-bit and 64-bit) .dts patches: > Acked-by: Martin Blumenstingl > thanks, I will send separate patch for this, and I will add your 'Acked-by' >> arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi | 3 +-- >> arch/arm64/boot/dts/amlogic/meson-gxl.dtsi | 3 +-- >> 4 files changed, 6 insertions(+), 10 deletions(-) >> >> diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi >> index b98d44fde6b6..f93d6cf6e094 100644 >> --- a/arch/arm/boot/dts/meson8.dtsi >> +++ b/arch/arm/boot/dts/meson8.dtsi >> @@ -289,9 +289,8 @@ >> { >> compatible = "amlogic,meson8-saradc", "amlogic,meson-saradc"; >> clocks = < CLKID_XTAL>, >> - < CLKID_SAR_ADC>, >> - < CLKID_SANA>; >> - clock-names = "clkin", "core", "sana"; >> + < CLKID_SAR_ADC>; >> + clock-names = "clkin", "core"; >> }; >> >> { >> diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi >> index bc278da7df0d..4aa444284f0c 100644 >> --- a/arch/arm/boot/dts/meson8b.dtsi >> +++ b/arch/arm/boot/dts/meson8b.dtsi >> @@ -185,9 +185,8 @@ >> { >> compatible = "amlogic,meson8b-saradc", "amlogic,meson-saradc"; >> clocks = < CLKID_XTAL>, >> - < CLKID_SAR_ADC>, >> - < CLKID_SANA>; >> - clock-names = "clkin", "core", "sana"; >> + < CLKID_SAR_ADC>; >> + clock-names = "clkin", "core"; >> }; >> >> _AO { >> diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> index af834cdbba79..b77f2593cdc3 100644 >> --- a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> +++ b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> @@ -686,10 +686,9 @@ >> compatible = "amlogic,meson-gxbb-saradc", "amlogic,meson-saradc"; >> clocks = <>, >> < CLKID_SAR_ADC>, >> -< CLKID_SANA>, >> < CLKID_SAR_ADC_CLK>, >> < CLKID_SAR_ADC_SEL>; >> - clock-names = "clkin", "core", "sana", "adc_clk", "adc_sel"; >> + clock-names = "clkin", "core", "adc_clk", "adc_sel"; >> }; >> >> _emmc_a { >> diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> b/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> index d8dd3298b15c..07805a3b4db0 100644 >> --- a/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> +++ b/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> @@ -628,10 +628,9 @@ >> compatible = "amlogic,meson-gxl-saradc", "amlogic,meson-saradc"; >> clocks = <>, >> < CLKID_SAR_ADC>, >> -< CLKID_SANA>, >> < CLKID_SAR_ADC_CLK>, >> < CLKID_SAR_ADC_SEL>; >> - clock-names = "clkin", "core", "sana", "adc_clk", "adc_sel"; >> + clock-names = "clkin", "core", "adc_clk", "adc_sel"; >> }; >> >> _emmc_a { >> -- >> 2.14.1 >> > > . >
Re: [PATCH v4 4/4] ARM64: dts: meson: drop "sana" clock from SAR ADC
Hi Kevin & others I'd like to just re-send the patch [4/4] (while leave others[1-3/4] unchanged), to have separated DT patch the for 32bit / 64bit platform. is this ok for you? On 11/12/17 09:33, Martin Blumenstingl wrote: > Hi Yixun, > > On Tue, Nov 7, 2017 at 3:10 PM, Yixun Lan wrote: >> From: Xingyu Chen >> >> The SAR ADC modules doesn't require The "sana" clock. >> >> Singed-off-by: Xingyu Chen >> Signed-off-by: Yixun Lan >> --- >> arch/arm/boot/dts/meson8.dtsi | 5 ++--- >> arch/arm/boot/dts/meson8b.dtsi | 5 ++--- > these two should go into a separate patch (with "ARM: dts: ..." > prefix) - the ARM maintainers want separate pull requests for the > 32-bit and 64-bit .dts changes, so patches should also follow that > schema > > with that fixed, you can add my ACK on both (32-bit and 64-bit) .dts patches: > Acked-by: Martin Blumenstingl > thanks, I will send separate patch for this, and I will add your 'Acked-by' >> arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi | 3 +-- >> arch/arm64/boot/dts/amlogic/meson-gxl.dtsi | 3 +-- >> 4 files changed, 6 insertions(+), 10 deletions(-) >> >> diff --git a/arch/arm/boot/dts/meson8.dtsi b/arch/arm/boot/dts/meson8.dtsi >> index b98d44fde6b6..f93d6cf6e094 100644 >> --- a/arch/arm/boot/dts/meson8.dtsi >> +++ b/arch/arm/boot/dts/meson8.dtsi >> @@ -289,9 +289,8 @@ >> { >> compatible = "amlogic,meson8-saradc", "amlogic,meson-saradc"; >> clocks = < CLKID_XTAL>, >> - < CLKID_SAR_ADC>, >> - < CLKID_SANA>; >> - clock-names = "clkin", "core", "sana"; >> + < CLKID_SAR_ADC>; >> + clock-names = "clkin", "core"; >> }; >> >> { >> diff --git a/arch/arm/boot/dts/meson8b.dtsi b/arch/arm/boot/dts/meson8b.dtsi >> index bc278da7df0d..4aa444284f0c 100644 >> --- a/arch/arm/boot/dts/meson8b.dtsi >> +++ b/arch/arm/boot/dts/meson8b.dtsi >> @@ -185,9 +185,8 @@ >> { >> compatible = "amlogic,meson8b-saradc", "amlogic,meson-saradc"; >> clocks = < CLKID_XTAL>, >> - < CLKID_SAR_ADC>, >> - < CLKID_SANA>; >> - clock-names = "clkin", "core", "sana"; >> + < CLKID_SAR_ADC>; >> + clock-names = "clkin", "core"; >> }; >> >> _AO { >> diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> index af834cdbba79..b77f2593cdc3 100644 >> --- a/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> +++ b/arch/arm64/boot/dts/amlogic/meson-gxbb.dtsi >> @@ -686,10 +686,9 @@ >> compatible = "amlogic,meson-gxbb-saradc", "amlogic,meson-saradc"; >> clocks = <>, >> < CLKID_SAR_ADC>, >> -< CLKID_SANA>, >> < CLKID_SAR_ADC_CLK>, >> < CLKID_SAR_ADC_SEL>; >> - clock-names = "clkin", "core", "sana", "adc_clk", "adc_sel"; >> + clock-names = "clkin", "core", "adc_clk", "adc_sel"; >> }; >> >> _emmc_a { >> diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> b/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> index d8dd3298b15c..07805a3b4db0 100644 >> --- a/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> +++ b/arch/arm64/boot/dts/amlogic/meson-gxl.dtsi >> @@ -628,10 +628,9 @@ >> compatible = "amlogic,meson-gxl-saradc", "amlogic,meson-saradc"; >> clocks = <>, >> < CLKID_SAR_ADC>, >> -< CLKID_SANA>, >> < CLKID_SAR_ADC_CLK>, >> < CLKID_SAR_ADC_SEL>; >> - clock-names = "clkin", "core", "sana", "adc_clk", "adc_sel"; >> + clock-names = "clkin", "core", "adc_clk", "adc_sel"; >> }; >> >> _emmc_a { >> -- >> 2.14.1 >> > > . >
RE: [PATCH 1/2] mm: drop migrate type checks from has_unmovable_pages
Hello Michal, > Date: Fri, 13 Oct 2017 14:00:12 +0200 > > From: Michal Hocko> > Michael has noticed that the memory offline tries to migrate kernel code > pages when doing echo 0 > /sys/devices/system/memory/memory0/online > > The current implementation will fail the operation after several failed page > migration attempts but we shouldn't even attempt to migrate that memory > and fail right away because this memory is clearly not migrateable. This will > become a real problem when we drop the retry loop counter resp. timeout. > > The real problem is in has_unmovable_pages in fact. We should fail if there > are any non migrateable pages in the area. In orther to guarantee that > remove the migrate type checks because MIGRATE_MOVABLE is not > guaranteed to contain only migrateable pages. It is merely a heuristic. > Similarly MIGRATE_CMA does guarantee that the page allocator doesn't > allocate any non-migrateable pages from the block but CMA allocations > themselves are unlikely to migrateable. Therefore remove both checks. > > Reported-by: Michael Ellerman > Signed-off-by: Michal Hocko > Tested-by: Michael Ellerman > Acked-by: Vlastimil Babka > --- > mm/page_alloc.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index > 3badcedf96a7..ad0294ab3e4f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, > struct page *page, int count, >*/ > if (zone_idx(zone) == ZONE_MOVABLE) > return false; > - mt = get_pageblock_migratetype(page); > - if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > - return false; This drop cause DWC3 USB controller fail on initialization with Layerscaper processors (such as LS1043A) as below: [2.701437] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1 [2.710949] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -16 [2.717411] xhci-hcd xhci-hcd.0.auto: can't setup: -12 [2.727940] xhci-hcd xhci-hcd.0.auto: USB bus 1 deregistered [2.733607] xhci-hcd: probe of xhci-hcd.0.auto failed with error -12 [2.739978] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller And I notice that someone also reported to you that DWC2 got affected recently, so do you have the solution now? Best regards Ran > > pfn = page_to_pfn(page); > for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
RE: [PATCH 1/2] mm: drop migrate type checks from has_unmovable_pages
Hello Michal, > Date: Fri, 13 Oct 2017 14:00:12 +0200 > > From: Michal Hocko > > Michael has noticed that the memory offline tries to migrate kernel code > pages when doing echo 0 > /sys/devices/system/memory/memory0/online > > The current implementation will fail the operation after several failed page > migration attempts but we shouldn't even attempt to migrate that memory > and fail right away because this memory is clearly not migrateable. This will > become a real problem when we drop the retry loop counter resp. timeout. > > The real problem is in has_unmovable_pages in fact. We should fail if there > are any non migrateable pages in the area. In orther to guarantee that > remove the migrate type checks because MIGRATE_MOVABLE is not > guaranteed to contain only migrateable pages. It is merely a heuristic. > Similarly MIGRATE_CMA does guarantee that the page allocator doesn't > allocate any non-migrateable pages from the block but CMA allocations > themselves are unlikely to migrateable. Therefore remove both checks. > > Reported-by: Michael Ellerman > Signed-off-by: Michal Hocko > Tested-by: Michael Ellerman > Acked-by: Vlastimil Babka > --- > mm/page_alloc.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index > 3badcedf96a7..ad0294ab3e4f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, > struct page *page, int count, >*/ > if (zone_idx(zone) == ZONE_MOVABLE) > return false; > - mt = get_pageblock_migratetype(page); > - if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > - return false; This drop cause DWC3 USB controller fail on initialization with Layerscaper processors (such as LS1043A) as below: [2.701437] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1 [2.710949] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -16 [2.717411] xhci-hcd xhci-hcd.0.auto: can't setup: -12 [2.727940] xhci-hcd xhci-hcd.0.auto: USB bus 1 deregistered [2.733607] xhci-hcd: probe of xhci-hcd.0.auto failed with error -12 [2.739978] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller And I notice that someone also reported to you that DWC2 got affected recently, so do you have the solution now? Best regards Ran > > pfn = page_to_pfn(page); > for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
[GIT PULL] locking changes for v4.15
Linus, Please pull the latest locking-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking-core-for-linus # HEAD: 450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730 locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE The main changes in this cycle are: - Another attempt at enabling cross-release lockdep dependency tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time with better performance and fewer false positives. (Byungchul Park) - Introduce lockdep_assert_irqs_enabled()/disabled() and convert open-coded equivalents to lockdep variants. (Frederic Weisbecker) - Add down_read_killable() and use it in the VFS's iterate_dir() method. (Kirill Tkhai) - Convert remaining uses of ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle driven. (Mark Rutland, Paul E. McKenney) - Get rid of lockless_dereference(), by strengthening Alpha atomics, strengthening READ_ONCE() with smp_read_barrier_depends() and thus being able to convert users of lockless_dereference() to READ_ONCE(). (Will Deacon) - Various micro-optimizations: - better PV qspinlocks (Waiman Long), - better x86 barriers (Michael S. Tsirkin) - better x86 refcounts (Kees Cook) - ... plus other fixes and enhancements. (Borislav Petkov, Juergen Gross, Miguel Bernal Marin) Thanks, Ingo --> Borislav Petkov (1): locking/static_keys: Improve uninitialized key warning Byungchul Park (8): locking/lockdep: Provide empty lockdep_map structure for !CONFIG_LOCKDEP locking/lockdep, sched/completions: Change the prefix of lock name for completion variables locking/lockdep: Add a boot parameter allowing unwind in cross-release and disable it by default locking/lockdep: Remove the BROKEN flag from CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS locking/lockdep: Introduce CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=y sched/completions: Add support for initializing completions with lockdep_map workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() Cheng Jian (1): locking/rwlocks: Fix comments Christoph Hellwig (1): block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait() Dou Liyang (1): x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized Frederic Weisbecker (14): locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled() irq/softirqs: Use lockdep to assert IRQs are disabled/enabled workqueue: Use lockdep to assert IRQs are disabled/enabled timers/nohz: Use lockdep to assert IRQs are disabled/enabled timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled smp/core: Use lockdep to assert IRQs are disabled/enabled x86: Use lockdep to assert IRQs are disabled/enabled perf/core: Use lockdep to assert IRQs are disabled/enabled irq/timings: Use lockdep to assert IRQs are disabled/enabled irq_work: Use lockdep to assert IRQs are disabled/enabled sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled netpoll: Use lockdep to assert IRQs are disabled/enabled rcu: Use lockdep to assert IRQs are disabled/enabled Juergen Gross (2): locking/paravirt: Use new static key for controlling call of virt_spin_lock() locking/spinlocks, paravirt, xen: Correct the xen_nopvspin case Kees Cook (2): locking/refcounts, x86/asm: Use unique .text section for refcount exceptions locking/refcounts, x86/asm: Enable CONFIG_ARCH_HAS_REFCOUNT Kirill Tkhai (6): locking/arch, alpha: Add __down_read_killable() locking/arch, ia64: Add __down_read_killable() locking/arch, s390: Add __down_read_killable() locking/arch, x86: Add __down_read_killable() locking/rwsem: Add down_read_killable() locking/rwsem, fs: Use killable down_read() in iterate_dir() Mark Rutland (14): locking/atomics, dm-integrity: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, EDAC/altera: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, firmware/ivc: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, fs/ncpfs: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, media/dvb_ringbuffer: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/ipv4/tcp_input.c: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/average:
[GIT PULL] locking changes for v4.15
Linus, Please pull the latest locking-core-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking-core-for-linus # HEAD: 450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730 locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE The main changes in this cycle are: - Another attempt at enabling cross-release lockdep dependency tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time with better performance and fewer false positives. (Byungchul Park) - Introduce lockdep_assert_irqs_enabled()/disabled() and convert open-coded equivalents to lockdep variants. (Frederic Weisbecker) - Add down_read_killable() and use it in the VFS's iterate_dir() method. (Kirill Tkhai) - Convert remaining uses of ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle driven. (Mark Rutland, Paul E. McKenney) - Get rid of lockless_dereference(), by strengthening Alpha atomics, strengthening READ_ONCE() with smp_read_barrier_depends() and thus being able to convert users of lockless_dereference() to READ_ONCE(). (Will Deacon) - Various micro-optimizations: - better PV qspinlocks (Waiman Long), - better x86 barriers (Michael S. Tsirkin) - better x86 refcounts (Kees Cook) - ... plus other fixes and enhancements. (Borislav Petkov, Juergen Gross, Miguel Bernal Marin) Thanks, Ingo --> Borislav Petkov (1): locking/static_keys: Improve uninitialized key warning Byungchul Park (8): locking/lockdep: Provide empty lockdep_map structure for !CONFIG_LOCKDEP locking/lockdep, sched/completions: Change the prefix of lock name for completion variables locking/lockdep: Add a boot parameter allowing unwind in cross-release and disable it by default locking/lockdep: Remove the BROKEN flag from CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS locking/lockdep: Introduce CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=y sched/completions: Add support for initializing completions with lockdep_map workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() Cheng Jian (1): locking/rwlocks: Fix comments Christoph Hellwig (1): block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait() Dou Liyang (1): x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized Frederic Weisbecker (14): locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled() irq/softirqs: Use lockdep to assert IRQs are disabled/enabled workqueue: Use lockdep to assert IRQs are disabled/enabled timers/nohz: Use lockdep to assert IRQs are disabled/enabled timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled smp/core: Use lockdep to assert IRQs are disabled/enabled x86: Use lockdep to assert IRQs are disabled/enabled perf/core: Use lockdep to assert IRQs are disabled/enabled irq/timings: Use lockdep to assert IRQs are disabled/enabled irq_work: Use lockdep to assert IRQs are disabled/enabled sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled netpoll: Use lockdep to assert IRQs are disabled/enabled rcu: Use lockdep to assert IRQs are disabled/enabled Juergen Gross (2): locking/paravirt: Use new static key for controlling call of virt_spin_lock() locking/spinlocks, paravirt, xen: Correct the xen_nopvspin case Kees Cook (2): locking/refcounts, x86/asm: Use unique .text section for refcount exceptions locking/refcounts, x86/asm: Enable CONFIG_ARCH_HAS_REFCOUNT Kirill Tkhai (6): locking/arch, alpha: Add __down_read_killable() locking/arch, ia64: Add __down_read_killable() locking/arch, s390: Add __down_read_killable() locking/arch, x86: Add __down_read_killable() locking/rwsem: Add down_read_killable() locking/rwsem, fs: Use killable down_read() in iterate_dir() Mark Rutland (14): locking/atomics, dm-integrity: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, EDAC/altera: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, firmware/ivc: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, fs/ncpfs: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, media/dvb_ringbuffer: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/ipv4/tcp_input.c: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() locking/atomics, net/average:
[GIT PULL] usercopy whitelisting for v4.15-rc1
Hi, Please pull these hardened usercopy whitelisting changes for v4.15-rc1. This significantly narrows the areas of memory that can be copied to/from userspace in the face of usercopy bugs. Thanks! -Kees The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff: Linux 4.14-rc3 (2017-10-01 14:54:54 -0700) are available in the git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/usercopy-v4.15-rc1 for you to fetch changes up to 3889a28c449c01cebe166e413a58742002c2352b: lkdtm: Update usercopy tests for whitelisting (2017-11-08 15:40:04 -0800) Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass hardened usercopy checks since these sizes cannot change at runtime.) David Windsor (23): usercopy: Prepare for usercopy whitelisting usercopy: Enforce slab cache usercopy region boundaries usercopy: Mark kmalloc caches as usercopy caches dcache: Define usercopy region in dentry_cache slab cache vfs: Define usercopy region in names_cache slab caches vfs: Copy struct mount.mnt_id to userspace using put_user() ext4: Define usercopy region in ext4_inode_cache slab cache ext2: Define usercopy region in ext2_inode_cache slab cache jfs: Define usercopy region in jfs_ip slab cache befs: Define usercopy region in befs_inode_cache slab cache exofs: Define usercopy region in exofs_inode_cache slab cache orangefs: Define usercopy region in orangefs_inode_cache slab cache ufs: Define usercopy region in ufs_inode_cache slab cache vxfs: Define usercopy region in vxfs_inode slab cache cifs: Define usercopy region in cifs_request slab cache scsi: Define usercopy region in scsi_sense_cache slab cache net: Define usercopy region in struct proto slab cache ip: Define usercopy region in IP proto slab cache caif: Define usercopy region in caif proto slab cache sctp: Define usercopy region in SCTP proto slab cache sctp: Copy struct sctp_sock.autoclose to userspace using put_user() fork: Define usercopy region in mm_struct slab caches fork: Define usercopy region in thread_stack slab caches Kees Cook (8): net: Restrict unwhitelisted proto caches to size 0 fork: Provide usercopy whitelisting for task_struct x86: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy arm: Implement thread_struct whitelist for hardened usercopy usercopy: Allow for temporary fallback for non-whitelisted usercopy usercopy: Restrict non-usercopy caches to size 0 lkdtm: Update usercopy tests for whitelisting Paolo Bonzini (2): kvm: whitelist struct kvm_vcpu_arch kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl arch/Kconfig | 11 + arch/arm/Kconfig | 1 + arch/arm/include/asm/processor.h | 7 +++ arch/arm64/Kconfig | 1 + arch/arm64/include/asm/processor.h | 8 arch/x86/Kconfig | 1 + arch/x86/include/asm/processor.h | 8 arch/x86/kvm/x86.c | 7 +-- drivers/misc/lkdtm.h | 4 +- drivers/misc/lkdtm_core.c | 4 +- drivers/misc/lkdtm_usercopy.c | 88 +- drivers/scsi/scsi_lib.c| 9 ++-- fs/befs/linuxvfs.c | 14 +++--- fs/cifs/cifsfs.c | 10 +++-- fs/dcache.c| 9 ++-- fs/exofs/super.c | 7 ++- fs/ext2/super.c| 12 +++--- fs/ext4/super.c| 12 +++--- fs/fhandle.c | 3 +- fs/freevxfs/vxfs_super.c | 8 +++- fs/jfs/super.c | 8 ++-- fs/orangefs/super.c| 15 --- fs/ufs/super.c | 13 +++--- include/linux/sched/task.h | 14 ++ include/linux/slab.h | 27 +--- include/linux/slab_def.h | 3 ++ include/linux/slub_def.h | 3 ++ include/linux/stddef.h | 2 + include/net/sctp/structs.h | 9 +++-
[GIT PULL] usercopy whitelisting for v4.15-rc1
Hi, Please pull these hardened usercopy whitelisting changes for v4.15-rc1. This significantly narrows the areas of memory that can be copied to/from userspace in the face of usercopy bugs. Thanks! -Kees The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff: Linux 4.14-rc3 (2017-10-01 14:54:54 -0700) are available in the git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/usercopy-v4.15-rc1 for you to fetch changes up to 3889a28c449c01cebe166e413a58742002c2352b: lkdtm: Update usercopy tests for whitelisting (2017-11-08 15:40:04 -0800) Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass hardened usercopy checks since these sizes cannot change at runtime.) David Windsor (23): usercopy: Prepare for usercopy whitelisting usercopy: Enforce slab cache usercopy region boundaries usercopy: Mark kmalloc caches as usercopy caches dcache: Define usercopy region in dentry_cache slab cache vfs: Define usercopy region in names_cache slab caches vfs: Copy struct mount.mnt_id to userspace using put_user() ext4: Define usercopy region in ext4_inode_cache slab cache ext2: Define usercopy region in ext2_inode_cache slab cache jfs: Define usercopy region in jfs_ip slab cache befs: Define usercopy region in befs_inode_cache slab cache exofs: Define usercopy region in exofs_inode_cache slab cache orangefs: Define usercopy region in orangefs_inode_cache slab cache ufs: Define usercopy region in ufs_inode_cache slab cache vxfs: Define usercopy region in vxfs_inode slab cache cifs: Define usercopy region in cifs_request slab cache scsi: Define usercopy region in scsi_sense_cache slab cache net: Define usercopy region in struct proto slab cache ip: Define usercopy region in IP proto slab cache caif: Define usercopy region in caif proto slab cache sctp: Define usercopy region in SCTP proto slab cache sctp: Copy struct sctp_sock.autoclose to userspace using put_user() fork: Define usercopy region in mm_struct slab caches fork: Define usercopy region in thread_stack slab caches Kees Cook (8): net: Restrict unwhitelisted proto caches to size 0 fork: Provide usercopy whitelisting for task_struct x86: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy arm: Implement thread_struct whitelist for hardened usercopy usercopy: Allow for temporary fallback for non-whitelisted usercopy usercopy: Restrict non-usercopy caches to size 0 lkdtm: Update usercopy tests for whitelisting Paolo Bonzini (2): kvm: whitelist struct kvm_vcpu_arch kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl arch/Kconfig | 11 + arch/arm/Kconfig | 1 + arch/arm/include/asm/processor.h | 7 +++ arch/arm64/Kconfig | 1 + arch/arm64/include/asm/processor.h | 8 arch/x86/Kconfig | 1 + arch/x86/include/asm/processor.h | 8 arch/x86/kvm/x86.c | 7 +-- drivers/misc/lkdtm.h | 4 +- drivers/misc/lkdtm_core.c | 4 +- drivers/misc/lkdtm_usercopy.c | 88 +- drivers/scsi/scsi_lib.c| 9 ++-- fs/befs/linuxvfs.c | 14 +++--- fs/cifs/cifsfs.c | 10 +++-- fs/dcache.c| 9 ++-- fs/exofs/super.c | 7 ++- fs/ext2/super.c| 12 +++--- fs/ext4/super.c| 12 +++--- fs/fhandle.c | 3 +- fs/freevxfs/vxfs_super.c | 8 +++- fs/jfs/super.c | 8 ++-- fs/orangefs/super.c| 15 --- fs/ufs/super.c | 13 +++--- include/linux/sched/task.h | 14 ++ include/linux/slab.h | 27 +--- include/linux/slab_def.h | 3 ++ include/linux/slub_def.h | 3 ++ include/linux/stddef.h | 2 + include/net/sctp/structs.h | 9 +++-
Re: [PATCH 4/4] kbuild: optimize object directory creation for incremental build
Hi Cao, 2017-11-10 19:58 GMT+09:00 Cao jin: > Masahiro-san > > On 11/09/2017 11:41 PM, Masahiro Yamada wrote: >> The previous commit largely optimized the object directory creation. >> We can optimize it more for incremental build. >> >> There are already *.cmd files in the output directory. The existing >> *.cmd files have been picked up by $(wildcard ...). Obviously, >> directories containing them exist too, so we can skip "mkdir -p". >> >> With this, Kbuild runs almost zero "mkdir -p" in incremental building. >> >> Signed-off-by: Masahiro Yamada >> --- >> >> scripts/Makefile.build | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/scripts/Makefile.build b/scripts/Makefile.build >> index 89ac180..90ea7a5 100644 >> --- a/scripts/Makefile.build >> +++ b/scripts/Makefile.build >> @@ -583,8 +583,13 @@ endif >> ifneq ($(KBUILD_SRC),) >> # Create directories for object files if directory does not exist >> obj-dirs := $(sort $(obj) $(patsubst %/,%, $(dir $(targets >> +# If cmd_files exist, their directories apparently exist. Skip mkdir. >> +exist-dirs := $(sort $(patsubst %/,%, $(dir $(cmd_files >> +obj-dirs := $(strip $(filter-out . $(exist-dirs), $(obj-dirs))) > > First I am not sure if the dot "." here is necessary, because I guess > kbuild always descend into subdir do recursive make, so, very > $(cmd_files) should have at least 1 level dir. The top level Makefile descends into ./Kbuild prepare0: archprepare gcc-plugins $(Q)$(MAKE) $(build)=. So, it is possible to have 0 level dir. > Second, Assuming that "." probably exists, Right "." always exists. That's way I filtered it out. > would it be "./"? because it > is what "dir" function returns. No. You missed $(patsubst %/,%, ...) Having said that, "." generally comes from phony targets and I think I can fix it in a more correct way. I will remove "." from v2. > -- > Sincerely, > Cao jin > >> +ifneq ($(obj-dirs),) >> $(shell mkdir -p $(obj-dirs)) >> endif >> +endif >> > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards Masahiro Yamada
Re: [PATCH 4/4] kbuild: optimize object directory creation for incremental build
Hi Cao, 2017-11-10 19:58 GMT+09:00 Cao jin : > Masahiro-san > > On 11/09/2017 11:41 PM, Masahiro Yamada wrote: >> The previous commit largely optimized the object directory creation. >> We can optimize it more for incremental build. >> >> There are already *.cmd files in the output directory. The existing >> *.cmd files have been picked up by $(wildcard ...). Obviously, >> directories containing them exist too, so we can skip "mkdir -p". >> >> With this, Kbuild runs almost zero "mkdir -p" in incremental building. >> >> Signed-off-by: Masahiro Yamada >> --- >> >> scripts/Makefile.build | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/scripts/Makefile.build b/scripts/Makefile.build >> index 89ac180..90ea7a5 100644 >> --- a/scripts/Makefile.build >> +++ b/scripts/Makefile.build >> @@ -583,8 +583,13 @@ endif >> ifneq ($(KBUILD_SRC),) >> # Create directories for object files if directory does not exist >> obj-dirs := $(sort $(obj) $(patsubst %/,%, $(dir $(targets >> +# If cmd_files exist, their directories apparently exist. Skip mkdir. >> +exist-dirs := $(sort $(patsubst %/,%, $(dir $(cmd_files >> +obj-dirs := $(strip $(filter-out . $(exist-dirs), $(obj-dirs))) > > First I am not sure if the dot "." here is necessary, because I guess > kbuild always descend into subdir do recursive make, so, very > $(cmd_files) should have at least 1 level dir. The top level Makefile descends into ./Kbuild prepare0: archprepare gcc-plugins $(Q)$(MAKE) $(build)=. So, it is possible to have 0 level dir. > Second, Assuming that "." probably exists, Right "." always exists. That's way I filtered it out. > would it be "./"? because it > is what "dir" function returns. No. You missed $(patsubst %/,%, ...) Having said that, "." generally comes from phony targets and I think I can fix it in a more correct way. I will remove "." from v2. > -- > Sincerely, > Cao jin > >> +ifneq ($(obj-dirs),) >> $(shell mkdir -p $(obj-dirs)) >> endif >> +endif >> > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards Masahiro Yamada
Re: [PATCH 2/3] Input: twl6040-vibra: fix child-node lookup
On 2017-11-11 17:43, Johan Hovold wrote: > Fix child-node lookup during probe, which ended up searching the whole > device tree depth-first starting at parent rather than just matching on > its children. > > Later sanity checks on node properties (which would likely be missing) > should prevent this from causing much trouble however, especially as the > original premature free of the parent node has already been fixed > separately (but that "fix" was apparently never backported to stable). > > Fixes: e7ec014a47e4 ("Input: twl6040-vibra - update for device tree support") > Fixes: c52c545ead97 ("Input: twl6040-vibra - fix DT node memory management") > Cc: stable# 3.6 > Cc: Peter Ujfalusi > Cc: H. Nikolaus Schaller > Signed-off-by: Johan Hovold Acked-by: Peter Ujfalusi > --- > drivers/input/misc/twl6040-vibra.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/input/misc/twl6040-vibra.c > b/drivers/input/misc/twl6040-vibra.c > index 5690eb7ff954..15e0d352c4cc 100644 > --- a/drivers/input/misc/twl6040-vibra.c > +++ b/drivers/input/misc/twl6040-vibra.c > @@ -248,8 +248,7 @@ static int twl6040_vibra_probe(struct platform_device > *pdev) > int vddvibr_uV = 0; > int error; > > - of_node_get(twl6040_core_dev->of_node); > - twl6040_core_node = of_find_node_by_name(twl6040_core_dev->of_node, > + twl6040_core_node = of_get_child_by_name(twl6040_core_dev->of_node, >"vibra"); > if (!twl6040_core_node) { > dev_err(>dev, "parent of node is missing?\n"); > - Péter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
Re: [PATCH 2/3] Input: twl6040-vibra: fix child-node lookup
On 2017-11-11 17:43, Johan Hovold wrote: > Fix child-node lookup during probe, which ended up searching the whole > device tree depth-first starting at parent rather than just matching on > its children. > > Later sanity checks on node properties (which would likely be missing) > should prevent this from causing much trouble however, especially as the > original premature free of the parent node has already been fixed > separately (but that "fix" was apparently never backported to stable). > > Fixes: e7ec014a47e4 ("Input: twl6040-vibra - update for device tree support") > Fixes: c52c545ead97 ("Input: twl6040-vibra - fix DT node memory management") > Cc: stable # 3.6 > Cc: Peter Ujfalusi > Cc: H. Nikolaus Schaller > Signed-off-by: Johan Hovold Acked-by: Peter Ujfalusi > --- > drivers/input/misc/twl6040-vibra.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/input/misc/twl6040-vibra.c > b/drivers/input/misc/twl6040-vibra.c > index 5690eb7ff954..15e0d352c4cc 100644 > --- a/drivers/input/misc/twl6040-vibra.c > +++ b/drivers/input/misc/twl6040-vibra.c > @@ -248,8 +248,7 @@ static int twl6040_vibra_probe(struct platform_device > *pdev) > int vddvibr_uV = 0; > int error; > > - of_node_get(twl6040_core_dev->of_node); > - twl6040_core_node = of_find_node_by_name(twl6040_core_dev->of_node, > + twl6040_core_node = of_get_child_by_name(twl6040_core_dev->of_node, >"vibra"); > if (!twl6040_core_node) { > dev_err(>dev, "parent of node is missing?\n"); > - Péter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
Re: [PATCH 1/3] Input: twl4030-vibra: fix sibling-node lookup
On 2017-11-11 17:43, Johan Hovold wrote: > A helper purported to look up a child node based on its name was using > the wrong of-helper and ended up prematurely freeing the parent of-node > while searching the whole device tree depth-first starting at the parent > node. > > Fixes: 64b9e4d803b1 ("input: twl4030-vibra: Support for DT booted kernel") > Fixes: e661d0a04462 ("Input: twl4030-vibra - fix ERROR: Bad of_node_put() > warning") > Cc: stable# 3.7 > Cc: Peter Ujfalusi > Cc: Marek Belisko > Signed-off-by: Johan Hovold Acked-by: Peter Ujfalusi > --- > drivers/input/misc/twl4030-vibra.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/input/misc/twl4030-vibra.c > b/drivers/input/misc/twl4030-vibra.c > index 6c51d404874b..c37aea9ac272 100644 > --- a/drivers/input/misc/twl4030-vibra.c > +++ b/drivers/input/misc/twl4030-vibra.c > @@ -178,12 +178,14 @@ static SIMPLE_DEV_PM_OPS(twl4030_vibra_pm_ops, >twl4030_vibra_suspend, twl4030_vibra_resume); > > static bool twl4030_vibra_check_coexist(struct twl4030_vibra_data *pdata, > - struct device_node *node) > + struct device_node *parent) > { > + struct device_node *node; > + > if (pdata && pdata->coexist) > return true; > > - node = of_find_node_by_name(node, "codec"); > + node = of_get_child_by_name(parent, "codec"); > if (node) { > of_node_put(node); > return true; > - Péter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
Re: [PATCH 1/3] Input: twl4030-vibra: fix sibling-node lookup
On 2017-11-11 17:43, Johan Hovold wrote: > A helper purported to look up a child node based on its name was using > the wrong of-helper and ended up prematurely freeing the parent of-node > while searching the whole device tree depth-first starting at the parent > node. > > Fixes: 64b9e4d803b1 ("input: twl4030-vibra: Support for DT booted kernel") > Fixes: e661d0a04462 ("Input: twl4030-vibra - fix ERROR: Bad of_node_put() > warning") > Cc: stable # 3.7 > Cc: Peter Ujfalusi > Cc: Marek Belisko > Signed-off-by: Johan Hovold Acked-by: Peter Ujfalusi > --- > drivers/input/misc/twl4030-vibra.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/input/misc/twl4030-vibra.c > b/drivers/input/misc/twl4030-vibra.c > index 6c51d404874b..c37aea9ac272 100644 > --- a/drivers/input/misc/twl4030-vibra.c > +++ b/drivers/input/misc/twl4030-vibra.c > @@ -178,12 +178,14 @@ static SIMPLE_DEV_PM_OPS(twl4030_vibra_pm_ops, >twl4030_vibra_suspend, twl4030_vibra_resume); > > static bool twl4030_vibra_check_coexist(struct twl4030_vibra_data *pdata, > - struct device_node *node) > + struct device_node *parent) > { > + struct device_node *node; > + > if (pdata && pdata->coexist) > return true; > > - node = of_find_node_by_name(node, "codec"); > + node = of_get_child_by_name(parent, "codec"); > if (node) { > of_node_put(node); > return true; > - Péter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn
2017-11-10 17:49 GMT+08:00 Paolo Bonzini: > Sometimes, a processor might execute an instruction while another > processor is updating the page tables for that instruction's code page, > but before the TLB shootdown completes. The interesting case happens > if the page is in the TLB. > > In general, the processor will succeed in executing the instruction and > nothing bad happens. However, what if the instruction is an MMIO access? > If *that* happens, KVM invokes the emulator, and the emulator gets the > updated page tables. If the update side had marked the code page as non > present, the page table walk then will fail and so will x86_decode_insn. > > Unfortunately, even though kvm_fetch_guest_virt is correctly returning > X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as > a fatal error if the instruction cannot simply be reexecuted (as is the > case for MMIO). And this in fact happened sometimes when rebooting > Windows 2012r2 guests. Just checking ctxt->have_exception and injecting > the exception if true is enough to fix the case. I found the only place which can set ctxt->have_exception is in the function x86_emulate_insn(), and x86_decode_insn() will not set ctxt->have_exception even if kvm_fetch_guest_virt() returns X86_EMUL_PROPAGATE_FAULT. Regards, Wanpeng Li > > Thanks to Eduardo Habkost for helping in the debugging of this issue. > > Reported-by: Yanan Fu > Cc: Eduardo Habkost > Cc: sta...@vger.kernel.org > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/x86.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 34c85aa2e2d1..6dbed9022797 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5722,6 +5722,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, > if (reexecute_instruction(vcpu, cr2, > write_fault_to_spt, > emulation_type)) > return EMULATE_DONE; > + if (ctxt->have_exception && > inject_emulated_exception(vcpu)) > + return EMULATE_DONE; > if (emulation_type & EMULTYPE_SKIP) > return EMULATE_FAIL; > return handle_emulation_failure(vcpu); > -- > 1.8.3.1 >
Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn
2017-11-10 17:49 GMT+08:00 Paolo Bonzini : > Sometimes, a processor might execute an instruction while another > processor is updating the page tables for that instruction's code page, > but before the TLB shootdown completes. The interesting case happens > if the page is in the TLB. > > In general, the processor will succeed in executing the instruction and > nothing bad happens. However, what if the instruction is an MMIO access? > If *that* happens, KVM invokes the emulator, and the emulator gets the > updated page tables. If the update side had marked the code page as non > present, the page table walk then will fail and so will x86_decode_insn. > > Unfortunately, even though kvm_fetch_guest_virt is correctly returning > X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as > a fatal error if the instruction cannot simply be reexecuted (as is the > case for MMIO). And this in fact happened sometimes when rebooting > Windows 2012r2 guests. Just checking ctxt->have_exception and injecting > the exception if true is enough to fix the case. I found the only place which can set ctxt->have_exception is in the function x86_emulate_insn(), and x86_decode_insn() will not set ctxt->have_exception even if kvm_fetch_guest_virt() returns X86_EMUL_PROPAGATE_FAULT. Regards, Wanpeng Li > > Thanks to Eduardo Habkost for helping in the debugging of this issue. > > Reported-by: Yanan Fu > Cc: Eduardo Habkost > Cc: sta...@vger.kernel.org > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/x86.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 34c85aa2e2d1..6dbed9022797 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5722,6 +5722,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, > if (reexecute_instruction(vcpu, cr2, > write_fault_to_spt, > emulation_type)) > return EMULATE_DONE; > + if (ctxt->have_exception && > inject_emulated_exception(vcpu)) > + return EMULATE_DONE; > if (emulation_type & EMULTYPE_SKIP) > return EMULATE_FAIL; > return handle_emulation_failure(vcpu); > -- > 1.8.3.1 >
[GIT PULL] RCU updates for v4.15
Linus, Please pull the latest core-rcu-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-rcu-for-linus # HEAD: 72bc286b81d21404cdfecddf76b64c7163aac764 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu The main changes in this cycle are: - Documentation updates - RCU CPU stall-warning updates - Torture-test updates - Miscellaneous fixes Size wise the biggest updates are to documentation. Excluding documentation the diffstat becomes: 18 files changed, 205 insertions(+), 79 deletions(-) ... and most of the code increase comes from a single commit which expands debugging: 9b9500da8150: rcu: Make RCU CPU stall warnings check for irq-disabled CPUs Thanks, Ingo --> Alan Stern (1): memory-barriers: Rework multicopy-atomicity section Guilherme G. Piccoli (1): doc: Rewrite confusing statement about memory barriers Neeraj Upadhyay (1): rcu: Fix up pending cbs check in rcu_prepare_for_idle Paul E. McKenney (18): documentation: RCU grace-period memory ordering guarantees documentation: Long-running irq handlers can stall RCU grace periods documentation: Slow systems can stall RCU grace periods documentation: Update RCU CPU stall warning messages memory-barriers: Replace uses of "transitive" rcu: Create call_rcu_tasks() kthread at boot time irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP sched: Make resched_cpu() unconditional sched,rcu: Make cond_resched() provide RCU quiescent state rcu: Make RCU CPU stall warnings check for irq-disabled CPUs rcu: Turn off tracing before dumping trace rcu: Suppress RCU CPU stall warnings while dumping trace rcutorture: Add interrupt-disable capability to stall-warning tests rcutorture: Dump writer stack if stalled torture: Provide TMPDIR environment variable to specify tmpdir rcu: Suppress lockdep false-positive ->boost_mtx complaints rcu: Add extended-quiescent-state testing advice srcu: Add parameters to SRCU docbook comments Scott Tsai (1): memory-barriers.txt: Fix typo in pairing example Sebastian Andrzej Siewior (2): rcu: Do not include rtmutex_common.h unconditionally rcu/segcblist: Include rcupdate.h .../Design/Memory-Ordering/Tree-RCU-Diagram.html |9 + .../Memory-Ordering/Tree-RCU-Memory-Ordering.html | 707 +++ .../TreeRCU-callback-invocation.svg| 486 ++ .../Memory-Ordering/TreeRCU-callback-registry.svg | 655 +++ .../RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Design/Memory-Ordering/TreeRCU-gp-cleanup.svg | 1126 + .../RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 + .../Design/Memory-Ordering/TreeRCU-gp-init-1.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-3.svg | 632 +++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5135 .../RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 + .../RCU/Design/Memory-Ordering/rcu_node-lock.svg | 229 + Documentation/RCU/stallwarn.txt| 200 +- Documentation/admin-guide/kernel-parameters.txt|3 + Documentation/memory-barriers.txt | 197 +- include/linux/irq_work.h |3 - kernel/irq_work.c |9 +- kernel/rcu/rcu.h | 21 +- kernel/rcu/rcu_segcblist.c |1 + kernel/rcu/rcutorture.c| 24 +- kernel/rcu/tree.c | 159 +- kernel/rcu/tree.h |5 + kernel/rcu/tree_plugin.h | 14 +- kernel/rcu/update.c| 25 +- kernel/sched/core.c|5 +- .../selftests/rcutorture/bin/config_override.sh|2 +- .../selftests/rcutorture/bin/configcheck.sh|2 +- .../testing/selftests/rcutorture/bin/configinit.sh |2 +- .../testing/selftests/rcutorture/bin/kvm-build.sh |2 +- .../selftests/rcutorture/bin/kvm-test-1-run.sh |2 +- tools/testing/selftests/rcutorture/bin/kvm.sh |4 +- .../selftests/rcutorture/bin/parse-build.sh|2 +- .../selftests/rcutorture/bin/parse-torture.sh |2 +- 35 files changed, 14598 insertions(+), 256 deletions(-) create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg
[GIT PULL] RCU updates for v4.15
Linus, Please pull the latest core-rcu-for-linus git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-rcu-for-linus # HEAD: 72bc286b81d21404cdfecddf76b64c7163aac764 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu The main changes in this cycle are: - Documentation updates - RCU CPU stall-warning updates - Torture-test updates - Miscellaneous fixes Size wise the biggest updates are to documentation. Excluding documentation the diffstat becomes: 18 files changed, 205 insertions(+), 79 deletions(-) ... and most of the code increase comes from a single commit which expands debugging: 9b9500da8150: rcu: Make RCU CPU stall warnings check for irq-disabled CPUs Thanks, Ingo --> Alan Stern (1): memory-barriers: Rework multicopy-atomicity section Guilherme G. Piccoli (1): doc: Rewrite confusing statement about memory barriers Neeraj Upadhyay (1): rcu: Fix up pending cbs check in rcu_prepare_for_idle Paul E. McKenney (18): documentation: RCU grace-period memory ordering guarantees documentation: Long-running irq handlers can stall RCU grace periods documentation: Slow systems can stall RCU grace periods documentation: Update RCU CPU stall warning messages memory-barriers: Replace uses of "transitive" rcu: Create call_rcu_tasks() kthread at boot time irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP sched: Make resched_cpu() unconditional sched,rcu: Make cond_resched() provide RCU quiescent state rcu: Make RCU CPU stall warnings check for irq-disabled CPUs rcu: Turn off tracing before dumping trace rcu: Suppress RCU CPU stall warnings while dumping trace rcutorture: Add interrupt-disable capability to stall-warning tests rcutorture: Dump writer stack if stalled torture: Provide TMPDIR environment variable to specify tmpdir rcu: Suppress lockdep false-positive ->boost_mtx complaints rcu: Add extended-quiescent-state testing advice srcu: Add parameters to SRCU docbook comments Scott Tsai (1): memory-barriers.txt: Fix typo in pairing example Sebastian Andrzej Siewior (2): rcu: Do not include rtmutex_common.h unconditionally rcu/segcblist: Include rcupdate.h .../Design/Memory-Ordering/Tree-RCU-Diagram.html |9 + .../Memory-Ordering/Tree-RCU-Memory-Ordering.html | 707 +++ .../TreeRCU-callback-invocation.svg| 486 ++ .../Memory-Ordering/TreeRCU-callback-registry.svg | 655 +++ .../RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Design/Memory-Ordering/TreeRCU-gp-cleanup.svg | 1126 + .../RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 + .../Design/Memory-Ordering/TreeRCU-gp-init-1.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-3.svg | 632 +++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5135 .../RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 + .../RCU/Design/Memory-Ordering/rcu_node-lock.svg | 229 + Documentation/RCU/stallwarn.txt| 200 +- Documentation/admin-guide/kernel-parameters.txt|3 + Documentation/memory-barriers.txt | 197 +- include/linux/irq_work.h |3 - kernel/irq_work.c |9 +- kernel/rcu/rcu.h | 21 +- kernel/rcu/rcu_segcblist.c |1 + kernel/rcu/rcutorture.c| 24 +- kernel/rcu/tree.c | 159 +- kernel/rcu/tree.h |5 + kernel/rcu/tree_plugin.h | 14 +- kernel/rcu/update.c| 25 +- kernel/sched/core.c|5 +- .../selftests/rcutorture/bin/config_override.sh|2 +- .../selftests/rcutorture/bin/configcheck.sh|2 +- .../testing/selftests/rcutorture/bin/configinit.sh |2 +- .../testing/selftests/rcutorture/bin/kvm-build.sh |2 +- .../selftests/rcutorture/bin/kvm-test-1-run.sh |2 +- tools/testing/selftests/rcutorture/bin/kvm.sh |4 +- .../selftests/rcutorture/bin/parse-build.sh|2 +- .../selftests/rcutorture/bin/parse-torture.sh |2 +- 35 files changed, 14598 insertions(+), 256 deletions(-) create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg
Re: [PATCHv4 3/6] powerpc64: Add .opd based function descriptor dereference
* Sergey Senozhatskywrote (on 2017-11-10 08:48:27 +0900): > We are moving towards separate kernel and module function descriptor > dereference callbacks. This patch enables it for powerpc64. > > For pointers that belong to the kernel > - Added __start_opd and __end_opd pointers, to track the kernel >.opd section address range; > > - Added dereference_kernel_function_descriptor(). Now we >will dereference only function pointers that are within >[__start_opd, __end_opd); > > For pointers that belong to a module > - Added dereference_module_function_descriptor() to handle module >function descriptor dereference. Now we will dereference only >pointers that are within [module->opd.start, module->opd.end). > > Signed-off-by: Sergey Senozhatsky > --- > arch/powerpc/include/asm/module.h | 3 +++ > arch/powerpc/include/asm/sections.h | 12 > arch/powerpc/kernel/module_64.c | 14 ++ > arch/powerpc/kernel/vmlinux.lds.S | 2 ++ > 4 files changed, 31 insertions(+) > Looks good on powerpc. If you wish: Tested-by: Santosh Sivaraj # for powerpc Thanks, Santosh > diff --git a/arch/powerpc/include/asm/module.h > b/arch/powerpc/include/asm/module.h > index 6c0132c7212f..7e28442827f1 100644 > --- a/arch/powerpc/include/asm/module.h > +++ b/arch/powerpc/include/asm/module.h > @@ -45,6 +45,9 @@ struct mod_arch_specific { > unsigned long tramp; > #endif > > + /* For module function descriptor dereference */ > + unsigned long start_opd; > + unsigned long end_opd; > #else /* powerpc64 */ > /* Indices of PLT sections within module. */ > unsigned int core_plt_section; > diff --git a/arch/powerpc/include/asm/sections.h > b/arch/powerpc/include/asm/sections.h > index 82bec63bbd4f..e335a8f846af 100644 > --- a/arch/powerpc/include/asm/sections.h > +++ b/arch/powerpc/include/asm/sections.h > @@ -66,6 +66,9 @@ static inline int overlaps_kvm_tmp(unsigned long start, > unsigned long end) > } > > #ifdef PPC64_ELF_ABI_v1 > + > +#define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1 > + > #undef dereference_function_descriptor > static inline void *dereference_function_descriptor(void *ptr) > { > @@ -76,6 +79,15 @@ static inline void *dereference_function_descriptor(void > *ptr) > ptr = p; > return ptr; > } > + > +#undef dereference_kernel_function_descriptor > +static inline void *dereference_kernel_function_descriptor(void *ptr) > +{ > + if (ptr < (void *)__start_opd || ptr >= (void *)__end_opd) > + return ptr; > + > + return dereference_function_descriptor(ptr); > +} > #endif /* PPC64_ELF_ABI_v1 */ > > #endif > diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c > index 759104b99f9f..218971ac7e04 100644 > --- a/arch/powerpc/kernel/module_64.c > +++ b/arch/powerpc/kernel/module_64.c > @@ -93,6 +93,15 @@ static unsigned int local_entry_offset(const Elf64_Sym > *sym) > { > return 0; > } > + > +void *dereference_module_function_descriptor(struct module *mod, void *ptr) > +{ > + if (ptr < (void *)mod->arch.start_opd || > + ptr >= (void *)mod->arch.end_opd) > + return ptr; > + > + return dereference_function_descriptor(ptr); > +} > #endif > > #define STUB_MAGIC 0x73747562 /* stub */ > @@ -344,6 +353,11 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr, > else if (strcmp(secstrings+sechdrs[i].sh_name,"__versions")==0) > dedotify_versions((void *)hdr + sechdrs[i].sh_offset, > sechdrs[i].sh_size); > + else if (!strcmp(secstrings + sechdrs[i].sh_name, ".opd")) { > + me->arch.start_opd = sechdrs[i].sh_addr; > + me->arch.end_opd = sechdrs[i].sh_addr + > +sechdrs[i].sh_size; > + } > > /* We don't handle .init for the moment: rename to _init */ > while ((p = strstr(secstrings + sechdrs[i].sh_name, ".init"))) > diff --git a/arch/powerpc/kernel/vmlinux.lds.S > b/arch/powerpc/kernel/vmlinux.lds.S > index 0494e1566ee2..5dac5ab22fa2 100644 > --- a/arch/powerpc/kernel/vmlinux.lds.S > +++ b/arch/powerpc/kernel/vmlinux.lds.S > @@ -278,7 +278,9 @@ SECTIONS > } > > .opd : AT(ADDR(.opd) - LOAD_OFFSET) { > + __start_opd = .; > *(.opd) > + __end_opd = .; > } > > . = ALIGN(256); --
Re: [PATCHv4 3/6] powerpc64: Add .opd based function descriptor dereference
* Sergey Senozhatsky wrote (on 2017-11-10 08:48:27 +0900): > We are moving towards separate kernel and module function descriptor > dereference callbacks. This patch enables it for powerpc64. > > For pointers that belong to the kernel > - Added __start_opd and __end_opd pointers, to track the kernel >.opd section address range; > > - Added dereference_kernel_function_descriptor(). Now we >will dereference only function pointers that are within >[__start_opd, __end_opd); > > For pointers that belong to a module > - Added dereference_module_function_descriptor() to handle module >function descriptor dereference. Now we will dereference only >pointers that are within [module->opd.start, module->opd.end). > > Signed-off-by: Sergey Senozhatsky > --- > arch/powerpc/include/asm/module.h | 3 +++ > arch/powerpc/include/asm/sections.h | 12 > arch/powerpc/kernel/module_64.c | 14 ++ > arch/powerpc/kernel/vmlinux.lds.S | 2 ++ > 4 files changed, 31 insertions(+) > Looks good on powerpc. If you wish: Tested-by: Santosh Sivaraj # for powerpc Thanks, Santosh > diff --git a/arch/powerpc/include/asm/module.h > b/arch/powerpc/include/asm/module.h > index 6c0132c7212f..7e28442827f1 100644 > --- a/arch/powerpc/include/asm/module.h > +++ b/arch/powerpc/include/asm/module.h > @@ -45,6 +45,9 @@ struct mod_arch_specific { > unsigned long tramp; > #endif > > + /* For module function descriptor dereference */ > + unsigned long start_opd; > + unsigned long end_opd; > #else /* powerpc64 */ > /* Indices of PLT sections within module. */ > unsigned int core_plt_section; > diff --git a/arch/powerpc/include/asm/sections.h > b/arch/powerpc/include/asm/sections.h > index 82bec63bbd4f..e335a8f846af 100644 > --- a/arch/powerpc/include/asm/sections.h > +++ b/arch/powerpc/include/asm/sections.h > @@ -66,6 +66,9 @@ static inline int overlaps_kvm_tmp(unsigned long start, > unsigned long end) > } > > #ifdef PPC64_ELF_ABI_v1 > + > +#define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1 > + > #undef dereference_function_descriptor > static inline void *dereference_function_descriptor(void *ptr) > { > @@ -76,6 +79,15 @@ static inline void *dereference_function_descriptor(void > *ptr) > ptr = p; > return ptr; > } > + > +#undef dereference_kernel_function_descriptor > +static inline void *dereference_kernel_function_descriptor(void *ptr) > +{ > + if (ptr < (void *)__start_opd || ptr >= (void *)__end_opd) > + return ptr; > + > + return dereference_function_descriptor(ptr); > +} > #endif /* PPC64_ELF_ABI_v1 */ > > #endif > diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c > index 759104b99f9f..218971ac7e04 100644 > --- a/arch/powerpc/kernel/module_64.c > +++ b/arch/powerpc/kernel/module_64.c > @@ -93,6 +93,15 @@ static unsigned int local_entry_offset(const Elf64_Sym > *sym) > { > return 0; > } > + > +void *dereference_module_function_descriptor(struct module *mod, void *ptr) > +{ > + if (ptr < (void *)mod->arch.start_opd || > + ptr >= (void *)mod->arch.end_opd) > + return ptr; > + > + return dereference_function_descriptor(ptr); > +} > #endif > > #define STUB_MAGIC 0x73747562 /* stub */ > @@ -344,6 +353,11 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr, > else if (strcmp(secstrings+sechdrs[i].sh_name,"__versions")==0) > dedotify_versions((void *)hdr + sechdrs[i].sh_offset, > sechdrs[i].sh_size); > + else if (!strcmp(secstrings + sechdrs[i].sh_name, ".opd")) { > + me->arch.start_opd = sechdrs[i].sh_addr; > + me->arch.end_opd = sechdrs[i].sh_addr + > +sechdrs[i].sh_size; > + } > > /* We don't handle .init for the moment: rename to _init */ > while ((p = strstr(secstrings + sechdrs[i].sh_name, ".init"))) > diff --git a/arch/powerpc/kernel/vmlinux.lds.S > b/arch/powerpc/kernel/vmlinux.lds.S > index 0494e1566ee2..5dac5ab22fa2 100644 > --- a/arch/powerpc/kernel/vmlinux.lds.S > +++ b/arch/powerpc/kernel/vmlinux.lds.S > @@ -278,7 +278,9 @@ SECTIONS > } > > .opd : AT(ADDR(.opd) - LOAD_OFFSET) { > + __start_opd = .; > *(.opd) > + __end_opd = .; > } > > . = ALIGN(256); --
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 11:38 AM, Tobin C. Hardingwrote: > On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote: >> On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: >> > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com >> > wrote: >> > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: >> > > > Currently we are leaking addresses from the kernel to user space. >> > > > This ... > > So, Linus has requested that I set up a tree for the development of > this. I have to work out the details of how to do that and then I'll > email you so you can get the pull the current version. I can then take > your patch via LKML as per usual. > Super. Thanks.
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 11:38 AM, Tobin C. Harding wrote: > On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote: >> On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: >> > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com >> > wrote: >> > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: >> > > > Currently we are leaking addresses from the kernel to user space. >> > > > This ... > > So, Linus has requested that I set up a tree for the development of > this. I have to work out the details of how to do that and then I'll > email you so you can get the pull the current version. I can then take > your patch via LKML as per usual. > Super. Thanks.
[PATCH] KVM: X86: Avoid to handle first-time write when updating the pv stuffs each time
From: Wanpeng LiThere is a logic to handle first-time write when updating the pvclock/wall clock/steal time shared memory pages each time, actually we should do this logic during pv stuffs setup if we suspect the version-field can't be guranteed to be initialized to an even number by the guest. This patch fixes it by handling the first-time write of pvclock/steal time during setup since the update is frequent, and keeping the wall clock since it is rare updating. Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Liran Alon Signed-off-by: Wanpeng Li --- arch/x86/kvm/x86.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4552427..19311e0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1833,9 +1833,6 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v) */ BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0); - if (guest_hv_clock.version & 1) - ++guest_hv_clock.version; /* first time write, random junk */ - vcpu->hv_clock.version = guest_hv_clock.version + 1; kvm_write_guest_cached(v->kvm, >pv_time, >hv_clock, @@ -2126,9 +2123,6 @@ static void record_steal_time(struct kvm_vcpu *vcpu) vcpu->arch.st.steal.preempted = 0; - if (vcpu->arch.st.steal.version & 1) - vcpu->arch.st.steal.version += 1; /* first time write, random junk */ - vcpu->arch.st.steal.version += 1; kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, @@ -2256,8 +2250,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >arch.pv_time, data & ~1ULL, sizeof(struct pvclock_vcpu_time_info))) vcpu->arch.pv_time_enabled = false; - else + else { + struct pvclock_vcpu_time_info guest_hv_clock; + vcpu->arch.pv_time_enabled = true; + if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.pv_time, + _hv_clock, sizeof(guest_hv_clock + break; + if (guest_hv_clock.version & 1) + ++guest_hv_clock.version; /* first time write, random junk */ + kvm_write_guest_cached(vcpu->kvm, >arch.pv_time, + >arch.hv_clock, + sizeof(vcpu->arch.hv_clock.version)); + } break; } @@ -2283,6 +2288,16 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!(data & KVM_MSR_ENABLED)) break; + if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time + break; + + if (vcpu->arch.st.steal.version & 1) + vcpu->arch.st.steal.version += 1; /* first time write, random junk */ + + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time)); + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); break; -- 2.7.4
[PATCH] KVM: X86: Avoid to handle first-time write when updating the pv stuffs each time
From: Wanpeng Li There is a logic to handle first-time write when updating the pvclock/wall clock/steal time shared memory pages each time, actually we should do this logic during pv stuffs setup if we suspect the version-field can't be guranteed to be initialized to an even number by the guest. This patch fixes it by handling the first-time write of pvclock/steal time during setup since the update is frequent, and keeping the wall clock since it is rare updating. Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Liran Alon Signed-off-by: Wanpeng Li --- arch/x86/kvm/x86.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4552427..19311e0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1833,9 +1833,6 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v) */ BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0); - if (guest_hv_clock.version & 1) - ++guest_hv_clock.version; /* first time write, random junk */ - vcpu->hv_clock.version = guest_hv_clock.version + 1; kvm_write_guest_cached(v->kvm, >pv_time, >hv_clock, @@ -2126,9 +2123,6 @@ static void record_steal_time(struct kvm_vcpu *vcpu) vcpu->arch.st.steal.preempted = 0; - if (vcpu->arch.st.steal.version & 1) - vcpu->arch.st.steal.version += 1; /* first time write, random junk */ - vcpu->arch.st.steal.version += 1; kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, @@ -2256,8 +2250,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >arch.pv_time, data & ~1ULL, sizeof(struct pvclock_vcpu_time_info))) vcpu->arch.pv_time_enabled = false; - else + else { + struct pvclock_vcpu_time_info guest_hv_clock; + vcpu->arch.pv_time_enabled = true; + if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.pv_time, + _hv_clock, sizeof(guest_hv_clock + break; + if (guest_hv_clock.version & 1) + ++guest_hv_clock.version; /* first time write, random junk */ + kvm_write_guest_cached(vcpu->kvm, >arch.pv_time, + >arch.hv_clock, + sizeof(vcpu->arch.hv_clock.version)); + } break; } @@ -2283,6 +2288,16 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!(data & KVM_MSR_ENABLED)) break; + if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time + break; + + if (vcpu->arch.st.steal.version & 1) + vcpu->arch.st.steal.version += 1; /* first time write, random junk */ + + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time)); + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); break; -- 2.7.4
[GIT PULL] s390 updates for v4.15
Hello Linus, since Martin is on vacation you get the s390 pull request for the v4.15 merge window this time from me. Please pull from the 'for-linus' branch of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git for-linus to receive the following updates: Besides a lot of cleanups and bug fixes these are the most important changes: - A new regset for runtime instrumentation registers - Hardware accelerated AES-GCM support for the aes_s390 module - Support for the new CEX6S crypto cards - Support for FORTIFY_SOURCE - Addition of missing z13 and new z14 instructions to the in-kernel disassembler - Generate opcode tables for the in-kernel disassembler out of a simple text file instead of having to manually maintain those tables - Fast memset16, memset32 and memset64 implementations - Removal of named saved segment support - Hardware counter support for z14 - Queued spinlocks and queued rwlocks implementations for s390 - Use the stack_depth tracking feature for s390 BPF JIT - A new s390_sthyi system call which emulates the sthyi (store hypervisor information) instruction - Removal of the old KVM virtio transport - An s390 specific CPU alternatives implementation which is used in the new spinlock code You will see two trivial merge conflicts caused by the "SPDX GPL-2.0 license" commit and the removal of s390 specific header files: arch/s390/include/asm/rwsem.h arch/s390/include/uapi/asm/kvm_virtio.h The resolution is to simply remove those files. Thanks, Heiko Alice Frosi (2): s390/runtime_instrumentation: clean up struct runtime_instr_cb s390/ptrace: add runtime instrumention register get/set Arnd Bergmann (1): s390/dasd: avoid calling do_gettimeofday() Christian Borntraeger (2): s390/pci: do not require AIS facility s390/virtio: remove unused header file kvm_virtio.h Cornelia Huck (1): MAINTAINERS: add virtio-ccw.h to virtio/s390 section Dong Jia Shi (2): vfio: ccw: bypass bad idaw address when fetching IDAL ccws vfio: ccw: validate the count field of a ccw before pinning Elena Reshetova (1): vmur: convert urdev.ref_count from atomic_t to refcount_t Harald Freudenberger (7): s390/zcrypt: Explicitly check input data length. s390/crypto: add s390 platform specific aes gcm support. s390/zcrypt: CEX6S exploitation s390/zcrypt: Enable special header file flag for AU CPRP s390/zcrypt: Introduce QACT support for AP bus devices. s390/archrandom: Reconsider s390 arch random implementation s390/zcrypt: Rework struct ap_qact_ap_info. Heiko Carstens (31): s390: convert release_thread() into a static inline function s390/runtime instrumention: fix possible memory corruption s390/runtime instrumentation: simplify task exit handling s390/guarded storage: fix possible memory corruption s390/ptrace: fix guarded storage regset handling s390/guarded storage: simplify task exit handling s390: get rid of exit_thread() s390: add support for FORTIFY_SOURCE s390/cpumf: remove superfluous nr_cpumask_bits check s390/virtio: simplify Makefile s390/disassembler: add missing end marker for e7 table s390/disassembler: fix LRDFU format s390/disassembler: remove double instructions s390/disassembler: add sthyi instruction s390/disassembler: add missing z13 instructions s390/disassembler: add new z14 instructions s390: use generic rwsem implementation s390: implement memset16, memset32 & memset64 s390/mm: use memset64 instead of clear_table s390: optimize memset implementation s390: cleanup string ops prototypes s390/kprobes: remove KPROBE_SWAP_INST state s390/debug: adjust coding style s390: remove named saved segment support s390/disassembler: remove insn_to_mnemonic() s390/disassembler: generate opcode tables from text file s390: avoid undefined behaviour s390: simplify transactional execution elf hwcap handling Merge tag 'vfio-ccw-20171109' of git://git.kernel.org/.../kvms390/vfio-ccw into features s390: fix transactional execution control register handling s390/noexec: execute kexec datamover without DAT Hendrik Brueckner (1): s390/cpum_cf: add hardware counter support for IBM z14 Himanshu Jha (1): s390/sclp: Use setup_timer and mod_timer Jason J. Herne (1): s390: vfio-ccw: Do not attempt to free no-op, test and tic cda. Jean Delvare (1): s390/char: fix cdev_add usage Johannes Thumshirn (1): samples/kprobes: Add s390 case in kprobe example module Julian Wiedmann (1): s390/ccwgroup: tie a ccwgroup driver to its ccw driver Luc Van Oostenryck (1): s390: pass endianness info to sparse Martin Schwidefsky (14): s390/topology: add detection of dedicated vs shared CPUs s390/spinlock: use the cpu number +1 as spinlock value s390/spinlock: introduce
[GIT PULL] s390 updates for v4.15
Hello Linus, since Martin is on vacation you get the s390 pull request for the v4.15 merge window this time from me. Please pull from the 'for-linus' branch of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git for-linus to receive the following updates: Besides a lot of cleanups and bug fixes these are the most important changes: - A new regset for runtime instrumentation registers - Hardware accelerated AES-GCM support for the aes_s390 module - Support for the new CEX6S crypto cards - Support for FORTIFY_SOURCE - Addition of missing z13 and new z14 instructions to the in-kernel disassembler - Generate opcode tables for the in-kernel disassembler out of a simple text file instead of having to manually maintain those tables - Fast memset16, memset32 and memset64 implementations - Removal of named saved segment support - Hardware counter support for z14 - Queued spinlocks and queued rwlocks implementations for s390 - Use the stack_depth tracking feature for s390 BPF JIT - A new s390_sthyi system call which emulates the sthyi (store hypervisor information) instruction - Removal of the old KVM virtio transport - An s390 specific CPU alternatives implementation which is used in the new spinlock code You will see two trivial merge conflicts caused by the "SPDX GPL-2.0 license" commit and the removal of s390 specific header files: arch/s390/include/asm/rwsem.h arch/s390/include/uapi/asm/kvm_virtio.h The resolution is to simply remove those files. Thanks, Heiko Alice Frosi (2): s390/runtime_instrumentation: clean up struct runtime_instr_cb s390/ptrace: add runtime instrumention register get/set Arnd Bergmann (1): s390/dasd: avoid calling do_gettimeofday() Christian Borntraeger (2): s390/pci: do not require AIS facility s390/virtio: remove unused header file kvm_virtio.h Cornelia Huck (1): MAINTAINERS: add virtio-ccw.h to virtio/s390 section Dong Jia Shi (2): vfio: ccw: bypass bad idaw address when fetching IDAL ccws vfio: ccw: validate the count field of a ccw before pinning Elena Reshetova (1): vmur: convert urdev.ref_count from atomic_t to refcount_t Harald Freudenberger (7): s390/zcrypt: Explicitly check input data length. s390/crypto: add s390 platform specific aes gcm support. s390/zcrypt: CEX6S exploitation s390/zcrypt: Enable special header file flag for AU CPRP s390/zcrypt: Introduce QACT support for AP bus devices. s390/archrandom: Reconsider s390 arch random implementation s390/zcrypt: Rework struct ap_qact_ap_info. Heiko Carstens (31): s390: convert release_thread() into a static inline function s390/runtime instrumention: fix possible memory corruption s390/runtime instrumentation: simplify task exit handling s390/guarded storage: fix possible memory corruption s390/ptrace: fix guarded storage regset handling s390/guarded storage: simplify task exit handling s390: get rid of exit_thread() s390: add support for FORTIFY_SOURCE s390/cpumf: remove superfluous nr_cpumask_bits check s390/virtio: simplify Makefile s390/disassembler: add missing end marker for e7 table s390/disassembler: fix LRDFU format s390/disassembler: remove double instructions s390/disassembler: add sthyi instruction s390/disassembler: add missing z13 instructions s390/disassembler: add new z14 instructions s390: use generic rwsem implementation s390: implement memset16, memset32 & memset64 s390/mm: use memset64 instead of clear_table s390: optimize memset implementation s390: cleanup string ops prototypes s390/kprobes: remove KPROBE_SWAP_INST state s390/debug: adjust coding style s390: remove named saved segment support s390/disassembler: remove insn_to_mnemonic() s390/disassembler: generate opcode tables from text file s390: avoid undefined behaviour s390: simplify transactional execution elf hwcap handling Merge tag 'vfio-ccw-20171109' of git://git.kernel.org/.../kvms390/vfio-ccw into features s390: fix transactional execution control register handling s390/noexec: execute kexec datamover without DAT Hendrik Brueckner (1): s390/cpum_cf: add hardware counter support for IBM z14 Himanshu Jha (1): s390/sclp: Use setup_timer and mod_timer Jason J. Herne (1): s390: vfio-ccw: Do not attempt to free no-op, test and tic cda. Jean Delvare (1): s390/char: fix cdev_add usage Johannes Thumshirn (1): samples/kprobes: Add s390 case in kprobe example module Julian Wiedmann (1): s390/ccwgroup: tie a ccwgroup driver to its ccw driver Luc Van Oostenryck (1): s390: pass endianness info to sparse Martin Schwidefsky (14): s390/topology: add detection of dedicated vs shared CPUs s390/spinlock: use the cpu number +1 as spinlock value s390/spinlock: introduce
drivers/firmware/google/vpd.c: duplicate sysfs file
sysfs: cannot create duplicate filename '/devices/platform/vpd' on the second load of this driver. I.e., modprobe vpd-sysfs rmmod vpd-sysfs modprobe vpd-sysfs [boom] on 4.14-rc8 -- ~Randy
drivers/firmware/google/vpd.c: duplicate sysfs file
sysfs: cannot create duplicate filename '/devices/platform/vpd' on the second load of this driver. I.e., modprobe vpd-sysfs rmmod vpd-sysfs modprobe vpd-sysfs [boom] on 4.14-rc8 -- ~Randy
[PATCH IMPROVEMENT/BUGFIX 0/4] block, bfq: increase sustainable IOPS and fix a bug
Hi, these patches address the following issue, raised and discussed in [1]. BFQ provides a proportional share policy for the blkio controller. In this respect, BFQ updates the I/O accounting related to its policy, i.e., the statistics contained in the special files blkio.bfq.* in blkio groups (these files are the bfq counterpart of the blkio.* statistic files updated by CFQ). To update these statistics, BFQ invokes some blkg_*stats_* functions. We have found out that these functions take a considerable percentage, about 40%, of the total execution time of BFQ. This patch series contains two patches to address this issue, namely the patches anticipated and discussed in their main aspects in [1]. The first of these two patches is patch 3/4 in this series: it enables BFQ to execute the above blkg_*stats_* functions, where possible, in parallel with the rest of the code of the scheduler. With this improvement, the maximum request-processing rate sustainable with BFQ grows by 25%-30%, depending on the CPU. For instance, it grows from 250 to 310 KIOPS on an Intel i7-4850HQ. These results, and the others reported in this letter, have been obtained and can be reproduced very easily with the script [2]. Unfortunately, even after the above improvement, blkg_*stats_* functions still cause a noticeable loss of sustainable throughput. To give an idea, on an Intel i7-4850HQ, if the update of blkio.bfq.* statistics is not performed at all, then the sustainable throughput grows from 310 to 400 KIOPS. This issue has been already discussed in [1] as well. In brief, we agreed to make a further commit, which introduces the possibility to disable/re-enable at boot, or at module-loading time, the updating of all blkio statistics for proportional-share policies, i.e., of both those updated by BFQ and those updated by CFQ. We are already working on that commit, but finalizing it will take some time. Fortunately, following a suggestion/recommendation of Tejun in the same thread [2], it is already possible to drastically increase BFQ performance, when no blkio-debugging information is needed. Tejun's suggestion/recommendation is to move most blkio.bfq.* statistics behind an already existing config option, CONFIG_DEBUG_BLK_CGROUP. Patch 4/4 in this series does that. Thanks to this change, if CONFIG_DEBUG_BLK_CGROUP is not set, then bfq does attain a further boost in sustainable throughput, which ranges from +30% to +45%, depending on the CPU (some figures in the documentation). The above two patches are preceded by two preliminary patches. The first updates the conservative range of IOPS (sustainable with BFQ) that was previously reported in the documentation. The patch replaces this piece of information with the actual, much higher limits that we have measured while working at the above two commits. The second preliminary patch fixes a functional bug, related to the update of the above statistics. We waited for one week of testing from bfq users before submitting these patches. We hope we are still in time for having these improvements and fixes considered for 4.15. NOTE. Two definitions of empty functions in patch 4/4 trigger the following checkpatch error: "open brace '{' following function definitions go on the next line". Unfortunately, following this recommendation does seem to worsen code in our case: in addition to making these two definitions slightly harder to read, it would break symmetry with respect to all other definitions of empty functions, both those already present in the base code, and those added by the patch itself. In particular, in all those other definitions, the empty body of the functions is on the same line as the prototypes of the functions. Oddly, the latter definitions do not cause the same error report. Thanks, Paolo [1] https://www.spinics.net/lists/linux-block/msg18943.html [2] https://github.com/Algodev-github/IOSpeed Luca Miccio (2): block, bfq: add missing invocations of bfqg_stats_update_io_add/remove block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP Paolo Valente (2): doc, block, bfq: update max IOPS sustainable with BFQ block, bfq: update blkio stats outside the scheduler lock Documentation/block/bfq-iosched.txt | 43 +-- block/bfq-cgroup.c | 148 block/bfq-iosched.c | 117 ++-- block/bfq-iosched.h | 4 +- block/bfq-wf2q.c| 1 - 5 files changed, 233 insertions(+), 80 deletions(-) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 2/4] block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
From: Luca Micciobfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be invoked, respectively, when an I/O request enters and when an I/O request exits the scheduler. Unfortunately, bfq does not fully comply with this scheme, because it does not invoke these functions for requests that are inserted into or extracted from its priority dispatch list. This commit fixes this mistake. Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio --- block/bfq-iosched.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 889a854..91703eb 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1359,7 +1359,6 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, bfqq->ttime.last_end_request + bfqd->bfq_slice_idle * 3; - bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, rq->cmd_flags); /* * bfqq deserves to be weight-raised if: @@ -1633,7 +1632,6 @@ static void bfq_remove_request(struct request_queue *q, if (rq->cmd_flags & REQ_META) bfqq->meta_pending--; - bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags); } static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) @@ -1746,6 +1744,7 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq, bfqq->next_rq = rq; bfq_remove_request(q, next); + bfqg_stats_update_io_remove(bfqq_group(bfqq), next->cmd_flags); spin_unlock_irq(>bfqd->lock); end: @@ -3700,6 +3699,9 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) spin_lock_irq(>lock); rq = __bfq_dispatch_request(hctx); + if (rq && RQ_BFQQ(rq)) + bfqg_stats_update_io_remove(bfqq_group(RQ_BFQQ(rq)), + rq->cmd_flags); spin_unlock_irq(>lock); return rq; @@ -4224,6 +4226,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, { struct request_queue *q = hctx->queue; struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_queue *bfqq = RQ_BFQQ(rq); spin_lock_irq(>lock); if (blk_mq_sched_try_insert_merge(q, rq)) { @@ -4243,6 +4246,12 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, list_add_tail(>queuelist, >dispatch); } else { __bfq_insert_request(bfqd, rq); + /* +* Update bfqq, because, if a queue merge has occurred +* in __bfq_insert_request, then rq has been +* redirected into a new queue. +*/ + bfqq = RQ_BFQQ(rq); if (rq_mergeable(rq)) { elv_rqhash_add(q, rq); @@ -4251,6 +4260,9 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, } } + if (bfqq) + bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, rq->cmd_flags); + spin_unlock_irq(>lock); } @@ -4428,8 +4440,11 @@ static void bfq_finish_request(struct request *rq) * lock is held. */ - if (!RB_EMPTY_NODE(>rb_node)) + if (!RB_EMPTY_NODE(>rb_node)) { bfq_remove_request(rq->q, rq); + bfqg_stats_update_io_remove(bfqq_group(bfqq), + rq->cmd_flags); + } bfq_put_rq_priv_body(bfqq); } -- 2.10.0
[PATCH IMPROVEMENT/BUGFIX 0/4] block, bfq: increase sustainable IOPS and fix a bug
Hi, these patches address the following issue, raised and discussed in [1]. BFQ provides a proportional share policy for the blkio controller. In this respect, BFQ updates the I/O accounting related to its policy, i.e., the statistics contained in the special files blkio.bfq.* in blkio groups (these files are the bfq counterpart of the blkio.* statistic files updated by CFQ). To update these statistics, BFQ invokes some blkg_*stats_* functions. We have found out that these functions take a considerable percentage, about 40%, of the total execution time of BFQ. This patch series contains two patches to address this issue, namely the patches anticipated and discussed in their main aspects in [1]. The first of these two patches is patch 3/4 in this series: it enables BFQ to execute the above blkg_*stats_* functions, where possible, in parallel with the rest of the code of the scheduler. With this improvement, the maximum request-processing rate sustainable with BFQ grows by 25%-30%, depending on the CPU. For instance, it grows from 250 to 310 KIOPS on an Intel i7-4850HQ. These results, and the others reported in this letter, have been obtained and can be reproduced very easily with the script [2]. Unfortunately, even after the above improvement, blkg_*stats_* functions still cause a noticeable loss of sustainable throughput. To give an idea, on an Intel i7-4850HQ, if the update of blkio.bfq.* statistics is not performed at all, then the sustainable throughput grows from 310 to 400 KIOPS. This issue has been already discussed in [1] as well. In brief, we agreed to make a further commit, which introduces the possibility to disable/re-enable at boot, or at module-loading time, the updating of all blkio statistics for proportional-share policies, i.e., of both those updated by BFQ and those updated by CFQ. We are already working on that commit, but finalizing it will take some time. Fortunately, following a suggestion/recommendation of Tejun in the same thread [2], it is already possible to drastically increase BFQ performance, when no blkio-debugging information is needed. Tejun's suggestion/recommendation is to move most blkio.bfq.* statistics behind an already existing config option, CONFIG_DEBUG_BLK_CGROUP. Patch 4/4 in this series does that. Thanks to this change, if CONFIG_DEBUG_BLK_CGROUP is not set, then bfq does attain a further boost in sustainable throughput, which ranges from +30% to +45%, depending on the CPU (some figures in the documentation). The above two patches are preceded by two preliminary patches. The first updates the conservative range of IOPS (sustainable with BFQ) that was previously reported in the documentation. The patch replaces this piece of information with the actual, much higher limits that we have measured while working at the above two commits. The second preliminary patch fixes a functional bug, related to the update of the above statistics. We waited for one week of testing from bfq users before submitting these patches. We hope we are still in time for having these improvements and fixes considered for 4.15. NOTE. Two definitions of empty functions in patch 4/4 trigger the following checkpatch error: "open brace '{' following function definitions go on the next line". Unfortunately, following this recommendation does seem to worsen code in our case: in addition to making these two definitions slightly harder to read, it would break symmetry with respect to all other definitions of empty functions, both those already present in the base code, and those added by the patch itself. In particular, in all those other definitions, the empty body of the functions is on the same line as the prototypes of the functions. Oddly, the latter definitions do not cause the same error report. Thanks, Paolo [1] https://www.spinics.net/lists/linux-block/msg18943.html [2] https://github.com/Algodev-github/IOSpeed Luca Miccio (2): block, bfq: add missing invocations of bfqg_stats_update_io_add/remove block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP Paolo Valente (2): doc, block, bfq: update max IOPS sustainable with BFQ block, bfq: update blkio stats outside the scheduler lock Documentation/block/bfq-iosched.txt | 43 +-- block/bfq-cgroup.c | 148 block/bfq-iosched.c | 117 ++-- block/bfq-iosched.h | 4 +- block/bfq-wf2q.c| 1 - 5 files changed, 233 insertions(+), 80 deletions(-) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 2/4] block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
From: Luca Miccio bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be invoked, respectively, when an I/O request enters and when an I/O request exits the scheduler. Unfortunately, bfq does not fully comply with this scheme, because it does not invoke these functions for requests that are inserted into or extracted from its priority dispatch list. This commit fixes this mistake. Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio --- block/bfq-iosched.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 889a854..91703eb 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1359,7 +1359,6 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, bfqq->ttime.last_end_request + bfqd->bfq_slice_idle * 3; - bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, rq->cmd_flags); /* * bfqq deserves to be weight-raised if: @@ -1633,7 +1632,6 @@ static void bfq_remove_request(struct request_queue *q, if (rq->cmd_flags & REQ_META) bfqq->meta_pending--; - bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags); } static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) @@ -1746,6 +1744,7 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq, bfqq->next_rq = rq; bfq_remove_request(q, next); + bfqg_stats_update_io_remove(bfqq_group(bfqq), next->cmd_flags); spin_unlock_irq(>bfqd->lock); end: @@ -3700,6 +3699,9 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) spin_lock_irq(>lock); rq = __bfq_dispatch_request(hctx); + if (rq && RQ_BFQQ(rq)) + bfqg_stats_update_io_remove(bfqq_group(RQ_BFQQ(rq)), + rq->cmd_flags); spin_unlock_irq(>lock); return rq; @@ -4224,6 +4226,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, { struct request_queue *q = hctx->queue; struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_queue *bfqq = RQ_BFQQ(rq); spin_lock_irq(>lock); if (blk_mq_sched_try_insert_merge(q, rq)) { @@ -4243,6 +4246,12 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, list_add_tail(>queuelist, >dispatch); } else { __bfq_insert_request(bfqd, rq); + /* +* Update bfqq, because, if a queue merge has occurred +* in __bfq_insert_request, then rq has been +* redirected into a new queue. +*/ + bfqq = RQ_BFQQ(rq); if (rq_mergeable(rq)) { elv_rqhash_add(q, rq); @@ -4251,6 +4260,9 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, } } + if (bfqq) + bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, rq->cmd_flags); + spin_unlock_irq(>lock); } @@ -4428,8 +4440,11 @@ static void bfq_finish_request(struct request *rq) * lock is held. */ - if (!RB_EMPTY_NODE(>rb_node)) + if (!RB_EMPTY_NODE(>rb_node)) { bfq_remove_request(rq->q, rq); + bfqg_stats_update_io_remove(bfqq_group(bfqq), + rq->cmd_flags); + } bfq_put_rq_priv_body(bfqq); } -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 4/4] block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
From: Luca MiccioBFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo Suggested-by: Ulf Hansson Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko Signed-off-by: Luca Miccio Signed-off-by: Paolo Valente --- Documentation/block/bfq-iosched.txt | 38 +++-- block/bfq-cgroup.c | 148 block/bfq-iosched.c | 14 ++-- block/bfq-iosched.h | 4 +- 4 files changed, 125 insertions(+), 79 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 7fad6c0..8d8d8f0 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -20,12 +20,22 @@ for that device, by setting low_latency to 0. See Section 3 for details on how to configure BFQ for the desired tradeoff between latency and throughput, or on how to maximize throughput. -BFQ has a non-null overhead, which limits the maximum IOPS that the -CPU can process for a device scheduled with BFQ. To give an idea of -the limits on slow or average CPUs, here are BFQ limits for three -different CPUs, on, respectively, an average laptop, an old desktop, -and a cheap embedded system, in case full hierarchical support is -enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): +BFQ has a non-null overhead, which limits the maximum IOPS that a CPU +can process for a device scheduled with BFQ. To give an idea of the +limits on slow or average CPUs, here are, first, the limits of BFQ for +three different CPUs, on, respectively, an average laptop, an old +desktop, and a cheap embedded system, in case full hierarchical +support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but +CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2): +- Intel i7-4850HQ: 400 KIOPS +- AMD A8-3850: 250 KIOPS +- ARM CortexTM-A53 Octa-core: 80 KIOPS + +If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical +support is enabled), then the sustainable throughput with BFQ +decreases, because all blkio.bfq* statistics are created and updated +(Section 4-2). For BFQ, this leads to the following maximum +sustainable throughputs, on the same systems as above: - Intel i7-4850HQ: 310 KIOPS - AMD A8-3850: 200 KIOPS - ARM CortexTM-A53 Octa-core: 56 KIOPS @@ -505,6 +515,22 @@ BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group parameter to set the weight of a group with BFQ is blkio.bfq.weight or io.bfq.weight. +As for cgroups-v1 (blkio controller), the exact set of stat files +created, and kept up-to-date by bfq, depends on whether +CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all +the stat files documented in +Documentation/cgroup-v1/blkio-controller.txt. If, instead, +CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files +blkio.bfq.io_service_bytes +blkio.bfq.io_service_bytes_recursive +blkio.bfq.io_serviced +blkio.bfq.io_serviced_recursive + +The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum +throughput sustainable with bfq, because updating the blkio.bfq.* +stats is rather costly, especially for some of the stats enabled by +CONFIG_DEBUG_BLK_CGROUP. + Parameters to set - diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index ceefb9a..da1525e 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -24,7 +24,7 @@ #include "bfq-iosched.h" -#ifdef CONFIG_BFQ_GROUP_IOSCHED +#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) /* bfqg stats flags */ enum bfqg_stats_flags { @@ -152,6 +152,57 @@ void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) bfqg_stats_update_group_wait_time(stats); } +void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, +
[PATCH BUGFIX/IMPROVEMENT 4/4] block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
From: Luca Miccio BFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo Suggested-by: Ulf Hansson Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko Signed-off-by: Luca Miccio Signed-off-by: Paolo Valente --- Documentation/block/bfq-iosched.txt | 38 +++-- block/bfq-cgroup.c | 148 block/bfq-iosched.c | 14 ++-- block/bfq-iosched.h | 4 +- 4 files changed, 125 insertions(+), 79 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 7fad6c0..8d8d8f0 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -20,12 +20,22 @@ for that device, by setting low_latency to 0. See Section 3 for details on how to configure BFQ for the desired tradeoff between latency and throughput, or on how to maximize throughput. -BFQ has a non-null overhead, which limits the maximum IOPS that the -CPU can process for a device scheduled with BFQ. To give an idea of -the limits on slow or average CPUs, here are BFQ limits for three -different CPUs, on, respectively, an average laptop, an old desktop, -and a cheap embedded system, in case full hierarchical support is -enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): +BFQ has a non-null overhead, which limits the maximum IOPS that a CPU +can process for a device scheduled with BFQ. To give an idea of the +limits on slow or average CPUs, here are, first, the limits of BFQ for +three different CPUs, on, respectively, an average laptop, an old +desktop, and a cheap embedded system, in case full hierarchical +support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but +CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2): +- Intel i7-4850HQ: 400 KIOPS +- AMD A8-3850: 250 KIOPS +- ARM CortexTM-A53 Octa-core: 80 KIOPS + +If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical +support is enabled), then the sustainable throughput with BFQ +decreases, because all blkio.bfq* statistics are created and updated +(Section 4-2). For BFQ, this leads to the following maximum +sustainable throughputs, on the same systems as above: - Intel i7-4850HQ: 310 KIOPS - AMD A8-3850: 200 KIOPS - ARM CortexTM-A53 Octa-core: 56 KIOPS @@ -505,6 +515,22 @@ BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group parameter to set the weight of a group with BFQ is blkio.bfq.weight or io.bfq.weight. +As for cgroups-v1 (blkio controller), the exact set of stat files +created, and kept up-to-date by bfq, depends on whether +CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all +the stat files documented in +Documentation/cgroup-v1/blkio-controller.txt. If, instead, +CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files +blkio.bfq.io_service_bytes +blkio.bfq.io_service_bytes_recursive +blkio.bfq.io_serviced +blkio.bfq.io_serviced_recursive + +The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum +throughput sustainable with bfq, because updating the blkio.bfq.* +stats is rather costly, especially for some of the stats enabled by +CONFIG_DEBUG_BLK_CGROUP. + Parameters to set - diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index ceefb9a..da1525e 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -24,7 +24,7 @@ #include "bfq-iosched.h" -#ifdef CONFIG_BFQ_GROUP_IOSCHED +#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) /* bfqg stats flags */ enum bfqg_stats_flags { @@ -152,6 +152,57 @@ void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) bfqg_stats_update_group_wait_time(stats); } +void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, + unsigned int op) +{ + blkg_rwstat_add(>stats.queued, op, 1); + bfqg_stats_end_empty_time(>stats); + if (!(bfqq ==
[PATCH BUGFIX/IMPROVEMENT 1/4] doc, block, bfq: update max IOPS sustainable with BFQ
We have investigated more deeply the performance of BFQ, in terms of number of IOPS that can be processed by the CPU when BFQ is used as I/O scheduler. In more detail, using the script [1], we have measured the number of IOPS reached on top of a null block device configured with zero latency, as a function of the workload (sequential read, sequential write, random read, random write) and of the system (we considered desktops, laptops and embedded systems). Basing on the resulting figures, with this commit we update the current, conservative IOPS range reported in BFQ documentation. In particular, the documentation now reports, for each of three different systems, the lowest number of IOPS obtained for that system with the above test (namely, the value obtained with the workload leading to the lowest IOPS). [1] https://github.com/Algodev-github/IOSpeed Reviewed-by: Lee TibbertSigned-off-by: Paolo Valente Signed-off-by: Luca Miccio --- Documentation/block/bfq-iosched.txt | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 3d6951d..7a93615 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -20,12 +20,17 @@ for that device, by setting low_latency to 0. See Section 3 for details on how to configure BFQ for the desired tradeoff between latency and throughput, or on how to maximize throughput. -On average CPUs, the current version of BFQ can handle devices -performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a -reference, 30-50 KIOPS correspond to very high bandwidths with -sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and -to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on -multi-queue devices too. +BFQ has a non-null overhead, which limits the maximum IOPS that the +CPU can process for a device scheduled with BFQ. To give an idea of +the limits on slow or average CPUs, here are BFQ limits for three +different CPUs, on, respectively, an average laptop, an old desktop, +and a cheap embedded system, in case full hierarchical support is +enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): +- Intel i7-4850HQ: 250 KIOPS +- AMD A8-3850: 170 KIOPS +- ARM CortexTM-A53 Octa-core: 45 KIOPS + +BFQ works for multi-queue devices too. The table of contents follow. Impatients can just jump to Section 3. -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 1/4] doc, block, bfq: update max IOPS sustainable with BFQ
We have investigated more deeply the performance of BFQ, in terms of number of IOPS that can be processed by the CPU when BFQ is used as I/O scheduler. In more detail, using the script [1], we have measured the number of IOPS reached on top of a null block device configured with zero latency, as a function of the workload (sequential read, sequential write, random read, random write) and of the system (we considered desktops, laptops and embedded systems). Basing on the resulting figures, with this commit we update the current, conservative IOPS range reported in BFQ documentation. In particular, the documentation now reports, for each of three different systems, the lowest number of IOPS obtained for that system with the above test (namely, the value obtained with the workload leading to the lowest IOPS). [1] https://github.com/Algodev-github/IOSpeed Reviewed-by: Lee Tibbert Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio --- Documentation/block/bfq-iosched.txt | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 3d6951d..7a93615 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -20,12 +20,17 @@ for that device, by setting low_latency to 0. See Section 3 for details on how to configure BFQ for the desired tradeoff between latency and throughput, or on how to maximize throughput. -On average CPUs, the current version of BFQ can handle devices -performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a -reference, 30-50 KIOPS correspond to very high bandwidths with -sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and -to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on -multi-queue devices too. +BFQ has a non-null overhead, which limits the maximum IOPS that the +CPU can process for a device scheduled with BFQ. To give an idea of +the limits on slow or average CPUs, here are BFQ limits for three +different CPUs, on, respectively, an average laptop, an old desktop, +and a cheap embedded system, in case full hierarchical support is +enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): +- Intel i7-4850HQ: 250 KIOPS +- AMD A8-3850: 170 KIOPS +- ARM CortexTM-A53 Octa-core: 45 KIOPS + +BFQ works for multi-queue devices too. The table of contents follow. Impatients can just jump to Section 3. -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 3/4] block, bfq: update blkio stats outside the scheduler lock
bfq invokes various blkg_*stats_* functions to update the statistics contained in the special files blkio.bfq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq (i.e., of the sum of the execution time of all the bfq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq noticeably, even on a multicore CPU. In fact, the bfq functions that invoke blkg_*stats_* functions cannot be executed in parallel with the rest of the code of bfq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq functions that invoke blkg_*stats_* functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_*stats_* functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] https://github.com/Algodev-github/IOSpeed Tested-by: Lee TibbertTested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio --- Documentation/block/bfq-iosched.txt | 6 +- block/bfq-iosched.c | 110 block/bfq-wf2q.c| 1 - 3 files changed, 102 insertions(+), 15 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 7a93615..7fad6c0 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -26,9 +26,9 @@ the limits on slow or average CPUs, here are BFQ limits for three different CPUs, on, respectively, an average laptop, an old desktop, and a cheap embedded system, in case full hierarchical support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): -- Intel i7-4850HQ: 250 KIOPS -- AMD A8-3850: 170 KIOPS -- ARM CortexTM-A53 Octa-core: 45 KIOPS +- Intel i7-4850HQ: 310 KIOPS +- AMD A8-3850: 200 KIOPS +- ARM CortexTM-A53 Octa-core: 56 KIOPS BFQ works for multi-queue devices too. diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 91703eb..69e05f8 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2228,7 +2228,6 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd, struct bfq_queue *bfqq) { if (bfqq) { - bfqg_stats_update_avg_queue_size(bfqq_group(bfqq)); bfq_clear_bfqq_fifo_expire(bfqq); bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8; @@ -3469,7 +3468,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) */ bfq_clear_bfqq_wait_request(bfqq); hrtimer_try_to_cancel(>idle_slice_timer); - bfqg_stats_update_idle_time(bfqq_group(bfqq)); } goto keep_queue; } @@ -3695,15 +3693,67 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; struct request *rq; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_queue *in_serv_queue, *bfqq; + bool waiting_rq, idle_timer_disabled; +#endif spin_lock_irq(>lock); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + in_serv_queue = bfqd->in_service_queue; + waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); + + rq = __bfq_dispatch_request(hctx); + + idle_timer_disabled = + waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); + +#else rq = __bfq_dispatch_request(hctx); - if (rq && RQ_BFQQ(rq)) - bfqg_stats_update_io_remove(bfqq_group(RQ_BFQQ(rq)), - rq->cmd_flags); +#endif spin_unlock_irq(>lock); +#ifdef CONFIG_BFQ_GROUP_IOSCHED +
[PATCH BUGFIX/IMPROVEMENT 3/4] block, bfq: update blkio stats outside the scheduler lock
bfq invokes various blkg_*stats_* functions to update the statistics contained in the special files blkio.bfq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq (i.e., of the sum of the execution time of all the bfq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq noticeably, even on a multicore CPU. In fact, the bfq functions that invoke blkg_*stats_* functions cannot be executed in parallel with the rest of the code of bfq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq functions that invoke blkg_*stats_* functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_*stats_* functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] https://github.com/Algodev-github/IOSpeed Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Luca Miccio --- Documentation/block/bfq-iosched.txt | 6 +- block/bfq-iosched.c | 110 block/bfq-wf2q.c| 1 - 3 files changed, 102 insertions(+), 15 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 7a93615..7fad6c0 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -26,9 +26,9 @@ the limits on slow or average CPUs, here are BFQ limits for three different CPUs, on, respectively, an average laptop, an old desktop, and a cheap embedded system, in case full hierarchical support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set): -- Intel i7-4850HQ: 250 KIOPS -- AMD A8-3850: 170 KIOPS -- ARM CortexTM-A53 Octa-core: 45 KIOPS +- Intel i7-4850HQ: 310 KIOPS +- AMD A8-3850: 200 KIOPS +- ARM CortexTM-A53 Octa-core: 56 KIOPS BFQ works for multi-queue devices too. diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 91703eb..69e05f8 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2228,7 +2228,6 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd, struct bfq_queue *bfqq) { if (bfqq) { - bfqg_stats_update_avg_queue_size(bfqq_group(bfqq)); bfq_clear_bfqq_fifo_expire(bfqq); bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8; @@ -3469,7 +3468,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) */ bfq_clear_bfqq_wait_request(bfqq); hrtimer_try_to_cancel(>idle_slice_timer); - bfqg_stats_update_idle_time(bfqq_group(bfqq)); } goto keep_queue; } @@ -3695,15 +3693,67 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; struct request *rq; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_queue *in_serv_queue, *bfqq; + bool waiting_rq, idle_timer_disabled; +#endif spin_lock_irq(>lock); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + in_serv_queue = bfqd->in_service_queue; + waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); + + rq = __bfq_dispatch_request(hctx); + + idle_timer_disabled = + waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); + +#else rq = __bfq_dispatch_request(hctx); - if (rq && RQ_BFQQ(rq)) - bfqg_stats_update_io_remove(bfqq_group(RQ_BFQQ(rq)), - rq->cmd_flags); +#endif spin_unlock_irq(>lock); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqq = rq ? RQ_BFQQ(rq) : NULL; + if (!idle_timer_disabled && !bfqq) +
linux-next: Tree for Nov 13
Hi all, Please do not add any v4.16 material to your linux-next included trees until v4.15-rc1 has been released. Changes since 20171110: The powerpc tree still had its build failure for which I applied a patch The keys tree gained a build failure so I used the version from next-20171110. Non-merge commits (relative to Linus' tree): 12048 11498 files changed, 555327 insertions(+), 266135 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 272 trees (counting Linus' and 42 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (bebc6082da0a Linux 4.14) Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging kbuild-current/fixes (bb3f38c3c5b7 kbuild: clang: fix build failures with sparse check) Merging arc-current/for-curr (92d44128241f ARCv2: Accomodate HS48 MMUv5 by relaxing MMU ver checking) Merging arm-current/fixes (b9dd05c7002e ARM: 8720/1: ensure dump_instr() checks addr_limit) Merging m68k-current/for-linus (558d5ad276c9 m68k/mac: Avoid soft-lockup warning after mach_power_off) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (7ecb37f62fe5 powerpc/perf: Fix core-imc hotplug callback failure during imc initialization) Merging sparc/master (23198ddffb6c sparc32: Add cmpxchg64().) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (b39545684a90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging ipsec/master (c9f3f813d462 xfrm: Fix stack-out-of-bounds read in xfrm_state_find.) Merging netfilter/master (7400bb4b5800 netfilter: nf_reject_ipv4: Fix use-after-free in send_reset) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (a6127b4440d1 Merge tag 'iwlwifi-for-kalle-2017-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes) Merging mac80211/master (9618aec3349b Merge tag 'mac80211-for-davem-2017-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211) Merging sound-current/for-linus (75ee94b20b46 ALSA: hda - fix headset mic problem for Dell machines with alc274) Merging pci-current/for-linus (6b7be529634b MAINTAINERS: Add Lorenzo Pieralisi for PCI host bridge drivers) Merging driver-core.current/driver-core-linus (39dae59d66ac Linux 4.14-rc8) Merging tty.current/tty-linus (8a5776a5f498 Linux 4.14-rc4) Merging usb.current/usb-linus (bb176f67090c Linux 4.14-rc6) Merging usb-gadget-fixes/fixes (7c80f9e4a588 usb: usbtest: fix NULL pointer dereference) Merging usb-serial-fixes/usb-linus (0b07194bb55e Linux 4.14-rc7) Merging usb-chipidea-fixes/ci-for-usb-stable (cbb22ebcfb99 usb: chipidea: core: check before accessing ci_role in ci_role_show) Merging phy/fixes (2fb850092fd9 phy: rockchip-typec: Check for errors from tcphy_phy_init()) Merging staging.current/staging-linus (bb176f67090c Linux 4.14-rc6) Merging char-misc.current/char-misc-linus (bb176f67090c Linux 4.14-rc6) Merging input-current/for-linus (26dd633e437d Input: synaptics-rmi4 - RMI4 can also use SMBUS version 3) Merging crypto-current/master (441f99c90497 crypto: ccm - preserve the IV buffer) Merging ide/master
linux-next: Tree for Nov 13
Hi all, Please do not add any v4.16 material to your linux-next included trees until v4.15-rc1 has been released. Changes since 20171110: The powerpc tree still had its build failure for which I applied a patch The keys tree gained a build failure so I used the version from next-20171110. Non-merge commits (relative to Linus' tree): 12048 11498 files changed, 555327 insertions(+), 266135 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 272 trees (counting Linus' and 42 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (bebc6082da0a Linux 4.14) Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging kbuild-current/fixes (bb3f38c3c5b7 kbuild: clang: fix build failures with sparse check) Merging arc-current/for-curr (92d44128241f ARCv2: Accomodate HS48 MMUv5 by relaxing MMU ver checking) Merging arm-current/fixes (b9dd05c7002e ARM: 8720/1: ensure dump_instr() checks addr_limit) Merging m68k-current/for-linus (558d5ad276c9 m68k/mac: Avoid soft-lockup warning after mach_power_off) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (7ecb37f62fe5 powerpc/perf: Fix core-imc hotplug callback failure during imc initialization) Merging sparc/master (23198ddffb6c sparc32: Add cmpxchg64().) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (b39545684a90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging ipsec/master (c9f3f813d462 xfrm: Fix stack-out-of-bounds read in xfrm_state_find.) Merging netfilter/master (7400bb4b5800 netfilter: nf_reject_ipv4: Fix use-after-free in send_reset) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (a6127b4440d1 Merge tag 'iwlwifi-for-kalle-2017-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes) Merging mac80211/master (9618aec3349b Merge tag 'mac80211-for-davem-2017-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211) Merging sound-current/for-linus (75ee94b20b46 ALSA: hda - fix headset mic problem for Dell machines with alc274) Merging pci-current/for-linus (6b7be529634b MAINTAINERS: Add Lorenzo Pieralisi for PCI host bridge drivers) Merging driver-core.current/driver-core-linus (39dae59d66ac Linux 4.14-rc8) Merging tty.current/tty-linus (8a5776a5f498 Linux 4.14-rc4) Merging usb.current/usb-linus (bb176f67090c Linux 4.14-rc6) Merging usb-gadget-fixes/fixes (7c80f9e4a588 usb: usbtest: fix NULL pointer dereference) Merging usb-serial-fixes/usb-linus (0b07194bb55e Linux 4.14-rc7) Merging usb-chipidea-fixes/ci-for-usb-stable (cbb22ebcfb99 usb: chipidea: core: check before accessing ci_role in ci_role_show) Merging phy/fixes (2fb850092fd9 phy: rockchip-typec: Check for errors from tcphy_phy_init()) Merging staging.current/staging-linus (bb176f67090c Linux 4.14-rc6) Merging char-misc.current/char-misc-linus (bb176f67090c Linux 4.14-rc6) Merging input-current/for-linus (26dd633e437d Input: synaptics-rmi4 - RMI4 can also use SMBUS version 3) Merging crypto-current/master (441f99c90497 crypto: ccm - preserve the IV buffer) Merging ide/master
Re: linux-next: manual merge of the arm64 tree with Linus' tree
Hi all, On Wed, 1 Nov 2017 07:57:23 +1100 Stephen Rothwellwrote: > > Today's linux-next merge of the arm64 tree got a conflict in: > > drivers/acpi/arm64/iort.c > > between commit: > > 37f6b42e9c29 ("ACPI/IORT: Fix PCI ACS enablement") > > from Linus' tree and commit: > > 896dd2c32484 ("ACPI/IORT: Make platform devices initialization code SMMU > agnostic") > > from the arm64 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > -- > Cheers, > Stephen Rothwell > > diff --cc drivers/acpi/arm64/iort.c > index de56394dd161,7dc964f4d8f1.. > --- a/drivers/acpi/arm64/iort.c > +++ b/drivers/acpi/arm64/iort.c > @@@ -1215,7 -1326,7 +1357,8 @@@ static void __init iort_init_platform_d > struct acpi_table_iort *iort; > struct fwnode_handle *fwnode; > int i, ret; > +bool acs_enabled = false; > + const struct iort_dev_config *ops; > > /* >* iort_table and iort both point to the start of IORT table, but > @@@ -1235,12 -1346,8 +1378,11 @@@ > return; > } > > +if (!acs_enabled) > +acs_enabled = iort_enable_acs(iort_node); > + > - if ((iort_node->type == ACPI_IORT_NODE_SMMU) || > - (iort_node->type == ACPI_IORT_NODE_SMMU_V3)) { > - > + ops = iort_get_dev_cfg(iort_node); > + if (ops) { > fwnode = acpi_alloc_fwnode_static(); > if (!fwnode) > return; Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the arm64 tree with Linus' tree
Hi all, On Wed, 1 Nov 2017 07:57:23 +1100 Stephen Rothwell wrote: > > Today's linux-next merge of the arm64 tree got a conflict in: > > drivers/acpi/arm64/iort.c > > between commit: > > 37f6b42e9c29 ("ACPI/IORT: Fix PCI ACS enablement") > > from Linus' tree and commit: > > 896dd2c32484 ("ACPI/IORT: Make platform devices initialization code SMMU > agnostic") > > from the arm64 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > -- > Cheers, > Stephen Rothwell > > diff --cc drivers/acpi/arm64/iort.c > index de56394dd161,7dc964f4d8f1.. > --- a/drivers/acpi/arm64/iort.c > +++ b/drivers/acpi/arm64/iort.c > @@@ -1215,7 -1326,7 +1357,8 @@@ static void __init iort_init_platform_d > struct acpi_table_iort *iort; > struct fwnode_handle *fwnode; > int i, ret; > +bool acs_enabled = false; > + const struct iort_dev_config *ops; > > /* >* iort_table and iort both point to the start of IORT table, but > @@@ -1235,12 -1346,8 +1378,11 @@@ > return; > } > > +if (!acs_enabled) > +acs_enabled = iort_enable_acs(iort_node); > + > - if ((iort_node->type == ACPI_IORT_NODE_SMMU) || > - (iort_node->type == ACPI_IORT_NODE_SMMU_V3)) { > - > + ops = iort_get_dev_cfg(iort_node); > + if (ops) { > fwnode = acpi_alloc_fwnode_static(); > if (!fwnode) > return; Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote: > On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: > > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com > > wrote: > > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: > > > > Currently we are leaking addresses from the kernel to user space. > > > > This > > > > script is an attempt to find some of those leakages. Script > > > > parses > > > > `dmesg` output and /proc and /sys files for hex strings that look > > > > like > > > > kernel addresses. > > > > > > > > Only works for 64 bit kernels, the reason being that kernel > > > > addresses > > > > on 64 bit kernels have '' as the leading bit pattern making > > > > greping > > > > possible. On 32 kernels we don't have this luxury. > > > > > > Tobin C. Hardingwrote: > > > > Only works for 64 bit kernels, the reason being that kernel > > > > addresses > > > > on 64 bit kernels have '' as the leading bit pattern making > > > > greping > > > > possible. On 32 kernels we don't have this luxury. > > > > > > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels > > > as well > > > > > > (Firstly, apologies if I've got the protocol horribly wrong- should > > > this > > > be a new thread altogether?) > > > > I think this patch will need to wait until the patch set that is > > currently in flight is either merged or dropped. > > > Thanks for looking at it! > Okay; blocking on merge || drop... :-) So, Linus has requested that I set up a tree for the development of this. I have to work out the details of how to do that and then I'll email you so you can get the pull the current version. I can then take your patch via LKML as per usual. > > We can work this out pragmatically, Perl can give us an architecture > > string then a few regexs can ascertain which architecture we are > > running > > on. This is in the inflight patch set. > > > > > The patch below does Not take into account (yet) stuff like: > > > - exactly which files & dirs should be skipped on 32-bit (will it > > > be > > > identical to 64-bit?; unsure..) > > > > As per discussion later in this thread we may need to consider > > architecture specific lists for files/directories to skip. > Right > > > > > - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000' > > > , just > > > so I can test quickly; must figure whether to query it or pass it; > > > Suggestions? > > > > Perhaps we should have a command line option for this. > > > > --kernel-base-address > > Why not just detect it programatically? We could devise a series of > fallbacks; something like: > - if .config exists in the kernel source tree root, grep it for > PAGE_OFFSET > - if not, grep the arch-specific (arch//configs/) > for the same > - if for some reason we don't have enough info regarding specific > platform and thus the defconfig filename (could happen for ARM, PPC?), > we then fail and request the user to pass it as a parameter. > > > > - the 'false positives'; again, what differs for 32-bit? Sounds good to me. thanks, Tobin.
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote: > On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: > > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com > > wrote: > > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: > > > > Currently we are leaking addresses from the kernel to user space. > > > > This > > > > script is an attempt to find some of those leakages. Script > > > > parses > > > > `dmesg` output and /proc and /sys files for hex strings that look > > > > like > > > > kernel addresses. > > > > > > > > Only works for 64 bit kernels, the reason being that kernel > > > > addresses > > > > on 64 bit kernels have '' as the leading bit pattern making > > > > greping > > > > possible. On 32 kernels we don't have this luxury. > > > > > > Tobin C. Harding wrote: > > > > Only works for 64 bit kernels, the reason being that kernel > > > > addresses > > > > on 64 bit kernels have '' as the leading bit pattern making > > > > greping > > > > possible. On 32 kernels we don't have this luxury. > > > > > > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels > > > as well > > > > > > (Firstly, apologies if I've got the protocol horribly wrong- should > > > this > > > be a new thread altogether?) > > > > I think this patch will need to wait until the patch set that is > > currently in flight is either merged or dropped. > > > Thanks for looking at it! > Okay; blocking on merge || drop... :-) So, Linus has requested that I set up a tree for the development of this. I have to work out the details of how to do that and then I'll email you so you can get the pull the current version. I can then take your patch via LKML as per usual. > > We can work this out pragmatically, Perl can give us an architecture > > string then a few regexs can ascertain which architecture we are > > running > > on. This is in the inflight patch set. > > > > > The patch below does Not take into account (yet) stuff like: > > > - exactly which files & dirs should be skipped on 32-bit (will it > > > be > > > identical to 64-bit?; unsure..) > > > > As per discussion later in this thread we may need to consider > > architecture specific lists for files/directories to skip. > Right > > > > > - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000' > > > , just > > > so I can test quickly; must figure whether to query it or pass it; > > > Suggestions? > > > > Perhaps we should have a command line option for this. > > > > --kernel-base-address > > Why not just detect it programatically? We could devise a series of > fallbacks; something like: > - if .config exists in the kernel source tree root, grep it for > PAGE_OFFSET > - if not, grep the arch-specific (arch//configs/) > for the same > - if for some reason we don't have enough info regarding specific > platform and thus the defconfig filename (could happen for ARM, PPC?), > we then fail and request the user to pass it as a parameter. > > > > - the 'false positives'; again, what differs for 32-bit? Sounds good to me. thanks, Tobin.
Improving documentation of parent-ID field in /proc/PID/mountinfo
Hello Ram, Long ago (2.6.29) you added the /proc/PID/mountinfo file and associated documentation in Documentation/filesystems/proc.txt. Later, I pasted much of that documentation into the proc(5) manual page. That documentation says of the second field in the file: [[ This file contains lines of the form: 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) (1) mount ID: unique identifier of the mount (may be reused after umount) (2) parent ID: ID of parent (or of self for the top of the mount tree) ... ]] The last piece of the description of field (2) doesn't seem to be correct, or is at least rather unclear. I take this to be saying that that for the root mount point, /, field (2) will have the same value as field (1). I never actually looked at this detail closely, but Alexander pointed out that this is obviously not so, as one can immediately verify: $ grep '/ / ' /proc/$$/mountinfo 65 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,seclabel,data=order I dug around in the kernel source for a bit. I do not have an exact handle on the details, but I can see roughly what is going on. Internally, there seems to be one ("hidden") mount ID reserved to each mount namespace, and that ID is the parent of the root mount point. Looking through the (4.14) kernel source, mount IDs are allocated by mnt_alloc_id() (in fs/namespace.c), which is in turn called by alloc_vfsmnt() which is in turn called by clone_mnt(). A new mount namespace is created by the kernel function copy_mnt_ns() (in fs/namespace.c, called by create_new_namespaces() in kernel/nsproxy.c). The copy_mnt_ns() function calls copy_tree() (in fs/namespace.c), and copy_tree() calls clone_mnt() in *two* places. The first of these is the call that creates the "hidden" mount ID that becomes the parent of the root mount point. (I verified this by instrumenting the kernel with a few printk() calls to display the IDs.) The second place where copy_tree() calls clone_mnt() is in a loop that replicates each of the mount points (including the root mount point) in the source mount namespace. With these details in mind, I propose to patch the man page to read as below. Perhaps you have some corrections or improvements to suggest for this text? [[ (2) parent ID: the ID of the parent mount. For the root mount point, the ID shown here is a hidden mount ID associated with the mount namespace. That ID is dis‐ tinct from any of the IDs shown in field (1) of the records shown in the mountinfo file, and does not appear in field (1) in the mountinfo file in any other mount namespace. (In the initial mount namespace, this hidden ID has the value 0.) ]] With best regards, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Improving documentation of parent-ID field in /proc/PID/mountinfo
Hello Ram, Long ago (2.6.29) you added the /proc/PID/mountinfo file and associated documentation in Documentation/filesystems/proc.txt. Later, I pasted much of that documentation into the proc(5) manual page. That documentation says of the second field in the file: [[ This file contains lines of the form: 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) (1) mount ID: unique identifier of the mount (may be reused after umount) (2) parent ID: ID of parent (or of self for the top of the mount tree) ... ]] The last piece of the description of field (2) doesn't seem to be correct, or is at least rather unclear. I take this to be saying that that for the root mount point, /, field (2) will have the same value as field (1). I never actually looked at this detail closely, but Alexander pointed out that this is obviously not so, as one can immediately verify: $ grep '/ / ' /proc/$$/mountinfo 65 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,seclabel,data=order I dug around in the kernel source for a bit. I do not have an exact handle on the details, but I can see roughly what is going on. Internally, there seems to be one ("hidden") mount ID reserved to each mount namespace, and that ID is the parent of the root mount point. Looking through the (4.14) kernel source, mount IDs are allocated by mnt_alloc_id() (in fs/namespace.c), which is in turn called by alloc_vfsmnt() which is in turn called by clone_mnt(). A new mount namespace is created by the kernel function copy_mnt_ns() (in fs/namespace.c, called by create_new_namespaces() in kernel/nsproxy.c). The copy_mnt_ns() function calls copy_tree() (in fs/namespace.c), and copy_tree() calls clone_mnt() in *two* places. The first of these is the call that creates the "hidden" mount ID that becomes the parent of the root mount point. (I verified this by instrumenting the kernel with a few printk() calls to display the IDs.) The second place where copy_tree() calls clone_mnt() is in a loop that replicates each of the mount points (including the root mount point) in the source mount namespace. With these details in mind, I propose to patch the man page to read as below. Perhaps you have some corrections or improvements to suggest for this text? [[ (2) parent ID: the ID of the parent mount. For the root mount point, the ID shown here is a hidden mount ID associated with the mount namespace. That ID is dis‐ tinct from any of the IDs shown in field (1) of the records shown in the mountinfo file, and does not appear in field (1) in the mountinfo file in any other mount namespace. (In the initial mount namespace, this hidden ID has the value 0.) ]] With best regards, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: linux-next: manual merge of the tip tree with the net-next tree
Hi all, On Mon, 30 Oct 2017 20:55:47 + Mark Brownwrote: > > Today's linux-next merge of the tip tree got a conflict in: > > net/ipv4/tcp_output.c > > between commit: > > 6aa7de059173a ("locking/atomics: COCCINELLE/treewide: Convert trivial > ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()") > > in the tip tree and some change in the net-next tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc net/ipv4/tcp_output.c > index a69a34f57330,48531da1aba6.. > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@@ -1978,7 -1908,7 +1978,7 @@@ static bool tcp_tso_should_defer(struc > if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len)) > goto send_now; > > - win_divisor = > ACCESS_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor); > -win_divisor = READ_ONCE(sysctl_tcp_tso_win_divisor); > ++win_divisor = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor); > if (win_divisor) { > u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache); > Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the tip tree with the net-next tree
Hi all, On Mon, 30 Oct 2017 20:55:47 + Mark Brown wrote: > > Today's linux-next merge of the tip tree got a conflict in: > > net/ipv4/tcp_output.c > > between commit: > > 6aa7de059173a ("locking/atomics: COCCINELLE/treewide: Convert trivial > ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()") > > in the tip tree and some change in the net-next tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc net/ipv4/tcp_output.c > index a69a34f57330,48531da1aba6.. > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@@ -1978,7 -1908,7 +1978,7 @@@ static bool tcp_tso_should_defer(struc > if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len)) > goto send_now; > > - win_divisor = > ACCESS_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor); > -win_divisor = READ_ONCE(sysctl_tcp_tso_win_divisor); > ++win_divisor = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor); > if (win_divisor) { > u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache); > Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the devicetree tree with the drm tree
Hi all, On Mon, 30 Oct 2017 20:37:56 + Mark Brownwrote: > > Today's linux-next merge of the devicetree tree got a conflict in: > > drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > > between commit: > > 44cd3939c111b7 ("drm/tilcdc: Remove redundant OF_DETACHED flag setting") > > from the drm tree and commit: > > f948d6d8b792bb ("of: overlay: avoid race condition between applying > multiple overlays") > > from the devicetree tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > index 482299a6f3b0,54025af534d4.. > --- a/drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > +++ b/drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > @@@ -163,12 -162,8 +162,6 @@@ static struct device_node * __init tilc > return NULL; > } > > - ret = of_resolve_phandles(overlay); > - if (ret) { > - pr_err("%s: Failed to resolve phandles: %d\n", __func__, ret); > - return NULL; > - } > -of_node_set_flag(overlay, OF_DETACHED); > -- > return overlay; > } > Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the devicetree tree with the drm tree
Hi all, On Mon, 30 Oct 2017 20:37:56 + Mark Brown wrote: > > Today's linux-next merge of the devicetree tree got a conflict in: > > drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > > between commit: > > 44cd3939c111b7 ("drm/tilcdc: Remove redundant OF_DETACHED flag setting") > > from the drm tree and commit: > > f948d6d8b792bb ("of: overlay: avoid race condition between applying > multiple overlays") > > from the devicetree tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > index 482299a6f3b0,54025af534d4.. > --- a/drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > +++ b/drivers/gpu/drm/tilcdc/tilcdc_slave_compat.c > @@@ -163,12 -162,8 +162,6 @@@ static struct device_node * __init tilc > return NULL; > } > > - ret = of_resolve_phandles(overlay); > - if (ret) { > - pr_err("%s: Failed to resolve phandles: %d\n", __func__, ret); > - return NULL; > - } > -of_node_set_flag(overlay, OF_DETACHED); > -- > return overlay; > } > Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the ext4 tree with the fscrypt tree
Hi all, On Mon, 30 Oct 2017 14:48:04 + Mark Brownwrote: > > Today's linux-next merge of the ext4 tree got a conflict in: > > fs/ext4/inode.c > > between commit: > > 2ee6a576be564272 ("fs, fscrypt: add an S_ENCRYPTED inode flag") > > from the fscrypt tree and commit: > > d4e50e6d43b2620f ("ext4: add ext4_should_use_dax()") > > from the ext4 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc fs/ext4/inode.c > index 617c7feced24,9f836e2ec18c.. > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@@ -4572,6 -4610,21 +4610,23 @@@ int ext4_get_inode_loc(struct inode *in > !ext4_test_inode_state(inode, EXT4_STATE_XATTR)); > } > > + static bool ext4_should_use_dax(struct inode *inode) > + { > ++unsigned int flags = EXT4_I(inode)->i_flags; > ++ > + if (!test_opt(inode->i_sb, DAX)) > + return false; > + if (!S_ISREG(inode->i_mode)) > + return false; > + if (ext4_should_journal_data(inode)) > + return false; > + if (ext4_has_inline_data(inode)) > + return false; > -if (ext4_encrypted_inode(inode)) > ++if (flags & EXT4_ENCRYPT_FL) > + return false; > + return true; > + } > + > void ext4_set_inode_flags(struct inode *inode) > { > unsigned int flags = EXT4_I(inode)->i_flags; > @@@ -4587,15 -4640,10 +4642,13 @@@ > new_fl |= S_NOATIME; > if (flags & EXT4_DIRSYNC_FL) > new_fl |= S_DIRSYNC; > - if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) && > - !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) && > - !(flags & EXT4_ENCRYPT_FL)) > + if (ext4_should_use_dax(inode)) > new_fl |= S_DAX; > +if (flags & EXT4_ENCRYPT_FL) > +new_fl |= S_ENCRYPTED; > inode_set_flags(inode, new_fl, > -S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX); > +S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX| > +S_ENCRYPTED); > } > > static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode, Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the ext4 tree with the fscrypt tree
Hi all, On Mon, 30 Oct 2017 14:48:04 + Mark Brown wrote: > > Today's linux-next merge of the ext4 tree got a conflict in: > > fs/ext4/inode.c > > between commit: > > 2ee6a576be564272 ("fs, fscrypt: add an S_ENCRYPTED inode flag") > > from the fscrypt tree and commit: > > d4e50e6d43b2620f ("ext4: add ext4_should_use_dax()") > > from the ext4 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc fs/ext4/inode.c > index 617c7feced24,9f836e2ec18c.. > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@@ -4572,6 -4610,21 +4610,23 @@@ int ext4_get_inode_loc(struct inode *in > !ext4_test_inode_state(inode, EXT4_STATE_XATTR)); > } > > + static bool ext4_should_use_dax(struct inode *inode) > + { > ++unsigned int flags = EXT4_I(inode)->i_flags; > ++ > + if (!test_opt(inode->i_sb, DAX)) > + return false; > + if (!S_ISREG(inode->i_mode)) > + return false; > + if (ext4_should_journal_data(inode)) > + return false; > + if (ext4_has_inline_data(inode)) > + return false; > -if (ext4_encrypted_inode(inode)) > ++if (flags & EXT4_ENCRYPT_FL) > + return false; > + return true; > + } > + > void ext4_set_inode_flags(struct inode *inode) > { > unsigned int flags = EXT4_I(inode)->i_flags; > @@@ -4587,15 -4640,10 +4642,13 @@@ > new_fl |= S_NOATIME; > if (flags & EXT4_DIRSYNC_FL) > new_fl |= S_DIRSYNC; > - if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) && > - !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) && > - !(flags & EXT4_ENCRYPT_FL)) > + if (ext4_should_use_dax(inode)) > new_fl |= S_DAX; > +if (flags & EXT4_ENCRYPT_FL) > +new_fl |= S_ENCRYPTED; > inode_set_flags(inode, new_fl, > -S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX); > +S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX| > +S_ENCRYPTED); > } > > static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode, Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
linux-next: build warning after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (x86_64 allmodconfig) produced this warning: In file included from include/linux/printk.h:7:0, from include/linux/kernel.h:14, from lib/test_find_bit.c:28: lib/test_find_bit.c: In function 'test_find_first_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:54:2: note: in expansion of macro 'pr_err' pr_err("find_first_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ lib/test_find_bit.c: In function 'test_find_next_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:68:2: note: in expansion of macro 'pr_err' pr_err("find_next_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ lib/test_find_bit.c: In function 'test_find_next_zero_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:82:2: note: in expansion of macro 'pr_err' pr_err("find_next_zero_bit:\t%ld cycles,\t%ld iterations\n", ^ lib/test_find_bit.c: In function 'test_find_last_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:102:2: note: in expansion of macro 'pr_err' pr_err("find_last_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ Introduced by commit 09588b1f1d58 ("lib: test module for find_*_bit() functions") -- Cheers, Stephen Rothwell
linux-next: build warning after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (x86_64 allmodconfig) produced this warning: In file included from include/linux/printk.h:7:0, from include/linux/kernel.h:14, from lib/test_find_bit.c:28: lib/test_find_bit.c: In function 'test_find_first_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:54:2: note: in expansion of macro 'pr_err' pr_err("find_first_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ lib/test_find_bit.c: In function 'test_find_next_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:68:2: note: in expansion of macro 'pr_err' pr_err("find_next_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ lib/test_find_bit.c: In function 'test_find_next_zero_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:82:2: note: in expansion of macro 'pr_err' pr_err("find_next_zero_bit:\t%ld cycles,\t%ld iterations\n", ^ lib/test_find_bit.c: In function 'test_find_last_bit': include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 2 has type 'cycles_t {aka long long unsigned int}' [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' #define KERN_ERR KERN_SOH "3" /* error conditions */ ^ include/linux/printk.h:300:9: note: in expansion of macro 'KERN_ERR' printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) ^ lib/test_find_bit.c:102:2: note: in expansion of macro 'pr_err' pr_err("find_last_bit:\t\t%ld cycles,\t%ld iterations\n", cycles, cnt); ^ Introduced by commit 09588b1f1d58 ("lib: test module for find_*_bit() functions") -- Cheers, Stephen Rothwell
[PATCH v2] coccinelle: orplus: reorganize to improve performance
Adding two #define constants is less common than performing & and | operations on them, so put the addition first to reduce the set of cases that have to be considered in detail. At the same time, add & and | patterns for both arguments of +, to account for commutativity and obtain more results. Running time is divided by 3 when applying this to the whole kernel on my laptop with an Intel i5-6200U CPU. Signed-off-by: Julia Lawall--- v2: added SOB and fixed typos in the commit message diff --git a/scripts/coccinelle/misc/orplus.cocci b/scripts/coccinelle/misc/orplus.cocci index 81fabf3..08de5be 100644 --- a/scripts/coccinelle/misc/orplus.cocci +++ b/scripts/coccinelle/misc/orplus.cocci @@ -14,7 +14,19 @@ virtual report virtual context @r@ -constant c; +constant c,c1; +identifier i,i1; +position p; +@@ + +( + c1 + c - 1 +| + c1@i1 +@p c@i +) + +@s@ +constant r.c, r.c1; identifier i; expression e; @@ @@ -27,28 +39,31 @@ e & c@i e |= c@i | e &= c@i +| +e | c1@i +| +e & c1@i +| +e |= c1@i +| +e &= c1@i ) -@s@ -constant r.c,c1; -identifier i1; -position p; +@depends on s@ +position r.p; +constant c1,c2; @@ -( - c1 + c - 1 -| -*c1@i1 +@p c -) +* c1 +@p c2 -@script:python depends on org@ -p << s.p; +@script:python depends on s && org@ +p << r.p; @@ cocci.print_main("sum of probable bitmasks, consider |",p) -@script:python depends on report@ -p << s.p; +@script:python depends on s && report@ +p << r.p; @@ msg = "WARNING: sum of probable bitmasks, consider |"
[PATCH v2] coccinelle: orplus: reorganize to improve performance
Adding two #define constants is less common than performing & and | operations on them, so put the addition first to reduce the set of cases that have to be considered in detail. At the same time, add & and | patterns for both arguments of +, to account for commutativity and obtain more results. Running time is divided by 3 when applying this to the whole kernel on my laptop with an Intel i5-6200U CPU. Signed-off-by: Julia Lawall --- v2: added SOB and fixed typos in the commit message diff --git a/scripts/coccinelle/misc/orplus.cocci b/scripts/coccinelle/misc/orplus.cocci index 81fabf3..08de5be 100644 --- a/scripts/coccinelle/misc/orplus.cocci +++ b/scripts/coccinelle/misc/orplus.cocci @@ -14,7 +14,19 @@ virtual report virtual context @r@ -constant c; +constant c,c1; +identifier i,i1; +position p; +@@ + +( + c1 + c - 1 +| + c1@i1 +@p c@i +) + +@s@ +constant r.c, r.c1; identifier i; expression e; @@ @@ -27,28 +39,31 @@ e & c@i e |= c@i | e &= c@i +| +e | c1@i +| +e & c1@i +| +e |= c1@i +| +e &= c1@i ) -@s@ -constant r.c,c1; -identifier i1; -position p; +@depends on s@ +position r.p; +constant c1,c2; @@ -( - c1 + c - 1 -| -*c1@i1 +@p c -) +* c1 +@p c2 -@script:python depends on org@ -p << s.p; +@script:python depends on s && org@ +p << r.p; @@ cocci.print_main("sum of probable bitmasks, consider |",p) -@script:python depends on report@ -p << s.p; +@script:python depends on s && report@ +p << r.p; @@ msg = "WARNING: sum of probable bitmasks, consider |"
[PATCH] timekeeping: Eliminate the useless declaration of ktime_get_raw_and_real_ts64()
Commit: ba26621e63ce ("time: Remove duplicated code in ktime_get_raw_and_real()") ... got rid of ktime_get_raw_and_real_ts64(), but left its declaration behind. Remove it. Signed-off-by: Dou Liyang--- include/linux/timekeeping.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index c198ab4..b17bcce 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -143,12 +143,6 @@ extern bool timekeeping_rtc_skipresume(void); extern void timekeeping_inject_sleeptime64(struct timespec64 *delta); /* - * PPS accessor - */ -extern void ktime_get_raw_and_real_ts64(struct timespec64 *ts_raw, - struct timespec64 *ts_real); - -/* * struct system_time_snapshot - simultaneous raw/real time capture with * counter value * @cycles:Clocksource counter value to produce the system times -- 2.5.5
[PATCH] timekeeping: Eliminate the useless declaration of ktime_get_raw_and_real_ts64()
Commit: ba26621e63ce ("time: Remove duplicated code in ktime_get_raw_and_real()") ... got rid of ktime_get_raw_and_real_ts64(), but left its declaration behind. Remove it. Signed-off-by: Dou Liyang --- include/linux/timekeeping.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index c198ab4..b17bcce 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -143,12 +143,6 @@ extern bool timekeeping_rtc_skipresume(void); extern void timekeeping_inject_sleeptime64(struct timespec64 *delta); /* - * PPS accessor - */ -extern void ktime_get_raw_and_real_ts64(struct timespec64 *ts_raw, - struct timespec64 *ts_real); - -/* * struct system_time_snapshot - simultaneous raw/real time capture with * counter value * @cycles:Clocksource counter value to produce the system times -- 2.5.5
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com > wrote: > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: > > > Currently we are leaking addresses from the kernel to user space. > > > This > > > script is an attempt to find some of those leakages. Script > > > parses > > > `dmesg` output and /proc and /sys files for hex strings that look > > > like > > > kernel addresses. > > > > > > Only works for 64 bit kernels, the reason being that kernel > > > addresses > > > on 64 bit kernels have '' as the leading bit pattern making > > > greping > > > possible. On 32 kernels we don't have this luxury. > > > > Tobin C. Hardingwrote: > > > Only works for 64 bit kernels, the reason being that kernel > > > addresses > > > on 64 bit kernels have '' as the leading bit pattern making > > > greping > > > possible. On 32 kernels we don't have this luxury. > > > > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels > > as well > > > > (Firstly, apologies if I've got the protocol horribly wrong- should > > this > > be a new thread altogether?) > > I think this patch will need to wait until the patch set that is > currently in flight is either merged or dropped. > Thanks for looking at it! Okay; blocking on merge || drop... :-) > > > We can work this out pragmatically, Perl can give us an architecture > string then a few regexs can ascertain which architecture we are > running > on. This is in the inflight patch set. > > > The patch below does Not take into account (yet) stuff like: > > - exactly which files & dirs should be skipped on 32-bit (will it > > be > > identical to 64-bit?; unsure..) > > As per discussion later in this thread we may need to consider > architecture specific lists for files/directories to skip. Right > > > - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000' > > , just > > so I can test quickly; must figure whether to query it or pass it; > > Suggestions? > > Perhaps we should have a command line option for this. > > --kernel-base-address Why not just detect it programatically? We could devise a series of fallbacks; something like: - if .config exists in the kernel source tree root, grep it for PAGE_OFFSET - if not, grep the arch-specific (arch//configs/) for the same - if for some reason we don't have enough info regarding specific platform and thus the defconfig filename (could happen for ARM, PPC?), we then fail and request the user to pass it as a parameter. > > - the 'false positives'; again, what differs for 32-bit? > >(BTW, shouldn't the dmesg 'root=UUID=<...>' line be a false > > positive > > & skipped?). > > We could probably do with architecture specific false > positives. Inflight patch set refactors false_positive() so adding to > this should be easy. Sure. > > > Also, I must point out that I'm a complete newbie to Perl :-) so, > > pl excuse > > my highly inadequate perl-foo; I rely on you perl gurus out there > > to fix > > and optimize :) > > I'm no Perl guru but following are a few tips I have picked up over > the > last month. Thanks, will fix the issues you point out.. > > > > Conceptually your ideas look good to me. If there is some reason this > approach won't work hopefully someone else will jump in and say so. > > Nice work, thanks for putting in effort to get 32 bit machines > supported. Let's see what happens with the inflight patch set then > work > on getting these ideas in. > Thanks! yes.. > thanks, > Tobin.
Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl
On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote: > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com > wrote: > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote: > > > Currently we are leaking addresses from the kernel to user space. > > > This > > > script is an attempt to find some of those leakages. Script > > > parses > > > `dmesg` output and /proc and /sys files for hex strings that look > > > like > > > kernel addresses. > > > > > > Only works for 64 bit kernels, the reason being that kernel > > > addresses > > > on 64 bit kernels have '' as the leading bit pattern making > > > greping > > > possible. On 32 kernels we don't have this luxury. > > > > Tobin C. Harding wrote: > > > Only works for 64 bit kernels, the reason being that kernel > > > addresses > > > on 64 bit kernels have '' as the leading bit pattern making > > > greping > > > possible. On 32 kernels we don't have this luxury. > > > > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels > > as well > > > > (Firstly, apologies if I've got the protocol horribly wrong- should > > this > > be a new thread altogether?) > > I think this patch will need to wait until the patch set that is > currently in flight is either merged or dropped. > Thanks for looking at it! Okay; blocking on merge || drop... :-) > > > We can work this out pragmatically, Perl can give us an architecture > string then a few regexs can ascertain which architecture we are > running > on. This is in the inflight patch set. > > > The patch below does Not take into account (yet) stuff like: > > - exactly which files & dirs should be skipped on 32-bit (will it > > be > > identical to 64-bit?; unsure..) > > As per discussion later in this thread we may need to consider > architecture specific lists for files/directories to skip. Right > > > - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000' > > , just > > so I can test quickly; must figure whether to query it or pass it; > > Suggestions? > > Perhaps we should have a command line option for this. > > --kernel-base-address Why not just detect it programatically? We could devise a series of fallbacks; something like: - if .config exists in the kernel source tree root, grep it for PAGE_OFFSET - if not, grep the arch-specific (arch//configs/) for the same - if for some reason we don't have enough info regarding specific platform and thus the defconfig filename (could happen for ARM, PPC?), we then fail and request the user to pass it as a parameter. > > - the 'false positives'; again, what differs for 32-bit? > >(BTW, shouldn't the dmesg 'root=UUID=<...>' line be a false > > positive > > & skipped?). > > We could probably do with architecture specific false > positives. Inflight patch set refactors false_positive() so adding to > this should be easy. Sure. > > > Also, I must point out that I'm a complete newbie to Perl :-) so, > > pl excuse > > my highly inadequate perl-foo; I rely on you perl gurus out there > > to fix > > and optimize :) > > I'm no Perl guru but following are a few tips I have picked up over > the > last month. Thanks, will fix the issues you point out.. > > > > Conceptually your ideas look good to me. If there is some reason this > approach won't work hopefully someone else will jump in and say so. > > Nice work, thanks for putting in effort to get 32 bit machines > supported. Let's see what happens with the inflight patch set then > work > on getting these ideas in. > Thanks! yes.. > thanks, > Tobin.
linux-next: build warning after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: In file included from include/linux/mmzone.h:17:0, from include/linux/mempolicy.h:10, from mm/mempolicy.c:70: mm/mempolicy.c: In function 'mpol_to_str': include/linux/nodemask.h:107:41: warning: the address of 'nodes' will always evaluate as 'true' [-Waddress] #define nodemask_pr_args(maskp) (maskp) ? MAX_NUMNODES : 0, (maskp) ? (maskp)->bits : NULL ^ mm/mempolicy.c:2817:11: note: in expansion of macro 'nodemask_pr_args' nodemask_pr_args()); ^ include/linux/nodemask.h:107:69: warning: the address of 'nodes' will always evaluate as 'true' [-Waddress] #define nodemask_pr_args(maskp) (maskp) ? MAX_NUMNODES : 0, (maskp) ? (maskp)->bits : NULL ^ mm/mempolicy.c:2817:11: note: in expansion of macro 'nodemask_pr_args' nodemask_pr_args()); ^ Introduced by commit b2c1ed23bdc1 ("mm: simplify nodemask printing") -- Cheers, Stephen Rothwell
linux-next: build warning after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: In file included from include/linux/mmzone.h:17:0, from include/linux/mempolicy.h:10, from mm/mempolicy.c:70: mm/mempolicy.c: In function 'mpol_to_str': include/linux/nodemask.h:107:41: warning: the address of 'nodes' will always evaluate as 'true' [-Waddress] #define nodemask_pr_args(maskp) (maskp) ? MAX_NUMNODES : 0, (maskp) ? (maskp)->bits : NULL ^ mm/mempolicy.c:2817:11: note: in expansion of macro 'nodemask_pr_args' nodemask_pr_args()); ^ include/linux/nodemask.h:107:69: warning: the address of 'nodes' will always evaluate as 'true' [-Waddress] #define nodemask_pr_args(maskp) (maskp) ? MAX_NUMNODES : 0, (maskp) ? (maskp)->bits : NULL ^ mm/mempolicy.c:2817:11: note: in expansion of macro 'nodemask_pr_args' nodemask_pr_args()); ^ Introduced by commit b2c1ed23bdc1 ("mm: simplify nodemask printing") -- Cheers, Stephen Rothwell
[PATCH] powerpc/perf: Add debugfs interface for imc-mode and imc-command
In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar--- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, +
[PATCH] powerpc/perf: Add debugfs interface for imc-mode and imc-command
In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar --- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, +
Re: [PATCH] powerpc/perf: Add debugfs interface for imc run-mode and run-status
Hi, Kindly ignore this version Thanks, Anju On Monday 13 November 2017 11:06 AM, Anju T Sudhakar wrote: In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar--- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, + "0x%016llx\n"); + +static struct dentry *debugfs_create_imc_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value) +{ + return debugfs_create_file_unsafe(name, mode, parent, value, _imc_x64); +} + +/* + * export_imc_mode_and_cmd: Create a debugfs interface + * for
Re: [PATCH] powerpc/perf: Add debugfs interface for imc run-mode and run-status
Hi, Kindly ignore this version Thanks, Anju On Monday 13 November 2017 11:06 AM, Anju T Sudhakar wrote: In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar --- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, + "0x%016llx\n"); + +static struct dentry *debugfs_create_imc_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value) +{ + return debugfs_create_file_unsafe(name, mode, parent, value, _imc_x64); +} + +/* + * export_imc_mode_and_cmd: Create a debugfs interface + * for imc_cmd and imc_mode + *
Re: linux-next: manual merge of the powerpc tree with Linus' tree
Hi all, On Mon, 30 Oct 2017 12:51:33 + Mark Brownwrote: > > Hi all, > > Today's linux-next merge of the powerpc tree got a conflict in: > > arch/powerpc/kvm/powerpc.c > > between commit: > > ac64115a66c1 ("KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM") > > from Linus' tree and commit: > > 2a3d6553cbd7 ("KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM > feature") > > from the powerpc tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc arch/powerpc/kvm/powerpc.c > index ee279c7f4802,a3746b98ec11.. > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@@ -644,7 -644,8 +644,8 @@@ int kvm_vm_ioctl_check_extension(struc > break; > #endif > case KVM_CAP_PPC_HTM: > - r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled; > -r = is_kvmppc_hv_enabled(kvm) && > ++r = hv_enabled && > + (cur_cpu_spec->cpu_user_features2 & PPC_FEATURE2_HTM_COMP); > break; > default: > r = 0; Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the powerpc tree with Linus' tree
Hi all, On Mon, 30 Oct 2017 12:51:33 + Mark Brown wrote: > > Hi all, > > Today's linux-next merge of the powerpc tree got a conflict in: > > arch/powerpc/kvm/powerpc.c > > between commit: > > ac64115a66c1 ("KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM") > > from Linus' tree and commit: > > 2a3d6553cbd7 ("KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM > feature") > > from the powerpc tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc arch/powerpc/kvm/powerpc.c > index ee279c7f4802,a3746b98ec11.. > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@@ -644,7 -644,8 +644,8 @@@ int kvm_vm_ioctl_check_extension(struc > break; > #endif > case KVM_CAP_PPC_HTM: > - r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled; > -r = is_kvmppc_hv_enabled(kvm) && > ++r = hv_enabled && > + (cur_cpu_spec->cpu_user_features2 & PPC_FEATURE2_HTM_COMP); > break; > default: > r = 0; Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
[PATCH] powerpc/perf: Add debugfs interface for imc run-mode and run-status
In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar--- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, +
[PATCH] powerpc/perf: Add debugfs interface for imc run-mode and run-status
In memory Collection (IMC) counter pmu driver controls the ucode's execution state. At the system boot, IMC perf driver pause the ucode. Ucode state is changed to "running" only when any of the nest units are monitored or profiled using perf tool. Nest units support only limited set of hardware counters and ucode is always programmed in the "production mode" ("accumulation") mode. This mode is configured to provide key performance metric data for most of the nest units. But ucode also supports other modes which would be used for "debug" to drill down specific nest units. That is, ucode when switched to "powerbus" debug mode (for example), will dynamically reconfigure the nest counters to target only "powerbus" related events in the hardware counters. This allows the IMC nest unit to focus on powerbus related transactions in the system in more detail. At this point, production mode events may or may not be counted. IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other nest counter configurations. Patch provides an interface via "debugfs" to enable the switching of ucode modes in the system. To switch ucode mode, one has to first pause the microcode (imc_cmd), and then write the target mode value to the "imc_mode" file. Proposed Approach === In the proposed approach, the function (export_imc_mode_and_cmd) which creates the debugfs interface for imc mode and command is implemented in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer base address for each chip. The interface to expose imc mode and command is required only if we have nest pmu units registered. Employing the existing data structures to track whether we have any nest units registered will require to extend data from perf side to opal-imc.c. Instead an integer is introduced to hold that information by counting successful nest unit registration. Debugfs interface is removed based on the integer count. Example for the interface: root@:/sys/kernel/debug/imc# ls imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Signed-off-by: Anju T Sudhakar --- arch/powerpc/include/asm/imc-pmu.h| 7 +++ arch/powerpc/platforms/powernv/opal-imc.c | 74 ++- 2 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7f74c28..317002d 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -40,6 +40,13 @@ #define THREAD_IMC_ENABLE 0x8000ULL /* + * For debugfs interface for imc-mode and imc-command + */ +#define IMC_CNTL_BLK_OFFSET0x3FC00 +#define IMC_CNTL_BLK_CMD_OFFSET8 +#define IMC_CNTL_BLK_MODE_OFFSET 32 + +/* * Structure to hold memory address information for imc units. */ struct imc_mem_info { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 21f6531..a88ddab 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -21,6 +21,70 @@ #include #include #include +#include + +static struct dentry *parent; + +/* Helpers to export imc command and status via debugfs */ +static int debugfs_imc_mem_get(void *data, u64 *val) +{ + *val = cpu_to_be64(*(u64 *)data); + return 0; +} + +static int debugfs_imc_mem_set(void *data, u64 val) +{ + *(u64 *)data = cpu_to_be64(val); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(fops_imc_x64, debugfs_imc_mem_get, debugfs_imc_mem_set, +
Re: linux-next: manual merge of the integrity tree with the jc-docs tree
Hi all, On Wed, 18 Oct 2017 11:50:25 +0100 Mark Brownwrote: > > Today's linux-next merge of the integrity tree got a conflict in: > > Documentation/ABI/testing/evm > > between commit: > > c7f66400f504fd5 ("Documentation: fix security related doc refs") > > from the jc-docs tree and commit: > > cbad39d632b7c18 ("EVM: Allow userspace to signal an RSA key has been > loaded") > > from the integrity tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc Documentation/ABI/testing/evm > index ca622c9aa24c,a0bbccb00736.. > --- a/Documentation/ABI/testing/evm > +++ b/Documentation/ABI/testing/evm > @@@ -7,17 -7,36 +7,36 @@@ Description > HMAC-sha1 value across the extended attributes, storing the > value as the extended attribute 'security.evm'. > > - EVM depends on the Kernel Key Retention System to provide it > - with a trusted/encrypted key for the HMAC-sha1 operation. > - The key is loaded onto the root's keyring using keyctl. Until > - EVM receives notification that the key has been successfully > - loaded onto the keyring (echo 1 > /evm), EVM > - can not create or validate the 'security.evm' xattr, but > - returns INTEGRITY_UNKNOWN. Loading the key and signaling EVM > - should be done as early as possible. Normally this is done > - in the initramfs, which has already been measured as part > - of the trusted boot. For more information on creating and > - loading existing trusted/encrypted keys, refer to: > - Documentation/security/keys/trusted-encrypted.rst. (A sample > - dracut patch, which loads the trusted/encrypted key and enables > - EVM, is available from http://linux-ima.sourceforge.net/#EVM.) > + EVM supports two classes of security.evm. The first is > + an HMAC-sha1 generated locally with a > + trusted/encrypted key stored in the Kernel Key > + Retention System. The second is a digital signature > + generated either locally or remotely using an > + asymmetric key. These keys are loaded onto root's > + keyring using keyctl, and EVM is then enabled by > + echoing a value to /evm: > + > + 1: enable HMAC validation and creation > + 2: enable digital signature validation > + 3: enable HMAC and digital signature validation and HMAC > +creation > + > + Further writes will be blocked if HMAC support is enabled or > + if bit 32 is set: > + > + echo 0x8002 >/evm > + > + will enable digital signature validation and block > + further writes to /evm. > + > + Until this is done, EVM can not create or validate the > + 'security.evm' xattr, but returns INTEGRITY_UNKNOWN. > + Loading keys and signaling EVM should be done as early > + as possible. Normally this is done in the initramfs, > + which has already been measured as part of the trusted > + boot. For more information on creating and loading > + existing trusted/encrypted keys, refer to: > -Documentation/keys-trusted-encrypted.txt. Both dracut > ++Documentation/security/keys/trusted-encrypted.rst. Both dracut > + (via 97masterkey and 98integrity) and systemd (via > + core/ima-setup) have support for loading keys at boot > + time. Just a reminder that this conflict still exists (and is now relevant to the security tree). -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the integrity tree with the jc-docs tree
Hi all, On Wed, 18 Oct 2017 11:50:25 +0100 Mark Brown wrote: > > Today's linux-next merge of the integrity tree got a conflict in: > > Documentation/ABI/testing/evm > > between commit: > > c7f66400f504fd5 ("Documentation: fix security related doc refs") > > from the jc-docs tree and commit: > > cbad39d632b7c18 ("EVM: Allow userspace to signal an RSA key has been > loaded") > > from the integrity tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc Documentation/ABI/testing/evm > index ca622c9aa24c,a0bbccb00736.. > --- a/Documentation/ABI/testing/evm > +++ b/Documentation/ABI/testing/evm > @@@ -7,17 -7,36 +7,36 @@@ Description > HMAC-sha1 value across the extended attributes, storing the > value as the extended attribute 'security.evm'. > > - EVM depends on the Kernel Key Retention System to provide it > - with a trusted/encrypted key for the HMAC-sha1 operation. > - The key is loaded onto the root's keyring using keyctl. Until > - EVM receives notification that the key has been successfully > - loaded onto the keyring (echo 1 > /evm), EVM > - can not create or validate the 'security.evm' xattr, but > - returns INTEGRITY_UNKNOWN. Loading the key and signaling EVM > - should be done as early as possible. Normally this is done > - in the initramfs, which has already been measured as part > - of the trusted boot. For more information on creating and > - loading existing trusted/encrypted keys, refer to: > - Documentation/security/keys/trusted-encrypted.rst. (A sample > - dracut patch, which loads the trusted/encrypted key and enables > - EVM, is available from http://linux-ima.sourceforge.net/#EVM.) > + EVM supports two classes of security.evm. The first is > + an HMAC-sha1 generated locally with a > + trusted/encrypted key stored in the Kernel Key > + Retention System. The second is a digital signature > + generated either locally or remotely using an > + asymmetric key. These keys are loaded onto root's > + keyring using keyctl, and EVM is then enabled by > + echoing a value to /evm: > + > + 1: enable HMAC validation and creation > + 2: enable digital signature validation > + 3: enable HMAC and digital signature validation and HMAC > +creation > + > + Further writes will be blocked if HMAC support is enabled or > + if bit 32 is set: > + > + echo 0x8002 >/evm > + > + will enable digital signature validation and block > + further writes to /evm. > + > + Until this is done, EVM can not create or validate the > + 'security.evm' xattr, but returns INTEGRITY_UNKNOWN. > + Loading keys and signaling EVM should be done as early > + as possible. Normally this is done in the initramfs, > + which has already been measured as part of the trusted > + boot. For more information on creating and loading > + existing trusted/encrypted keys, refer to: > -Documentation/keys-trusted-encrypted.txt. Both dracut > ++Documentation/security/keys/trusted-encrypted.rst. Both dracut > + (via 97masterkey and 98integrity) and systemd (via > + core/ima-setup) have support for loading keys at boot > + time. Just a reminder that this conflict still exists (and is now relevant to the security tree). -- Cheers, Stephen Rothwell
Re: [kernel-hardening] Re: [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 10:05 AM, Tobin C. Hardingwrote: > On Mon, Nov 13, 2017 at 06:37:28AM +0300, Kirill A. Shutemov wrote: >> On Mon, Nov 13, 2017 at 10:06:46AM +1100, Tobin C. Harding wrote: >> > On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote: ... >> > >> > Thanks for the link. So it looks like we need to refactor the kernel >> > address regular expression into a function that takes into account the >> > machine architecture and the number of page table levels. We will need >> > to add this to the false positive checks also. >> > >> > > Not sure if we care. It won't work too for other 64-bit architectrues >> > > that >> > > have more than 256TB of virtual address space. >> > >> > Is this because of the virtual memory map? >> >> On x86 direct mapping is the nearest thing we have to userspace. >> >> > Did you mean 512TB? >> >> No, I mean 256TB. >> >> You have all kernel memory in the range from 0x to >> 0x if you have 256 TB of virtual address space. If you >> hvae more, some thing might be ouside the range. > > Doesn't 4-level paging already limit a system to 64TB of memory? So any > system better equipped than this will use 5-level paging right? If I am > totally talking rubbish please ignore, I'm appreciative that you pointed > out the limitation already. Perhaps we can add a comment to the script > > # Script may miss some addresses on machines with more than 256TB of > # memory. I think the 256TB is wrt *virtual* address space not physical RAM. Also, IMHO, the script should 'transparently' take into account the # of paging levels (instead of the user needing to pass a parameter). IOW it should be able to detect the same (say, from the .config file) and act accordingly - in the sense, the regex's and associated logic would accordingly differ.
Re: [kernel-hardening] Re: [PATCH v4] scripts: add leaking_addresses.pl
On Mon, Nov 13, 2017 at 10:05 AM, Tobin C. Harding wrote: > On Mon, Nov 13, 2017 at 06:37:28AM +0300, Kirill A. Shutemov wrote: >> On Mon, Nov 13, 2017 at 10:06:46AM +1100, Tobin C. Harding wrote: >> > On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote: ... >> > >> > Thanks for the link. So it looks like we need to refactor the kernel >> > address regular expression into a function that takes into account the >> > machine architecture and the number of page table levels. We will need >> > to add this to the false positive checks also. >> > >> > > Not sure if we care. It won't work too for other 64-bit architectrues >> > > that >> > > have more than 256TB of virtual address space. >> > >> > Is this because of the virtual memory map? >> >> On x86 direct mapping is the nearest thing we have to userspace. >> >> > Did you mean 512TB? >> >> No, I mean 256TB. >> >> You have all kernel memory in the range from 0x to >> 0x if you have 256 TB of virtual address space. If you >> hvae more, some thing might be ouside the range. > > Doesn't 4-level paging already limit a system to 64TB of memory? So any > system better equipped than this will use 5-level paging right? If I am > totally talking rubbish please ignore, I'm appreciative that you pointed > out the limitation already. Perhaps we can add a comment to the script > > # Script may miss some addresses on machines with more than 256TB of > # memory. I think the 256TB is wrt *virtual* address space not physical RAM. Also, IMHO, the script should 'transparently' take into account the # of paging levels (instead of the user needing to pass a parameter). IOW it should be able to detect the same (say, from the .config file) and act accordingly - in the sense, the regex's and associated logic would accordingly differ.
Re: linux-next: manual merge of the tip tree with the FIXME tree
Hi Mark, On Wed, 11 Oct 2017 17:10:35 +0100 Mark Brownwrote: > > Today's linux-next merge of the tip tree got a conflict in: > > arch/s390/include/asm/spinlock.h > > between a series of commits adding wait queuing to s390 spinlocks > from the s390 tree: > > eb3b7b848fb3dd00f7a57d633 s390/rwlock: introduce rwlock wait queueing > b96f7d881ad94203e997cd2aa s390/spinlock: introduce spinlock wait queueing > 8153380379ecc8381f6d55f64 s390/spinlock: use the cpu number +1 as spinlock > value > > and Will's series of commits removing dummy implementations of spinlock > related things from the tip tree: > > a4c1887d4c1462b0ec5a8989f locking/arch: Remove dummy > arch_{read,spin,write}_lock_flags() implementations > 0160fb177d484367e041ac251 locking/arch: Remove dummy > arch_{read,spin,write}_relax() implementations > a8a217c22116eff6c120d753c locking/core: Remove {read,spin,write}_can_lock() > > I'm don't feel confident I can resolve this conflict sensibly without > taking too long so I've used the tip tree from yesterday. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the tip tree with the FIXME tree
Hi Mark, On Wed, 11 Oct 2017 17:10:35 +0100 Mark Brown wrote: > > Today's linux-next merge of the tip tree got a conflict in: > > arch/s390/include/asm/spinlock.h > > between a series of commits adding wait queuing to s390 spinlocks > from the s390 tree: > > eb3b7b848fb3dd00f7a57d633 s390/rwlock: introduce rwlock wait queueing > b96f7d881ad94203e997cd2aa s390/spinlock: introduce spinlock wait queueing > 8153380379ecc8381f6d55f64 s390/spinlock: use the cpu number +1 as spinlock > value > > and Will's series of commits removing dummy implementations of spinlock > related things from the tip tree: > > a4c1887d4c1462b0ec5a8989f locking/arch: Remove dummy > arch_{read,spin,write}_lock_flags() implementations > 0160fb177d484367e041ac251 locking/arch: Remove dummy > arch_{read,spin,write}_relax() implementations > a8a217c22116eff6c120d753c locking/core: Remove {read,spin,write}_can_lock() > > I'm don't feel confident I can resolve this conflict sensibly without > taking too long so I've used the tip tree from yesterday. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
RE: [PATCH] mm/hugetlb: Implement ASLR and topdown for hugetlb mappings
Hi, Russell, Have you any time to check this patch? I found this issue/missing in my works, the application cannot mmap big hugepage (about 360MB) due to no more contiguous vm from the default "TASK_UNMMAPPED_AREA" by legacy bottom-up. We need this patch to fix this issue. Could you please help check this patch? Thanks! BR, Shile -Original Message- From: Shile Zhang [mailto:shile.zh...@nokia-sbell.com] Sent: Friday, November 03, 2017 5:19 PM To: Russell KingCc: linux-kernel@vger.kernel.org; Zhang, Shile (NSB - CN/Hangzhou) Subject: [PATCH] mm/hugetlb: Implement ASLR and topdown for hugetlb mappings merge from arch/x86 Signed-off-by: Shile Zhang --- arch/arm/include/asm/page.h | 1 + arch/arm/mm/hugetlbpage.c | 85 + 2 files changed, 86 insertions(+) diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h index 4355f0e..994630f 100644 --- a/arch/arm/include/asm/page.h +++ b/arch/arm/include/asm/page.h @@ -144,6 +144,7 @@ extern void copy_page(void *to, const void *from); #ifdef CONFIG_KUSER_HELPERS #define __HAVE_ARCH_GATE_AREA 1 +#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #endif #ifdef CONFIG_ARM_LPAE diff --git a/arch/arm/mm/hugetlbpage.c b/arch/arm/mm/hugetlbpage.c index fcafb52..46ed0c8 100644 --- a/arch/arm/mm/hugetlbpage.c +++ b/arch/arm/mm/hugetlbpage.c @@ -45,3 +45,88 @@ int pmd_huge(pmd_t pmd) { return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT); } + +#ifdef CONFIG_HUGETLB_PAGE +static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file, + unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct vm_unmapped_area_info info; + + info.flags = 0; + info.length = len; + info.low_limit = current->mm->mmap_legacy_base; + info.high_limit = TASK_SIZE; + info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.align_offset = 0; + return vm_unmapped_area(); +} + +static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, + unsigned long addr0, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct vm_unmapped_area_info info; + unsigned long addr; + + info.flags = VM_UNMAPPED_AREA_TOPDOWN; + info.length = len; + info.low_limit = PAGE_SIZE; + info.high_limit = current->mm->mmap_base; + info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.align_offset = 0; + addr = vm_unmapped_area(); + + /* +* A failed mmap() very likely causes application failure, +* so fall back to the bottom-up function here. This scenario +* can happen with large stack limits and large mmap() +* allocations. +*/ + if (addr & ~PAGE_MASK) { + VM_BUG_ON(addr != -ENOMEM); + info.flags = 0; + info.low_limit = TASK_UNMAPPED_BASE; + info.high_limit = TASK_SIZE; + addr = vm_unmapped_area(); + } + + return addr; +} + +unsigned long +hugetlb_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + + if (len & ~huge_page_mask(h)) + return -EINVAL; + if (len > TASK_SIZE) + return -ENOMEM; + + if (flags & MAP_FIXED) { + if (prepare_hugepage_range(file, addr, len)) + return -EINVAL; + return addr; + } + + if (addr) { + addr = ALIGN(addr, huge_page_size(h)); + vma = find_vma(mm, addr); + if (TASK_SIZE - len >= addr && + (!vma || addr + len <= vma->vm_start)) + return addr; + } + if (mm->get_unmapped_area == arch_get_unmapped_area) + return hugetlb_get_unmapped_area_bottomup(file, addr, len, + pgoff, flags); + else + return hugetlb_get_unmapped_area_topdown(file, addr, len, + pgoff, flags); +} +#endif /* CONFIG_HUGETLB_PAGE */ -- 2.6.2
RE: [PATCH] mm/hugetlb: Implement ASLR and topdown for hugetlb mappings
Hi, Russell, Have you any time to check this patch? I found this issue/missing in my works, the application cannot mmap big hugepage (about 360MB) due to no more contiguous vm from the default "TASK_UNMMAPPED_AREA" by legacy bottom-up. We need this patch to fix this issue. Could you please help check this patch? Thanks! BR, Shile -Original Message- From: Shile Zhang [mailto:shile.zh...@nokia-sbell.com] Sent: Friday, November 03, 2017 5:19 PM To: Russell King Cc: linux-kernel@vger.kernel.org; Zhang, Shile (NSB - CN/Hangzhou) Subject: [PATCH] mm/hugetlb: Implement ASLR and topdown for hugetlb mappings merge from arch/x86 Signed-off-by: Shile Zhang --- arch/arm/include/asm/page.h | 1 + arch/arm/mm/hugetlbpage.c | 85 + 2 files changed, 86 insertions(+) diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h index 4355f0e..994630f 100644 --- a/arch/arm/include/asm/page.h +++ b/arch/arm/include/asm/page.h @@ -144,6 +144,7 @@ extern void copy_page(void *to, const void *from); #ifdef CONFIG_KUSER_HELPERS #define __HAVE_ARCH_GATE_AREA 1 +#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #endif #ifdef CONFIG_ARM_LPAE diff --git a/arch/arm/mm/hugetlbpage.c b/arch/arm/mm/hugetlbpage.c index fcafb52..46ed0c8 100644 --- a/arch/arm/mm/hugetlbpage.c +++ b/arch/arm/mm/hugetlbpage.c @@ -45,3 +45,88 @@ int pmd_huge(pmd_t pmd) { return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT); } + +#ifdef CONFIG_HUGETLB_PAGE +static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file, + unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct vm_unmapped_area_info info; + + info.flags = 0; + info.length = len; + info.low_limit = current->mm->mmap_legacy_base; + info.high_limit = TASK_SIZE; + info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.align_offset = 0; + return vm_unmapped_area(); +} + +static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, + unsigned long addr0, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct vm_unmapped_area_info info; + unsigned long addr; + + info.flags = VM_UNMAPPED_AREA_TOPDOWN; + info.length = len; + info.low_limit = PAGE_SIZE; + info.high_limit = current->mm->mmap_base; + info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.align_offset = 0; + addr = vm_unmapped_area(); + + /* +* A failed mmap() very likely causes application failure, +* so fall back to the bottom-up function here. This scenario +* can happen with large stack limits and large mmap() +* allocations. +*/ + if (addr & ~PAGE_MASK) { + VM_BUG_ON(addr != -ENOMEM); + info.flags = 0; + info.low_limit = TASK_UNMAPPED_BASE; + info.high_limit = TASK_SIZE; + addr = vm_unmapped_area(); + } + + return addr; +} + +unsigned long +hugetlb_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long flags) +{ + struct hstate *h = hstate_file(file); + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + + if (len & ~huge_page_mask(h)) + return -EINVAL; + if (len > TASK_SIZE) + return -ENOMEM; + + if (flags & MAP_FIXED) { + if (prepare_hugepage_range(file, addr, len)) + return -EINVAL; + return addr; + } + + if (addr) { + addr = ALIGN(addr, huge_page_size(h)); + vma = find_vma(mm, addr); + if (TASK_SIZE - len >= addr && + (!vma || addr + len <= vma->vm_start)) + return addr; + } + if (mm->get_unmapped_area == arch_get_unmapped_area) + return hugetlb_get_unmapped_area_bottomup(file, addr, len, + pgoff, flags); + else + return hugetlb_get_unmapped_area_topdown(file, addr, len, + pgoff, flags); +} +#endif /* CONFIG_HUGETLB_PAGE */ -- 2.6.2
Re: linux-next: manual merge of the tip tree with the s390 tree
Hi all, On Wed, 11 Oct 2017 16:51:45 +0100 Mark Brownwrote: > > Today's linux-next merge of the tip tree got a conflict in: > > arch/s390/include/asm/rwsem.h > > between commit: > >91a1fad759ffd ("s390: use generic rwsem implementation") > > from the s390 tree and commit: > >a61ba2c8a48f1 ("locking/arch, s390: Add __down_read_killable()") > > from the tip tree. > > I fixed it up by re-deleting the file and can carry the fix as > necessary. This is now fixed as far as linux-next is concerned, but any > non trivial conflicts should be mentioned to your upstream maintainer > when your tree is submitted for merging. You may also want to consider > cooperating with the maintainer of the conflicting tree to minimise any > particularly complex conflicts. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the tip tree with the s390 tree
Hi all, On Wed, 11 Oct 2017 16:51:45 +0100 Mark Brown wrote: > > Today's linux-next merge of the tip tree got a conflict in: > > arch/s390/include/asm/rwsem.h > > between commit: > >91a1fad759ffd ("s390: use generic rwsem implementation") > > from the s390 tree and commit: > >a61ba2c8a48f1 ("locking/arch, s390: Add __down_read_killable()") > > from the tip tree. > > I fixed it up by re-deleting the file and can carry the fix as > necessary. This is now fixed as far as linux-next is concerned, but any > non trivial conflicts should be mentioned to your upstream maintainer > when your tree is submitted for merging. You may also want to consider > cooperating with the maintainer of the conflicting tree to minimise any > particularly complex conflicts. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the drivers-x86 tree with the net-next tree
Hi all, On Mon, 9 Oct 2017 18:56:33 +0100 Mark Brownwrote: > > Today's linux-next merge of the drivers-x86 tree got a conflict in: > > Documentation/admin-guide/thunderbolt.rst > > between commit: > >e69b6c02b4c3b ("net: Add support for networking over Thunderbolt cable") > > from the net-next tree and commit: > >ce6a90027c10f ("platform/x86: Add driver to force WMI Thunderbolt > controller power status") > > from the drivers-x86 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc Documentation/admin-guide/thunderbolt.rst > index 5c62d11d77e8,dadcd66ee12f.. > --- a/Documentation/admin-guide/thunderbolt.rst > +++ b/Documentation/admin-guide/thunderbolt.rst > @@@ -198,26 -198,17 +198,41 @@@ information is missing > To recover from this mode, one needs to flash a valid NVM image to the > host host controller in the same way it is done in the previous chapter. > > +Networking over Thunderbolt cable > +- > +Thunderbolt technology allows software communication across two hosts > +connected by a Thunderbolt cable. > + > +It is possible to tunnel any kind of traffic over Thunderbolt link but > +currently we only support Apple ThunderboltIP protocol. > + > +If the other host is running Windows or macOS only thing you need to > +do is to connect Thunderbolt cable between the two hosts, the > +``thunderbolt-net`` is loaded automatically. If the other host is also > +Linux you should load ``thunderbolt-net`` manually on one host (it does > +not matter which one):: > + > + # modprobe thunderbolt-net > + > +This triggers module load on the other host automatically. If the driver > +is built-in to the kernel image, there is no need to do anything. > + > +The driver will create one virtual ethernet interface per Thunderbolt > +port which are named like ``thunderbolt0`` and so on. From this point > +you can either use standard userspace tools like ``ifconfig`` to > +configure the interface or let your GUI to handle it automatically. > ++ > + Forcing power > + - > + Many OEMs include a method that can be used to force the power of a > + thunderbolt controller to an "On" state even if nothing is connected. > + If supported by your machine this will be exposed by the WMI bus with > + a sysfs attribute called "force_power". > + > + For example the intel-wmi-thunderbolt driver exposes this attribute in: > + > /sys/devices/platform/PNP0C14:00/wmi_bus/wmi_bus-PNP0C14:00/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power > + > + To force the power to on, write 1 to this attribute file. > + To disable force power, write 0 to this attribute file. > + > + Note: it's currently not possible to query the force power state of a > platform. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the drivers-x86 tree with the net-next tree
Hi all, On Mon, 9 Oct 2017 18:56:33 +0100 Mark Brown wrote: > > Today's linux-next merge of the drivers-x86 tree got a conflict in: > > Documentation/admin-guide/thunderbolt.rst > > between commit: > >e69b6c02b4c3b ("net: Add support for networking over Thunderbolt cable") > > from the net-next tree and commit: > >ce6a90027c10f ("platform/x86: Add driver to force WMI Thunderbolt > controller power status") > > from the drivers-x86 tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > diff --cc Documentation/admin-guide/thunderbolt.rst > index 5c62d11d77e8,dadcd66ee12f.. > --- a/Documentation/admin-guide/thunderbolt.rst > +++ b/Documentation/admin-guide/thunderbolt.rst > @@@ -198,26 -198,17 +198,41 @@@ information is missing > To recover from this mode, one needs to flash a valid NVM image to the > host host controller in the same way it is done in the previous chapter. > > +Networking over Thunderbolt cable > +- > +Thunderbolt technology allows software communication across two hosts > +connected by a Thunderbolt cable. > + > +It is possible to tunnel any kind of traffic over Thunderbolt link but > +currently we only support Apple ThunderboltIP protocol. > + > +If the other host is running Windows or macOS only thing you need to > +do is to connect Thunderbolt cable between the two hosts, the > +``thunderbolt-net`` is loaded automatically. If the other host is also > +Linux you should load ``thunderbolt-net`` manually on one host (it does > +not matter which one):: > + > + # modprobe thunderbolt-net > + > +This triggers module load on the other host automatically. If the driver > +is built-in to the kernel image, there is no need to do anything. > + > +The driver will create one virtual ethernet interface per Thunderbolt > +port which are named like ``thunderbolt0`` and so on. From this point > +you can either use standard userspace tools like ``ifconfig`` to > +configure the interface or let your GUI to handle it automatically. > ++ > + Forcing power > + - > + Many OEMs include a method that can be used to force the power of a > + thunderbolt controller to an "On" state even if nothing is connected. > + If supported by your machine this will be exposed by the WMI bus with > + a sysfs attribute called "force_power". > + > + For example the intel-wmi-thunderbolt driver exposes this attribute in: > + > /sys/devices/platform/PNP0C14:00/wmi_bus/wmi_bus-PNP0C14:00/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power > + > + To force the power to on, write 1 to this attribute file. > + To disable force power, write 0 to this attribute file. > + > + Note: it's currently not possible to query the force power state of a > platform. Just a reminder that this conflict still exists. -- Cheers, Stephen Rothwell
Re: [PATCH 2/3] X86/kdump: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM
On 10/24/17 at 01:31pm, Dave Young wrote: > Now crashkernel=X will fail if there's not enough memory at low region > (below 896M) when trying to reserve large memory size. One can use > crashkernel=xM,high to reserve it at high region (>4G) but it is more > convinient to improve crashkernel=X to: > > - First try to reserve X below 896M (for being compatible with old >kexec-tools). > - If fails, try to reserve X below 4G (swiotlb need to stay below 4G). > - If fails, try to reserve X from MAXMEM top down. > > It's more transparent and user-friendly. > > If crashkernel is large and the reserved is beyond 896M, old kexec-tools > is not compatible with new kernel because old kexec-tools can not load > kernel at high memory region, there was an old discussion below: > https://lkml.org/lkml/2013/10/15/601 > > But actually the behavior is consistent during my test. Suppose > old kernel fail to reserve memory at low areas, kdump does not > work because no meory reserved. With this patch, suppose new kernel > successfully reserved memory at high areas, old kexec-tools still fail > to load kdump kernel (tested 2.0.2), so it is acceptable, no need to > worry about the compatibility. > > Here is the test result (kexec-tools 2.0.2, no high memory load > support): > Crashkernel over 4G: > # cat /proc/iomem|grep Crash > be00-cdff : Crash kernel > 21300-21eff : Crash kernel > # ./kexec -p /boot/vmlinuz-`uname -r` > Memory for crashkernel is not reserved > Please reserve memory by passing "crashkernel=X@Y" parameter to the kernel > Then try loading kdump kernel > > crashkernel: 896M-4G: > # cat /proc/iomem|grep Crash > 9600-cdef : Crash kernel > # ./kexec -p /boot/vmlinuz-4.14.0-rc4+ > ELF core (kcore) parse failed > Cannot load /boot/vmlinuz-4.14.0-rc4+ > > Signed-off-by: Dave Young> --- > arch/x86/kernel/setup.c | 16 > 1 file changed, 16 insertions(+) > > --- linux-x86.orig/arch/x86/kernel/setup.c > +++ linux-x86/arch/x86/kernel/setup.c > @@ -568,6 +568,22 @@ static void __init reserve_crashkernel(v > high ? CRASH_ADDR_HIGH_MAX >: CRASH_ADDR_LOW_MAX, > crash_size, CRASH_ALIGN); > +#ifdef CONFIG_X86_64 > + /* > + * crashkernel=X reserve below 896M fails? Try below 4G > + */ > + if (!high && !crash_base) > + crash_base = memblock_find_in_range(CRASH_ALIGN, > + (1ULL << 32), > + crash_size, CRASH_ALIGN); > + /* > + * crashkernel=X reserve below 4G fails? Try MAXMEM > + */ > + if (!high && !crash_base) > + crash_base = memblock_find_in_range(CRASH_ALIGN, > + CRASH_ADDR_HIGH_MAX, > + crash_size, CRASH_ALIGN); > +#endif > if (!crash_base) { > pr_info("crashkernel reservation failed - No suitable > area found.\n"); > return; > > Andrew, this patch is good to have, could you take this in your tree? The other two patches may need more discussion I will drop them for now. Thanks Dave
Re: [PATCH 2/3] X86/kdump: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM
On 10/24/17 at 01:31pm, Dave Young wrote: > Now crashkernel=X will fail if there's not enough memory at low region > (below 896M) when trying to reserve large memory size. One can use > crashkernel=xM,high to reserve it at high region (>4G) but it is more > convinient to improve crashkernel=X to: > > - First try to reserve X below 896M (for being compatible with old >kexec-tools). > - If fails, try to reserve X below 4G (swiotlb need to stay below 4G). > - If fails, try to reserve X from MAXMEM top down. > > It's more transparent and user-friendly. > > If crashkernel is large and the reserved is beyond 896M, old kexec-tools > is not compatible with new kernel because old kexec-tools can not load > kernel at high memory region, there was an old discussion below: > https://lkml.org/lkml/2013/10/15/601 > > But actually the behavior is consistent during my test. Suppose > old kernel fail to reserve memory at low areas, kdump does not > work because no meory reserved. With this patch, suppose new kernel > successfully reserved memory at high areas, old kexec-tools still fail > to load kdump kernel (tested 2.0.2), so it is acceptable, no need to > worry about the compatibility. > > Here is the test result (kexec-tools 2.0.2, no high memory load > support): > Crashkernel over 4G: > # cat /proc/iomem|grep Crash > be00-cdff : Crash kernel > 21300-21eff : Crash kernel > # ./kexec -p /boot/vmlinuz-`uname -r` > Memory for crashkernel is not reserved > Please reserve memory by passing "crashkernel=X@Y" parameter to the kernel > Then try loading kdump kernel > > crashkernel: 896M-4G: > # cat /proc/iomem|grep Crash > 9600-cdef : Crash kernel > # ./kexec -p /boot/vmlinuz-4.14.0-rc4+ > ELF core (kcore) parse failed > Cannot load /boot/vmlinuz-4.14.0-rc4+ > > Signed-off-by: Dave Young > --- > arch/x86/kernel/setup.c | 16 > 1 file changed, 16 insertions(+) > > --- linux-x86.orig/arch/x86/kernel/setup.c > +++ linux-x86/arch/x86/kernel/setup.c > @@ -568,6 +568,22 @@ static void __init reserve_crashkernel(v > high ? CRASH_ADDR_HIGH_MAX >: CRASH_ADDR_LOW_MAX, > crash_size, CRASH_ALIGN); > +#ifdef CONFIG_X86_64 > + /* > + * crashkernel=X reserve below 896M fails? Try below 4G > + */ > + if (!high && !crash_base) > + crash_base = memblock_find_in_range(CRASH_ALIGN, > + (1ULL << 32), > + crash_size, CRASH_ALIGN); > + /* > + * crashkernel=X reserve below 4G fails? Try MAXMEM > + */ > + if (!high && !crash_base) > + crash_base = memblock_find_in_range(CRASH_ALIGN, > + CRASH_ADDR_HIGH_MAX, > + crash_size, CRASH_ALIGN); > +#endif > if (!crash_base) { > pr_info("crashkernel reservation failed - No suitable > area found.\n"); > return; > > Andrew, this patch is good to have, could you take this in your tree? The other two patches may need more discussion I will drop them for now. Thanks Dave