Re: [PATCH 06/16] arch: remove tile port
On 3/14/2018 10:36 AM, Arnd Bergmann wrote: The Tile architecture port was added by Chris Metcalf in 2010, and maintained until early 2018 when he orphaned it due to his departure from Mellanox, and nobody else stepped up to maintain it. The product line is still around in the form of the BlueField SoC, but no longer uses the Tile architecture. There are also still products for sale with Tile-GX SoCs, notably the Mikrotik CCR router family. The products all use old (linux-3.3) kernels with lots of patches and won't be upgraded by their manufacturers. There have been efforts to port both OpenWRT and Debian to these, but both projects have stalled and are very unlikely to be continued in the future. Given that we are reasonably sure that nobody is still using the port with an upstream kernel any more, it seems better to remove it now while the port is in a good shape than to let it bitrot for a few years first. Arnd, thanks for dealing with this. There are a number of tile-specific driver files that are mostly called out in the MAINTAINERS file. I would expect you should also delete those. -F: drivers/char/tile-srom.c -F: drivers/edac/tile_edac.c -F: drivers/net/ethernet/tile/ -F: drivers/rtc/rtc-tile.c -F: drivers/tty/hvc/hvc_tile.c -F: drivers/tty/serial/tilegx.c -F: drivers/usb/host/*-tilegx.c -F: include/linux/usb/tilegx.h Chris
Re: [PATCH 06/16] arch: remove tile port
On 3/14/2018 10:36 AM, Arnd Bergmann wrote: The Tile architecture port was added by Chris Metcalf in 2010, and maintained until early 2018 when he orphaned it due to his departure from Mellanox, and nobody else stepped up to maintain it. The product line is still around in the form of the BlueField SoC, but no longer uses the Tile architecture. There are also still products for sale with Tile-GX SoCs, notably the Mikrotik CCR router family. The products all use old (linux-3.3) kernels with lots of patches and won't be upgraded by their manufacturers. There have been efforts to port both OpenWRT and Debian to these, but both projects have stalled and are very unlikely to be continued in the future. Given that we are reasonably sure that nobody is still using the port with an upstream kernel any more, it seems better to remove it now while the port is in a good shape than to let it bitrot for a few years first. Arnd, thanks for dealing with this. There are a number of tile-specific driver files that are mostly called out in the MAINTAINERS file. I would expect you should also delete those. -F: drivers/char/tile-srom.c -F: drivers/edac/tile_edac.c -F: drivers/net/ethernet/tile/ -F: drivers/rtc/rtc-tile.c -F: drivers/tty/hvc/hvc_tile.c -F: drivers/tty/serial/tilegx.c -F: drivers/usb/host/*-tilegx.c -F: include/linux/usb/tilegx.h Chris
[GIT PULL] arch/tile "bugfix" for 4.15-rc3
Linus, This is not exactly a bugfix, but this is my last week at Mellanox and I am stepping down as arch/tile maintainer, as described in a bit more detail in my email here: https://lkml.kernel.org/r/1512402760-12694-1-git-send-email-cmetc...@mellanox.com So, please pull this one last tile commit to remove me as maintainer, and to tag the tile architecture as orphaned: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git HEAD It's been a pleasure working with you and the rest of the Linux community since 2010 and I hope to continue to do more in the years to come. Chris Metcalf (1): arch/tile: mark as orphaned MAINTAINERS | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile "bugfix" for 4.15-rc3
Linus, This is not exactly a bugfix, but this is my last week at Mellanox and I am stepping down as arch/tile maintainer, as described in a bit more detail in my email here: https://lkml.kernel.org/r/1512402760-12694-1-git-send-email-cmetc...@mellanox.com So, please pull this one last tile commit to remove me as maintainer, and to tag the tile architecture as orphaned: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git HEAD It's been a pleasure working with you and the rest of the Linux community since 2010 and I hope to continue to do more in the years to come. Chris Metcalf (1): arch/tile: mark as orphaned MAINTAINERS | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: linux-next: remove the tile tree?
On 12/4/2017 3:25 PM, Stephen Rothwell wrote: Hi Chris, Given commit 8ee5ad1d4c0b ("arch/tile: mark as orphaned") in Linus' tree, should I remove the tile tree from linux-next? Yes, that would make sense. Good catch! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: linux-next: remove the tile tree?
On 12/4/2017 3:25 PM, Stephen Rothwell wrote: Hi Chris, Given commit 8ee5ad1d4c0b ("arch/tile: mark as orphaned") in Linus' tree, should I remove the tile tree from linux-next? Yes, that would make sense. Good catch! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[PATCH] arch/tile: mark as orphaned
The chip family of TILEPro and TILE-Gx was developed by Tilera, which was eventually acquired by Mellanox. The tile architecture was added to the kernel in 2010 and first appeared in 2.6.36. Now at Mellanox we are developing new chips based on the ARM64 architecture; our last TILE-Gx chip (the Gx72) was released in 2013, and our customers using tile architecture products are not, as far as we know, looking to upgrade to newer kernel releases. In the absence of someone in the community stepping up to take over maintainership, this commit marks the architecture as orphaned. Cc: Chris Metcalf <metc...@alum.mit.edu> Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- This is my last week at Mellanox, and in the absence of customer engagements, it doesn't seem to make sense to transition the tile architecture maintainer role over to some other Mellanox employee. It would be great if someone in the community were interested in taking over! I'm also open to a community consensus suggesting that I just "git rm" the tile-related code instead of tagging it as orphaned, but my sense is that that's something the community can address later if no one steps up over a period of several releases to take over ownership. Note the Cc: tag on this commit; further kernel work (in particular the task-isolation patch series, which sprang out of some early Tilera work) will continue to come from that email address. MAINTAINERS | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2d3d750b19c0..67cf1db6cde4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13458,10 +13458,8 @@ F: drivers/net/wireless/ti/ F: include/linux/wl12xx.h TILE ARCHITECTURE -M: Chris Metcalf <cmetc...@mellanox.com> W: http://www.mellanox.com/repository/solutions/tile-scm/ -T: git git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git -S: Supported +S: Orphan F: arch/tile/ F: drivers/char/tile-srom.c F: drivers/edac/tile_edac.c -- 2.1.2
[PATCH] arch/tile: mark as orphaned
The chip family of TILEPro and TILE-Gx was developed by Tilera, which was eventually acquired by Mellanox. The tile architecture was added to the kernel in 2010 and first appeared in 2.6.36. Now at Mellanox we are developing new chips based on the ARM64 architecture; our last TILE-Gx chip (the Gx72) was released in 2013, and our customers using tile architecture products are not, as far as we know, looking to upgrade to newer kernel releases. In the absence of someone in the community stepping up to take over maintainership, this commit marks the architecture as orphaned. Cc: Chris Metcalf Signed-off-by: Chris Metcalf --- This is my last week at Mellanox, and in the absence of customer engagements, it doesn't seem to make sense to transition the tile architecture maintainer role over to some other Mellanox employee. It would be great if someone in the community were interested in taking over! I'm also open to a community consensus suggesting that I just "git rm" the tile-related code instead of tagging it as orphaned, but my sense is that that's something the community can address later if no one steps up over a period of several releases to take over ownership. Note the Cc: tag on this commit; further kernel work (in particular the task-isolation patch series, which sprang out of some early Tilera work) will continue to come from that email address. MAINTAINERS | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2d3d750b19c0..67cf1db6cde4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13458,10 +13458,8 @@ F: drivers/net/wireless/ti/ F: include/linux/wl12xx.h TILE ARCHITECTURE -M: Chris Metcalf W: http://www.mellanox.com/repository/solutions/tile-scm/ -T: git git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git -S: Supported +S: Orphan F: arch/tile/ F: drivers/char/tile-srom.c F: drivers/edac/tile_edac.c -- 2.1.2
Re: [PATCH v16 00/13] support "task_isolation" mode
On 11/7/2017 12:10 PM, Christopher Lameter wrote: On Mon, 6 Nov 2017, Chris Metcalf wrote: On 11/6/2017 10:38 AM, Christopher Lameter wrote: What about that d*mn 1 Hz clock? It's still there, so this code still requires some further work before it can actually get a process into long-term task isolation (without the obvious one-line kernel hack). Frederic suggested a while ago forcing updates on cpustats was required as the last gating factor; do we think that is still true? Christoph was working on this at one point - any progress from your point of view? Well if you still have the 1 HZ clock then you can simply defer the numa remote page cleanup of the page allocator to that the time you execute that tick. We have to get rid of the 1 Hz tick, so we don't want to tie anything else to it... Yes we want to get rid of the 1 HZ tick but the work on that could also include dealing with the remove page cleanup issue that we have deferred. Presumably we have another context there were we may be able to call into the cleanup code with interrupts enabled. Right now for task isolation we run with interrupts enabled during the initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently calls refresh_cpu_vm_stats(false). In fact we could certainly pass "true" there instead (and probably should) since we can handle dealing with the pagesets at this time. As we return to userspace we will test that nothing surprising happened with vmstat; if so we jam an EAGAIN into the syscall result value, but if not, we will be in userspace and won't need to touch the vmstat counters until we next go back into the kernel. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 00/13] support "task_isolation" mode
On 11/7/2017 12:10 PM, Christopher Lameter wrote: On Mon, 6 Nov 2017, Chris Metcalf wrote: On 11/6/2017 10:38 AM, Christopher Lameter wrote: What about that d*mn 1 Hz clock? It's still there, so this code still requires some further work before it can actually get a process into long-term task isolation (without the obvious one-line kernel hack). Frederic suggested a while ago forcing updates on cpustats was required as the last gating factor; do we think that is still true? Christoph was working on this at one point - any progress from your point of view? Well if you still have the 1 HZ clock then you can simply defer the numa remote page cleanup of the page allocator to that the time you execute that tick. We have to get rid of the 1 Hz tick, so we don't want to tie anything else to it... Yes we want to get rid of the 1 HZ tick but the work on that could also include dealing with the remove page cleanup issue that we have deferred. Presumably we have another context there were we may be able to call into the cleanup code with interrupts enabled. Right now for task isolation we run with interrupts enabled during the initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently calls refresh_cpu_vm_stats(false). In fact we could certainly pass "true" there instead (and probably should) since we can handle dealing with the pagesets at this time. As we return to userspace we will test that nothing surprising happened with vmstat; if so we jam an EAGAIN into the syscall result value, but if not, we will be in userspace and won't need to touch the vmstat counters until we next go back into the kernel. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote: Since we're potentially about to start the merge window for 4.15 this weekend, the timing of this doesn't work well either. With the start of the merge window now delayed for a week, I'm sure everyone can distract themselves and help make the last week of -rc8 pass more quickly by digging into this patch series! :-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote: Since we're potentially about to start the merge window for 4.15 this weekend, the timing of this doesn't work well either. With the start of the merge window now delayed for a week, I'm sure everyone can distract themselves and help make the last week of -rc8 pass more quickly by digging into this patch series! :-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality
On 11/3/2017 1:32 PM, Mark Rutland wrote: Hi Chris, On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote: In do_notify_resume(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. We instrument the smp_send_reschedule() routine so that it checks for isolated tasks and generates a suitable warning if needed. Finally, report on page faults in task-isolation processes in do_page_faults(). I don't have much context for this (I only received patches 9, 10, and 12), and this commit message doesn't help me to understand why these changes are necessary. Sorry, I missed having you on the cover letter. I'll fix that for the next spin. The cover letter (and rest of the series) is here: https://lkml.org/lkml/2017/11/3/589 The core piece of the patch is here: https://lkml.org/lkml/2017/11/3/598 Here we add to _TIF_WORK_MASK... [...] ... and here we open-code the *old* _TIF_WORK_MASK. Can we drop both in , building one in terms of the other: #define _TIF_WORK_NOISOLATION_MASK \ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME |\ _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK) #define _TIF_WORK_MASK \ (_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION) ... that avoids duplication, ensuring the two are kept in sync, and makes it a little easier to understand. We certainly could do that. I based my approach on the x86 model, which defines _TIF_ALLWORK_MASK in thread_info.h, and then a local EXIT_TO_USERMODE_WORK_FLAGS above exit_to_usermode_loop(). If you'd prefer to avoid the duplication, perhaps names more like this? _TIF_WORK_LOOP_MASK (without TIF_TASK_ISOLATION) _TIF_WORK_MASK as _TIF_WORK_LOOP_MASK | _TIF_TASK_ISOLATION That keeps the names reflective of the function (entry only vs loop). @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu) #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL void arch_send_wakeup_ipi_mask(const struct cpumask *mask) { + task_isolation_remote_cpumask(mask, "wakeup IPI"); What exactly does this do? Is it some kind of a tracepoint? It is intended to generate a diagnostic for a remote task that is trying to run isolated from the kernel (NOHZ_FULL on steroids, more or less), if the kernel is about to interrupt it. Similarly, the task_isolation_interrupt() hooks are diagnostics for the current task. The intent is that by hooking a little deeper in the call path, you get actionable diagnostics for processes that are about to be signalled because they have lost task isolation for some reason. @@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, */ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS { + /* No signal was generated, but notify task-isolation tasks. */ + if (user_mode(regs)) + task_isolation_interrupt("page fault at %#lx", addr); What exactly does the task receive here? Are these strings ABI? Do we need to do this for *every* exception? The strings are diagnostic messages; the process itself just gets a SIGKILL (or user-defined signal if requested). To provide better diagnosis we emit a log message that can be examined to see what exactly caused the signal to be generated. Thanks! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality
On 11/3/2017 1:32 PM, Mark Rutland wrote: Hi Chris, On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote: In do_notify_resume(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. We instrument the smp_send_reschedule() routine so that it checks for isolated tasks and generates a suitable warning if needed. Finally, report on page faults in task-isolation processes in do_page_faults(). I don't have much context for this (I only received patches 9, 10, and 12), and this commit message doesn't help me to understand why these changes are necessary. Sorry, I missed having you on the cover letter. I'll fix that for the next spin. The cover letter (and rest of the series) is here: https://lkml.org/lkml/2017/11/3/589 The core piece of the patch is here: https://lkml.org/lkml/2017/11/3/598 Here we add to _TIF_WORK_MASK... [...] ... and here we open-code the *old* _TIF_WORK_MASK. Can we drop both in , building one in terms of the other: #define _TIF_WORK_NOISOLATION_MASK \ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME |\ _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK) #define _TIF_WORK_MASK \ (_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION) ... that avoids duplication, ensuring the two are kept in sync, and makes it a little easier to understand. We certainly could do that. I based my approach on the x86 model, which defines _TIF_ALLWORK_MASK in thread_info.h, and then a local EXIT_TO_USERMODE_WORK_FLAGS above exit_to_usermode_loop(). If you'd prefer to avoid the duplication, perhaps names more like this? _TIF_WORK_LOOP_MASK (without TIF_TASK_ISOLATION) _TIF_WORK_MASK as _TIF_WORK_LOOP_MASK | _TIF_TASK_ISOLATION That keeps the names reflective of the function (entry only vs loop). @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu) #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL void arch_send_wakeup_ipi_mask(const struct cpumask *mask) { + task_isolation_remote_cpumask(mask, "wakeup IPI"); What exactly does this do? Is it some kind of a tracepoint? It is intended to generate a diagnostic for a remote task that is trying to run isolated from the kernel (NOHZ_FULL on steroids, more or less), if the kernel is about to interrupt it. Similarly, the task_isolation_interrupt() hooks are diagnostics for the current task. The intent is that by hooking a little deeper in the call path, you get actionable diagnostics for processes that are about to be signalled because they have lost task isolation for some reason. @@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, */ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS { + /* No signal was generated, but notify task-isolation tasks. */ + if (user_mode(regs)) + task_isolation_interrupt("page fault at %#lx", addr); What exactly does the task receive here? Are these strings ABI? Do we need to do this for *every* exception? The strings are diagnostic messages; the process itself just gets a SIGKILL (or user-defined signal if requested). To provide better diagnosis we emit a log message that can be examined to see what exactly caused the signal to be generated. Thanks! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[PATCH] arch/tile: Implement ->set_state_oneshot_stopped()
set_state_oneshot_stopped() is called by the clkevt core, when the next event is required at an expiry time of 'KTIME_MAX'. This normally happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes. This patch makes the clockevent device to stop on such an event, to avoid spurious interrupts, as explained by: commit 8fff52fd5093 ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state"). Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- arch/tile/kernel/time.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index 6643ffbc0615..f95d65f3162b 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -162,6 +162,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = { .set_next_event = tile_timer_set_next_event, .set_state_shutdown = tile_timer_shutdown, .set_state_oneshot = tile_timer_shutdown, + .set_state_oneshot_stopped = tile_timer_shutdown, .tick_resume = tile_timer_shutdown, }; -- 2.1.2
[PATCH] arch/tile: Implement ->set_state_oneshot_stopped()
set_state_oneshot_stopped() is called by the clkevt core, when the next event is required at an expiry time of 'KTIME_MAX'. This normally happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes. This patch makes the clockevent device to stop on such an event, to avoid spurious interrupts, as explained by: commit 8fff52fd5093 ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state"). Signed-off-by: Chris Metcalf --- arch/tile/kernel/time.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index 6643ffbc0615..f95d65f3162b 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -162,6 +162,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = { .set_next_event = tile_timer_set_next_event, .set_state_shutdown = tile_timer_shutdown, .set_state_oneshot = tile_timer_shutdown, + .set_state_oneshot_stopped = tile_timer_shutdown, .tick_resume = tile_timer_shutdown, }; -- 2.1.2
Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
On 11/3/2017 1:18 PM, Mark Rutland wrote: Hi Chris, On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote: diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index fd4b7f684bd0..61ea7f907c56 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type, } } + clk->set_state_oneshot_stopped = clk->set_state_shutdown; AFAICT, we've set up this callback since commit: cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()") ... so I don't beleive this is necessary, and I think this change can be dropped. Thanks, I will drop it. I missed the semantic merge conflict there. I extracted the arch/tile specific part of the change and just pushed it through the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
On 11/3/2017 1:18 PM, Mark Rutland wrote: Hi Chris, On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote: diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index fd4b7f684bd0..61ea7f907c56 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type, } } + clk->set_state_oneshot_stopped = clk->set_state_shutdown; AFAICT, we've set up this callback since commit: cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()") ... so I don't beleive this is necessary, and I think this change can be dropped. Thanks, I will drop it. I missed the semantic merge conflict there. I extracted the arch/tile specific part of the change and just pushed it through the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile bugfixes for 4.14-rcN
Linus, Please pull the following two commits for 4.14 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master These are both one-line bug fixes. Chris Metcalf (1): arch/tile: Implement ->set_state_oneshot_stopped() Luc Van Oostenryck (1): tile: pass machine size to sparse arch/tile/Makefile | 2 ++ arch/tile/kernel/time.c | 1 + 2 files changed, 3 insertions(+) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile bugfixes for 4.14-rcN
Linus, Please pull the following two commits for 4.14 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master These are both one-line bug fixes. Chris Metcalf (1): arch/tile: Implement ->set_state_oneshot_stopped() Luc Van Oostenryck (1): tile: pass machine size to sparse arch/tile/Makefile | 2 ++ arch/tile/kernel/time.c | 1 + 2 files changed, 3 insertions(+) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote: On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote: From: Francis Giraldeau <francis.girald...@gmail.com> This patch is a port of the task isolation functionality to the arm 32-bit architecture. The task isolation needs an additional thread flag that requires to change the entry assembly code to accept a bitfield larger than one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now defined in the literal pool. The rest of the patch is straightforward and reflects what is done on other architectures. To avoid problems with the tst instruction in the v7m build, we renumber TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7. After a bit of digging (which could've been saved if our patch format contained information about what kernel version this patch was generated against) it turns out that this patch will not apply since commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode return") has been applied, which means the TIF numbers have changed as well as the assembly code that your patch touches. My guess is that this patch was generated from a 4.13 kernel, so misses the 4.14-rc1 changes. Since we're potentially about to start the merge window for 4.15 this weekend, the timing of this doesn't work well either. What patch failure did you see? The patch is based against 4.14-rc4, so while it's a few weeks out of date, it does include the commit you reference. Once 4.15-rc1 has been published, please rebase against that version and resend. Sure. I was hoping to eke out a little bit of attention from kernel developers before the merge window actually opens :) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote: On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote: From: Francis Giraldeau This patch is a port of the task isolation functionality to the arm 32-bit architecture. The task isolation needs an additional thread flag that requires to change the entry assembly code to accept a bitfield larger than one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now defined in the literal pool. The rest of the patch is straightforward and reflects what is done on other architectures. To avoid problems with the tst instruction in the v7m build, we renumber TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7. After a bit of digging (which could've been saved if our patch format contained information about what kernel version this patch was generated against) it turns out that this patch will not apply since commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode return") has been applied, which means the TIF numbers have changed as well as the assembly code that your patch touches. My guess is that this patch was generated from a 4.13 kernel, so misses the 4.14-rc1 changes. Since we're potentially about to start the merge window for 4.15 this weekend, the timing of this doesn't work well either. What patch failure did you see? The patch is based against 4.14-rc4, so while it's a few weeks out of date, it does include the commit you reference. Once 4.15-rc1 has been published, please rebase against that version and resend. Sure. I was hoping to eke out a little bit of attention from kernel developers before the merge window actually opens :) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[PATCH v16 07/13] Add task isolation hooks to arch-independent code
This commit adds task isolation hooks as follows: - __handle_domain_irq() generates an isolation warning for the local task - irq_work_queue_on() generates an isolation warning for the remote task being interrupted for irq_work - generic_exec_single() generates a remote isolation warning for the remote cpu being IPI'd - smp_call_function_many() generates a remote isolation warning for the set of remote cpus being IPI'd Calls to task_isolation_remote() or task_isolation_interrupt() can be placed in the platform-independent code like this when doing so results in fewer lines of code changes, as for example is true of the users of the arch_send_call_function_*() APIs. Or, they can be placed in the per-architecture code when there are many callers, as for example is true of the smp_send_reschedule() call. A further cleanup might be to create an intermediate layer, so that for example smp_send_reschedule() is a single generic function that just calls arch_smp_send_reschedule(), allowing generic code to be called every time smp_send_reschedule() is invoked. But for now, we just update either callers or callees as makes most sense. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- kernel/irq/irqdesc.c | 5 + kernel/irq_work.c| 5 - kernel/smp.c | 6 +- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 82afb7ed369f..1b114c6b7ab8 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internals.h" @@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, irq = irq_find_mapping(domain, hwirq); #endif + task_isolation_interrupt((irq == hwirq) ? +"irq %d (%s)" : "irq %d (%s hwirq %d)", +irq, domain ? domain->name : "", hwirq); + /* * Some hardware gives randomly wrong interrupts. Rather * than crashing, do something sensible. diff --git a/kernel/irq_work.c b/kernel/irq_work.c index bcf107ce0854..cde49f1f31f7 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include #include #include +#include #include @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(>llnode, _cpu(raised_list, cpu))) + if (llist_add(>llnode, _cpu(raised_list, cpu))) { + task_isolation_remote(cpu, "irq_work"); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/smp.c b/kernel/smp.c index c94dd85c8d41..44252aa650ac 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "smpboot.h" @@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ - if (llist_add(>llist, _cpu(call_single_queue, cpu))) + if (llist_add(>llist, _cpu(call_single_queue, cpu))) { + task_isolation_remote(cpu, "IPI function"); arch_send_call_function_single_ipi(cpu); + } return 0; } @@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function"); arch_send_call_function_ipi_mask(cfd->cpumask_ipi); if (wait) { -- 2.1.2
[PATCH v16 07/13] Add task isolation hooks to arch-independent code
This commit adds task isolation hooks as follows: - __handle_domain_irq() generates an isolation warning for the local task - irq_work_queue_on() generates an isolation warning for the remote task being interrupted for irq_work - generic_exec_single() generates a remote isolation warning for the remote cpu being IPI'd - smp_call_function_many() generates a remote isolation warning for the set of remote cpus being IPI'd Calls to task_isolation_remote() or task_isolation_interrupt() can be placed in the platform-independent code like this when doing so results in fewer lines of code changes, as for example is true of the users of the arch_send_call_function_*() APIs. Or, they can be placed in the per-architecture code when there are many callers, as for example is true of the smp_send_reschedule() call. A further cleanup might be to create an intermediate layer, so that for example smp_send_reschedule() is a single generic function that just calls arch_smp_send_reschedule(), allowing generic code to be called every time smp_send_reschedule() is invoked. But for now, we just update either callers or callees as makes most sense. Signed-off-by: Chris Metcalf --- kernel/irq/irqdesc.c | 5 + kernel/irq_work.c| 5 - kernel/smp.c | 6 +- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 82afb7ed369f..1b114c6b7ab8 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internals.h" @@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, irq = irq_find_mapping(domain, hwirq); #endif + task_isolation_interrupt((irq == hwirq) ? +"irq %d (%s)" : "irq %d (%s hwirq %d)", +irq, domain ? domain->name : "", hwirq); + /* * Some hardware gives randomly wrong interrupts. Rather * than crashing, do something sensible. diff --git a/kernel/irq_work.c b/kernel/irq_work.c index bcf107ce0854..cde49f1f31f7 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include #include #include +#include #include @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(>llnode, _cpu(raised_list, cpu))) + if (llist_add(>llnode, _cpu(raised_list, cpu))) { + task_isolation_remote(cpu, "irq_work"); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/smp.c b/kernel/smp.c index c94dd85c8d41..44252aa650ac 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "smpboot.h" @@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ - if (llist_add(>llist, _cpu(call_single_queue, cpu))) + if (llist_add(>llist, _cpu(call_single_queue, cpu))) { + task_isolation_remote(cpu, "IPI function"); arch_send_call_function_single_ipi(cpu); + } return 0; } @@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function"); arch_send_call_function_ipi_mask(cfd->cpumask_ipi); if (wait) { -- 2.1.2
[PATCH v16 10/13] arch/arm: enable task isolation functionality
From: Francis Giraldeau <francis.girald...@gmail.com> This patch is a port of the task isolation functionality to the arm 32-bit architecture. The task isolation needs an additional thread flag that requires to change the entry assembly code to accept a bitfield larger than one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now defined in the literal pool. The rest of the patch is straightforward and reflects what is done on other architectures. To avoid problems with the tst instruction in the v7m build, we renumber TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7. Signed-off-by: Francis Giraldeau <francis.girald...@gmail.com> Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> [with modifications] --- arch/arm/Kconfig | 1 + arch/arm/include/asm/thread_info.h | 10 +++--- arch/arm/kernel/entry-common.S | 12 arch/arm/kernel/ptrace.c | 10 ++ arch/arm/kernel/signal.c | 10 +- arch/arm/kernel/smp.c | 4 arch/arm/mm/fault.c| 8 +++- 7 files changed, 46 insertions(+), 9 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 7888c9803eb0..3423c655a32b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -48,6 +48,7 @@ config ARM select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT) + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARM_SMCCC if CPU_V7 select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h index 776757d1604a..a7b76ac9543d 100644 --- a/arch/arm/include/asm/thread_info.h +++ b/arch/arm/include/asm/thread_info.h @@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *, #define TIF_SYSCALL_TRACE 4 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */ #define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */ -#define TIF_SECCOMP7 /* seccomp syscall filtering active */ +#define TIF_TASK_ISOLATION 7 /* task isolation active */ +#define TIF_SECCOMP8 /* seccomp syscall filtering active */ #define TIF_NOHZ 12 /* in adaptive nohz mode */ #define TIF_USING_IWMMXT 17 @@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *, #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_USING_IWMMXT (1 << TIF_USING_IWMMXT) /* Checks for any syscall work in entry-common.S */ #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ - _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP) + _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ + _TIF_TASK_ISOLATION) /* * Change these and you break ASM code in entry-common.S */ #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ -_TIF_NOTIFY_RESUME | _TIF_UPROBE) +_TIF_NOTIFY_RESUME | _TIF_UPROBE | \ +_TIF_TASK_ISOLATION) #endif /* __KERNEL__ */ #endif /* __ASM_ARM_THREAD_INFO_H */ diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S index 99c908226065..9ae3ef2dbc1e 100644 --- a/arch/arm/kernel/entry-common.S +++ b/arch/arm/kernel/entry-common.S @@ -53,7 +53,8 @@ ret_fast_syscall: cmp r2, #TASK_SIZE blneaddr_limit_check_failed ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, r2 bne fast_work_pending @@ -83,7 +84,8 @@ ret_fast_syscall: cmp r2, #TASK_SIZE blneaddr_limit_check_failed ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, r2 beq no_work_pending UNWIND(.fnend ) ENDPROC(ret_fast_syscall) @@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall) /* Slower path - fall through to work_pending */ #endif - tst r1, #_TIF_SYSCALL_WORK + ldr r2, =_TIF_SYSCALL_WORK + tst r1, r2 bne __sys_trace_return_nosave slow_work_pending: mov r0, sp @ 'regs' @
[PATCH v16 10/13] arch/arm: enable task isolation functionality
From: Francis Giraldeau This patch is a port of the task isolation functionality to the arm 32-bit architecture. The task isolation needs an additional thread flag that requires to change the entry assembly code to accept a bitfield larger than one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now defined in the literal pool. The rest of the patch is straightforward and reflects what is done on other architectures. To avoid problems with the tst instruction in the v7m build, we renumber TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7. Signed-off-by: Francis Giraldeau Signed-off-by: Chris Metcalf [with modifications] --- arch/arm/Kconfig | 1 + arch/arm/include/asm/thread_info.h | 10 +++--- arch/arm/kernel/entry-common.S | 12 arch/arm/kernel/ptrace.c | 10 ++ arch/arm/kernel/signal.c | 10 +- arch/arm/kernel/smp.c | 4 arch/arm/mm/fault.c| 8 +++- 7 files changed, 46 insertions(+), 9 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 7888c9803eb0..3423c655a32b 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -48,6 +48,7 @@ config ARM select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT) + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARM_SMCCC if CPU_V7 select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h index 776757d1604a..a7b76ac9543d 100644 --- a/arch/arm/include/asm/thread_info.h +++ b/arch/arm/include/asm/thread_info.h @@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *, #define TIF_SYSCALL_TRACE 4 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */ #define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */ -#define TIF_SECCOMP7 /* seccomp syscall filtering active */ +#define TIF_TASK_ISOLATION 7 /* task isolation active */ +#define TIF_SECCOMP8 /* seccomp syscall filtering active */ #define TIF_NOHZ 12 /* in adaptive nohz mode */ #define TIF_USING_IWMMXT 17 @@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *, #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_USING_IWMMXT (1 << TIF_USING_IWMMXT) /* Checks for any syscall work in entry-common.S */ #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ - _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP) + _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ + _TIF_TASK_ISOLATION) /* * Change these and you break ASM code in entry-common.S */ #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ -_TIF_NOTIFY_RESUME | _TIF_UPROBE) +_TIF_NOTIFY_RESUME | _TIF_UPROBE | \ +_TIF_TASK_ISOLATION) #endif /* __KERNEL__ */ #endif /* __ASM_ARM_THREAD_INFO_H */ diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S index 99c908226065..9ae3ef2dbc1e 100644 --- a/arch/arm/kernel/entry-common.S +++ b/arch/arm/kernel/entry-common.S @@ -53,7 +53,8 @@ ret_fast_syscall: cmp r2, #TASK_SIZE blneaddr_limit_check_failed ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, r2 bne fast_work_pending @@ -83,7 +84,8 @@ ret_fast_syscall: cmp r2, #TASK_SIZE blneaddr_limit_check_failed ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, r2 beq no_work_pending UNWIND(.fnend ) ENDPROC(ret_fast_syscall) @@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall) /* Slower path - fall through to work_pending */ #endif - tst r1, #_TIF_SYSCALL_WORK + ldr r2, =_TIF_SYSCALL_WORK + tst r1, r2 bne __sys_trace_return_nosave slow_work_pending: mov r0, sp @ 'regs' @@ -238,7 +241,8 @@ local_restart: ldr r10, [tsk, #TI_FLAGS] @ check for syscall tracing
[PATCH v16 09/13] arch/arm64: enable task isolation functionality
In do_notify_resume(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. We instrument the smp_send_reschedule() routine so that it checks for isolated tasks and generates a suitable warning if needed. Finally, report on page faults in task-isolation processes in do_page_faults(). Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/thread_info.h | 5 - arch/arm64/kernel/ptrace.c | 18 +++--- arch/arm64/kernel/signal.c | 10 +- arch/arm64/kernel/smp.c | 7 +++ arch/arm64/mm/fault.c| 5 + 6 files changed, 41 insertions(+), 5 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 0df64a6a56d4..d77ecdb29765 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -73,6 +73,7 @@ config ARM64 select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_VMAP_STACK diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index ddded6497a8a..9c749eca7384 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -82,6 +82,7 @@ void arch_setup_new_exec(void); #define TIF_FOREIGN_FPSTATE3 /* CPU's FP state is not current's */ #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK5 /* Check FS is USER_DS on return */ +#define TIF_TASK_ISOLATION 6 #define TIF_NOHZ 7 #define TIF_SYSCALL_TRACE 8 #define TIF_SYSCALL_AUDIT 9 @@ -97,6 +98,7 @@ void arch_setup_new_exec(void); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) @@ -108,7 +110,8 @@ void arch_setup_new_exec(void); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ -_TIF_UPROBE | _TIF_FSCHECK) +_TIF_UPROBE | _TIF_FSCHECK | \ +_TIF_TASK_ISOLATION) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 9cbb6123208f..e5c0d7cdaf4e 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { - if (test_thread_flag(TIF_SYSCALL_TRACE)) + unsigned long work = READ_ONCE(current_thread_info()->flags); + + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - /* Do the secure computing after ptrace; failures should be fast. */ + /* +* In task isolation mode, we may prevent the syscall from +* running, and if so we also deliver a signal to the process. +*/ + if (work & _TIF_TASK_ISOLATION) { + if (task_isolation_syscall(regs->syscallno) == -1) + return -1; + } + + /* Do the secure computing check early; failures should be fast. */ if (secure_computing(NULL) == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 0bdc96c61bc0..d8f4904e992f 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -30,6 +30,7 @@ #include #include #in
[PATCH v16 09/13] arch/arm64: enable task isolation functionality
In do_notify_resume(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. We instrument the smp_send_reschedule() routine so that it checks for isolated tasks and generates a suitable warning if needed. Finally, report on page faults in task-isolation processes in do_page_faults(). Signed-off-by: Chris Metcalf --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/thread_info.h | 5 - arch/arm64/kernel/ptrace.c | 18 +++--- arch/arm64/kernel/signal.c | 10 +- arch/arm64/kernel/smp.c | 7 +++ arch/arm64/mm/fault.c| 5 + 6 files changed, 41 insertions(+), 5 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 0df64a6a56d4..d77ecdb29765 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -73,6 +73,7 @@ config ARM64 select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_VMAP_STACK diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index ddded6497a8a..9c749eca7384 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -82,6 +82,7 @@ void arch_setup_new_exec(void); #define TIF_FOREIGN_FPSTATE3 /* CPU's FP state is not current's */ #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK5 /* Check FS is USER_DS on return */ +#define TIF_TASK_ISOLATION 6 #define TIF_NOHZ 7 #define TIF_SYSCALL_TRACE 8 #define TIF_SYSCALL_AUDIT 9 @@ -97,6 +98,7 @@ void arch_setup_new_exec(void); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) @@ -108,7 +110,8 @@ void arch_setup_new_exec(void); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ -_TIF_UPROBE | _TIF_FSCHECK) +_TIF_UPROBE | _TIF_FSCHECK | \ +_TIF_TASK_ISOLATION) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 9cbb6123208f..e5c0d7cdaf4e 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { - if (test_thread_flag(TIF_SYSCALL_TRACE)) + unsigned long work = READ_ONCE(current_thread_info()->flags); + + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - /* Do the secure computing after ptrace; failures should be fast. */ + /* +* In task isolation mode, we may prevent the syscall from +* running, and if so we also deliver a signal to the process. +*/ + if (work & _TIF_TASK_ISOLATION) { + if (task_isolation_syscall(regs->syscallno) == -1) + return -1; + } + + /* Do the secure computing check early; failures should be fast. */ if (secure_computing(NULL) == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 0bdc96c61bc0..d8f4904e992f 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -30,6 +30,7 @@ #include #include #include +#include
[PATCH v16 08/13] arch/x86: enable task isolation functionality
In prepare_exit_to_usermode(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. In syscall_trace_enter_phase1(), add the necessary support for reporting syscalls for task-isolation processes. Add task_isolation_remote() calls for the kernel exception types that do not result in signals, namely non-signalling page faults and non-signalling MPX fixups. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- arch/x86/Kconfig | 1 + arch/x86/entry/common.c| 14 ++ arch/x86/include/asm/apic.h| 3 +++ arch/x86/include/asm/thread_info.h | 8 +--- arch/x86/kernel/smp.c | 2 ++ arch/x86/kernel/traps.c| 3 +++ arch/x86/mm/fault.c| 5 + 7 files changed, 33 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..45967840b81a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -114,6 +114,7 @@ config X86 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 03505ffbe1b6..2c70b915d1f2 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs) if (emulated) return -1L; + /* +* In task isolation mode, we may prevent the syscall from +* running, and if so we also deliver a signal to the process. +*/ + if (work & _TIF_TASK_ISOLATION) { + if (task_isolation_syscall(regs->orig_ax) == -1) + return -1L; + work &= ~_TIF_TASK_ISOLATION; + } + #ifdef CONFIG_SECCOMP /* * Do seccomp after ptrace, to catch any tracer changes. @@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs) if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) exit_to_usermode_loop(regs, cached_flags); + if (cached_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); + #ifdef CONFIG_COMPAT /* * Compat syscalls set TS_COMPAT. Make sure we clear it before diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 5f01671c68f2..c70cb9cacfc0 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -2,6 +2,7 @@ #define _ASM_X86_APIC_H #include +#include #include #include @@ -618,6 +619,7 @@ extern void irq_exit(void); static inline void entering_irq(void) { + task_isolation_interrupt("irq"); irq_enter(); } @@ -629,6 +631,7 @@ static inline void entering_ack_irq(void) static inline void ipi_entering_ack_irq(void) { + task_isolation_interrupt("ack irq"); irq_enter(); ack_APIC_irq(); } diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 89e7eeb5cec1..aa9d9d817f8b 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -85,6 +85,7 @@ struct thread_info { #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ #define TIF_UPROBE 12 /* breakpointed or singlestepping */ #define TIF_PATCH_PENDING 13 /* pending live patching update */ +#define TIF_TASK_ISOLATION 14 /* task isolation enabled for task */ #define TIF_NOCPUID15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ @@ -111,6 +112,7 @@ struct thread_info { #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY) #define _TIF_UPROBE(1 << TIF_UPROBE) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) @@ -132,15 +134,15 @@ struct thread_info { #define _TIF_WORK_SYSCALL_ENTRY\ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ -_TIF_NOHZ) +_TIF_NOHZ | _TIF_TASK_ISOLATION) /* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |
[PATCH v16 08/13] arch/x86: enable task isolation functionality
In prepare_exit_to_usermode(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. In syscall_trace_enter_phase1(), add the necessary support for reporting syscalls for task-isolation processes. Add task_isolation_remote() calls for the kernel exception types that do not result in signals, namely non-signalling page faults and non-signalling MPX fixups. Signed-off-by: Chris Metcalf --- arch/x86/Kconfig | 1 + arch/x86/entry/common.c| 14 ++ arch/x86/include/asm/apic.h| 3 +++ arch/x86/include/asm/thread_info.h | 8 +--- arch/x86/kernel/smp.c | 2 ++ arch/x86/kernel/traps.c| 3 +++ arch/x86/mm/fault.c| 5 + 7 files changed, 33 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..45967840b81a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -114,6 +114,7 @@ config X86 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 03505ffbe1b6..2c70b915d1f2 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs) if (emulated) return -1L; + /* +* In task isolation mode, we may prevent the syscall from +* running, and if so we also deliver a signal to the process. +*/ + if (work & _TIF_TASK_ISOLATION) { + if (task_isolation_syscall(regs->orig_ax) == -1) + return -1L; + work &= ~_TIF_TASK_ISOLATION; + } + #ifdef CONFIG_SECCOMP /* * Do seccomp after ptrace, to catch any tracer changes. @@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs) if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) exit_to_usermode_loop(regs, cached_flags); + if (cached_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); + #ifdef CONFIG_COMPAT /* * Compat syscalls set TS_COMPAT. Make sure we clear it before diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 5f01671c68f2..c70cb9cacfc0 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -2,6 +2,7 @@ #define _ASM_X86_APIC_H #include +#include #include #include @@ -618,6 +619,7 @@ extern void irq_exit(void); static inline void entering_irq(void) { + task_isolation_interrupt("irq"); irq_enter(); } @@ -629,6 +631,7 @@ static inline void entering_ack_irq(void) static inline void ipi_entering_ack_irq(void) { + task_isolation_interrupt("ack irq"); irq_enter(); ack_APIC_irq(); } diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 89e7eeb5cec1..aa9d9d817f8b 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -85,6 +85,7 @@ struct thread_info { #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ #define TIF_UPROBE 12 /* breakpointed or singlestepping */ #define TIF_PATCH_PENDING 13 /* pending live patching update */ +#define TIF_TASK_ISOLATION 14 /* task isolation enabled for task */ #define TIF_NOCPUID15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ @@ -111,6 +112,7 @@ struct thread_info { #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY) #define _TIF_UPROBE(1 << TIF_UPROBE) #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) @@ -132,15 +134,15 @@ struct thread_info { #define _TIF_WORK_SYSCALL_ENTRY\ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ -_TIF_NOHZ) +_TIF_NOHZ | _TIF_TASK_ISOLATION) /* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\ _TIF_NEED_RESCHED | _
[PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL
This API checks to see if the scheduler tick can be stopped, and if so, stops it and returns 0; otherwise it returns an error. This is intended for use with task isolation, where we will want to be able to stop the tick synchronously when returning to userspace. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- include/linux/tick.h | 1 + kernel/time/tick-sched.c | 18 ++ 2 files changed, 19 insertions(+) diff --git a/include/linux/tick.h b/include/linux/tick.h index fe01e68bf520..078ff2464b00 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); +extern int try_stop_full_tick(void); #else static inline int housekeeping_any_cpu(void) { diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c7a899c5ce64..c026145eba2f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) #endif } +#ifdef CONFIG_TASK_ISOLATION +int try_stop_full_tick(void) +{ + int cpu = smp_processor_id(); + struct tick_sched *ts = this_cpu_ptr(_cpu_sched); + + /* For an unstable clock, we should return a permanent error code. */ + if (atomic_read(_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE) + return -EINVAL; + + if (!can_stop_full_tick(cpu, ts)) + return -EAGAIN; + + tick_nohz_stop_sched_tick(ts, ktime_get(), cpu); + return 0; +} +#endif + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { /* -- 2.1.2
[PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL
This API checks to see if the scheduler tick can be stopped, and if so, stops it and returns 0; otherwise it returns an error. This is intended for use with task isolation, where we will want to be able to stop the tick synchronously when returning to userspace. Signed-off-by: Chris Metcalf --- include/linux/tick.h | 1 + kernel/time/tick-sched.c | 18 ++ 2 files changed, 19 insertions(+) diff --git a/include/linux/tick.h b/include/linux/tick.h index fe01e68bf520..078ff2464b00 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); +extern int try_stop_full_tick(void); #else static inline int housekeeping_any_cpu(void) { diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c7a899c5ce64..c026145eba2f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) #endif } +#ifdef CONFIG_TASK_ISOLATION +int try_stop_full_tick(void) +{ + int cpu = smp_processor_id(); + struct tick_sched *ts = this_cpu_ptr(_cpu_sched); + + /* For an unstable clock, we should return a permanent error code. */ + if (atomic_read(_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE) + return -EINVAL; + + if (!can_stop_full_tick(cpu, ts)) + return -EAGAIN; + + tick_nohz_stop_sched_tick(ts, ktime_get(), cpu); + return 0; +} +#endif + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { /* -- 2.1.2
[PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will kill it with SIGKILL. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. In a number of cases we can tell on a remote cpu that we are going to be interrupting the cpu, e.g. via an IPI or a TLB flush. In that case we generate the diagnostic (and optional stack dump) on the remote core to be able to deliver better diagnostics. If the interrupt is not something caught by Linux (e.g. a hypervisor interrupt) we can also request a reschedule IPI to be sent to the remote core so it can be sure to generate a signal to notify the process. Separate patches that follow provide these changes for x86, tile, arm, and arm64. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- Documentation/admin-guide/kernel-parameters.txt | 6 + include/linux/isolation.h | 175 +++ include/linux/sched.h | 4 + include/uapi/linux/prctl.h | 6 + init/Kconfig| 28 ++ kernel/Makefile | 1 + kernel/context_tracking.c | 2 + kernel/isolation.c | 402 kernel/signal.c | 2 + kernel/sys.c| 6 + 10 files changed, 631 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 05496622b4ef..aaf278f2cfc3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4025,6 +4025,12 @@ neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug[KNL] + In kernels built with CONFIG_TASK_ISOLATION, this + setting will generate console backtraces to + accompany the diagnostics generated about + interrupting tasks running with task isolation.
[PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will kill it with SIGKILL. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. In a number of cases we can tell on a remote cpu that we are going to be interrupting the cpu, e.g. via an IPI or a TLB flush. In that case we generate the diagnostic (and optional stack dump) on the remote core to be able to deliver better diagnostics. If the interrupt is not something caught by Linux (e.g. a hypervisor interrupt) we can also request a reschedule IPI to be sent to the remote core so it can be sure to generate a signal to notify the process. Separate patches that follow provide these changes for x86, tile, arm, and arm64. Signed-off-by: Chris Metcalf --- Documentation/admin-guide/kernel-parameters.txt | 6 + include/linux/isolation.h | 175 +++ include/linux/sched.h | 4 + include/uapi/linux/prctl.h | 6 + init/Kconfig| 28 ++ kernel/Makefile | 1 + kernel/context_tracking.c | 2 + kernel/isolation.c | 402 kernel/signal.c | 2 + kernel/sys.c| 6 + 10 files changed, 631 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 05496622b4ef..aaf278f2cfc3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4025,6 +4025,12 @@ neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug[KNL] + In kernels built with CONFIG_TASK_ISOLATION, this + setting will generate console backtraces to + accompany the diagnostics generated about + interrupting tasks running with task isolation. + tcpmha
[PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"
This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869. We do need this function for try_get_task_struct_on_cpu(). Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- include/linux/sched/task.h | 2 ++ kernel/exit.c | 13 + 2 files changed, 15 insertions(+) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 79a2a744648d..270ff76d43d9 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t) } struct task_struct *task_rcu_dereference(struct task_struct **ptask); +struct task_struct *try_get_task_struct(struct task_struct **ptask); + #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT extern int arch_task_struct_size __read_mostly; diff --git a/kernel/exit.c b/kernel/exit.c index f2cd53e92147..e2a3e7458d0f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w) rcu_read_unlock(); } +struct task_struct *try_get_task_struct(struct task_struct **ptask) +{ + struct task_struct *task; + + rcu_read_lock(); + task = task_rcu_dereference(ptask); + if (task) + get_task_struct(task); + rcu_read_unlock(); + + return task; +} + /* * Determine if a process group is "orphaned", according to the POSIX * definition in 2.2.2.52. Orphaned process groups are not to be affected -- 2.1.2
[PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"
This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869. We do need this function for try_get_task_struct_on_cpu(). Signed-off-by: Chris Metcalf --- include/linux/sched/task.h | 2 ++ kernel/exit.c | 13 + 2 files changed, 15 insertions(+) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 79a2a744648d..270ff76d43d9 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t) } struct task_struct *task_rcu_dereference(struct task_struct **ptask); +struct task_struct *try_get_task_struct(struct task_struct **ptask); + #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT extern int arch_task_struct_size __read_mostly; diff --git a/kernel/exit.c b/kernel/exit.c index f2cd53e92147..e2a3e7458d0f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w) rcu_read_unlock(); } +struct task_struct *try_get_task_struct(struct task_struct **ptask) +{ + struct task_struct *task; + + rcu_read_lock(); + task = task_rcu_dereference(ptask); + if (task) + get_task_struct(task); + rcu_read_unlock(); + + return task; +} + /* * Determine if a process group is "orphaned", according to the POSIX * definition in 2.2.2.52. Orphaned process groups are not to be affected -- 2.1.2
[PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
When the schedule tick is disabled in tick_nohz_stop_sched_tick(), we call hrtimer_cancel(), which eventually calls down into __remove_hrtimer() and thus into hrtimer_force_reprogram(). That function's call to tick_program_event() detects that we are trying to set the expiration to KTIME_MAX and calls clockevents_switch_state() to set the state to ONESHOT_STOPPED, and returns. See commit 8fff52fd5093 ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background. However, by default the internal __clockevents_switch_state() code doesn't have a "set_state_oneshot_stopped" function pointer for the arm_arch_timer or tile clock_event_device structures, so that code returns -ENOSYS, and we end up not setting the state, and more importantly, we don't actually turn off the hardware timer. As a result, the timer tick we were waiting for before is still queued, and fires shortly afterwards, only to discover there was nothing for it to do, at which point it quiesces. The fix is to provide that function pointer field, and like the other function pointers, have it just turn off the timer interrupt. Any call to set a new timer interval will properly re-enable it. This fix avoids a small performance hiccup for regular applications, but for TASK_ISOLATION code, it fixes a potentially serious kernel timer interruption to the time-sensitive application. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> Acked-by: Daniel Lezcano <daniel.lezc...@linaro.org> --- arch/tile/kernel/time.c | 1 + drivers/clocksource/arm_arch_timer.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index f74f10d827fa..afca6fe496c8 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = { .set_next_event = tile_timer_set_next_event, .set_state_shutdown = tile_timer_shutdown, .set_state_oneshot = tile_timer_shutdown, + .set_state_oneshot_stopped = tile_timer_shutdown, .tick_resume = tile_timer_shutdown, }; diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index fd4b7f684bd0..61ea7f907c56 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type, } } + clk->set_state_oneshot_stopped = clk->set_state_shutdown; + clk->set_state_shutdown(clk); clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fff); -- 2.1.2
[PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
When the schedule tick is disabled in tick_nohz_stop_sched_tick(), we call hrtimer_cancel(), which eventually calls down into __remove_hrtimer() and thus into hrtimer_force_reprogram(). That function's call to tick_program_event() detects that we are trying to set the expiration to KTIME_MAX and calls clockevents_switch_state() to set the state to ONESHOT_STOPPED, and returns. See commit 8fff52fd5093 ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background. However, by default the internal __clockevents_switch_state() code doesn't have a "set_state_oneshot_stopped" function pointer for the arm_arch_timer or tile clock_event_device structures, so that code returns -ENOSYS, and we end up not setting the state, and more importantly, we don't actually turn off the hardware timer. As a result, the timer tick we were waiting for before is still queued, and fires shortly afterwards, only to discover there was nothing for it to do, at which point it quiesces. The fix is to provide that function pointer field, and like the other function pointers, have it just turn off the timer interrupt. Any call to set a new timer interval will properly re-enable it. This fix avoids a small performance hiccup for regular applications, but for TASK_ISOLATION code, it fixes a potentially serious kernel timer interruption to the time-sensitive application. Signed-off-by: Chris Metcalf Acked-by: Daniel Lezcano --- arch/tile/kernel/time.c | 1 + drivers/clocksource/arm_arch_timer.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index f74f10d827fa..afca6fe496c8 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = { .set_next_event = tile_timer_set_next_event, .set_state_shutdown = tile_timer_shutdown, .set_state_oneshot = tile_timer_shutdown, + .set_state_oneshot_stopped = tile_timer_shutdown, .tick_resume = tile_timer_shutdown, }; diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index fd4b7f684bd0..61ea7f907c56 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type, } } + clk->set_state_oneshot_stopped = clk->set_state_shutdown; + clk->set_state_shutdown(clk); clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fff); -- 2.1.2
[PATCH v16 13/13] task_isolation self test
This code tests various aspects of task_isolation. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/task_isolation/Makefile| 6 + tools/testing/selftests/task_isolation/config | 1 + tools/testing/selftests/task_isolation/isolation.c | 643 + 4 files changed, 651 insertions(+) create mode 100644 tools/testing/selftests/task_isolation/Makefile create mode 100644 tools/testing/selftests/task_isolation/config create mode 100644 tools/testing/selftests/task_isolation/isolation.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index ff805643b5f7..ab781b99d3c9 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -30,6 +30,7 @@ TARGETS += splice TARGETS += static_keys TARGETS += sync TARGETS += sysctl +TARGETS += task_isolation ifneq (1, $(quicktest)) TARGETS += timers endif diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile new file mode 100644 index ..74d060b493f9 --- /dev/null +++ b/tools/testing/selftests/task_isolation/Makefile @@ -0,0 +1,6 @@ +CFLAGS += -O2 -g -W -Wall +LDFLAGS += -pthread + +TEST_GEN_PROGS := isolation + +include ../lib.mk diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config new file mode 100644 index ..34edfbca0423 --- /dev/null +++ b/tools/testing/selftests/task_isolation/config @@ -0,0 +1 @@ +CONFIG_TASK_ISOLATION=y diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c new file mode 100644 index ..9c0b49619b40 --- /dev/null +++ b/tools/testing/selftests/task_isolation/isolation.c @@ -0,0 +1,643 @@ +/* + * This test program tests the features of task isolation. + * + * - Makes sure enabling task isolation fails if you are unaffinitized + * or on a non-task-isolation cpu. + * + * - Validates that various synchronous exceptions are fatal in isolation + * mode: + * + * * Page fault + * * System call + * * TLB invalidation from another thread [1] + * * Unaligned access [2] + * + * - Tests that taking a user-defined signal for the above faults works. + * + * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off. + * + * - Tests that receiving a signal turns isolation off. + * + * - Tests that having another process schedule into the core where the + * isolation process is running correctly kills the isolation process. + * + * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64 + * [2] Unaligned access only causes exceptions on some platforms, e.g. tile + * + * + * You must be running under a kernel configured with TASK_ISOLATION. + * + * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to + * enable some task-isolation cores. If you get interrupt reports, you + * can also add the boot argument "task_isolation_debug" to learn more. + * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add + * isolation checks in every user_exit() call. + * + * NOTE: you must disable the code in tick_nohz_stop_sched_tick() + * that limits the tick delta to the maximum scheduler deferment + * by making it conditional not just on "!ts->inidle" but also + * on !current->task_isolation_flags. This is around line 756 + * in kernel/time/tick-sched.c (as of kernel 4.14). + * + * + * To compile the test program, run "make". + * + * Run the program as "./isolation" and if you want to run the + * jitter-detection loop for longer than 10 giga-cycles, specify the + * number of giga-cycles to run it for as a command-line argument. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../kselftest.h" + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val)) + +#ifndef PR_TASK_ISOLATION /* Not in system headers yet? */ +# define PR_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) +#endif + +/* The cpu we are using for isolation tests. */ +static int task_isolation_cpu; + +/* Overall status, maintained as tests run. */ +static int exit_status = KSFT_PASS; + +/* Data shared between parent and children. */ +static struct { + /* Set to true when the parent's isolation prctl is successful. */ + bool parent_isolated; +} *shared; + +/* Set affinity to a single cpu or die if trying to do
[PATCH v16 13/13] task_isolation self test
This code tests various aspects of task_isolation. Signed-off-by: Chris Metcalf --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/task_isolation/Makefile| 6 + tools/testing/selftests/task_isolation/config | 1 + tools/testing/selftests/task_isolation/isolation.c | 643 + 4 files changed, 651 insertions(+) create mode 100644 tools/testing/selftests/task_isolation/Makefile create mode 100644 tools/testing/selftests/task_isolation/config create mode 100644 tools/testing/selftests/task_isolation/isolation.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index ff805643b5f7..ab781b99d3c9 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -30,6 +30,7 @@ TARGETS += splice TARGETS += static_keys TARGETS += sync TARGETS += sysctl +TARGETS += task_isolation ifneq (1, $(quicktest)) TARGETS += timers endif diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile new file mode 100644 index ..74d060b493f9 --- /dev/null +++ b/tools/testing/selftests/task_isolation/Makefile @@ -0,0 +1,6 @@ +CFLAGS += -O2 -g -W -Wall +LDFLAGS += -pthread + +TEST_GEN_PROGS := isolation + +include ../lib.mk diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config new file mode 100644 index ..34edfbca0423 --- /dev/null +++ b/tools/testing/selftests/task_isolation/config @@ -0,0 +1 @@ +CONFIG_TASK_ISOLATION=y diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c new file mode 100644 index ..9c0b49619b40 --- /dev/null +++ b/tools/testing/selftests/task_isolation/isolation.c @@ -0,0 +1,643 @@ +/* + * This test program tests the features of task isolation. + * + * - Makes sure enabling task isolation fails if you are unaffinitized + * or on a non-task-isolation cpu. + * + * - Validates that various synchronous exceptions are fatal in isolation + * mode: + * + * * Page fault + * * System call + * * TLB invalidation from another thread [1] + * * Unaligned access [2] + * + * - Tests that taking a user-defined signal for the above faults works. + * + * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off. + * + * - Tests that receiving a signal turns isolation off. + * + * - Tests that having another process schedule into the core where the + * isolation process is running correctly kills the isolation process. + * + * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64 + * [2] Unaligned access only causes exceptions on some platforms, e.g. tile + * + * + * You must be running under a kernel configured with TASK_ISOLATION. + * + * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to + * enable some task-isolation cores. If you get interrupt reports, you + * can also add the boot argument "task_isolation_debug" to learn more. + * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add + * isolation checks in every user_exit() call. + * + * NOTE: you must disable the code in tick_nohz_stop_sched_tick() + * that limits the tick delta to the maximum scheduler deferment + * by making it conditional not just on "!ts->inidle" but also + * on !current->task_isolation_flags. This is around line 756 + * in kernel/time/tick-sched.c (as of kernel 4.14). + * + * + * To compile the test program, run "make". + * + * Run the program as "./isolation" and if you want to run the + * jitter-detection loop for longer than 10 giga-cycles, specify the + * number of giga-cycles to run it for as a command-line argument. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../kselftest.h" + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val)) + +#ifndef PR_TASK_ISOLATION /* Not in system headers yet? */ +# define PR_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) +#endif + +/* The cpu we are using for isolation tests. */ +static int task_isolation_cpu; + +/* Overall status, maintained as tests run. */ +static int exit_status = KSFT_PASS; + +/* Data shared between parent and children. */ +static struct { + /* Set to true when the parent's isolation prctl is successful. */ + bool parent_isolated; +} *shared; + +/* Set affinity to a single cpu or die if trying to do so fails. */ +static void set
[PATCH v16 11/13] arch/tile: enable task isolation functionality
We add the necessary call to task_isolation_start() in the prepare_exit_to_usermode() routine. We already unconditionally call into this routine if TIF_NOHZ is set, since that's where we do the user_enter() call. We add calls to task_isolation_interrupt() in places where exceptions may not generate signals to the application. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- arch/tile/Kconfig | 1 + arch/tile/include/asm/thread_info.h | 2 ++ arch/tile/kernel/hardwall.c | 2 ++ arch/tile/kernel/irq.c | 3 +++ arch/tile/kernel/messaging.c| 4 arch/tile/kernel/process.c | 4 arch/tile/kernel/ptrace.c | 10 ++ arch/tile/kernel/single_step.c | 7 +++ arch/tile/kernel/smp.c | 21 +++-- arch/tile/kernel/time.c | 2 ++ arch/tile/kernel/unaligned.c| 4 arch/tile/mm/fault.c| 13 - arch/tile/mm/homecache.c| 11 +++ 13 files changed, 73 insertions(+), 11 deletions(-) diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index 4583c0320059..2d644138f2eb 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -16,6 +16,7 @@ config TILE select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_CONTEXT_TRACKING select HAVE_DEBUG_BUGVERBOSE diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index b7659b8f1117..3e298bd43d11 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -126,6 +126,7 @@ extern void _cpu_idle(void); #define TIF_SYSCALL_TRACEPOINT 9 /* syscall tracepoint instrumentation */ #define TIF_POLLING_NRFLAG 10 /* idle is polling for TIF_NEED_RESCHED */ #define TIF_NOHZ 11 /* in adaptive nohz mode */ +#define TIF_TASK_ISOLATION 12 /* in task isolation mode */ #define _TIF_SIGPENDING(1<<TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) @@ -139,6 +140,7 @@ extern void _cpu_idle(void); #define _TIF_SYSCALL_TRACEPOINT(1<<TIF_SYSCALL_TRACEPOINT) #define _TIF_POLLING_NRFLAG(1<<TIF_POLLING_NRFLAG) #define _TIF_NOHZ (1<<TIF_NOHZ) +#define _TIF_TASK_ISOLATION(1<<TIF_TASK_ISOLATION) /* Work to do as we loop to exit to user space. */ #define _TIF_WORK_MASK \ diff --git a/arch/tile/kernel/hardwall.c b/arch/tile/kernel/hardwall.c index 2fd1694ac1d0..9559f04d1c2a 100644 --- a/arch/tile/kernel/hardwall.c +++ b/arch/tile/kernel/hardwall.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int fault_num) int found_processes; struct pt_regs *old_regs = set_irq_regs(regs); + task_isolation_interrupt("hardwall trap"); irq_enter(); /* Figure out which network trapped. */ diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c index 22044fc691ef..0b1b24b9c496 100644 --- a/arch/tile/kernel/irq.c +++ b/arch/tile/kernel/irq.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum) /* Track time spent here in an interrupt context. */ old_regs = set_irq_regs(regs); + + task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs); irq_enter(); #ifdef CONFIG_DEBUG_STACKOVERFLOW diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c index 7475af3aacec..1cf1630215f0 100644 --- a/arch/tile/kernel/messaging.c +++ b/arch/tile/kernel/messaging.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum) tag = message[0]; #ifdef CONFIG_SMP + task_isolation_interrupt("SMP message %d", tag); evaluate_message(message[0]); #else panic("Received IPI message %d in UP mode", tag); @@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum) HV_IntrMsg *him = (HV_IntrMsg *)message; struct hv_driver_cb *cb = (struct hv_driver_cb *)him->intarg; + task_isolation_interrupt("interrupt message %#lx(%#lx)", + him->intarg, him->intdata); cb->callback(cb, him->intdata); __this_cpu_inc(irq_stat.irq_hv_msg_count); } diff --git a/arch/tile/k
[PATCH v16 11/13] arch/tile: enable task isolation functionality
We add the necessary call to task_isolation_start() in the prepare_exit_to_usermode() routine. We already unconditionally call into this routine if TIF_NOHZ is set, since that's where we do the user_enter() call. We add calls to task_isolation_interrupt() in places where exceptions may not generate signals to the application. Signed-off-by: Chris Metcalf --- arch/tile/Kconfig | 1 + arch/tile/include/asm/thread_info.h | 2 ++ arch/tile/kernel/hardwall.c | 2 ++ arch/tile/kernel/irq.c | 3 +++ arch/tile/kernel/messaging.c| 4 arch/tile/kernel/process.c | 4 arch/tile/kernel/ptrace.c | 10 ++ arch/tile/kernel/single_step.c | 7 +++ arch/tile/kernel/smp.c | 21 +++-- arch/tile/kernel/time.c | 2 ++ arch/tile/kernel/unaligned.c| 4 arch/tile/mm/fault.c| 13 - arch/tile/mm/homecache.c| 11 +++ 13 files changed, 73 insertions(+), 11 deletions(-) diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index 4583c0320059..2d644138f2eb 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -16,6 +16,7 @@ config TILE select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_TRACEHOOK select HAVE_CONTEXT_TRACKING select HAVE_DEBUG_BUGVERBOSE diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index b7659b8f1117..3e298bd43d11 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -126,6 +126,7 @@ extern void _cpu_idle(void); #define TIF_SYSCALL_TRACEPOINT 9 /* syscall tracepoint instrumentation */ #define TIF_POLLING_NRFLAG 10 /* idle is polling for TIF_NEED_RESCHED */ #define TIF_NOHZ 11 /* in adaptive nohz mode */ +#define TIF_TASK_ISOLATION 12 /* in task isolation mode */ #define _TIF_SIGPENDING(1< #include #include +#include #include #include #include @@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int fault_num) int found_processes; struct pt_regs *old_regs = set_irq_regs(regs); + task_isolation_interrupt("hardwall trap"); irq_enter(); /* Figure out which network trapped. */ diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c index 22044fc691ef..0b1b24b9c496 100644 --- a/arch/tile/kernel/irq.c +++ b/arch/tile/kernel/irq.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum) /* Track time spent here in an interrupt context. */ old_regs = set_irq_regs(regs); + + task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs); irq_enter(); #ifdef CONFIG_DEBUG_STACKOVERFLOW diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c index 7475af3aacec..1cf1630215f0 100644 --- a/arch/tile/kernel/messaging.c +++ b/arch/tile/kernel/messaging.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum) tag = message[0]; #ifdef CONFIG_SMP + task_isolation_interrupt("SMP message %d", tag); evaluate_message(message[0]); #else panic("Received IPI message %d in UP mode", tag); @@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum) HV_IntrMsg *him = (HV_IntrMsg *)message; struct hv_driver_cb *cb = (struct hv_driver_cb *)him->intarg; + task_isolation_interrupt("interrupt message %#lx(%#lx)", + him->intarg, him->intdata); cb->callback(cb, him->intdata); __this_cpu_inc(irq_stat.irq_hv_msg_count); } diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index f0a0e18e4dfb..ac22e971dc1d 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include #include @@ -516,6 +517,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags) #endif } + if (thread_info_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); + user_enter(); } diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index e1a078e6828e..908d57d3d2cf 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -24,6 +24,7 @@ #include #include
[PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation
Task isolation wants to be able to verify that a remote core is running an isolated task to determine if it should generate a diagnostic, and also possibly interrupt it. This API returns a pointer to the task_struct of the task that was running on the specified core at the moment of the request; it uses try_get_task_struct() to increment the ref count on the returned task_struct so that the caller can examine it even if the actual remote task has already exited by that point. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- include/linux/sched/task.h | 1 + kernel/sched/core.c| 11 +++ 2 files changed, 12 insertions(+) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 270ff76d43d9..6785db926857 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t) struct task_struct *task_rcu_dereference(struct task_struct **ptask); struct task_struct *try_get_task_struct(struct task_struct **ptask); +struct task_struct *try_get_task_struct_on_cpu(int cpu); #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..2728154057ae 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq) } #endif /* CONFIG_NO_HZ_FULL */ +/* + * Return a pointer to the task_struct for the task that is running on + * the specified cpu at the time of the call (note that the task may have + * exited by the time the caller inspects the resulting task_struct). + * Caller must put_task_struct() with the pointer when finished with it. + */ +struct task_struct *try_get_task_struct_on_cpu(int cpu) +{ + return try_get_task_struct(_rq(cpu)->curr); +} + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); -- 2.1.2
[PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation
Task isolation wants to be able to verify that a remote core is running an isolated task to determine if it should generate a diagnostic, and also possibly interrupt it. This API returns a pointer to the task_struct of the task that was running on the specified core at the moment of the request; it uses try_get_task_struct() to increment the ref count on the returned task_struct so that the caller can examine it even if the actual remote task has already exited by that point. Signed-off-by: Chris Metcalf --- include/linux/sched/task.h | 1 + kernel/sched/core.c| 11 +++ 2 files changed, 12 insertions(+) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 270ff76d43d9..6785db926857 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t) struct task_struct *task_rcu_dereference(struct task_struct **ptask); struct task_struct *try_get_task_struct(struct task_struct **ptask); +struct task_struct *try_get_task_struct_on_cpu(int cpu); #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..2728154057ae 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq) } #endif /* CONFIG_NO_HZ_FULL */ +/* + * Return a pointer to the task_struct for the task that is running on + * the specified cpu at the time of the call (note that the task may have + * exited by the time the caller inspects the resulting task_struct). + * Caller must put_task_struct() with the pointer when finished with it. + */ +struct task_struct *try_get_task_struct_on_cpu(int cpu) +{ + return try_get_task_struct(_rq(cpu)->curr); +} + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); -- 2.1.2
[PATCH v16 02/13] vmstat: add vmstat_idle function
This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Acked-by: Christoph Lameter <c...@linux.com> Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index e0b504594593..80212a952448 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ad1b84ca9cf..8b13a6ca494c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void) } /* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(_work)) && + !need_update(smp_processor_id()); +} + +/* * Shepherd worker thread that checks the * differentials of processors that have their worker * threads for vm statistics updates disabled because of -- 2.1.2
[PATCH v16 02/13] vmstat: add vmstat_idle function
This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Acked-by: Christoph Lameter Signed-off-by: Chris Metcalf --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index e0b504594593..80212a952448 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ad1b84ca9cf..8b13a6ca494c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void) } /* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(_work)) && + !need_update(smp_processor_id()); +} + +/* * Shepherd worker thread that checks the * differentials of processors that have their worker * threads for vm statistics updates disabled because of -- 2.1.2
[PATCH v16 01/13] vmstat: add quiet_vmstat_sync function
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 9 + 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index ade7cb5f1359..e0b504594593 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 4bb13e72ac97..8ad1b84ca9cf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1837,6 +1837,15 @@ void quiet_vmstat(void) } /* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(_work)); + refresh_cpu_vm_stats(false); +} + +/* * Shepherd worker thread that checks the * differentials of processors that have their worker * threads for vm statistics updates disabled because of -- 2.1.2
[PATCH v16 01/13] vmstat: add quiet_vmstat_sync function
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 9 + 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index ade7cb5f1359..e0b504594593 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 4bb13e72ac97..8ad1b84ca9cf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1837,6 +1837,15 @@ void quiet_vmstat(void) } /* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(_work)); + refresh_cpu_vm_stats(false); +} + +/* * Shepherd worker thread that checks the * differentials of processors that have their worker * threads for vm statistics updates disabled because of -- 2.1.2
[PATCH v16 00/13] support "task_isolation" mode
is doesn't seem to ever happen. What about using a per-cpu flag to stop doing new deferred work? Andy also suggested we could structure the code to have the prctl() set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu data, or lru page cache). Then, we could flush those structures right from the sys_prctl() call, and when we were returning to user space, we'd be confident that there wasn't going to be any new work added. With the current set of things that we are disabling for task isolation, though, it didn't seem necessary. Quiescing the vmstat shepherd seems like it is generally pretty safe since we will likely be able to sync up the per-cpu cache and kill the deferred work with high probability, with no expectation that additional work will show up. And since we can flush the LRU page cache with interrupts disabled, that turns out not to be an issue either. I could imagine that if we have to deal with some new kind of deferred work, we might find the per-cpu flag becomes a good solution, but for now we don't have a good use case for that approach. How about stopping the dyn tick? Right now we try to stop it on return to userspace, but if we can't, we just return EAGAIN to userspace. In practice, what I see is that usually the tick stops immediately, but occasionally it doesn't; in this case I've always seen that nr_running is >1, presumably with some temporary kernel worker threads, and the user code just needs to call prctl() until those threads are done. We could structure things with a completion that we wait for, which is set by the timer code when it finally does stop the tick, but this may be overkill, particularly since we'll only be running this prctl() loop from userspace on cores where we have no other useful work that we're trying to run anyway. What about TLB flushing? We talked about this at Plumbers and some of the email discussion also was about TLB flushing. I haven't tried to add it to this patch set, because I really want to avoid scope creep; in any case, I think I managed to convince Andy that he was going to work on it himself. :) Paul McKenney already contributed some framework for such a patch, in commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks counter"). What about that d*mn 1 Hz clock? It's still there, so this code still requires some further work before it can actually get a process into long-term task isolation (without the obvious one-line kernel hack). Frederic suggested a while ago forcing updates on cpustats was required as the last gating factor; do we think that is still true? Christoph was working on this at one point - any progress from your point of view? Chris Metcalf (12): vmstat: add quiet_vmstat_sync function vmstat: add vmstat_idle function Revert "sched/core: Drop the unused try_get_task_struct() helper function" Add try_get_task_struct_on_cpu() to scheduler for task isolation Add try_stop_full_tick() API for NO_HZ_FULL task_isolation: userspace hard isolation from kernel Add task isolation hooks to arch-independent code arch/x86: enable task isolation functionality arch/arm64: enable task isolation functionality arch/tile: enable task isolation functionality arm, tile: turn off timer tick for oneshot_stopped state task_isolation self test Francis Giraldeau (1): arch/arm: enable task isolation functionality Documentation/admin-guide/kernel-parameters.txt| 6 + arch/arm/Kconfig | 1 + arch/arm/include/asm/thread_info.h | 10 +- arch/arm/kernel/entry-common.S | 12 +- arch/arm/kernel/ptrace.c | 10 + arch/arm/kernel/signal.c | 10 +- arch/arm/kernel/smp.c | 4 + arch/arm/mm/fault.c| 8 +- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/thread_info.h | 5 +- arch/arm64/kernel/ptrace.c | 18 +- arch/arm64/kernel/signal.c | 10 +- arch/arm64/kernel/smp.c| 7 + arch/arm64/mm/fault.c | 5 + arch/tile/Kconfig | 1 + arch/tile/include/asm/thread_info.h| 2 + arch/tile/kernel/hardwall.c| 2 + arch/tile/kernel/irq.c | 3 + arch/tile/kernel/messaging.c | 4 + arch/tile/kernel/process.c | 4 + arch/tile/kernel/ptrace.c | 10 + arch/tile/kernel/single_step.c | 7 + arch/tile/kernel/smp.c | 21 +- arch/tile/kernel/time.c| 3 + arch/tile/kernel/unaligned.c | 4 + arch/tile/mm/fault.c
[PATCH v16 00/13] support "task_isolation" mode
is doesn't seem to ever happen. What about using a per-cpu flag to stop doing new deferred work? Andy also suggested we could structure the code to have the prctl() set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu data, or lru page cache). Then, we could flush those structures right from the sys_prctl() call, and when we were returning to user space, we'd be confident that there wasn't going to be any new work added. With the current set of things that we are disabling for task isolation, though, it didn't seem necessary. Quiescing the vmstat shepherd seems like it is generally pretty safe since we will likely be able to sync up the per-cpu cache and kill the deferred work with high probability, with no expectation that additional work will show up. And since we can flush the LRU page cache with interrupts disabled, that turns out not to be an issue either. I could imagine that if we have to deal with some new kind of deferred work, we might find the per-cpu flag becomes a good solution, but for now we don't have a good use case for that approach. How about stopping the dyn tick? Right now we try to stop it on return to userspace, but if we can't, we just return EAGAIN to userspace. In practice, what I see is that usually the tick stops immediately, but occasionally it doesn't; in this case I've always seen that nr_running is >1, presumably with some temporary kernel worker threads, and the user code just needs to call prctl() until those threads are done. We could structure things with a completion that we wait for, which is set by the timer code when it finally does stop the tick, but this may be overkill, particularly since we'll only be running this prctl() loop from userspace on cores where we have no other useful work that we're trying to run anyway. What about TLB flushing? We talked about this at Plumbers and some of the email discussion also was about TLB flushing. I haven't tried to add it to this patch set, because I really want to avoid scope creep; in any case, I think I managed to convince Andy that he was going to work on it himself. :) Paul McKenney already contributed some framework for such a patch, in commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks counter"). What about that d*mn 1 Hz clock? It's still there, so this code still requires some further work before it can actually get a process into long-term task isolation (without the obvious one-line kernel hack). Frederic suggested a while ago forcing updates on cpustats was required as the last gating factor; do we think that is still true? Christoph was working on this at one point - any progress from your point of view? Chris Metcalf (12): vmstat: add quiet_vmstat_sync function vmstat: add vmstat_idle function Revert "sched/core: Drop the unused try_get_task_struct() helper function" Add try_get_task_struct_on_cpu() to scheduler for task isolation Add try_stop_full_tick() API for NO_HZ_FULL task_isolation: userspace hard isolation from kernel Add task isolation hooks to arch-independent code arch/x86: enable task isolation functionality arch/arm64: enable task isolation functionality arch/tile: enable task isolation functionality arm, tile: turn off timer tick for oneshot_stopped state task_isolation self test Francis Giraldeau (1): arch/arm: enable task isolation functionality Documentation/admin-guide/kernel-parameters.txt| 6 + arch/arm/Kconfig | 1 + arch/arm/include/asm/thread_info.h | 10 +- arch/arm/kernel/entry-common.S | 12 +- arch/arm/kernel/ptrace.c | 10 + arch/arm/kernel/signal.c | 10 +- arch/arm/kernel/smp.c | 4 + arch/arm/mm/fault.c| 8 +- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/thread_info.h | 5 +- arch/arm64/kernel/ptrace.c | 18 +- arch/arm64/kernel/signal.c | 10 +- arch/arm64/kernel/smp.c| 7 + arch/arm64/mm/fault.c | 5 + arch/tile/Kconfig | 1 + arch/tile/include/asm/thread_info.h| 2 + arch/tile/kernel/hardwall.c| 2 + arch/tile/kernel/irq.c | 3 + arch/tile/kernel/messaging.c | 4 + arch/tile/kernel/process.c | 4 + arch/tile/kernel/ptrace.c | 10 + arch/tile/kernel/single_step.c | 7 + arch/tile/kernel/smp.c | 21 +- arch/tile/kernel/time.c| 3 + arch/tile/kernel/unaligned.c | 4 + arch/tile/mm/fault.c
Re: [GIT PULL] Introduce housekeeping subsystem v4
On 10/20/2017 10:29 AM, Frederic Weisbecker wrote: 2017-10-20 10:17 UTC+02:00, Ingo Molnar <mi...@kernel.org>: I mean code like: triton:~/tip> git grep on_each_cpu mm mm/page_alloc.c: * cpu to drain that CPU pcps and on_each_cpu_mask mm/slab.c: on_each_cpu(do_drain, cachep, 1); mm/slub.c: on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1, GFP_ATOMIC); mm/vmstat.c:err = schedule_on_each_cpu(refresh_vm_stats); is something we want to execute on 'housekeeping CPUs' as well, to not disturb the isolated CPUs, right? I see, so indeed that's the kind of thing we want to also confine to housekeeping as well whenever possible but these cases require special treatment that need to be handled by the subsystem in charge. For example vmstat has the vmstat_sheperd thing which allows to drive those timers adaptively on demand to make sure that userspace isn't interrupted. The others will likely need some similar treatment. For now I only see vmstat having such a feature and it acts transparently. There is also the LRU flush (IIRC) which needs to be called for example before returning to userspace to avoid IPIs. Such things may indeed need special treatment. With the current patchset it could be a housekeeping flag. I have been working to update the task isolation support the last few days and though it's not quite ready to post (probably will be Monday or Tuesday), I have sorted out those issues from task isolation's perspective. It turns out that you can both quiesce the vmstat_shepherd, as well as drain the LRU per-cpu pages, while interrupts are disabled on the way back to userspace. Whether shifting this work to housekeeping cores at all times makes sense seems like a much more open question. The idea of task isolation is to provide a harder guarantee of isolation, and in particular to shift work to the moment that you return to userspace, rather than allowing it to happen later. It does seem likely that there are some things you'd want to do on the core itself most of the time, and just suppress for true task isolation if requested, rather than trying to move them to the housekeeping cores. But, it's certainly worth looking at both options and seeing how it plays out. The less complicated the task isolation return-to-user path is, the better. (The idea of task isolation seems like a win no matter what, to allow ensuring kernel isolation when you absolutely require it.) The current task isolation tree is in the "dataplane" branch at git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [GIT PULL] Introduce housekeeping subsystem v4
On 10/20/2017 10:29 AM, Frederic Weisbecker wrote: 2017-10-20 10:17 UTC+02:00, Ingo Molnar : I mean code like: triton:~/tip> git grep on_each_cpu mm mm/page_alloc.c: * cpu to drain that CPU pcps and on_each_cpu_mask mm/slab.c: on_each_cpu(do_drain, cachep, 1); mm/slub.c: on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1, GFP_ATOMIC); mm/vmstat.c:err = schedule_on_each_cpu(refresh_vm_stats); is something we want to execute on 'housekeeping CPUs' as well, to not disturb the isolated CPUs, right? I see, so indeed that's the kind of thing we want to also confine to housekeeping as well whenever possible but these cases require special treatment that need to be handled by the subsystem in charge. For example vmstat has the vmstat_sheperd thing which allows to drive those timers adaptively on demand to make sure that userspace isn't interrupted. The others will likely need some similar treatment. For now I only see vmstat having such a feature and it acts transparently. There is also the LRU flush (IIRC) which needs to be called for example before returning to userspace to avoid IPIs. Such things may indeed need special treatment. With the current patchset it could be a housekeeping flag. I have been working to update the task isolation support the last few days and though it's not quite ready to post (probably will be Monday or Tuesday), I have sorted out those issues from task isolation's perspective. It turns out that you can both quiesce the vmstat_shepherd, as well as drain the LRU per-cpu pages, while interrupts are disabled on the way back to userspace. Whether shifting this work to housekeeping cores at all times makes sense seems like a much more open question. The idea of task isolation is to provide a harder guarantee of isolation, and in particular to shift work to the moment that you return to userspace, rather than allowing it to happen later. It does seem likely that there are some things you'd want to do on the core itself most of the time, and just suppress for true task isolation if requested, rather than trying to move them to the housekeeping cores. But, it's certainly worth looking at both options and seeing how it plays out. The less complicated the task isolation return-to-user path is, the better. (The idea of task isolation seems like a win no matter what, to allow ensuring kernel isolation when you absolutely require it.) The current task isolation tree is in the "dataplane" branch at git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[PATCH] firmware: bluefield: add boot control driver
The Mellanox BlueField SoC firmware supports a safe upgrade mode as part of the flow where users put new firmware on the secondary eMMC boot partition (the one not currently in use), tell the eMMC to make the secondary boot partition primary, and reset. This driver is used to request that the firmware start the ARM watchdog after the next reset, and also request that the firmware swap the eMMC boot partition back again on the reset after that (the second reset). This means that if anything goes wrong, the watchdog will fire, the system will reset, and the firmware will switch back to the original boot partition. If the boot is successful, the user will use this driver to put the firmware back into the state where it doesn't touch the eMMC boot partition at reset, and turn off the ARM watchdog. The firmware allows for more configurability than that, as can be seen in the code, but the use case above is what the driver primarily supports. It is structured as a simple sysfs driver that is loaded based on an ACPI table entry, and allows reading/writing text strings to various /sys/bus/platform/drivers/mlx-bootctl/* files. Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> --- Ingo, since there isn't an overall maintainer for drivers/firmware, does it make sense for this to go through your tree? Thanks! drivers/firmware/Kconfig | 12 +++ drivers/firmware/Makefile | 1 + drivers/firmware/mlx-bootctl.c | 222 + drivers/firmware/mlx-bootctl.h | 103 +++ 4 files changed, 338 insertions(+) create mode 100644 drivers/firmware/mlx-bootctl.c create mode 100644 drivers/firmware/mlx-bootctl.h diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 6e4ed5a9c6fd..1f2adbcc5acc 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -230,6 +230,18 @@ config TI_SCI_PROTOCOL This protocol library is used by client drivers to use the features provided by the system controller. +config MLX_BOOTCTL + tristate "Mellanox BlueField Firmware Boot Control" + depends on ARM64 + help + The Mellanox BlueField firmware implements functionality to + request swapping the primary and alternate eMMC boot + partition, and to set up a watchdog that can undo that swap + if the system does not boot up correctly. This driver + provides sysfs access to the firmware, to be used in + conjunction with the eMMC device driver to do any necessary + initial swap of the boot partition. + config HAVE_ARM_SMCCC bool diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile index a37f12e8d137..4f4cad1eb9dd 100644 --- a/drivers/firmware/Makefile +++ b/drivers/firmware/Makefile @@ -22,6 +22,7 @@ obj-$(CONFIG_QCOM_SCM_64) += qcom_scm-64.o obj-$(CONFIG_QCOM_SCM_32) += qcom_scm-32.o CFLAGS_qcom_scm-32.o :=$(call as-instr,.arch armv7-a\n.arch_extension sec,-DREQUIRES_SEC=1) -march=armv7-a obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o +obj-$(CONFIG_MLX_BOOTCTL) += mlx-bootctl.o obj-y += broadcom/ obj-y += meson/ diff --git a/drivers/firmware/mlx-bootctl.c b/drivers/firmware/mlx-bootctl.c new file mode 100644 index ..7fe942e9d7bb --- /dev/null +++ b/drivers/firmware/mlx-bootctl.c @@ -0,0 +1,222 @@ +/* + * Mellanox boot control driver + * This driver provides a sysfs interface for systems management + * software to manage reset-time actions. + * + * Copyright (C) 2017 Mellanox Technologies. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License v2.0 as published by + * the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include "mlx-bootctl.h" + +#define DRIVER_NAME"mlx-bootctl" +#define DRIVER_VERSION "1.1" +#define DRIVER_DESCRIPTION "Mellanox boot control driver" + +struct boot_name { + int value; + const char name[12]; +}; + +static struct boot_name boot_names[] = { + { MLNX_BOOT_EXTERNAL, "external" }, + { MLNX_BOOT_EMMC, "emmc" }, + { MLNX_BOOT_SWAP_EMMC, "swap_emmc" }, + { MLNX_BOOT_EMMC_LEGACY,"emmc_legacy" }, + { MLNX_BOOT_NONE, "none" }, + { -1, "" } +}; + +/* The SMC calls in question are atomic, so we don't have to lock here. */ +static int smc_call1(uns
[PATCH] firmware: bluefield: add boot control driver
The Mellanox BlueField SoC firmware supports a safe upgrade mode as part of the flow where users put new firmware on the secondary eMMC boot partition (the one not currently in use), tell the eMMC to make the secondary boot partition primary, and reset. This driver is used to request that the firmware start the ARM watchdog after the next reset, and also request that the firmware swap the eMMC boot partition back again on the reset after that (the second reset). This means that if anything goes wrong, the watchdog will fire, the system will reset, and the firmware will switch back to the original boot partition. If the boot is successful, the user will use this driver to put the firmware back into the state where it doesn't touch the eMMC boot partition at reset, and turn off the ARM watchdog. The firmware allows for more configurability than that, as can be seen in the code, but the use case above is what the driver primarily supports. It is structured as a simple sysfs driver that is loaded based on an ACPI table entry, and allows reading/writing text strings to various /sys/bus/platform/drivers/mlx-bootctl/* files. Signed-off-by: Chris Metcalf --- Ingo, since there isn't an overall maintainer for drivers/firmware, does it make sense for this to go through your tree? Thanks! drivers/firmware/Kconfig | 12 +++ drivers/firmware/Makefile | 1 + drivers/firmware/mlx-bootctl.c | 222 + drivers/firmware/mlx-bootctl.h | 103 +++ 4 files changed, 338 insertions(+) create mode 100644 drivers/firmware/mlx-bootctl.c create mode 100644 drivers/firmware/mlx-bootctl.h diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 6e4ed5a9c6fd..1f2adbcc5acc 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -230,6 +230,18 @@ config TI_SCI_PROTOCOL This protocol library is used by client drivers to use the features provided by the system controller. +config MLX_BOOTCTL + tristate "Mellanox BlueField Firmware Boot Control" + depends on ARM64 + help + The Mellanox BlueField firmware implements functionality to + request swapping the primary and alternate eMMC boot + partition, and to set up a watchdog that can undo that swap + if the system does not boot up correctly. This driver + provides sysfs access to the firmware, to be used in + conjunction with the eMMC device driver to do any necessary + initial swap of the boot partition. + config HAVE_ARM_SMCCC bool diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile index a37f12e8d137..4f4cad1eb9dd 100644 --- a/drivers/firmware/Makefile +++ b/drivers/firmware/Makefile @@ -22,6 +22,7 @@ obj-$(CONFIG_QCOM_SCM_64) += qcom_scm-64.o obj-$(CONFIG_QCOM_SCM_32) += qcom_scm-32.o CFLAGS_qcom_scm-32.o :=$(call as-instr,.arch armv7-a\n.arch_extension sec,-DREQUIRES_SEC=1) -march=armv7-a obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o +obj-$(CONFIG_MLX_BOOTCTL) += mlx-bootctl.o obj-y += broadcom/ obj-y += meson/ diff --git a/drivers/firmware/mlx-bootctl.c b/drivers/firmware/mlx-bootctl.c new file mode 100644 index ..7fe942e9d7bb --- /dev/null +++ b/drivers/firmware/mlx-bootctl.c @@ -0,0 +1,222 @@ +/* + * Mellanox boot control driver + * This driver provides a sysfs interface for systems management + * software to manage reset-time actions. + * + * Copyright (C) 2017 Mellanox Technologies. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License v2.0 as published by + * the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include "mlx-bootctl.h" + +#define DRIVER_NAME"mlx-bootctl" +#define DRIVER_VERSION "1.1" +#define DRIVER_DESCRIPTION "Mellanox boot control driver" + +struct boot_name { + int value; + const char name[12]; +}; + +static struct boot_name boot_names[] = { + { MLNX_BOOT_EXTERNAL, "external" }, + { MLNX_BOOT_EMMC, "emmc" }, + { MLNX_BOOT_SWAP_EMMC, "swap_emmc" }, + { MLNX_BOOT_EMMC_LEGACY,"emmc_legacy" }, + { MLNX_BOOT_NONE, "none" }, + { -1, "" } +}; + +/* The SMC calls in question are atomic, so we don't have to lock here. */ +static int smc_call1(unsigned int smc_o
[GIT PULL] arch/tile bugfixes for 4.14-rc2
Linus, Please pull the following two changes for 4.14-rc2 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master These are a code cleanup and config cleanup, respectively. Dan Carpenter (1): tile: array underflow in setup_maxnodemem() Krzysztof Kozlowski (1): tile: defconfig: Cleanup from old Kconfig options arch/tile/configs/tilegx_defconfig | 1 - arch/tile/configs/tilepro_defconfig | 2 -- arch/tile/kernel/setup.c| 2 +- 3 files changed, 1 insertion(+), 4 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile bugfixes for 4.14-rc2
Linus, Please pull the following two changes for 4.14-rc2 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master These are a code cleanup and config cleanup, respectively. Dan Carpenter (1): tile: array underflow in setup_maxnodemem() Krzysztof Kozlowski (1): tile: defconfig: Cleanup from old Kconfig options arch/tile/configs/tilegx_defconfig | 1 - arch/tile/configs/tilepro_defconfig | 2 -- arch/tile/kernel/setup.c| 2 +- 3 files changed, 1 insertion(+), 4 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/11/2017 11:35 AM, Christopher Lameter wrote: Ah, Chris since you are here: What is happening with the dataplane patches? Work has been crazy and I keep expecting to have a chunk of time to work on it and it doesn't happen. September is looking relatively good though for my having time to work on it. I really would like to get out a new spin. Fingers crossed. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/11/2017 11:35 AM, Christopher Lameter wrote: Ah, Chris since you are here: What is happening with the dataplane patches? Work has been crazy and I keep expecting to have a chunk of time to work on it and it doesn't happen. September is looking relatively good though for my having time to work on it. I really would like to get out a new spin. Fingers crossed. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/11/2017 2:36 AM, Mike Galbraith wrote: On Thu, 2017-08-10 at 09:57 -0400, Chris Metcalf wrote: On 8/10/2017 8:54 AM, Frederic Weisbecker wrote: But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option. Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point that its default behaviour will be the exact opposite of the current one: by default every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we don't set housekeeping boot option. Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping by default to just the boot cpu. In conjunction with NOHZ_FULL_ALL you would then get the expected semantics. A big box with only the boot cpu for housekeeping is likely screwed. Fair point - this kind of configuration would be primarily useful for dedicated systems that were running a high-traffic-rate networking application on many cores, for example. In this mode you don't end up putting a lot of burden on the housekeeping core. In any case, probably not worth adding an additional kernel config for. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/11/2017 2:36 AM, Mike Galbraith wrote: On Thu, 2017-08-10 at 09:57 -0400, Chris Metcalf wrote: On 8/10/2017 8:54 AM, Frederic Weisbecker wrote: But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option. Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point that its default behaviour will be the exact opposite of the current one: by default every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we don't set housekeeping boot option. Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping by default to just the boot cpu. In conjunction with NOHZ_FULL_ALL you would then get the expected semantics. A big box with only the boot cpu for housekeeping is likely screwed. Fair point - this kind of configuration would be primarily useful for dedicated systems that were running a high-traffic-rate networking application on many cores, for example. In this mode you don't end up putting a lot of burden on the housekeeping core. In any case, probably not worth adding an additional kernel config for. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/10/2017 8:54 AM, Frederic Weisbecker wrote: But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option. Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point that its default behaviour will be the exact opposite of the current one: by default every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we don't set housekeeping boot option. Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping by default to just the boot cpu. In conjunction with NOHZ_FULL_ALL you would then get the expected semantics. Also I plan to add a housekeeping option to offload the residual 1Hz tick from nohz_full CPUs. So having "housekeeping=0,tick_offload" would make CPU 0 the housekeeper, make the other CPUs nohz_full and handle their 1hz tick from CPU 0. It does seem like that might be implied by requesting NOHZ_FULL on the core... or maybe it's just implied by TASK_ISOLATION. I've done a bad job of finding time to work on that since last year's Plumbers, but September looks good :) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 8/10/2017 8:54 AM, Frederic Weisbecker wrote: But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option. Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point that its default behaviour will be the exact opposite of the current one: by default every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we don't set housekeeping boot option. Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping by default to just the boot cpu. In conjunction with NOHZ_FULL_ALL you would then get the expected semantics. Also I plan to add a housekeeping option to offload the residual 1Hz tick from nohz_full CPUs. So having "housekeeping=0,tick_offload" would make CPU 0 the housekeeper, make the other CPUs nohz_full and handle their 1hz tick from CPU 0. It does seem like that might be implied by requesting NOHZ_FULL on the core... or maybe it's just implied by TASK_ISOLATION. I've done a bad job of finding time to work on that since last year's Plumbers, but September looks good :) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 09/11] tile/topology: Remove the unused parent_node() macro
On 7/26/2017 9:34 AM, Dou Liyang wrote: Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removes the last user of parent_node(). The parent_node() macro in tile platform is unnecessary. Remove it for cleanup. Reported-by: Michael Ellerman<m...@ellerman.id.au> Signed-off-by: Dou Liyang<douly.f...@cn.fujitsu.com> Cc: Chris Metcalf<cmetc...@mellanox.com> --- arch/tile/include/asm/topology.h | 6 -- 1 file changed, 6 deletions(-) Acked-by: Chris Metcalf <cmetc...@mellanox.com> -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 09/11] tile/topology: Remove the unused parent_node() macro
On 7/26/2017 9:34 AM, Dou Liyang wrote: Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removes the last user of parent_node(). The parent_node() macro in tile platform is unnecessary. Remove it for cleanup. Reported-by: Michael Ellerman Signed-off-by: Dou Liyang Cc: Chris Metcalf --- arch/tile/include/asm/topology.h | 6 -- 1 file changed, 6 deletions(-) Acked-by: Chris Metcalf -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] tile: array underflow in setup_maxnodemem()
On 7/22/2017 3:33 AM, Dan Carpenter wrote: My static checker correctly complains that we should have a lower bound on "node" to prevent an array underflow. Fixes: 867e359b97c9 ("arch/tile: core support for Tilera 32-bit chips.") Signed-off-by: Dan Carpenter<dan.carpen...@oracle.com> Thanks, taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] tile: array underflow in setup_maxnodemem()
On 7/22/2017 3:33 AM, Dan Carpenter wrote: My static checker correctly complains that we should have a lower bound on "node" to prevent an array underflow. Fixes: 867e359b97c9 ("arch/tile: core support for Tilera 32-bit chips.") Signed-off-by: Dan Carpenter Thanks, taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 7/21/2017 9:21 AM, Frederic Weisbecker wrote: I'm leaving for two weeks so this is food for thoughts in the meantime :) We have a design issue with nohz_full: it drives the isolation features through the *housekeeping*() functions: kthreads, unpinned timers, watchdog, ... But things should work the other way around because the tick is just an isolation feature among others. So we need a housekeeping subsystem to drive all these isolation features, including nohz full in a later iteration. For now this is a basic draft. In the long run this subsystem should also drive the tick offloading (remove residual 1Hz) and all unbound kthreads. git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/0hz HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1 For the series: Reviewed-by: Chris Metcalf <cmetc...@mellanox.com> I spotted a few typos that you should grep for and fix for your next version: "watchog", "Lets/lets" instead of "Let's/let's", "overriden" (should have two d's). The new housekeeping=MASK boot option seems like it might make it a little irritating to specify nohz_full=MASK as well. I guess if setting NO_HZ_FULL_ALL implied "all but housekeeping", it becomes a reasonably tidy solution. To make this work right you might have to make the housekeeping option early_param instead so its value is available early enough. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 0/9] Introduce housekeeping subsystem
On 7/21/2017 9:21 AM, Frederic Weisbecker wrote: I'm leaving for two weeks so this is food for thoughts in the meantime :) We have a design issue with nohz_full: it drives the isolation features through the *housekeeping*() functions: kthreads, unpinned timers, watchdog, ... But things should work the other way around because the tick is just an isolation feature among others. So we need a housekeeping subsystem to drive all these isolation features, including nohz full in a later iteration. For now this is a basic draft. In the long run this subsystem should also drive the tick offloading (remove residual 1Hz) and all unbound kthreads. git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/0hz HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1 For the series: Reviewed-by: Chris Metcalf I spotted a few typos that you should grep for and fix for your next version: "watchog", "Lets/lets" instead of "Let's/let's", "overriden" (should have two d's). The new housekeeping=MASK boot option seems like it might make it a little irritating to specify nohz_full=MASK as well. I guess if setting NO_HZ_FULL_ALL implied "all but housekeeping", it becomes a reasonably tidy solution. To make this work right you might have to make the housekeeping option early_param instead so its value is available early enough. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RESEND PATCH] tile: defconfig: Cleanup from old Kconfig options
On 7/20/2017 1:05 AM, Krzysztof Kozlowski wrote: Remove old, dead Kconfig options (in order appearing in this commit): - CRYPTO_ZLIB: commit 110492183c4b ("crypto: compress - remove unused pcomp interface"); - IP_NF_TARGET_ULOG: commit d4da843e6fad ("netfilter: kill remnants of ulog targets"); Signed-off-by: Krzysztof Kozlowski<k...@kernel.org> --- arch/tile/configs/tilegx_defconfig | 1 - arch/tile/configs/tilepro_defconfig | 2 -- 2 files changed, 3 deletions(-) Thanks! Taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RESEND PATCH] tile: defconfig: Cleanup from old Kconfig options
On 7/20/2017 1:05 AM, Krzysztof Kozlowski wrote: Remove old, dead Kconfig options (in order appearing in this commit): - CRYPTO_ZLIB: commit 110492183c4b ("crypto: compress - remove unused pcomp interface"); - IP_NF_TARGET_ULOG: commit d4da843e6fad ("netfilter: kill remnants of ulog targets"); Signed-off-by: Krzysztof Kozlowski --- arch/tile/configs/tilegx_defconfig | 1 - arch/tile/configs/tilepro_defconfig | 2 -- 2 files changed, 3 deletions(-) Thanks! Taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] lib/strscpy: avoid KASAN false positive
On 7/18/2017 6:04 PM, Andrew Morton wrote: On Wed, 19 Jul 2017 00:31:36 +0300 Andrey Ryabinin <aryabi...@virtuozzo.com> wrote: On 07/18/2017 11:26 PM, Linus Torvalds wrote: On Tue, Jul 18, 2017 at 1:15 PM, Andrey Ryabinin <aryabi...@virtuozzo.com> wrote: No, it does warn about valid users. The report that Dave posted wasn't about wrong strscpy() usage it was about reading 8-bytes from 5-bytes source string. It wasn't about buggy 'count' at all. So KASAN will warn for perfectly valid code like this: char dest[16]; strscpy(dest, "12345", sizeof(dest)): Ugh, ok, yes. For strscpy() that would mean making the *whole* read from 'src' buffer unchecked by KASAN. So we do have that READ_ONCE_NOCHECK(), but could we perhaps have something that doesn't do a NOCHECK but a partial check and is simply ok with "this is an optimistc longer access" This can be dont, I think. Something like this: static inline unsigned long read_partial_nocheck(unsigned long *x) { unsigned long ret = READ_ONCE_NOCHECK(x); kasan_check_partial(x, sizeof(unsigned long)); return ret; } (Cc Chris) We could just remove all that word-at-a-time logic. Do we have any evidence that this would harm anything? The word-at-a-time logic was part of the initial commit since I wanted to ensure that strscpy could be used to replace strlcpy or strncpy without serious concerns about performance. It seems unfortunate to remove it unconditionally to support KASAN, but I haven't looked deeply at the tradeoffs here. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] lib/strscpy: avoid KASAN false positive
On 7/18/2017 6:04 PM, Andrew Morton wrote: On Wed, 19 Jul 2017 00:31:36 +0300 Andrey Ryabinin wrote: On 07/18/2017 11:26 PM, Linus Torvalds wrote: On Tue, Jul 18, 2017 at 1:15 PM, Andrey Ryabinin wrote: No, it does warn about valid users. The report that Dave posted wasn't about wrong strscpy() usage it was about reading 8-bytes from 5-bytes source string. It wasn't about buggy 'count' at all. So KASAN will warn for perfectly valid code like this: char dest[16]; strscpy(dest, "12345", sizeof(dest)): Ugh, ok, yes. For strscpy() that would mean making the *whole* read from 'src' buffer unchecked by KASAN. So we do have that READ_ONCE_NOCHECK(), but could we perhaps have something that doesn't do a NOCHECK but a partial check and is simply ok with "this is an optimistc longer access" This can be dont, I think. Something like this: static inline unsigned long read_partial_nocheck(unsigned long *x) { unsigned long ret = READ_ONCE_NOCHECK(x); kasan_check_partial(x, sizeof(unsigned long)); return ret; } (Cc Chris) We could just remove all that word-at-a-time logic. Do we have any evidence that this would harm anything? The word-at-a-time logic was part of the initial commit since I wanted to ensure that strscpy could be used to replace strlcpy or strncpy without serious concerns about performance. It seems unfortunate to remove it unconditionally to support KASAN, but I haven't looked deeply at the tradeoffs here. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile changes for 4.13
Linus, Please pull the following changes for 4.13 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master This adds support for an to help with removing __need_xxx #defines from glibc, and removes some dead code in arch/tile/mm/init.c. Chris Metcalf (1): tile: prefer to __need_int_reg_t Michal Hocko (1): mm, tile: drop arch_{add,remove}_memory arch/tile/include/uapi/arch/abi.h| 49 +++-- arch/tile/include/uapi/arch/intreg.h | 70 arch/tile/mm/init.c | 30 3 files changed, 74 insertions(+), 75 deletions(-) create mode 100644 arch/tile/include/uapi/arch/intreg.h -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile changes for 4.13
Linus, Please pull the following changes for 4.13 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master This adds support for an to help with removing __need_xxx #defines from glibc, and removes some dead code in arch/tile/mm/init.c. Chris Metcalf (1): tile: prefer to __need_int_reg_t Michal Hocko (1): mm, tile: drop arch_{add,remove}_memory arch/tile/include/uapi/arch/abi.h| 49 +++-- arch/tile/include/uapi/arch/intreg.h | 70 arch/tile/mm/init.c | 30 3 files changed, 74 insertions(+), 75 deletions(-) create mode 100644 arch/tile/include/uapi/arch/intreg.h -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC][PATCH] atomic: Fix atomic_set_release() for 'funny' architectures
On 6/9/2017 7:05 AM, Peter Zijlstra wrote: Subject: atomic: Fix atomic_set_release() for 'funny' architectures Those architectures that have a special atomic_set implementation also need a special atomic_set_release(), because for the very same reason WRITE_ONCE() is broken for them, smp_store_release() is too. The vast majority is architectures that have spinlock hash based atomic implementation except hexagon which seems to have a hardware 'feature'. The spinlock based atomics should be SC, that is, none of them appear to place extra barriers in atomic_cmpxchg() or any of the other SC atomic primitives and therefore seem to rely on their spinlock implementation being SC (I did not fully validate all that). Therefore, the normal atomic_set() is SC and can be used at atomic_set_release(). Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile] -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC][PATCH] atomic: Fix atomic_set_release() for 'funny' architectures
On 6/9/2017 7:05 AM, Peter Zijlstra wrote: Subject: atomic: Fix atomic_set_release() for 'funny' architectures Those architectures that have a special atomic_set implementation also need a special atomic_set_release(), because for the very same reason WRITE_ONCE() is broken for them, smp_store_release() is too. The vast majority is architectures that have spinlock hash based atomic implementation except hexagon which seems to have a hardware 'feature'. The spinlock based atomics should be SC, that is, none of them appear to place extra barriers in atomic_cmpxchg() or any of the other SC atomic primitives and therefore seem to rely on their spinlock implementation being SC (I did not fully validate all that). Therefore, the normal atomic_set() is SC and can be used at atomic_set_release(). Acked-by: Chris Metcalf [for tile] -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Updating kernel.org cross compilers?
On 05/09/2017 10:59 AM, Andre Przywara wrote: On 30/04/17 06:29, Segher Boessenkool wrote: On Wed, Apr 26, 2017 at 03:14:16PM +0100, Andre Przywara wrote: It seems that many people (even outside the Linux kernel community) use the cross compilers provided at kernel.org/pub/tools/crosstool. The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile. So I took Segher's buildall scripts from [1] and threw binutils 2.28 and GCC 6.3.0 at them. I am belatedly catching up on this thread. It sounds like the tilegx/tilepro issues were sorted out -- as someone noted, you need to have the kernel headers available to build glibc. However, if there are any outstanding tile issues, please feel free to loop me in! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Updating kernel.org cross compilers?
On 05/09/2017 10:59 AM, Andre Przywara wrote: On 30/04/17 06:29, Segher Boessenkool wrote: On Wed, Apr 26, 2017 at 03:14:16PM +0100, Andre Przywara wrote: It seems that many people (even outside the Linux kernel community) use the cross compilers provided at kernel.org/pub/tools/crosstool. The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile. So I took Segher's buildall scripts from [1] and threw binutils 2.28 and GCC 6.3.0 at them. I am belatedly catching up on this thread. It sounds like the tilegx/tilepro issues were sorted out -- as someone noted, you need to have the kernel headers available to build glibc. However, if there are any outstanding tile issues, please feel free to loop me in! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 2/6] mm, tile: drop arch_{add,remove}_memory
On 3/30/2017 7:54 AM, Michal Hocko wrote: From: Michal Hocko<mho...@suse.com> these functions are unreachable because tile doesn't support memory hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor it supports SPARSEMEM. This code hasn't been compiled for a while obviously because nobody has noticed that __add_pages has a different signature since 2009. Cc: Chris Metcalf<cmetc...@mellanox.com> Signed-off-by: Michal Hocko<mho...@suse.com> --- arch/tile/mm/init.c | 30 -- 1 file changed, 30 deletions(-) Thanks - taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 2/6] mm, tile: drop arch_{add,remove}_memory
On 3/30/2017 7:54 AM, Michal Hocko wrote: From: Michal Hocko these functions are unreachable because tile doesn't support memory hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor it supports SPARSEMEM. This code hasn't been compiled for a while obviously because nobody has noticed that __add_pages has a different signature since 2009. Cc: Chris Metcalf Signed-off-by: Michal Hocko --- arch/tile/mm/init.c | 30 -- 1 file changed, 30 deletions(-) Thanks - taken into the tile tree. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[PATCH] tile: prefer to __need_int_reg_t
As part of some work in glibc to move away from the "__need" prefix, this commit breaks away the definitions of __int_reg_t, __uint_reg_t, __INT_REG_BITS, and __INT_REG_FMT to a separate "microheader". It is then included from to preserve the semantics of the previous header. For now, we continue to preserve the __need_int_reg_t semantics in as well, but anticipate that after a few years we can obsolete it. --- arch/tile/include/uapi/arch/abi.h| 49 +++-- arch/tile/include/uapi/arch/intreg.h | 70 2 files changed, 74 insertions(+), 45 deletions(-) create mode 100644 arch/tile/include/uapi/arch/intreg.h diff --git a/arch/tile/include/uapi/arch/abi.h b/arch/tile/include/uapi/arch/abi.h index c55a3d432644..328e62260272 100644 --- a/arch/tile/include/uapi/arch/abi.h +++ b/arch/tile/include/uapi/arch/abi.h @@ -20,58 +20,17 @@ #ifndef __ARCH_ABI_H__ -#if !defined __need_int_reg_t && !defined __DOXYGEN__ -# define __ARCH_ABI_H__ -# include -#endif - -/* Provide the basic machine types. */ -#ifndef __INT_REG_BITS - -/** Number of bits in a register. */ -#if defined __tilegx__ -# define __INT_REG_BITS 64 -#elif defined __tilepro__ -# define __INT_REG_BITS 32 -#elif !defined __need_int_reg_t +#ifndef __tile__ /* support uncommon use of arch headers in non-tile builds */ # include # define __INT_REG_BITS CHIP_WORD_SIZE() -#else -# error Unrecognized architecture with __need_int_reg_t -#endif - -#if __INT_REG_BITS == 64 - -#ifndef __ASSEMBLER__ -/** Unsigned type that can hold a register. */ -typedef unsigned long long __uint_reg_t; - -/** Signed type that can hold a register. */ -typedef long long __int_reg_t; -#endif - -/** String prefix to use for printf(). */ -#define __INT_REG_FMT "ll" - -#else - -#ifndef __ASSEMBLER__ -/** Unsigned type that can hold a register. */ -typedef unsigned long __uint_reg_t; - -/** Signed type that can hold a register. */ -typedef long __int_reg_t; -#endif - -/** String prefix to use for printf(). */ -#define __INT_REG_FMT "l" - #endif -#endif /* __INT_REG_BITS */ +#include +/* __need_int_reg_t is deprecated: just include */ #ifndef __need_int_reg_t +#define __ARCH_ABI_H__ #ifndef __ASSEMBLER__ /** Unsigned type that can hold a register. */ diff --git a/arch/tile/include/uapi/arch/intreg.h b/arch/tile/include/uapi/arch/intreg.h new file mode 100644 index ..1cf2fbf74306 --- /dev/null +++ b/arch/tile/include/uapi/arch/intreg.h @@ -0,0 +1,70 @@ +/* + * Copyright 2017 Tilera Corporation. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +/** + * @file + * + * Provide types and defines for the type that can hold a register, + * in the implementation namespace. + */ + +#ifndef __ARCH_INTREG_H__ +#define __ARCH_INTREG_H__ + +/* + * Get number of bits in a register. __INT_REG_BITS may be defined + * prior to including this header to force a particular bit width. + */ + +#ifndef __INT_REG_BITS +# if defined __tilegx__ +# define __INT_REG_BITS 64 +# elif defined __tilepro__ +# define __INT_REG_BITS 32 +# else +# error Unrecognized architecture +# endif +#endif + +#if __INT_REG_BITS == 64 + +# ifndef __ASSEMBLER__ +/** Unsigned type that can hold a register. */ +typedef unsigned long long __uint_reg_t; + +/** Signed type that can hold a register. */ +typedef long long __int_reg_t; +# endif + +/** String prefix to use for printf(). */ +# define __INT_REG_FMT "ll" + +#elif __INT_REG_BITS == 32 + +# ifndef __ASSEMBLER__ +/** Unsigned type that can hold a register. */ +typedef unsigned long __uint_reg_t; + +/** Signed type that can hold a register. */ +typedef long __int_reg_t; +# endif + +/** String prefix to use for printf(). */ +# define __INT_REG_FMT "l" + +#else +# error Unrecognized value of __INT_REG_BITS +#endif + +#endif /* !__ARCH_INTREG_H__ */ -- 2.7.2
[PATCH] tile: prefer to __need_int_reg_t
As part of some work in glibc to move away from the "__need" prefix, this commit breaks away the definitions of __int_reg_t, __uint_reg_t, __INT_REG_BITS, and __INT_REG_FMT to a separate "microheader". It is then included from to preserve the semantics of the previous header. For now, we continue to preserve the __need_int_reg_t semantics in as well, but anticipate that after a few years we can obsolete it. --- arch/tile/include/uapi/arch/abi.h| 49 +++-- arch/tile/include/uapi/arch/intreg.h | 70 2 files changed, 74 insertions(+), 45 deletions(-) create mode 100644 arch/tile/include/uapi/arch/intreg.h diff --git a/arch/tile/include/uapi/arch/abi.h b/arch/tile/include/uapi/arch/abi.h index c55a3d432644..328e62260272 100644 --- a/arch/tile/include/uapi/arch/abi.h +++ b/arch/tile/include/uapi/arch/abi.h @@ -20,58 +20,17 @@ #ifndef __ARCH_ABI_H__ -#if !defined __need_int_reg_t && !defined __DOXYGEN__ -# define __ARCH_ABI_H__ -# include -#endif - -/* Provide the basic machine types. */ -#ifndef __INT_REG_BITS - -/** Number of bits in a register. */ -#if defined __tilegx__ -# define __INT_REG_BITS 64 -#elif defined __tilepro__ -# define __INT_REG_BITS 32 -#elif !defined __need_int_reg_t +#ifndef __tile__ /* support uncommon use of arch headers in non-tile builds */ # include # define __INT_REG_BITS CHIP_WORD_SIZE() -#else -# error Unrecognized architecture with __need_int_reg_t -#endif - -#if __INT_REG_BITS == 64 - -#ifndef __ASSEMBLER__ -/** Unsigned type that can hold a register. */ -typedef unsigned long long __uint_reg_t; - -/** Signed type that can hold a register. */ -typedef long long __int_reg_t; -#endif - -/** String prefix to use for printf(). */ -#define __INT_REG_FMT "ll" - -#else - -#ifndef __ASSEMBLER__ -/** Unsigned type that can hold a register. */ -typedef unsigned long __uint_reg_t; - -/** Signed type that can hold a register. */ -typedef long __int_reg_t; -#endif - -/** String prefix to use for printf(). */ -#define __INT_REG_FMT "l" - #endif -#endif /* __INT_REG_BITS */ +#include +/* __need_int_reg_t is deprecated: just include */ #ifndef __need_int_reg_t +#define __ARCH_ABI_H__ #ifndef __ASSEMBLER__ /** Unsigned type that can hold a register. */ diff --git a/arch/tile/include/uapi/arch/intreg.h b/arch/tile/include/uapi/arch/intreg.h new file mode 100644 index ..1cf2fbf74306 --- /dev/null +++ b/arch/tile/include/uapi/arch/intreg.h @@ -0,0 +1,70 @@ +/* + * Copyright 2017 Tilera Corporation. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +/** + * @file + * + * Provide types and defines for the type that can hold a register, + * in the implementation namespace. + */ + +#ifndef __ARCH_INTREG_H__ +#define __ARCH_INTREG_H__ + +/* + * Get number of bits in a register. __INT_REG_BITS may be defined + * prior to including this header to force a particular bit width. + */ + +#ifndef __INT_REG_BITS +# if defined __tilegx__ +# define __INT_REG_BITS 64 +# elif defined __tilepro__ +# define __INT_REG_BITS 32 +# else +# error Unrecognized architecture +# endif +#endif + +#if __INT_REG_BITS == 64 + +# ifndef __ASSEMBLER__ +/** Unsigned type that can hold a register. */ +typedef unsigned long long __uint_reg_t; + +/** Signed type that can hold a register. */ +typedef long long __int_reg_t; +# endif + +/** String prefix to use for printf(). */ +# define __INT_REG_FMT "ll" + +#elif __INT_REG_BITS == 32 + +# ifndef __ASSEMBLER__ +/** Unsigned type that can hold a register. */ +typedef unsigned long __uint_reg_t; + +/** Signed type that can hold a register. */ +typedef long __int_reg_t; +# endif + +/** String prefix to use for printf(). */ +# define __INT_REG_FMT "l" + +#else +# error Unrecognized value of __INT_REG_BITS +#endif + +#endif /* !__ARCH_INTREG_H__ */ -- 2.7.2
Re: [PATCH 1/3] futex: remove duplicated code
On 3/3/2017 7:27 AM, Jiri Slaby wrote: There is code duplicated over all architecture's headers for futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr, and comparison of the result. Remove this duplication and leave up to the arches only the needed assembly which is now in arch_futex_atomic_op_inuser. Note that s390 removed access_ok check in d12a29703 ("s390/uaccess: remove pointless access_ok() checks") as access_ok there returns true. We introduce it back to the helper for the sake of simplicity (it gets optimized away anyway). Signed-off-by: Jiri Slaby<jsl...@suse.cz> Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile] -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH 1/3] futex: remove duplicated code
On 3/3/2017 7:27 AM, Jiri Slaby wrote: There is code duplicated over all architecture's headers for futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr, and comparison of the result. Remove this duplication and leave up to the arches only the needed assembly which is now in arch_futex_atomic_op_inuser. Note that s390 removed access_ok check in d12a29703 ("s390/uaccess: remove pointless access_ok() checks") as access_ok there returns true. We introduce it back to the helper for the sake of simplicity (it gets optimized away anyway). Signed-off-by: Jiri Slaby Acked-by: Chris Metcalf [for tile] -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v15 04/13] task_isolation: add initial support
On 2/2/2017 11:13 AM, Eugene Syromiatnikov wrote: case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; It is not a very good idea to ignore the values of unused arguments; it prevents future their usage, as user space can pass some garbage values here. Check out the code for newer prctl handlers, like PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT (PR_[SG]_FP_MODE is an unfortunate recent omission). The other thing is the usage of #ifdef's, which is generally avoided there. Also, the patch for man-pages, describing the new prctl calls, is missing. Thanks, I appreciate the feedback. I'll fold this into the next spin of the series! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH v15 04/13] task_isolation: add initial support
On 2/2/2017 11:13 AM, Eugene Syromiatnikov wrote: case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; It is not a very good idea to ignore the values of unused arguments; it prevents future their usage, as user space can pass some garbage values here. Check out the code for newer prctl handlers, like PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT (PR_[SG]_FP_MODE is an unfortunate recent omission). The other thing is the usage of #ifdef's, which is generally avoided there. Also, the patch for man-pages, describing the new prctl calls, is missing. Thanks, I appreciate the feedback. I'll fold this into the next spin of the series! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] tile: fix build failure
On 1/24/2017 11:39 AM, Sudip Mukherjee wrote: From: Sudip Mukherjee <sudipm.mukher...@gmail.com> The build of tilegx allmodconfig was failing with errors like: ../arch/tile/include/asm/div64.h:5:15: error: unknown type name 'u64' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ ../arch/tile/include/asm/div64.h:5:31: error: unknown type name 'u32' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ ../arch/tile/include/asm/div64.h:5:38: error: unknown type name 'u32' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ In file included from ../fs/ubifs/ubifs.h:26:0, from ../fs/ubifs/shrinker.c:42: ../include/linux/math64.h: In function 'mul_u64_u32_shr': ../arch/tile/include/asm/div64.h:9:21: error: implicit declaration of function 'mul_u32_u32' [-Werror=implicit-function-declaration] The simplest solution was to include the types header file. Fixes: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends") Cc: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Sudip Mukherjee <sudip.mukher...@codethink.co.uk> --- build log is at: https://travis-ci.org/sudipm-mukherjee/parport/jobs/194717687 arch/tile/include/asm/div64.h | 1 + 1 file changed, 1 insertion(+) Acked-by: Chris Metcalf <cmetc...@mellanox.com> -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [PATCH] tile: fix build failure
On 1/24/2017 11:39 AM, Sudip Mukherjee wrote: From: Sudip Mukherjee The build of tilegx allmodconfig was failing with errors like: ../arch/tile/include/asm/div64.h:5:15: error: unknown type name 'u64' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ ../arch/tile/include/asm/div64.h:5:31: error: unknown type name 'u32' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ ../arch/tile/include/asm/div64.h:5:38: error: unknown type name 'u32' static inline u64 mul_u32_u32(u32 a, u32 b) ^~~ In file included from ../fs/ubifs/ubifs.h:26:0, from ../fs/ubifs/shrinker.c:42: ../include/linux/math64.h: In function 'mul_u64_u32_shr': ../arch/tile/include/asm/div64.h:9:21: error: implicit declaration of function 'mul_u32_u32' [-Werror=implicit-function-declaration] The simplest solution was to include the types header file. Fixes: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends") Cc: Peter Zijlstra Signed-off-by: Sudip Mukherjee --- build log is at: https://travis-ci.org/sudipm-mukherjee/parport/jobs/194717687 arch/tile/include/asm/div64.h | 1 + 1 file changed, 1 insertion(+) Acked-by: Chris Metcalf -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile bugfix for 4.10-rc6
Linus, Please pull the following change from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git stable This avoids an issue with short userspace reads for regset via ptrace. Dave Martin (1): tile/ptrace: Preserve previous registers for short regset write arch/tile/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile bugfix for 4.10-rc6
Linus, Please pull the following change from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git stable This avoids an issue with short userspace reads for regset via ptrace. Dave Martin (1): tile/ptrace: Preserve previous registers for short regset write arch/tile/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Questions on the task isolation patches
On 12/20/2016 4:27 AM, Paolo Bonzini wrote: On 16/12/2016 22:00, Chris Metcalf wrote: Sorry, I think I wasn't clear. Normally when you are running task isolated and you enter the kernel, you will get a fatal signal. The exception is if you call prctl itself (or exit), the kernel tolerates it without a signal, since obviously that's how you need to cleanly tell the kernel you are done with task isolation. Running in a guest is pretty much the same as running in userspace. Would it be possible to exclude the KVM_RUN ioctl as well? QEMU would still have to run prctl when a CPU goes to sleep, and KVM_RUN would have to enable/disable isolated mode when a VM executes HLT (which should never happen anyway in NFV scenarios). I think that probably makes sense. The flow would be that qemu executes first the prctl() for task isolation, then the KVM_RUN ioctl. We obviously can't do it in the other order, so we'd need to make task isolation tolerate KVM_RUN. I won't try to do it for my next patch series (based on 4.10) though, since I'd like to get the basic support upstreamed before trying to extend it. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Questions on the task isolation patches
On 12/20/2016 4:27 AM, Paolo Bonzini wrote: On 16/12/2016 22:00, Chris Metcalf wrote: Sorry, I think I wasn't clear. Normally when you are running task isolated and you enter the kernel, you will get a fatal signal. The exception is if you call prctl itself (or exit), the kernel tolerates it without a signal, since obviously that's how you need to cleanly tell the kernel you are done with task isolation. Running in a guest is pretty much the same as running in userspace. Would it be possible to exclude the KVM_RUN ioctl as well? QEMU would still have to run prctl when a CPU goes to sleep, and KVM_RUN would have to enable/disable isolated mode when a VM executes HLT (which should never happen anyway in NFV scenarios). I think that probably makes sense. The flow would be that qemu executes first the prctl() for task isolation, then the KVM_RUN ioctl. We obviously can't do it in the other order, so we'd need to make task isolation tolerate KVM_RUN. I won't try to do it for my next patch series (based on 4.10) though, since I'd like to get the basic support upstreamed before trying to extend it. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Questions on the task isolation patches
Sorry for the slow response - I have been busy with some other things. On 12/6/2016 4:43 PM, yunhong jiang wrote: On Fri, 2 Dec 2016 13:58:08 -0500 Chris Metcalf <cmetc...@mellanox.com> wrote: On 12/1/2016 5:28 PM, yunhong jiang wrote: a) If the task isolation need prctl to mark itself as isolated, possibly the vCPU thread can't achieve it. First, the vCPU thread may need system service during OS booting time, also it's the application, instead of the vCPU thread to decide if the vCPU thread should be isolated. So possibly we need a mechanism so that another process can set the vCPU thread's task isolation? These are good questions. I think that the we would probably want to add a KVM mode that did the prctl() before transitioning back to the Would prctl() when back to guest be too heavy? It's a good question; it can be heavy. But the design for task isolation is that the task isolated process is always running in userspace anyway. If you are transitioning in and out of the guest or host kernels frequently, you probably should not be using task isolation, but just regular NOHZ_FULL. guest. But then, in the same way that we currently allow another prctl() from a task-isolated userspace process, we'd probably need to You mean currently in your patch we alraedy can do the prctl from 3rd party process to task-isolate a userspace process? Sorry that I didn't notice that part. Sorry, I think I wasn't clear. Normally when you are running task isolated and you enter the kernel, you will get a fatal signal. The exception is if you call prctl itself (or exit), the kernel tolerates it without a signal, since obviously that's how you need to cleanly tell the kernel you are done with task isolation. My point in the previous email was that we might need to similarly tolerate a guest exit without causing a fatal signal to the userspace process. But as I think about it, that's probably not true; we probably would want to notify the guest kernel of the task isolation violation and have it kill the userspace process just as if it had entered the guest kernel. Perhaps the way to drive this is to have task isolation be triggered from the guest's prctl up to the host, so there's some kind of KVM exit to the host that indicates that the guest has a userspace process that wants to run task isolated, at which point qemu invokes task isolation on behalf of the guest then returns to the guest to set up its own virtualized task isolation. It does get confusing! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: Questions on the task isolation patches
Sorry for the slow response - I have been busy with some other things. On 12/6/2016 4:43 PM, yunhong jiang wrote: On Fri, 2 Dec 2016 13:58:08 -0500 Chris Metcalf wrote: On 12/1/2016 5:28 PM, yunhong jiang wrote: a) If the task isolation need prctl to mark itself as isolated, possibly the vCPU thread can't achieve it. First, the vCPU thread may need system service during OS booting time, also it's the application, instead of the vCPU thread to decide if the vCPU thread should be isolated. So possibly we need a mechanism so that another process can set the vCPU thread's task isolation? These are good questions. I think that the we would probably want to add a KVM mode that did the prctl() before transitioning back to the Would prctl() when back to guest be too heavy? It's a good question; it can be heavy. But the design for task isolation is that the task isolated process is always running in userspace anyway. If you are transitioning in and out of the guest or host kernels frequently, you probably should not be using task isolation, but just regular NOHZ_FULL. guest. But then, in the same way that we currently allow another prctl() from a task-isolated userspace process, we'd probably need to You mean currently in your patch we alraedy can do the prctl from 3rd party process to task-isolate a userspace process? Sorry that I didn't notice that part. Sorry, I think I wasn't clear. Normally when you are running task isolated and you enter the kernel, you will get a fatal signal. The exception is if you call prctl itself (or exit), the kernel tolerates it without a signal, since obviously that's how you need to cleanly tell the kernel you are done with task isolation. My point in the previous email was that we might need to similarly tolerate a guest exit without causing a fatal signal to the userspace process. But as I think about it, that's probably not true; we probably would want to notify the guest kernel of the task isolation violation and have it kill the userspace process just as if it had entered the guest kernel. Perhaps the way to drive this is to have task isolation be triggered from the guest's prctl up to the host, so there's some kind of KVM exit to the host that indicates that the guest has a userspace process that wants to run task isolated, at which point qemu invokes task isolation on behalf of the guest then returns to the guest to set up its own virtualized task isolation. It does get confusing! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile changes for 4.10
Linus, Please pull the following changes for 4.10 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master Another grab-bag of miscellaneous changes. Chris Metcalf (2): tile: remove #pragma unroll from finv_buffer_remote() tile: use __ro_after_init instead of tile-specific __write_once Colin Ian King (1): tile/pci_gx: fix spelling mistake: "delievered" -> "delivered" Markus Elfring (2): tile-module: Use kmalloc_array() in module_alloc() tile-module: Rename jump labels in module_alloc() Paul Gortmaker (1): tile: migrate exception table users off module.h and onto extable.h arch/tile/include/asm/cache.h| 7 ++- arch/tile/include/asm/sections.h | 3 --- arch/tile/kernel/module.c| 11 +-- arch/tile/kernel/pci.c | 2 +- arch/tile/kernel/pci_gx.c| 2 +- arch/tile/kernel/setup.c | 18 +- arch/tile/kernel/smp.c | 2 +- arch/tile/kernel/time.c | 4 ++-- arch/tile/kernel/unaligned.c | 2 +- arch/tile/lib/cacheflush.c | 8 +--- arch/tile/mm/extable.c | 2 +- arch/tile/mm/fault.c | 2 +- arch/tile/mm/homecache.c | 2 +- arch/tile/mm/init.c | 10 +- 14 files changed, 31 insertions(+), 44 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
[GIT PULL] arch/tile changes for 4.10
Linus, Please pull the following changes for 4.10 from: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master Another grab-bag of miscellaneous changes. Chris Metcalf (2): tile: remove #pragma unroll from finv_buffer_remote() tile: use __ro_after_init instead of tile-specific __write_once Colin Ian King (1): tile/pci_gx: fix spelling mistake: "delievered" -> "delivered" Markus Elfring (2): tile-module: Use kmalloc_array() in module_alloc() tile-module: Rename jump labels in module_alloc() Paul Gortmaker (1): tile: migrate exception table users off module.h and onto extable.h arch/tile/include/asm/cache.h| 7 ++- arch/tile/include/asm/sections.h | 3 --- arch/tile/kernel/module.c| 11 +-- arch/tile/kernel/pci.c | 2 +- arch/tile/kernel/pci_gx.c| 2 +- arch/tile/kernel/setup.c | 18 +- arch/tile/kernel/smp.c | 2 +- arch/tile/kernel/time.c | 4 ++-- arch/tile/kernel/unaligned.c | 2 +- arch/tile/lib/cacheflush.c | 8 +--- arch/tile/mm/extable.c | 2 +- arch/tile/mm/fault.c | 2 +- arch/tile/mm/homecache.c | 2 +- arch/tile/mm/init.c | 10 +- 14 files changed, 31 insertions(+), 44 deletions(-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math
On 12/9/2016 5:18 AM, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote: Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't recognise the 32x32 mults and generates crap. This used to work :/ I tried: gcc-4.4: good gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad I also found 4.4 was good on tilegx at recognizing the 32x32, and bad on the later versions I tested; I don't recall which specific later versions I tried, though. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math
On 12/9/2016 5:18 AM, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote: Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't recognise the 32x32 mults and generates crap. This used to work :/ I tried: gcc-4.4: good gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad I also found 4.4 was good on tilegx at recognizing the 32x32, and bad on the later versions I tested; I don't recall which specific later versions I tried, though. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math
On 12/9/2016 3:30 AM, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote: Just for giggles, on tilegx the branch is actually slower than doing the mult unconditionally. The problem is that the two multiplies would otherwise completely pipeline, whereas with the conditional you serialize them. On my Haswell laptop the unconditional version is faster too. Only when using x86_64 instructions, once I fixed the i386 variant it was slower, probably due to register pressure and the like. (came to light while talking about why the mul_u64_u32_shr() fallback didn't work right for them, which was a combination of the above issue and the fact that their compiler 'lost' the fact that these are 32x32->64 mults and did 64x64 ones instead). Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't recognise the 32x32 mults and generates crap. This used to work :/ Do we want something like so? --- arch/tile/include/asm/Kbuild | 1 - arch/tile/include/asm/div64.h | 14 ++ arch/x86/include/asm/div64.h | 10 ++ include/linux/math64.h| 26 ++ 4 files changed, 42 insertions(+), 9 deletions(-) Untested, but I looked at it closely, and it seems like a decent idea. Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile] Of course if this is pushed up, it will then probably be too tempting for me not to add the tilegx-specific mul_u64_u32_shr() to take advantage of pipelining the two 32x32->64 multiplies :-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math
On 12/9/2016 3:30 AM, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote: On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote: Just for giggles, on tilegx the branch is actually slower than doing the mult unconditionally. The problem is that the two multiplies would otherwise completely pipeline, whereas with the conditional you serialize them. On my Haswell laptop the unconditional version is faster too. Only when using x86_64 instructions, once I fixed the i386 variant it was slower, probably due to register pressure and the like. (came to light while talking about why the mul_u64_u32_shr() fallback didn't work right for them, which was a combination of the above issue and the fact that their compiler 'lost' the fact that these are 32x32->64 mults and did 64x64 ones instead). Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't recognise the 32x32 mults and generates crap. This used to work :/ Do we want something like so? --- arch/tile/include/asm/Kbuild | 1 - arch/tile/include/asm/div64.h | 14 ++ arch/x86/include/asm/div64.h | 10 ++ include/linux/math64.h| 26 ++ 4 files changed, 42 insertions(+), 9 deletions(-) Untested, but I looked at it closely, and it seems like a decent idea. Acked-by: Chris Metcalf [for tile] Of course if this is pushed up, it will then probably be too tempting for me not to add the tilegx-specific mul_u64_u32_shr() to take advantage of pipelining the two 32x32->64 multiplies :-) -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com