Re: [PATCH 06/16] arch: remove tile port

2018-03-15 Thread Chris Metcalf

On 3/14/2018 10:36 AM, Arnd Bergmann wrote:

The Tile architecture port was added by Chris Metcalf in 2010, and
maintained until early 2018 when he orphaned it due to his departure
from Mellanox, and nobody else stepped up to maintain it. The product
line is still around in the form of the BlueField SoC, but no longer
uses the Tile architecture.

There are also still products for sale with Tile-GX SoCs, notably the
Mikrotik CCR router family. The products all use old (linux-3.3) kernels
with lots of patches and won't be upgraded by their manufacturers. There
have been efforts to port both OpenWRT and Debian to these, but both
projects have stalled and are very unlikely to be continued in the future.

Given that we are reasonably sure that nobody is still using the port
with an upstream kernel any more, it seems better to remove it now while
the port is in a good shape than to let it bitrot for a few years first.


Arnd, thanks for dealing with this.

There are a number of tile-specific driver files that are mostly called out
in the MAINTAINERS file.  I would expect you should also delete those.

-F:    drivers/char/tile-srom.c
-F:    drivers/edac/tile_edac.c
-F:    drivers/net/ethernet/tile/
-F:    drivers/rtc/rtc-tile.c
-F:    drivers/tty/hvc/hvc_tile.c
-F:    drivers/tty/serial/tilegx.c
-F:    drivers/usb/host/*-tilegx.c
-F:    include/linux/usb/tilegx.h

Chris


Re: [PATCH 06/16] arch: remove tile port

2018-03-15 Thread Chris Metcalf

On 3/14/2018 10:36 AM, Arnd Bergmann wrote:

The Tile architecture port was added by Chris Metcalf in 2010, and
maintained until early 2018 when he orphaned it due to his departure
from Mellanox, and nobody else stepped up to maintain it. The product
line is still around in the form of the BlueField SoC, but no longer
uses the Tile architecture.

There are also still products for sale with Tile-GX SoCs, notably the
Mikrotik CCR router family. The products all use old (linux-3.3) kernels
with lots of patches and won't be upgraded by their manufacturers. There
have been efforts to port both OpenWRT and Debian to these, but both
projects have stalled and are very unlikely to be continued in the future.

Given that we are reasonably sure that nobody is still using the port
with an upstream kernel any more, it seems better to remove it now while
the port is in a good shape than to let it bitrot for a few years first.


Arnd, thanks for dealing with this.

There are a number of tile-specific driver files that are mostly called out
in the MAINTAINERS file.  I would expect you should also delete those.

-F:    drivers/char/tile-srom.c
-F:    drivers/edac/tile_edac.c
-F:    drivers/net/ethernet/tile/
-F:    drivers/rtc/rtc-tile.c
-F:    drivers/tty/hvc/hvc_tile.c
-F:    drivers/tty/serial/tilegx.c
-F:    drivers/usb/host/*-tilegx.c
-F:    include/linux/usb/tilegx.h

Chris


[GIT PULL] arch/tile "bugfix" for 4.15-rc3

2017-12-05 Thread Chris Metcalf

Linus,

This is not exactly a bugfix, but this is my last week at Mellanox and
I am stepping down as arch/tile maintainer, as described in a bit more
detail in my email here:

https://lkml.kernel.org/r/1512402760-12694-1-git-send-email-cmetc...@mellanox.com

So, please pull this one last tile commit to remove me as maintainer,
and to tag the tile architecture as orphaned:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git HEAD

It's been a pleasure working with you and the rest of the Linux
community since 2010 and I hope to continue to do more in the
years to come.

Chris Metcalf (1):
  arch/tile: mark as orphaned

 MAINTAINERS | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile "bugfix" for 4.15-rc3

2017-12-05 Thread Chris Metcalf

Linus,

This is not exactly a bugfix, but this is my last week at Mellanox and
I am stepping down as arch/tile maintainer, as described in a bit more
detail in my email here:

https://lkml.kernel.org/r/1512402760-12694-1-git-send-email-cmetc...@mellanox.com

So, please pull this one last tile commit to remove me as maintainer,
and to tag the tile architecture as orphaned:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git HEAD

It's been a pleasure working with you and the rest of the Linux
community since 2010 and I hope to continue to do more in the
years to come.

Chris Metcalf (1):
  arch/tile: mark as orphaned

 MAINTAINERS | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: linux-next: remove the tile tree?

2017-12-04 Thread Chris Metcalf

On 12/4/2017 3:25 PM, Stephen Rothwell wrote:

Hi Chris,

Given commit

   8ee5ad1d4c0b ("arch/tile: mark as orphaned")

in Linus' tree, should I remove the tile tree from linux-next?


Yes, that would make sense.  Good catch!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: linux-next: remove the tile tree?

2017-12-04 Thread Chris Metcalf

On 12/4/2017 3:25 PM, Stephen Rothwell wrote:

Hi Chris,

Given commit

   8ee5ad1d4c0b ("arch/tile: mark as orphaned")

in Linus' tree, should I remove the tile tree from linux-next?


Yes, that would make sense.  Good catch!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[PATCH] arch/tile: mark as orphaned

2017-12-04 Thread Chris Metcalf
The chip family of TILEPro and TILE-Gx was developed by Tilera,
which was eventually acquired by Mellanox.  The tile architecture
was added to the kernel in 2010 and first appeared in 2.6.36.
Now at Mellanox we are developing new chips based on the ARM64
architecture; our last TILE-Gx chip (the Gx72) was released in 2013,
and our customers using tile architecture products are not, as far
as we know, looking to upgrade to newer kernel releases.  In the
absence of someone in the community stepping up to take over
maintainership, this commit marks the architecture as orphaned.

Cc: Chris Metcalf <metc...@alum.mit.edu>
Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---

This is my last week at Mellanox, and in the absence of customer
engagements, it doesn't seem to make sense to transition the tile
architecture maintainer role over to some other Mellanox employee.
It would be great if someone in the community were interested in
taking over!

I'm also open to a community consensus suggesting that I just
"git rm" the tile-related code instead of tagging it as orphaned,
but my sense is that that's something the community can address
later if no one steps up over a period of several releases to take
over ownership.

Note the Cc: tag on this commit; further kernel work (in particular
the task-isolation patch series, which sprang out of some early
Tilera work) will continue to come from that email address.

 MAINTAINERS | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d750b19c0..67cf1db6cde4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13458,10 +13458,8 @@ F: drivers/net/wireless/ti/
 F: include/linux/wl12xx.h
 
 TILE ARCHITECTURE
-M: Chris Metcalf <cmetc...@mellanox.com>
 W: http://www.mellanox.com/repository/solutions/tile-scm/
-T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
-S: Supported
+S: Orphan
 F: arch/tile/
 F: drivers/char/tile-srom.c
 F: drivers/edac/tile_edac.c
-- 
2.1.2



[PATCH] arch/tile: mark as orphaned

2017-12-04 Thread Chris Metcalf
The chip family of TILEPro and TILE-Gx was developed by Tilera,
which was eventually acquired by Mellanox.  The tile architecture
was added to the kernel in 2010 and first appeared in 2.6.36.
Now at Mellanox we are developing new chips based on the ARM64
architecture; our last TILE-Gx chip (the Gx72) was released in 2013,
and our customers using tile architecture products are not, as far
as we know, looking to upgrade to newer kernel releases.  In the
absence of someone in the community stepping up to take over
maintainership, this commit marks the architecture as orphaned.

Cc: Chris Metcalf 
Signed-off-by: Chris Metcalf 
---

This is my last week at Mellanox, and in the absence of customer
engagements, it doesn't seem to make sense to transition the tile
architecture maintainer role over to some other Mellanox employee.
It would be great if someone in the community were interested in
taking over!

I'm also open to a community consensus suggesting that I just
"git rm" the tile-related code instead of tagging it as orphaned,
but my sense is that that's something the community can address
later if no one steps up over a period of several releases to take
over ownership.

Note the Cc: tag on this commit; further kernel work (in particular
the task-isolation patch series, which sprang out of some early
Tilera work) will continue to come from that email address.

 MAINTAINERS | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d750b19c0..67cf1db6cde4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13458,10 +13458,8 @@ F: drivers/net/wireless/ti/
 F: include/linux/wl12xx.h
 
 TILE ARCHITECTURE
-M: Chris Metcalf 
 W: http://www.mellanox.com/repository/solutions/tile-scm/
-T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
-S: Supported
+S: Orphan
 F: arch/tile/
 F: drivers/char/tile-srom.c
 F: drivers/edac/tile_edac.c
-- 
2.1.2



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Chris Metcalf

On 11/7/2017 12:10 PM, Christopher Lameter wrote:

On Mon, 6 Nov 2017, Chris Metcalf wrote:


On 11/6/2017 10:38 AM, Christopher Lameter wrote:

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

We have to get rid of the 1 Hz tick, so we don't want to tie anything
else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.


Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently

calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Chris Metcalf

On 11/7/2017 12:10 PM, Christopher Lameter wrote:

On Mon, 6 Nov 2017, Chris Metcalf wrote:


On 11/6/2017 10:38 AM, Christopher Lameter wrote:

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

We have to get rid of the 1 Hz tick, so we don't want to tie anything
else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.


Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently

calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-06 Thread Chris Metcalf

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:

Since we're potentially about to start
the merge window for 4.15 this weekend, the timing of this doesn't
work well either.


With the start of the merge window now delayed for a week, I'm sure
everyone can distract themselves and help make the last week of -rc8
pass more quickly by digging into this patch series!  :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-06 Thread Chris Metcalf

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:

Since we're potentially about to start
the merge window for 4.15 this weekend, the timing of this doesn't
work well either.


With the start of the merge window now delayed for a week, I'm sure
everyone can distract themselves and help make the last week of -rc8
pass more quickly by digging into this patch series!  :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:32 PM, Mark Rutland wrote:

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote:

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

I don't have much context for this (I only received patches 9, 10, and
12), and this commit message doesn't help me to understand why these
changes are necessary.


Sorry, I missed having you on the cover letter.  I'll fix that for the 
next spin.

The cover letter (and rest of the series) is here:

https://lkml.org/lkml/2017/11/3/589

The core piece of the patch is here:

https://lkml.org/lkml/2017/11/3/598


Here we add to _TIF_WORK_MASK...
[...]
... and here we open-code the *old* _TIF_WORK_MASK.

Can we drop both in , building one in terms of the
other:

#define _TIF_WORK_NOISOLATION_MASK  \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME |\
 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)

#define _TIF_WORK_MASK  \
(_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION)

... that avoids duplication, ensuring the two are kept in sync, and
makes it a little easier to understand.


We certainly could do that.  I based my approach on the x86 model,
which defines _TIF_ALLWORK_MASK in thread_info.h, and then a local
EXIT_TO_USERMODE_WORK_FLAGS above exit_to_usermode_loop().

If you'd prefer to avoid the duplication, perhaps names more like this?

_TIF_WORK_LOOP_MASK (without TIF_TASK_ISOLATION)
_TIF_WORK_MASK as _TIF_WORK_LOOP_MASK | _TIF_TASK_ISOLATION

That keeps the names reflective of the function (entry only vs loop).


@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
  {
+   task_isolation_remote_cpumask(mask, "wakeup IPI");

What exactly does this do? Is it some kind of a tracepoint?


It is intended to generate a diagnostic for a remote task that is
trying to run isolated from the kernel (NOHZ_FULL on steroids, more
or less), if the kernel is about to interrupt it.

Similarly, the task_isolation_interrupt() hooks are diagnostics for
the current task.  The intent is that by hooking a little deeper in
the call path, you get actionable diagnostics for processes that are
about to be signalled because they have lost task isolation for some
reason.


@@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
 */
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
  VM_FAULT_BADACCESS {
+   /* No signal was generated, but notify task-isolation tasks. */
+   if (user_mode(regs))
+   task_isolation_interrupt("page fault at %#lx", addr);

What exactly does the task receive here? Are these strings ABI?

Do we need to do this for *every* exception?


The strings are diagnostic messages; the process itself just gets
a SIGKILL (or user-defined signal if requested).  To provide better
diagnosis we emit a log message that can be examined to see
what exactly caused the signal to be generated.

Thanks!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:32 PM, Mark Rutland wrote:

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote:

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

I don't have much context for this (I only received patches 9, 10, and
12), and this commit message doesn't help me to understand why these
changes are necessary.


Sorry, I missed having you on the cover letter.  I'll fix that for the 
next spin.

The cover letter (and rest of the series) is here:

https://lkml.org/lkml/2017/11/3/589

The core piece of the patch is here:

https://lkml.org/lkml/2017/11/3/598


Here we add to _TIF_WORK_MASK...
[...]
... and here we open-code the *old* _TIF_WORK_MASK.

Can we drop both in , building one in terms of the
other:

#define _TIF_WORK_NOISOLATION_MASK  \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME |\
 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)

#define _TIF_WORK_MASK  \
(_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION)

... that avoids duplication, ensuring the two are kept in sync, and
makes it a little easier to understand.


We certainly could do that.  I based my approach on the x86 model,
which defines _TIF_ALLWORK_MASK in thread_info.h, and then a local
EXIT_TO_USERMODE_WORK_FLAGS above exit_to_usermode_loop().

If you'd prefer to avoid the duplication, perhaps names more like this?

_TIF_WORK_LOOP_MASK (without TIF_TASK_ISOLATION)
_TIF_WORK_MASK as _TIF_WORK_LOOP_MASK | _TIF_TASK_ISOLATION

That keeps the names reflective of the function (entry only vs loop).


@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
  {
+   task_isolation_remote_cpumask(mask, "wakeup IPI");

What exactly does this do? Is it some kind of a tracepoint?


It is intended to generate a diagnostic for a remote task that is
trying to run isolated from the kernel (NOHZ_FULL on steroids, more
or less), if the kernel is about to interrupt it.

Similarly, the task_isolation_interrupt() hooks are diagnostics for
the current task.  The intent is that by hooking a little deeper in
the call path, you get actionable diagnostics for processes that are
about to be signalled because they have lost task isolation for some
reason.


@@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
 */
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
  VM_FAULT_BADACCESS {
+   /* No signal was generated, but notify task-isolation tasks. */
+   if (user_mode(regs))
+   task_isolation_interrupt("page fault at %#lx", addr);

What exactly does the task receive here? Are these strings ABI?

Do we need to do this for *every* exception?


The strings are diagnostic messages; the process itself just gets
a SIGKILL (or user-defined signal if requested).  To provide better
diagnosis we emit a log message that can be examined to see
what exactly caused the signal to be generated.

Thanks!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[PATCH] arch/tile: Implement ->set_state_oneshot_stopped()

2017-11-03 Thread Chris Metcalf
set_state_oneshot_stopped() is called by the clkevt core, when the
next event is required at an expiry time of 'KTIME_MAX'. This normally
happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.

This patch makes the clockevent device to stop on such an event, to
avoid spurious interrupts, as explained by: commit 8fff52fd5093
("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 arch/tile/kernel/time.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 6643ffbc0615..f95d65f3162b 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -162,6 +162,7 @@ static DEFINE_PER_CPU(struct clock_event_device, 
tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+   .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
 };
 
-- 
2.1.2



[PATCH] arch/tile: Implement ->set_state_oneshot_stopped()

2017-11-03 Thread Chris Metcalf
set_state_oneshot_stopped() is called by the clkevt core, when the
next event is required at an expiry time of 'KTIME_MAX'. This normally
happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.

This patch makes the clockevent device to stop on such an event, to
avoid spurious interrupts, as explained by: commit 8fff52fd5093
("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

Signed-off-by: Chris Metcalf 
---
 arch/tile/kernel/time.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 6643ffbc0615..f95d65f3162b 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -162,6 +162,7 @@ static DEFINE_PER_CPU(struct clock_event_device, 
tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+   .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
 };
 
-- 
2.1.2



Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:18 PM, Mark Rutland wrote:

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote:

diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
}
}
  
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;

AFAICT, we've set up this callback since commit:

   cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement 
->set_state_oneshot_stopped()")

... so I don't beleive this is necessary, and I think this change can be
dropped.


Thanks, I will drop it.  I missed the semantic merge conflict there.

I extracted the arch/tile specific part of the change and just pushed it 
through

the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:18 PM, Mark Rutland wrote:

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote:

diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
}
}
  
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;

AFAICT, we've set up this callback since commit:

   cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement 
->set_state_oneshot_stopped()")

... so I don't beleive this is necessary, and I think this change can be
dropped.


Thanks, I will drop it.  I missed the semantic merge conflict there.

I extracted the arch/tile specific part of the change and just pushed it 
through

the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile bugfixes for 4.14-rcN

2017-11-03 Thread Chris Metcalf

Linus,

Please pull the following two commits for 4.14 from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master

These are both one-line bug fixes.

Chris Metcalf (1):
  arch/tile: Implement ->set_state_oneshot_stopped()

Luc Van Oostenryck (1):
  tile: pass machine size to sparse

 arch/tile/Makefile  | 2 ++
 arch/tile/kernel/time.c | 1 +
 2 files changed, 3 insertions(+)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile bugfixes for 4.14-rcN

2017-11-03 Thread Chris Metcalf

Linus,

Please pull the following two commits for 4.14 from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master

These are both one-line bug fixes.

Chris Metcalf (1):
  arch/tile: Implement ->set_state_oneshot_stopped()

Luc Van Oostenryck (1):
  tile: pass machine size to sparse

 arch/tile/Makefile  | 2 ++
 arch/tile/kernel/time.c | 1 +
 2 files changed, 3 insertions(+)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:

On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote:

From: Francis Giraldeau <francis.girald...@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

After a bit of digging (which could've been saved if our patch format
contained information about what kernel version this patch was
generated against) it turns out that this patch will not apply since
commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode
return") has been applied, which means the TIF numbers have changed
as well as the assembly code that your patch touches.

My guess is that this patch was generated from a 4.13 kernel, so
misses the 4.14-rc1 changes.  Since we're potentially about to start
the merge window for 4.15 this weekend, the timing of this doesn't
work well either.


What patch failure did you see?  The patch is based against 4.14-rc4, so 
while

it's a few weeks out of date, it does include the commit you reference.


Once 4.15-rc1 has been published, please rebase against that version
and resend.


Sure.  I was hoping to eke out a little bit of attention from kernel 
developers

before the merge window actually opens :)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-03 Thread Chris Metcalf

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:

On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote:

From: Francis Giraldeau 

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

After a bit of digging (which could've been saved if our patch format
contained information about what kernel version this patch was
generated against) it turns out that this patch will not apply since
commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode
return") has been applied, which means the TIF numbers have changed
as well as the assembly code that your patch touches.

My guess is that this patch was generated from a 4.13 kernel, so
misses the 4.14-rc1 changes.  Since we're potentially about to start
the merge window for 4.15 this weekend, the timing of this doesn't
work well either.


What patch failure did you see?  The patch is based against 4.14-rc4, so 
while

it's a few weeks out of date, it does include the commit you reference.


Once 4.15-rc1 has been published, please rebase against that version
and resend.


Sure.  I was hoping to eke out a little bit of attention from kernel 
developers

before the merge window actually opens :)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[PATCH v16 07/13] Add task isolation hooks to arch-independent code

2017-11-03 Thread Chris Metcalf
This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs.  Or, they can
be placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 kernel/irq/irqdesc.c | 5 +
 kernel/irq_work.c| 5 -
 kernel/smp.c | 6 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 82afb7ed369f..1b114c6b7ab8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internals.h"
 
@@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, 
unsigned int hwirq,
irq = irq_find_mapping(domain, hwirq);
 #endif
 
+   task_isolation_interrupt((irq == hwirq) ?
+"irq %d (%s)" : "irq %d (%s hwirq %d)",
+irq, domain ? domain->name : "", hwirq);
+
/*
 * Some hardware gives randomly wrong interrupts.  Rather
 * than crashing, do something sensible.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..cde49f1f31f7 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;
 
-   if (llist_add(>llnode, _cpu(raised_list, cpu)))
+   if (llist_add(>llnode, _cpu(raised_list, cpu))) {
+   task_isolation_remote(cpu, "irq_work");
arch_send_call_function_single_ipi(cpu);
+   }
 
return true;
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index c94dd85c8d41..44252aa650ac 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "smpboot.h"
 
@@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t 
*csd,
 * locking and barrier primitives. Generic code isn't really
 * equipped to do the right thing...
 */
-   if (llist_add(>llist, _cpu(call_single_queue, cpu)))
+   if (llist_add(>llist, _cpu(call_single_queue, cpu))) {
+   task_isolation_remote(cpu, "IPI function");
arch_send_call_function_single_ipi(cpu);
+   }
 
return 0;
 }
@@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask,
}
 
/* Send a message to all CPUs in the map */
+   task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
if (wait) {
-- 
2.1.2



[PATCH v16 07/13] Add task isolation hooks to arch-independent code

2017-11-03 Thread Chris Metcalf
This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs.  Or, they can
be placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf 
---
 kernel/irq/irqdesc.c | 5 +
 kernel/irq_work.c| 5 -
 kernel/smp.c | 6 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 82afb7ed369f..1b114c6b7ab8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internals.h"
 
@@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, 
unsigned int hwirq,
irq = irq_find_mapping(domain, hwirq);
 #endif
 
+   task_isolation_interrupt((irq == hwirq) ?
+"irq %d (%s)" : "irq %d (%s hwirq %d)",
+irq, domain ? domain->name : "", hwirq);
+
/*
 * Some hardware gives randomly wrong interrupts.  Rather
 * than crashing, do something sensible.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..cde49f1f31f7 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;
 
-   if (llist_add(>llnode, _cpu(raised_list, cpu)))
+   if (llist_add(>llnode, _cpu(raised_list, cpu))) {
+   task_isolation_remote(cpu, "irq_work");
arch_send_call_function_single_ipi(cpu);
+   }
 
return true;
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index c94dd85c8d41..44252aa650ac 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "smpboot.h"
 
@@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t 
*csd,
 * locking and barrier primitives. Generic code isn't really
 * equipped to do the right thing...
 */
-   if (llist_add(>llist, _cpu(call_single_queue, cpu)))
+   if (llist_add(>llist, _cpu(call_single_queue, cpu))) {
+   task_isolation_remote(cpu, "IPI function");
arch_send_call_function_single_ipi(cpu);
+   }
 
return 0;
 }
@@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask,
}
 
/* Send a message to all CPUs in the map */
+   task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
if (wait) {
-- 
2.1.2



[PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
From: Francis Giraldeau <francis.girald...@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau <francis.girald...@gmail.com>
Signed-off-by: Chris Metcalf <cmetc...@mellanox.com> [with modifications]
---
 arch/arm/Kconfig   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++---
 arch/arm/kernel/entry-common.S | 12 
 arch/arm/kernel/ptrace.c   | 10 ++
 arch/arm/kernel/signal.c   | 10 +-
 arch/arm/kernel/smp.c  |  4 
 arch/arm/mm/fault.c|  8 +++-
 7 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7888c9803eb0..3423c655a32b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,6 +48,7 @@ config ARM
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARM_SMCCC if CPU_V7
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
diff --git a/arch/arm/include/asm/thread_info.h 
b/arch/arm/include/asm/thread_info.h
index 776757d1604a..a7b76ac9543d 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user 
*,
 #define TIF_SYSCALL_TRACE  4   /* syscall trace active */
 #define TIF_SYSCALL_AUDIT  5   /* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT 6   /* syscall tracepoint instrumentation */
-#define TIF_SECCOMP7   /* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION 7   /* task isolation active */
+#define TIF_SECCOMP8   /* seccomp syscall filtering active */
 
 #define TIF_NOHZ   12  /* in adaptive nohz mode */
 #define TIF_USING_IWMMXT   17
@@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp 
__user *,
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP   (1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT  (1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-  _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+  _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+  _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-_TIF_NOTIFY_RESUME | _TIF_UPROBE)
+_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+_TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 99c908226065..9ae3ef2dbc1e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blneaddr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing
-   tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   tst r1, r2
bne fast_work_pending
 
 
@@ -83,7 +84,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blneaddr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing
-   tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   tst r1, r2
beq no_work_pending
  UNWIND(.fnend )
 ENDPROC(ret_fast_syscall)
@@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall)
/* Slower path - fall through to work_pending */
 #endif
 
-   tst r1, #_TIF_SYSCALL_WORK
+   ldr r2, =_TIF_SYSCALL_WORK
+   tst r1, r2
bne __sys_trace_return_nosave
 slow_work_pending:
mov r0, sp  @ 'regs'
@

[PATCH v16 10/13] arch/arm: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
From: Francis Giraldeau 

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau 
Signed-off-by: Chris Metcalf  [with modifications]
---
 arch/arm/Kconfig   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++---
 arch/arm/kernel/entry-common.S | 12 
 arch/arm/kernel/ptrace.c   | 10 ++
 arch/arm/kernel/signal.c   | 10 +-
 arch/arm/kernel/smp.c  |  4 
 arch/arm/mm/fault.c|  8 +++-
 7 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7888c9803eb0..3423c655a32b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,6 +48,7 @@ config ARM
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARM_SMCCC if CPU_V7
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
diff --git a/arch/arm/include/asm/thread_info.h 
b/arch/arm/include/asm/thread_info.h
index 776757d1604a..a7b76ac9543d 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user 
*,
 #define TIF_SYSCALL_TRACE  4   /* syscall trace active */
 #define TIF_SYSCALL_AUDIT  5   /* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT 6   /* syscall tracepoint instrumentation */
-#define TIF_SECCOMP7   /* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION 7   /* task isolation active */
+#define TIF_SECCOMP8   /* seccomp syscall filtering active */
 
 #define TIF_NOHZ   12  /* in adaptive nohz mode */
 #define TIF_USING_IWMMXT   17
@@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp 
__user *,
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP   (1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT  (1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-  _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+  _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+  _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-_TIF_NOTIFY_RESUME | _TIF_UPROBE)
+_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+_TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 99c908226065..9ae3ef2dbc1e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blneaddr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing
-   tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   tst r1, r2
bne fast_work_pending
 
 
@@ -83,7 +84,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blneaddr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS]@ re-check for syscall tracing
-   tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+   tst r1, r2
beq no_work_pending
  UNWIND(.fnend )
 ENDPROC(ret_fast_syscall)
@@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall)
/* Slower path - fall through to work_pending */
 #endif
 
-   tst r1, #_TIF_SYSCALL_WORK
+   ldr r2, =_TIF_SYSCALL_WORK
+   tst r1, r2
bne __sys_trace_return_nosave
 slow_work_pending:
mov r0, sp  @ 'regs'
@@ -238,7 +241,8 @@ local_restart:
ldr r10, [tsk, #TI_FLAGS]   @ check for syscall tracing
  

[PATCH v16 09/13] arch/arm64: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 -
 arch/arm64/kernel/ptrace.c   | 18 +++---
 arch/arm64/kernel/signal.c   | 10 +-
 arch/arm64/kernel/smp.c  |  7 +++
 arch/arm64/mm/fault.c|  5 +
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..d77ecdb29765 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -73,6 +73,7 @@ config ARM64
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_VMAP_STACK
diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index ddded6497a8a..9c749eca7384 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,7 @@ void arch_setup_new_exec(void);
 #define TIF_FOREIGN_FPSTATE3   /* CPU's FP state is not current's */
 #define TIF_UPROBE 4   /* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK5   /* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION 6
 #define TIF_NOHZ   7
 #define TIF_SYSCALL_TRACE  8
 #define TIF_SYSCALL_AUDIT  9
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
 #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE   (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ  (1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
@@ -108,7 +110,8 @@ void arch_setup_new_exec(void);
 
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-_TIF_UPROBE | _TIF_FSCHECK)
+_TIF_UPROBE | _TIF_FSCHECK | \
+_TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK  (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 9cbb6123208f..e5c0d7cdaf4e 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs 
*regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-   if (test_thread_flag(TIF_SYSCALL_TRACE))
+   unsigned long work = READ_ONCE(current_thread_info()->flags);
+
+   if (work & _TIF_SYSCALL_TRACE)
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-   /* Do the secure computing after ptrace; failures should be fast. */
+   /*
+* In task isolation mode, we may prevent the syscall from
+* running, and if so we also deliver a signal to the process.
+*/
+   if (work & _TIF_TASK_ISOLATION) {
+   if (task_isolation_syscall(regs->syscallno) == -1)
+   return -1;
+   }
+
+   /* Do the secure computing check early; failures should be fast. */
if (secure_computing(NULL) == -1)
return -1;
 
-   if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+   if (work & _TIF_SYSCALL_TRACEPOINT)
trace_sys_enter(regs, regs->syscallno);
 
audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 0bdc96c61bc0..d8f4904e992f 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #in

[PATCH v16 09/13] arch/arm64: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf 
---
 arch/arm64/Kconfig   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 -
 arch/arm64/kernel/ptrace.c   | 18 +++---
 arch/arm64/kernel/signal.c   | 10 +-
 arch/arm64/kernel/smp.c  |  7 +++
 arch/arm64/mm/fault.c|  5 +
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..d77ecdb29765 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -73,6 +73,7 @@ config ARM64
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_VMAP_STACK
diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index ddded6497a8a..9c749eca7384 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,7 @@ void arch_setup_new_exec(void);
 #define TIF_FOREIGN_FPSTATE3   /* CPU's FP state is not current's */
 #define TIF_UPROBE 4   /* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK5   /* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION 6
 #define TIF_NOHZ   7
 #define TIF_SYSCALL_TRACE  8
 #define TIF_SYSCALL_AUDIT  9
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
 #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE   (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ  (1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
@@ -108,7 +110,8 @@ void arch_setup_new_exec(void);
 
 #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-_TIF_UPROBE | _TIF_FSCHECK)
+_TIF_UPROBE | _TIF_FSCHECK | \
+_TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK  (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 9cbb6123208f..e5c0d7cdaf4e 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs 
*regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-   if (test_thread_flag(TIF_SYSCALL_TRACE))
+   unsigned long work = READ_ONCE(current_thread_info()->flags);
+
+   if (work & _TIF_SYSCALL_TRACE)
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-   /* Do the secure computing after ptrace; failures should be fast. */
+   /*
+* In task isolation mode, we may prevent the syscall from
+* running, and if so we also deliver a signal to the process.
+*/
+   if (work & _TIF_TASK_ISOLATION) {
+   if (task_isolation_syscall(regs->syscallno) == -1)
+   return -1;
+   }
+
+   /* Do the secure computing check early; failures should be fast. */
if (secure_computing(NULL) == -1)
return -1;
 
-   if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+   if (work & _TIF_SYSCALL_TRACEPOINT)
trace_sys_enter(regs, regs->syscallno);
 
audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 0bdc96c61bc0..d8f4904e992f 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 

[PATCH v16 08/13] arch/x86: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults
and non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 arch/x86/Kconfig   |  1 +
 arch/x86/entry/common.c| 14 ++
 arch/x86/include/asm/apic.h|  3 +++
 arch/x86/include/asm/thread_info.h |  8 +---
 arch/x86/kernel/smp.c  |  2 ++
 arch/x86/kernel/traps.c|  3 +++
 arch/x86/mm/fault.c|  5 +
 7 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..45967840b81a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,6 +114,7 @@ config X86
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
select HAVE_ARCH_COMPAT_MMAP_BASES  if MMU && COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03505ffbe1b6..2c70b915d1f2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
if (emulated)
return -1L;
 
+   /*
+* In task isolation mode, we may prevent the syscall from
+* running, and if so we also deliver a signal to the process.
+*/
+   if (work & _TIF_TASK_ISOLATION) {
+   if (task_isolation_syscall(regs->orig_ax) == -1)
+   return -1L;
+   work &= ~_TIF_TASK_ISOLATION;
+   }
+
 #ifdef CONFIG_SECCOMP
/*
 * Do seccomp after ptrace, to catch any tracer changes.
@@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct 
pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);
 
+   if (cached_flags & _TIF_TASK_ISOLATION)
+   task_isolation_start();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 5f01671c68f2..c70cb9cacfc0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -2,6 +2,7 @@
 #define _ASM_X86_APIC_H
 
 #include 
+#include 
 
 #include 
 #include 
@@ -618,6 +619,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+   task_isolation_interrupt("irq");
irq_enter();
 }
 
@@ -629,6 +631,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+   task_isolation_interrupt("ack irq");
irq_enter();
ack_APIC_irq();
 }
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 89e7eeb5cec1..aa9d9d817f8b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -85,6 +85,7 @@ struct thread_info {
 #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return */
 #define TIF_UPROBE 12  /* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING  13  /* pending live patching update */
+#define TIF_TASK_ISOLATION 14  /* task isolation enabled for task */
 #define TIF_NOCPUID15  /* CPUID is not accessible in userland 
*/
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
@@ -111,6 +112,7 @@ struct thread_info {
 #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOCPUID   (1 << TIF_NOCPUID)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
@@ -132,15 +134,15 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY\
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |   \
-_TIF_NOHZ)
+_TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK  \
(_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |

[PATCH v16 08/13] arch/x86: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
In prepare_exit_to_usermode(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults
and non-signalling MPX fixups.

Signed-off-by: Chris Metcalf 
---
 arch/x86/Kconfig   |  1 +
 arch/x86/entry/common.c| 14 ++
 arch/x86/include/asm/apic.h|  3 +++
 arch/x86/include/asm/thread_info.h |  8 +---
 arch/x86/kernel/smp.c  |  2 ++
 arch/x86/kernel/traps.c|  3 +++
 arch/x86/mm/fault.c|  5 +
 7 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..45967840b81a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,6 +114,7 @@ config X86
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
select HAVE_ARCH_COMPAT_MMAP_BASES  if MMU && COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03505ffbe1b6..2c70b915d1f2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
if (emulated)
return -1L;
 
+   /*
+* In task isolation mode, we may prevent the syscall from
+* running, and if so we also deliver a signal to the process.
+*/
+   if (work & _TIF_TASK_ISOLATION) {
+   if (task_isolation_syscall(regs->orig_ax) == -1)
+   return -1L;
+   work &= ~_TIF_TASK_ISOLATION;
+   }
+
 #ifdef CONFIG_SECCOMP
/*
 * Do seccomp after ptrace, to catch any tracer changes.
@@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct 
pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);
 
+   if (cached_flags & _TIF_TASK_ISOLATION)
+   task_isolation_start();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 5f01671c68f2..c70cb9cacfc0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -2,6 +2,7 @@
 #define _ASM_X86_APIC_H
 
 #include 
+#include 
 
 #include 
 #include 
@@ -618,6 +619,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+   task_isolation_interrupt("irq");
irq_enter();
 }
 
@@ -629,6 +631,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+   task_isolation_interrupt("ack irq");
irq_enter();
ack_APIC_irq();
 }
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 89e7eeb5cec1..aa9d9d817f8b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -85,6 +85,7 @@ struct thread_info {
 #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return */
 #define TIF_UPROBE 12  /* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING  13  /* pending live patching update */
+#define TIF_TASK_ISOLATION 14  /* task isolation enabled for task */
 #define TIF_NOCPUID15  /* CPUID is not accessible in userland 
*/
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
@@ -111,6 +112,7 @@ struct thread_info {
 #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
+#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOCPUID   (1 << TIF_NOCPUID)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
@@ -132,15 +134,15 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY\
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |   \
-_TIF_NOHZ)
+_TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK  \
(_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\
 _TIF_NEED_RESCHED | _

[PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL

2017-11-03 Thread Chris Metcalf
This API checks to see if the scheduler tick can be stopped,
and if so, stops it and returns 0; otherwise it returns an error.
This is intended for use with task isolation, where we will want to
be able to stop the tick synchronously when returning to userspace.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 include/linux/tick.h |  1 +
 kernel/time/tick-sched.c | 18 ++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68bf520..078ff2464b00 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct 
signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern int try_stop_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c5ce64..c026145eba2f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched 
*ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+   int cpu = smp_processor_id();
+   struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
+
+   /* For an unstable clock, we should return a permanent error code. */
+   if (atomic_read(_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+   return -EINVAL;
+
+   if (!can_stop_full_tick(cpu, ts))
+   return -EAGAIN;
+
+   tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+   return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
/*
-- 
2.1.2



[PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL

2017-11-03 Thread Chris Metcalf
This API checks to see if the scheduler tick can be stopped,
and if so, stops it and returns 0; otherwise it returns an error.
This is intended for use with task isolation, where we will want to
be able to stop the tick synchronously when returning to userspace.

Signed-off-by: Chris Metcalf 
---
 include/linux/tick.h |  1 +
 kernel/time/tick-sched.c | 18 ++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68bf520..078ff2464b00 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct 
signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern int try_stop_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c5ce64..c026145eba2f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched 
*ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+   int cpu = smp_processor_id();
+   struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
+
+   /* For an unstable clock, we should return a permanent error code. */
+   if (atomic_read(_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+   return -EINVAL;
+
+   if (!can_stop_full_tick(cpu, ts))
+   return -EAGAIN;
+
+   tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+   return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
/*
-- 
2.1.2



[PATCH v16 06/13] task_isolation: userspace hard isolation from kernel

2017-11-03 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus.  The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags.  When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced.  If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired.  In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting.  Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   6 +
 include/linux/isolation.h   | 175 +++
 include/linux/sched.h   |   4 +
 include/uapi/linux/prctl.h  |   6 +
 init/Kconfig|  28 ++
 kernel/Makefile |   1 +
 kernel/context_tracking.c   |   2 +
 kernel/isolation.c  | 402 
 kernel/signal.c |   2 +
 kernel/sys.c|   6 +
 10 files changed, 631 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION, this
+   setting will generate console backtraces to
+   accompany the diagnostics generated about
+   interrupting tasks running with task isolation.

[PATCH v16 06/13] task_isolation: userspace hard isolation from kernel

2017-11-03 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus.  The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags.  When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced.  If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired.  In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting.  Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/admin-guide/kernel-parameters.txt |   6 +
 include/linux/isolation.h   | 175 +++
 include/linux/sched.h   |   4 +
 include/uapi/linux/prctl.h  |   6 +
 init/Kconfig|  28 ++
 kernel/Makefile |   1 +
 kernel/context_tracking.c   |   2 +
 kernel/isolation.c  | 402 
 kernel/signal.c |   2 +
 kernel/sys.c|   6 +
 10 files changed, 631 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION, this
+   setting will generate console backtraces to
+   accompany the diagnostics generated about
+   interrupting tasks running with task isolation.
+
tcpmha

[PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"

2017-11-03 Thread Chris Metcalf
This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869.
We do need this function for try_get_task_struct_on_cpu().

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 include/linux/sched/task.h |  2 ++
 kernel/exit.c  | 13 +
 2 files changed, 15 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 79a2a744648d..270ff76d43d9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..e2a3e7458d0f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w)
rcu_read_unlock();
 }
 
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+   struct task_struct *task;
+
+   rcu_read_lock();
+   task = task_rcu_dereference(ptask);
+   if (task)
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   return task;
+}
+
 /*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
-- 
2.1.2



[PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"

2017-11-03 Thread Chris Metcalf
This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869.
We do need this function for try_get_task_struct_on_cpu().

Signed-off-by: Chris Metcalf 
---
 include/linux/sched/task.h |  2 ++
 kernel/exit.c  | 13 +
 2 files changed, 15 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 79a2a744648d..270ff76d43d9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..e2a3e7458d0f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w)
rcu_read_unlock();
 }
 
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+   struct task_struct *task;
+
+   rcu_read_lock();
+   task = task_rcu_dereference(ptask);
+   if (task)
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   return task;
+}
+
 /*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
-- 
2.1.2



[PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state

2017-11-03 Thread Chris Metcalf
When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
Acked-by: Daniel Lezcano <daniel.lezc...@linaro.org>
---
 arch/tile/kernel/time.c  | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index f74f10d827fa..afca6fe496c8 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, 
tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+   .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
}
}
 
+   clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
clk->set_state_shutdown(clk);
 
clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fff);
-- 
2.1.2



[PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state

2017-11-03 Thread Chris Metcalf
When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf 
Acked-by: Daniel Lezcano 
---
 arch/tile/kernel/time.c  | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index f74f10d827fa..afca6fe496c8 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, 
tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+   .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c 
b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
}
}
 
+   clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
clk->set_state_shutdown(clk);
 
clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fff);
-- 
2.1.2



[PATCH v16 13/13] task_isolation self test

2017-11-03 Thread Chris Metcalf
This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/task_isolation/Makefile|   6 +
 tools/testing/selftests/task_isolation/config  |   1 +
 tools/testing/selftests/task_isolation/isolation.c | 643 +
 4 files changed, 651 insertions(+)
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff805643b5f7..ab781b99d3c9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -30,6 +30,7 @@ TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
 TARGETS += sysctl
+TARGETS += task_isolation
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/task_isolation/Makefile 
b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index ..74d060b493f9
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_GEN_PROGS := isolation
+
+include ../lib.mk
diff --git a/tools/testing/selftests/task_isolation/config 
b/tools/testing/selftests/task_isolation/config
new file mode 100644
index ..34edfbca0423
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1 @@
+CONFIG_TASK_ISOLATION=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c 
b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index ..9c0b49619b40
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,0 +1,643 @@
+/*
+ * This test program tests the features of task isolation.
+ *
+ * - Makes sure enabling task isolation fails if you are unaffinitized
+ *   or on a non-task-isolation cpu.
+ *
+ * - Validates that various synchronous exceptions are fatal in isolation
+ *   mode:
+ *
+ *   * Page fault
+ *   * System call
+ *   * TLB invalidation from another thread [1]
+ *   * Unaligned access [2]
+ *
+ * - Tests that taking a user-defined signal for the above faults works.
+ *
+ * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off.
+ *
+ * - Tests that receiving a signal turns isolation off.
+ *
+ * - Tests that having another process schedule into the core where the
+ *   isolation process is running correctly kills the isolation process.
+ *
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ *
+ *
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ *
+ * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to
+ * enable some task-isolation cores.  If you get interrupt reports, you
+ * can also add the boot argument "task_isolation_debug" to learn more.
+ * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add
+ * isolation checks in every user_exit() call.
+ *
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !current->task_isolation_flags.  This is around line 756
+ * in kernel/time/tick-sched.c (as of kernel 4.14).
+ *
+ *
+ * To compile the test program, run "make".
+ *
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))
+
+#ifndef PR_TASK_ISOLATION   /* Not in system headers yet? */
+# define PR_TASK_ISOLATION 48
+# define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Data shared between parent and children. */
+static struct {
+   /* Set to true when the parent's isolation prctl is successful. */
+   bool parent_isolated;
+} *shared;
+
+/* Set affinity to a single cpu or die if trying to do 

[PATCH v16 13/13] task_isolation self test

2017-11-03 Thread Chris Metcalf
This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf 
---
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/task_isolation/Makefile|   6 +
 tools/testing/selftests/task_isolation/config  |   1 +
 tools/testing/selftests/task_isolation/isolation.c | 643 +
 4 files changed, 651 insertions(+)
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff805643b5f7..ab781b99d3c9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -30,6 +30,7 @@ TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
 TARGETS += sysctl
+TARGETS += task_isolation
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/task_isolation/Makefile 
b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index ..74d060b493f9
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_GEN_PROGS := isolation
+
+include ../lib.mk
diff --git a/tools/testing/selftests/task_isolation/config 
b/tools/testing/selftests/task_isolation/config
new file mode 100644
index ..34edfbca0423
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1 @@
+CONFIG_TASK_ISOLATION=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c 
b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index ..9c0b49619b40
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,0 +1,643 @@
+/*
+ * This test program tests the features of task isolation.
+ *
+ * - Makes sure enabling task isolation fails if you are unaffinitized
+ *   or on a non-task-isolation cpu.
+ *
+ * - Validates that various synchronous exceptions are fatal in isolation
+ *   mode:
+ *
+ *   * Page fault
+ *   * System call
+ *   * TLB invalidation from another thread [1]
+ *   * Unaligned access [2]
+ *
+ * - Tests that taking a user-defined signal for the above faults works.
+ *
+ * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off.
+ *
+ * - Tests that receiving a signal turns isolation off.
+ *
+ * - Tests that having another process schedule into the core where the
+ *   isolation process is running correctly kills the isolation process.
+ *
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ *
+ *
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ *
+ * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to
+ * enable some task-isolation cores.  If you get interrupt reports, you
+ * can also add the boot argument "task_isolation_debug" to learn more.
+ * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add
+ * isolation checks in every user_exit() call.
+ *
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !current->task_isolation_flags.  This is around line 756
+ * in kernel/time/tick-sched.c (as of kernel 4.14).
+ *
+ *
+ * To compile the test program, run "make".
+ *
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))
+
+#ifndef PR_TASK_ISOLATION   /* Not in system headers yet? */
+# define PR_TASK_ISOLATION 48
+# define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Data shared between parent and children. */
+static struct {
+   /* Set to true when the parent's isolation prctl is successful. */
+   bool parent_isolated;
+} *shared;
+
+/* Set affinity to a single cpu or die if trying to do so fails. */
+static void set

[PATCH v16 11/13] arch/tile: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
We add the necessary call to task_isolation_start() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_interrupt() in places where exceptions
may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 arch/tile/Kconfig   |  1 +
 arch/tile/include/asm/thread_info.h |  2 ++
 arch/tile/kernel/hardwall.c |  2 ++
 arch/tile/kernel/irq.c  |  3 +++
 arch/tile/kernel/messaging.c|  4 
 arch/tile/kernel/process.c  |  4 
 arch/tile/kernel/ptrace.c   | 10 ++
 arch/tile/kernel/single_step.c  |  7 +++
 arch/tile/kernel/smp.c  | 21 +++--
 arch/tile/kernel/time.c |  2 ++
 arch/tile/kernel/unaligned.c|  4 
 arch/tile/mm/fault.c| 13 -
 arch/tile/mm/homecache.c| 11 +++
 13 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4583c0320059..2d644138f2eb 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -16,6 +16,7 @@ config TILE
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_CONTEXT_TRACKING
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h 
b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..3e298bd43d11 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT 9   /* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG 10  /* idle is polling for TIF_NEED_RESCHED 
*/
 #define TIF_NOHZ   11  /* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION 12  /* in task isolation mode */
 
 #define _TIF_SIGPENDING(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  (1<<TIF_NEED_RESCHED)
@@ -139,6 +140,7 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ  (1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
diff --git a/arch/tile/kernel/hardwall.c b/arch/tile/kernel/hardwall.c
index 2fd1694ac1d0..9559f04d1c2a 100644
--- a/arch/tile/kernel/hardwall.c
+++ b/arch/tile/kernel/hardwall.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int 
fault_num)
int found_processes;
struct pt_regs *old_regs = set_irq_regs(regs);
 
+   task_isolation_interrupt("hardwall trap");
irq_enter();
 
/* Figure out which network trapped. */
diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c
index 22044fc691ef..0b1b24b9c496 100644
--- a/arch/tile/kernel/irq.c
+++ b/arch/tile/kernel/irq.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum)
 
/* Track time spent here in an interrupt context. */
old_regs = set_irq_regs(regs);
+
+   task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs);
irq_enter();
 
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c
index 7475af3aacec..1cf1630215f0 100644
--- a/arch/tile/kernel/messaging.c
+++ b/arch/tile/kernel/messaging.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
 
tag = message[0];
 #ifdef CONFIG_SMP
+   task_isolation_interrupt("SMP message %d", tag);
evaluate_message(message[0]);
 #else
panic("Received IPI message %d in UP mode", tag);
@@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
HV_IntrMsg *him = (HV_IntrMsg *)message;
struct hv_driver_cb *cb =
(struct hv_driver_cb *)him->intarg;
+   task_isolation_interrupt("interrupt message %#lx(%#lx)",
+  him->intarg, him->intdata);
cb->callback(cb, him->intdata);
__this_cpu_inc(irq_stat.irq_hv_msg_count);
}
diff --git a/arch/tile/k

[PATCH v16 11/13] arch/tile: enable task isolation functionality

2017-11-03 Thread Chris Metcalf
We add the necessary call to task_isolation_start() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_interrupt() in places where exceptions
may not generate signals to the application.

Signed-off-by: Chris Metcalf 
---
 arch/tile/Kconfig   |  1 +
 arch/tile/include/asm/thread_info.h |  2 ++
 arch/tile/kernel/hardwall.c |  2 ++
 arch/tile/kernel/irq.c  |  3 +++
 arch/tile/kernel/messaging.c|  4 
 arch/tile/kernel/process.c  |  4 
 arch/tile/kernel/ptrace.c   | 10 ++
 arch/tile/kernel/single_step.c  |  7 +++
 arch/tile/kernel/smp.c  | 21 +++--
 arch/tile/kernel/time.c |  2 ++
 arch/tile/kernel/unaligned.c|  4 
 arch/tile/mm/fault.c| 13 -
 arch/tile/mm/homecache.c| 11 +++
 13 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4583c0320059..2d644138f2eb 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -16,6 +16,7 @@ config TILE
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_ARCH_SECCOMP_FILTER
+   select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_CONTEXT_TRACKING
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h 
b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..3e298bd43d11 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT 9   /* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG 10  /* idle is polling for TIF_NEED_RESCHED 
*/
 #define TIF_NOHZ   11  /* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION 12  /* in task isolation mode */
 
 #define _TIF_SIGPENDING(1<
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int 
fault_num)
int found_processes;
struct pt_regs *old_regs = set_irq_regs(regs);
 
+   task_isolation_interrupt("hardwall trap");
irq_enter();
 
/* Figure out which network trapped. */
diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c
index 22044fc691ef..0b1b24b9c496 100644
--- a/arch/tile/kernel/irq.c
+++ b/arch/tile/kernel/irq.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum)
 
/* Track time spent here in an interrupt context. */
old_regs = set_irq_regs(regs);
+
+   task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs);
irq_enter();
 
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c
index 7475af3aacec..1cf1630215f0 100644
--- a/arch/tile/kernel/messaging.c
+++ b/arch/tile/kernel/messaging.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
 
tag = message[0];
 #ifdef CONFIG_SMP
+   task_isolation_interrupt("SMP message %d", tag);
evaluate_message(message[0]);
 #else
panic("Received IPI message %d in UP mode", tag);
@@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
HV_IntrMsg *him = (HV_IntrMsg *)message;
struct hv_driver_cb *cb =
(struct hv_driver_cb *)him->intarg;
+   task_isolation_interrupt("interrupt message %#lx(%#lx)",
+  him->intarg, him->intdata);
cb->callback(cb, him->intdata);
__this_cpu_inc(irq_stat.irq_hv_msg_count);
}
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index f0a0e18e4dfb..ac22e971dc1d 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -516,6 +517,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 
thread_info_flags)
 #endif
}
 
+   if (thread_info_flags & _TIF_TASK_ISOLATION)
+   task_isolation_start();
+
user_enter();
 }
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index e1a078e6828e..908d57d3d2cf 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -24,6 +24,7 @@
 #include 
 #include

[PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation

2017-11-03 Thread Chris Metcalf
Task isolation wants to be able to verify that a remote core is
running an isolated task to determine if it should generate a
diagnostic, and also possibly interrupt it.

This API returns a pointer to the task_struct of the task that
was running on the specified core at the moment of the request;
it uses try_get_task_struct() to increment the ref count on the
returned task_struct so that the caller can examine it even if
the actual remote task has already exited by that point.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 include/linux/sched/task.h |  1 +
 kernel/sched/core.c| 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 270ff76d43d9..6785db926857 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t)
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 struct task_struct *try_get_task_struct(struct task_struct **ptask);
+struct task_struct *try_get_task_struct_on_cpu(int cpu);
 
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..2728154057ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+/*
+ * Return a pointer to the task_struct for the task that is running on
+ * the specified cpu at the time of the call (note that the task may have
+ * exited by the time the caller inspects the resulting task_struct).
+ * Caller must put_task_struct() with the pointer when finished with it.
+ */
+struct task_struct *try_get_task_struct_on_cpu(int cpu)
+{
+   return try_get_task_struct(_rq(cpu)->curr);
+}
+
 void sched_avg_update(struct rq *rq)
 {
s64 period = sched_avg_period();
-- 
2.1.2



[PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation

2017-11-03 Thread Chris Metcalf
Task isolation wants to be able to verify that a remote core is
running an isolated task to determine if it should generate a
diagnostic, and also possibly interrupt it.

This API returns a pointer to the task_struct of the task that
was running on the specified core at the moment of the request;
it uses try_get_task_struct() to increment the ref count on the
returned task_struct so that the caller can examine it even if
the actual remote task has already exited by that point.

Signed-off-by: Chris Metcalf 
---
 include/linux/sched/task.h |  1 +
 kernel/sched/core.c| 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 270ff76d43d9..6785db926857 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t)
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 struct task_struct *try_get_task_struct(struct task_struct **ptask);
+struct task_struct *try_get_task_struct_on_cpu(int cpu);
 
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..2728154057ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+/*
+ * Return a pointer to the task_struct for the task that is running on
+ * the specified cpu at the time of the call (note that the task may have
+ * exited by the time the caller inspects the resulting task_struct).
+ * Caller must put_task_struct() with the pointer when finished with it.
+ */
+struct task_struct *try_get_task_struct_on_cpu(int cpu)
+{
+   return try_get_task_struct(_rq(cpu)->curr);
+}
+
 void sched_avg_update(struct rq *rq)
 {
s64 period = sched_avg_period();
-- 
2.1.2



[PATCH v16 02/13] vmstat: add vmstat_idle function

2017-11-03 Thread Chris Metcalf
This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <c...@linux.com>
Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c| 10 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e0b504594593..80212a952448 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum 
node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ad1b84ca9cf..8b13a6ca494c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+   return !delayed_work_pending(this_cpu_ptr(_work)) &&
+   !need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2



[PATCH v16 02/13] vmstat: add vmstat_idle function

2017-11-03 Thread Chris Metcalf
This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter 
Signed-off-by: Chris Metcalf 
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c| 10 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e0b504594593..80212a952448 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum 
node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ad1b84ca9cf..8b13a6ca494c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+   return !delayed_work_pending(this_cpu_ptr(_work)) &&
+   !need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2



[PATCH v16 01/13] vmstat: add quiet_vmstat_sync function

2017-11-03 Thread Chris Metcalf
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5f1359..e0b504594593 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum 
zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e72ac97..8ad1b84ca9cf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1837,6 +1837,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+   cancel_delayed_work_sync(this_cpu_ptr(_work));
+   refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2



[PATCH v16 01/13] vmstat: add quiet_vmstat_sync function

2017-11-03 Thread Chris Metcalf
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf 
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5f1359..e0b504594593 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum 
zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e72ac97..8ad1b84ca9cf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1837,6 +1837,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+   cancel_delayed_work_sync(this_cpu_ptr(_work));
+   refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2



[PATCH v16 00/13] support "task_isolation" mode

2017-11-03 Thread Chris Metcalf
is doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache).  Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary.  Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up.  And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace.  In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done.  We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing.  I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  Revert "sched/core: Drop the unused try_get_task_struct() helper
function"
  Add try_get_task_struct_on_cpu() to scheduler for task isolation
  Add try_stop_full_tick() API for NO_HZ_FULL
  task_isolation: userspace hard isolation from kernel
  Add task isolation hooks to arch-independent code
  arch/x86: enable task isolation functionality
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation self test

Francis Giraldeau (1):
  arch/arm: enable task isolation functionality

 Documentation/admin-guide/kernel-parameters.txt|   6 +
 arch/arm/Kconfig   |   1 +
 arch/arm/include/asm/thread_info.h |  10 +-
 arch/arm/kernel/entry-common.S |  12 +-
 arch/arm/kernel/ptrace.c   |  10 +
 arch/arm/kernel/signal.c   |  10 +-
 arch/arm/kernel/smp.c  |   4 +
 arch/arm/mm/fault.c|   8 +-
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/ptrace.c |  18 +-
 arch/arm64/kernel/signal.c |  10 +-
 arch/arm64/kernel/smp.c|   7 +
 arch/arm64/mm/fault.c  |   5 +
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   2 +
 arch/tile/kernel/hardwall.c|   2 +
 arch/tile/kernel/irq.c |   3 +
 arch/tile/kernel/messaging.c   |   4 +
 arch/tile/kernel/process.c |   4 +
 arch/tile/kernel/ptrace.c  |  10 +
 arch/tile/kernel/single_step.c |   7 +
 arch/tile/kernel/smp.c |  21 +-
 arch/tile/kernel/time.c|   3 +
 arch/tile/kernel/unaligned.c   |   4 +
 arch/tile/mm/fault.c 

[PATCH v16 00/13] support "task_isolation" mode

2017-11-03 Thread Chris Metcalf
is doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache).  Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary.  Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up.  And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace.  In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done.  We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing.  I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  Revert "sched/core: Drop the unused try_get_task_struct() helper
function"
  Add try_get_task_struct_on_cpu() to scheduler for task isolation
  Add try_stop_full_tick() API for NO_HZ_FULL
  task_isolation: userspace hard isolation from kernel
  Add task isolation hooks to arch-independent code
  arch/x86: enable task isolation functionality
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation self test

Francis Giraldeau (1):
  arch/arm: enable task isolation functionality

 Documentation/admin-guide/kernel-parameters.txt|   6 +
 arch/arm/Kconfig   |   1 +
 arch/arm/include/asm/thread_info.h |  10 +-
 arch/arm/kernel/entry-common.S |  12 +-
 arch/arm/kernel/ptrace.c   |  10 +
 arch/arm/kernel/signal.c   |  10 +-
 arch/arm/kernel/smp.c  |   4 +
 arch/arm/mm/fault.c|   8 +-
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/ptrace.c |  18 +-
 arch/arm64/kernel/signal.c |  10 +-
 arch/arm64/kernel/smp.c|   7 +
 arch/arm64/mm/fault.c  |   5 +
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   2 +
 arch/tile/kernel/hardwall.c|   2 +
 arch/tile/kernel/irq.c |   3 +
 arch/tile/kernel/messaging.c   |   4 +
 arch/tile/kernel/process.c |   4 +
 arch/tile/kernel/ptrace.c  |  10 +
 arch/tile/kernel/single_step.c |   7 +
 arch/tile/kernel/smp.c |  21 +-
 arch/tile/kernel/time.c|   3 +
 arch/tile/kernel/unaligned.c   |   4 +
 arch/tile/mm/fault.c 

Re: [GIT PULL] Introduce housekeeping subsystem v4

2017-10-21 Thread Chris Metcalf

On 10/20/2017 10:29 AM, Frederic Weisbecker wrote:

2017-10-20 10:17 UTC+02:00, Ingo Molnar <mi...@kernel.org>:

I mean code like:

  triton:~/tip> git grep on_each_cpu mm
  mm/page_alloc.c: * cpu to drain that CPU pcps and on_each_cpu_mask
  mm/slab.c:  on_each_cpu(do_drain, cachep, 1);
  mm/slub.c:  on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1,
GFP_ATOMIC);
  mm/vmstat.c:err = schedule_on_each_cpu(refresh_vm_stats);

is something we want to execute on 'housekeeping CPUs' as well, to not
disturb the
isolated CPUs, right?

I see, so indeed that's the kind of thing we want to also confine to
housekeeping as
well whenever possible but these cases require special treatment that need to be
handled by the subsystem in charge. For example vmstat has the vmstat_sheperd
thing which allows to drive those timers adaptively on demand to make sure that
userspace isn't interrupted. The others will likely need some similar treatment.

For now I only see vmstat having such a feature and it acts
transparently. There is
also the LRU flush (IIRC) which needs to be called for example before
returning to
userspace to avoid IPIs. Such things may indeed need special treatment. With the
current patchset it could be a housekeeping flag.


I have been working to update the task isolation support the last few
days and though it's not quite ready to post (probably will be Monday
or Tuesday), I have sorted out those issues from task isolation's
perspective.  It turns out that you can both quiesce the
vmstat_shepherd, as well as drain the LRU per-cpu pages, while
interrupts are disabled on the way back to userspace.

Whether shifting this work to housekeeping cores at all times makes
sense seems like a much more open question.  The idea of task
isolation is to provide a harder guarantee of isolation, and in
particular to shift work to the moment that you return to userspace,
rather than allowing it to happen later.  It does seem likely that
there are some things you'd want to do on the core itself most of the
time, and just suppress for true task isolation if requested, rather
than trying to move them to the housekeeping cores.

But, it's certainly worth looking at both options and seeing how it
plays out.  The less complicated the task isolation return-to-user
path is, the better.  (The idea of task isolation seems like a win no
matter what, to allow ensuring kernel isolation when you absolutely
require it.)

The current task isolation tree is in the "dataplane" branch at
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [GIT PULL] Introduce housekeeping subsystem v4

2017-10-21 Thread Chris Metcalf

On 10/20/2017 10:29 AM, Frederic Weisbecker wrote:

2017-10-20 10:17 UTC+02:00, Ingo Molnar :

I mean code like:

  triton:~/tip> git grep on_each_cpu mm
  mm/page_alloc.c: * cpu to drain that CPU pcps and on_each_cpu_mask
  mm/slab.c:  on_each_cpu(do_drain, cachep, 1);
  mm/slub.c:  on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1,
GFP_ATOMIC);
  mm/vmstat.c:err = schedule_on_each_cpu(refresh_vm_stats);

is something we want to execute on 'housekeeping CPUs' as well, to not
disturb the
isolated CPUs, right?

I see, so indeed that's the kind of thing we want to also confine to
housekeeping as
well whenever possible but these cases require special treatment that need to be
handled by the subsystem in charge. For example vmstat has the vmstat_sheperd
thing which allows to drive those timers adaptively on demand to make sure that
userspace isn't interrupted. The others will likely need some similar treatment.

For now I only see vmstat having such a feature and it acts
transparently. There is
also the LRU flush (IIRC) which needs to be called for example before
returning to
userspace to avoid IPIs. Such things may indeed need special treatment. With the
current patchset it could be a housekeeping flag.


I have been working to update the task isolation support the last few
days and though it's not quite ready to post (probably will be Monday
or Tuesday), I have sorted out those issues from task isolation's
perspective.  It turns out that you can both quiesce the
vmstat_shepherd, as well as drain the LRU per-cpu pages, while
interrupts are disabled on the way back to userspace.

Whether shifting this work to housekeeping cores at all times makes
sense seems like a much more open question.  The idea of task
isolation is to provide a harder guarantee of isolation, and in
particular to shift work to the moment that you return to userspace,
rather than allowing it to happen later.  It does seem likely that
there are some things you'd want to do on the core itself most of the
time, and just suppress for true task isolation if requested, rather
than trying to move them to the housekeeping cores.

But, it's certainly worth looking at both options and seeing how it
plays out.  The less complicated the task isolation return-to-user
path is, the better.  (The idea of task isolation seems like a win no
matter what, to allow ensuring kernel isolation when you absolutely
require it.)

The current task isolation tree is in the "dataplane" branch at
git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[PATCH] firmware: bluefield: add boot control driver

2017-10-09 Thread Chris Metcalf
The Mellanox BlueField SoC firmware supports a safe upgrade mode as
part of the flow where users put new firmware on the secondary eMMC
boot partition (the one not currently in use), tell the eMMC to make
the secondary boot partition primary, and reset.  This driver is
used to request that the firmware start the ARM watchdog after the
next reset, and also request that the firmware swap the eMMC boot
partition back again on the reset after that (the second reset).
This means that if anything goes wrong, the watchdog will fire, the
system will reset, and the firmware will switch back to the original
boot partition.  If the boot is successful, the user will use this
driver to put the firmware back into the state where it doesn't touch
the eMMC boot partition at reset, and turn off the ARM watchdog.

The firmware allows for more configurability than that, as can
be seen in the code, but the use case above is what the driver
primarily supports.

It is structured as a simple sysfs driver that is loaded based on
an ACPI table entry, and allows reading/writing text strings to
various /sys/bus/platform/drivers/mlx-bootctl/* files.

Signed-off-by: Chris Metcalf <cmetc...@mellanox.com>
---
Ingo, since there isn't an overall maintainer for drivers/firmware,
does it make sense for this to go through your tree?  Thanks!

 drivers/firmware/Kconfig   |  12 +++
 drivers/firmware/Makefile  |   1 +
 drivers/firmware/mlx-bootctl.c | 222 +
 drivers/firmware/mlx-bootctl.h | 103 +++
 4 files changed, 338 insertions(+)
 create mode 100644 drivers/firmware/mlx-bootctl.c
 create mode 100644 drivers/firmware/mlx-bootctl.h

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 6e4ed5a9c6fd..1f2adbcc5acc 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -230,6 +230,18 @@ config TI_SCI_PROTOCOL
  This protocol library is used by client drivers to use the features
  provided by the system controller.
 
+config MLX_BOOTCTL
+   tristate "Mellanox BlueField Firmware Boot Control"
+   depends on ARM64
+   help
+ The Mellanox BlueField firmware implements functionality to
+ request swapping the primary and alternate eMMC boot
+ partition, and to set up a watchdog that can undo that swap
+ if the system does not boot up correctly.  This driver
+ provides sysfs access to the firmware, to be used in
+ conjunction with the eMMC device driver to do any necessary
+ initial swap of the boot partition.
+
 config HAVE_ARM_SMCCC
bool
 
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index a37f12e8d137..4f4cad1eb9dd 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_QCOM_SCM_64) += qcom_scm-64.o
 obj-$(CONFIG_QCOM_SCM_32)  += qcom_scm-32.o
 CFLAGS_qcom_scm-32.o :=$(call as-instr,.arch armv7-a\n.arch_extension 
sec,-DREQUIRES_SEC=1) -march=armv7-a
 obj-$(CONFIG_TI_SCI_PROTOCOL)  += ti_sci.o
+obj-$(CONFIG_MLX_BOOTCTL)  += mlx-bootctl.o
 
 obj-y  += broadcom/
 obj-y  += meson/
diff --git a/drivers/firmware/mlx-bootctl.c b/drivers/firmware/mlx-bootctl.c
new file mode 100644
index ..7fe942e9d7bb
--- /dev/null
+++ b/drivers/firmware/mlx-bootctl.c
@@ -0,0 +1,222 @@
+/*
+ *  Mellanox boot control driver
+ *  This driver provides a sysfs interface for systems management
+ *  software to manage reset-time actions.
+ *
+ *  Copyright (C) 2017 Mellanox Technologies.  All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License v2.0 as published by
+ *  the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include "mlx-bootctl.h"
+
+#define DRIVER_NAME"mlx-bootctl"
+#define DRIVER_VERSION "1.1"
+#define DRIVER_DESCRIPTION "Mellanox boot control driver"
+
+struct boot_name {
+   int value;
+   const char name[12];
+};
+
+static struct boot_name boot_names[] = {
+   { MLNX_BOOT_EXTERNAL,   "external"  },
+   { MLNX_BOOT_EMMC,   "emmc"  },
+   { MLNX_BOOT_SWAP_EMMC,  "swap_emmc" },
+   { MLNX_BOOT_EMMC_LEGACY,"emmc_legacy"   },
+   { MLNX_BOOT_NONE,   "none"  },
+   { -1,   ""  }
+};
+
+/* The SMC calls in question are atomic, so we don't have to lock here. */
+static int smc_call1(uns

[PATCH] firmware: bluefield: add boot control driver

2017-10-09 Thread Chris Metcalf
The Mellanox BlueField SoC firmware supports a safe upgrade mode as
part of the flow where users put new firmware on the secondary eMMC
boot partition (the one not currently in use), tell the eMMC to make
the secondary boot partition primary, and reset.  This driver is
used to request that the firmware start the ARM watchdog after the
next reset, and also request that the firmware swap the eMMC boot
partition back again on the reset after that (the second reset).
This means that if anything goes wrong, the watchdog will fire, the
system will reset, and the firmware will switch back to the original
boot partition.  If the boot is successful, the user will use this
driver to put the firmware back into the state where it doesn't touch
the eMMC boot partition at reset, and turn off the ARM watchdog.

The firmware allows for more configurability than that, as can
be seen in the code, but the use case above is what the driver
primarily supports.

It is structured as a simple sysfs driver that is loaded based on
an ACPI table entry, and allows reading/writing text strings to
various /sys/bus/platform/drivers/mlx-bootctl/* files.

Signed-off-by: Chris Metcalf 
---
Ingo, since there isn't an overall maintainer for drivers/firmware,
does it make sense for this to go through your tree?  Thanks!

 drivers/firmware/Kconfig   |  12 +++
 drivers/firmware/Makefile  |   1 +
 drivers/firmware/mlx-bootctl.c | 222 +
 drivers/firmware/mlx-bootctl.h | 103 +++
 4 files changed, 338 insertions(+)
 create mode 100644 drivers/firmware/mlx-bootctl.c
 create mode 100644 drivers/firmware/mlx-bootctl.h

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 6e4ed5a9c6fd..1f2adbcc5acc 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -230,6 +230,18 @@ config TI_SCI_PROTOCOL
  This protocol library is used by client drivers to use the features
  provided by the system controller.
 
+config MLX_BOOTCTL
+   tristate "Mellanox BlueField Firmware Boot Control"
+   depends on ARM64
+   help
+ The Mellanox BlueField firmware implements functionality to
+ request swapping the primary and alternate eMMC boot
+ partition, and to set up a watchdog that can undo that swap
+ if the system does not boot up correctly.  This driver
+ provides sysfs access to the firmware, to be used in
+ conjunction with the eMMC device driver to do any necessary
+ initial swap of the boot partition.
+
 config HAVE_ARM_SMCCC
bool
 
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index a37f12e8d137..4f4cad1eb9dd 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_QCOM_SCM_64) += qcom_scm-64.o
 obj-$(CONFIG_QCOM_SCM_32)  += qcom_scm-32.o
 CFLAGS_qcom_scm-32.o :=$(call as-instr,.arch armv7-a\n.arch_extension 
sec,-DREQUIRES_SEC=1) -march=armv7-a
 obj-$(CONFIG_TI_SCI_PROTOCOL)  += ti_sci.o
+obj-$(CONFIG_MLX_BOOTCTL)  += mlx-bootctl.o
 
 obj-y  += broadcom/
 obj-y  += meson/
diff --git a/drivers/firmware/mlx-bootctl.c b/drivers/firmware/mlx-bootctl.c
new file mode 100644
index ..7fe942e9d7bb
--- /dev/null
+++ b/drivers/firmware/mlx-bootctl.c
@@ -0,0 +1,222 @@
+/*
+ *  Mellanox boot control driver
+ *  This driver provides a sysfs interface for systems management
+ *  software to manage reset-time actions.
+ *
+ *  Copyright (C) 2017 Mellanox Technologies.  All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License v2.0 as published by
+ *  the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include "mlx-bootctl.h"
+
+#define DRIVER_NAME"mlx-bootctl"
+#define DRIVER_VERSION "1.1"
+#define DRIVER_DESCRIPTION "Mellanox boot control driver"
+
+struct boot_name {
+   int value;
+   const char name[12];
+};
+
+static struct boot_name boot_names[] = {
+   { MLNX_BOOT_EXTERNAL,   "external"  },
+   { MLNX_BOOT_EMMC,   "emmc"  },
+   { MLNX_BOOT_SWAP_EMMC,  "swap_emmc" },
+   { MLNX_BOOT_EMMC_LEGACY,"emmc_legacy"   },
+   { MLNX_BOOT_NONE,   "none"  },
+   { -1,   ""  }
+};
+
+/* The SMC calls in question are atomic, so we don't have to lock here. */
+static int smc_call1(unsigned int smc_o

[GIT PULL] arch/tile bugfixes for 4.14-rc2

2017-09-22 Thread Chris Metcalf

Linus,

Please pull the following two changes for 4.14-rc2 from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
master

These are a code cleanup and config cleanup, respectively.

Dan Carpenter (1):
  tile: array underflow in setup_maxnodemem()

Krzysztof Kozlowski (1):
  tile: defconfig: Cleanup from old Kconfig options

 arch/tile/configs/tilegx_defconfig  | 1 -
 arch/tile/configs/tilepro_defconfig | 2 --
 arch/tile/kernel/setup.c| 2 +-
 3 files changed, 1 insertion(+), 4 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile bugfixes for 4.14-rc2

2017-09-22 Thread Chris Metcalf

Linus,

Please pull the following two changes for 4.14-rc2 from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git
master

These are a code cleanup and config cleanup, respectively.

Dan Carpenter (1):
  tile: array underflow in setup_maxnodemem()

Krzysztof Kozlowski (1):
  tile: defconfig: Cleanup from old Kconfig options

 arch/tile/configs/tilegx_defconfig  | 1 -
 arch/tile/configs/tilepro_defconfig | 2 --
 arch/tile/kernel/setup.c| 2 +-
 3 files changed, 1 insertion(+), 4 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-11 Thread Chris Metcalf

On 8/11/2017 11:35 AM, Christopher Lameter wrote:

Ah, Chris since you are here: What is happening with the dataplane
patches?


Work has been crazy and I keep expecting to have a chunk of time to work
on it and it doesn't happen.

September is looking relatively good though for my having time to work
on it.  I really would like to get out a new spin.  Fingers crossed.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-11 Thread Chris Metcalf

On 8/11/2017 11:35 AM, Christopher Lameter wrote:

Ah, Chris since you are here: What is happening with the dataplane
patches?


Work has been crazy and I keep expecting to have a chunk of time to work
on it and it doesn't happen.

September is looking relatively good though for my having time to work
on it.  I really would like to get out a new spin.  Fingers crossed.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-11 Thread Chris Metcalf

On 8/11/2017 2:36 AM, Mike Galbraith wrote:

On Thu, 2017-08-10 at 09:57 -0400, Chris Metcalf wrote:

On 8/10/2017 8:54 AM, Frederic Weisbecker wrote:

But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option.
Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point
that its default behaviour will be the exact opposite of the current one: by 
default
every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we
don't set housekeeping boot option.

Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping
by default to just the boot cpu.  In conjunction with NOHZ_FULL_ALL you would
then get the expected semantics.

A big box with only the boot cpu for housekeeping is likely screwed.


Fair point - this kind of configuration would be primarily useful for
dedicated systems that were running a high-traffic-rate networking
application on many cores, for example.  In this mode you don't end up
putting a lot of burden on the housekeeping core.  In any case,
probably not worth adding an additional kernel config for.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-11 Thread Chris Metcalf

On 8/11/2017 2:36 AM, Mike Galbraith wrote:

On Thu, 2017-08-10 at 09:57 -0400, Chris Metcalf wrote:

On 8/10/2017 8:54 AM, Frederic Weisbecker wrote:

But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option.
Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point
that its default behaviour will be the exact opposite of the current one: by 
default
every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we
don't set housekeeping boot option.

Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping
by default to just the boot cpu.  In conjunction with NOHZ_FULL_ALL you would
then get the expected semantics.

A big box with only the boot cpu for housekeeping is likely screwed.


Fair point - this kind of configuration would be primarily useful for
dedicated systems that were running a high-traffic-rate networking
application on many cores, for example.  In this mode you don't end up
putting a lot of burden on the housekeeping core.  In any case,
probably not worth adding an additional kernel config for.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-10 Thread Chris Metcalf

On 8/10/2017 8:54 AM, Frederic Weisbecker wrote:

But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option.
Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point
that its default behaviour will be the exact opposite of the current one: by 
default
every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we
don't set housekeeping boot option.


Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping
by default to just the boot cpu.  In conjunction with NOHZ_FULL_ALL you would
then get the expected semantics.


Also I plan to add a housekeeping option to offload the residual 1Hz tick from
nohz_full CPUs. So having "housekeeping=0,tick_offload" would make CPU 0 the
housekeeper, make the other CPUs nohz_full and handle their 1hz tick from CPU 0.


It does seem like that might be implied by requesting NOHZ_FULL on the core...
or maybe it's just implied by TASK_ISOLATION.  I've done a bad job of finding 
time
to work on that since last year's Plumbers, but September looks good :)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-08-10 Thread Chris Metcalf

On 8/10/2017 8:54 AM, Frederic Weisbecker wrote:

But perhaps I should add a new NO_HZ_FULL_BUT_HOUSEKEEPING option.
Otherwise we'll change the meaning of NO_HZ_FULL_ALL way too much, to the point
that its default behaviour will be the exact opposite of the current one: by 
default
every CPU is housekeeping, so NO_HZ_FULL_ALL would have no effect anymore if we
don't set housekeeping boot option.


Maybe a CONFIG_HOUSEKEEPING_BOOT_ONLY as a way to restrict housekeeping
by default to just the boot cpu.  In conjunction with NOHZ_FULL_ALL you would
then get the expected semantics.


Also I plan to add a housekeeping option to offload the residual 1Hz tick from
nohz_full CPUs. So having "housekeeping=0,tick_offload" would make CPU 0 the
housekeeper, make the other CPUs nohz_full and handle their 1hz tick from CPU 0.


It does seem like that might be implied by requesting NOHZ_FULL on the core...
or maybe it's just implied by TASK_ISOLATION.  I've done a bad job of finding 
time
to work on that since last year's Plumbers, but September looks good :)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 09/11] tile/topology: Remove the unused parent_node() macro

2017-07-26 Thread Chris Metcalf

On 7/26/2017 9:34 AM, Dou Liyang wrote:

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in tile platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman<m...@ellerman.id.au>
Signed-off-by: Dou Liyang<douly.f...@cn.fujitsu.com>
Cc: Chris Metcalf<cmetc...@mellanox.com>
---
  arch/tile/include/asm/topology.h | 6 --
  1 file changed, 6 deletions(-)


Acked-by: Chris Metcalf <cmetc...@mellanox.com>

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 09/11] tile/topology: Remove the unused parent_node() macro

2017-07-26 Thread Chris Metcalf

On 7/26/2017 9:34 AM, Dou Liyang wrote:

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in tile platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman
Signed-off-by: Dou Liyang
Cc: Chris Metcalf
---
  arch/tile/include/asm/topology.h | 6 --
  1 file changed, 6 deletions(-)


Acked-by: Chris Metcalf 

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] tile: array underflow in setup_maxnodemem()

2017-07-24 Thread Chris Metcalf

On 7/22/2017 3:33 AM, Dan Carpenter wrote:

My static checker correctly complains that we should have a lower bound
on "node" to prevent an array underflow.

Fixes: 867e359b97c9 ("arch/tile: core support for Tilera 32-bit chips.")
Signed-off-by: Dan Carpenter<dan.carpen...@oracle.com>


Thanks, taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] tile: array underflow in setup_maxnodemem()

2017-07-24 Thread Chris Metcalf

On 7/22/2017 3:33 AM, Dan Carpenter wrote:

My static checker correctly complains that we should have a lower bound
on "node" to prevent an array underflow.

Fixes: 867e359b97c9 ("arch/tile: core support for Tilera 32-bit chips.")
Signed-off-by: Dan Carpenter


Thanks, taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-07-21 Thread Chris Metcalf

On 7/21/2017 9:21 AM, Frederic Weisbecker wrote:

I'm leaving for two weeks so this is food for thoughts in the meantime :)

We have a design issue with nohz_full: it drives the isolation features
through the *housekeeping*() functions: kthreads, unpinned timers,
watchdog, ...

But things should work the other way around because the tick is just an
isolation feature among others.

So we need a housekeeping subsystem to drive all these isolation
features, including nohz full in a later iteration. For now this is a
basic draft. In the long run this subsystem should also drive the tick
offloading (remove residual 1Hz) and all unbound kthreads.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/0hz

HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1


For the series:

Reviewed-by: Chris Metcalf <cmetc...@mellanox.com>

I spotted a few typos that you should grep for and fix for your next 
version:
"watchog", "Lets/lets" instead of "Let's/let's", "overriden" (should 
have two d's).


The new housekeeping=MASK boot option seems like it might make it a little
irritating to specify nohz_full=MASK as well.  I guess if setting 
NO_HZ_FULL_ALL
implied "all but housekeeping", it becomes a reasonably tidy solution.  
To make

this work right you might have to make the housekeeping option early_param
instead so its value is available early enough.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC PATCH 0/9] Introduce housekeeping subsystem

2017-07-21 Thread Chris Metcalf

On 7/21/2017 9:21 AM, Frederic Weisbecker wrote:

I'm leaving for two weeks so this is food for thoughts in the meantime :)

We have a design issue with nohz_full: it drives the isolation features
through the *housekeeping*() functions: kthreads, unpinned timers,
watchdog, ...

But things should work the other way around because the tick is just an
isolation feature among others.

So we need a housekeeping subsystem to drive all these isolation
features, including nohz full in a later iteration. For now this is a
basic draft. In the long run this subsystem should also drive the tick
offloading (remove residual 1Hz) and all unbound kthreads.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/0hz

HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1


For the series:

Reviewed-by: Chris Metcalf 

I spotted a few typos that you should grep for and fix for your next 
version:
"watchog", "Lets/lets" instead of "Let's/let's", "overriden" (should 
have two d's).


The new housekeeping=MASK boot option seems like it might make it a little
irritating to specify nohz_full=MASK as well.  I guess if setting 
NO_HZ_FULL_ALL
implied "all but housekeeping", it becomes a reasonably tidy solution.  
To make

this work right you might have to make the housekeeping option early_param
instead so its value is available early enough.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RESEND PATCH] tile: defconfig: Cleanup from old Kconfig options

2017-07-20 Thread Chris Metcalf

On 7/20/2017 1:05 AM, Krzysztof Kozlowski wrote:

Remove old, dead Kconfig options (in order appearing in this commit):
  - CRYPTO_ZLIB: commit 110492183c4b ("crypto: compress - remove unused
pcomp interface");
  - IP_NF_TARGET_ULOG: commit d4da843e6fad ("netfilter: kill remnants of
ulog targets");

Signed-off-by: Krzysztof Kozlowski<k...@kernel.org>
---
  arch/tile/configs/tilegx_defconfig  | 1 -
  arch/tile/configs/tilepro_defconfig | 2 --
  2 files changed, 3 deletions(-)


Thanks! Taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RESEND PATCH] tile: defconfig: Cleanup from old Kconfig options

2017-07-20 Thread Chris Metcalf

On 7/20/2017 1:05 AM, Krzysztof Kozlowski wrote:

Remove old, dead Kconfig options (in order appearing in this commit):
  - CRYPTO_ZLIB: commit 110492183c4b ("crypto: compress - remove unused
pcomp interface");
  - IP_NF_TARGET_ULOG: commit d4da843e6fad ("netfilter: kill remnants of
ulog targets");

Signed-off-by: Krzysztof Kozlowski
---
  arch/tile/configs/tilegx_defconfig  | 1 -
  arch/tile/configs/tilepro_defconfig | 2 --
  2 files changed, 3 deletions(-)


Thanks! Taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] lib/strscpy: avoid KASAN false positive

2017-07-19 Thread Chris Metcalf

On 7/18/2017 6:04 PM, Andrew Morton wrote:

On Wed, 19 Jul 2017 00:31:36 +0300 Andrey Ryabinin <aryabi...@virtuozzo.com> 
wrote:


On 07/18/2017 11:26 PM, Linus Torvalds wrote:

On Tue, Jul 18, 2017 at 1:15 PM, Andrey Ryabinin
<aryabi...@virtuozzo.com> wrote:

No, it does warn about valid users. The report that Dave posted wasn't about 
wrong strscpy() usage
it was about reading 8-bytes from 5-bytes source string. It wasn't about buggy 
'count' at all.
So KASAN will warn for perfectly valid code like this:
 char dest[16];
 strscpy(dest, "12345", sizeof(dest)):

Ugh, ok, yes.


For strscpy() that would mean making the *whole* read from 'src' buffer 
unchecked by KASAN.

So we do have that READ_ONCE_NOCHECK(), but could we perhaps have
something that doesn't do a NOCHECK but a partial check and is simply
ok with "this is an optimistc longer access"


This can be dont, I think.

Something like this:
static inline unsigned long read_partial_nocheck(unsigned long *x)
{
unsigned long ret = READ_ONCE_NOCHECK(x);
kasan_check_partial(x, sizeof(unsigned long));
return ret;
}


(Cc Chris)

We could just remove all that word-at-a-time logic.  Do we have any
evidence that this would harm anything?


The word-at-a-time logic was part of the initial commit since I wanted
to ensure that strscpy could be used to replace strlcpy or strncpy without
serious concerns about performance.  It seems unfortunate to remove it
unconditionally to support KASAN, but I haven't looked deeply at the
tradeoffs here.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] lib/strscpy: avoid KASAN false positive

2017-07-19 Thread Chris Metcalf

On 7/18/2017 6:04 PM, Andrew Morton wrote:

On Wed, 19 Jul 2017 00:31:36 +0300 Andrey Ryabinin  
wrote:


On 07/18/2017 11:26 PM, Linus Torvalds wrote:

On Tue, Jul 18, 2017 at 1:15 PM, Andrey Ryabinin
 wrote:

No, it does warn about valid users. The report that Dave posted wasn't about 
wrong strscpy() usage
it was about reading 8-bytes from 5-bytes source string. It wasn't about buggy 
'count' at all.
So KASAN will warn for perfectly valid code like this:
 char dest[16];
 strscpy(dest, "12345", sizeof(dest)):

Ugh, ok, yes.


For strscpy() that would mean making the *whole* read from 'src' buffer 
unchecked by KASAN.

So we do have that READ_ONCE_NOCHECK(), but could we perhaps have
something that doesn't do a NOCHECK but a partial check and is simply
ok with "this is an optimistc longer access"


This can be dont, I think.

Something like this:
static inline unsigned long read_partial_nocheck(unsigned long *x)
{
unsigned long ret = READ_ONCE_NOCHECK(x);
kasan_check_partial(x, sizeof(unsigned long));
return ret;
}


(Cc Chris)

We could just remove all that word-at-a-time logic.  Do we have any
evidence that this would harm anything?


The word-at-a-time logic was part of the initial commit since I wanted
to ensure that strscpy could be used to replace strlcpy or strncpy without
serious concerns about performance.  It seems unfortunate to remove it
unconditionally to support KASAN, but I haven't looked deeply at the
tradeoffs here.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile changes for 4.13

2017-07-14 Thread Chris Metcalf

Linus,

Please pull the following changes for 4.13 from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
master


This adds support for an  to help with removing __need_xxx
#defines from glibc, and removes some dead code in arch/tile/mm/init.c.

Chris Metcalf (1):
  tile: prefer  to __need_int_reg_t

Michal Hocko (1):
  mm, tile: drop arch_{add,remove}_memory

 arch/tile/include/uapi/arch/abi.h| 49 +++--
 arch/tile/include/uapi/arch/intreg.h | 70 


 arch/tile/mm/init.c  | 30 
 3 files changed, 74 insertions(+), 75 deletions(-)
 create mode 100644 arch/tile/include/uapi/arch/intreg.h

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile changes for 4.13

2017-07-14 Thread Chris Metcalf

Linus,

Please pull the following changes for 4.13 from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
master


This adds support for an  to help with removing __need_xxx
#defines from glibc, and removes some dead code in arch/tile/mm/init.c.

Chris Metcalf (1):
  tile: prefer  to __need_int_reg_t

Michal Hocko (1):
  mm, tile: drop arch_{add,remove}_memory

 arch/tile/include/uapi/arch/abi.h| 49 +++--
 arch/tile/include/uapi/arch/intreg.h | 70 


 arch/tile/mm/init.c  | 30 
 3 files changed, 74 insertions(+), 75 deletions(-)
 create mode 100644 arch/tile/include/uapi/arch/intreg.h

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC][PATCH] atomic: Fix atomic_set_release() for 'funny' architectures

2017-06-09 Thread Chris Metcalf

On 6/9/2017 7:05 AM, Peter Zijlstra wrote:

Subject: atomic: Fix atomic_set_release() for 'funny' architectures

Those architectures that have a special atomic_set implementation also
need a special atomic_set_release(), because for the very same reason
WRITE_ONCE() is broken for them, smp_store_release() is too.

The vast majority is architectures that have spinlock hash based atomic
implementation except hexagon which seems to have a hardware 'feature'.

The spinlock based atomics should be SC, that is, none of them appear to
place extra barriers in atomic_cmpxchg() or any of the other SC atomic
primitives and therefore seem to rely on their spinlock implementation
being SC (I did not fully validate all that).

Therefore, the normal atomic_set() is SC and can be used at
atomic_set_release().


Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile]

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [RFC][PATCH] atomic: Fix atomic_set_release() for 'funny' architectures

2017-06-09 Thread Chris Metcalf

On 6/9/2017 7:05 AM, Peter Zijlstra wrote:

Subject: atomic: Fix atomic_set_release() for 'funny' architectures

Those architectures that have a special atomic_set implementation also
need a special atomic_set_release(), because for the very same reason
WRITE_ONCE() is broken for them, smp_store_release() is too.

The vast majority is architectures that have spinlock hash based atomic
implementation except hexagon which seems to have a hardware 'feature'.

The spinlock based atomics should be SC, that is, none of them appear to
place extra barriers in atomic_cmpxchg() or any of the other SC atomic
primitives and therefore seem to rely on their spinlock implementation
being SC (I did not fully validate all that).

Therefore, the normal atomic_set() is SC and can be used at
atomic_set_release().


Acked-by: Chris Metcalf  [for tile]

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: Updating kernel.org cross compilers?

2017-05-23 Thread Chris Metcalf

On 05/09/2017 10:59 AM, Andre Przywara wrote:

On 30/04/17 06:29, Segher Boessenkool wrote:

On Wed, Apr 26, 2017 at 03:14:16PM +0100, Andre Przywara wrote:

It seems that many people (even outside the Linux kernel community) use
the cross compilers provided at kernel.org/pub/tools/crosstool.
The latest compiler I find there is 4.9.0, which celebrated its third
birthday at the weekend, also has been superseded by 4.9.4 meanwhile.

So I took Segher's buildall scripts from [1] and threw binutils 2.28 and
GCC 6.3.0 at them.


I am belatedly catching up on this thread.  It sounds like the
tilegx/tilepro issues were sorted out -- as someone noted, you need
to have the kernel headers available to build glibc.  However, if
there are any outstanding tile issues, please feel free to loop me in!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


Re: Updating kernel.org cross compilers?

2017-05-23 Thread Chris Metcalf

On 05/09/2017 10:59 AM, Andre Przywara wrote:

On 30/04/17 06:29, Segher Boessenkool wrote:

On Wed, Apr 26, 2017 at 03:14:16PM +0100, Andre Przywara wrote:

It seems that many people (even outside the Linux kernel community) use
the cross compilers provided at kernel.org/pub/tools/crosstool.
The latest compiler I find there is 4.9.0, which celebrated its third
birthday at the weekend, also has been superseded by 4.9.4 meanwhile.

So I took Segher's buildall scripts from [1] and threw binutils 2.28 and
GCC 6.3.0 at them.


I am belatedly catching up on this thread.  It sounds like the
tilegx/tilepro issues were sorted out -- as someone noted, you need
to have the kernel headers available to build glibc.  However, if
there are any outstanding tile issues, please feel free to loop me in!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


Re: [PATCH 2/6] mm, tile: drop arch_{add,remove}_memory

2017-03-30 Thread Chris Metcalf

On 3/30/2017 7:54 AM, Michal Hocko wrote:

From: Michal Hocko<mho...@suse.com>

these functions are unreachable because tile doesn't support memory
hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor
it supports SPARSEMEM.

This code hasn't been compiled for a while obviously because nobody has
noticed that __add_pages has a different signature since 2009.

Cc: Chris Metcalf<cmetc...@mellanox.com>
Signed-off-by: Michal Hocko<mho...@suse.com>
---
  arch/tile/mm/init.c | 30 --
  1 file changed, 30 deletions(-)


Thanks - taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 2/6] mm, tile: drop arch_{add,remove}_memory

2017-03-30 Thread Chris Metcalf

On 3/30/2017 7:54 AM, Michal Hocko wrote:

From: Michal Hocko

these functions are unreachable because tile doesn't support memory
hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor
it supports SPARSEMEM.

This code hasn't been compiled for a while obviously because nobody has
noticed that __add_pages has a different signature since 2009.

Cc: Chris Metcalf
Signed-off-by: Michal Hocko
---
  arch/tile/mm/init.c | 30 --
  1 file changed, 30 deletions(-)


Thanks - taken into the tile tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[PATCH] tile: prefer to __need_int_reg_t

2017-03-27 Thread Chris Metcalf
As part of some work in glibc to move away from the "__need" prefix,
this commit breaks away the definitions of __int_reg_t, __uint_reg_t,
__INT_REG_BITS, and __INT_REG_FMT to a separate 
"microheader".  It is then included from  to preserve
the semantics of the previous header.

For now, we continue to preserve the __need_int_reg_t semantics
in  as well, but anticipate that after a few years
we can obsolete it.
---
 arch/tile/include/uapi/arch/abi.h| 49 +++--
 arch/tile/include/uapi/arch/intreg.h | 70 
 2 files changed, 74 insertions(+), 45 deletions(-)
 create mode 100644 arch/tile/include/uapi/arch/intreg.h

diff --git a/arch/tile/include/uapi/arch/abi.h 
b/arch/tile/include/uapi/arch/abi.h
index c55a3d432644..328e62260272 100644
--- a/arch/tile/include/uapi/arch/abi.h
+++ b/arch/tile/include/uapi/arch/abi.h
@@ -20,58 +20,17 @@
 
 #ifndef __ARCH_ABI_H__
 
-#if !defined __need_int_reg_t && !defined __DOXYGEN__
-# define __ARCH_ABI_H__
-# include 
-#endif
-
-/* Provide the basic machine types. */
-#ifndef __INT_REG_BITS
-
-/** Number of bits in a register. */
-#if defined __tilegx__
-# define __INT_REG_BITS 64
-#elif defined __tilepro__
-# define __INT_REG_BITS 32
-#elif !defined __need_int_reg_t
+#ifndef __tile__   /* support uncommon use of arch headers in non-tile builds 
*/
 # include 
 # define __INT_REG_BITS CHIP_WORD_SIZE()
-#else
-# error Unrecognized architecture with __need_int_reg_t
-#endif
-
-#if __INT_REG_BITS == 64
-
-#ifndef __ASSEMBLER__
-/** Unsigned type that can hold a register. */
-typedef unsigned long long __uint_reg_t;
-
-/** Signed type that can hold a register. */
-typedef long long __int_reg_t;
-#endif
-
-/** String prefix to use for printf(). */
-#define __INT_REG_FMT "ll"
-
-#else
-
-#ifndef __ASSEMBLER__
-/** Unsigned type that can hold a register. */
-typedef unsigned long __uint_reg_t;
-
-/** Signed type that can hold a register. */
-typedef long __int_reg_t;
-#endif
-
-/** String prefix to use for printf(). */
-#define __INT_REG_FMT "l"
-
 #endif
-#endif /* __INT_REG_BITS */
 
+#include 
 
+/* __need_int_reg_t is deprecated: just include  */
 #ifndef __need_int_reg_t
 
+#define __ARCH_ABI_H__
 
 #ifndef __ASSEMBLER__
 /** Unsigned type that can hold a register. */
diff --git a/arch/tile/include/uapi/arch/intreg.h 
b/arch/tile/include/uapi/arch/intreg.h
new file mode 100644
index ..1cf2fbf74306
--- /dev/null
+++ b/arch/tile/include/uapi/arch/intreg.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright 2017 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+
+/**
+ * @file
+ *
+ * Provide types and defines for the type that can hold a register,
+ * in the implementation namespace.
+ */
+
+#ifndef __ARCH_INTREG_H__
+#define __ARCH_INTREG_H__
+
+/*
+ * Get number of bits in a register.  __INT_REG_BITS may be defined
+ * prior to including this header to force a particular bit width.
+ */
+
+#ifndef __INT_REG_BITS
+# if defined __tilegx__
+#  define __INT_REG_BITS 64
+# elif defined __tilepro__
+#  define __INT_REG_BITS 32
+# else
+#  error Unrecognized architecture
+# endif
+#endif
+
+#if __INT_REG_BITS == 64
+
+# ifndef __ASSEMBLER__
+/** Unsigned type that can hold a register. */
+typedef unsigned long long __uint_reg_t;
+
+/** Signed type that can hold a register. */
+typedef long long __int_reg_t;
+# endif
+
+/** String prefix to use for printf(). */
+# define __INT_REG_FMT "ll"
+
+#elif __INT_REG_BITS == 32
+
+# ifndef __ASSEMBLER__
+/** Unsigned type that can hold a register. */
+typedef unsigned long __uint_reg_t;
+
+/** Signed type that can hold a register. */
+typedef long __int_reg_t;
+# endif
+
+/** String prefix to use for printf(). */
+# define __INT_REG_FMT "l"
+
+#else
+# error Unrecognized value of __INT_REG_BITS
+#endif
+
+#endif /* !__ARCH_INTREG_H__ */
-- 
2.7.2



[PATCH] tile: prefer to __need_int_reg_t

2017-03-27 Thread Chris Metcalf
As part of some work in glibc to move away from the "__need" prefix,
this commit breaks away the definitions of __int_reg_t, __uint_reg_t,
__INT_REG_BITS, and __INT_REG_FMT to a separate 
"microheader".  It is then included from  to preserve
the semantics of the previous header.

For now, we continue to preserve the __need_int_reg_t semantics
in  as well, but anticipate that after a few years
we can obsolete it.
---
 arch/tile/include/uapi/arch/abi.h| 49 +++--
 arch/tile/include/uapi/arch/intreg.h | 70 
 2 files changed, 74 insertions(+), 45 deletions(-)
 create mode 100644 arch/tile/include/uapi/arch/intreg.h

diff --git a/arch/tile/include/uapi/arch/abi.h 
b/arch/tile/include/uapi/arch/abi.h
index c55a3d432644..328e62260272 100644
--- a/arch/tile/include/uapi/arch/abi.h
+++ b/arch/tile/include/uapi/arch/abi.h
@@ -20,58 +20,17 @@
 
 #ifndef __ARCH_ABI_H__
 
-#if !defined __need_int_reg_t && !defined __DOXYGEN__
-# define __ARCH_ABI_H__
-# include 
-#endif
-
-/* Provide the basic machine types. */
-#ifndef __INT_REG_BITS
-
-/** Number of bits in a register. */
-#if defined __tilegx__
-# define __INT_REG_BITS 64
-#elif defined __tilepro__
-# define __INT_REG_BITS 32
-#elif !defined __need_int_reg_t
+#ifndef __tile__   /* support uncommon use of arch headers in non-tile builds 
*/
 # include 
 # define __INT_REG_BITS CHIP_WORD_SIZE()
-#else
-# error Unrecognized architecture with __need_int_reg_t
-#endif
-
-#if __INT_REG_BITS == 64
-
-#ifndef __ASSEMBLER__
-/** Unsigned type that can hold a register. */
-typedef unsigned long long __uint_reg_t;
-
-/** Signed type that can hold a register. */
-typedef long long __int_reg_t;
-#endif
-
-/** String prefix to use for printf(). */
-#define __INT_REG_FMT "ll"
-
-#else
-
-#ifndef __ASSEMBLER__
-/** Unsigned type that can hold a register. */
-typedef unsigned long __uint_reg_t;
-
-/** Signed type that can hold a register. */
-typedef long __int_reg_t;
-#endif
-
-/** String prefix to use for printf(). */
-#define __INT_REG_FMT "l"
-
 #endif
-#endif /* __INT_REG_BITS */
 
+#include 
 
+/* __need_int_reg_t is deprecated: just include  */
 #ifndef __need_int_reg_t
 
+#define __ARCH_ABI_H__
 
 #ifndef __ASSEMBLER__
 /** Unsigned type that can hold a register. */
diff --git a/arch/tile/include/uapi/arch/intreg.h 
b/arch/tile/include/uapi/arch/intreg.h
new file mode 100644
index ..1cf2fbf74306
--- /dev/null
+++ b/arch/tile/include/uapi/arch/intreg.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright 2017 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+
+/**
+ * @file
+ *
+ * Provide types and defines for the type that can hold a register,
+ * in the implementation namespace.
+ */
+
+#ifndef __ARCH_INTREG_H__
+#define __ARCH_INTREG_H__
+
+/*
+ * Get number of bits in a register.  __INT_REG_BITS may be defined
+ * prior to including this header to force a particular bit width.
+ */
+
+#ifndef __INT_REG_BITS
+# if defined __tilegx__
+#  define __INT_REG_BITS 64
+# elif defined __tilepro__
+#  define __INT_REG_BITS 32
+# else
+#  error Unrecognized architecture
+# endif
+#endif
+
+#if __INT_REG_BITS == 64
+
+# ifndef __ASSEMBLER__
+/** Unsigned type that can hold a register. */
+typedef unsigned long long __uint_reg_t;
+
+/** Signed type that can hold a register. */
+typedef long long __int_reg_t;
+# endif
+
+/** String prefix to use for printf(). */
+# define __INT_REG_FMT "ll"
+
+#elif __INT_REG_BITS == 32
+
+# ifndef __ASSEMBLER__
+/** Unsigned type that can hold a register. */
+typedef unsigned long __uint_reg_t;
+
+/** Signed type that can hold a register. */
+typedef long __int_reg_t;
+# endif
+
+/** String prefix to use for printf(). */
+# define __INT_REG_FMT "l"
+
+#else
+# error Unrecognized value of __INT_REG_BITS
+#endif
+
+#endif /* !__ARCH_INTREG_H__ */
-- 
2.7.2



Re: [PATCH 1/3] futex: remove duplicated code

2017-03-03 Thread Chris Metcalf

On 3/3/2017 7:27 AM, Jiri Slaby wrote:

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby<jsl...@suse.cz>


Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile]

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH 1/3] futex: remove duplicated code

2017-03-03 Thread Chris Metcalf

On 3/3/2017 7:27 AM, Jiri Slaby wrote:

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby


Acked-by: Chris Metcalf  [for tile]

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v15 04/13] task_isolation: add initial support

2017-02-02 Thread Chris Metcalf

On 2/2/2017 11:13 AM, Eugene Syromiatnikov wrote:

case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_TASK_ISOLATION
+   case PR_SET_TASK_ISOLATION:
+   error = task_isolation_set(arg2);
+   break;
+   case PR_GET_TASK_ISOLATION:
+   error = me->task_isolation_flags;
+   break;
+#endif
default:
error = -EINVAL;
break;

It is not a very good idea to ignore the values of unused arguments; it
prevents future their usage, as user space can pass some garbage values
here. Check out the code for newer prctl handlers, like
PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT
(PR_[SG]_FP_MODE is an unfortunate recent omission).

The other thing is the usage of #ifdef's, which is generally avoided
there. Also, the patch for man-pages, describing the new prctl calls, is
missing.


Thanks, I appreciate the feedback.  I'll fold this into the next spin of the 
series!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v15 04/13] task_isolation: add initial support

2017-02-02 Thread Chris Metcalf

On 2/2/2017 11:13 AM, Eugene Syromiatnikov wrote:

case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_TASK_ISOLATION
+   case PR_SET_TASK_ISOLATION:
+   error = task_isolation_set(arg2);
+   break;
+   case PR_GET_TASK_ISOLATION:
+   error = me->task_isolation_flags;
+   break;
+#endif
default:
error = -EINVAL;
break;

It is not a very good idea to ignore the values of unused arguments; it
prevents future their usage, as user space can pass some garbage values
here. Check out the code for newer prctl handlers, like
PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT
(PR_[SG]_FP_MODE is an unfortunate recent omission).

The other thing is the usage of #ifdef's, which is generally avoided
there. Also, the patch for man-pages, describing the new prctl calls, is
missing.


Thanks, I appreciate the feedback.  I'll fold this into the next spin of the 
series!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] tile: fix build failure

2017-01-24 Thread Chris Metcalf

On 1/24/2017 11:39 AM, Sudip Mukherjee wrote:

From: Sudip Mukherjee <sudipm.mukher...@gmail.com>

The build of tilegx allmodconfig was failing with errors like:
../arch/tile/include/asm/div64.h:5:15: error: unknown type name 'u64'
  static inline u64 mul_u32_u32(u32 a, u32 b)
^~~
../arch/tile/include/asm/div64.h:5:31: error: unknown type name 'u32'
  static inline u64 mul_u32_u32(u32 a, u32 b)
^~~
../arch/tile/include/asm/div64.h:5:38: error: unknown type name 'u32'
  static inline u64 mul_u32_u32(u32 a, u32 b)
   ^~~
In file included from ../fs/ubifs/ubifs.h:26:0,
  from ../fs/ubifs/shrinker.c:42:
../include/linux/math64.h: In function 'mul_u64_u32_shr':
../arch/tile/include/asm/div64.h:9:21: error: implicit declaration of
function 'mul_u32_u32' [-Werror=implicit-function-declaration]

The simplest solution was to include the types header file.

Fixes: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")
Cc: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Sudip Mukherjee <sudip.mukher...@codethink.co.uk>
---

build log is at:
https://travis-ci.org/sudipm-mukherjee/parport/jobs/194717687

  arch/tile/include/asm/div64.h | 1 +
  1 file changed, 1 insertion(+)


Acked-by: Chris Metcalf <cmetc...@mellanox.com>

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH] tile: fix build failure

2017-01-24 Thread Chris Metcalf

On 1/24/2017 11:39 AM, Sudip Mukherjee wrote:

From: Sudip Mukherjee 

The build of tilegx allmodconfig was failing with errors like:
../arch/tile/include/asm/div64.h:5:15: error: unknown type name 'u64'
  static inline u64 mul_u32_u32(u32 a, u32 b)
^~~
../arch/tile/include/asm/div64.h:5:31: error: unknown type name 'u32'
  static inline u64 mul_u32_u32(u32 a, u32 b)
^~~
../arch/tile/include/asm/div64.h:5:38: error: unknown type name 'u32'
  static inline u64 mul_u32_u32(u32 a, u32 b)
   ^~~
In file included from ../fs/ubifs/ubifs.h:26:0,
  from ../fs/ubifs/shrinker.c:42:
../include/linux/math64.h: In function 'mul_u64_u32_shr':
../arch/tile/include/asm/div64.h:9:21: error: implicit declaration of
function 'mul_u32_u32' [-Werror=implicit-function-declaration]

The simplest solution was to include the types header file.

Fixes: 9e3d6223d209 ("math64, timers: Fix 32bit mul_u64_u32_shr() and friends")
Cc: Peter Zijlstra 
Signed-off-by: Sudip Mukherjee 
---

build log is at:
https://travis-ci.org/sudipm-mukherjee/parport/jobs/194717687

  arch/tile/include/asm/div64.h | 1 +
  1 file changed, 1 insertion(+)


Acked-by: Chris Metcalf 

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile bugfix for 4.10-rc6

2017-01-23 Thread Chris Metcalf

Linus,

Please pull the following change from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git stable

This avoids an issue with short userspace reads for regset via ptrace.

Dave Martin (1):
  tile/ptrace: Preserve previous registers for short regset write

 arch/tile/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile bugfix for 4.10-rc6

2017-01-23 Thread Chris Metcalf

Linus,

Please pull the following change from:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git stable

This avoids an issue with short userspace reads for regset via ptrace.

Dave Martin (1):
  tile/ptrace: Preserve previous registers for short regset write

 arch/tile/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: Questions on the task isolation patches

2016-12-22 Thread Chris Metcalf

On 12/20/2016 4:27 AM, Paolo Bonzini wrote:

On 16/12/2016 22:00, Chris Metcalf wrote:

Sorry, I think I wasn't clear.  Normally when you are running task
isolated and you enter the kernel, you will get a fatal signal.  The
exception is if you call prctl itself (or exit), the kernel tolerates
it without a signal, since obviously that's how you need to cleanly
tell the kernel you are done with task isolation.

Running in a guest is pretty much the same as running in userspace.
Would it be possible to exclude the KVM_RUN ioctl as well?  QEMU would
still have to run prctl when a CPU goes to sleep, and KVM_RUN would have
to enable/disable isolated mode when a VM executes HLT (which should
never happen anyway in NFV scenarios).


I think that probably makes sense.  The flow would be that qemu executes
first the prctl() for task isolation, then the KVM_RUN ioctl.  We obviously 
can't
do it in the other order, so we'd need to make task isolation tolerate KVM_RUN.

I won't try to do it for my next patch series (based on 4.10) though, since I'd
like to get the basic support upstreamed before trying to extend it.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: Questions on the task isolation patches

2016-12-22 Thread Chris Metcalf

On 12/20/2016 4:27 AM, Paolo Bonzini wrote:

On 16/12/2016 22:00, Chris Metcalf wrote:

Sorry, I think I wasn't clear.  Normally when you are running task
isolated and you enter the kernel, you will get a fatal signal.  The
exception is if you call prctl itself (or exit), the kernel tolerates
it without a signal, since obviously that's how you need to cleanly
tell the kernel you are done with task isolation.

Running in a guest is pretty much the same as running in userspace.
Would it be possible to exclude the KVM_RUN ioctl as well?  QEMU would
still have to run prctl when a CPU goes to sleep, and KVM_RUN would have
to enable/disable isolated mode when a VM executes HLT (which should
never happen anyway in NFV scenarios).


I think that probably makes sense.  The flow would be that qemu executes
first the prctl() for task isolation, then the KVM_RUN ioctl.  We obviously 
can't
do it in the other order, so we'd need to make task isolation tolerate KVM_RUN.

I won't try to do it for my next patch series (based on 4.10) though, since I'd
like to get the basic support upstreamed before trying to extend it.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: Questions on the task isolation patches

2016-12-16 Thread Chris Metcalf

Sorry for the slow response - I have been busy with some other things.

On 12/6/2016 4:43 PM, yunhong jiang wrote:

On Fri, 2 Dec 2016 13:58:08 -0500
Chris Metcalf <cmetc...@mellanox.com> wrote:


On 12/1/2016 5:28 PM, yunhong jiang wrote:

a) If the task isolation need prctl to mark itself as isolated,
possibly the vCPU thread can't achieve it. First, the vCPU thread
may need system service during OS booting time, also it's the
application, instead of the vCPU thread to decide if the vCPU
thread should be isolated. So possibly we need a mechanism so that
another process can set the vCPU thread's task isolation?

These are good questions.  I think that the we would probably want to
add a KVM mode that did the prctl() before transitioning back to the

Would prctl() when back to guest be too heavy?


It's a good question; it can be heavy.  But the design for task isolation is 
that
the task isolated process is always running in userspace anyway.  If you are
transitioning in and out of the guest or host kernels frequently, you probably
should not be using task isolation, but just regular NOHZ_FULL.


guest.  But then, in the same way that we currently allow another
prctl() from a task-isolated userspace process, we'd probably need to

You mean currently in your patch we alraedy can do the prctl from 3rd party
process to task-isolate a userspace process? Sorry that I didn't notice that
part.


Sorry, I think I wasn't clear.  Normally when you are running task isolated
and you enter the kernel, you will get a fatal signal.  The exception is if you
call prctl itself (or exit), the kernel tolerates it without a signal, since 
obviously
that's how you need to cleanly tell the kernel you are done with task isolation.

My point in the previous email was that we might need to similarly tolerate
a guest exit without causing a fatal signal to the userspace process.  But as
I think about it, that's probably not true; we probably would want to notify
the guest kernel of the task isolation violation and have it kill the userspace
process just as if it had entered the guest kernel.

Perhaps the way to drive this is to have task isolation be triggered from
the guest's prctl up to the host, so there's some kind of KVM exit to
the host that indicates that the guest has a userspace process that
wants to run task isolated, at which point qemu invokes task isolation
on behalf of the guest then returns to the guest to set up its own
virtualized task isolation.  It does get confusing!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: Questions on the task isolation patches

2016-12-16 Thread Chris Metcalf

Sorry for the slow response - I have been busy with some other things.

On 12/6/2016 4:43 PM, yunhong jiang wrote:

On Fri, 2 Dec 2016 13:58:08 -0500
Chris Metcalf  wrote:


On 12/1/2016 5:28 PM, yunhong jiang wrote:

a) If the task isolation need prctl to mark itself as isolated,
possibly the vCPU thread can't achieve it. First, the vCPU thread
may need system service during OS booting time, also it's the
application, instead of the vCPU thread to decide if the vCPU
thread should be isolated. So possibly we need a mechanism so that
another process can set the vCPU thread's task isolation?

These are good questions.  I think that the we would probably want to
add a KVM mode that did the prctl() before transitioning back to the

Would prctl() when back to guest be too heavy?


It's a good question; it can be heavy.  But the design for task isolation is 
that
the task isolated process is always running in userspace anyway.  If you are
transitioning in and out of the guest or host kernels frequently, you probably
should not be using task isolation, but just regular NOHZ_FULL.


guest.  But then, in the same way that we currently allow another
prctl() from a task-isolated userspace process, we'd probably need to

You mean currently in your patch we alraedy can do the prctl from 3rd party
process to task-isolate a userspace process? Sorry that I didn't notice that
part.


Sorry, I think I wasn't clear.  Normally when you are running task isolated
and you enter the kernel, you will get a fatal signal.  The exception is if you
call prctl itself (or exit), the kernel tolerates it without a signal, since 
obviously
that's how you need to cleanly tell the kernel you are done with task isolation.

My point in the previous email was that we might need to similarly tolerate
a guest exit without causing a fatal signal to the userspace process.  But as
I think about it, that's probably not true; we probably would want to notify
the guest kernel of the task isolation violation and have it kill the userspace
process just as if it had entered the guest kernel.

Perhaps the way to drive this is to have task isolation be triggered from
the guest's prctl up to the host, so there's some kind of KVM exit to
the host that indicates that the guest has a userspace process that
wants to run task isolated, at which point qemu invokes task isolation
on behalf of the guest then returns to the guest to set up its own
virtualized task isolation.  It does get confusing!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile changes for 4.10

2016-12-16 Thread Chris Metcalf

Linus,

Please pull the following changes for 4.10 from:

   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master

Another grab-bag of miscellaneous changes.

Chris Metcalf (2):
  tile: remove #pragma unroll from finv_buffer_remote()
  tile: use __ro_after_init instead of tile-specific __write_once

Colin Ian King (1):
  tile/pci_gx: fix spelling mistake: "delievered" -> "delivered"

Markus Elfring (2):
  tile-module: Use kmalloc_array() in module_alloc()
  tile-module: Rename jump labels in module_alloc()

Paul Gortmaker (1):
  tile: migrate exception table users off module.h and onto extable.h

 arch/tile/include/asm/cache.h|  7 ++-
 arch/tile/include/asm/sections.h |  3 ---
 arch/tile/kernel/module.c| 11 +--
 arch/tile/kernel/pci.c   |  2 +-
 arch/tile/kernel/pci_gx.c|  2 +-
 arch/tile/kernel/setup.c | 18 +-
 arch/tile/kernel/smp.c   |  2 +-
 arch/tile/kernel/time.c  |  4 ++--
 arch/tile/kernel/unaligned.c |  2 +-
 arch/tile/lib/cacheflush.c   |  8 +---
 arch/tile/mm/extable.c   |  2 +-
 arch/tile/mm/fault.c |  2 +-
 arch/tile/mm/homecache.c |  2 +-
 arch/tile/mm/init.c  | 10 +-
 14 files changed, 31 insertions(+), 44 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



[GIT PULL] arch/tile changes for 4.10

2016-12-16 Thread Chris Metcalf

Linus,

Please pull the following changes for 4.10 from:

   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git master

Another grab-bag of miscellaneous changes.

Chris Metcalf (2):
  tile: remove #pragma unroll from finv_buffer_remote()
  tile: use __ro_after_init instead of tile-specific __write_once

Colin Ian King (1):
  tile/pci_gx: fix spelling mistake: "delievered" -> "delivered"

Markus Elfring (2):
  tile-module: Use kmalloc_array() in module_alloc()
  tile-module: Rename jump labels in module_alloc()

Paul Gortmaker (1):
  tile: migrate exception table users off module.h and onto extable.h

 arch/tile/include/asm/cache.h|  7 ++-
 arch/tile/include/asm/sections.h |  3 ---
 arch/tile/kernel/module.c| 11 +--
 arch/tile/kernel/pci.c   |  2 +-
 arch/tile/kernel/pci_gx.c|  2 +-
 arch/tile/kernel/setup.c | 18 +-
 arch/tile/kernel/smp.c   |  2 +-
 arch/tile/kernel/time.c  |  4 ++--
 arch/tile/kernel/unaligned.c |  2 +-
 arch/tile/lib/cacheflush.c   |  8 +---
 arch/tile/mm/extable.c   |  2 +-
 arch/tile/mm/fault.c |  2 +-
 arch/tile/mm/homecache.c |  2 +-
 arch/tile/mm/init.c  | 10 +-
 14 files changed, 31 insertions(+), 44 deletions(-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

2016-12-09 Thread Chris Metcalf

On 12/9/2016 5:18 AM, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:


Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
recognise the 32x32 mults and generates crap.

This used to work :/

I tried:

gcc-4.4: good
gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad


I also found 4.4 was good on tilegx at recognizing the 32x32, and bad on
the later versions I tested; I don't recall which specific later versions I
tried, though.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

2016-12-09 Thread Chris Metcalf

On 12/9/2016 5:18 AM, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:


Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
recognise the 32x32 mults and generates crap.

This used to work :/

I tried:

gcc-4.4: good
gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad


I also found 4.4 was good on tilegx at recognizing the 32x32, and bad on
the later versions I tested; I don't recall which specific later versions I
tried, though.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

2016-12-09 Thread Chris Metcalf

On 12/9/2016 3:30 AM, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote:

Just for giggles, on tilegx the branch is actually slower than doing the
mult unconditionally.

The problem is that the two multiplies would otherwise completely
pipeline, whereas with the conditional you serialize them.

On my Haswell laptop the unconditional version is faster too.

Only when using x86_64 instructions, once I fixed the i386 variant it
was slower, probably due to register pressure and the like.


(came to light while talking about why the mul_u64_u32_shr() fallback
didn't work right for them, which was a combination of the above issue
and the fact that their compiler 'lost' the fact that these are
32x32->64 mults and did 64x64 ones instead).

Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
recognise the 32x32 mults and generates crap.

This used to work :/

Do we want something like so?

---
  arch/tile/include/asm/Kbuild  |  1 -
  arch/tile/include/asm/div64.h | 14 ++
  arch/x86/include/asm/div64.h  | 10 ++
  include/linux/math64.h| 26 ++
  4 files changed, 42 insertions(+), 9 deletions(-)


Untested, but I looked at it closely, and it seems like a decent idea.

Acked-by: Chris Metcalf <cmetc...@mellanox.com> [for tile]

Of course if this is pushed up, it will then probably be too tempting for me not
to add the tilegx-specific mul_u64_u32_shr() to take advantage of pipelining
the two 32x32->64 multiplies :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

2016-12-09 Thread Chris Metcalf

On 12/9/2016 3:30 AM, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:

On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote:

Just for giggles, on tilegx the branch is actually slower than doing the
mult unconditionally.

The problem is that the two multiplies would otherwise completely
pipeline, whereas with the conditional you serialize them.

On my Haswell laptop the unconditional version is faster too.

Only when using x86_64 instructions, once I fixed the i386 variant it
was slower, probably due to register pressure and the like.


(came to light while talking about why the mul_u64_u32_shr() fallback
didn't work right for them, which was a combination of the above issue
and the fact that their compiler 'lost' the fact that these are
32x32->64 mults and did 64x64 ones instead).

Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
recognise the 32x32 mults and generates crap.

This used to work :/

Do we want something like so?

---
  arch/tile/include/asm/Kbuild  |  1 -
  arch/tile/include/asm/div64.h | 14 ++
  arch/x86/include/asm/div64.h  | 10 ++
  include/linux/math64.h| 26 ++
  4 files changed, 42 insertions(+), 9 deletions(-)


Untested, but I looked at it closely, and it seems like a decent idea.

Acked-by: Chris Metcalf  [for tile]

Of course if this is pushed up, it will then probably be too tempting for me not
to add the tilegx-specific mul_u64_u32_shr() to take advantage of pipelining
the two 32x32->64 multiplies :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



  1   2   3   4   5   6   7   8   9   10   >