Re: [PATCH v16 00/13] support "task_isolation" mode

2018-03-07 Thread Yury Norov
Hi Chris,

(CC Cavium people)

Thanks for your series.

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
> 
> This version of the patch series cleans up a number of areas that were
> a little dodgy in the previous patch series.
> 
> - It no longer loops in the final code that prepares to return to
>   userspace; instead, it sets things up in the prctl() and then
>   validates when preparing to return to userspace, adjusting the
>   syscall return value to -EAGAIN at that point if something doesn't
>   line up quite correctly.
> 
> - We no longer support the NOSIG mode that let you freely call into
>   the kernel multiple times while in task isolation.  This was always
>   a little odd, since you really should be in sufficient control of
>   task isolation code that you can explicitly stop isolation with a
>   "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
>   else.  It also made the semantics of migrating the task confusing.
>   More importantly, removing that support means that the only path
>   that sets up task isolation is the return from prctl(), which allows
>   us to make the simplification above.
> 
> - We no longer try to signal the task isolation process from a remote
>   core when we detect that we are about to violate its isolation.
>   Instead, we just print a message there (and optionally dump stack),
>   and rely on the eventual interrupt on the core itself to trigger the
>   signal.  We are always in a safe context to generate a signal when
>   we enter the kernel, unlike when we are deep in a call stack sending
>   an SMP IPI or whatever.
> 
> - We notice the case of an unstable scheduler clock and return
>   EINVAL rather than spinning forever with EAGAIN (suggestion from
>   Francis Giraldeau).
> 
> - The prctl() call requires zeros for arg3/4/5 (suggestion from
>   Eugene Syromiatnikov).
> 
> The kernel internal isolation API is also now cleaner, and I have
> included kerneldoc APIs for all the interfaces so that it should be
> easier to port it to additional architectures; in fact looking at
> include/linux/isolation.h is a good place to start understanding the
> overall patch set.
> 
> I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
> for x86, since this version is sufficiently different to merit
> re-review and re-testing.
> 
> Note that this is not rebased on top of Frederic's recent housekeeping
> patch series, although it is largely orthogonal right now.  After
> Frederic's patch series lands, task isolation is enabled with
> "isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
> ("isolcpus=full,CPUS"?) or just use it as-is.
> 
> The previous (v15) patch series is here:
> 
> https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetc...@mellanox.com
> 
> This patch series is available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
> dataplane
> 
> Some folks raised some good points at the LPC discussion and then in
> email discussions that followed.  Rather than trying to respond to
> everyone in a flurry of emails, I'll answer some questions here:
> 
> 
> Why not just instrument user_exit() to raise the isolation-lost signal?
> 
> Andy pointed me in this direction.  The advantage is that you catch
> *everything*, by definition.  There is a hook that can do this in the
> current patch set, but you have to #define DEBUG_TASK_ISOLATION
> manually to take advantage of it, because as written it has two issues:
> 
> 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
>because the user_exit hook kills you first.
> 2. You lose the ability to get much better diagnostics by waiting
>until you are further into kernel entry and know what you're doing.
> 
> You could work around #2 in several ways, but #1 is harder.  I looked
> at x86 for a while, and although you could imagine this, you really
> want to generate a lost-isolation signal on any syscall that isn't
> that exact prctl(), and it's awkward to try to do all of that checking
> before user_exit().  Since in any case we do want to have the more
> specific probes at the various kernel entry points where we generate
> the diagnostics, I felt like it wasn't the right approach to enable
> as a compilation-time default.  I'm open to discussion on this though!
> 
> 
> Can't we do all the exit-to-userspace work with irqs disabled?
> 
> In fact, it turns out that you can do lru_add_drain() with irqs
> disabled, so that's what we're doing in the patch series now.
> 
> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with 

Re: [PATCH v16 00/13] support "task_isolation" mode

2018-03-07 Thread Yury Norov
Hi Chris,

(CC Cavium people)

Thanks for your series.

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
> 
> This version of the patch series cleans up a number of areas that were
> a little dodgy in the previous patch series.
> 
> - It no longer loops in the final code that prepares to return to
>   userspace; instead, it sets things up in the prctl() and then
>   validates when preparing to return to userspace, adjusting the
>   syscall return value to -EAGAIN at that point if something doesn't
>   line up quite correctly.
> 
> - We no longer support the NOSIG mode that let you freely call into
>   the kernel multiple times while in task isolation.  This was always
>   a little odd, since you really should be in sufficient control of
>   task isolation code that you can explicitly stop isolation with a
>   "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
>   else.  It also made the semantics of migrating the task confusing.
>   More importantly, removing that support means that the only path
>   that sets up task isolation is the return from prctl(), which allows
>   us to make the simplification above.
> 
> - We no longer try to signal the task isolation process from a remote
>   core when we detect that we are about to violate its isolation.
>   Instead, we just print a message there (and optionally dump stack),
>   and rely on the eventual interrupt on the core itself to trigger the
>   signal.  We are always in a safe context to generate a signal when
>   we enter the kernel, unlike when we are deep in a call stack sending
>   an SMP IPI or whatever.
> 
> - We notice the case of an unstable scheduler clock and return
>   EINVAL rather than spinning forever with EAGAIN (suggestion from
>   Francis Giraldeau).
> 
> - The prctl() call requires zeros for arg3/4/5 (suggestion from
>   Eugene Syromiatnikov).
> 
> The kernel internal isolation API is also now cleaner, and I have
> included kerneldoc APIs for all the interfaces so that it should be
> easier to port it to additional architectures; in fact looking at
> include/linux/isolation.h is a good place to start understanding the
> overall patch set.
> 
> I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
> for x86, since this version is sufficiently different to merit
> re-review and re-testing.
> 
> Note that this is not rebased on top of Frederic's recent housekeeping
> patch series, although it is largely orthogonal right now.  After
> Frederic's patch series lands, task isolation is enabled with
> "isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
> ("isolcpus=full,CPUS"?) or just use it as-is.
> 
> The previous (v15) patch series is here:
> 
> https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetc...@mellanox.com
> 
> This patch series is available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
> dataplane
> 
> Some folks raised some good points at the LPC discussion and then in
> email discussions that followed.  Rather than trying to respond to
> everyone in a flurry of emails, I'll answer some questions here:
> 
> 
> Why not just instrument user_exit() to raise the isolation-lost signal?
> 
> Andy pointed me in this direction.  The advantage is that you catch
> *everything*, by definition.  There is a hook that can do this in the
> current patch set, but you have to #define DEBUG_TASK_ISOLATION
> manually to take advantage of it, because as written it has two issues:
> 
> 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
>because the user_exit hook kills you first.
> 2. You lose the ability to get much better diagnostics by waiting
>until you are further into kernel entry and know what you're doing.
> 
> You could work around #2 in several ways, but #1 is harder.  I looked
> at x86 for a while, and although you could imagine this, you really
> want to generate a lost-isolation signal on any syscall that isn't
> that exact prctl(), and it's awkward to try to do all of that checking
> before user_exit().  Since in any case we do want to have the more
> specific probes at the various kernel entry points where we generate
> the diagnostics, I felt like it wasn't the right approach to enable
> as a compilation-time default.  I'm open to discussion on this though!
> 
> 
> Can't we do all the exit-to-userspace work with irqs disabled?
> 
> In fact, it turns out that you can do lru_add_drain() with irqs
> disabled, so that's what we're doing in the patch series now.
> 
> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with 

Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Christopher Lameter
On Tue, 7 Nov 2017, Chris Metcalf wrote:

> > Presumably we have another context there were we may be able to call into
> > the cleanup code with interrupts enabled.
>
> Right now for task isolation we run with interrupts enabled during the
> initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently
> calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
> there instead (and probably should) since we can handle dealing with
> the pagesets at this time.  As we return to userspace we will test that
> nothing surprising happened with vmstat; if so we jam an EAGAIN into
> the syscall result value, but if not, we will be in userspace and won't need
> to touch the vmstat counters until we next go back into the kernel.

If you do it too early and there is another page allocator action then it
may potentially repopulate the caches. Since this is only draining the
caches for remote node allocation it may be rare and not that important.


Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Christopher Lameter
On Tue, 7 Nov 2017, Chris Metcalf wrote:

> > Presumably we have another context there were we may be able to call into
> > the cleanup code with interrupts enabled.
>
> Right now for task isolation we run with interrupts enabled during the
> initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently
> calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
> there instead (and probably should) since we can handle dealing with
> the pagesets at this time.  As we return to userspace we will test that
> nothing surprising happened with vmstat; if so we jam an EAGAIN into
> the syscall result value, but if not, we will be in userspace and won't need
> to touch the vmstat counters until we next go back into the kernel.

If you do it too early and there is another page allocator action then it
may potentially repopulate the caches. Since this is only draining the
caches for remote node allocation it may be rare and not that important.


Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Chris Metcalf

On 11/7/2017 12:10 PM, Christopher Lameter wrote:

On Mon, 6 Nov 2017, Chris Metcalf wrote:


On 11/6/2017 10:38 AM, Christopher Lameter wrote:

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

We have to get rid of the 1 Hz tick, so we don't want to tie anything
else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.


Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently

calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Chris Metcalf

On 11/7/2017 12:10 PM, Christopher Lameter wrote:

On Mon, 6 Nov 2017, Chris Metcalf wrote:


On 11/6/2017 10:38 AM, Christopher Lameter wrote:

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

We have to get rid of the 1 Hz tick, so we don't want to tie anything
else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.


Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently

calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Christopher Lameter
On Mon, 6 Nov 2017, Chris Metcalf wrote:

> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
> > > What about that d*mn 1 Hz clock?
> > >
> > > It's still there, so this code still requires some further work before
> > > it can actually get a process into long-term task isolation (without
> > > the obvious one-line kernel hack).  Frederic suggested a while ago
> > > forcing updates on cpustats was required as the last gating factor; do
> > > we think that is still true?  Christoph was working on this at one
> > > point - any progress from your point of view?
> > Well if you still have the 1 HZ clock then you can simply defer the numa
> > remote page cleanup of the page allocator to that the time you execute
> > that tick.
>
> We have to get rid of the 1 Hz tick, so we don't want to tie anything
> else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Christopher Lameter
On Mon, 6 Nov 2017, Chris Metcalf wrote:

> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
> > > What about that d*mn 1 Hz clock?
> > >
> > > It's still there, so this code still requires some further work before
> > > it can actually get a process into long-term task isolation (without
> > > the obvious one-line kernel hack).  Frederic suggested a while ago
> > > forcing updates on cpustats was required as the last gating factor; do
> > > we think that is still true?  Christoph was working on this at one
> > > point - any progress from your point of view?
> > Well if you still have the 1 HZ clock then you can simply defer the numa
> > remote page cleanup of the page allocator to that the time you execute
> > that tick.
>
> We have to get rid of the 1 Hz tick, so we don't want to tie anything
> else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.



Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-06 Thread Christopher Lameter
On Fri, 3 Nov 2017, Chris Metcalf wrote:

> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I

Callig quiet_vmstat() gives you the folding of the statistics without
the numa page allocator page cleanup. Leaving some pages in the queues
should not be that much of an issue.

> What about that d*mn 1 Hz clock?
>
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.




Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-06 Thread Christopher Lameter
On Fri, 3 Nov 2017, Chris Metcalf wrote:

> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I

Callig quiet_vmstat() gives you the folding of the statistics without
the numa page allocator page cleanup. Leaving some pages in the queues
should not be that much of an issue.

> What about that d*mn 1 Hz clock?
>
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.