Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-15 Thread Chris Metcalf

On 05/12/2015 06:23 PM, Andy Lutomirski wrote:

On May 13, 2015 6:06 AM, "Chris Metcalf"  wrote:

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf  wrote:

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.


I don't think it's necessary to distinguish the types.  As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc.

Not true.  PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all.  (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.


All that we are trying to do with this API, though, is distinguish
synchronous faults.  So IPIs, etc., should not be happening
(they would be bugs), and hopefully we are mostly just
distinguishing different types of synchronous program entries.
That said, I did a si_info flag to differentiate syscalls from other
synchronous entries, and I'm open to looking at more such if
it seems useful.

Again, though, I think we really do need to distinguish at least MCE 
and NMI (on x86) from the others. 


Yes, those are both interesting cases, and I'm not entirely
sure what the right way to handle them is - for example,
likely disable STRICT if you are running with perf enabled.

I look forward to hearing more when you're back next week!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-15 Thread Chris Metcalf

On 05/12/2015 06:23 PM, Andy Lutomirski wrote:

On May 13, 2015 6:06 AM, Chris Metcalf cmetc...@ezchip.com wrote:

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote:

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say give me a
SIGBUS when that happens and in production you might say
fix it up and let's try to keep going.

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.


I don't think it's necessary to distinguish the types.  As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc.

Not true.  PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all.  (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.


All that we are trying to do with this API, though, is distinguish
synchronous faults.  So IPIs, etc., should not be happening
(they would be bugs), and hopefully we are mostly just
distinguishing different types of synchronous program entries.
That said, I did a si_info flag to differentiate syscalls from other
synchronous entries, and I'm open to looking at more such if
it seems useful.

Again, though, I think we really do need to distinguish at least MCE 
and NMI (on x86) from the others. 


Yes, those are both interesting cases, and I'm not entirely
sure what the right way to handle them is - for example,
likely disable STRICT if you are running with perf enabled.

I look forward to hearing more when you're back next week!

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Andy Lutomirski
On May 13, 2015 6:06 AM, "Chris Metcalf"  wrote:
>
> On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
>>
>> [add peterz due to perf stuff]
>>
>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf  wrote:
>>>
>>> Patch 6/6 proposes a mechanism to track down times when the
>>> kernel screws up and delivers an IRQ to a userspace-only task.
>>> Here, we're just trying to identify the times when an application
>>> screws itself up out of cluelessness, and provide a mechanism
>>> that allows the developer to easily figure out why and fix it.
>>>
>>> In particular, /proc/interrupts won't show syscalls or page faults,
>>> which are two easy ways applications can screw themselves
>>> when they think they're in userspace-only mode.  Also, they don't
>>> provide sufficient precision to make it clear what part of the
>>> application caused the undesired kernel entry.
>>
>> Perf does, though, complete with context.
>
>
> The perf_event suggestions are interesting, but I think it's plausible
> for this to be an alternate way to debug the issues that STRICT
> addresses.
>
>
>>> In this case, killing the task is appropriate, since that's exactly
>>> the semantics that have been asked for - it's like on architectures
>>> that don't natively support unaligned accesses, but fake it relatively
>>> slowly in the kernel, and in development you just say "give me a
>>> SIGBUS when that happens" and in production you might say
>>> "fix it up and let's try to keep going".
>>
>> I think more control is needed.  I also think that, if we go this
>> route, we should distinguish syscalls, synchronous non-syscall
>> entries, and asynchronous non-syscall entries.  They're quite
>> different.
>
>
> I don't think it's necessary to distinguish the types.  As long as we
> have a PC pointing to the instruction that triggered the problem,
> we can see if it's a system call instruction, a memory write that
> caused a page fault, a trap instruction, etc.

Not true.  PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all.  (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.

> We certainly could
> add infrastructure to capture syscall numbers, fault/signal numbers,
> etc etc, but I think it's overkill if it adds kernel overhead on
> entry/exit.
>

None of these should add overhead.

>
>>> A better implementation, I think, is to put the tests for "you
>>> screwed up and synchronously entered the kernel" in
>>> the syscall_trace_enter() code, which TIF_NOHZ already
>>> gets us into;
>>
>> No, not unless you're planning on using that to distinguish syscalls
>> from other stuff *and* people think that's justified.
>
>
> So, the question is how we separate synchronous entries
> from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
> tasks), and synchronous entries are application bugs.  We'd
> like to deliver a signal for the latter, and do some kind of
> kernel diagnostics for the former.  So we can't just add the
> test in the context tracking code, which doesn't actually know
> why we're entering or exiting.

Synchronous entries could be VM bugs, too.

>
> That's why I was thinking that the syscall_trace_entry and
> exception_enter paths were the best choices.  I'm fairly sure
> that exception_enter is only done for synchronous traps,
> page faults, etc.

Maybe.  Doing it through the actual entry/exit slow paths would be
overhead-free, although I'm not sure that IRQs have real slow paths
for entry.

>
> Certainly on the tile architecture we include the trap number
> in the pt_regs, so it's possible to just examine the pt_regs and
> know why you entered or are exiting the kernel, but I don't
> think we can rely on that for all architectures.

x86 can't do this.

> I'll put out a v2 of my patch that does both the things you
> advise against :-) just so we can have a strawman to think
> about how to do it better - unless you have a suggestion
> offhand as to how we can better differentiate sync and async
> entries into the kernel in a platform-independent way.
>
> I could imagine modifying user_exit() and exception_enter()
> to pass an identifier into the context system saying why they
> were changing contexts, so we could have syscalls, trap
> numbers, fault numbers, etc., and some way to query as
> to whether they were synchronous or asynchronous, and
> build this scheme on top of that, but I'm not sure the extra
> infrastructure is worthwhile.
>

I'll take a look.

Again, though, I think we really do need to distinguish at least MCE
and NMI (on x86) from the others.

>
>> What if we added a mode to perf where delivery of a sample
>> synchronously (or semi-synchronously by catching it on the next exit
>> to userspace) freezes the delivering task?  It would be like debugger
>> support via perf.
>>
>> peterz, do you think this would be a sensible thing to add to perf?

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Chris Metcalf

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:

[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf  wrote:

Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

Perf does, though, complete with context.


The perf_event suggestions are interesting, but I think it's plausible
for this to be an alternate way to debug the issues that STRICT
addresses.


In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.


I don't think it's necessary to distinguish the types.  As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc.  We certainly could
add infrastructure to capture syscall numbers, fault/signal numbers,
etc etc, but I think it's overkill if it adds kernel overhead on
entry/exit.


A better implementation, I think, is to put the tests for "you
screwed up and synchronously entered the kernel" in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.


So, the question is how we separate synchronous entries
from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
tasks), and synchronous entries are application bugs.  We'd
like to deliver a signal for the latter, and do some kind of
kernel diagnostics for the former.  So we can't just add the
test in the context tracking code, which doesn't actually know
why we're entering or exiting.

That's why I was thinking that the syscall_trace_entry and
exception_enter paths were the best choices.  I'm fairly sure
that exception_enter is only done for synchronous traps,
page faults, etc.

Certainly on the tile architecture we include the trap number
in the pt_regs, so it's possible to just examine the pt_regs and
know why you entered or are exiting the kernel, but I don't
think we can rely on that for all architectures.


It's far to easy to just make a tiny change to the entry code.  Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it.  I already have a prototype patch for the relevant
part of that.


there, we can test if the dataplane "strict" bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

I'd rather avoid that, too.  This feature isn't really arch-specific,
so let's avoid the arch stuff if at all possible.


I'll put out a v2 of my patch that does both the things you
advise against :-) just so we can have a strawman to think
about how to do it better - unless you have a suggestion
offhand as to how we can better differentiate sync and async
entries into the kernel in a platform-independent way.

I could imagine modifying user_exit() and exception_enter()
to pass an identifier into the context system saying why they
were changing contexts, so we could have syscalls, trap
numbers, fault numbers, etc., and some way to query as
to whether they were synchronous or asynchronous, and
build this scheme on top of that, but I'm not sure the extra
infrastructure is worthwhile.


Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.

Let's try to generalize a bit.  There's also irq_entry and ist_enter,
and some of the exception_enter cases are for synchronous entries
while (IIRC -- could be 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Paul E. McKenney
On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> > +++ b/kernel/time/tick-sched.c
> > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
> > (jiffies - start));
> > dump_stack();
> > }
> > +
> > +   /*
> > +* Kill the process if it violates STRICT mode.  Note that this
> > +* code also results in killing the task if a kernel bug causes an
> > +* irq to be delivered to this core.
> > +*/
> > +   if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> > +   == PR_DATAPLANE_STRICT) {
> > +   pr_warn("Dataplane STRICT mode violated; process killed.\n");
> > +   dump_stack();
> > +   task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> > +   local_irq_enable();
> > +   do_group_exit(SIGKILL);
> > +   }
> >  }
> 
> So while I'm all for hard fails like this, can we not provide a wee bit
> more information in the siginfo ? And maybe use a slightly less fatal
> signal, such that userspace can actually catch it and dump state in
> debug modes?

Agreed, a bit more debug state would be helpful.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Peter Zijlstra
On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> +++ b/kernel/time/tick-sched.c
> @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
>   (jiffies - start));
>   dump_stack();
>   }
> +
> + /*
> +  * Kill the process if it violates STRICT mode.  Note that this
> +  * code also results in killing the task if a kernel bug causes an
> +  * irq to be delivered to this core.
> +  */
> + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> + == PR_DATAPLANE_STRICT) {
> + pr_warn("Dataplane STRICT mode violated; process killed.\n");
> + dump_stack();
> + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> + local_irq_enable();
> + do_group_exit(SIGKILL);
> + }
>  }

So while I'm all for hard fails like this, can we not provide a wee bit
more information in the siginfo ? And maybe use a slightly less fatal
signal, such that userspace can actually catch it and dump state in
debug modes?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Chris Metcalf

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:

[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote:

Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

Perf does, though, complete with context.


The perf_event suggestions are interesting, but I think it's plausible
for this to be an alternate way to debug the issues that STRICT
addresses.


In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say give me a
SIGBUS when that happens and in production you might say
fix it up and let's try to keep going.

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.


I don't think it's necessary to distinguish the types.  As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc.  We certainly could
add infrastructure to capture syscall numbers, fault/signal numbers,
etc etc, but I think it's overkill if it adds kernel overhead on
entry/exit.


A better implementation, I think, is to put the tests for you
screwed up and synchronously entered the kernel in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.


So, the question is how we separate synchronous entries
from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
tasks), and synchronous entries are application bugs.  We'd
like to deliver a signal for the latter, and do some kind of
kernel diagnostics for the former.  So we can't just add the
test in the context tracking code, which doesn't actually know
why we're entering or exiting.

That's why I was thinking that the syscall_trace_entry and
exception_enter paths were the best choices.  I'm fairly sure
that exception_enter is only done for synchronous traps,
page faults, etc.

Certainly on the tile architecture we include the trap number
in the pt_regs, so it's possible to just examine the pt_regs and
know why you entered or are exiting the kernel, but I don't
think we can rely on that for all architectures.


It's far to easy to just make a tiny change to the entry code.  Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it.  I already have a prototype patch for the relevant
part of that.


there, we can test if the dataplane strict bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

I'd rather avoid that, too.  This feature isn't really arch-specific,
so let's avoid the arch stuff if at all possible.


I'll put out a v2 of my patch that does both the things you
advise against :-) just so we can have a strawman to think
about how to do it better - unless you have a suggestion
offhand as to how we can better differentiate sync and async
entries into the kernel in a platform-independent way.

I could imagine modifying user_exit() and exception_enter()
to pass an identifier into the context system saying why they
were changing contexts, so we could have syscalls, trap
numbers, fault numbers, etc., and some way to query as
to whether they were synchronous or asynchronous, and
build this scheme on top of that, but I'm not sure the extra
infrastructure is worthwhile.


Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.

Let's try to generalize a bit.  There's also irq_entry and ist_enter,
and some of the exception_enter cases are for synchronous entries
while (IIRC 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Peter Zijlstra
On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
 +++ b/kernel/time/tick-sched.c
 @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
   (jiffies - start));
   dump_stack();
   }
 +
 + /*
 +  * Kill the process if it violates STRICT mode.  Note that this
 +  * code also results in killing the task if a kernel bug causes an
 +  * irq to be delivered to this core.
 +  */
 + if ((task-dataplane_flags  (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
 + == PR_DATAPLANE_STRICT) {
 + pr_warn(Dataplane STRICT mode violated; process killed.\n);
 + dump_stack();
 + task-dataplane_flags = ~PR_DATAPLANE_QUIESCE;
 + local_irq_enable();
 + do_group_exit(SIGKILL);
 + }
  }

So while I'm all for hard fails like this, can we not provide a wee bit
more information in the siginfo ? And maybe use a slightly less fatal
signal, such that userspace can actually catch it and dump state in
debug modes?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Andy Lutomirski
On May 13, 2015 6:06 AM, Chris Metcalf cmetc...@ezchip.com wrote:

 On 05/11/2015 06:28 PM, Andy Lutomirski wrote:

 [add peterz due to perf stuff]

 On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote:

 Patch 6/6 proposes a mechanism to track down times when the
 kernel screws up and delivers an IRQ to a userspace-only task.
 Here, we're just trying to identify the times when an application
 screws itself up out of cluelessness, and provide a mechanism
 that allows the developer to easily figure out why and fix it.

 In particular, /proc/interrupts won't show syscalls or page faults,
 which are two easy ways applications can screw themselves
 when they think they're in userspace-only mode.  Also, they don't
 provide sufficient precision to make it clear what part of the
 application caused the undesired kernel entry.

 Perf does, though, complete with context.


 The perf_event suggestions are interesting, but I think it's plausible
 for this to be an alternate way to debug the issues that STRICT
 addresses.


 In this case, killing the task is appropriate, since that's exactly
 the semantics that have been asked for - it's like on architectures
 that don't natively support unaligned accesses, but fake it relatively
 slowly in the kernel, and in development you just say give me a
 SIGBUS when that happens and in production you might say
 fix it up and let's try to keep going.

 I think more control is needed.  I also think that, if we go this
 route, we should distinguish syscalls, synchronous non-syscall
 entries, and asynchronous non-syscall entries.  They're quite
 different.


 I don't think it's necessary to distinguish the types.  As long as we
 have a PC pointing to the instruction that triggered the problem,
 we can see if it's a system call instruction, a memory write that
 caused a page fault, a trap instruction, etc.

Not true.  PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all.  (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.

 We certainly could
 add infrastructure to capture syscall numbers, fault/signal numbers,
 etc etc, but I think it's overkill if it adds kernel overhead on
 entry/exit.


None of these should add overhead.


 A better implementation, I think, is to put the tests for you
 screwed up and synchronously entered the kernel in
 the syscall_trace_enter() code, which TIF_NOHZ already
 gets us into;

 No, not unless you're planning on using that to distinguish syscalls
 from other stuff *and* people think that's justified.


 So, the question is how we separate synchronous entries
 from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
 tasks), and synchronous entries are application bugs.  We'd
 like to deliver a signal for the latter, and do some kind of
 kernel diagnostics for the former.  So we can't just add the
 test in the context tracking code, which doesn't actually know
 why we're entering or exiting.

Synchronous entries could be VM bugs, too.


 That's why I was thinking that the syscall_trace_entry and
 exception_enter paths were the best choices.  I'm fairly sure
 that exception_enter is only done for synchronous traps,
 page faults, etc.

Maybe.  Doing it through the actual entry/exit slow paths would be
overhead-free, although I'm not sure that IRQs have real slow paths
for entry.


 Certainly on the tile architecture we include the trap number
 in the pt_regs, so it's possible to just examine the pt_regs and
 know why you entered or are exiting the kernel, but I don't
 think we can rely on that for all architectures.

x86 can't do this.

 I'll put out a v2 of my patch that does both the things you
 advise against :-) just so we can have a strawman to think
 about how to do it better - unless you have a suggestion
 offhand as to how we can better differentiate sync and async
 entries into the kernel in a platform-independent way.

 I could imagine modifying user_exit() and exception_enter()
 to pass an identifier into the context system saying why they
 were changing contexts, so we could have syscalls, trap
 numbers, fault numbers, etc., and some way to query as
 to whether they were synchronous or asynchronous, and
 build this scheme on top of that, but I'm not sure the extra
 infrastructure is worthwhile.


I'll take a look.

Again, though, I think we really do need to distinguish at least MCE
and NMI (on x86) from the others.


 What if we added a mode to perf where delivery of a sample
 synchronously (or semi-synchronously by catching it on the next exit
 to userspace) freezes the delivering task?  It would be like debugger
 support via perf.

 peterz, do you think this would be a sensible thing to add to perf?
 It would only make sense for some types of events (tracepoints and
 hw_breakpoints mostly, I think).


 I suspect it's reasonable 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-12 Thread Paul E. McKenney
On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote:
 On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
  +++ b/kernel/time/tick-sched.c
  @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
  (jiffies - start));
  dump_stack();
  }
  +
  +   /*
  +* Kill the process if it violates STRICT mode.  Note that this
  +* code also results in killing the task if a kernel bug causes an
  +* irq to be delivered to this core.
  +*/
  +   if ((task-dataplane_flags  (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
  +   == PR_DATAPLANE_STRICT) {
  +   pr_warn(Dataplane STRICT mode violated; process killed.\n);
  +   dump_stack();
  +   task-dataplane_flags = ~PR_DATAPLANE_QUIESCE;
  +   local_irq_enable();
  +   do_group_exit(SIGKILL);
  +   }
   }
 
 So while I'm all for hard fails like this, can we not provide a wee bit
 more information in the siginfo ? And maybe use a slightly less fatal
 signal, such that userspace can actually catch it and dump state in
 debug modes?

Agreed, a bit more debug state would be helpful.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-11 Thread Andy Lutomirski
[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf  wrote:
> On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
>>
>> On May 8, 2015 11:44 PM, "Chris Metcalf"  wrote:
>>>
>>> With QUIESCE mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves.  In particular,
>>> if it enters the kernel via system call, page fault, or any of
>>> a number of other synchronous traps, it may be unexpectedly
>>> exposed to long latencies.  Add a simple flag that puts the process
>>> into a state where any such kernel entry is fatal.
>>>
>>> To allow the state to be entered and exited, we add an internal
>>> bit to current->dataplane_flags that is set when prctl() sets the
>>> flags.  That way, when we are exiting the kernel after calling
>>> prctl() to forbid future kernel exits, we don't get immediately
>>> killed.
>>
>> Is there any reason this can't already be addressed in userspace using
>> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
>> when we screw up and fail to avoid an interrupt, and killing the task
>> seems like overkill to me.
>
>
> Patch 6/6 proposes a mechanism to track down times when the
> kernel screws up and delivers an IRQ to a userspace-only task.
> Here, we're just trying to identify the times when an application
> screws itself up out of cluelessness, and provide a mechanism
> that allows the developer to easily figure out why and fix it.
>
> In particular, /proc/interrupts won't show syscalls or page faults,
> which are two easy ways applications can screw themselves
> when they think they're in userspace-only mode.  Also, they don't
> provide sufficient precision to make it clear what part of the
> application caused the undesired kernel entry.

Perf does, though, complete with context.

>
> In this case, killing the task is appropriate, since that's exactly
> the semantics that have been asked for - it's like on architectures
> that don't natively support unaligned accesses, but fake it relatively
> slowly in the kernel, and in development you just say "give me a
> SIGBUS when that happens" and in production you might say
> "fix it up and let's try to keep going".

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.

>
> You can argue that this is something that can be done by ftrace,
> but certainly you'd want to have a way to programmatically
> turn on ftrace at the moment when you're entering userspace-only
> mode, so we'd want some API around that anyway.  And honestly,
> it's so easy to test a task state bit in a couple of places and
> generate the failurel on the spot, vs. the relative complexity
> of setting up and understanding ftrace, that I think it merits
> inclusion on that basis alone.

perf_event, not ftrace.

>
>> Also, can we please stop further torturing the exit paths?  We have a
>> disaster of assembly code that calls into syscall_trace_leave and
>> do_notify_resume.  Those functions, in turn, *both* call user_enter
>> (WTF?), and on very brief inspection user_enter makes it into the nohz
>> code through multiple levels of indirection, which, with these
>> patches, has yet another conditionally enabled helper, which does this
>> new stuff.  It's getting to be impossible to tell what happens when we
>> exit to user space any more.
>>
>> Also, I think your code is buggy.  There's no particular guarantee
>> that user_enter is only called once between sys_prctl and the final
>> exit to user mode (see the above WTF), so you might spuriously kill
>> the process.
>
>
> This is a good point; I also find the x86 kernel entry and exit
> paths confusing, although I've reviewed them a bunch of times.
> The tile architecture paths are a little easier to understand.
>
> That said, I think the answer here is avoid non-idempotent
> actions in the dataplane code, such as clearing a syscall bit.
>
> A better implementation, I think, is to put the tests for "you
> screwed up and synchronously entered the kernel" in
> the syscall_trace_enter() code, which TIF_NOHZ already
> gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.

It's far to easy to just make a tiny change to the entry code.  Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it.  I already have a prototype patch for the relevant
part of that.

> there, we can test if the dataplane "strict" bit is
> set and the syscall is not prctl(), then we generate the error.
> (We'd exclude exit and exit_group here too, 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-11 Thread Chris Metcalf

On 05/09/2015 03:28 AM, Andy Lutomirski wrote:

On May 8, 2015 11:44 PM, "Chris Metcalf"  wrote:

With QUIESCE mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of
a number of other synchronous traps, it may be unexpectedly
exposed to long latencies.  Add a simple flag that puts the process
into a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we add an internal
bit to current->dataplane_flags that is set when prctl() sets the
flags.  That way, when we are exiting the kernel after calling
prctl() to forbid future kernel exits, we don't get immediately
killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events?  ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.


Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

You can argue that this is something that can be done by ftrace,
but certainly you'd want to have a way to programmatically
turn on ftrace at the moment when you're entering userspace-only
mode, so we'd want some API around that anyway.  And honestly,
it's so easy to test a task state bit in a couple of places and
generate the failurel on the spot, vs. the relative complexity
of setting up and understanding ftrace, that I think it merits
inclusion on that basis alone.


Also, can we please stop further torturing the exit paths?  We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume.  Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff.  It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy.  There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.


This is a good point; I also find the x86 kernel entry and exit
paths confusing, although I've reviewed them a bunch of times.
The tile architecture paths are a little easier to understand.

That said, I think the answer here is avoid non-idempotent
actions in the dataplane code, such as clearing a syscall bit.

A better implementation, I think, is to put the tests for "you
screwed up and synchronously entered the kernel" in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into; there, we can test if the dataplane "strict" bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.


Also, I think that most users will be quite surprised if "strict
dataplane" code causes any machine check on the system to kill your
dataplane task.


Fair point, and avoided by testing as described above instead.
(Though presumably in development it's not such a big deal,
and as I said you'd likely turn it off in production.)


Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics.


Yes, also avoided by doing this as above, though I'd argue we
could also just say that running perf disables this mode.
But it's not as clean as the above suggestion.

On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote:

So, I don't know if it is a practical suggestion or not, but would it 
better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-11 Thread Chris Metcalf

On 05/09/2015 03:28 AM, Andy Lutomirski wrote:

On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote:

With QUIESCE mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of
a number of other synchronous traps, it may be unexpectedly
exposed to long latencies.  Add a simple flag that puts the process
into a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we add an internal
bit to current-dataplane_flags that is set when prctl() sets the
flags.  That way, when we are exiting the kernel after calling
prctl() to forbid future kernel exits, we don't get immediately
killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events?  ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.


Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say give me a
SIGBUS when that happens and in production you might say
fix it up and let's try to keep going.

You can argue that this is something that can be done by ftrace,
but certainly you'd want to have a way to programmatically
turn on ftrace at the moment when you're entering userspace-only
mode, so we'd want some API around that anyway.  And honestly,
it's so easy to test a task state bit in a couple of places and
generate the failurel on the spot, vs. the relative complexity
of setting up and understanding ftrace, that I think it merits
inclusion on that basis alone.


Also, can we please stop further torturing the exit paths?  We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume.  Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff.  It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy.  There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.


This is a good point; I also find the x86 kernel entry and exit
paths confusing, although I've reviewed them a bunch of times.
The tile architecture paths are a little easier to understand.

That said, I think the answer here is avoid non-idempotent
actions in the dataplane code, such as clearing a syscall bit.

A better implementation, I think, is to put the tests for you
screwed up and synchronously entered the kernel in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into; there, we can test if the dataplane strict bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.


Also, I think that most users will be quite surprised if strict
dataplane code causes any machine check on the system to kill your
dataplane task.


Fair point, and avoided by testing as described above instead.
(Though presumably in development it's not such a big deal,
and as I said you'd likely turn it off in production.)


Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics.


Yes, also avoided by doing this as above, though I'd argue we
could also just say that running perf disables this mode.
But it's not as clean as the above suggestion.

On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote:

So, I don't know if it is a practical suggestion or not, but would it 
better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification 

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-11 Thread Andy Lutomirski
[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote:
 On 05/09/2015 03:28 AM, Andy Lutomirski wrote:

 On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote:

 With QUIESCE mode, the task is in principle guaranteed not to be
 interrupted by the kernel, but only if it behaves.  In particular,
 if it enters the kernel via system call, page fault, or any of
 a number of other synchronous traps, it may be unexpectedly
 exposed to long latencies.  Add a simple flag that puts the process
 into a state where any such kernel entry is fatal.

 To allow the state to be entered and exited, we add an internal
 bit to current-dataplane_flags that is set when prctl() sets the
 flags.  That way, when we are exiting the kernel after calling
 prctl() to forbid future kernel exits, we don't get immediately
 killed.

 Is there any reason this can't already be addressed in userspace using
 /proc/interrupts or perf_events?  ISTM the real goal here is to detect
 when we screw up and fail to avoid an interrupt, and killing the task
 seems like overkill to me.


 Patch 6/6 proposes a mechanism to track down times when the
 kernel screws up and delivers an IRQ to a userspace-only task.
 Here, we're just trying to identify the times when an application
 screws itself up out of cluelessness, and provide a mechanism
 that allows the developer to easily figure out why and fix it.

 In particular, /proc/interrupts won't show syscalls or page faults,
 which are two easy ways applications can screw themselves
 when they think they're in userspace-only mode.  Also, they don't
 provide sufficient precision to make it clear what part of the
 application caused the undesired kernel entry.

Perf does, though, complete with context.


 In this case, killing the task is appropriate, since that's exactly
 the semantics that have been asked for - it's like on architectures
 that don't natively support unaligned accesses, but fake it relatively
 slowly in the kernel, and in development you just say give me a
 SIGBUS when that happens and in production you might say
 fix it up and let's try to keep going.

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.


 You can argue that this is something that can be done by ftrace,
 but certainly you'd want to have a way to programmatically
 turn on ftrace at the moment when you're entering userspace-only
 mode, so we'd want some API around that anyway.  And honestly,
 it's so easy to test a task state bit in a couple of places and
 generate the failurel on the spot, vs. the relative complexity
 of setting up and understanding ftrace, that I think it merits
 inclusion on that basis alone.

perf_event, not ftrace.


 Also, can we please stop further torturing the exit paths?  We have a
 disaster of assembly code that calls into syscall_trace_leave and
 do_notify_resume.  Those functions, in turn, *both* call user_enter
 (WTF?), and on very brief inspection user_enter makes it into the nohz
 code through multiple levels of indirection, which, with these
 patches, has yet another conditionally enabled helper, which does this
 new stuff.  It's getting to be impossible to tell what happens when we
 exit to user space any more.

 Also, I think your code is buggy.  There's no particular guarantee
 that user_enter is only called once between sys_prctl and the final
 exit to user mode (see the above WTF), so you might spuriously kill
 the process.


 This is a good point; I also find the x86 kernel entry and exit
 paths confusing, although I've reviewed them a bunch of times.
 The tile architecture paths are a little easier to understand.

 That said, I think the answer here is avoid non-idempotent
 actions in the dataplane code, such as clearing a syscall bit.

 A better implementation, I think, is to put the tests for you
 screwed up and synchronously entered the kernel in
 the syscall_trace_enter() code, which TIF_NOHZ already
 gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.

It's far to easy to just make a tiny change to the entry code.  Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it.  I already have a prototype patch for the relevant
part of that.

 there, we can test if the dataplane strict bit is
 set and the syscall is not prctl(), then we generate the error.
 (We'd exclude exit and exit_group here too, since we don't
 need to shoot down a task that's just trying to kill itself.)
 This needs a bit of 

RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-09 Thread Gilad Ben Yossef
> From: Andy Lutomirski [mailto:l...@amacapital.net]
> Sent: Saturday, May 09, 2015 10:29 AM
> To: Chris Metcalf
> Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar;
> Rik van Riel; linux-...@vger.kernel.org; Andrew Morton; linux-
> ker...@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven
> Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API
> Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
> 
> On May 8, 2015 11:44 PM, "Chris Metcalf"  wrote:
> >
> > With QUIESCE mode, the task is in principle guaranteed not to be
> > interrupted by the kernel, but only if it behaves.  In particular,
> > if it enters the kernel via system call, page fault, or any of
> > a number of other synchronous traps, it may be unexpectedly
> > exposed to long latencies.  Add a simple flag that puts the process
> > into a state where any such kernel entry is fatal.
> >
> > To allow the state to be entered and exited, we add an internal
> > bit to current->dataplane_flags that is set when prctl() sets the
> > flags.  That way, when we are exiting the kernel after calling
> > prctl() to forbid future kernel exits, we don't get immediately
> > killed.
> 
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.
> 
> Also, can we please stop further torturing the exit paths?  
So, I don't know if it is a practical suggestion or not, but would it 
better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification (killing the task or 
just logging the event in a signal handler) and hopefully since return to 
userspace with a pending signal is already handled we don't need new code in 
the exit path?

Gilad
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-09 Thread Andy Lutomirski
On May 8, 2015 11:44 PM, "Chris Metcalf"  wrote:
>
> With QUIESCE mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves.  In particular,
> if it enters the kernel via system call, page fault, or any of
> a number of other synchronous traps, it may be unexpectedly
> exposed to long latencies.  Add a simple flag that puts the process
> into a state where any such kernel entry is fatal.
>
> To allow the state to be entered and exited, we add an internal
> bit to current->dataplane_flags that is set when prctl() sets the
> flags.  That way, when we are exiting the kernel after calling
> prctl() to forbid future kernel exits, we don't get immediately
> killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events?  ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.

Also, can we please stop further torturing the exit paths?  We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume.  Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff.  It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy.  There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.

Also, I think that most users will be quite surprised if "strict
dataplane" code causes any machine check on the system to kill your
dataplane task.  Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics.  /proc/interrupts gets
that right as is.  Sure, MCEs will hurt your RT performance, but Intel
screwed up the way that MCEs work, so we should make do.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-09 Thread Gilad Ben Yossef
 From: Andy Lutomirski [mailto:l...@amacapital.net]
 Sent: Saturday, May 09, 2015 10:29 AM
 To: Chris Metcalf
 Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar;
 Rik van Riel; linux-...@vger.kernel.org; Andrew Morton; linux-
 ker...@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven
 Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API
 Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
 
 On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote:
 
  With QUIESCE mode, the task is in principle guaranteed not to be
  interrupted by the kernel, but only if it behaves.  In particular,
  if it enters the kernel via system call, page fault, or any of
  a number of other synchronous traps, it may be unexpectedly
  exposed to long latencies.  Add a simple flag that puts the process
  into a state where any such kernel entry is fatal.
 
  To allow the state to be entered and exited, we add an internal
  bit to current-dataplane_flags that is set when prctl() sets the
  flags.  That way, when we are exiting the kernel after calling
  prctl() to forbid future kernel exits, we don't get immediately
  killed.
 
 Is there any reason this can't already be addressed in userspace using
 /proc/interrupts or perf_events?  ISTM the real goal here is to detect
 when we screw up and fail to avoid an interrupt, and killing the task
 seems like overkill to me.
 
 Also, can we please stop further torturing the exit paths?  
So, I don't know if it is a practical suggestion or not, but would it 
better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification (killing the task or 
just logging the event in a signal handler) and hopefully since return to 
userspace with a pending signal is already handled we don't need new code in 
the exit path?

Gilad
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode

2015-05-09 Thread Andy Lutomirski
On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote:

 With QUIESCE mode, the task is in principle guaranteed not to be
 interrupted by the kernel, but only if it behaves.  In particular,
 if it enters the kernel via system call, page fault, or any of
 a number of other synchronous traps, it may be unexpectedly
 exposed to long latencies.  Add a simple flag that puts the process
 into a state where any such kernel entry is fatal.

 To allow the state to be entered and exited, we add an internal
 bit to current-dataplane_flags that is set when prctl() sets the
 flags.  That way, when we are exiting the kernel after calling
 prctl() to forbid future kernel exits, we don't get immediately
 killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events?  ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.

Also, can we please stop further torturing the exit paths?  We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume.  Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff.  It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy.  There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.

Also, I think that most users will be quite surprised if strict
dataplane code causes any machine check on the system to kill your
dataplane task.  Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics.  /proc/interrupts gets
that right as is.  Sure, MCEs will hurt your RT performance, but Intel
screwed up the way that MCEs work, so we should make do.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/