Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-24 Thread Ingo Molnar

* Jiri Kosina  wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
> 
> > (It does have some other requirements, such as making 
> > all syscalls interruptible to a 'special' signalling 
> > method that only live patching triggers - even syscalls 
> > that are under the normal ABI uninterruptible, such as 
> > sys_sync().)
> 
> BTW I didn't really understand this -- could you please 
> elaborate what exactly do you propose to do here in your 
> "simplified" patching method (i.e. serializing everybody 
> at the kernel boundary) for TASK_UNINTERRUPTIBLE 
> processess?

So I'd try to separate out the two main categories of 
uninterruptible sleepers:

 - those who just serialize with other local CPUs/tasks 
   relatively quickly

 - those who are waiting for some potentially very long and
   open ended request. [such as IO, potentially network IO.]

I'd only touch the latter: a prominent example would be 
sys_sync(). I'd leave alone the myriads of other 
uninterruptible sleepers.

> But I didn't understand your claims regarding 
> uninterruptible sleeps in your paragraph above. 
> sys_sync() is one thing, that's just waiting 
> uninterruptibly for completion. But how about all the 
> mutex waitiers in TASK_UNINTERRUPTIBLE, for example?

I'd not touch those - unless they are waiting for something 
that will not be done by the time we park all tasks: for 
example NFS might have uninterruptible sleeps, and 
sys_sync() will potentially do IO for minutes.

I think it would be the exception, not the rule - but it 
would give us an approach that allows us to touch 'any' 
kernel code if its wait times are unreasonably long or open 
ended.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-24 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 On Sat, 21 Feb 2015, Ingo Molnar wrote:
 
  (It does have some other requirements, such as making 
  all syscalls interruptible to a 'special' signalling 
  method that only live patching triggers - even syscalls 
  that are under the normal ABI uninterruptible, such as 
  sys_sync().)
 
 BTW I didn't really understand this -- could you please 
 elaborate what exactly do you propose to do here in your 
 simplified patching method (i.e. serializing everybody 
 at the kernel boundary) for TASK_UNINTERRUPTIBLE 
 processess?

So I'd try to separate out the two main categories of 
uninterruptible sleepers:

 - those who just serialize with other local CPUs/tasks 
   relatively quickly

 - those who are waiting for some potentially very long and
   open ended request. [such as IO, potentially network IO.]

I'd only touch the latter: a prominent example would be 
sys_sync(). I'd leave alone the myriads of other 
uninterruptible sleepers.

 But I didn't understand your claims regarding 
 uninterruptible sleeps in your paragraph above. 
 sys_sync() is one thing, that's just waiting 
 uninterruptibly for completion. But how about all the 
 mutex waitiers in TASK_UNINTERRUPTIBLE, for example?

I'd not touch those - unless they are waiting for something 
that will not be done by the time we park all tasks: for 
example NFS might have uninterruptible sleeps, and 
sys_sync() will potentially do IO for minutes.

I think it would be the exception, not the rule - but it 
would give us an approach that allows us to touch 'any' 
kernel code if its wait times are unreasonably long or open 
ended.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-23 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

> (It does have some other requirements, such as making all 
> syscalls interruptible to a 'special' signalling method 
> that only live patching triggers - even syscalls that are 
> under the normal ABI uninterruptible, such as sys_sync().)

BTW I didn't really understand this -- could you please elaborate what 
exactly do you propose to do here in your "simplified" patching method 
(i.e. serializing everybody at the kernel boundary) for 
TASK_UNINTERRUPTIBLE processess?

That actually seems to be the most crucial problem to me in this respect. 
Other things are rather implementation details; no matter whether we are 
sending normal SIGCONT or SIGPATCHING with special semantics you have 
described above, at the end of the day we end up calling kick_process() 
for the task in question, and that makes both interruptible sleepers and 
CPU hogs go through the "checkpoint". SIGPATCHING would then be "just" an 
improvement of this, making sure that EINTR doesn't spuriously get leaked 
to userspace.

But I didn't understand your claims regarding uninterruptible sleeps in 
your paragraph above. sys_sync() is one thing, that's just waiting 
uninterruptibly for completion. But how about all the mutex waitiers in 
TASK_UNINTERRUPTIBLE, for example?

Thanks a lot,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-23 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

 (It does have some other requirements, such as making all 
 syscalls interruptible to a 'special' signalling method 
 that only live patching triggers - even syscalls that are 
 under the normal ABI uninterruptible, such as sys_sync().)

BTW I didn't really understand this -- could you please elaborate what 
exactly do you propose to do here in your simplified patching method 
(i.e. serializing everybody at the kernel boundary) for 
TASK_UNINTERRUPTIBLE processess?

That actually seems to be the most crucial problem to me in this respect. 
Other things are rather implementation details; no matter whether we are 
sending normal SIGCONT or SIGPATCHING with special semantics you have 
described above, at the end of the day we end up calling kick_process() 
for the task in question, and that makes both interruptible sleepers and 
CPU hogs go through the checkpoint. SIGPATCHING would then be just an 
improvement of this, making sure that EINTR doesn't spuriously get leaked 
to userspace.

But I didn't understand your claims regarding uninterruptible sleeps in 
your paragraph above. sys_sync() is one thing, that's just waiting 
uninterruptibly for completion. But how about all the mutex waitiers in 
TASK_UNINTERRUPTIBLE, for example?

Thanks a lot,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina

[ live-patching@ ML added to CC here as well ]

On Sun, 22 Feb 2015, Ingo Molnar wrote:

> > BTW how exactly do you envision this will work? Do I understand your 
> > proposal correctly that EINTR will be "handled" somewhere in the "live 
> > patching special signal handler" and then have the interrupted syscall 
> > restarted?
> 
> If you want to think about it in signal handling terms then it's a new 
> automatic in-kernel handler, which does not actually return back to 
> user-mode at all.
> 
> We can do it via the signals machinery (mainly to reuse all the existing 
> signal_pending() code in various syscalls), or via new TIF flags like 
> the user work machinery: the principle is the same: interrupt out of 
> syscall functions into a central place and restart them, and return to 
> user-space later on as if a single call had been performed.
> 
> This necessarily means some changes to syscalls, but not insurmountable 
> ones - and checkpoint/restore support would want to have similar changes 
> in any case so we can hit two birds with the same stone.

Understood. I think this by itself really would be a huge improvement on 
how to force tasks to converge to the new universe without confusing the 
userspace via artificial signals (i.e. even with the lazy migration 
aproaches, without full serialization checkpoint). As I said yesterday, I 
am pretty sure we'll experiment with this in kgraft. Thanks for bringing 
this idea up,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Ingo Molnar

* Jiri Kosina  wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
> 
> > > Plus a lot of processes would see EINTR, causing more 
> > > havoc.
> > 
> > Parking threads safely in user mode does not require 
> > the propagation of syscall interruption to user-space.
> 
> BTW how exactly do you envision this will work? Do I 
> understand your proposal correctly that EINTR will be 
> "handled" somewhere in the "live patching special signal 
> handler" and then have the interrupted syscall restarted?

If you want to think about it in signal handling terms then 
it's a new automatic in-kernel handler, which does not 
actually return back to user-mode at all.

We can do it via the signals machinery (mainly to reuse all 
the existing signal_pending() code in various syscalls), or 
via new TIF flags like the user work machinery: the 
principle is the same: interrupt out of syscall functions 
into a central place and restart them, and return to 
user-space later on as if a single call had been performed.

This necessarily means some changes to syscalls, but not 
insurmountable ones - and checkpoint/restore support would 
want to have similar changes in any case so we can hit two 
birds with the same stone.

> Even without EINTR propagation to userspace, this would 
> make a lot of new syscall restarts that were not there 
> before, [...]

That's only a problem if you do system call restarts by 
restarting them via user-space system call restart handler 
- I'm not proposing that.

I'm suggesting a completely user-space transparent way to 
execute long lasting system calls in a smarter way. I.e. it 
would not be observable via strace either.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina
On Sun, 22 Feb 2015, Ingo Molnar wrote:

> I am making specific technical arguments, but you attempted to redirect 
> my very specific arguments towards 'differences in philosophy' and 
> 'where to draw the line'. Lets discuss the merits and brush them aside 
> as 'philosophical differences' or a made up category of 'consistency 
> models'.
> 
> Anyway, let me try to reboot this discussion back to technological 
> details by summing up my arguments in another mail.

Sounds good, thanks. Don't get me wrong -- I am not really opposing your 
solution (because it's of course quite close to what kgraft is doing :p ), 
but I'd like to make sure we understand each other well.

I still have to think a little bit more about how to handle kthreads 
properly even in your proposed solution (i.e. how exactly is it superior 
to what kgraft is currently doing in this respect). We'd have to go 
through all them anyway, and make them parkable, right?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > Plus a lot of processes would see EINTR, causing more havoc.
> 
> Parking threads safely in user mode does not require the propagation of 
> syscall interruption to user-space.

BTW how exactly do you envision this will work? Do I understand your 
proposal correctly that EINTR will be "handled" somewhere in the "live 
patching special signal handler" and then have the interrupted syscall 
restarted?

Even without EINTR propagation to userspace, this would make a lot of new 
syscall restarts that were not there before, and I am still to be 
convinced that this is something we are not going to cause a lot of new 
user-visible breakage with.

Yes, the breakage would be caused kernel bugs (I mostly envision drivers 
to be problematic in this area) that would be nice to have fixed, but the 
user experience that will come out of it will be just completely horrible.

Or did I misunderstand what you are really proposing?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Ingo Molnar

* Jiri Kosina  wrote:

> > > Or think of kernel that has some 3rd party vendor 
> > > module loaded, and this module spawning a ktrehad 
> > > that is not capable of parking itself.
> > 
> > The kernel will refuse to live patch until the module 
> > is fixed. It is a win by any measure.
> 
> Depends on your point of view. If you are an 
> administrator of the vulnerable system, you have no to 
> very little clue why patching your system is not possible 
> (and what you should to to have it fixed in the coming 
> days, before your system is taken down by attackers, for 
> example), and would be really sad to be forced to run 
> with unpatched system.

I don't see what precise argument you are making here. That 
we are unable to fix possible bugs in binary only modules? 
News at 11.

Or are you making the argument that we should design our 
core kernel capabilities to be deficient, just to 
accomodate hypothetical scenarios with buggy third party 
modules that create unparkable kernel threads? That would 
be a crazy proposition.

> > Why would __notrace__ functions be a problem in the 
> > 'simple' method? Live patching with this method will 
> > work even if ftrace is not built in, we can just patch 
> > out the function in the simplest possible fashion, 
> > because we do it atomically and don't have to worry 
> > about per task 'transition state' - like kGraft does.
> > 
> > It's a massive simplification and there's no need to 
> > rely on ftrace's mcount trick. (Sorry Steve!)
> 
> This would indeed be possible iff we take the "global 
> synchronizing point" aproach you are proposing. [...]

Yes.

> [...] Still technicalities to be solved (what happens if 
> you are patching that already has ftrace on it, etc), but 
> probably nothing principal.

Correct.

> > > So it's not black and white, it's really a rather 
> > > philosophical question where to draw the line (and 
> > > make sure that everyone is aware of where the line is 
> > > and what it means).
> > 
> > Out of the three examples you mentioned, two are 
> > actually an advantage to begin with - so I'd suggest 
> > you stop condescending me ...
> 
> Ugh, if you feel my e-mails like attempts to condescend 
> you, I am really surprised; I thought we are discussing 
> technical details. If you feel otherwise, we should 
> probably just stop discussing then.

I am making specific technical arguments, but you attempted 
to redirect my very specific arguments towards 'differences 
in philosophy' and 'where to draw the line'. Lets discuss 
the merits and brush them aside as 'philosophical 
differences' or a made up category of 'consistency models'.

Anyway, let me try to reboot this discussion back to 
technological details by summing up my arguments in another 
mail.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

   Or think of kernel that has some 3rd party vendor 
   module loaded, and this module spawning a ktrehad 
   that is not capable of parking itself.
  
  The kernel will refuse to live patch until the module 
  is fixed. It is a win by any measure.
 
 Depends on your point of view. If you are an 
 administrator of the vulnerable system, you have no to 
 very little clue why patching your system is not possible 
 (and what you should to to have it fixed in the coming 
 days, before your system is taken down by attackers, for 
 example), and would be really sad to be forced to run 
 with unpatched system.

I don't see what precise argument you are making here. That 
we are unable to fix possible bugs in binary only modules? 
News at 11.

Or are you making the argument that we should design our 
core kernel capabilities to be deficient, just to 
accomodate hypothetical scenarios with buggy third party 
modules that create unparkable kernel threads? That would 
be a crazy proposition.

  Why would __notrace__ functions be a problem in the 
  'simple' method? Live patching with this method will 
  work even if ftrace is not built in, we can just patch 
  out the function in the simplest possible fashion, 
  because we do it atomically and don't have to worry 
  about per task 'transition state' - like kGraft does.
  
  It's a massive simplification and there's no need to 
  rely on ftrace's mcount trick. (Sorry Steve!)
 
 This would indeed be possible iff we take the global 
 synchronizing point aproach you are proposing. [...]

Yes.

 [...] Still technicalities to be solved (what happens if 
 you are patching that already has ftrace on it, etc), but 
 probably nothing principal.

Correct.

   So it's not black and white, it's really a rather 
   philosophical question where to draw the line (and 
   make sure that everyone is aware of where the line is 
   and what it means).
  
  Out of the three examples you mentioned, two are 
  actually an advantage to begin with - so I'd suggest 
  you stop condescending me ...
 
 Ugh, if you feel my e-mails like attempts to condescend 
 you, I am really surprised; I thought we are discussing 
 technical details. If you feel otherwise, we should 
 probably just stop discussing then.

I am making specific technical arguments, but you attempted 
to redirect my very specific arguments towards 'differences 
in philosophy' and 'where to draw the line'. Lets discuss 
the merits and brush them aside as 'philosophical 
differences' or a made up category of 'consistency models'.

Anyway, let me try to reboot this discussion back to 
technological details by summing up my arguments in another 
mail.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 On Sat, 21 Feb 2015, Ingo Molnar wrote:
 
   Plus a lot of processes would see EINTR, causing more 
   havoc.
  
  Parking threads safely in user mode does not require 
  the propagation of syscall interruption to user-space.
 
 BTW how exactly do you envision this will work? Do I 
 understand your proposal correctly that EINTR will be 
 handled somewhere in the live patching special signal 
 handler and then have the interrupted syscall restarted?

If you want to think about it in signal handling terms then 
it's a new automatic in-kernel handler, which does not 
actually return back to user-mode at all.

We can do it via the signals machinery (mainly to reuse all 
the existing signal_pending() code in various syscalls), or 
via new TIF flags like the user work machinery: the 
principle is the same: interrupt out of syscall functions 
into a central place and restart them, and return to 
user-space later on as if a single call had been performed.

This necessarily means some changes to syscalls, but not 
insurmountable ones - and checkpoint/restore support would 
want to have similar changes in any case so we can hit two 
birds with the same stone.

 Even without EINTR propagation to userspace, this would 
 make a lot of new syscall restarts that were not there 
 before, [...]

That's only a problem if you do system call restarts by 
restarting them via user-space system call restart handler 
- I'm not proposing that.

I'm suggesting a completely user-space transparent way to 
execute long lasting system calls in a smarter way. I.e. it 
would not be observable via strace either.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina

[ live-patching@ ML added to CC here as well ]

On Sun, 22 Feb 2015, Ingo Molnar wrote:

  BTW how exactly do you envision this will work? Do I understand your 
  proposal correctly that EINTR will be handled somewhere in the live 
  patching special signal handler and then have the interrupted syscall 
  restarted?
 
 If you want to think about it in signal handling terms then it's a new 
 automatic in-kernel handler, which does not actually return back to 
 user-mode at all.
 
 We can do it via the signals machinery (mainly to reuse all the existing 
 signal_pending() code in various syscalls), or via new TIF flags like 
 the user work machinery: the principle is the same: interrupt out of 
 syscall functions into a central place and restart them, and return to 
 user-space later on as if a single call had been performed.
 
 This necessarily means some changes to syscalls, but not insurmountable 
 ones - and checkpoint/restore support would want to have similar changes 
 in any case so we can hit two birds with the same stone.

Understood. I think this by itself really would be a huge improvement on 
how to force tasks to converge to the new universe without confusing the 
userspace via artificial signals (i.e. even with the lazy migration 
aproaches, without full serialization checkpoint). As I said yesterday, I 
am pretty sure we'll experiment with this in kgraft. Thanks for bringing 
this idea up,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina
On Sun, 22 Feb 2015, Ingo Molnar wrote:

 I am making specific technical arguments, but you attempted to redirect 
 my very specific arguments towards 'differences in philosophy' and 
 'where to draw the line'. Lets discuss the merits and brush them aside 
 as 'philosophical differences' or a made up category of 'consistency 
 models'.
 
 Anyway, let me try to reboot this discussion back to technological 
 details by summing up my arguments in another mail.

Sounds good, thanks. Don't get me wrong -- I am not really opposing your 
solution (because it's of course quite close to what kgraft is doing :p ), 
but I'd like to make sure we understand each other well.

I still have to think a little bit more about how to handle kthreads 
properly even in your proposed solution (i.e. how exactly is it superior 
to what kgraft is currently doing in this respect). We'd have to go 
through all them anyway, and make them parkable, right?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-22 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

  Plus a lot of processes would see EINTR, causing more havoc.
 
 Parking threads safely in user mode does not require the propagation of 
 syscall interruption to user-space.

BTW how exactly do you envision this will work? Do I understand your 
proposal correctly that EINTR will be handled somewhere in the live 
patching special signal handler and then have the interrupted syscall 
restarted?

Even without EINTR propagation to userspace, this would make a lot of new 
syscall restarts that were not there before, and I am still to be 
convinced that this is something we are not going to cause a lot of new 
user-visible breakage with.

Yes, the breakage would be caused kernel bugs (I mostly envision drivers 
to be problematic in this area) that would be nice to have fixed, but the 
user experience that will come out of it will be just completely horrible.

Or did I misunderstand what you are really proposing?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
To make sure that this thread doesn't conclude in void, here's my take on 
it:

- what's currently alredy there is the simplest-of-simplest methods; it
  allows you to apply context-less patches (such as adding bounds checking 
  to the beginning of syscall, etc), which turns out to cover vast portion 
  of applicable CVEs

- it can always be made more clever; patch author always has to know the 
  version of the kernel he's preparing the patch for anyway (the live 
  patch and the kernel is closely tied together)

- the proposal to force sleeping or CPU-hogging tasks through a defined 
  safe checkpoint using a fake sort-of signal without any other 
  sideeffects might be useful even for kGraft and also for other proposed 
  aproaches. I think we'll try to implement this as an optimization for 
  kGraft and we'll see how (a) fast (b) non-intrusive we would be able to 
  make it. If it turns out to be successful, we can then just reuse it in 
  the upstream solution (whatever that would be)

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > I see the difference, but I am afraid you are simplifying 
> > the situation a litle bit too much.
> > 
> > There will always be properties of patches that will make 
> > them unapplicable in a "live patching" way by design. 
> > Think of data structure layout changes (*).
> 
> Yes.

The (*) note described that it's actually in theory possible, but the 
price to pay is very high.

So it's possible to write a patch that does this, and have it use a 
different consistency model, which would take care of data structure 
layout change (and the fact that this particular consistency model is very 
expensive, would be a clear fact).

Your "simple one-method-to-rule-them-all aproach" is not sufficient for 
this.

Even such a stretched case doesn't justify existency of consistency models 
for you?

> > Or think of kernel that has some 3rd party vendor module loaded, and 
> > this module spawning a ktrehad that is not capable of parking itself.
> 
> The kernel will refuse to live patch until the module is fixed. It is a 
> win by any measure.

Depends on your point of view. If you are an administrator of the 
vulnerable system, you have no to very little clue why patching your 
system is not possible (and what you should to to have it fixed in the 
coming days, before your system is taken down by attackers, for example), 
and would be really sad to be forced to run with unpatched system.

I see your point that this would be a good lever we'd have to 3rd party 
vendors, but OTOH we'd be taking OS users as hostages in some sense.

> > Or think of patching __notrace__ functions.
> 
> Why would __notrace__ functions be a problem in the 'simple' method? 
> Live patching with this method will work even if ftrace is not built in, 
> we can just patch out the function in the simplest possible fashion, 
> because we do it atomically and don't have to worry about per task 
> 'transition state' - like kGraft does.
> 
> It's a massive simplification and there's no need to rely on ftrace's 
> mcount trick. (Sorry Steve!)

This would indeed be possible iff we take the "global synchronizing point" 
aproach you are proposing. Still technicalities to be solved (what happens 
if you are patching that already has ftrace on it, etc), but probably 
nothing principal.

> > So it's not black and white, it's really a rather philosophical 
> > question where to draw the line (and make sure that everyone is aware 
> > of where the line is and what it means).
> 
> Out of the three examples you mentioned, two are actually an advantage 
> to begin with - so I'd suggest you stop condescending me ...

Ugh, if you feel my e-mails like attempts to condescend you, I am really 
surprised; I thought we are discussing technical details. If you feel 
otherwise, we should probably just stop discussing then.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Jiri Kosina  wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
> 
> > > But admittedly, if we reserve a special sort-of 
> > > signal for making the tasks pass through a safe 
> > > checkpoint (and make them queue there (your solution) 
> > > or make them just pass through it and continue 
> > > (current kGraft)), it might reduce the time this 
> > > effort needs considerably.
> > 
> > Well, I think the 'simple' method has another 
> > advantage: it can only work if all those problems 
> > (kthreads, parking machinery) are solved, because the 
> > patching will occur only everything is quiescent.
> > 
> > So no shortcuts are allowed, by design. It starts from 
> > a fundamentally safe, robust base, while all the other 
> > approaches I've seen were developed in a 'lets get the 
> > patching to work, then iteratively try to make it 
> > safer' which really puts the cart before the horse.
> > 
> > So to put it more bluntly: I don't subscribe to the 
> > whole 'consistency model' nonsense: that's just crazy 
> > talk IMHO.
> > 
> > Make it fundamentally safe from the very beginning, the 
> > 'simple method' I suggested _won't live patch the 
> > kernel_ if the mechanism has a bug and some kthread or 
> > task does not park. See the difference?
> 
> I see the difference, but I am afraid you are simplifying 
> the situation a litle bit too much.
> 
> There will always be properties of patches that will make 
> them unapplicable in a "live patching" way by design. 
> Think of data structure layout changes (*).

Yes.

> Or think of kernel that has some 3rd party vendor module 
> loaded, and this module spawning a ktrehad that is not 
> capable of parking itself.

The kernel will refuse to live patch until the module is 
fixed. It is a win by any measure.

> Or think of patching __notrace__ functions.

Why would __notrace__ functions be a problem in the 
'simple' method? Live patching with this method will work 
even if ftrace is not built in, we can just patch out the 
function in the simplest possible fashion, because we do it 
atomically and don't have to worry about per task 
'transition state' - like kGraft does.

It's a massive simplification and there's no need to rely 
on ftrace's mcount trick. (Sorry Steve!)

> Etc.
> 
> So it's not black and white, it's really a rather 
> philosophical question where to draw the line (and make 
> sure that everyone is aware of where the line is and what 
> it means).

Out of the three examples you mentioned, two are actually 
an advantage to begin with - so I'd suggest you stop 
condescending me ...

> This is exactly why we came up with consistency models -- 
> it allows you to draw the line at well-defined places.

To still be blunt: they are snake oil, a bit like 'security 
modules': allowing upstream merges by consensus between 
competing pieces of crap, instead of coming up with a 
single good design that we can call the Linux kernel live 
patching method ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > But admittedly, if we reserve a special sort-of signal 
> > for making the tasks pass through a safe checkpoint (and 
> > make them queue there (your solution) or make them just 
> > pass through it and continue (current kGraft)), it might 
> > reduce the time this effort needs considerably.
> 
> Well, I think the 'simple' method has another advantage: it can only 
> work if all those problems (kthreads, parking machinery) are solved, 
> because the patching will occur only everything is quiescent.
> 
> So no shortcuts are allowed, by design. It starts from a fundamentally 
> safe, robust base, while all the other approaches I've seen were 
> developed in a 'lets get the patching to work, then iteratively try to 
> make it safer' which really puts the cart before the horse.
> 
> So to put it more bluntly: I don't subscribe to the whole 'consistency 
> model' nonsense: that's just crazy talk IMHO.
> 
> Make it fundamentally safe from the very beginning, the 'simple method' 
> I suggested _won't live patch the kernel_ if the mechanism has a bug and 
> some kthread or task does not park. See the difference?

I see the difference, but I am afraid you are simplifying the situation a 
litle bit too much. 

There will always be properties of patches that will make them 
unapplicable in a "live patching" way by design. Think of data structure 
layout changes (*).

Or think of kernel that has some 3rd party vendor module loaded, and this 
module spawning a ktrehad that is not capable of parking itself.

Or think of patching __notrace__ functions.

Etc.

So it's not black and white, it's really a rather philosophical question 
where to draw the line (and make sure that everyone is aware of where the 
line is and what it means).
This is exactly why we came up with consistency models -- it allows you to 
draw the line at well-defined places.

(*) there are some rather crazy ideas how to make this work, but the price 
you pay is basically always unacceptable slowdown

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Jiri Kosina  wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
> 
> > > This means that each and every sleeping task in the 
> > > system has to be woken up in some way (sending a 
> > > signal ...) to exit from a syscall it is sleeping in. 
> > > Same for CPU hogs. All kernel threads need to be 
> > > parked.
> > 
> > Yes - although I'd not use signals for this, signals 
> > have side effects - but yes, something functionally 
> > equivalent.
> 
> This is similar to my proposal I came up with not too 
> long time ago; a fake signal (analogically to, but not 
> exactly the same, what freezer is using), that will just 
> make tasks cycle through userspace/kernelspace boundary 
> without other side-effects.

Yeah.

> > > This is exactly what you need to do for kGraft to 
> > > complete patching.
> > 
> > My understanding of kGraft is that by default you allow 
> > tasks to continue 'in the new universe' after they are 
> > patched. Has this changed or have I misunderstood the 
> > concept?
> 
> What Vojtech meant here, I believe, is that the effort 
> you have to make to force all tasks to queue themselves 
> to park them on a safe place and then restart their 
> execution is exactly the same as the effort you have to 
> make to make kGraft converge and succeed.

Yes - with the difference that in the 'simple' method I 
suggested we'd have kpatch's patching robustness (all or 
nothing atomic patching - no intermediate patching state, 
no reliance on mcount entries, no doubt about which version 
of the function is working - sans kpatch's stack trace 
logic), combined with kGraft's task parking robustness.

> But admittedly, if we reserve a special sort-of signal 
> for making the tasks pass through a safe checkpoint (and 
> make them queue there (your solution) or make them just 
> pass through it and continue (current kGraft)), it might 
> reduce the time this effort needs considerably.

Well, I think the 'simple' method has another advantage: it 
can only work if all those problems (kthreads, parking 
machinery) are solved, because the patching will occur only 
everything is quiescent.

So no shortcuts are allowed, by design. It starts from a 
fundamentally safe, robust base, while all the other 
approaches I've seen were developed in a 'lets get the 
patching to work, then iteratively try to make it safer' 
which really puts the cart before the horse.

So to put it more bluntly: I don't subscribe to the whole 
'consistency model' nonsense: that's just crazy talk IMHO. 

Make it fundamentally safe from the very beginning, the 
'simple method' I suggested _won't live patch the kernel_ 
if the mechanism has a bug and some kthread or task does 
not park. See the difference?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > This means that each and every sleeping task in the system has to be 
> > woken up in some way (sending a signal ...) to exit from a syscall it 
> > is sleeping in. Same for CPU hogs. All kernel threads need to be 
> > parked.
> 
> Yes - although I'd not use signals for this, signals have 
> side effects - but yes, something functionally equivalent.

This is similar to my proposal I came up with not too long time ago; a 
fake signal (analogically to, but not exactly the same, what freezer is 
using), that will just make tasks cycle through userspace/kernelspace 
boundary without other side-effects.

> > This is exactly what you need to do for kGraft to complete patching.
> 
> My understanding of kGraft is that by default you allow tasks to 
> continue 'in the new universe' after they are patched. Has this changed 
> or have I misunderstood the concept?

What Vojtech meant here, I believe, is that the effort you have to make to 
force all tasks to queue themselves to park them on a safe place and then 
restart their execution is exactly the same as the effort you have to make 
to make kGraft converge and succeed.

But admittedly, if we reserve a special sort-of signal for making the 
tasks pass through a safe checkpoint (and make them queue there (your 
solution) or make them just pass through it and continue (current 
kGraft)), it might reduce the time this effort needs considerably.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
> > On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
> >
> > > I.e. it's in essence the strong stop-all atomic 
> > > patching model of 'kpatch', combined with the 
> > > reliable avoidance of kernel stacks that 'kgraft' 
> > > uses.
> > 
> > > That should be the starting point, because it's the 
> > > most reliable method.
> > 
> > In the consistency models discussion, this was marked 
> > the "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the 
> > strongest model of all, but also comes at the highest 
> > cost in terms of impact on running tasks. It's so high 
> > (the interruption may be seconds or more) that it was 
> > deemed not worth implementing.
> 
> Yeah, this is way too disruptive to the user.
> 
> Even the comparatively tiny latency caused by kpatch's 
> use of stop_machine() was considered unacceptable by 
> some.

Unreliable, unrobust patching is even more disruptive...

What I think makes it long term fragile is that we combine 
two unrobust, unlikely mechanisms: the chance that a task 
just happens to execute a patched function, with the chance 
that debug information is unreliable.

For example tracing patching got debugged to a fair degree 
because we rely on the patching for actual tracing 
functionality. Even with that relatively robust usage model 
we had our crises ...

I just don't see how a stack backtrace based live patching 
method can become robust in the long run.

> Plus a lot of processes would see EINTR, causing more 
> havoc.

Parking threads safely in user mode does not require the 
propagation of syscall interruption to user-space.

(It does have some other requirements, such as making all 
syscalls interruptible to a 'special' signalling method 
that only live patching triggers - even syscalls that are 
under the normal ABI uninterruptible, such as sys_sync().)

On the other hand, if it's too slow, people will work on 
improving signal propagation latencies: making syscalls 
more readily interruptible and more seemlessly restartable 
has various other advantages beyond live kernel patching.

I.e. it's a win-win scenario and will improve various areas 
of the kernel in terms of syscall interruptability 
latencies.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Vojtech Pavlik  wrote:

> On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
> 
> > > ... the choice the sysadmins have here is either have 
> > > the system running in an intermediate state, or have 
> > > the system completely dead for the *same time*. 
> > > Because to finish the transition successfully, all 
> > > the tasks have to be woken up in any case.
> > 
> > That statement is false: an 'intermediate state' system 
> > where 'new' tasks are still running is still running 
> > and will interfere with the resolution of 'old' tasks.
> 
> Can you suggest a way how they would interfere? The 
> transition happens on entering or returning from a 
> syscall, there is no influence between individual tasks.

Well, a 'new' task does not stop executing after returning 
from the syscall, right? If it's stopped (until all 
patching is totally complete) then you are right and I 
concede your point.

If it's allowed to continue its workload then my point 
stands: subsequent execution of 'new' tasks can interfere 
with, slow down, interact with 'old' tasks trying to get 
patched.

> > I think you misunderstood: the 'simple' method I 
> > outlined does not just 'synchronize', it actually 
> > executes the live patching atomically, once all tasks 
> > are gathered and we know they are _all_ in a safe 
> > state.
> 
> The 'simple' method has to catch and freeze all tasks one 
> by one in syscall entry/exit, at the kernel/userspace 
> boundary, until all are frozen and then patch the system 
> atomically.

Correct.

> This means that each and every sleeping task in the 
> system has to be woken up in some way (sending a signal 
> ...) to exit from a syscall it is sleeping in. Same for 
> CPU hogs. All kernel threads need to be parked.

Yes - although I'd not use signals for this, signals have 
side effects - but yes, something functionally equivalent.

> This is exactly what you need to do for kGraft to 
> complete patching.

My understanding of kGraft is that by default you allow 
tasks to continue 'in the new universe' after they are 
patched. Has this changed or have I misunderstood the 
concept?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Vojtech Pavlik vojt...@suse.com wrote:

 On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
 
   ... the choice the sysadmins have here is either have 
   the system running in an intermediate state, or have 
   the system completely dead for the *same time*. 
   Because to finish the transition successfully, all 
   the tasks have to be woken up in any case.
  
  That statement is false: an 'intermediate state' system 
  where 'new' tasks are still running is still running 
  and will interfere with the resolution of 'old' tasks.
 
 Can you suggest a way how they would interfere? The 
 transition happens on entering or returning from a 
 syscall, there is no influence between individual tasks.

Well, a 'new' task does not stop executing after returning 
from the syscall, right? If it's stopped (until all 
patching is totally complete) then you are right and I 
concede your point.

If it's allowed to continue its workload then my point 
stands: subsequent execution of 'new' tasks can interfere 
with, slow down, interact with 'old' tasks trying to get 
patched.

  I think you misunderstood: the 'simple' method I 
  outlined does not just 'synchronize', it actually 
  executes the live patching atomically, once all tasks 
  are gathered and we know they are _all_ in a safe 
  state.
 
 The 'simple' method has to catch and freeze all tasks one 
 by one in syscall entry/exit, at the kernel/userspace 
 boundary, until all are frozen and then patch the system 
 atomically.

Correct.

 This means that each and every sleeping task in the 
 system has to be woken up in some way (sending a signal 
 ...) to exit from a syscall it is sleeping in. Same for 
 CPU hogs. All kernel threads need to be parked.

Yes - although I'd not use signals for this, signals have 
side effects - but yes, something functionally equivalent.

 This is exactly what you need to do for kGraft to 
 complete patching.

My understanding of kGraft is that by default you allow 
tasks to continue 'in the new universe' after they are 
patched. Has this changed or have I misunderstood the 
concept?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Josh Poimboeuf jpoim...@redhat.com wrote:

 On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
  On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
 
   I.e. it's in essence the strong stop-all atomic 
   patching model of 'kpatch', combined with the 
   reliable avoidance of kernel stacks that 'kgraft' 
   uses.
  
   That should be the starting point, because it's the 
   most reliable method.
  
  In the consistency models discussion, this was marked 
  the LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the 
  strongest model of all, but also comes at the highest 
  cost in terms of impact on running tasks. It's so high 
  (the interruption may be seconds or more) that it was 
  deemed not worth implementing.
 
 Yeah, this is way too disruptive to the user.
 
 Even the comparatively tiny latency caused by kpatch's 
 use of stop_machine() was considered unacceptable by 
 some.

Unreliable, unrobust patching is even more disruptive...

What I think makes it long term fragile is that we combine 
two unrobust, unlikely mechanisms: the chance that a task 
just happens to execute a patched function, with the chance 
that debug information is unreliable.

For example tracing patching got debugged to a fair degree 
because we rely on the patching for actual tracing 
functionality. Even with that relatively robust usage model 
we had our crises ...

I just don't see how a stack backtrace based live patching 
method can become robust in the long run.

 Plus a lot of processes would see EINTR, causing more 
 havoc.

Parking threads safely in user mode does not require the 
propagation of syscall interruption to user-space.

(It does have some other requirements, such as making all 
syscalls interruptible to a 'special' signalling method 
that only live patching triggers - even syscalls that are 
under the normal ABI uninterruptible, such as sys_sync().)

On the other hand, if it's too slow, people will work on 
improving signal propagation latencies: making syscalls 
more readily interruptible and more seemlessly restartable 
has various other advantages beyond live kernel patching.

I.e. it's a win-win scenario and will improve various areas 
of the kernel in terms of syscall interruptability 
latencies.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
To make sure that this thread doesn't conclude in void, here's my take on 
it:

- what's currently alredy there is the simplest-of-simplest methods; it
  allows you to apply context-less patches (such as adding bounds checking 
  to the beginning of syscall, etc), which turns out to cover vast portion 
  of applicable CVEs

- it can always be made more clever; patch author always has to know the 
  version of the kernel he's preparing the patch for anyway (the live 
  patch and the kernel is closely tied together)

- the proposal to force sleeping or CPU-hogging tasks through a defined 
  safe checkpoint using a fake sort-of signal without any other 
  sideeffects might be useful even for kGraft and also for other proposed 
  aproaches. I think we'll try to implement this as an optimization for 
  kGraft and we'll see how (a) fast (b) non-intrusive we would be able to 
  make it. If it turns out to be successful, we can then just reuse it in 
  the upstream solution (whatever that would be)

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

  This means that each and every sleeping task in the system has to be 
  woken up in some way (sending a signal ...) to exit from a syscall it 
  is sleeping in. Same for CPU hogs. All kernel threads need to be 
  parked.
 
 Yes - although I'd not use signals for this, signals have 
 side effects - but yes, something functionally equivalent.

This is similar to my proposal I came up with not too long time ago; a 
fake signal (analogically to, but not exactly the same, what freezer is 
using), that will just make tasks cycle through userspace/kernelspace 
boundary without other side-effects.

  This is exactly what you need to do for kGraft to complete patching.
 
 My understanding of kGraft is that by default you allow tasks to 
 continue 'in the new universe' after they are patched. Has this changed 
 or have I misunderstood the concept?

What Vojtech meant here, I believe, is that the effort you have to make to 
force all tasks to queue themselves to park them on a safe place and then 
restart their execution is exactly the same as the effort you have to make 
to make kGraft converge and succeed.

But admittedly, if we reserve a special sort-of signal for making the 
tasks pass through a safe checkpoint (and make them queue there (your 
solution) or make them just pass through it and continue (current 
kGraft)), it might reduce the time this effort needs considerably.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

  I see the difference, but I am afraid you are simplifying 
  the situation a litle bit too much.
  
  There will always be properties of patches that will make 
  them unapplicable in a live patching way by design. 
  Think of data structure layout changes (*).
 
 Yes.

The (*) note described that it's actually in theory possible, but the 
price to pay is very high.

So it's possible to write a patch that does this, and have it use a 
different consistency model, which would take care of data structure 
layout change (and the fact that this particular consistency model is very 
expensive, would be a clear fact).

Your simple one-method-to-rule-them-all aproach is not sufficient for 
this.

Even such a stretched case doesn't justify existency of consistency models 
for you?

  Or think of kernel that has some 3rd party vendor module loaded, and 
  this module spawning a ktrehad that is not capable of parking itself.
 
 The kernel will refuse to live patch until the module is fixed. It is a 
 win by any measure.

Depends on your point of view. If you are an administrator of the 
vulnerable system, you have no to very little clue why patching your 
system is not possible (and what you should to to have it fixed in the 
coming days, before your system is taken down by attackers, for example), 
and would be really sad to be forced to run with unpatched system.

I see your point that this would be a good lever we'd have to 3rd party 
vendors, but OTOH we'd be taking OS users as hostages in some sense.

  Or think of patching __notrace__ functions.
 
 Why would __notrace__ functions be a problem in the 'simple' method? 
 Live patching with this method will work even if ftrace is not built in, 
 we can just patch out the function in the simplest possible fashion, 
 because we do it atomically and don't have to worry about per task 
 'transition state' - like kGraft does.
 
 It's a massive simplification and there's no need to rely on ftrace's 
 mcount trick. (Sorry Steve!)

This would indeed be possible iff we take the global synchronizing point 
aproach you are proposing. Still technicalities to be solved (what happens 
if you are patching that already has ftrace on it, etc), but probably 
nothing principal.

  So it's not black and white, it's really a rather philosophical 
  question where to draw the line (and make sure that everyone is aware 
  of where the line is and what it means).
 
 Out of the three examples you mentioned, two are actually an advantage 
 to begin with - so I'd suggest you stop condescending me ...

Ugh, if you feel my e-mails like attempts to condescend you, I am really 
surprised; I thought we are discussing technical details. If you feel 
otherwise, we should probably just stop discussing then.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 On Sat, 21 Feb 2015, Ingo Molnar wrote:
 
   This means that each and every sleeping task in the 
   system has to be woken up in some way (sending a 
   signal ...) to exit from a syscall it is sleeping in. 
   Same for CPU hogs. All kernel threads need to be 
   parked.
  
  Yes - although I'd not use signals for this, signals 
  have side effects - but yes, something functionally 
  equivalent.
 
 This is similar to my proposal I came up with not too 
 long time ago; a fake signal (analogically to, but not 
 exactly the same, what freezer is using), that will just 
 make tasks cycle through userspace/kernelspace boundary 
 without other side-effects.

Yeah.

   This is exactly what you need to do for kGraft to 
   complete patching.
  
  My understanding of kGraft is that by default you allow 
  tasks to continue 'in the new universe' after they are 
  patched. Has this changed or have I misunderstood the 
  concept?
 
 What Vojtech meant here, I believe, is that the effort 
 you have to make to force all tasks to queue themselves 
 to park them on a safe place and then restart their 
 execution is exactly the same as the effort you have to 
 make to make kGraft converge and succeed.

Yes - with the difference that in the 'simple' method I 
suggested we'd have kpatch's patching robustness (all or 
nothing atomic patching - no intermediate patching state, 
no reliance on mcount entries, no doubt about which version 
of the function is working - sans kpatch's stack trace 
logic), combined with kGraft's task parking robustness.

 But admittedly, if we reserve a special sort-of signal 
 for making the tasks pass through a safe checkpoint (and 
 make them queue there (your solution) or make them just 
 pass through it and continue (current kGraft)), it might 
 reduce the time this effort needs considerably.

Well, I think the 'simple' method has another advantage: it 
can only work if all those problems (kthreads, parking 
machinery) are solved, because the patching will occur only 
everything is quiescent.

So no shortcuts are allowed, by design. It starts from a 
fundamentally safe, robust base, while all the other 
approaches I've seen were developed in a 'lets get the 
patching to work, then iteratively try to make it safer' 
which really puts the cart before the horse.

So to put it more bluntly: I don't subscribe to the whole 
'consistency model' nonsense: that's just crazy talk IMHO. 

Make it fundamentally safe from the very beginning, the 
'simple method' I suggested _won't live patch the kernel_ 
if the mechanism has a bug and some kthread or task does 
not park. See the difference?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Jiri Kosina
On Sat, 21 Feb 2015, Ingo Molnar wrote:

  But admittedly, if we reserve a special sort-of signal 
  for making the tasks pass through a safe checkpoint (and 
  make them queue there (your solution) or make them just 
  pass through it and continue (current kGraft)), it might 
  reduce the time this effort needs considerably.
 
 Well, I think the 'simple' method has another advantage: it can only 
 work if all those problems (kthreads, parking machinery) are solved, 
 because the patching will occur only everything is quiescent.
 
 So no shortcuts are allowed, by design. It starts from a fundamentally 
 safe, robust base, while all the other approaches I've seen were 
 developed in a 'lets get the patching to work, then iteratively try to 
 make it safer' which really puts the cart before the horse.
 
 So to put it more bluntly: I don't subscribe to the whole 'consistency 
 model' nonsense: that's just crazy talk IMHO.
 
 Make it fundamentally safe from the very beginning, the 'simple method' 
 I suggested _won't live patch the kernel_ if the mechanism has a bug and 
 some kthread or task does not park. See the difference?

I see the difference, but I am afraid you are simplifying the situation a 
litle bit too much. 

There will always be properties of patches that will make them 
unapplicable in a live patching way by design. Think of data structure 
layout changes (*).

Or think of kernel that has some 3rd party vendor module loaded, and this 
module spawning a ktrehad that is not capable of parking itself.

Or think of patching __notrace__ functions.

Etc.

So it's not black and white, it's really a rather philosophical question 
where to draw the line (and make sure that everyone is aware of where the 
line is and what it means).
This is exactly why we came up with consistency models -- it allows you to 
draw the line at well-defined places.

(*) there are some rather crazy ideas how to make this work, but the price 
you pay is basically always unacceptable slowdown

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-21 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 On Sat, 21 Feb 2015, Ingo Molnar wrote:
 
   But admittedly, if we reserve a special sort-of 
   signal for making the tasks pass through a safe 
   checkpoint (and make them queue there (your solution) 
   or make them just pass through it and continue 
   (current kGraft)), it might reduce the time this 
   effort needs considerably.
  
  Well, I think the 'simple' method has another 
  advantage: it can only work if all those problems 
  (kthreads, parking machinery) are solved, because the 
  patching will occur only everything is quiescent.
  
  So no shortcuts are allowed, by design. It starts from 
  a fundamentally safe, robust base, while all the other 
  approaches I've seen were developed in a 'lets get the 
  patching to work, then iteratively try to make it 
  safer' which really puts the cart before the horse.
  
  So to put it more bluntly: I don't subscribe to the 
  whole 'consistency model' nonsense: that's just crazy 
  talk IMHO.
  
  Make it fundamentally safe from the very beginning, the 
  'simple method' I suggested _won't live patch the 
  kernel_ if the mechanism has a bug and some kthread or 
  task does not park. See the difference?
 
 I see the difference, but I am afraid you are simplifying 
 the situation a litle bit too much.
 
 There will always be properties of patches that will make 
 them unapplicable in a live patching way by design. 
 Think of data structure layout changes (*).

Yes.

 Or think of kernel that has some 3rd party vendor module 
 loaded, and this module spawning a ktrehad that is not 
 capable of parking itself.

The kernel will refuse to live patch until the module is 
fixed. It is a win by any measure.

 Or think of patching __notrace__ functions.

Why would __notrace__ functions be a problem in the 
'simple' method? Live patching with this method will work 
even if ftrace is not built in, we can just patch out the 
function in the simplest possible fashion, because we do it 
atomically and don't have to worry about per task 
'transition state' - like kGraft does.

It's a massive simplification and there's no need to rely 
on ftrace's mcount trick. (Sorry Steve!)

 Etc.
 
 So it's not black and white, it's really a rather 
 philosophical question where to draw the line (and make 
 sure that everyone is aware of where the line is and what 
 it means).

Out of the three examples you mentioned, two are actually 
an advantage to begin with - so I'd suggest you stop 
condescending me ...

 This is exactly why we came up with consistency models -- 
 it allows you to draw the line at well-defined places.

To still be blunt: they are snake oil, a bit like 'security 
modules': allowing upstream merges by consensus between 
competing pieces of crap, instead of coming up with a 
single good design that we can call the Linux kernel live 
patching method ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
> On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
> > I.e. it's in essence the strong stop-all atomic patching 
> > model of 'kpatch', combined with the reliable avoidance of 
> > kernel stacks that 'kgraft' uses.
> 
> > That should be the starting point, because it's the most 
> > reliable method.
> 
> In the consistency models discussion, this was marked the
> "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of
> all, but also comes at the highest cost in terms of impact on running
> tasks. It's so high (the interruption may be seconds or more) that it
> was deemed not worth implementing.

Yeah, this is way too disruptive to the user.

Even the comparatively tiny latency caused by kpatch's use of
stop_machine() was considered unacceptable by some.

Plus a lot of processes would see EINTR, causing more havoc.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Vojtech Pavlik
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:

> > ... the choice the sysadmins have here is either have the 
> > system running in an intermediate state, or have the 
> > system completely dead for the *same time*. Because to 
> > finish the transition successfully, all the tasks have to 
> > be woken up in any case.
> 
> That statement is false: an 'intermediate state' system 
> where 'new' tasks are still running is still running and 
> will interfere with the resolution of 'old' tasks.

Can you suggest a way how they would interfere? The transition happens
on entering or returning from a syscall, there is no influence between
individual tasks.

If you mean that the patch author has to consider the fact that both old
and new code will be running simultaneously, then yes, they have to.

> > But I do get your point; what you are basically saying is 
> > that your preference is what kgraft is doing, and option 
> > to allow for a global synchronization point should be 
> > added in parallel to the gradual lazy migration.
> 
> I think you misunderstood: the 'simple' method I outlined 
> does not just 'synchronize', it actually executes the live 
> patching atomically, once all tasks are gathered and we 
> know they are _all_ in a safe state.

The 'simple' method has to catch and freeze all tasks one by one in
syscall entry/exit, at the kernel/userspace boundary, until all are
frozen and then patch the system atomically. 

This means that each and every sleeping task in the system has to be
woken up in some way (sending a signal ...) to exit from a syscall it is
sleeping in. Same for CPU hogs. All kernel threads need to be parked.

This is exactly what you need to do for kGraft to complete patching.

This may take a significant amount of time to achieve and you won't be
able to use a userspace script to send the signals, because it'd be
frozen immediately.

> I.e. it's in essence the strong stop-all atomic patching 
> model of 'kpatch', combined with the reliable avoidance of 
> kernel stacks that 'kgraft' uses.

> That should be the starting point, because it's the most 
> reliable method.

In the consistency models discussion, this was marked the
"LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of
all, but also comes at the highest cost in terms of impact on running
tasks. It's so high (the interruption may be seconds or more) that it
was deemed not worth implementing.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 09:08:49PM +0100, Ingo Molnar wrote:
> * Josh Poimboeuf  wrote:
> > On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
> > > * Jiri Kosina  wrote:
> > > 
> > > > Alright, so to sum it up:
> > > > 
> > > > - current stack dumping (even looking at /proc//stack) is not 
> > > >   guaranteed to yield "correct" results in case the task is running at 
> > > > the 
> > > >   time the stack is being examined
> > > 
> > > Don't even _think_ about trying to base something as 
> > > dangerous as live patching the kernel image on the concept 
> > > of:
> > > 
> > >   'We can make all stack backtraces reliably correct all 
> > >the time, with no false positives, with no false
> > >negatives, 100% of the time, and quickly discover and
> > >fix bugs in that'.
> > > 
> > > It's not going to happen:
> > > 
> > >  - The correctness of stacktraces partially depends on
> > >tooling and we don't control those.
> > 
> > What tooling are you referring to?
> > 
> > Sure, we rely on the compiler to produce the correct 
> > assembly for putting frame pointers on the stack. [...]
> 
> We also rely on all assembly code to have valid frames all 
> the time - and this is far from a given, we've had many 
> bugs in that area.

Are you talking about kernel bugs or compiler bugs?  Either way, would
those bugs not be caught by the stack unwinding validation improvements
I proposed?

> > [...]  But that's pretty straightforward. The kernel 
> > already relies on the compiler to do plenty of things 
> > which are much more complex than that.
> 
> So this claim scares me because it's such nonsense:
> 
> We rely on the compiler to create _functional code_ for us.  
> If the compiler messes up then functionality, actual code 
> breaks.
> 
> 'Valid stack frame' is not functional code, it's in essence 
> debug info.

Only because we _treat_ it like debug info.  I'm proposing that we treat
it like functional code.  So that yes, if the compiler messes up frame
pointers, actual code breaks.  (But we could detect that before we do
the patch, so it doesn't destabilize the system.)

If gcc can't do frame pointers right, it's a *major* gcc bug.  It's
nothing more than:

   push   %rbp
   mov%rsp,%rbp

at the beginning of the function, and:

   pop%rbp

right before the function returns.

If you really think the compiler could possibly mess that up, I can
probably write a simple script, executed by the Makefile, which
validates that every C function in the kernel has the above pattern.

> If a backtrace is wrong in some rare, racy case 
> then that won't break the kernel, it just gives slightly 
> misleading debug info in most cases.
> 
> Now you want to turn 'debug info' to have functional 
> relevance, for something as intrusive as patching the 
> kernel code.
> 
> You really need to accept this distinction!

Yes, I already realize that this would be the first time the kernel
relies on stack traces being 100% correct.  Peter and I had already some
discussion about it already elsewhere in this thread.

Just because we haven't relied on its correctness in the past doesn't
mean we can't rely on it in the future.

> > >  - More importantly, there's no strong force that ensures
> > >we can rely on stack backtraces: correcting bad 
> > >stack traces depends on people hitting those 
> > >functions and situations that generate them, seeing 
> > >a bad stack trace, noticing that it's weird and 
> > >correcting whatever code or tooling quirk causes the 
> > >stack entry to be incorrect.
> > >
> > > Essentially unlike other kernel code which breaks stuff 
> > > if it's incorrect, there's no _functional_ dependence 
> > > on stack traces, so live patching would be the first 
> > > (and pretty much only) thing that breaks on bad stack 
> > > traces ...
> > 
> > First, there are several things we do to avoid anomalies:
> > 
> > - we don't patch asm functions
> 
> the problem with asm functions isn't that we might want to 
> patch them (although sometimes we might want to: especially 
> the more interesting security fixes tend to be in assembly 
> code), it's that asm functions can (easily) mess up debug 
> info, without that being apparent in any functional way.

Again, that's why I proposed the stack unwinding validation
improvements.

> > - we only walk the stack of sleeping tasks
> 
> sleeping tasks are affected by debug info bugs too.
> 
> > Second, currently the stack unwinding code is rather crude and
> > doesn't do much in the way of error handling.
> 
> That's a big red flag in my book ...

I wasn't talking about my patch set.  I was talking about the existing
stack dumping code in the kernel.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
> > 
> > * Jiri Kosina  wrote:
> > 
> > > Alright, so to sum it up:
> > > 
> > > - current stack dumping (even looking at /proc//stack) is not 
> > >   guaranteed to yield "correct" results in case the task is running at 
> > > the 
> > >   time the stack is being examined
> > 
> > Don't even _think_ about trying to base something as 
> > dangerous as live patching the kernel image on the concept 
> > of:
> > 
> >   'We can make all stack backtraces reliably correct all 
> >the time, with no false positives, with no false
> >negatives, 100% of the time, and quickly discover and
> >fix bugs in that'.
> > 
> > It's not going to happen:
> > 
> >  - The correctness of stacktraces partially depends on
> >tooling and we don't control those.
> 
> What tooling are you referring to?
> 
> Sure, we rely on the compiler to produce the correct 
> assembly for putting frame pointers on the stack. [...]

We also rely on all assembly code to have valid frames all 
the time - and this is far from a given, we've had many 
bugs in that area.

> [...]  But that's pretty straightforward. The kernel 
> already relies on the compiler to do plenty of things 
> which are much more complex than that.

So this claim scares me because it's such nonsense:

We rely on the compiler to create _functional code_ for us.  
If the compiler messes up then functionality, actual code 
breaks.

'Valid stack frame' is not functional code, it's in essence 
debug info. If a backtrace is wrong in some rare, racy case 
then that won't break the kernel, it just gives slightly 
misleading debug info in most cases.

Now you want to turn 'debug info' to have functional 
relevance, for something as intrusive as patching the 
kernel code.

You really need to accept this distinction!

> >  - More importantly, there's no strong force that ensures
> >we can rely on stack backtraces: correcting bad 
> >stack traces depends on people hitting those 
> >functions and situations that generate them, seeing 
> >a bad stack trace, noticing that it's weird and 
> >correcting whatever code or tooling quirk causes the 
> >stack entry to be incorrect.
> >
> > Essentially unlike other kernel code which breaks stuff 
> > if it's incorrect, there's no _functional_ dependence 
> > on stack traces, so live patching would be the first 
> > (and pretty much only) thing that breaks on bad stack 
> > traces ...
> 
> First, there are several things we do to avoid anomalies:
> 
> - we don't patch asm functions

the problem with asm functions isn't that we might want to 
patch them (although sometimes we might want to: especially 
the more interesting security fixes tend to be in assembly 
code), it's that asm functions can (easily) mess up debug 
info, without that being apparent in any functional way.

> - we only walk the stack of sleeping tasks

sleeping tasks are affected by debug info bugs too.

> Second, currently the stack unwinding code is rather 
> crude and doesn't do much in the way of error handling.

That's a big red flag in my book ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Ingo Molnar

* Jiri Kosina  wrote:

> > All fundamental pieces of the simple method are 
> > necessary to get guaranteed time transition from the 
> > complicated method: task tracking and transparent 
> > catching of them, handling kthreads, etc.
> > 
> > My argument is that the simple method should be 
> > implemented first and foremost.
> > 
> > Then people can do add-on features to possibly spread 
> > out the new function versions in a more complicated way 
> > if they want to avoid the stop-all-tasks transition - 
> > although I'm not convinced about it: I'm sure sure many 
> > sysadmins would like the bug patching to be over with 
> > quickly and not have their systems in an intermediate 
> > state like kgraft does it.
> 
> ... the choice the sysadmins have here is either have the 
> system running in an intermediate state, or have the 
> system completely dead for the *same time*. Because to 
> finish the transition successfully, all the tasks have to 
> be woken up in any case.

That statement is false: an 'intermediate state' system 
where 'new' tasks are still running is still running and 
will interfere with the resolution of 'old' tasks.

> But I do get your point; what you are basically saying is 
> that your preference is what kgraft is doing, and option 
> to allow for a global synchronization point should be 
> added in parallel to the gradual lazy migration.

I think you misunderstood: the 'simple' method I outlined 
does not just 'synchronize', it actually executes the live 
patching atomically, once all tasks are gathered and we 
know they are _all_ in a safe state.

I.e. it's in essence the strong stop-all atomic patching 
model of 'kpatch', combined with the reliable avoidance of 
kernel stacks that 'kgraft' uses.

That should be the starting point, because it's the most 
reliable method.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 09:49:32AM +0100, Jiri Kosina wrote:
> Alright, so to sum it up:
> 
> - current stack dumping (even looking at /proc//stack) is not 
>   guaranteed to yield "correct" results in case the task is running at the 
>   time the stack is being examined
> 
> - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize 
>   the handlers between each other (to make sure that reschedule doesn't 
>   happen in between on some CPU and other task doesn't start running in 
>   the interim). 
>   The NMI handler dumps its current stack in case it's running in context 
>   of the process whose stack is to be dumped. Otherwise, one of the NMI 
>   handlers looks up the required task_struct, and dumps it if it's not 
>   running on any CPU
> 
> - For live patching use-case, the stack has to be analyzed (and decision 
>   on what to do based on the analysis) in the NMI handler itself, 
>   otherwise it gets racy again
> 
> Converting /proc//stack to this mechanism seems like a correct thing 
> to do in any case, as it's slow path anyway.
> 
> The original intent seemed to have been to make this fast path for the 
> live patching case, but that's probably not possible, so it seems like the 
> price that will have to be paid for being able to finish live-patching of 
> CPU-bound processess is the cost of IPI-NMI broadcast.

Hm, syncing IPI's among CPUs sounds pretty disruptive.

This is really two different issues, so I'll separate them:

1. /proc/pid/stack for running tasks

I haven't heard anybody demanding that /proc//stack should actually
print the stack for running tasks.  My suggestion was just that we avoid
the possibility of printing garbage.

Today's behavior for a running task is (usually): 

  # cat /proc/802/stack
  [] 0x

How about, when we detecting a running task, just always show that?
That would give us today's behavior, except without occasionally
printing garbage, while avoiding all the overhead of syncing IPI's.

2. live patching of running tasks

I don't see why we would need to sync IPI's to patch CPU-bound
processes.  Why not use context tracking or the TIF_USERSPACE flag like
I mentioned before?

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
> 
> * Jiri Kosina  wrote:
> 
> > Alright, so to sum it up:
> > 
> > - current stack dumping (even looking at /proc//stack) is not 
> >   guaranteed to yield "correct" results in case the task is running at the 
> >   time the stack is being examined
> 
> Don't even _think_ about trying to base something as 
> dangerous as live patching the kernel image on the concept 
> of:
> 
>   'We can make all stack backtraces reliably correct all 
>the time, with no false positives, with no false
>negatives, 100% of the time, and quickly discover and
>fix bugs in that'.
> 
> It's not going to happen:
> 
>  - The correctness of stacktraces partially depends on
>tooling and we don't control those.

What tooling are you referring to?

Sure, we rely on the compiler to produce the correct assembly for
putting frame pointers on the stack.  But that's pretty straightforward.
The kernel already relies on the compiler to do plenty of things which
are much more complex than that.

>  - More importantly, there's no strong force that ensures
>we can rely on stack backtraces: correcting bad stack
>traces depends on people hitting those functions and
>situations that generate them, seeing a bad stack trace,
>noticing that it's weird and correcting whatever code or
>tooling quirk causes the stack entry to be incorrect.
>
> Essentially unlike other kernel code which breaks stuff if 
> it's incorrect, there's no _functional_ dependence on stack 
> traces, so live patching would be the first (and pretty 
> much only) thing that breaks on bad stack traces ...

First, there are several things we do to avoid anomalies:

- we don't patch asm functions
- we only walk the stack of sleeping tasks

Second, currently the stack unwinding code is rather crude and doesn't
do much in the way of error handling.  There are several error
conditions we could easily check for programmatically:

- make sure it starts from a __schedule() call at the top (I've only
  proposed walking the stacks of sleeping tasks)
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
  call instruction

If any of these checks fail, we can either delay the patching of the
task until later or we can cancel the patching process, with no harm
done.  Either way we can WARN the user so that we get reports of these
anomalies.

> If you think you can make something like dwarf annotations 
> work reliably to base kernel live patching on that, 
> reconsider.

No, we would rely on CONFIG_FRAME_POINTER.

> Even with frame pointer backtraces can go bad sometimes, I 
> wouldn't base live patching even on _that_,

Other than hand-coded assembly, can you provide more details about how
frame pointer stack traces can go bad?

> and that's a very simple concept with a performance cost that most
> distros don't want to pay.

Hm, CONFIG_FRAME_POINTER is enabled on all the distros I use.

> So if your design is based on being able to discover 'live' 
> functions in the kernel stack dump of all tasks in the 
> system, I think you need a serious reboot of the whole 
> approach and get rid of that fragility before any of that 
> functionality gets upstream!

Just to clarify a couple of things:

1. Despite the email subject, this discussion is really about another
   RFC patch set [1].  It hasn't been merged, and there's still a lot of
   ongoing discussion about it.

2. We don't dump the stack of _all_ tasks.  Only sleeping ones.


[1] https://lkml.org/lkml/2015/2/9/475

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Jiri Kosina
On Fri, 20 Feb 2015, Ingo Molnar wrote:

>   - the complicated method spread out over time: uses the 
> same essential mechanism plus the ftrace patching 
> machinery to detect whether all tasks have transitioned 
> through a version flip. [this is what kgraft does in 
> part.]

The only difference of this to what kgraft does is that alive-enough tasks 
are not put in this kind of "out of kernel 'freezer'", but keep running. 
Modifying kgraft to (optionally) add the synchronization barrier and then 
flip the switch should be a rather trivial task, and can indeed be added 
as a simple option to the patch author / sysadmin. However ...

> All fundamental pieces of the simple method are necessary to get 
> guaranteed time transition from the complicated method: task tracking 
> and transparent catching of them, handling kthreads, etc.
> 
> My argument is that the simple method should be implemented first and 
> foremost.
> 
> Then people can do add-on features to possibly spread out the new 
> function versions in a more complicated way if they want to avoid the 
> stop-all-tasks transition - although I'm not convinced about it: I'm 
> sure sure many sysadmins would like the bug patching to be over with 
> quickly and not have their systems in an intermediate state like kgraft 
> does it.

... the choice the sysadmins have here is either have the system running 
in an intermediate state, or have the system completely dead for the *same 
time*. Because to finish the transition successfully, all the tasks have 
to be woken up in any case.

(please note that this is different to suspend/resume task freezing, 
because we don't care about sleeping tasks there).

But I do get your point; what you are basically saying is that your 
preference is what kgraft is doing, and option to allow for a global 
synchronization point should be added in parallel to the gradual lazy 
migration.

> In any case, as per my arguments above, examining the kernel stack is 
> superfluous (so we won't be exposed to the fragility of it either): 
> there's no need to examine it and writing such patches is misguided...
> 
> Thanks,
> 
>   Ingo
> 

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Ingo Molnar

* Jiri Kosina  wrote:

> On Fri, 20 Feb 2015, Ingo Molnar wrote:
> 
> > So if your design is based on being able to discover 
>
> > 'live' functions in the kernel stack dump of all tasks 
> > in the system, I think you need a serious reboot of the 
> > whole approach and get rid of that fragility before any 
> > of that functionality gets upstream!
> 
> So let me repeat again, just to make sure that no more 
> confusion is being spread around -- there are aproaches 
> which do rely on stack contents, and aproaches which 
> don't. kpatch (the Red Hat solution) and ksplice (the 
> Oracle solution) contains stack analysis as a conceptual 
> design step, kgraft (the SUSE solution) doesn't.

So just to make my position really clear: any talk about 
looking at the kernel stack for backtraces is just crazy 
talk, considering how stack backtrace technology stands 
today and in the reasonable near future!

With that out of the way, the only safe mechanism to live 
patch the kernel (for sufficiently simple sets of changes 
to singular functions) I'm aware of at the moment is:

 - forcing all user space tasks out of kernel mode and
   intercepting them in a safe state. I.e. making sure that 
   no kernel code is executed, no kernel stack state is 
   used (modulo code closely related to the live
   patching mechanism and kernel threads in safe state, 
   lets ignore them for this argument)

There's two variants of this concept, which deals with the 
timing of how user-space tasks are forced into user mode:

  - the simple method: force all user-space tasks out of 
kernel mode, stop the machine for a brief moment and be 
done with the patching safely and essentially 
atomically.

  - the complicated method spread out over time: uses the 
same essential mechanism plus the ftrace patching 
machinery to detect whether all tasks have transitioned 
through a version flip. [this is what kgraft does in 
part.]

All fundamental pieces of the simple method are necessary 
to get guaranteed time transition from the complicated 
method: task tracking and transparent catching of them, 
handling kthreads, etc.

My argument is that the simple method should be implemented 
first and foremost.

Then people can do add-on features to possibly spread out 
the new function versions in a more complicated way if they 
want to avoid the stop-all-tasks transition - although I'm 
not convinced about it: I'm sure sure many sysadmins would 
like the bug patching to be over with quickly and not have 
their systems in an intermediate state like kgraft does it.

In any case, as per my arguments above, examining the 
kernel stack is superfluous (so we won't be exposed to the 
fragility of it either): there's no need to examine it and 
writing such patches is misguided...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Jiri Kosina
On Fri, 20 Feb 2015, Ingo Molnar wrote:

> So if your design is based on being able to discover 'live' functions in 
> the kernel stack dump of all tasks in the system, I think you need a 
> serious reboot of the whole approach and get rid of that fragility 
> before any of that functionality gets upstream!

So let me repeat again, just to make sure that no more confusion is being 
spread around -- there are aproaches which do rely on stack contents, and 
aproaches which don't. kpatch (the Red Hat solution) and ksplice (the 
Oracle solution) contains stack analysis as a conceptual design step, 
kgraft (the SUSE solution) doesn't.

Also the limited (in features -- consistency models) common infrastructure 
that is currently upstream doesn't care about stack contents.

We are now figuring out which consistency models make sense for upstream, 
and how they can be achieved. Josh's patchset is one of the aproaches that 
is currently being discussed, but it's not the only available option.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Ingo Molnar

* Jiri Kosina  wrote:

> Alright, so to sum it up:
> 
> - current stack dumping (even looking at /proc//stack) is not 
>   guaranteed to yield "correct" results in case the task is running at the 
>   time the stack is being examined

Don't even _think_ about trying to base something as 
dangerous as live patching the kernel image on the concept 
of:

  'We can make all stack backtraces reliably correct all 
   the time, with no false positives, with no false
   negatives, 100% of the time, and quickly discover and
   fix bugs in that'.

It's not going to happen:

 - The correctness of stacktraces partially depends on
   tooling and we don't control those.

 - More importantly, there's no strong force that ensures
   we can rely on stack backtraces: correcting bad stack
   traces depends on people hitting those functions and
   situations that generate them, seeing a bad stack trace,
   noticing that it's weird and correcting whatever code or
   tooling quirk causes the stack entry to be incorrect.

Essentially unlike other kernel code which breaks stuff if 
it's incorrect, there's no _functional_ dependence on stack 
traces, so live patching would be the first (and pretty 
much only) thing that breaks on bad stack traces ...

If you think you can make something like dwarf annotations 
work reliably to base kernel live patching on that, 
reconsider.

Even with frame pointer backtraces can go bad sometimes, I 
wouldn't base live patching even on _that_, and that's a 
very simple concept with a performance cost that most 
distros don't want to pay.

So if your design is based on being able to discover 'live' 
functions in the kernel stack dump of all tasks in the 
system, I think you need a serious reboot of the whole 
approach and get rid of that fragility before any of that 
functionality gets upstream!

> - For live patching use-case, the stack has to be 
>   analyzed (and decision on what to do based on the 
>   analysis) in the NMI handler itself, otherwise it gets 
>   racy again

You simply cannot reliably determine from the kernel stack 
whether a function is used by a task or not, and actually 
modify the kernel image, from a stack backtrace, as things 
stand today. Full stop.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Jiri Kosina
Alright, so to sum it up:

- current stack dumping (even looking at /proc//stack) is not 
  guaranteed to yield "correct" results in case the task is running at the 
  time the stack is being examined

- the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize 
  the handlers between each other (to make sure that reschedule doesn't 
  happen in between on some CPU and other task doesn't start running in 
  the interim). 
  The NMI handler dumps its current stack in case it's running in context 
  of the process whose stack is to be dumped. Otherwise, one of the NMI 
  handlers looks up the required task_struct, and dumps it if it's not 
  running on any CPU

- For live patching use-case, the stack has to be analyzed (and decision 
  on what to do based on the analysis) in the NMI handler itself, 
  otherwise it gets racy again

Converting /proc//stack to this mechanism seems like a correct thing 
to do in any case, as it's slow path anyway.

The original intent seemed to have been to make this fast path for the 
live patching case, but that's probably not possible, so it seems like the 
price that will have to be paid for being able to finish live-patching of 
CPU-bound processess is the cost of IPI-NMI broadcast.

Agreed?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 On Fri, 20 Feb 2015, Ingo Molnar wrote:
 
  So if your design is based on being able to discover 

  'live' functions in the kernel stack dump of all tasks 
  in the system, I think you need a serious reboot of the 
  whole approach and get rid of that fragility before any 
  of that functionality gets upstream!
 
 So let me repeat again, just to make sure that no more 
 confusion is being spread around -- there are aproaches 
 which do rely on stack contents, and aproaches which 
 don't. kpatch (the Red Hat solution) and ksplice (the 
 Oracle solution) contains stack analysis as a conceptual 
 design step, kgraft (the SUSE solution) doesn't.

So just to make my position really clear: any talk about 
looking at the kernel stack for backtraces is just crazy 
talk, considering how stack backtrace technology stands 
today and in the reasonable near future!

With that out of the way, the only safe mechanism to live 
patch the kernel (for sufficiently simple sets of changes 
to singular functions) I'm aware of at the moment is:

 - forcing all user space tasks out of kernel mode and
   intercepting them in a safe state. I.e. making sure that 
   no kernel code is executed, no kernel stack state is 
   used (modulo code closely related to the live
   patching mechanism and kernel threads in safe state, 
   lets ignore them for this argument)

There's two variants of this concept, which deals with the 
timing of how user-space tasks are forced into user mode:

  - the simple method: force all user-space tasks out of 
kernel mode, stop the machine for a brief moment and be 
done with the patching safely and essentially 
atomically.

  - the complicated method spread out over time: uses the 
same essential mechanism plus the ftrace patching 
machinery to detect whether all tasks have transitioned 
through a version flip. [this is what kgraft does in 
part.]

All fundamental pieces of the simple method are necessary 
to get guaranteed time transition from the complicated 
method: task tracking and transparent catching of them, 
handling kthreads, etc.

My argument is that the simple method should be implemented 
first and foremost.

Then people can do add-on features to possibly spread out 
the new function versions in a more complicated way if they 
want to avoid the stop-all-tasks transition - although I'm 
not convinced about it: I'm sure sure many sysadmins would 
like the bug patching to be over with quickly and not have 
their systems in an intermediate state like kgraft does it.

In any case, as per my arguments above, examining the 
kernel stack is superfluous (so we won't be exposed to the 
fragility of it either): there's no need to examine it and 
writing such patches is misguided...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Jiri Kosina
On Fri, 20 Feb 2015, Ingo Molnar wrote:

   - the complicated method spread out over time: uses the 
 same essential mechanism plus the ftrace patching 
 machinery to detect whether all tasks have transitioned 
 through a version flip. [this is what kgraft does in 
 part.]

The only difference of this to what kgraft does is that alive-enough tasks 
are not put in this kind of out of kernel 'freezer', but keep running. 
Modifying kgraft to (optionally) add the synchronization barrier and then 
flip the switch should be a rather trivial task, and can indeed be added 
as a simple option to the patch author / sysadmin. However ...

 All fundamental pieces of the simple method are necessary to get 
 guaranteed time transition from the complicated method: task tracking 
 and transparent catching of them, handling kthreads, etc.
 
 My argument is that the simple method should be implemented first and 
 foremost.
 
 Then people can do add-on features to possibly spread out the new 
 function versions in a more complicated way if they want to avoid the 
 stop-all-tasks transition - although I'm not convinced about it: I'm 
 sure sure many sysadmins would like the bug patching to be over with 
 quickly and not have their systems in an intermediate state like kgraft 
 does it.

... the choice the sysadmins have here is either have the system running 
in an intermediate state, or have the system completely dead for the *same 
time*. Because to finish the transition successfully, all the tasks have 
to be woken up in any case.

(please note that this is different to suspend/resume task freezing, 
because we don't care about sleeping tasks there).

But I do get your point; what you are basically saying is that your 
preference is what kgraft is doing, and option to allow for a global 
synchronization point should be added in parallel to the gradual lazy 
migration.

 In any case, as per my arguments above, examining the kernel stack is 
 superfluous (so we won't be exposed to the fragility of it either): 
 there's no need to examine it and writing such patches is misguided...
 
 Thanks,
 
   Ingo
 

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Jiri Kosina
Alright, so to sum it up:

- current stack dumping (even looking at /proc/pid/stack) is not 
  guaranteed to yield correct results in case the task is running at the 
  time the stack is being examined

- the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize 
  the handlers between each other (to make sure that reschedule doesn't 
  happen in between on some CPU and other task doesn't start running in 
  the interim). 
  The NMI handler dumps its current stack in case it's running in context 
  of the process whose stack is to be dumped. Otherwise, one of the NMI 
  handlers looks up the required task_struct, and dumps it if it's not 
  running on any CPU

- For live patching use-case, the stack has to be analyzed (and decision 
  on what to do based on the analysis) in the NMI handler itself, 
  otherwise it gets racy again

Converting /proc/pid/stack to this mechanism seems like a correct thing 
to do in any case, as it's slow path anyway.

The original intent seemed to have been to make this fast path for the 
live patching case, but that's probably not possible, so it seems like the 
price that will have to be paid for being able to finish live-patching of 
CPU-bound processess is the cost of IPI-NMI broadcast.

Agreed?

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Jiri Kosina
On Fri, 20 Feb 2015, Ingo Molnar wrote:

 So if your design is based on being able to discover 'live' functions in 
 the kernel stack dump of all tasks in the system, I think you need a 
 serious reboot of the whole approach and get rid of that fragility 
 before any of that functionality gets upstream!

So let me repeat again, just to make sure that no more confusion is being 
spread around -- there are aproaches which do rely on stack contents, and 
aproaches which don't. kpatch (the Red Hat solution) and ksplice (the 
Oracle solution) contains stack analysis as a conceptual design step, 
kgraft (the SUSE solution) doesn't.

Also the limited (in features -- consistency models) common infrastructure 
that is currently upstream doesn't care about stack contents.

We are now figuring out which consistency models make sense for upstream, 
and how they can be achieved. Josh's patchset is one of the aproaches that 
is currently being discussed, but it's not the only available option.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

 Alright, so to sum it up:
 
 - current stack dumping (even looking at /proc/pid/stack) is not 
   guaranteed to yield correct results in case the task is running at the 
   time the stack is being examined

Don't even _think_ about trying to base something as 
dangerous as live patching the kernel image on the concept 
of:

  'We can make all stack backtraces reliably correct all 
   the time, with no false positives, with no false
   negatives, 100% of the time, and quickly discover and
   fix bugs in that'.

It's not going to happen:

 - The correctness of stacktraces partially depends on
   tooling and we don't control those.

 - More importantly, there's no strong force that ensures
   we can rely on stack backtraces: correcting bad stack
   traces depends on people hitting those functions and
   situations that generate them, seeing a bad stack trace,
   noticing that it's weird and correcting whatever code or
   tooling quirk causes the stack entry to be incorrect.

Essentially unlike other kernel code which breaks stuff if 
it's incorrect, there's no _functional_ dependence on stack 
traces, so live patching would be the first (and pretty 
much only) thing that breaks on bad stack traces ...

If you think you can make something like dwarf annotations 
work reliably to base kernel live patching on that, 
reconsider.

Even with frame pointer backtraces can go bad sometimes, I 
wouldn't base live patching even on _that_, and that's a 
very simple concept with a performance cost that most 
distros don't want to pay.

So if your design is based on being able to discover 'live' 
functions in the kernel stack dump of all tasks in the 
system, I think you need a serious reboot of the whole 
approach and get rid of that fragility before any of that 
functionality gets upstream!

 - For live patching use-case, the stack has to be 
   analyzed (and decision on what to do based on the 
   analysis) in the NMI handler itself, otherwise it gets 
   racy again

You simply cannot reliably determine from the kernel stack 
whether a function is used by a task or not, and actually 
modify the kernel image, from a stack backtrace, as things 
stand today. Full stop.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Vojtech Pavlik
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:

  ... the choice the sysadmins have here is either have the 
  system running in an intermediate state, or have the 
  system completely dead for the *same time*. Because to 
  finish the transition successfully, all the tasks have to 
  be woken up in any case.
 
 That statement is false: an 'intermediate state' system 
 where 'new' tasks are still running is still running and 
 will interfere with the resolution of 'old' tasks.

Can you suggest a way how they would interfere? The transition happens
on entering or returning from a syscall, there is no influence between
individual tasks.

If you mean that the patch author has to consider the fact that both old
and new code will be running simultaneously, then yes, they have to.

  But I do get your point; what you are basically saying is 
  that your preference is what kgraft is doing, and option 
  to allow for a global synchronization point should be 
  added in parallel to the gradual lazy migration.
 
 I think you misunderstood: the 'simple' method I outlined 
 does not just 'synchronize', it actually executes the live 
 patching atomically, once all tasks are gathered and we 
 know they are _all_ in a safe state.

The 'simple' method has to catch and freeze all tasks one by one in
syscall entry/exit, at the kernel/userspace boundary, until all are
frozen and then patch the system atomically. 

This means that each and every sleeping task in the system has to be
woken up in some way (sending a signal ...) to exit from a syscall it is
sleeping in. Same for CPU hogs. All kernel threads need to be parked.

This is exactly what you need to do for kGraft to complete patching.

This may take a significant amount of time to achieve and you won't be
able to use a userspace script to send the signals, because it'd be
frozen immediately.

 I.e. it's in essence the strong stop-all atomic patching 
 model of 'kpatch', combined with the reliable avoidance of 
 kernel stacks that 'kgraft' uses.

 That should be the starting point, because it's the most 
 reliable method.

In the consistency models discussion, this was marked the
LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of
all, but also comes at the highest cost in terms of impact on running
tasks. It's so high (the interruption may be seconds or more) that it
was deemed not worth implementing.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
 On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
  I.e. it's in essence the strong stop-all atomic patching 
  model of 'kpatch', combined with the reliable avoidance of 
  kernel stacks that 'kgraft' uses.
 
  That should be the starting point, because it's the most 
  reliable method.
 
 In the consistency models discussion, this was marked the
 LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of
 all, but also comes at the highest cost in terms of impact on running
 tasks. It's so high (the interruption may be seconds or more) that it
 was deemed not worth implementing.

Yeah, this is way too disruptive to the user.

Even the comparatively tiny latency caused by kpatch's use of
stop_machine() was considered unacceptable by some.

Plus a lot of processes would see EINTR, causing more havoc.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 09:08:49PM +0100, Ingo Molnar wrote:
 * Josh Poimboeuf jpoim...@redhat.com wrote:
  On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
   * Jiri Kosina jkos...@suse.cz wrote:
   
Alright, so to sum it up:

- current stack dumping (even looking at /proc/pid/stack) is not 
  guaranteed to yield correct results in case the task is running at 
the 
  time the stack is being examined
   
   Don't even _think_ about trying to base something as 
   dangerous as live patching the kernel image on the concept 
   of:
   
 'We can make all stack backtraces reliably correct all 
  the time, with no false positives, with no false
  negatives, 100% of the time, and quickly discover and
  fix bugs in that'.
   
   It's not going to happen:
   
- The correctness of stacktraces partially depends on
  tooling and we don't control those.
  
  What tooling are you referring to?
  
  Sure, we rely on the compiler to produce the correct 
  assembly for putting frame pointers on the stack. [...]
 
 We also rely on all assembly code to have valid frames all 
 the time - and this is far from a given, we've had many 
 bugs in that area.

Are you talking about kernel bugs or compiler bugs?  Either way, would
those bugs not be caught by the stack unwinding validation improvements
I proposed?

  [...]  But that's pretty straightforward. The kernel 
  already relies on the compiler to do plenty of things 
  which are much more complex than that.
 
 So this claim scares me because it's such nonsense:
 
 We rely on the compiler to create _functional code_ for us.  
 If the compiler messes up then functionality, actual code 
 breaks.
 
 'Valid stack frame' is not functional code, it's in essence 
 debug info.

Only because we _treat_ it like debug info.  I'm proposing that we treat
it like functional code.  So that yes, if the compiler messes up frame
pointers, actual code breaks.  (But we could detect that before we do
the patch, so it doesn't destabilize the system.)

If gcc can't do frame pointers right, it's a *major* gcc bug.  It's
nothing more than:

   push   %rbp
   mov%rsp,%rbp

at the beginning of the function, and:

   pop%rbp

right before the function returns.

If you really think the compiler could possibly mess that up, I can
probably write a simple script, executed by the Makefile, which
validates that every C function in the kernel has the above pattern.

 If a backtrace is wrong in some rare, racy case 
 then that won't break the kernel, it just gives slightly 
 misleading debug info in most cases.
 
 Now you want to turn 'debug info' to have functional 
 relevance, for something as intrusive as patching the 
 kernel code.
 
 You really need to accept this distinction!

Yes, I already realize that this would be the first time the kernel
relies on stack traces being 100% correct.  Peter and I had already some
discussion about it already elsewhere in this thread.

Just because we haven't relied on its correctness in the past doesn't
mean we can't rely on it in the future.

- More importantly, there's no strong force that ensures
  we can rely on stack backtraces: correcting bad 
  stack traces depends on people hitting those 
  functions and situations that generate them, seeing 
  a bad stack trace, noticing that it's weird and 
  correcting whatever code or tooling quirk causes the 
  stack entry to be incorrect.
  
   Essentially unlike other kernel code which breaks stuff 
   if it's incorrect, there's no _functional_ dependence 
   on stack traces, so live patching would be the first 
   (and pretty much only) thing that breaks on bad stack 
   traces ...
  
  First, there are several things we do to avoid anomalies:
  
  - we don't patch asm functions
 
 the problem with asm functions isn't that we might want to 
 patch them (although sometimes we might want to: especially 
 the more interesting security fixes tend to be in assembly 
 code), it's that asm functions can (easily) mess up debug 
 info, without that being apparent in any functional way.

Again, that's why I proposed the stack unwinding validation
improvements.

  - we only walk the stack of sleeping tasks
 
 sleeping tasks are affected by debug info bugs too.
 
  Second, currently the stack unwinding code is rather crude and
  doesn't do much in the way of error handling.
 
 That's a big red flag in my book ...

I wasn't talking about my patch set.  I was talking about the existing
stack dumping code in the kernel.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
 
 * Jiri Kosina jkos...@suse.cz wrote:
 
  Alright, so to sum it up:
  
  - current stack dumping (even looking at /proc/pid/stack) is not 
guaranteed to yield correct results in case the task is running at the 
time the stack is being examined
 
 Don't even _think_ about trying to base something as 
 dangerous as live patching the kernel image on the concept 
 of:
 
   'We can make all stack backtraces reliably correct all 
the time, with no false positives, with no false
negatives, 100% of the time, and quickly discover and
fix bugs in that'.
 
 It's not going to happen:
 
  - The correctness of stacktraces partially depends on
tooling and we don't control those.

What tooling are you referring to?

Sure, we rely on the compiler to produce the correct assembly for
putting frame pointers on the stack.  But that's pretty straightforward.
The kernel already relies on the compiler to do plenty of things which
are much more complex than that.

  - More importantly, there's no strong force that ensures
we can rely on stack backtraces: correcting bad stack
traces depends on people hitting those functions and
situations that generate them, seeing a bad stack trace,
noticing that it's weird and correcting whatever code or
tooling quirk causes the stack entry to be incorrect.

 Essentially unlike other kernel code which breaks stuff if 
 it's incorrect, there's no _functional_ dependence on stack 
 traces, so live patching would be the first (and pretty 
 much only) thing that breaks on bad stack traces ...

First, there are several things we do to avoid anomalies:

- we don't patch asm functions
- we only walk the stack of sleeping tasks

Second, currently the stack unwinding code is rather crude and doesn't
do much in the way of error handling.  There are several error
conditions we could easily check for programmatically:

- make sure it starts from a __schedule() call at the top (I've only
  proposed walking the stacks of sleeping tasks)
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
  call instruction

If any of these checks fail, we can either delay the patching of the
task until later or we can cancel the patching process, with no harm
done.  Either way we can WARN the user so that we get reports of these
anomalies.

 If you think you can make something like dwarf annotations 
 work reliably to base kernel live patching on that, 
 reconsider.

No, we would rely on CONFIG_FRAME_POINTER.

 Even with frame pointer backtraces can go bad sometimes, I 
 wouldn't base live patching even on _that_,

Other than hand-coded assembly, can you provide more details about how
frame pointer stack traces can go bad?

 and that's a very simple concept with a performance cost that most
 distros don't want to pay.

Hm, CONFIG_FRAME_POINTER is enabled on all the distros I use.

 So if your design is based on being able to discover 'live' 
 functions in the kernel stack dump of all tasks in the 
 system, I think you need a serious reboot of the whole 
 approach and get rid of that fragility before any of that 
 functionality gets upstream!

Just to clarify a couple of things:

1. Despite the email subject, this discussion is really about another
   RFC patch set [1].  It hasn't been merged, and there's still a lot of
   ongoing discussion about it.

2. We don't dump the stack of _all_ tasks.  Only sleeping ones.


[1] https://lkml.org/lkml/2015/2/9/475

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Josh Poimboeuf
On Fri, Feb 20, 2015 at 09:49:32AM +0100, Jiri Kosina wrote:
 Alright, so to sum it up:
 
 - current stack dumping (even looking at /proc/pid/stack) is not 
   guaranteed to yield correct results in case the task is running at the 
   time the stack is being examined
 
 - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize 
   the handlers between each other (to make sure that reschedule doesn't 
   happen in between on some CPU and other task doesn't start running in 
   the interim). 
   The NMI handler dumps its current stack in case it's running in context 
   of the process whose stack is to be dumped. Otherwise, one of the NMI 
   handlers looks up the required task_struct, and dumps it if it's not 
   running on any CPU
 
 - For live patching use-case, the stack has to be analyzed (and decision 
   on what to do based on the analysis) in the NMI handler itself, 
   otherwise it gets racy again
 
 Converting /proc/pid/stack to this mechanism seems like a correct thing 
 to do in any case, as it's slow path anyway.
 
 The original intent seemed to have been to make this fast path for the 
 live patching case, but that's probably not possible, so it seems like the 
 price that will have to be paid for being able to finish live-patching of 
 CPU-bound processess is the cost of IPI-NMI broadcast.

Hm, syncing IPI's among CPUs sounds pretty disruptive.

This is really two different issues, so I'll separate them:

1. /proc/pid/stack for running tasks

I haven't heard anybody demanding that /proc/pid/stack should actually
print the stack for running tasks.  My suggestion was just that we avoid
the possibility of printing garbage.

Today's behavior for a running task is (usually): 

  # cat /proc/802/stack
  [] 0x

How about, when we detecting a running task, just always show that?
That would give us today's behavior, except without occasionally
printing garbage, while avoiding all the overhead of syncing IPI's.

2. live patching of running tasks

I don't see why we would need to sync IPI's to patch CPU-bound
processes.  Why not use context tracking or the TIF_USERSPACE flag like
I mentioned before?

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Ingo Molnar

* Jiri Kosina jkos...@suse.cz wrote:

  All fundamental pieces of the simple method are 
  necessary to get guaranteed time transition from the 
  complicated method: task tracking and transparent 
  catching of them, handling kthreads, etc.
  
  My argument is that the simple method should be 
  implemented first and foremost.
  
  Then people can do add-on features to possibly spread 
  out the new function versions in a more complicated way 
  if they want to avoid the stop-all-tasks transition - 
  although I'm not convinced about it: I'm sure sure many 
  sysadmins would like the bug patching to be over with 
  quickly and not have their systems in an intermediate 
  state like kgraft does it.
 
 ... the choice the sysadmins have here is either have the 
 system running in an intermediate state, or have the 
 system completely dead for the *same time*. Because to 
 finish the transition successfully, all the tasks have to 
 be woken up in any case.

That statement is false: an 'intermediate state' system 
where 'new' tasks are still running is still running and 
will interfere with the resolution of 'old' tasks.

 But I do get your point; what you are basically saying is 
 that your preference is what kgraft is doing, and option 
 to allow for a global synchronization point should be 
 added in parallel to the gradual lazy migration.

I think you misunderstood: the 'simple' method I outlined 
does not just 'synchronize', it actually executes the live 
patching atomically, once all tasks are gathered and we 
know they are _all_ in a safe state.

I.e. it's in essence the strong stop-all atomic patching 
model of 'kpatch', combined with the reliable avoidance of 
kernel stacks that 'kgraft' uses.

That should be the starting point, because it's the most 
reliable method.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-20 Thread Ingo Molnar

* Josh Poimboeuf jpoim...@redhat.com wrote:

 On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
  
  * Jiri Kosina jkos...@suse.cz wrote:
  
   Alright, so to sum it up:
   
   - current stack dumping (even looking at /proc/pid/stack) is not 
 guaranteed to yield correct results in case the task is running at 
   the 
 time the stack is being examined
  
  Don't even _think_ about trying to base something as 
  dangerous as live patching the kernel image on the concept 
  of:
  
'We can make all stack backtraces reliably correct all 
 the time, with no false positives, with no false
 negatives, 100% of the time, and quickly discover and
 fix bugs in that'.
  
  It's not going to happen:
  
   - The correctness of stacktraces partially depends on
 tooling and we don't control those.
 
 What tooling are you referring to?
 
 Sure, we rely on the compiler to produce the correct 
 assembly for putting frame pointers on the stack. [...]

We also rely on all assembly code to have valid frames all 
the time - and this is far from a given, we've had many 
bugs in that area.

 [...]  But that's pretty straightforward. The kernel 
 already relies on the compiler to do plenty of things 
 which are much more complex than that.

So this claim scares me because it's such nonsense:

We rely on the compiler to create _functional code_ for us.  
If the compiler messes up then functionality, actual code 
breaks.

'Valid stack frame' is not functional code, it's in essence 
debug info. If a backtrace is wrong in some rare, racy case 
then that won't break the kernel, it just gives slightly 
misleading debug info in most cases.

Now you want to turn 'debug info' to have functional 
relevance, for something as intrusive as patching the 
kernel code.

You really need to accept this distinction!

   - More importantly, there's no strong force that ensures
 we can rely on stack backtraces: correcting bad 
 stack traces depends on people hitting those 
 functions and situations that generate them, seeing 
 a bad stack trace, noticing that it's weird and 
 correcting whatever code or tooling quirk causes the 
 stack entry to be incorrect.
 
  Essentially unlike other kernel code which breaks stuff 
  if it's incorrect, there's no _functional_ dependence 
  on stack traces, so live patching would be the first 
  (and pretty much only) thing that breaks on bad stack 
  traces ...
 
 First, there are several things we do to avoid anomalies:
 
 - we don't patch asm functions

the problem with asm functions isn't that we might want to 
patch them (although sometimes we might want to: especially 
the more interesting security fixes tend to be in assembly 
code), it's that asm functions can (easily) mess up debug 
info, without that being apparent in any functional way.

 - we only walk the stack of sleeping tasks

sleeping tasks are affected by debug info bugs too.

 Second, currently the stack unwinding code is rather 
 crude and doesn't do much in the way of error handling.

That's a big red flag in my book ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> So I've looked at kgr_needs_lazy_migration(), but I still have no idea
> how it works.
> 
> First of all, I think reading the stack while its being written to could
> give you some garbage values, and a completely wrong nr_entries value
> from save_stack_trace_tsk().

I believe we've already been discussing this some time ago ...

I agree that this is a very crude optimization that should probably be 
either removed (which would only cause slower convergence in the presence 
of CPU-bound tasks), or rewritten to perform IPI-based stack dumping 
(probably on a voluntarily-configurable basis).

Reading garbage values could only happen if the task would be running in 
kernelspace. nr_entries would then be at least 2.

But I agree that relying on this very specific behavior is not really 
safe in general in case someone changes the stack dumping implementation 
in the future in an unpredictable way.

> But also, how would you walk a stack without knowing its stack pointer? 
> That function relies on the saved stack pointer in 
> task_struct.thread.sp, which, AFAICT, was last saved during the last 
> call to schedule().  Since then, the stack could have been completely 
> rewritten, with different size stack frames, before the task exited the 
> kernel.

Same argument holds here as well, I believe.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote:
> On Thu, 19 Feb 2015, Josh Poimboeuf wrote:
> 
> > How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
> > right at the border.  Then for running tasks it's as simple as:
> > 
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
> 
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
> we'd be interested in would be
> 
>   __this_cpu_read(context_tracking.state) == IN_USER
> 
> But it has overhead.

Yeah, that does seem to pretty much do what we want.  Unfortunately it
has a much higher overhead than just setting a thread flag.

And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would
enable it on all CPUs during boot, "this option brings an overhead that
you don't want in production."

Maybe that code needs a rewrite to rely on a thread flag instead.  Then
we could use it too.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > > > 
> > > > > > > No, these tasks will _never_ make syscalls. So you need to 
> > > > > > > guarantee
> > > > > > > they don't accidentally enter the kernel while you flip them. 
> > > > > > > Something
> > > > > > > like so should do.
> > > > > > > 
> > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, 
> > > > > > > flip
> > > > > > > them then clear TIF_ENTER_WAIT.
> > > > > > 
> > > > > > Ah, that's a good idea.  But how do we check if they're in user 
> > > > > > space?
> > > > > 
> > > > > I don't see the benefit in holding them in a loop - you can just as 
> > > > > well
> > > > > flip them from the syscall code as kGraft does.
> > > > 
> > > > But we were talking specifically about HPC tasks which never make
> > > > syscalls.
> > > 
> > > Yes. I'm saying that rather than guaranteeing they don't enter the
> > > kernel (by having them spin) you can flip them in case they try to do
> > > that instead. That solves the race condition just as well.
> > 
> > Ok, gotcha.
> > 
> > We'd still need a safe way to check if they're in user space though.
> 
> Having a safe way would be very nice and actually quite useful in other
> cases, too.
> 
> For this specific purpose, however, we don't need a very safe way,
> though. We don't require atomicity in any way, we don't mind even if it
> creates false negatives, only false positives would be bad.
> 
> kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
> addresses there, it assumes userspace. Not very nice, but does the job.

So I've looked at kgr_needs_lazy_migration(), but I still have no idea
how it works.

First of all, I think reading the stack while its being written to could
give you some garbage values, and a completely wrong nr_entries value
from save_stack_trace_tsk().

But also, how would you walk a stack without knowing its stack pointer?
That function relies on the saved stack pointer in
task_struct.thread.sp, which, AFAICT, was last saved during the last
call to schedule().  Since then, the stack could have been completely
rewritten, with different size stack frames, before the task exited the
kernel.

Am I missing something?

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Jiri Kosina wrote:

> > How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
> > right at the border.  Then for running tasks it's as simple as:
> > 
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
> 
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
> we'd be interested in would be
> 
>   __this_cpu_read(context_tracking.state) == IN_USER

Well, more precisely

per_cpu(context_tracking.state, cpu) == IN_USER

of course.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
> right at the border.  Then for running tasks it's as simple as:
> 
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
>   klp_switch_task_universe(task);

That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
we'd be interested in would be

__this_cpu_read(context_tracking.state) == IN_USER

But it has overhead.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > > 
> > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > they don't accidentally enter the kernel while you flip them. 
> > > > > > Something
> > > > > > like so should do.
> > > > > > 
> > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, 
> > > > > > flip
> > > > > > them then clear TIF_ENTER_WAIT.
> > > > > 
> > > > > Ah, that's a good idea.  But how do we check if they're in user space?
> > > > 
> > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > flip them from the syscall code as kGraft does.
> > > 
> > > But we were talking specifically about HPC tasks which never make
> > > syscalls.
> > 
> > Yes. I'm saying that rather than guaranteeing they don't enter the
> > kernel (by having them spin) you can flip them in case they try to do
> > that instead. That solves the race condition just as well.
> 
> Ok, gotcha.
> 
> We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

> How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
> right at the border.  Then for running tasks it's as simple as:
> 
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
>   klp_switch_task_universe(task);

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf  wrote:

>> Yes. I'm saying that rather than guaranteeing they don't enter the
>> kernel (by having them spin) you can flip them in case they try to do
>> that instead. That solves the race condition just as well.
>
>Ok, gotcha.
>
>We'd still need a safe way to check if they're in user space though.
>
>How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
>right at the border.  Then for running tasks it's as simple as:
>
>if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
>   klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, 
negated). I think it would work well, but it isn't entirely for free: two 
instructions per syscall and an extra TIF bit, which are (near) exhausted on 
some archs.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > 
> > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > they don't accidentally enter the kernel while you flip them. 
> > > > > Something
> > > > > like so should do.
> > > > > 
> > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > them then clear TIF_ENTER_WAIT.
> > > > 
> > > > Ah, that's a good idea.  But how do we check if they're in user space?
> > > 
> > > I don't see the benefit in holding them in a loop - you can just as well
> > > flip them from the syscall code as kGraft does.
> > 
> > But we were talking specifically about HPC tasks which never make
> > syscalls.
> 
> Yes. I'm saying that rather than guaranteeing they don't enter the
> kernel (by having them spin) you can flip them in case they try to do
> that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
right at the border.  Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
klp_switch_task_universe(task);

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > 
> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > > 
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > > 
> > > Ah, that's a good idea.  But how do we check if they're in user space?
> > 
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
> 
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > > 
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > > 
> > > Ah, that's a good idea.  But how do we check if they're in user space?
> > 
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
> 
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yeah. And getting answer to "is this task_struct currently running in 
userspace?" is not easy in a non-disruptive way.

Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the 
number of systems that have this option turned on will be rather limited 
(especially in HPC space).

Sending IPIs around to check whether the task is running in user-space or 
kernel-space would work, but it's disruptive to the running task, which 
might be unwanted as well.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> 
> > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > they don't accidentally enter the kernel while you flip them. Something
> > > like so should do.
> > > 
> > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > them then clear TIF_ENTER_WAIT.
> > 
> > Ah, that's a good idea.  But how do we check if they're in user space?
> 
> I don't see the benefit in holding them in a loop - you can just as well
> flip them from the syscall code as kGraft does.

But we were talking specifically about HPC tasks which never make
syscalls.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

> > No, these tasks will _never_ make syscalls. So you need to guarantee
> > they don't accidentally enter the kernel while you flip them. Something
> > like so should do.
> > 
> > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > them then clear TIF_ENTER_WAIT.
> 
> Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

> > I still absolutely hate you need to disturb userspace like that. Signals
> > are quite visible and perturb userspace state.
> 
> Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
> results in EINTR being returned from a system call.  It's not ideal, but
> it's much less intrusive than something like suspend.
> 
> But anyway we leave it up to the user to decide whether they want to
> take that risk, or wait for the task to wake up on its own, or cancel
> the patch.
> 
> > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> > what they thought was a good reason. You'd wreck their state.
> 
> Hm, that's a good point.  Maybe use the freezer instead of signals?
> 
> (Again this would only be for those user tasks which are sleeping on a
> patched function)
> 
> > > But now I'm thinking that kthreads will almost never be a problem.  Most
> > > kthreads are basically this:
> > 
> > You guys are way too optimistic; maybe its because I've worked on
> > realtime stuff too much, but I'm always looking at worst cases. If you
> > can't handle those, I feel you might as well not bother :-)
> 
> Well, I think we're already resigned to the fact that live patching
> won't work for every patch, every time.  And that the patch author must
> know what they're doing, and must do it carefully.
> 
> Our target worst case is that the patching fails gracefully and the user
> can't patch their system with that particular patch.  Or that the system
> remains in a partially patched state forever, if the user is ok with
> that.
> 
> > > Patching thread_fn wouldn't be possible unless we killed the thread.
> > 
> > It is, see kthread_park().
> 
> When the kthread returns from kthread_parkme(), it'll still be running
> the old thread_fn code, regardless of whether we flipped the task's
> patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
> 
> > > > The next line of attack is patching tasks when exiting the kernel to
> > > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > > and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
> > > > model patch set.
> > > 
> > > So the HPC people are really into userspace that does for (;;) ; and
> > > isolate that on CPUs and have the tick interrupt stopped and all that.
> > > 
> > > You'll not catch those threads on the sysexit path.
> > > 
> > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> > > 
> > > Now its fairly easy to also handle this; just mark those tasks with a
> > > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > > go-away, then flip their state and clear the flag.
> > 
> > I guess you mean patch the task when it makes a syscall?  I'm doing that
> > already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> > _TIF_DO_NOTIFY_MASK.
> 
> No, these tasks will _never_ make syscalls. So you need to guarantee
> they don't accidentally enter the kernel while you flip them. Something
> like so should do.
> 
> You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> them then clear TIF_ENTER_WAIT.

Ah, that's a good idea.  But how do we check if they're in user space?

> > > > As a last resort, if there are still any tasks which are sleeping on a
> > > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > > force them to be patched.
> > > 
> > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> > > 
> > > Bit pesky that.. needs pondering.
> 
> I still absolutely hate you need to disturb userspace like that. Signals
> are quite visible and perturb userspace state.

Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
results in EINTR being returned from a system call.  It's not ideal, but
it's much less intrusive than something like suspend.

But anyway we leave it up to the user to decide whether they want to
take that risk, or wait for the task to wake up on its own, or cancel
the patch.

> Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> what they thought was a good reason. You'd wreck their state.

Hm, that's a good point.  Maybe use the freezer instead of signals?

(Again this would only be for those user tasks which are sleeping on a
patched function)

> > But now I'm thinking that kthreads will almost never be a problem.  Most
> > kthreads are basically this:
> 
> You guys are way too optimistic; maybe its because I've worked on
> realtime stuff too much, but I'm always looking at worst cases. If you
> can't handle those, I feel you might as well not bother :-)

Well, I think we're already resigned to the fact that live patching
won't work for every patch, every time.  And that the patch author must
know what they're doing, and must do it carefully.

Our target worst case is that the patching fails gracefully and the user
can't patch their system with that particular patch.  Or that the system
remains in a partially patched state forever, if the user is ok with
that.

> > Patching thread_fn wouldn't be possible unless we killed the thread.
> 
> It is, see kthread_park().

When the kthread returns from kthread_parkme(), it'll still be running
the old thread_fn code, regardless of whether we flipped the task's
patch state.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Peter Zijlstra
On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > The next line of attack is patching tasks when exiting the kernel to
> > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
> > > model patch set.
> > 
> > So the HPC people are really into userspace that does for (;;) ; and
> > isolate that on CPUs and have the tick interrupt stopped and all that.
> > 
> > You'll not catch those threads on the sysexit path.
> > 
> > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> > 
> > Now its fairly easy to also handle this; just mark those tasks with a
> > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > go-away, then flip their state and clear the flag.
> 
> I guess you mean patch the task when it makes a syscall?  I'm doing that
> already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> _TIF_DO_NOTIFY_MASK.

No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.

---
 arch/x86/include/asm/thread_info.h | 4 +++-
 arch/x86/kernel/entry_64.S | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index e82e95abc92b..baa836f13536 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -90,6 +90,7 @@ struct thread_info {
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
 #define TIF_X3230  /* 32-bit native x86-64 binary 
*/
+#define TIF_ENTER_WAIT 31
 
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -113,12 +114,13 @@ struct thread_info {
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
 #define _TIF_X32   (1 << TIF_X32)
+#define _TIF_ENTER_WAIT(1 << TIF_ENTER_WAIT)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY\
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
-_TIF_NOHZ)
+_TIF_NOHZ | _TIF_ENTER_WAIT)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT \
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index db13655c3a2a..735566b35903 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs)
 
/* Do syscall tracing */
 tracesys:
+   andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz tracesys;
leaq -REST_SKIP(%rsp), %rdi
movq $AUDIT_ARCH_X86_64, %rsi
call syscall_trace_enter_phase1

> > > As a last resort, if there are still any tasks which are sleeping on a
> > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > force them to be patched.
> > 
> > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> > 
> > Bit pesky that.. needs pondering.

I still absolutely hate you need to disturb userspace like that. Signals
are quite visible and perturb userspace state.

Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
what they thought was a good reason. You'd wreck their state.

> But now I'm thinking that kthreads will almost never be a problem.  Most
> kthreads are basically this:

You guys are way too optimistic; maybe its because I've worked on
realtime stuff too much, but I'm always looking at worst cases. If you
can't handle those, I feel you might as well not bother :-)

> Patching thread_fn wouldn't be possible unless we killed the thread.

It is, see kthread_park().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
 On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
  On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
   On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

  No, these tasks will _never_ make syscalls. So you need to guarantee
  they don't accidentally enter the kernel while you flip them. 
  Something
  like so should do.
  
  You set TIF_ENTER_WAIT on them, check they're still in userspace, 
  flip
  them then clear TIF_ENTER_WAIT.
 
 Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.
   
   But we were talking specifically about HPC tasks which never make
   syscalls.
  
  Yes. I'm saying that rather than guaranteeing they don't enter the
  kernel (by having them spin) you can flip them in case they try to do
  that instead. That solves the race condition just as well.
 
 Ok, gotcha.
 
 We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

 How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
 right at the border.  Then for running tasks it's as simple as:
 
 if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
   klp_switch_task_universe(task);

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Peter Zijlstra
On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
 On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
  On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

   The next line of attack is patching tasks when exiting the kernel to
   user space (system calls, interrupts, signals), to catch all CPU-bound
   and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
   model patch set.
  
  So the HPC people are really into userspace that does for (;;) ; and
  isolate that on CPUs and have the tick interrupt stopped and all that.
  
  You'll not catch those threads on the sysexit path.
  
  And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
  
  Now its fairly easy to also handle this; just mark those tasks with a
  _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
  go-away, then flip their state and clear the flag.
 
 I guess you mean patch the task when it makes a syscall?  I'm doing that
 already on syscall exit with a bit in _TIF_ALLWORK_MASK and
 _TIF_DO_NOTIFY_MASK.

No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.

---
 arch/x86/include/asm/thread_info.h | 4 +++-
 arch/x86/kernel/entry_64.S | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index e82e95abc92b..baa836f13536 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -90,6 +90,7 @@ struct thread_info {
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
 #define TIF_X3230  /* 32-bit native x86-64 binary 
*/
+#define TIF_ENTER_WAIT 31
 
 #define _TIF_SYSCALL_TRACE (1  TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1  TIF_NOTIFY_RESUME)
@@ -113,12 +114,13 @@ struct thread_info {
 #define _TIF_SYSCALL_TRACEPOINT(1  TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1  TIF_ADDR32)
 #define _TIF_X32   (1  TIF_X32)
+#define _TIF_ENTER_WAIT(1  TIF_ENTER_WAIT)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY\
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
 _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
-_TIF_NOHZ)
+_TIF_NOHZ | _TIF_ENTER_WAIT)
 
 /* work to do in syscall_trace_leave() */
 #define _TIF_WORK_SYSCALL_EXIT \
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index db13655c3a2a..735566b35903 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs)
 
/* Do syscall tracing */
 tracesys:
+   andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz tracesys;
leaq -REST_SKIP(%rsp), %rdi
movq $AUDIT_ARCH_X86_64, %rsi
call syscall_trace_enter_phase1

   As a last resort, if there are still any tasks which are sleeping on a
   to-be-patched function, the user can send them SIGSTOP and SIGCONT to
   force them to be patched.
  
  You typically cannot SIGSTOP/SIGCONT kernel threads. Also
  TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
  
  Bit pesky that.. needs pondering.

I still absolutely hate you need to disturb userspace like that. Signals
are quite visible and perturb userspace state.

Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
what they thought was a good reason. You'd wreck their state.

 But now I'm thinking that kthreads will almost never be a problem.  Most
 kthreads are basically this:

You guys are way too optimistic; maybe its because I've worked on
realtime stuff too much, but I'm always looking at worst cases. If you
can't handle those, I feel you might as well not bother :-)

 Patching thread_fn wouldn't be possible unless we killed the thread.

It is, see kthread_park().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

 How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
 right at the border.  Then for running tasks it's as simple as:
 
 if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
   klp_switch_task_universe(task);

That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
we'd be interested in would be

__this_cpu_read(context_tracking.state) == IN_USER

But it has overhead.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Jiri Kosina wrote:

  How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
  right at the border.  Then for running tasks it's as simple as:
  
  if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
  klp_switch_task_universe(task);
 
 That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
 we'd be interested in would be
 
   __this_cpu_read(context_tracking.state) == IN_USER

Well, more precisely

per_cpu(context_tracking.state, cpu) == IN_USER

of course.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote:
 On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
  On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
   On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
 On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
 
   No, these tasks will _never_ make syscalls. So you need to 
   guarantee
   they don't accidentally enter the kernel while you flip them. 
   Something
   like so should do.
   
   You set TIF_ENTER_WAIT on them, check they're still in userspace, 
   flip
   them then clear TIF_ENTER_WAIT.
  
  Ah, that's a good idea.  But how do we check if they're in user 
  space?
 
 I don't see the benefit in holding them in a loop - you can just as 
 well
 flip them from the syscall code as kGraft does.

But we were talking specifically about HPC tasks which never make
syscalls.
   
   Yes. I'm saying that rather than guaranteeing they don't enter the
   kernel (by having them spin) you can flip them in case they try to do
   that instead. That solves the race condition just as well.
  
  Ok, gotcha.
  
  We'd still need a safe way to check if they're in user space though.
 
 Having a safe way would be very nice and actually quite useful in other
 cases, too.
 
 For this specific purpose, however, we don't need a very safe way,
 though. We don't require atomicity in any way, we don't mind even if it
 creates false negatives, only false positives would be bad.
 
 kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
 addresses there, it assumes userspace. Not very nice, but does the job.

So I've looked at kgr_needs_lazy_migration(), but I still have no idea
how it works.

First of all, I think reading the stack while its being written to could
give you some garbage values, and a completely wrong nr_entries value
from save_stack_trace_tsk().

But also, how would you walk a stack without knowing its stack pointer?
That function relies on the saved stack pointer in
task_struct.thread.sp, which, AFAICT, was last saved during the last
call to schedule().  Since then, the stack could have been completely
rewritten, with different size stack frames, before the task exited the
kernel.

Am I missing something?

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

 So I've looked at kgr_needs_lazy_migration(), but I still have no idea
 how it works.
 
 First of all, I think reading the stack while its being written to could
 give you some garbage values, and a completely wrong nr_entries value
 from save_stack_trace_tsk().

I believe we've already been discussing this some time ago ...

I agree that this is a very crude optimization that should probably be 
either removed (which would only cause slower convergence in the presence 
of CPU-bound tasks), or rewritten to perform IPI-based stack dumping 
(probably on a voluntarily-configurable basis).

Reading garbage values could only happen if the task would be running in 
kernelspace. nr_entries would then be at least 2.

But I agree that relying on this very specific behavior is not really 
safe in general in case someone changes the stack dumping implementation 
in the future in an unpredictable way.

 But also, how would you walk a stack without knowing its stack pointer? 
 That function relies on the saved stack pointer in 
 task_struct.thread.sp, which, AFAICT, was last saved during the last 
 call to schedule().  Since then, the stack could have been completely 
 rewritten, with different size stack frames, before the task exited the 
 kernel.

Same argument holds here as well, I believe.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote:
 On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
  On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
   On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
 
The next line of attack is patching tasks when exiting the kernel to
user space (system calls, interrupts, signals), to catch all CPU-bound
and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
model patch set.
   
   So the HPC people are really into userspace that does for (;;) ; and
   isolate that on CPUs and have the tick interrupt stopped and all that.
   
   You'll not catch those threads on the sysexit path.
   
   And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
   
   Now its fairly easy to also handle this; just mark those tasks with a
   _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
   go-away, then flip their state and clear the flag.
  
  I guess you mean patch the task when it makes a syscall?  I'm doing that
  already on syscall exit with a bit in _TIF_ALLWORK_MASK and
  _TIF_DO_NOTIFY_MASK.
 
 No, these tasks will _never_ make syscalls. So you need to guarantee
 they don't accidentally enter the kernel while you flip them. Something
 like so should do.
 
 You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
 them then clear TIF_ENTER_WAIT.

Ah, that's a good idea.  But how do we check if they're in user space?

As a last resort, if there are still any tasks which are sleeping on a
to-be-patched function, the user can send them SIGSTOP and SIGCONT to
force them to be patched.
   
   You typically cannot SIGSTOP/SIGCONT kernel threads. Also
   TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
   
   Bit pesky that.. needs pondering.
 
 I still absolutely hate you need to disturb userspace like that. Signals
 are quite visible and perturb userspace state.

Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
results in EINTR being returned from a system call.  It's not ideal, but
it's much less intrusive than something like suspend.

But anyway we leave it up to the user to decide whether they want to
take that risk, or wait for the task to wake up on its own, or cancel
the patch.

 Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
 what they thought was a good reason. You'd wreck their state.

Hm, that's a good point.  Maybe use the freezer instead of signals?

(Again this would only be for those user tasks which are sleeping on a
patched function)

  But now I'm thinking that kthreads will almost never be a problem.  Most
  kthreads are basically this:
 
 You guys are way too optimistic; maybe its because I've worked on
 realtime stuff too much, but I'm always looking at worst cases. If you
 can't handle those, I feel you might as well not bother :-)

Well, I think we're already resigned to the fact that live patching
won't work for every patch, every time.  And that the patch author must
know what they're doing, and must do it carefully.

Our target worst case is that the patching fails gracefully and the user
can't patch their system with that particular patch.  Or that the system
remains in a partially patched state forever, if the user is ok with
that.

  Patching thread_fn wouldn't be possible unless we killed the thread.
 
 It is, see kthread_park().

When the kthread returns from kthread_parkme(), it'll still be running
the old thread_fn code, regardless of whether we flipped the task's
patch state.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

  No, these tasks will _never_ make syscalls. So you need to guarantee
  they don't accidentally enter the kernel while you flip them. Something
  like so should do.
  
  You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
  them then clear TIF_ENTER_WAIT.
 
 Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

  I still absolutely hate you need to disturb userspace like that. Signals
  are quite visible and perturb userspace state.
 
 Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
 results in EINTR being returned from a system call.  It's not ideal, but
 it's much less intrusive than something like suspend.
 
 But anyway we leave it up to the user to decide whether they want to
 take that risk, or wait for the task to wake up on its own, or cancel
 the patch.
 
  Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
  what they thought was a good reason. You'd wreck their state.
 
 Hm, that's a good point.  Maybe use the freezer instead of signals?
 
 (Again this would only be for those user tasks which are sleeping on a
 patched function)
 
   But now I'm thinking that kthreads will almost never be a problem.  Most
   kthreads are basically this:
  
  You guys are way too optimistic; maybe its because I've worked on
  realtime stuff too much, but I'm always looking at worst cases. If you
  can't handle those, I feel you might as well not bother :-)
 
 Well, I think we're already resigned to the fact that live patching
 won't work for every patch, every time.  And that the patch author must
 know what they're doing, and must do it carefully.
 
 Our target worst case is that the patching fails gracefully and the user
 can't patch their system with that particular patch.  Or that the system
 remains in a partially patched state forever, if the user is ok with
 that.
 
   Patching thread_fn wouldn't be possible unless we killed the thread.
  
  It is, see kthread_park().
 
 When the kthread returns from kthread_parkme(), it'll still be running
 the old thread_fn code, regardless of whether we flipped the task's
 patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Jiri Kosina
On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.
   
   Ah, that's a good idea.  But how do we check if they're in user space?
  
  I don't see the benefit in holding them in a loop - you can just as well
  flip them from the syscall code as kGraft does.
 
 But we were talking specifically about HPC tasks which never make
 syscalls.

Yeah. And getting answer to is this task_struct currently running in 
userspace? is not easy in a non-disruptive way.

Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the 
number of systems that have this option turned on will be rather limited 
(especially in HPC space).

Sending IPIs around to check whether the task is running in user-space or 
kernel-space would work, but it's disruptive to the running task, which 
might be unwanted as well.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
 On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
  On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
  
No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.
   
   Ah, that's a good idea.  But how do we check if they're in user space?
  
  I don't see the benefit in holding them in a loop - you can just as well
  flip them from the syscall code as kGraft does.
 
 But we were talking specifically about HPC tasks which never make
 syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
 On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
  On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
   On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
   
 No, these tasks will _never_ make syscalls. So you need to guarantee
 they don't accidentally enter the kernel while you flip them. 
 Something
 like so should do.
 
 You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
 them then clear TIF_ENTER_WAIT.

Ah, that's a good idea.  But how do we check if they're in user space?
   
   I don't see the benefit in holding them in a loop - you can just as well
   flip them from the syscall code as kGraft does.
  
  But we were talking specifically about HPC tasks which never make
  syscalls.
 
 Yes. I'm saying that rather than guaranteeing they don't enter the
 kernel (by having them spin) you can flip them in case they try to do
 that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
right at the border.  Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
klp_switch_task_universe(task);

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
 On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
 
   No, these tasks will _never_ make syscalls. So you need to guarantee
   they don't accidentally enter the kernel while you flip them. Something
   like so should do.
   
   You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
   them then clear TIF_ENTER_WAIT.
  
  Ah, that's a good idea.  But how do we check if they're in user space?
 
 I don't see the benefit in holding them in a loop - you can just as well
 flip them from the syscall code as kGraft does.

But we were talking specifically about HPC tasks which never make
syscalls.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf jpoim...@redhat.com wrote:

 Yes. I'm saying that rather than guaranteeing they don't enter the
 kernel (by having them spin) you can flip them in case they try to do
 that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
right at the border.  Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
   klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, 
negated). I think it would work well, but it isn't entirely for free: two 
instructions per syscall and an extra TIF bit, which are (near) exhausted on 
some archs.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote:
 On Thu, 19 Feb 2015, Josh Poimboeuf wrote:
 
  How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
  right at the border.  Then for running tasks it's as simple as:
  
  if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
  klp_switch_task_universe(task);
 
 That's in principle what CONTEXT_TRACKING is doing, i.e. the condition 
 we'd be interested in would be
 
   __this_cpu_read(context_tracking.state) == IN_USER
 
 But it has overhead.

Yeah, that does seem to pretty much do what we want.  Unfortunately it
has a much higher overhead than just setting a thread flag.

And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would
enable it on all CPUs during boot, this option brings an overhead that
you don't want in production.

Maybe that code needs a rewrite to rely on a thread flag instead.  Then
we could use it too.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
> > > So uhm, what happens if your target task is running? When will you
> > > retry? The problem I see is that if you do a sample approach you might
> > > never hit an opportune moment.
> > 
> > We attack it from multiple angles.
> > 
> > First we check the stack of all sleeping tasks.  That patches the
> > majority of tasks immediately.  If necessary, we also do that
> > periodically in a workqueue to catch any stragglers.
> 
> So not only do you need an 'atomic' stack save, you need to analyze and
> flip its state in the same atomic region. The task cannot start running
> again after the save and start using old functions while you analyze the
> stack and flip it.

Yes, exactly.

> > The next line of attack is patching tasks when exiting the kernel to
> > user space (system calls, interrupts, signals), to catch all CPU-bound
> > and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
> > model patch set.
> 
> So the HPC people are really into userspace that does for (;;) ; and
> isolate that on CPUs and have the tick interrupt stopped and all that.
> 
> You'll not catch those threads on the sysexit path.
> 
> And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> 
> Now its fairly easy to also handle this; just mark those tasks with a
> _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> go-away, then flip their state and clear the flag.

I guess you mean patch the task when it makes a syscall?  I'm doing that
already on syscall exit with a bit in _TIF_ALLWORK_MASK and
_TIF_DO_NOTIFY_MASK.

> > As a last resort, if there are still any tasks which are sleeping on a
> > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > force them to be patched.
> 
> You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> 
> Bit pesky that.. needs pondering.

I did have a scheme for patching kthreads which are sleeping on
to-be-patched functions.

But now I'm thinking that kthreads will almost never be a problem.  Most
kthreads are basically this:

void thread_fn()
{
while (1) {
/* do some stuff */

schedule();

/* do other stuff */
}
}

So a kthread would typically only fail the stack check if we're trying
to patch either schedule() or the top-level thread_fn.

Patching thread_fn wouldn't be possible unless we killed the thread.

And I'd guess we can probably live without being able to patch
schedule() for now :-)

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Peter Zijlstra
On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > a) spend the time to ensure the unwinding code is correct and resilient
> > >to errors;
> > > 
> > > b) leave the consistency model compiled code out if !FRAME_POINTER and
> > >allow users to patch without one (similar to the livepatch code
> > >that's already in the livepatch tree today); or
> > 
> > Which has a much more limited set of patches it can do, right?
> 
> If we're talking security patches, roughly 9 out of 10 of them don't
> change function prototypes, data, or data semantics.  If the patch
> author is careful, he or she doesn't need a consistency model in those
> cases.
> 
> So it's not _overly_ limited.  If somebody's not happy about those
> limitations, they can put in the work for option a :-)

OK. So all this is really only about really funny bits.

> > So uhm, what happens if your target task is running? When will you
> > retry? The problem I see is that if you do a sample approach you might
> > never hit an opportune moment.
> 
> We attack it from multiple angles.
> 
> First we check the stack of all sleeping tasks.  That patches the
> majority of tasks immediately.  If necessary, we also do that
> periodically in a workqueue to catch any stragglers.

So not only do you need an 'atomic' stack save, you need to analyze and
flip its state in the same atomic region. The task cannot start running
again after the save and start using old functions while you analyze the
stack and flip it.

> The next line of attack is patching tasks when exiting the kernel to
> user space (system calls, interrupts, signals), to catch all CPU-bound
> and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
> model patch set.

So the HPC people are really into userspace that does for (;;) ; and
isolate that on CPUs and have the tick interrupt stopped and all that.

You'll not catch those threads on the sysexit path.

And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.

Now its fairly easy to also handle this; just mark those tasks with a
_TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
go-away, then flip their state and clear the flag.

> As a last resort, if there are still any tasks which are sleeping on a
> to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> force them to be patched.

You typically cannot SIGSTOP/SIGCONT kernel threads. Also
TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.

Bit pesky that.. needs pondering.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Josh Poimboeuf
On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > > because otherwise x86 stacks are a mess too.
> > 
> > Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
> > about now (x86, power, s390, arm64) all have frame pointer support, and
> > the stack formats are all sane, AFAICT.
> > 
> > If we ever port livepatch to a more obscure arch for which walking the
> > stack is more difficult, we'll have several options:
> > 
> > a) spend the time to ensure the unwinding code is correct and resilient
> >to errors;
> > 
> > b) leave the consistency model compiled code out if !FRAME_POINTER and
> >allow users to patch without one (similar to the livepatch code
> >that's already in the livepatch tree today); or
> 
> Which has a much more limited set of patches it can do, right?

If we're talking security patches, roughly 9 out of 10 of them don't
change function prototypes, data, or data semantics.  If the patch
author is careful, he or she doesn't need a consistency model in those
cases.

So it's not _overly_ limited.  If somebody's not happy about those
limitations, they can put in the work for option a :-)

> > > And then hope you can get a better trace next time around? Or will you
> > > fall-back to an alternative method of patching?
> > 
> > Yeah, on second thought, we wouldn't have to cancel the patch.  We could
> > defer to check the task's stack again at a later time.  If it's stuck
> > there, the user can try sending it a signal to unstick it, or cancel the
> > patching process.  Those mechanisms are already in place with my
> > consistency model patch set.
> > 
> > I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> > guessing it would either indicate an error in the unwinding code or
> > point us to an unexpected stack condition.
> 
> So uhm, what happens if your target task is running? When will you
> retry? The problem I see is that if you do a sample approach you might
> never hit an opportune moment.

We attack it from multiple angles.

First we check the stack of all sleeping tasks.  That patches the
majority of tasks immediately.  If necessary, we also do that
periodically in a workqueue to catch any stragglers.

The next line of attack is patching tasks when exiting the kernel to
user space (system calls, interrupts, signals), to catch all CPU-bound
and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
model patch set.

As a last resort, if there are still any tasks which are sleeping on a
to-be-patched function, the user can send them SIGSTOP and SIGCONT to
force them to be patched.

If none of those work, the user has the option of either canceling the
patch or just waiting to see if the patching process eventually
completes.

> > > I'm further thinking we don't actually need 2 (or more) different means
> > > of live patching in the kernel. So you all had better sit down (again)
> > > and come up with something you all agree on.
> > 
> > Yeah, I also _really_ want to avoid multiple consistency models.
> > 
> > In fact, that's a big motivation behind my consistency model patch set.
> > It's heavily inspired by a suggestion from Vojtech [1].  It combines
> > kpatch (backtrace checking) with kGraft (per-thread consistency).  It
> > has several advantages over either of the individual approaches.  See
> > http://lwn.net/Articles/632582/ where I describe its pros over both
> > kpatch and kGraft.
> > 
> > Jiri, would you and Vojtech agree that the proposed consistency model is
> > all we need?  Or do you still want to do the multiple consistency model
> > thing?
> 
> Skimmed that thread; you all mostly seem to agree that one would be good
> but not quite agree on which one.

Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD
is the most interesting combination.  That's what my patches implement.
Let's wait to see what Jiri and Vojtech say.

> And I note, not all seem to require this stack walking stuff.

True.  But without stack walking, you have bigger problems:

1. You have to signal _all_ sleeping tasks to force them to be patched.
   That's much more disruptive to the system.

2. There's no way to convert kthreads without either:

   a) hacking up all the various kthread loops to call
  klp_update_task_universe(); or

   b) converting all kthreads' code to be freezeable.  Then we could
  freeze all tasks to patch them.  Not only would it be a lot of
  work to make all kthreads freezable, but it would encode within
  the kernel the dangerous and non-obvious assumption that the
  freeze point is a "safe" place for patching.


[1] https://lkml.org/lkml/2015/2/9/478


-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Peter Zijlstra
On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > because otherwise x86 stacks are a mess too.
> 
> Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
> about now (x86, power, s390, arm64) all have frame pointer support, and
> the stack formats are all sane, AFAICT.
> 
> If we ever port livepatch to a more obscure arch for which walking the
> stack is more difficult, we'll have several options:
> 
> a) spend the time to ensure the unwinding code is correct and resilient
>to errors;
> 
> b) leave the consistency model compiled code out if !FRAME_POINTER and
>allow users to patch without one (similar to the livepatch code
>that's already in the livepatch tree today); or

Which has a much more limited set of patches it can do, right?

> c) not support that arch.

Which would be sad of course.

> > And then hope you can get a better trace next time around? Or will you
> > fall-back to an alternative method of patching?
> 
> Yeah, on second thought, we wouldn't have to cancel the patch.  We could
> defer to check the task's stack again at a later time.  If it's stuck
> there, the user can try sending it a signal to unstick it, or cancel the
> patching process.  Those mechanisms are already in place with my
> consistency model patch set.
> 
> I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> guessing it would either indicate an error in the unwinding code or
> point us to an unexpected stack condition.

So uhm, what happens if your target task is running? When will you
retry? The problem I see is that if you do a sample approach you might
never hit an opportune moment.

> > I'm further thinking we don't actually need 2 (or more) different means
> > of live patching in the kernel. So you all had better sit down (again)
> > and come up with something you all agree on.
> 
> Yeah, I also _really_ want to avoid multiple consistency models.
> 
> In fact, that's a big motivation behind my consistency model patch set.
> It's heavily inspired by a suggestion from Vojtech [1].  It combines
> kpatch (backtrace checking) with kGraft (per-thread consistency).  It
> has several advantages over either of the individual approaches.  See
> http://lwn.net/Articles/632582/ where I describe its pros over both
> kpatch and kGraft.
> 
> Jiri, would you and Vojtech agree that the proposed consistency model is
> all we need?  Or do you still want to do the multiple consistency model
> thing?

Skimmed that thread; you all mostly seem to agree that one would be good
but not quite agree on which one.

And I note, not all seem to require this stack walking stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Josh Poimboeuf
On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote:
 On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
   And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
   because otherwise x86 stacks are a mess too.
  
  Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
  about now (x86, power, s390, arm64) all have frame pointer support, and
  the stack formats are all sane, AFAICT.
  
  If we ever port livepatch to a more obscure arch for which walking the
  stack is more difficult, we'll have several options:
  
  a) spend the time to ensure the unwinding code is correct and resilient
 to errors;
  
  b) leave the consistency model compiled code out if !FRAME_POINTER and
 allow users to patch without one (similar to the livepatch code
 that's already in the livepatch tree today); or
 
 Which has a much more limited set of patches it can do, right?

If we're talking security patches, roughly 9 out of 10 of them don't
change function prototypes, data, or data semantics.  If the patch
author is careful, he or she doesn't need a consistency model in those
cases.

So it's not _overly_ limited.  If somebody's not happy about those
limitations, they can put in the work for option a :-)

   And then hope you can get a better trace next time around? Or will you
   fall-back to an alternative method of patching?
  
  Yeah, on second thought, we wouldn't have to cancel the patch.  We could
  defer to check the task's stack again at a later time.  If it's stuck
  there, the user can try sending it a signal to unstick it, or cancel the
  patching process.  Those mechanisms are already in place with my
  consistency model patch set.
  
  I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
  guessing it would either indicate an error in the unwinding code or
  point us to an unexpected stack condition.
 
 So uhm, what happens if your target task is running? When will you
 retry? The problem I see is that if you do a sample approach you might
 never hit an opportune moment.

We attack it from multiple angles.

First we check the stack of all sleeping tasks.  That patches the
majority of tasks immediately.  If necessary, we also do that
periodically in a workqueue to catch any stragglers.

The next line of attack is patching tasks when exiting the kernel to
user space (system calls, interrupts, signals), to catch all CPU-bound
and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
model patch set.

As a last resort, if there are still any tasks which are sleeping on a
to-be-patched function, the user can send them SIGSTOP and SIGCONT to
force them to be patched.

If none of those work, the user has the option of either canceling the
patch or just waiting to see if the patching process eventually
completes.

   I'm further thinking we don't actually need 2 (or more) different means
   of live patching in the kernel. So you all had better sit down (again)
   and come up with something you all agree on.
  
  Yeah, I also _really_ want to avoid multiple consistency models.
  
  In fact, that's a big motivation behind my consistency model patch set.
  It's heavily inspired by a suggestion from Vojtech [1].  It combines
  kpatch (backtrace checking) with kGraft (per-thread consistency).  It
  has several advantages over either of the individual approaches.  See
  http://lwn.net/Articles/632582/ where I describe its pros over both
  kpatch and kGraft.
  
  Jiri, would you and Vojtech agree that the proposed consistency model is
  all we need?  Or do you still want to do the multiple consistency model
  thing?
 
 Skimmed that thread; you all mostly seem to agree that one would be good
 but not quite agree on which one.

Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD
is the most interesting combination.  That's what my patches implement.
Let's wait to see what Jiri and Vojtech say.

 And I note, not all seem to require this stack walking stuff.

True.  But without stack walking, you have bigger problems:

1. You have to signal _all_ sleeping tasks to force them to be patched.
   That's much more disruptive to the system.

2. There's no way to convert kthreads without either:

   a) hacking up all the various kthread loops to call
  klp_update_task_universe(); or

   b) converting all kthreads' code to be freezeable.  Then we could
  freeze all tasks to patch them.  Not only would it be a lot of
  work to make all kthreads freezable, but it would encode within
  the kernel the dangerous and non-obvious assumption that the
  freeze point is a safe place for patching.


[1] https://lkml.org/lkml/2015/2/9/478


-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Josh Poimboeuf
On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
 On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
   So uhm, what happens if your target task is running? When will you
   retry? The problem I see is that if you do a sample approach you might
   never hit an opportune moment.
  
  We attack it from multiple angles.
  
  First we check the stack of all sleeping tasks.  That patches the
  majority of tasks immediately.  If necessary, we also do that
  periodically in a workqueue to catch any stragglers.
 
 So not only do you need an 'atomic' stack save, you need to analyze and
 flip its state in the same atomic region. The task cannot start running
 again after the save and start using old functions while you analyze the
 stack and flip it.

Yes, exactly.

  The next line of attack is patching tasks when exiting the kernel to
  user space (system calls, interrupts, signals), to catch all CPU-bound
  and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
  model patch set.
 
 So the HPC people are really into userspace that does for (;;) ; and
 isolate that on CPUs and have the tick interrupt stopped and all that.
 
 You'll not catch those threads on the sysexit path.
 
 And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
 
 Now its fairly easy to also handle this; just mark those tasks with a
 _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
 go-away, then flip their state and clear the flag.

I guess you mean patch the task when it makes a syscall?  I'm doing that
already on syscall exit with a bit in _TIF_ALLWORK_MASK and
_TIF_DO_NOTIFY_MASK.

  As a last resort, if there are still any tasks which are sleeping on a
  to-be-patched function, the user can send them SIGSTOP and SIGCONT to
  force them to be patched.
 
 You typically cannot SIGSTOP/SIGCONT kernel threads. Also
 TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
 
 Bit pesky that.. needs pondering.

I did have a scheme for patching kthreads which are sleeping on
to-be-patched functions.

But now I'm thinking that kthreads will almost never be a problem.  Most
kthreads are basically this:

void thread_fn()
{
while (1) {
/* do some stuff */

schedule();

/* do other stuff */
}
}

So a kthread would typically only fail the stack check if we're trying
to patch either schedule() or the top-level thread_fn.

Patching thread_fn wouldn't be possible unless we killed the thread.

And I'd guess we can probably live without being able to patch
schedule() for now :-)

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Peter Zijlstra
On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
  And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
  because otherwise x86 stacks are a mess too.
 
 Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
 about now (x86, power, s390, arm64) all have frame pointer support, and
 the stack formats are all sane, AFAICT.
 
 If we ever port livepatch to a more obscure arch for which walking the
 stack is more difficult, we'll have several options:
 
 a) spend the time to ensure the unwinding code is correct and resilient
to errors;
 
 b) leave the consistency model compiled code out if !FRAME_POINTER and
allow users to patch without one (similar to the livepatch code
that's already in the livepatch tree today); or

Which has a much more limited set of patches it can do, right?

 c) not support that arch.

Which would be sad of course.

  And then hope you can get a better trace next time around? Or will you
  fall-back to an alternative method of patching?
 
 Yeah, on second thought, we wouldn't have to cancel the patch.  We could
 defer to check the task's stack again at a later time.  If it's stuck
 there, the user can try sending it a signal to unstick it, or cancel the
 patching process.  Those mechanisms are already in place with my
 consistency model patch set.
 
 I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
 guessing it would either indicate an error in the unwinding code or
 point us to an unexpected stack condition.

So uhm, what happens if your target task is running? When will you
retry? The problem I see is that if you do a sample approach you might
never hit an opportune moment.

  I'm further thinking we don't actually need 2 (or more) different means
  of live patching in the kernel. So you all had better sit down (again)
  and come up with something you all agree on.
 
 Yeah, I also _really_ want to avoid multiple consistency models.
 
 In fact, that's a big motivation behind my consistency model patch set.
 It's heavily inspired by a suggestion from Vojtech [1].  It combines
 kpatch (backtrace checking) with kGraft (per-thread consistency).  It
 has several advantages over either of the individual approaches.  See
 http://lwn.net/Articles/632582/ where I describe its pros over both
 kpatch and kGraft.
 
 Jiri, would you and Vojtech agree that the proposed consistency model is
 all we need?  Or do you still want to do the multiple consistency model
 thing?

Skimmed that thread; you all mostly seem to agree that one would be good
but not quite agree on which one.

And I note, not all seem to require this stack walking stuff.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-18 Thread Peter Zijlstra
On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

   a) spend the time to ensure the unwinding code is correct and resilient
  to errors;
   
   b) leave the consistency model compiled code out if !FRAME_POINTER and
  allow users to patch without one (similar to the livepatch code
  that's already in the livepatch tree today); or
  
  Which has a much more limited set of patches it can do, right?
 
 If we're talking security patches, roughly 9 out of 10 of them don't
 change function prototypes, data, or data semantics.  If the patch
 author is careful, he or she doesn't need a consistency model in those
 cases.
 
 So it's not _overly_ limited.  If somebody's not happy about those
 limitations, they can put in the work for option a :-)

OK. So all this is really only about really funny bits.

  So uhm, what happens if your target task is running? When will you
  retry? The problem I see is that if you do a sample approach you might
  never hit an opportune moment.
 
 We attack it from multiple angles.
 
 First we check the stack of all sleeping tasks.  That patches the
 majority of tasks immediately.  If necessary, we also do that
 periodically in a workqueue to catch any stragglers.

So not only do you need an 'atomic' stack save, you need to analyze and
flip its state in the same atomic region. The task cannot start running
again after the save and start using old functions while you analyze the
stack and flip it.

 The next line of attack is patching tasks when exiting the kernel to
 user space (system calls, interrupts, signals), to catch all CPU-bound
 and some I/O-bound tasks.  That's done in patch 9 [1] of the consistency
 model patch set.

So the HPC people are really into userspace that does for (;;) ; and
isolate that on CPUs and have the tick interrupt stopped and all that.

You'll not catch those threads on the sysexit path.

And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.

Now its fairly easy to also handle this; just mark those tasks with a
_TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
go-away, then flip their state and clear the flag.

 As a last resort, if there are still any tasks which are sleeping on a
 to-be-patched function, the user can send them SIGSTOP and SIGCONT to
 force them to be patched.

You typically cannot SIGSTOP/SIGCONT kernel threads. Also
TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.

Bit pesky that.. needs pondering.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Josh Poimboeuf
On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> > On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > > So far stack unwinding has basically been a best effort debug output
> > > kind of thing, you're wanting to make the integrity of the kernel depend
> > > on it.
> > > 
> > > You require an absolute 100% correctness of the stack unwinder -- where
> > > today it is; as stated above; a best effort debug output thing.
> > > 
> > > That is a _big_ change.
> > 
> > True, this does seem to be the first attempt to rely on the correctness
> > of the stack walking code.
> > 
> > > Has this been properly considered; has all the asm of the relevant
> > > architectures been audited? Are you planning on maintaining that level
> > > of audit for all new patches?
> > 
> > I agree, the unwinder needs to be 100% correct.  Currently we only
> > support x86_64.  Luckily, in general, stacks are very well defined and
> > walking the stack of a sleeping task should be straightforward.  I don't
> > think it would be too hard to ensure the stack unwinder is right for
> > other architectures as we port them.
> 
> I would not be too sure about that, the x86 framepointer thing is not
> universal. IIRC some have to have some form of in-kernel dwarf
> unwinding.
> 
> And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> because otherwise x86 stacks are a mess too.

Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
about now (x86, power, s390, arm64) all have frame pointer support, and
the stack formats are all sane, AFAICT.

If we ever port livepatch to a more obscure arch for which walking the
stack is more difficult, we'll have several options:

a) spend the time to ensure the unwinding code is correct and resilient
   to errors;

b) leave the consistency model compiled code out if !FRAME_POINTER and
   allow users to patch without one (similar to the livepatch code
   that's already in the livepatch tree today); or

c) not support that arch.

> So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
> are implemented in asm, don't honour that. So if one of those faults and
> the fault handler sleeps, you'll miss say your
> 'copy_user_enhanced_fast_string' entry.
> 
> Then again, those asm functions don't have function trace bits either,
> so you can't patch them to begin with I suppose.
> 
> Here's to hoping you don't have to..

Right.  We can't patch asm functions because we rely on ftrace, which
needs the -mfentry prologue.

> > > Because the way you propose to do things, we'll end up with silent but
> > > deadly fail if the unwinder is less than 100% correct. No way to easily
> > > debug that, no warns, just silent corruption.
> > 
> > That's a good point.  There's definitely room for additional error
> > checking in the x86 stack walking code.  A couple of ideas:
> > 
> > - make sure it starts from a __schedule() call at the top
> > - make sure we actually reach the bottom of the stack
> > - make sure each stack frame's return address is immediately after a
> >   call instruction
> > 
> > If we hit any of those errors, we can bail out, unregister with ftrace
> > and restore the system to its original state.
> 
> And then hope you can get a better trace next time around? Or will you
> fall-back to an alternative method of patching?

Yeah, on second thought, we wouldn't have to cancel the patch.  We could
defer to check the task's stack again at a later time.  If it's stuck
there, the user can try sending it a signal to unstick it, or cancel the
patching process.  Those mechanisms are already in place with my
consistency model patch set.

I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
guessing it would either indicate an error in the unwinding code or
point us to an unexpected stack condition.

> > > Are you really really sure you want to go do this?
> > 
> > Basically, yes.  We've had a lot of conversations about many different
> > variations of how to do this over the past year, and this is by far the
> > best approach we've come up with.
> 
> For some reason I'm thinking jikos is going to disagree with you on that
> :-)
> 
> I'm further thinking we don't actually need 2 (or more) different means
> of live patching in the kernel. So you all had better sit down (again)
> and come up with something you all agree on.

Yeah, I also _really_ want to avoid multiple consistency models.

In fact, that's a big motivation behind my consistency model patch set.
It's heavily inspired by a suggestion from Vojtech [1].  It combines
kpatch (backtrace checking) with kGraft (per-thread consistency).  It
has several advantages over either of the individual approaches.  See
http://lwn.net/Articles/632582/ where I describe its pros over both
kpatch and kGraft.

Jiri, would you and Vojtech agree that the proposed consistency model is
all we need?  Or do you still want to do the 

Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Peter Zijlstra
On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > So far stack unwinding has basically been a best effort debug output
> > kind of thing, you're wanting to make the integrity of the kernel depend
> > on it.
> > 
> > You require an absolute 100% correctness of the stack unwinder -- where
> > today it is; as stated above; a best effort debug output thing.
> > 
> > That is a _big_ change.
> 
> True, this does seem to be the first attempt to rely on the correctness
> of the stack walking code.
> 
> > Has this been properly considered; has all the asm of the relevant
> > architectures been audited? Are you planning on maintaining that level
> > of audit for all new patches?
> 
> I agree, the unwinder needs to be 100% correct.  Currently we only
> support x86_64.  Luckily, in general, stacks are very well defined and
> walking the stack of a sleeping task should be straightforward.  I don't
> think it would be too hard to ensure the stack unwinder is right for
> other architectures as we port them.

I would not be too sure about that, the x86 framepointer thing is not
universal. IIRC some have to have some form of in-kernel dwarf
unwinding.

And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
because otherwise x86 stacks are a mess too.

So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
are implemented in asm, don't honour that. So if one of those faults and
the fault handler sleeps, you'll miss say your
'copy_user_enhanced_fast_string' entry.

Then again, those asm functions don't have function trace bits either,
so you can't patch them to begin with I suppose.

Here's to hoping you don't have to..

> > Because the way you propose to do things, we'll end up with silent but
> > deadly fail if the unwinder is less than 100% correct. No way to easily
> > debug that, no warns, just silent corruption.
> 
> That's a good point.  There's definitely room for additional error
> checking in the x86 stack walking code.  A couple of ideas:
> 
> - make sure it starts from a __schedule() call at the top
> - make sure we actually reach the bottom of the stack
> - make sure each stack frame's return address is immediately after a
>   call instruction
> 
> If we hit any of those errors, we can bail out, unregister with ftrace
> and restore the system to its original state.

And then hope you can get a better trace next time around? Or will you
fall-back to an alternative method of patching?

> > Are you really really sure you want to go do this?
> 
> Basically, yes.  We've had a lot of conversations about many different
> variations of how to do this over the past year, and this is by far the
> best approach we've come up with.

For some reason I'm thinking jikos is going to disagree with you on that
:-)

I'm further thinking we don't actually need 2 (or more) different means
of live patching in the kernel. So you all had better sit down (again)
and come up with something you all agree on.

> FWIW, we've been doing something similar with kpatch and stop_machine()
> over the last 1+ years, and have run into zero problems with that
> approach.

Could be you've just been lucky...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Josh Poimboeuf
On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> > Yeah, I can understand that.  I definitely want to avoid touching the
> > scheduler code.  Basically I'm trying to find a way to atomically do the
> > following:
> > 
> > if (task is sleeping) {
> > walk the stack
> > if (certain set of functions isn't on the stack)
> > set (or clear) a thread flag for the task
> > }
> > 
> > Any ideas on how I can achieve that?  So far my ideas are:
> 
> So far stack unwinding has basically been a best effort debug output
> kind of thing, you're wanting to make the integrity of the kernel depend
> on it.
> 
> You require an absolute 100% correctness of the stack unwinder -- where
> today it is; as stated above; a best effort debug output thing.
> 
> That is a _big_ change.

True, this does seem to be the first attempt to rely on the correctness
of the stack walking code.

> Has this been properly considered; has all the asm of the relevant
> architectures been audited? Are you planning on maintaining that level
> of audit for all new patches?

I agree, the unwinder needs to be 100% correct.  Currently we only
support x86_64.  Luckily, in general, stacks are very well defined and
walking the stack of a sleeping task should be straightforward.  I don't
think it would be too hard to ensure the stack unwinder is right for
other architectures as we port them.

> Because the way you propose to do things, we'll end up with silent but
> deadly fail if the unwinder is less than 100% correct. No way to easily
> debug that, no warns, just silent corruption.

That's a good point.  There's definitely room for additional error
checking in the x86 stack walking code.  A couple of ideas:

- make sure it starts from a __schedule() call at the top
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
  call instruction

If we hit any of those errors, we can bail out, unregister with ftrace
and restore the system to its original state.

> Are you really really sure you want to go do this?

Basically, yes.  We've had a lot of conversations about many different
variations of how to do this over the past year, and this is by far the
best approach we've come up with.

FWIW, we've been doing something similar with kpatch and stop_machine()
over the last 1+ years, and have run into zero problems with that
approach.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Peter Zijlstra
On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> Yeah, I can understand that.  I definitely want to avoid touching the
> scheduler code.  Basically I'm trying to find a way to atomically do the
> following:
> 
> if (task is sleeping) {
>   walk the stack
>   if (certain set of functions isn't on the stack)
>   set (or clear) a thread flag for the task
> }
> 
> Any ideas on how I can achieve that?  So far my ideas are:

So far stack unwinding has basically been a best effort debug output
kind of thing, you're wanting to make the integrity of the kernel depend
on it.

You require an absolute 100% correctness of the stack unwinder -- where
today it is; as stated above; a best effort debug output thing.

That is a _big_ change.

Has this been properly considered; has all the asm of the relevant
architectures been audited? Are you planning on maintaining that level
of audit for all new patches?

Because the way you propose to do things, we'll end up with silent but
deadly fail if the unwinder is less than 100% correct. No way to easily
debug that, no warns, just silent corruption.

Are you really really sure you want to go do this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Josh Poimboeuf
On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
 On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
  Yeah, I can understand that.  I definitely want to avoid touching the
  scheduler code.  Basically I'm trying to find a way to atomically do the
  following:
  
  if (task is sleeping) {
  walk the stack
  if (certain set of functions isn't on the stack)
  set (or clear) a thread flag for the task
  }
  
  Any ideas on how I can achieve that?  So far my ideas are:
 
 So far stack unwinding has basically been a best effort debug output
 kind of thing, you're wanting to make the integrity of the kernel depend
 on it.
 
 You require an absolute 100% correctness of the stack unwinder -- where
 today it is; as stated above; a best effort debug output thing.
 
 That is a _big_ change.

True, this does seem to be the first attempt to rely on the correctness
of the stack walking code.

 Has this been properly considered; has all the asm of the relevant
 architectures been audited? Are you planning on maintaining that level
 of audit for all new patches?

I agree, the unwinder needs to be 100% correct.  Currently we only
support x86_64.  Luckily, in general, stacks are very well defined and
walking the stack of a sleeping task should be straightforward.  I don't
think it would be too hard to ensure the stack unwinder is right for
other architectures as we port them.

 Because the way you propose to do things, we'll end up with silent but
 deadly fail if the unwinder is less than 100% correct. No way to easily
 debug that, no warns, just silent corruption.

That's a good point.  There's definitely room for additional error
checking in the x86 stack walking code.  A couple of ideas:

- make sure it starts from a __schedule() call at the top
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
  call instruction

If we hit any of those errors, we can bail out, unregister with ftrace
and restore the system to its original state.

 Are you really really sure you want to go do this?

Basically, yes.  We've had a lot of conversations about many different
variations of how to do this over the past year, and this is by far the
best approach we've come up with.

FWIW, we've been doing something similar with kpatch and stop_machine()
over the last 1+ years, and have run into zero problems with that
approach.

-- 
Josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Peter Zijlstra
On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
 On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
  So far stack unwinding has basically been a best effort debug output
  kind of thing, you're wanting to make the integrity of the kernel depend
  on it.
  
  You require an absolute 100% correctness of the stack unwinder -- where
  today it is; as stated above; a best effort debug output thing.
  
  That is a _big_ change.
 
 True, this does seem to be the first attempt to rely on the correctness
 of the stack walking code.
 
  Has this been properly considered; has all the asm of the relevant
  architectures been audited? Are you planning on maintaining that level
  of audit for all new patches?
 
 I agree, the unwinder needs to be 100% correct.  Currently we only
 support x86_64.  Luckily, in general, stacks are very well defined and
 walking the stack of a sleeping task should be straightforward.  I don't
 think it would be too hard to ensure the stack unwinder is right for
 other architectures as we port them.

I would not be too sure about that, the x86 framepointer thing is not
universal. IIRC some have to have some form of in-kernel dwarf
unwinding.

And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
because otherwise x86 stacks are a mess too.

So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
are implemented in asm, don't honour that. So if one of those faults and
the fault handler sleeps, you'll miss say your
'copy_user_enhanced_fast_string' entry.

Then again, those asm functions don't have function trace bits either,
so you can't patch them to begin with I suppose.

Here's to hoping you don't have to..

  Because the way you propose to do things, we'll end up with silent but
  deadly fail if the unwinder is less than 100% correct. No way to easily
  debug that, no warns, just silent corruption.
 
 That's a good point.  There's definitely room for additional error
 checking in the x86 stack walking code.  A couple of ideas:
 
 - make sure it starts from a __schedule() call at the top
 - make sure we actually reach the bottom of the stack
 - make sure each stack frame's return address is immediately after a
   call instruction
 
 If we hit any of those errors, we can bail out, unregister with ftrace
 and restore the system to its original state.

And then hope you can get a better trace next time around? Or will you
fall-back to an alternative method of patching?

  Are you really really sure you want to go do this?
 
 Basically, yes.  We've had a lot of conversations about many different
 variations of how to do this over the past year, and this is by far the
 best approach we've come up with.

For some reason I'm thinking jikos is going to disagree with you on that
:-)

I'm further thinking we don't actually need 2 (or more) different means
of live patching in the kernel. So you all had better sit down (again)
and come up with something you all agree on.

 FWIW, we've been doing something similar with kpatch and stop_machine()
 over the last 1+ years, and have run into zero problems with that
 approach.

Could be you've just been lucky...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Josh Poimboeuf
On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote:
 On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
  On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
   So far stack unwinding has basically been a best effort debug output
   kind of thing, you're wanting to make the integrity of the kernel depend
   on it.
   
   You require an absolute 100% correctness of the stack unwinder -- where
   today it is; as stated above; a best effort debug output thing.
   
   That is a _big_ change.
  
  True, this does seem to be the first attempt to rely on the correctness
  of the stack walking code.
  
   Has this been properly considered; has all the asm of the relevant
   architectures been audited? Are you planning on maintaining that level
   of audit for all new patches?
  
  I agree, the unwinder needs to be 100% correct.  Currently we only
  support x86_64.  Luckily, in general, stacks are very well defined and
  walking the stack of a sleeping task should be straightforward.  I don't
  think it would be too hard to ensure the stack unwinder is right for
  other architectures as we port them.
 
 I would not be too sure about that, the x86 framepointer thing is not
 universal. IIRC some have to have some form of in-kernel dwarf
 unwinding.
 
 And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
 because otherwise x86 stacks are a mess too.

Yeah, it'll rely on CONFIG_FRAME_POINTER.  IIUC, the arches we care
about now (x86, power, s390, arm64) all have frame pointer support, and
the stack formats are all sane, AFAICT.

If we ever port livepatch to a more obscure arch for which walking the
stack is more difficult, we'll have several options:

a) spend the time to ensure the unwinding code is correct and resilient
   to errors;

b) leave the consistency model compiled code out if !FRAME_POINTER and
   allow users to patch without one (similar to the livepatch code
   that's already in the livepatch tree today); or

c) not support that arch.

 So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
 are implemented in asm, don't honour that. So if one of those faults and
 the fault handler sleeps, you'll miss say your
 'copy_user_enhanced_fast_string' entry.
 
 Then again, those asm functions don't have function trace bits either,
 so you can't patch them to begin with I suppose.
 
 Here's to hoping you don't have to..

Right.  We can't patch asm functions because we rely on ftrace, which
needs the -mfentry prologue.

   Because the way you propose to do things, we'll end up with silent but
   deadly fail if the unwinder is less than 100% correct. No way to easily
   debug that, no warns, just silent corruption.
  
  That's a good point.  There's definitely room for additional error
  checking in the x86 stack walking code.  A couple of ideas:
  
  - make sure it starts from a __schedule() call at the top
  - make sure we actually reach the bottom of the stack
  - make sure each stack frame's return address is immediately after a
call instruction
  
  If we hit any of those errors, we can bail out, unregister with ftrace
  and restore the system to its original state.
 
 And then hope you can get a better trace next time around? Or will you
 fall-back to an alternative method of patching?

Yeah, on second thought, we wouldn't have to cancel the patch.  We could
defer to check the task's stack again at a later time.  If it's stuck
there, the user can try sending it a signal to unstick it, or cancel the
patching process.  Those mechanisms are already in place with my
consistency model patch set.

I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
guessing it would either indicate an error in the unwinding code or
point us to an unexpected stack condition.

   Are you really really sure you want to go do this?
  
  Basically, yes.  We've had a lot of conversations about many different
  variations of how to do this over the past year, and this is by far the
  best approach we've come up with.
 
 For some reason I'm thinking jikos is going to disagree with you on that
 :-)
 
 I'm further thinking we don't actually need 2 (or more) different means
 of live patching in the kernel. So you all had better sit down (again)
 and come up with something you all agree on.

Yeah, I also _really_ want to avoid multiple consistency models.

In fact, that's a big motivation behind my consistency model patch set.
It's heavily inspired by a suggestion from Vojtech [1].  It combines
kpatch (backtrace checking) with kGraft (per-thread consistency).  It
has several advantages over either of the individual approaches.  See
http://lwn.net/Articles/632582/ where I describe its pros over both
kpatch and kGraft.

Jiri, would you and Vojtech agree that the proposed consistency model is
all we need?  Or do you still want to do the multiple consistency model
thing?

  FWIW, we've been doing something similar with kpatch and stop_machine()
  over the 

Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-17 Thread Peter Zijlstra
On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
 Yeah, I can understand that.  I definitely want to avoid touching the
 scheduler code.  Basically I'm trying to find a way to atomically do the
 following:
 
 if (task is sleeping) {
   walk the stack
   if (certain set of functions isn't on the stack)
   set (or clear) a thread flag for the task
 }
 
 Any ideas on how I can achieve that?  So far my ideas are:

So far stack unwinding has basically been a best effort debug output
kind of thing, you're wanting to make the integrity of the kernel depend
on it.

You require an absolute 100% correctness of the stack unwinder -- where
today it is; as stated above; a best effort debug output thing.

That is a _big_ change.

Has this been properly considered; has all the asm of the relevant
architectures been audited? Are you planning on maintaining that level
of audit for all new patches?

Because the way you propose to do things, we'll end up with silent but
deadly fail if the unwinder is less than 100% correct. No way to easily
debug that, no warns, just silent corruption.

Are you really really sure you want to go do this?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-16 Thread Josh Poimboeuf
On Mon, Feb 16, 2015 at 09:44:36PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> > +++ b/kernel/sched/core.c
> > @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
> >  EXPORT_SYMBOL_GPL(kick_process);
> >  #endif /* CONFIG_SMP */
> >  
> > +/***
> > + * sched_task_call - call a function with a task's state locked
> > + *
> > + * The task is guaranteed to remain either active or inactive during the
> > + * function call.
> > + */
> > +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> > +void *data)
> > +{
> > +   unsigned long flags;
> > +   struct rq *rq;
> > +
> > +   rq = task_rq_lock(p, );
> > +   func(p, data);
> > +   task_rq_unlock(rq, p, );
> > +}
> 
> Yeah, I think not. We're so not going to allow running random code under
> rq->lock and p->pi_lock.

Yeah, I can understand that.  I definitely want to avoid touching the
scheduler code.  Basically I'm trying to find a way to atomically do the
following:

if (task is sleeping) {
walk the stack
if (certain set of functions isn't on the stack)
set (or clear) a thread flag for the task
}

Any ideas on how I can achieve that?  So far my ideas are:

1. Use task_rq_lock() -- but rq_lock is internal to sched code.

2. Use wait_task_inactive() -- I could call it twice, with the stack
   checking in between, and use ncsw to ensure that it didn't reschedule
   in the mean time.  But this still seems racy.  i.e., I think the task
   could start running after the second call to wait_task_inactive()
   returns but before setting the thread flag.  Not sure if that's a
   realistic race condition or not.

3. Use set_cpus_allowed() to temporarily pin the task to its current
   CPU, and then call smp_call_function_single() to run the above
   critical section on that CPU.  I'm not sure if there's a race-free
   way to do it but it's a lot more disruptive than I'd like...

Any ideas or guidance would be greatly appreciated!

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-16 Thread Peter Zijlstra
On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> +++ b/kernel/sched/core.c
> @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
>  EXPORT_SYMBOL_GPL(kick_process);
>  #endif /* CONFIG_SMP */
>  
> +/***
> + * sched_task_call - call a function with a task's state locked
> + *
> + * The task is guaranteed to remain either active or inactive during the
> + * function call.
> + */
> +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> +  void *data)
> +{
> + unsigned long flags;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, );
> + func(p, data);
> + task_rq_unlock(rq, p, );
> +}

Yeah, I think not. We're so not going to allow running random code under
rq->lock and p->pi_lock.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >