subject:"Benefit\!"

I have a proposition for you it will be a mutual benefit to both of us

2020-08-10 Thread David Williams

-- 
Dear Sir,

I am David Williams. I have a Proposition involving an investment
initiative in your Country to discuss with you, It will be of mutual
benefit to both of us, and I believe we can handle it together, once
we have a common understanding and mutual co-operation in the
execution of the modalities. I am a staff with a Commercial Bank and I
discovered a dormant account, it involves the account of Late
Mr.Richard Cousin, please If you should be interested e-mail back to
me through this email address: officei...@citromail.hu) for more
details. Please If you receive this message in your spam or junk its
due to your network provider

I await your earliest response.
Yours Sincerely,

David Williams
E-mE-mail: officei...@citromail.hu

Mutual Benefit

2019-05-20 Thread Dave Lin

Hi,

I am contacting you with regards to using your last name for funds claim of 
long dormant funds belonging to a late depositor . Let me know if you are 
interested for more details .

 Best Regards



David Lin

My name is Francis Aaron,I have a financial proposal which will benefit both of us.If you are interested,get back to me for more details.Thanks

2019-03-21 Thread Francis Aron

Mutual Benefit

2019-02-24 Thread David Lin

 Hi,

I am contacting you with regards to using your last name for funds claim of 
long dormant funds belonging to a late depositor . Let me know if you are 
intreasted for more details . Best Regards 

David Lin

mutual Benefit

2019-01-15 Thread Sir Donald

I am Sir Donald Hopkins working in fortune wine company UK. I have profitable 
business proposal for you as one of the product my company use in manufacturing 
wine is found in India,i will give you more business transaction once i get a 
response from you 


Best regard 
Donald Hopkins 
Departmemt manager 
Fortune Wine Company ukI am Sir Donald Hopkins working in fortune wine company 
UK. I have profitable business proposal for you as one of the product my 
company use in manufacturing wine is found in India,i will give you more 
business transaction once i get a response from you 


Best regard 
Donald Hopkins 
Departmemt manager 
Fortune Wine Company uk

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-24 Thread Paul E. McKenney

On Tue, Oct 23, 2018 at 03:13:43PM +, Raslan, KarimAllah wrote:
> On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> > > 
> > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > > 
> > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > > 
> > > > > 
> > > > > When expedited grace-period is set, both synchronize_sched
> > > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > > latency.
> > > > > 
> > > > > Improve wait_rcu_gp handling to also account for expedited 
> > > > > grace-period.
> > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > > variants
> > > > > concurrently when an expedited grace-period is set, however, given the
> > > > > improved latency it does not really matter.
> > > > > 
> > > > > Cc: Paul E. McKenney 
> > > > > Cc: Josh Triplett 
> > > > > Cc: Steven Rostedt 
> > > > > Cc: Mathieu Desnoyers 
> > > > > Cc: Lai Jiangshan 
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Signed-off-by: KarimAllah Ahmed 
> > > > 
> > > > Cute!
> > > > 
> > > > Unfortunately, there are a few problems with this patch:
> > > > 
> > > > 1.  I will be eliminating synchronize_rcu_mult() due to the fact 
> > > > that
> > > > the upcoming RCU flavor consolidation eliminates its sole 
> > > > caller.
> > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > > > in my -rcu tree.  This would of course also eliminate the 
> > > > effects
> > > > of this patch.
> > > 
> > > Your patch covers our use-case already, but I still think that the 
> > > semantics 
> > > for wait_rcu_gp is not clear to me.
> > > 
> > > The problem for us was that sched_cpu_deactivate would call
> > > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > > though we are already using rcu_expedited sysctl variable, 
> > > synchronize_rcu_mult 
> > > was just ignoring it.
> > > 
> > > That being said, I indeed overlooked rcu_normal and that it takes 
> > > precedence 
> > > over expedited and I did not notice at all the deadlock you mentioned 
> > > below!
> > > 
> > > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> > 
> > ???
> > 
> > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> > with synchronize_rcu(), which really would be subject to the sysfs
> > variables.  Of course, this is not yet in mainline, so it perhaps cannot
> > solve your immediate problem, which probably involve older kernels in
> > any case.  More on this below...
> > 
> > > 
> > > > 
> > > > 2.  The real-time guys' users are not going to be at all happy
> > > > with the IPIs resulting from the _expedited() API members.
> > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > > > always need that big a hammer, and use of this kernel parameter
> > > > can slow down boot, hibernation, suspend, network configuration,
> > > > and much else besides.  We therefore don't want them to have to
> > > > use rcupdate.rcu_normal=1 unless absolutely necessary.
> > > 
> > > I might be missing something here. Why would they need to "explicitly" 
> > > use 
> > > rcu_normal? If rcu_expedited is set, would not the expected behavior is 
> > > to call 
> > > into the expedited version?
> > > 
> > > My patch should only activate *expedited* if and only if it is set.
> > 
> > You are right, I was confused.  However...
> > 
> > > 
> > > I think I might be misunderstanding the expected behavior 
> > > from synchronize_rcu_mult. My understanding is that something like:
> > > 
> > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > > identical behavior, right?
> > 
> > You would clearly prefer that it did, and the commit log does seem to
> > read that way, but synchronize_rcu_mult() is going away anyway, so there
> > isn't a whole lot of point in arguing about what it should have done.
> > And the eventual implementation (with 5fc9d4e000b1 or its successor)
> > will act as you want.
> > 
> > > 
> > > At least in this commit:
> > > 
> > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > > de-duplication")
> > > 
> > > .. the change clearly gives the impression that they can be used 
> > > interchangeably. The problem is that this is not true when you look at 
> > > the 
> > > implementation. One of them (i.e. synchronize_rcu) will respect the
> > > expedite_rcu flag set by sysfs while the other (i.e. 
> > > synchronize_rcu_mult) 
> > > simply ignores it.
> > > 
> > > So my patch is about making sure that both of the variants actually 
> > > respect 
> > > it.
> > 
> > I am guessing that you need to make an older version of the kernel
> > expedite the CPU-hotplug grace periods.  I am also guessing that you can
> > carry patches to your kernels.  If so, I suggest the

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-24 Thread Paul E. McKenney

On Tue, Oct 23, 2018 at 03:13:43PM +, Raslan, KarimAllah wrote:
> On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> > > 
> > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > > 
> > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > > 
> > > > > 
> > > > > When expedited grace-period is set, both synchronize_sched
> > > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > > latency.
> > > > > 
> > > > > Improve wait_rcu_gp handling to also account for expedited 
> > > > > grace-period.
> > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > > variants
> > > > > concurrently when an expedited grace-period is set, however, given the
> > > > > improved latency it does not really matter.
> > > > > 
> > > > > Cc: Paul E. McKenney 
> > > > > Cc: Josh Triplett 
> > > > > Cc: Steven Rostedt 
> > > > > Cc: Mathieu Desnoyers 
> > > > > Cc: Lai Jiangshan 
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Signed-off-by: KarimAllah Ahmed 
> > > > 
> > > > Cute!
> > > > 
> > > > Unfortunately, there are a few problems with this patch:
> > > > 
> > > > 1.  I will be eliminating synchronize_rcu_mult() due to the fact 
> > > > that
> > > > the upcoming RCU flavor consolidation eliminates its sole 
> > > > caller.
> > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > > > in my -rcu tree.  This would of course also eliminate the 
> > > > effects
> > > > of this patch.
> > > 
> > > Your patch covers our use-case already, but I still think that the 
> > > semantics 
> > > for wait_rcu_gp is not clear to me.
> > > 
> > > The problem for us was that sched_cpu_deactivate would call
> > > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > > though we are already using rcu_expedited sysctl variable, 
> > > synchronize_rcu_mult 
> > > was just ignoring it.
> > > 
> > > That being said, I indeed overlooked rcu_normal and that it takes 
> > > precedence 
> > > over expedited and I did not notice at all the deadlock you mentioned 
> > > below!
> > > 
> > > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> > 
> > ???
> > 
> > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> > with synchronize_rcu(), which really would be subject to the sysfs
> > variables.  Of course, this is not yet in mainline, so it perhaps cannot
> > solve your immediate problem, which probably involve older kernels in
> > any case.  More on this below...
> > 
> > > 
> > > > 
> > > > 2.  The real-time guys' users are not going to be at all happy
> > > > with the IPIs resulting from the _expedited() API members.
> > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > > > always need that big a hammer, and use of this kernel parameter
> > > > can slow down boot, hibernation, suspend, network configuration,
> > > > and much else besides.  We therefore don't want them to have to
> > > > use rcupdate.rcu_normal=1 unless absolutely necessary.
> > > 
> > > I might be missing something here. Why would they need to "explicitly" 
> > > use 
> > > rcu_normal? If rcu_expedited is set, would not the expected behavior is 
> > > to call 
> > > into the expedited version?
> > > 
> > > My patch should only activate *expedited* if and only if it is set.
> > 
> > You are right, I was confused.  However...
> > 
> > > 
> > > I think I might be misunderstanding the expected behavior 
> > > from synchronize_rcu_mult. My understanding is that something like:
> > > 
> > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > > identical behavior, right?
> > 
> > You would clearly prefer that it did, and the commit log does seem to
> > read that way, but synchronize_rcu_mult() is going away anyway, so there
> > isn't a whole lot of point in arguing about what it should have done.
> > And the eventual implementation (with 5fc9d4e000b1 or its successor)
> > will act as you want.
> > 
> > > 
> > > At least in this commit:
> > > 
> > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > > de-duplication")
> > > 
> > > .. the change clearly gives the impression that they can be used 
> > > interchangeably. The problem is that this is not true when you look at 
> > > the 
> > > implementation. One of them (i.e. synchronize_rcu) will respect the
> > > expedite_rcu flag set by sysfs while the other (i.e. 
> > > synchronize_rcu_mult) 
> > > simply ignores it.
> > > 
> > > So my patch is about making sure that both of the variants actually 
> > > respect 
> > > it.
> > 
> > I am guessing that you need to make an older version of the kernel
> > expedite the CPU-hotplug grace periods.  I am also guessing that you can
> > carry patches to your kernels.  If so, I suggest the

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-23 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > 
> > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > When expedited grace-period is set, both synchronize_sched
> > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > latency.
> > > > 
> > > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > variants
> > > > concurrently when an expedited grace-period is set, however, given the
> > > > improved latency it does not really matter.
> > > > 
> > > > Cc: Paul E. McKenney 
> > > > Cc: Josh Triplett 
> > > > Cc: Steven Rostedt 
> > > > Cc: Mathieu Desnoyers 
> > > > Cc: Lai Jiangshan 
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > Cute!
> > > 
> > > Unfortunately, there are a few problems with this patch:
> > > 
> > > 1.I will be eliminating synchronize_rcu_mult() due to the fact 
> > > that
> > >   the upcoming RCU flavor consolidation eliminates its sole caller.
> > >   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > >   in my -rcu tree.  This would of course also eliminate the effects
> > >   of this patch.
> > 
> > Your patch covers our use-case already, but I still think that the 
> > semantics 
> > for wait_rcu_gp is not clear to me.
> > 
> > The problem for us was that sched_cpu_deactivate would call
> > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > though we are already using rcu_expedited sysctl variable, 
> > synchronize_rcu_mult 
> > was just ignoring it.
> > 
> > That being said, I indeed overlooked rcu_normal and that it takes 
> > precedence 
> > over expedited and I did not notice at all the deadlock you mentioned below!
> > 
> > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> 
> ???
> 
> The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> with synchronize_rcu(), which really would be subject to the sysfs
> variables.  Of course, this is not yet in mainline, so it perhaps cannot
> solve your immediate problem, which probably involve older kernels in
> any case.  More on this below...
> 
> > 
> > > 
> > > 2.The real-time guys' users are not going to be at all happy
> > >   with the IPIs resulting from the _expedited() API members.
> > >   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > >   always need that big a hammer, and use of this kernel parameter
> > >   can slow down boot, hibernation, suspend, network configuration,
> > >   and much else besides.  We therefore don't want them to have to
> > >   use rcupdate.rcu_normal=1 unless absolutely necessary.
> > 
> > I might be missing something here. Why would they need to "explicitly" use 
> > rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> > call 
> > into the expedited version?
> > 
> > My patch should only activate *expedited* if and only if it is set.
> 
> You are right, I was confused.  However...
> 
> > 
> > I think I might be misunderstanding the expected behavior 
> > from synchronize_rcu_mult. My understanding is that something like:
> > 
> > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > identical behavior, right?
> 
> You would clearly prefer that it did, and the commit log does seem to
> read that way, but synchronize_rcu_mult() is going away anyway, so there
> isn't a whole lot of point in arguing about what it should have done.
> And the eventual implementation (with 5fc9d4e000b1 or its successor)
> will act as you want.
> 
> > 
> > At least in this commit:
> > 
> > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > de-duplication")
> > 
> > .. the change clearly gives the impression that they can be used 
> > interchangeably. The problem is that this is not true when you look at the 
> > implementation. One of them (i.e. synchronize_rcu) will respect the
> > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
> > simply ignores it.
> > 
> > So my patch is about making sure that both of the variants actually respect 
> > it.
> 
> I am guessing that you need to make an older version of the kernel
> expedite the CPU-hotplug grace periods.  I am also guessing that you can
> carry patches to your kernels.  If so, I suggest the following simpler
> change to sched_cpu_deactivate() in kernel/sched/core.c:
> 
>   if (rcu_gp_is_expedited()) {
>   synchronize_sched_expedited();
>   if (IS_ENABLED(CONFIG_PREEMPT))
>   synchronize_rcu_expedited();
>   } else {
>   synchronize_rcu_mult(call_rcu, call_rcu_sched);
>   }
> 
> As soon as this patch conflicts due to the

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-23 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > 
> > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > When expedited grace-period is set, both synchronize_sched
> > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > latency.
> > > > 
> > > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > variants
> > > > concurrently when an expedited grace-period is set, however, given the
> > > > improved latency it does not really matter.
> > > > 
> > > > Cc: Paul E. McKenney 
> > > > Cc: Josh Triplett 
> > > > Cc: Steven Rostedt 
> > > > Cc: Mathieu Desnoyers 
> > > > Cc: Lai Jiangshan 
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > Cute!
> > > 
> > > Unfortunately, there are a few problems with this patch:
> > > 
> > > 1.I will be eliminating synchronize_rcu_mult() due to the fact 
> > > that
> > >   the upcoming RCU flavor consolidation eliminates its sole caller.
> > >   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > >   in my -rcu tree.  This would of course also eliminate the effects
> > >   of this patch.
> > 
> > Your patch covers our use-case already, but I still think that the 
> > semantics 
> > for wait_rcu_gp is not clear to me.
> > 
> > The problem for us was that sched_cpu_deactivate would call
> > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > though we are already using rcu_expedited sysctl variable, 
> > synchronize_rcu_mult 
> > was just ignoring it.
> > 
> > That being said, I indeed overlooked rcu_normal and that it takes 
> > precedence 
> > over expedited and I did not notice at all the deadlock you mentioned below!
> > 
> > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> 
> ???
> 
> The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> with synchronize_rcu(), which really would be subject to the sysfs
> variables.  Of course, this is not yet in mainline, so it perhaps cannot
> solve your immediate problem, which probably involve older kernels in
> any case.  More on this below...
> 
> > 
> > > 
> > > 2.The real-time guys' users are not going to be at all happy
> > >   with the IPIs resulting from the _expedited() API members.
> > >   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > >   always need that big a hammer, and use of this kernel parameter
> > >   can slow down boot, hibernation, suspend, network configuration,
> > >   and much else besides.  We therefore don't want them to have to
> > >   use rcupdate.rcu_normal=1 unless absolutely necessary.
> > 
> > I might be missing something here. Why would they need to "explicitly" use 
> > rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> > call 
> > into the expedited version?
> > 
> > My patch should only activate *expedited* if and only if it is set.
> 
> You are right, I was confused.  However...
> 
> > 
> > I think I might be misunderstanding the expected behavior 
> > from synchronize_rcu_mult. My understanding is that something like:
> > 
> > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > identical behavior, right?
> 
> You would clearly prefer that it did, and the commit log does seem to
> read that way, but synchronize_rcu_mult() is going away anyway, so there
> isn't a whole lot of point in arguing about what it should have done.
> And the eventual implementation (with 5fc9d4e000b1 or its successor)
> will act as you want.
> 
> > 
> > At least in this commit:
> > 
> > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > de-duplication")
> > 
> > .. the change clearly gives the impression that they can be used 
> > interchangeably. The problem is that this is not true when you look at the 
> > implementation. One of them (i.e. synchronize_rcu) will respect the
> > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
> > simply ignores it.
> > 
> > So my patch is about making sure that both of the variants actually respect 
> > it.
> 
> I am guessing that you need to make an older version of the kernel
> expedite the CPU-hotplug grace periods.  I am also guessing that you can
> carry patches to your kernels.  If so, I suggest the following simpler
> change to sched_cpu_deactivate() in kernel/sched/core.c:
> 
>   if (rcu_gp_is_expedited()) {
>   synchronize_sched_expedited();
>   if (IS_ENABLED(CONFIG_PREEMPT))
>   synchronize_rcu_expedited();
>   } else {
>   synchronize_rcu_mult(call_rcu, call_rcu_sched);
>   }
> 
> As soon as this patch conflicts due to the

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Paul E. McKenney

On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > 
> > > When expedited grace-period is set, both synchronize_sched
> > > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > > 
> > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > variants
> > > concurrently when an expedited grace-period is set, however, given the
> > > improved latency it does not really matter.
> > > 
> > > Cc: Paul E. McKenney 
> > > Cc: Josh Triplett 
> > > Cc: Steven Rostedt 
> > > Cc: Mathieu Desnoyers 
> > > Cc: Lai Jiangshan 
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed 
> > 
> > Cute!
> > 
> > Unfortunately, there are a few problems with this patch:
> > 
> > 1.  I will be eliminating synchronize_rcu_mult() due to the fact that
> > the upcoming RCU flavor consolidation eliminates its sole caller.
> > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > in my -rcu tree.  This would of course also eliminate the effects
> > of this patch.
> 
> Your patch covers our use-case already, but I still think that the semantics 
> for wait_rcu_gp is not clear to me.
> 
> The problem for us was that sched_cpu_deactivate would call
> synchronize_rcu_mult which does not check for "expedited" at all. So even
> though we are already using rcu_expedited sysctl variable, 
> synchronize_rcu_mult 
> was just ignoring it.
> 
> That being said, I indeed overlooked rcu_normal and that it takes precedence 
> over expedited and I did not notice at all the deadlock you mentioned below!
> 
> That can however be easily fixed by also checking for !rcu_gp_is_normal.

???

The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
with synchronize_rcu(), which really would be subject to the sysfs
variables.  Of course, this is not yet in mainline, so it perhaps cannot
solve your immediate problem, which probably involve older kernels in
any case.  More on this below...

> > 2.  The real-time guys' users are not going to be at all happy
> > with the IPIs resulting from the _expedited() API members.
> > Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > always need that big a hammer, and use of this kernel parameter
> > can slow down boot, hibernation, suspend, network configuration,
> > and much else besides.  We therefore don't want them to have to
> > use rcupdate.rcu_normal=1 unless absolutely necessary.
> 
> I might be missing something here. Why would they need to "explicitly" use 
> rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> call 
> into the expedited version?
> 
> My patch should only activate *expedited* if and only if it is set.

You are right, I was confused.  However...

> I think I might be misunderstanding the expected behavior 
> from synchronize_rcu_mult. My understanding is that something like:
> 
> synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> identical behavior, right?

You would clearly prefer that it did, and the commit log does seem to
read that way, but synchronize_rcu_mult() is going away anyway, so there
isn't a whole lot of point in arguing about what it should have done.
And the eventual implementation (with 5fc9d4e000b1 or its successor)
will act as you want.

> At least in this commit:
> 
> commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")
> 
> .. the change clearly gives the impression that they can be used 
> interchangeably. The problem is that this is not true when you look at the 
> implementation. One of them (i.e. synchronize_rcu) will respect the
> expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
> simply ignores it.
> 
> So my patch is about making sure that both of the variants actually respect 
> it.

I am guessing that you need to make an older version of the kernel
expedite the CPU-hotplug grace periods.  I am also guessing that you can
carry patches to your kernels.  If so, I suggest the following simpler
change to sched_cpu_deactivate() in kernel/sched/core.c:

if (rcu_gp_is_expedited()) {
synchronize_sched_expedited();
if (IS_ENABLED(CONFIG_PREEMPT))
synchronize_rcu_expedited();
} else {
synchronize_rcu_mult(call_rcu, call_rcu_sched);
}

As soon as this patch conflicts due to the synchronize_rcu_mult()
becoming synchronize_rcu(), you can drop the patch.  And this is the only
use of synchronize_rcu_mult(), so this approach loses no generality.
Longer term, this patch might possibly be the backport of 5fc9d4e000b1
back to v4.14, but at the end of the day this is up to the various
-stable maintainers.

Hmmm...  If

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Paul E. McKenney

On Fri, Oct 19, 2018 at 07:45:51PM +, Raslan, KarimAllah wrote:
> On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > 
> > > When expedited grace-period is set, both synchronize_sched
> > > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > > 
> > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > variants
> > > concurrently when an expedited grace-period is set, however, given the
> > > improved latency it does not really matter.
> > > 
> > > Cc: Paul E. McKenney 
> > > Cc: Josh Triplett 
> > > Cc: Steven Rostedt 
> > > Cc: Mathieu Desnoyers 
> > > Cc: Lai Jiangshan 
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed 
> > 
> > Cute!
> > 
> > Unfortunately, there are a few problems with this patch:
> > 
> > 1.  I will be eliminating synchronize_rcu_mult() due to the fact that
> > the upcoming RCU flavor consolidation eliminates its sole caller.
> > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > in my -rcu tree.  This would of course also eliminate the effects
> > of this patch.
> 
> Your patch covers our use-case already, but I still think that the semantics 
> for wait_rcu_gp is not clear to me.
> 
> The problem for us was that sched_cpu_deactivate would call
> synchronize_rcu_mult which does not check for "expedited" at all. So even
> though we are already using rcu_expedited sysctl variable, 
> synchronize_rcu_mult 
> was just ignoring it.
> 
> That being said, I indeed overlooked rcu_normal and that it takes precedence 
> over expedited and I did not notice at all the deadlock you mentioned below!
> 
> That can however be easily fixed by also checking for !rcu_gp_is_normal.

???

The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
with synchronize_rcu(), which really would be subject to the sysfs
variables.  Of course, this is not yet in mainline, so it perhaps cannot
solve your immediate problem, which probably involve older kernels in
any case.  More on this below...

> > 2.  The real-time guys' users are not going to be at all happy
> > with the IPIs resulting from the _expedited() API members.
> > Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > always need that big a hammer, and use of this kernel parameter
> > can slow down boot, hibernation, suspend, network configuration,
> > and much else besides.  We therefore don't want them to have to
> > use rcupdate.rcu_normal=1 unless absolutely necessary.
> 
> I might be missing something here. Why would they need to "explicitly" use 
> rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> call 
> into the expedited version?
> 
> My patch should only activate *expedited* if and only if it is set.

You are right, I was confused.  However...

> I think I might be misunderstanding the expected behavior 
> from synchronize_rcu_mult. My understanding is that something like:
> 
> synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> identical behavior, right?

You would clearly prefer that it did, and the commit log does seem to
read that way, but synchronize_rcu_mult() is going away anyway, so there
isn't a whole lot of point in arguing about what it should have done.
And the eventual implementation (with 5fc9d4e000b1 or its successor)
will act as you want.

> At least in this commit:
> 
> commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")
> 
> .. the change clearly gives the impression that they can be used 
> interchangeably. The problem is that this is not true when you look at the 
> implementation. One of them (i.e. synchronize_rcu) will respect the
> expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
> simply ignores it.
> 
> So my patch is about making sure that both of the variants actually respect 
> it.

I am guessing that you need to make an older version of the kernel
expedite the CPU-hotplug grace periods.  I am also guessing that you can
carry patches to your kernels.  If so, I suggest the following simpler
change to sched_cpu_deactivate() in kernel/sched/core.c:

if (rcu_gp_is_expedited()) {
synchronize_sched_expedited();
if (IS_ENABLED(CONFIG_PREEMPT))
synchronize_rcu_expedited();
} else {
synchronize_rcu_mult(call_rcu, call_rcu_sched);
}

As soon as this patch conflicts due to the synchronize_rcu_mult()
becoming synchronize_rcu(), you can drop the patch.  And this is the only
use of synchronize_rcu_mult(), so this approach loses no generality.
Longer term, this patch might possibly be the backport of 5fc9d4e000b1
back to v4.14, but at the end of the day this is up to the various
-stable maintainers.

Hmmm...  If

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > 
> > When expedited grace-period is set, both synchronize_sched
> > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > 
> > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> > concurrently when an expedited grace-period is set, however, given the
> > improved latency it does not really matter.
> > 
> > Cc: Paul E. McKenney 
> > Cc: Josh Triplett 
> > Cc: Steven Rostedt 
> > Cc: Mathieu Desnoyers 
> > Cc: Lai Jiangshan 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Cute!
> 
> Unfortunately, there are a few problems with this patch:
> 
> 1.I will be eliminating synchronize_rcu_mult() due to the fact that
>   the upcoming RCU flavor consolidation eliminates its sole caller.
>   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
>   in my -rcu tree.  This would of course also eliminate the effects
>   of this patch.

Your patch covers our use-case already, but I still think that the semantics 
for wait_rcu_gp is not clear to me.

The problem for us was that sched_cpu_deactivate would call
synchronize_rcu_mult which does not check for "expedited" at all. So even
though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult 
was just ignoring it.

That being said, I indeed overlooked rcu_normal and that it takes precedence 
over expedited and I did not notice at all the deadlock you mentioned below!

That can however be easily fixed by also checking for !rcu_gp_is_normal.

> 
> 2.The real-time guys' users are not going to be at all happy
>   with the IPIs resulting from the _expedited() API members.
>   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
>   always need that big a hammer, and use of this kernel parameter
>   can slow down boot, hibernation, suspend, network configuration,
>   and much else besides.  We therefore don't want them to have to
>   use rcupdate.rcu_normal=1 unless absolutely necessary.

I might be missing something here. Why would they need to "explicitly" use 
rcu_normal? If rcu_expedited is set, would not the expected behavior is to call 
into the expedited version?

My patch should only activate *expedited* if and only if it is set.

I think I might be misunderstanding the expected behavior 
from synchronize_rcu_mult. My understanding is that something like:

synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
identical behavior, right?

At least in this commit:

commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")

.. the change clearly gives the impression that they can be used 
interchangeably. The problem is that this is not true when you look at the 
implementation. One of them (i.e. synchronize_rcu) will respect the
expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
simply ignores it.

So my patch is about making sure that both of the variants actually respect 
it.

> 3.If the real-time guys' users were to have booted with
>   rcupdate.rcu_normal=1, then synchronize_sched_expedited()
>   would invoke _synchronize_rcu_expedited, which would invoke
>   wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
>   invoke __wait_rcu_gp(), which, given your patch, would in turn
>   invoke synchronize_sched_expedited().  This situation could
>   well prevent their systems from meeting their response-time
>   requirements.
> 
> So I cannot accept this patch nor for that matter any similar patch.
> 
> But what were you really trying to get done here?  If you were thinking
> of adding another synchronize_rcu_mult(), the flavor consolidation will
> make that unnecessary in most cases.  If you are trying to speed up
> CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
> when taking a CPU offline.  If something else, please let me know what
> it is so that we can work out how the problem might best be solved.
> 
>   Thanx, Paul
> 
> > 
> > ---
> >  kernel/rcu/update.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 68fa19a..44b8817 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> > call_rcu_func_t *crcu_array,
> > might_sleep();
> > continue;
> > }
> > -   init_rcu_head_on_stack(_array[i].head);
> > -   init_completion(_array[i].completion);
> > +
> > for (j = 0; j < i; j++)
> > if (crcu_array[j] == crcu_array[i])
> >

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > 
> > When expedited grace-period is set, both synchronize_sched
> > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > 
> > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> > concurrently when an expedited grace-period is set, however, given the
> > improved latency it does not really matter.
> > 
> > Cc: Paul E. McKenney 
> > Cc: Josh Triplett 
> > Cc: Steven Rostedt 
> > Cc: Mathieu Desnoyers 
> > Cc: Lai Jiangshan 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Cute!
> 
> Unfortunately, there are a few problems with this patch:
> 
> 1.I will be eliminating synchronize_rcu_mult() due to the fact that
>   the upcoming RCU flavor consolidation eliminates its sole caller.
>   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
>   in my -rcu tree.  This would of course also eliminate the effects
>   of this patch.

Your patch covers our use-case already, but I still think that the semantics 
for wait_rcu_gp is not clear to me.

The problem for us was that sched_cpu_deactivate would call
synchronize_rcu_mult which does not check for "expedited" at all. So even
though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult 
was just ignoring it.

That being said, I indeed overlooked rcu_normal and that it takes precedence 
over expedited and I did not notice at all the deadlock you mentioned below!

That can however be easily fixed by also checking for !rcu_gp_is_normal.

> 
> 2.The real-time guys' users are not going to be at all happy
>   with the IPIs resulting from the _expedited() API members.
>   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
>   always need that big a hammer, and use of this kernel parameter
>   can slow down boot, hibernation, suspend, network configuration,
>   and much else besides.  We therefore don't want them to have to
>   use rcupdate.rcu_normal=1 unless absolutely necessary.

I might be missing something here. Why would they need to "explicitly" use 
rcu_normal? If rcu_expedited is set, would not the expected behavior is to call 
into the expedited version?

My patch should only activate *expedited* if and only if it is set.

I think I might be misunderstanding the expected behavior 
from synchronize_rcu_mult. My understanding is that something like:

synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
identical behavior, right?

At least in this commit:

commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")

.. the change clearly gives the impression that they can be used 
interchangeably. The problem is that this is not true when you look at the 
implementation. One of them (i.e. synchronize_rcu) will respect the
expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
simply ignores it.

So my patch is about making sure that both of the variants actually respect 
it.

> 3.If the real-time guys' users were to have booted with
>   rcupdate.rcu_normal=1, then synchronize_sched_expedited()
>   would invoke _synchronize_rcu_expedited, which would invoke
>   wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
>   invoke __wait_rcu_gp(), which, given your patch, would in turn
>   invoke synchronize_sched_expedited().  This situation could
>   well prevent their systems from meeting their response-time
>   requirements.
> 
> So I cannot accept this patch nor for that matter any similar patch.
> 
> But what were you really trying to get done here?  If you were thinking
> of adding another synchronize_rcu_mult(), the flavor consolidation will
> make that unnecessary in most cases.  If you are trying to speed up
> CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
> when taking a CPU offline.  If something else, please let me know what
> it is so that we can work out how the problem might best be solved.
> 
>   Thanx, Paul
> 
> > 
> > ---
> >  kernel/rcu/update.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 68fa19a..44b8817 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> > call_rcu_func_t *crcu_array,
> > might_sleep();
> > continue;
> > }
> > -   init_rcu_head_on_stack(_array[i].head);
> > -   init_completion(_array[i].completion);
> > +
> > for (j = 0; j < i; j++)
> > if (crcu_array[j] == crcu_array[i])
> >

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Paul E. McKenney

On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> When expedited grace-period is set, both synchronize_sched
> synchronize_rcu_bh can be optimized to have a significantly lower latency.
> 
> Improve wait_rcu_gp handling to also account for expedited grace-period.
> The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> concurrently when an expedited grace-period is set, however, given the
> improved latency it does not really matter.
> 
> Cc: Paul E. McKenney 
> Cc: Josh Triplett 
> Cc: Steven Rostedt 
> Cc: Mathieu Desnoyers 
> Cc: Lai Jiangshan 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: KarimAllah Ahmed 

Cute!

Unfortunately, there are a few problems with this patch:

1.  I will be eliminating synchronize_rcu_mult() due to the fact that
the upcoming RCU flavor consolidation eliminates its sole caller.
See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
in my -rcu tree.  This would of course also eliminate the effects
of this patch.

2.  The real-time guys' users are not going to be at all happy
with the IPIs resulting from the _expedited() API members.
Yes, they can boot with rcupdate.rcu_normal=1, but they don't
always need that big a hammer, and use of this kernel parameter
can slow down boot, hibernation, suspend, network configuration,
and much else besides.  We therefore don't want them to have to
use rcupdate.rcu_normal=1 unless absolutely necessary.

3.  If the real-time guys' users were to have booted with
rcupdate.rcu_normal=1, then synchronize_sched_expedited()
would invoke _synchronize_rcu_expedited, which would invoke
wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
invoke __wait_rcu_gp(), which, given your patch, would in turn
invoke synchronize_sched_expedited().  This situation could
well prevent their systems from meeting their response-time
requirements.

So I cannot accept this patch nor for that matter any similar patch.

But what were you really trying to get done here?  If you were thinking
of adding another synchronize_rcu_mult(), the flavor consolidation will
make that unnecessary in most cases.  If you are trying to speed up
CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
when taking a CPU offline.  If something else, please let me know what
it is so that we can work out how the problem might best be solved.

Thanx, Paul

> ---
>  kernel/rcu/update.c | 34 --
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 68fa19a..44b8817 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> call_rcu_func_t *crcu_array,
>   might_sleep();
>   continue;
>   }
> - init_rcu_head_on_stack(_array[i].head);
> - init_completion(_array[i].completion);
> +
>   for (j = 0; j < i; j++)
>   if (crcu_array[j] == crcu_array[i])
>   break;
> - if (j == i)
> - (crcu_array[i])(_array[i].head, wakeme_after_rcu);
> + if (j != i)
> + continue;
> +
> + if ((crcu_array[i] == call_rcu_sched ||
> +  crcu_array[i] == call_rcu_bh)
> + && rcu_gp_is_expedited()) {
> + if (crcu_array[i] == call_rcu_sched)
> + synchronize_sched_expedited();
> + else
> + synchronize_rcu_bh_expedited();
> +
> + continue;
> + }
> +
> + init_rcu_head_on_stack(_array[i].head);
> + init_completion(_array[i].completion);
> + (crcu_array[i])(_array[i].head, wakeme_after_rcu);
>   }
> 
>   /* Wait for all callbacks to be invoked. */
> @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, 
> call_rcu_func_t *crcu_array,
>   (crcu_array[i] == call_rcu ||
>crcu_array[i] == call_rcu_bh))
>   continue;
> +
> + if ((crcu_array[i] == call_rcu_sched ||
> +  crcu_array[i] == call_rcu_bh)
> + && rcu_gp_is_expedited())
> + continue;
> +
>   for (j = 0; j < i; j++)
>   if (crcu_array[j] == crcu_array[i])
>   break;
> - if (j == i)
> - wait_for_completion(_array[i].completion);
> + if (j != i)
> + continue;
> +
> + wait_for_completion(_array[i].completion);
>   destroy_rcu_head_on_stack(_array[i].head);
>

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Paul E. McKenney

On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> When expedited grace-period is set, both synchronize_sched
> synchronize_rcu_bh can be optimized to have a significantly lower latency.
> 
> Improve wait_rcu_gp handling to also account for expedited grace-period.
> The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> concurrently when an expedited grace-period is set, however, given the
> improved latency it does not really matter.
> 
> Cc: Paul E. McKenney 
> Cc: Josh Triplett 
> Cc: Steven Rostedt 
> Cc: Mathieu Desnoyers 
> Cc: Lai Jiangshan 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: KarimAllah Ahmed 

Cute!

Unfortunately, there are a few problems with this patch:

1.  I will be eliminating synchronize_rcu_mult() due to the fact that
the upcoming RCU flavor consolidation eliminates its sole caller.
See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
in my -rcu tree.  This would of course also eliminate the effects
of this patch.

2.  The real-time guys' users are not going to be at all happy
with the IPIs resulting from the _expedited() API members.
Yes, they can boot with rcupdate.rcu_normal=1, but they don't
always need that big a hammer, and use of this kernel parameter
can slow down boot, hibernation, suspend, network configuration,
and much else besides.  We therefore don't want them to have to
use rcupdate.rcu_normal=1 unless absolutely necessary.

3.  If the real-time guys' users were to have booted with
rcupdate.rcu_normal=1, then synchronize_sched_expedited()
would invoke _synchronize_rcu_expedited, which would invoke
wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
invoke __wait_rcu_gp(), which, given your patch, would in turn
invoke synchronize_sched_expedited().  This situation could
well prevent their systems from meeting their response-time
requirements.

So I cannot accept this patch nor for that matter any similar patch.

But what were you really trying to get done here?  If you were thinking
of adding another synchronize_rcu_mult(), the flavor consolidation will
make that unnecessary in most cases.  If you are trying to speed up
CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
when taking a CPU offline.  If something else, please let me know what
it is so that we can work out how the problem might best be solved.

Thanx, Paul

> ---
>  kernel/rcu/update.c | 34 --
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 68fa19a..44b8817 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> call_rcu_func_t *crcu_array,
>   might_sleep();
>   continue;
>   }
> - init_rcu_head_on_stack(_array[i].head);
> - init_completion(_array[i].completion);
> +
>   for (j = 0; j < i; j++)
>   if (crcu_array[j] == crcu_array[i])
>   break;
> - if (j == i)
> - (crcu_array[i])(_array[i].head, wakeme_after_rcu);
> + if (j != i)
> + continue;
> +
> + if ((crcu_array[i] == call_rcu_sched ||
> +  crcu_array[i] == call_rcu_bh)
> + && rcu_gp_is_expedited()) {
> + if (crcu_array[i] == call_rcu_sched)
> + synchronize_sched_expedited();
> + else
> + synchronize_rcu_bh_expedited();
> +
> + continue;
> + }
> +
> + init_rcu_head_on_stack(_array[i].head);
> + init_completion(_array[i].completion);
> + (crcu_array[i])(_array[i].head, wakeme_after_rcu);
>   }
> 
>   /* Wait for all callbacks to be invoked. */
> @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, 
> call_rcu_func_t *crcu_array,
>   (crcu_array[i] == call_rcu ||
>crcu_array[i] == call_rcu_bh))
>   continue;
> +
> + if ((crcu_array[i] == call_rcu_sched ||
> +  crcu_array[i] == call_rcu_bh)
> + && rcu_gp_is_expedited())
> + continue;
> +
>   for (j = 0; j < i; j++)
>   if (crcu_array[j] == crcu_array[i])
>   break;
> - if (j == i)
> - wait_for_completion(_array[i].completion);
> + if (j != i)
> + continue;
> +
> + wait_for_completion(_array[i].completion);
>   destroy_rcu_head_on_stack(_array[i].head);
>

[PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-18 Thread KarimAllah Ahmed

When expedited grace-period is set, both synchronize_sched
synchronize_rcu_bh can be optimized to have a significantly lower latency.

Improve wait_rcu_gp handling to also account for expedited grace-period.
The downside is that wait_rcu_gp will not wait anymore for all RCU variants
concurrently when an expedited grace-period is set, however, given the
improved latency it does not really matter.

Cc: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed 
---
 kernel/rcu/update.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 68fa19a..44b8817 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
*crcu_array,
might_sleep();
continue;
}
-   init_rcu_head_on_stack(_array[i].head);
-   init_completion(_array[i].completion);
+
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
-   if (j == i)
-   (crcu_array[i])(_array[i].head, wakeme_after_rcu);
+   if (j != i)
+   continue;
+
+   if ((crcu_array[i] == call_rcu_sched ||
+crcu_array[i] == call_rcu_bh)
+   && rcu_gp_is_expedited()) {
+   if (crcu_array[i] == call_rcu_sched)
+   synchronize_sched_expedited();
+   else
+   synchronize_rcu_bh_expedited();
+
+   continue;
+   }
+
+   init_rcu_head_on_stack(_array[i].head);
+   init_completion(_array[i].completion);
+   (crcu_array[i])(_array[i].head, wakeme_after_rcu);
}
 
/* Wait for all callbacks to be invoked. */
@@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
*crcu_array,
(crcu_array[i] == call_rcu ||
 crcu_array[i] == call_rcu_bh))
continue;
+
+   if ((crcu_array[i] == call_rcu_sched ||
+crcu_array[i] == call_rcu_bh)
+   && rcu_gp_is_expedited())
+   continue;
+
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
-   if (j == i)
-   wait_for_completion(_array[i].completion);
+   if (j != i)
+   continue;
+
+   wait_for_completion(_array[i].completion);
destroy_rcu_head_on_stack(_array[i].head);
}
 }
-- 
2.7.4

[PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-18 Thread KarimAllah Ahmed

When expedited grace-period is set, both synchronize_sched
synchronize_rcu_bh can be optimized to have a significantly lower latency.

Improve wait_rcu_gp handling to also account for expedited grace-period.
The downside is that wait_rcu_gp will not wait anymore for all RCU variants
concurrently when an expedited grace-period is set, however, given the
improved latency it does not really matter.

Cc: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed 
---
 kernel/rcu/update.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 68fa19a..44b8817 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
*crcu_array,
might_sleep();
continue;
}
-   init_rcu_head_on_stack(_array[i].head);
-   init_completion(_array[i].completion);
+
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
-   if (j == i)
-   (crcu_array[i])(_array[i].head, wakeme_after_rcu);
+   if (j != i)
+   continue;
+
+   if ((crcu_array[i] == call_rcu_sched ||
+crcu_array[i] == call_rcu_bh)
+   && rcu_gp_is_expedited()) {
+   if (crcu_array[i] == call_rcu_sched)
+   synchronize_sched_expedited();
+   else
+   synchronize_rcu_bh_expedited();
+
+   continue;
+   }
+
+   init_rcu_head_on_stack(_array[i].head);
+   init_completion(_array[i].completion);
+   (crcu_array[i])(_array[i].head, wakeme_after_rcu);
}
 
/* Wait for all callbacks to be invoked. */
@@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
*crcu_array,
(crcu_array[i] == call_rcu ||
 crcu_array[i] == call_rcu_bh))
continue;
+
+   if ((crcu_array[i] == call_rcu_sched ||
+crcu_array[i] == call_rcu_bh)
+   && rcu_gp_is_expedited())
+   continue;
+
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
-   if (j == i)
-   wait_for_completion(_array[i].completion);
+   if (j != i)
+   continue;
+
+   wait_for_completion(_array[i].completion);
destroy_rcu_head_on_stack(_array[i].head);
}
 }
-- 
2.7.4

[PATCH v3 tip/core/rcu 24/40] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-19 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS,

[PATCH v3 tip/core/rcu 24/40] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-19 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0

[PATCH v2 tip/core/rcu 24/39] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-17 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS,

[PATCH v2 tip/core/rcu 24/39] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-17 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0

[PATCH tip/core/rcu 27/40] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-12 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS,

[PATCH tip/core/rcu 27/40] srcu: Move combining-tree definitions for SRCU's benefit

2017-04-12 Thread Paul E. McKenney

This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_node_tree.h | 102 ++
 kernel/rcu/tree.h |  71 +
 2 files changed, 103 insertions(+), 70 deletions(-)
 create mode 100644 include/linux/rcu_node_tree.h

diff --git a/include/linux/rcu_node_tree.h b/include/linux/rcu_node_tree.h
new file mode 100644
index ..b7eb97096b1c
--- /dev/null
+++ b/include/linux/rcu_node_tree.h
@@ -0,0 +1,102 @@
+/*
+ * RCU node combining tree definitions.  These are used to compute
+ * global attributes while avoiding common-case global contention.  A key
+ * property that these computations rely on is a tournament-style approach
+ * where only one of the tasks contending a lower level in the tree need
+ * advance to the next higher level.  If properly configured, this allows
+ * unlimited scalability while maintaining a constant level of contention
+ * on the root node.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright IBM Corporation, 2017
+ *
+ * Author: Paul E. McKenney 
+ */
+
+#ifndef __LINUX_RCU_NODE_TREE_H
+#define __LINUX_RCU_NODE_TREE_H
+
+/*
+ * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and
+ * CONFIG_RCU_FANOUT_LEAF.
+ * In theory, it should be possible to add more levels straightforwardly.
+ * In practice, this did work well going from three levels to four.
+ * Of course, your mileage may vary.
+ */
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+#define RCU_FANOUT_LEAF 16
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
+
+#if NR_CPUS <= RCU_FANOUT_1
+#  define RCU_NUM_LVLS   1
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_NODES  NUM_RCU_LVL_0
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+#elif NR_CPUS <= RCU_FANOUT_2
+#  define RCU_NUM_LVLS   2
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+#elif NR_CPUS <= RCU_FANOUT_3
+#  define RCU_NUM_LVLS   3
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + 
NUM_RCU_LVL_2)
+#  define NUM_RCU_LVL_INIT{ NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+#  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+#  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", 
"rcu_node_fqs_2" }
+#elif NR_CPUS <= RCU_FANOUT_4
+#  define RCU_NUM_LVLS   4
+#  define NUM_RCU_LVL_0  1
+#  define NUM_RCU_LVL_1  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+#  define NUM_RCU_LVL_2  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+#  define NUM_RCU_LVL_3  DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+#  define NUM_RCU_NODES  (NUM_RCU_LVL_0

BENEFIT

2017-02-27 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $3 million USD. 
Contact(julieleac...@gmail.com) for claims.

BENEFIT

2017-02-27 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $3 million USD. 
Contact(julieleac...@gmail.com) for claims.

BENEFIT

2017-02-27 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $3 million USD. 
Contact(julieleac...@gmail.com) for claims.

BENEFIT

2017-02-27 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $3 million USD. 
Contact(julieleac...@gmail.com) for claims.

Grant Benefit

2016-11-03 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-03 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Grant Benefit

2016-11-02 Thread Mrs Julie Leach

You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims

Benefit!

2016-07-13 Thread Sgt Debra Moore

Mutual benefit for the both of us from Camp Stanley, get back to me for more 
info.

Benefit!

2016-07-13 Thread Sgt Debra Moore

Mutual benefit for the both of us from Camp Stanley, get back to me for more 
info.

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-07-01 Thread Andy Lutomirski

On Jun 30, 2016 2:25 PM, "Dave Hansen"  wrote:
>
> On 06/30/2016 10:36 AM, Andy Lutomirski wrote:
> >>> We make baseline_pkru a process-wide baseline and store it in
> >>> mm->context.  That way, no matter which thread gets interrupted for a
> >>> signal, they see consistent values.  We only write to it when an app
> >>> _specifically_ asks for it to be updated with a special flag to
> >>> sys_pkey_set().
> >>>
> >>> When an app uses the execute-only support, we implicitly set the
> >>> read-disable bit in baseline_pkru for the execute-only pkey.
> ...
> > Looking at your git tree, which I assume is a reasonably approximation
> > of your current patches, this seems to be unimplemented.  I, at least,
> > would be nervous about using PKRU for protection of critical data if
> > signal handlers are unconditionally exempt.
>
> I actually went along and implemented this using an extra 'flag' for
> pkey_get/set().  I just left it out of this stage since I'm having
> enough problems getting it in with the existing set of features. :)
>
> I'm confident we can add this later with the flags we can pass to
> pkey_get() and pkey_set().
>
> > Also, the lazily allocated no-read key for execute-only is done in the
> > name of performance, but it results in odd semantics.  How much of a
> > performance win is preserving the init optimization of PKRU in
> > practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
> > because even my Skylake laptop doesn't have PKRU.
>
> This is admittedly not the most realistic benchmark because everything
> is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS.
> This runs things in pretty tight loops where everything is cache hot.
>
> The XSAVE instructions are monsters and I'm not super-confident in my
> measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS
> getting 20-30 cycles when PKRU is in play vs. not.  This is with
> completely cache-hot data, though.

That's surprisingly bad, albeit negligible in the grand scheme of
context switches.

But maybe we could optimize differently.  When switching states, mask
out PKRU if it matches between the two states.  This could be messy,
but, if we switch to using WRPKRU directly some day, then the init
optimization becomes moot and this optimization becomes easy.

Hmm.  If we switch to WRPKRU directly and *always* mask out PKRU,
maybe a bunch of your code gets simpler because you won't have to poke
around in the saved state.

Looking forward, I think the xstate partitions into at least three kinds:

1. Pure user state (FPU, etc)

2. User-affecting XSAVES-only state (the CET mess, etc).

3. User accessible state that is needed in user mode *and* kernel
mode.  PKRU is like this.

Type 3 state needs to be switched eagerly.  Types 1 and 2 need to be
*saved* eagerly but not necessarily loaded eagerly.  I still want to
lazy-load it some day.

Do you happen to have WRPKRU cycle counts you can share?

--Andy

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-07-01 Thread Andy Lutomirski

On Jun 30, 2016 2:25 PM, "Dave Hansen"  wrote:
>
> On 06/30/2016 10:36 AM, Andy Lutomirski wrote:
> >>> We make baseline_pkru a process-wide baseline and store it in
> >>> mm->context.  That way, no matter which thread gets interrupted for a
> >>> signal, they see consistent values.  We only write to it when an app
> >>> _specifically_ asks for it to be updated with a special flag to
> >>> sys_pkey_set().
> >>>
> >>> When an app uses the execute-only support, we implicitly set the
> >>> read-disable bit in baseline_pkru for the execute-only pkey.
> ...
> > Looking at your git tree, which I assume is a reasonably approximation
> > of your current patches, this seems to be unimplemented.  I, at least,
> > would be nervous about using PKRU for protection of critical data if
> > signal handlers are unconditionally exempt.
>
> I actually went along and implemented this using an extra 'flag' for
> pkey_get/set().  I just left it out of this stage since I'm having
> enough problems getting it in with the existing set of features. :)
>
> I'm confident we can add this later with the flags we can pass to
> pkey_get() and pkey_set().
>
> > Also, the lazily allocated no-read key for execute-only is done in the
> > name of performance, but it results in odd semantics.  How much of a
> > performance win is preserving the init optimization of PKRU in
> > practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
> > because even my Skylake laptop doesn't have PKRU.
>
> This is admittedly not the most realistic benchmark because everything
> is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS.
> This runs things in pretty tight loops where everything is cache hot.
>
> The XSAVE instructions are monsters and I'm not super-confident in my
> measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS
> getting 20-30 cycles when PKRU is in play vs. not.  This is with
> completely cache-hot data, though.

That's surprisingly bad, albeit negligible in the grand scheme of
context switches.

But maybe we could optimize differently.  When switching states, mask
out PKRU if it matches between the two states.  This could be messy,
but, if we switch to using WRPKRU directly some day, then the init
optimization becomes moot and this optimization becomes easy.

Hmm.  If we switch to WRPKRU directly and *always* mask out PKRU,
maybe a bunch of your code gets simpler because you won't have to poke
around in the saved state.

Looking forward, I think the xstate partitions into at least three kinds:

1. Pure user state (FPU, etc)

2. User-affecting XSAVES-only state (the CET mess, etc).

3. User accessible state that is needed in user mode *and* kernel
mode.  PKRU is like this.

Type 3 state needs to be switched eagerly.  Types 1 and 2 need to be
*saved* eagerly but not necessarily loaded eagerly.  I still want to
lazy-load it some day.

Do you happen to have WRPKRU cycle counts you can share?

--Andy

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-06-30 Thread Dave Hansen

On 06/30/2016 10:36 AM, Andy Lutomirski wrote:
>>> We make baseline_pkru a process-wide baseline and store it in
>>> mm->context.  That way, no matter which thread gets interrupted for a
>>> signal, they see consistent values.  We only write to it when an app
>>> _specifically_ asks for it to be updated with a special flag to
>>> sys_pkey_set().
>>>
>>> When an app uses the execute-only support, we implicitly set the
>>> read-disable bit in baseline_pkru for the execute-only pkey.
...
> Looking at your git tree, which I assume is a reasonably approximation
> of your current patches, this seems to be unimplemented.  I, at least,
> would be nervous about using PKRU for protection of critical data if
> signal handlers are unconditionally exempt.

I actually went along and implemented this using an extra 'flag' for
pkey_get/set().  I just left it out of this stage since I'm having
enough problems getting it in with the existing set of features. :)

I'm confident we can add this later with the flags we can pass to
pkey_get() and pkey_set().

> Also, the lazily allocated no-read key for execute-only is done in the
> name of performance, but it results in odd semantics.  How much of a
> performance win is preserving the init optimization of PKRU in
> practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
> because even my Skylake laptop doesn't have PKRU.

This is admittedly not the most realistic benchmark because everything
is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS.
This runs things in pretty tight loops where everything is cache hot.

The XSAVE instructions are monsters and I'm not super-confident in my
measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS
getting 20-30 cycles when PKRU is in play vs. not.  This is with
completely cache-hot data, though.

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-06-30 Thread Dave Hansen

On 06/30/2016 10:36 AM, Andy Lutomirski wrote:
>>> We make baseline_pkru a process-wide baseline and store it in
>>> mm->context.  That way, no matter which thread gets interrupted for a
>>> signal, they see consistent values.  We only write to it when an app
>>> _specifically_ asks for it to be updated with a special flag to
>>> sys_pkey_set().
>>>
>>> When an app uses the execute-only support, we implicitly set the
>>> read-disable bit in baseline_pkru for the execute-only pkey.
...
> Looking at your git tree, which I assume is a reasonably approximation
> of your current patches, this seems to be unimplemented.  I, at least,
> would be nervous about using PKRU for protection of critical data if
> signal handlers are unconditionally exempt.

I actually went along and implemented this using an extra 'flag' for
pkey_get/set().  I just left it out of this stage since I'm having
enough problems getting it in with the existing set of features. :)

I'm confident we can add this later with the flags we can pass to
pkey_get() and pkey_set().

> Also, the lazily allocated no-read key for execute-only is done in the
> name of performance, but it results in odd semantics.  How much of a
> performance win is preserving the init optimization of PKRU in
> practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
> because even my Skylake laptop doesn't have PKRU.

This is admittedly not the most realistic benchmark because everything
is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS.
This runs things in pretty tight loops where everything is cache hot.

The XSAVE instructions are monsters and I'm not super-confident in my
measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS
getting 20-30 cycles when PKRU is in play vs. not.  This is with
completely cache-hot data, though.

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-06-30 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:07 PM, Andy Lutomirski  wrote:
> On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen
>  wrote:
>> On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
>> ...
 I could imagine that some kernel person would want to use even more
 keys, but I think two fixed keys are kind of the minimal we'd want to
 use.
>>>
>>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
>>> Let me be a bit more concrete about what I'm suggesting:
>>>
>>> We'd have thread_struct.baseline_pkru.  It would start with key 0
>>> allowing all access and key 1 denying reads.
>>
>> Are you sure thread_struct is the right place for this?  I think of
>> signal handlers as a process-wide thing, and it seems a bit goofy if we
>> have the PKRU value in a signal handler depend on the PKRU of the thread
>> that got interrupted.
>
> I think you're right.  mmu_context_t might be a better choice.
>
>>
>>> We'd have a syscall like set_protection_key that could allocate unused
>>> keys and change the values of keys that have been allocated.  Those
>>> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
>>> in baseline_pkru would not be allowed.
>>
>> FWIW, I think we can do this without *actually* dedicating key 1 to
>> execute-only.  But that's a side issue.
>>
>>> Signal delivery would load baseline_pkru into the PKRU register.
>>> Signal restore would restore PKRU to its previous value.
>>
>> Do you really mean "its previous value" or are you OK with the existing
>> behavior which restores PKRU from the XSAVE buffer in the sigcontext?
>
> By "its previous value" I meant the value in the XSAVE buffer in the
> sigcontext.  So I think I'm okay with that :)
>
>>
>>> WRPKRU would, of course, override baseline_pkru, but it wouldn't
>>> change baseline_pkru.  The set_protection_key syscall would modify
>>> *both* real PKRU and baseline_pkru.
>>
>> How about this:
>>
>> We make baseline_pkru a process-wide baseline and store it in
>> mm->context.  That way, no matter which thread gets interrupted for a
>> signal, they see consistent values.  We only write to it when an app
>> _specifically_ asks for it to be updated with a special flag to
>> sys_pkey_set().
>>
>> When an app uses the execute-only support, we implicitly set the
>> read-disable bit in baseline_pkru for the execute-only pkey.
>
> Sounds good, I think.

Resurrecting an old thread, but:

Looking at your git tree, which I assume is a reasonably approximation
of your current patches, this seems to be unimplemented.  I, at least,
would be nervous about using PKRU for protection of critical data if
signal handlers are unconditionally exempt.

Also, the lazily allocated no-read key for execute-only is done in the
name of performance, but it results in odd semantics.  How much of a
performance win is preserving the init optimization of PKRU in
practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
because even my Skylake laptop doesn't have PKRU.

--Andy

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2016-06-30 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:07 PM, Andy Lutomirski  wrote:
> On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen
>  wrote:
>> On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
>> ...
 I could imagine that some kernel person would want to use even more
 keys, but I think two fixed keys are kind of the minimal we'd want to
 use.
>>>
>>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
>>> Let me be a bit more concrete about what I'm suggesting:
>>>
>>> We'd have thread_struct.baseline_pkru.  It would start with key 0
>>> allowing all access and key 1 denying reads.
>>
>> Are you sure thread_struct is the right place for this?  I think of
>> signal handlers as a process-wide thing, and it seems a bit goofy if we
>> have the PKRU value in a signal handler depend on the PKRU of the thread
>> that got interrupted.
>
> I think you're right.  mmu_context_t might be a better choice.
>
>>
>>> We'd have a syscall like set_protection_key that could allocate unused
>>> keys and change the values of keys that have been allocated.  Those
>>> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
>>> in baseline_pkru would not be allowed.
>>
>> FWIW, I think we can do this without *actually* dedicating key 1 to
>> execute-only.  But that's a side issue.
>>
>>> Signal delivery would load baseline_pkru into the PKRU register.
>>> Signal restore would restore PKRU to its previous value.
>>
>> Do you really mean "its previous value" or are you OK with the existing
>> behavior which restores PKRU from the XSAVE buffer in the sigcontext?
>
> By "its previous value" I meant the value in the XSAVE buffer in the
> sigcontext.  So I think I'm okay with that :)
>
>>
>>> WRPKRU would, of course, override baseline_pkru, but it wouldn't
>>> change baseline_pkru.  The set_protection_key syscall would modify
>>> *both* real PKRU and baseline_pkru.
>>
>> How about this:
>>
>> We make baseline_pkru a process-wide baseline and store it in
>> mm->context.  That way, no matter which thread gets interrupted for a
>> signal, they see consistent values.  We only write to it when an app
>> _specifically_ asks for it to be updated with a special flag to
>> sys_pkey_set().
>>
>> When an app uses the execute-only support, we implicitly set the
>> read-disable bit in baseline_pkru for the execute-only pkey.
>
> Sounds good, I think.

Resurrecting an old thread, but:

Looking at your git tree, which I assume is a reasonably approximation
of your current patches, this seems to be unimplemented.  I, at least,
would be nervous about using PKRU for protection of critical data if
signal handlers are unconditionally exempt.

Also, the lazily allocated no-read key for execute-only is done in the
name of performance, but it results in odd semantics.  How much of a
performance win is preserving the init optimization of PKRU in
practice?  (I.e. how much faster are XSAVE and XRSTOR?)  I can't test
because even my Skylake laptop doesn't have PKRU.

--Andy

(Re: MUTUAL BENEFIT!!!)

2016-03-02 Thread Mr. Albert Silva Neto

From:  Integrated Crystal Investment Company
30th Floor Millbank Tower 21-24 Millbank
London SW1P 4QP


Dear Sir,

I got your contact from my Google research. I trust my instinct and my instinct 
has never failed me in the past. I am convinced that you are the person who can 
assist me in this pending transaction.

I am Albert Silva Neto, A Fund Manager with the Integrated Crystal Investment 
Company. I am contacting you based on my personal interest to develop a mutual 
business relationship with you in your country. The amount and terms of 
executing the deal will be made known to you upon receiving a positive response 
from you. I solicit for your assistance to execute this transaction.

I will give you a complete picture of this transaction as soon as I hear from 
you.

I look forward to hearing from you soonest.

Regards,
Albert Silva Neto

(Re: MUTUAL BENEFIT!!!)

2016-03-02 Thread Mr. Albert Silva Neto

From:  Integrated Crystal Investment Company
30th Floor Millbank Tower 21-24 Millbank
London SW1P 4QP


Dear Sir,

I got your contact from my Google research. I trust my instinct and my instinct 
has never failed me in the past. I am convinced that you are the person who can 
assist me in this pending transaction.

I am Albert Silva Neto, A Fund Manager with the Integrated Crystal Investment 
Company. I am contacting you based on my personal interest to develop a mutual 
business relationship with you in your country. The amount and terms of 
executing the deal will be made known to you upon receiving a positive response 
from you. I solicit for your assistance to execute this transaction.

I will give you a complete picture of this transaction as soon as I hear from 
you.

I look forward to hearing from you soonest.

Regards,
Albert Silva Neto

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-29 Thread Dave Hansen

On 12/18/2015 01:45 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
>  wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
> 
> How dynamic do we want to make this, though?

Right now, all I plan to do is make it a one-way trip: if a process does
a prot=PROT_EXEC mapping, we dedicate a key local to that process, and
it gets 14 usable keys.  If it doesn't use prot=PROT_EXEC, then it gets
15 usable keys.

> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
> 
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.

The one constant I've heard from the folks that are going to use this is
that 15 keys is not enough.  That's why I'm hesitant to remove _any_ more.

> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
> 
>  (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.

Having a dedicated or static key for execute-only doesn't really change
this code.  We just have one extra step to go look in the mm->context
and see which pkey (if any) is assigned to be execute-only in the fault
code.

>  (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.

The current hardware only applies the keys on _PAGE_USER mappings, so we
can't use it for kernel mappings.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-29 Thread Dave Hansen

On 12/18/2015 01:45 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
>  wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
> 
> How dynamic do we want to make this, though?

Right now, all I plan to do is make it a one-way trip: if a process does
a prot=PROT_EXEC mapping, we dedicate a key local to that process, and
it gets 14 usable keys.  If it doesn't use prot=PROT_EXEC, then it gets
15 usable keys.

> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
> 
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.

The one constant I've heard from the folks that are going to use this is
that 15 keys is not enough.  That's why I'm hesitant to remove _any_ more.

> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
> 
>  (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.

Having a dedicated or static key for execute-only doesn't really change
this code.  We just have one extra step to go look in the mm->context
and see which pkey (if any) is assigned to be execute-only in the fault
code.

>  (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.

The current hardware only applies the keys on _PAGE_USER mappings, so we
can't use it for kernel mappings.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen
 wrote:
> On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
> ...
>>> I could imagine that some kernel person would want to use even more
>>> keys, but I think two fixed keys are kind of the minimal we'd want to
>>> use.
>>
>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
>> Let me be a bit more concrete about what I'm suggesting:
>>
>> We'd have thread_struct.baseline_pkru.  It would start with key 0
>> allowing all access and key 1 denying reads.
>
> Are you sure thread_struct is the right place for this?  I think of
> signal handlers as a process-wide thing, and it seems a bit goofy if we
> have the PKRU value in a signal handler depend on the PKRU of the thread
> that got interrupted.

I think you're right.  mmu_context_t might be a better choice.

>
>> We'd have a syscall like set_protection_key that could allocate unused
>> keys and change the values of keys that have been allocated.  Those
>> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
>> in baseline_pkru would not be allowed.
>
> FWIW, I think we can do this without *actually* dedicating key 1 to
> execute-only.  But that's a side issue.
>
>> Signal delivery would load baseline_pkru into the PKRU register.
>> Signal restore would restore PKRU to its previous value.
>
> Do you really mean "its previous value" or are you OK with the existing
> behavior which restores PKRU from the XSAVE buffer in the sigcontext?

By "its previous value" I meant the value in the XSAVE buffer in the
sigcontext.  So I think I'm okay with that :)

>
>> WRPKRU would, of course, override baseline_pkru, but it wouldn't
>> change baseline_pkru.  The set_protection_key syscall would modify
>> *both* real PKRU and baseline_pkru.
>
> How about this:
>
> We make baseline_pkru a process-wide baseline and store it in
> mm->context.  That way, no matter which thread gets interrupted for a
> signal, they see consistent values.  We only write to it when an app
> _specifically_ asks for it to be updated with a special flag to
> sys_pkey_set().
>
> When an app uses the execute-only support, we implicitly set the
> read-disable bit in baseline_pkru for the execute-only pkey.

Sounds good, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/21/2015 03:02 PM, Andy Lutomirski wrote:
> On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
>  wrote:
>> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>>> Perhaps this is silly, but what if the default were changed to deny
>>> reads and writes for unallocated keys?  Is there a use case that
>>> breaks?
>>
>> It's probably a reasonable debugging feature.
>>
>> But, anything that takes an XSAVE feature out of its "init state" has
>> the potential to do a bit of harm because it increases the potential
>> size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
>> don't want to go out of our way to take things out of the init state
>> when we're unsure of the benefits.
> 
> Aren't you already doing that with your magic execute-only thing?

Yep, but that's with a concrete benefit in mind.

> Also, if we ever do the deferred-xstate-restore thing that Rik was
> playing with awhile back, then we'll want to switch to using rdpkru
> and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
> the XRSTOR and XSAVEOPT state, and this particular issue will become
> irrelevant.

Yep, agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
...
>> I could imagine that some kernel person would want to use even more
>> keys, but I think two fixed keys are kind of the minimal we'd want to
>> use.
> 
> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
> Let me be a bit more concrete about what I'm suggesting:
> 
> We'd have thread_struct.baseline_pkru.  It would start with key 0
> allowing all access and key 1 denying reads.

Are you sure thread_struct is the right place for this?  I think of
signal handlers as a process-wide thing, and it seems a bit goofy if we
have the PKRU value in a signal handler depend on the PKRU of the thread
that got interrupted.

> We'd have a syscall like set_protection_key that could allocate unused
> keys and change the values of keys that have been allocated.  Those
> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
> in baseline_pkru would not be allowed.

FWIW, I think we can do this without *actually* dedicating key 1 to
execute-only.  But that's a side issue.

> Signal delivery would load baseline_pkru into the PKRU register.
> Signal restore would restore PKRU to its previous value.

Do you really mean "its previous value" or are you OK with the existing
behavior which restores PKRU from the XSAVE buffer in the sigcontext?

> WRPKRU would, of course, override baseline_pkru, but it wouldn't
> change baseline_pkru.  The set_protection_key syscall would modify
> *both* real PKRU and baseline_pkru.

How about this:

We make baseline_pkru a process-wide baseline and store it in
mm->context.  That way, no matter which thread gets interrupted for a
signal, they see consistent values.  We only write to it when an app
_specifically_ asks for it to be updated with a special flag to
sys_pkey_set().

When an app uses the execute-only support, we implicitly set the
read-disable bit in baseline_pkru for the execute-only pkey.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
 wrote:
> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>> Perhaps this is silly, but what if the default were changed to deny
>> reads and writes for unallocated keys?  Is there a use case that
>> breaks?
>
> It's probably a reasonable debugging feature.
>
> But, anything that takes an XSAVE feature out of its "init state" has
> the potential to do a bit of harm because it increases the potential
> size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
> don't want to go out of our way to take things out of the init state
> when we're unsure of the benefits.

Aren't you already doing that with your magic execute-only thing?

Also, if we ever do the deferred-xstate-restore thing that Rik was
playing with awhile back, then we'll want to switch to using rdpkru
and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
the XRSTOR and XSAVEOPT state, and this particular issue will become
irrelevant.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
> Perhaps this is silly, but what if the default were changed to deny
> reads and writes for unallocated keys?  Is there a use case that
> breaks?

It's probably a reasonable debugging feature.

But, anything that takes an XSAVE feature out of its "init state" has
the potential to do a bit of harm because it increases the potential
size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
don't want to go out of our way to take things out of the init state
when we're unsure of the benefits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Dec 22, 2015 2:04 AM, "Dave Hansen"  wrote:
>
> On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> > Hrm.  We might also want an option to change pkru and/or baseline_pkru
> > in all threads in the current mm.  That's optional but it could be
> > handy.  Maybe it would be as simple as having the allocate-a-pkey call
> > have an option to set an initial baseline value and an option to
> > propagate that initial value to pre-existing threads.
>
> Do you mean actively going in and changing PKRU in other threads?  I
> fear that will be dangerous.
>
> IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
> are allowed (somehow).  Raw in this case would mean a thread calling
> WRPKRU without a system call and without checking in with what any other
> threads are doing.
>
> Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
> it for execute-only).  Now, a thread is trying to do this:
>
> pkey2 = sys_pkey_alloc(); // now pkey2=2
> tmp = rdpkru(); // 0x004
> tmp |= 0x10; // set PKRU[2].AD=1
> wrpkru(tmp);
>
> While another thread does:
>
> pkey4 = pkey_alloc(); // pkey4=4
> sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);
>
> Without some kind of locking, that's going to race.  We could do all the
> locking in the kernel, but that requires that the kernel do all the PKRU
> writing, which I'd really like to avoid.
>
> I think the closest we can get reasonably is to have the kernel track
> the baseline_pkru and then allow userspace to query it in case userspace
> decides that thread needs to update its thread-local PKRU from the baseline.

Yeah, fair point.  Let's skip the modify-other-threads thing.

Perhaps this is silly, but what if the default were changed to deny
reads and writes for unallocated keys?  Is there a use case that
breaks?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> Hrm.  We might also want an option to change pkru and/or baseline_pkru
> in all threads in the current mm.  That's optional but it could be
> handy.  Maybe it would be as simple as having the allocate-a-pkey call
> have an option to set an initial baseline value and an option to
> propagate that initial value to pre-existing threads.

Do you mean actively going in and changing PKRU in other threads?  I
fear that will be dangerous.

IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
are allowed (somehow).  Raw in this case would mean a thread calling
WRPKRU without a system call and without checking in with what any other
threads are doing.

Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
it for execute-only).  Now, a thread is trying to do this:

pkey2 = sys_pkey_alloc(); // now pkey2=2
tmp = rdpkru(); // 0x004
tmp |= 0x10; // set PKRU[2].AD=1
wrpkru(tmp);

While another thread does:

pkey4 = pkey_alloc(); // pkey4=4
sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);

Without some kind of locking, that's going to race.  We could do all the
locking in the kernel, but that requires that the kernel do all the PKRU
writing, which I'd really like to avoid.

I think the closest we can get reasonably is to have the kernel track
the baseline_pkru and then allow userspace to query it in case userspace
decides that thread needs to update its thread-local PKRU from the baseline.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
...
>> I could imagine that some kernel person would want to use even more
>> keys, but I think two fixed keys are kind of the minimal we'd want to
>> use.
> 
> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
> Let me be a bit more concrete about what I'm suggesting:
> 
> We'd have thread_struct.baseline_pkru.  It would start with key 0
> allowing all access and key 1 denying reads.

Are you sure thread_struct is the right place for this?  I think of
signal handlers as a process-wide thing, and it seems a bit goofy if we
have the PKRU value in a signal handler depend on the PKRU of the thread
that got interrupted.

> We'd have a syscall like set_protection_key that could allocate unused
> keys and change the values of keys that have been allocated.  Those
> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
> in baseline_pkru would not be allowed.

FWIW, I think we can do this without *actually* dedicating key 1 to
execute-only.  But that's a side issue.

> Signal delivery would load baseline_pkru into the PKRU register.
> Signal restore would restore PKRU to its previous value.

Do you really mean "its previous value" or are you OK with the existing
behavior which restores PKRU from the XSAVE buffer in the sigcontext?

> WRPKRU would, of course, override baseline_pkru, but it wouldn't
> change baseline_pkru.  The set_protection_key syscall would modify
> *both* real PKRU and baseline_pkru.

How about this:

We make baseline_pkru a process-wide baseline and store it in
mm->context.  That way, no matter which thread gets interrupted for a
signal, they see consistent values.  We only write to it when an app
_specifically_ asks for it to be updated with a special flag to
sys_pkey_set().

When an app uses the execute-only support, we implicitly set the
read-disable bit in baseline_pkru for the execute-only pkey.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
> Perhaps this is silly, but what if the default were changed to deny
> reads and writes for unallocated keys?  Is there a use case that
> breaks?

It's probably a reasonable debugging feature.

But, anything that takes an XSAVE feature out of its "init state" has
the potential to do a bit of harm because it increases the potential
size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
don't want to go out of our way to take things out of the init state
when we're unsure of the benefits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/21/2015 03:02 PM, Andy Lutomirski wrote:
> On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
> <dave.han...@linux.intel.com> wrote:
>> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>>> Perhaps this is silly, but what if the default were changed to deny
>>> reads and writes for unallocated keys?  Is there a use case that
>>> breaks?
>>
>> It's probably a reasonable debugging feature.
>>
>> But, anything that takes an XSAVE feature out of its "init state" has
>> the potential to do a bit of harm because it increases the potential
>> size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
>> don't want to go out of our way to take things out of the init state
>> when we're unsure of the benefits.
> 
> Aren't you already doing that with your magic execute-only thing?

Yep, but that's with a concrete benefit in mind.

> Also, if we ever do the deferred-xstate-restore thing that Rik was
> playing with awhile back, then we'll want to switch to using rdpkru
> and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
> the XRSTOR and XSAVEOPT state, and this particular issue will become
> irrelevant.

Yep, agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:04 PM, Dave Hansen
 wrote:
> On 12/18/2015 02:28 PM, Andy Lutomirski wrote:
> ...
>>> I could imagine that some kernel person would want to use even more
>>> keys, but I think two fixed keys are kind of the minimal we'd want to
>>> use.
>>
>> I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
>> Let me be a bit more concrete about what I'm suggesting:
>>
>> We'd have thread_struct.baseline_pkru.  It would start with key 0
>> allowing all access and key 1 denying reads.
>
> Are you sure thread_struct is the right place for this?  I think of
> signal handlers as a process-wide thing, and it seems a bit goofy if we
> have the PKRU value in a signal handler depend on the PKRU of the thread
> that got interrupted.

I think you're right.  mmu_context_t might be a better choice.

>
>> We'd have a syscall like set_protection_key that could allocate unused
>> keys and change the values of keys that have been allocated.  Those
>> changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
>> in baseline_pkru would not be allowed.
>
> FWIW, I think we can do this without *actually* dedicating key 1 to
> execute-only.  But that's a side issue.
>
>> Signal delivery would load baseline_pkru into the PKRU register.
>> Signal restore would restore PKRU to its previous value.
>
> Do you really mean "its previous value" or are you OK with the existing
> behavior which restores PKRU from the XSAVE buffer in the sigcontext?

By "its previous value" I meant the value in the XSAVE buffer in the
sigcontext.  So I think I'm okay with that :)

>
>> WRPKRU would, of course, override baseline_pkru, but it wouldn't
>> change baseline_pkru.  The set_protection_key syscall would modify
>> *both* real PKRU and baseline_pkru.
>
> How about this:
>
> We make baseline_pkru a process-wide baseline and store it in
> mm->context.  That way, no matter which thread gets interrupted for a
> signal, they see consistent values.  We only write to it when an app
> _specifically_ asks for it to be updated with a special flag to
> sys_pkey_set().
>
> When an app uses the execute-only support, we implicitly set the
> read-disable bit in baseline_pkru for the execute-only pkey.

Sounds good, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Dec 22, 2015 2:04 AM, "Dave Hansen"  wrote:
>
> On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> > Hrm.  We might also want an option to change pkru and/or baseline_pkru
> > in all threads in the current mm.  That's optional but it could be
> > handy.  Maybe it would be as simple as having the allocate-a-pkey call
> > have an option to set an initial baseline value and an option to
> > propagate that initial value to pre-existing threads.
>
> Do you mean actively going in and changing PKRU in other threads?  I
> fear that will be dangerous.
>
> IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
> are allowed (somehow).  Raw in this case would mean a thread calling
> WRPKRU without a system call and without checking in with what any other
> threads are doing.
>
> Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
> it for execute-only).  Now, a thread is trying to do this:
>
> pkey2 = sys_pkey_alloc(); // now pkey2=2
> tmp = rdpkru(); // 0x004
> tmp |= 0x10; // set PKRU[2].AD=1
> wrpkru(tmp);
>
> While another thread does:
>
> pkey4 = pkey_alloc(); // pkey4=4
> sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);
>
> Without some kind of locking, that's going to race.  We could do all the
> locking in the kernel, but that requires that the kernel do all the PKRU
> writing, which I'd really like to avoid.
>
> I think the closest we can get reasonably is to have the kernel track
> the baseline_pkru and then allow userspace to query it in case userspace
> decides that thread needs to update its thread-local PKRU from the baseline.

Yeah, fair point.  Let's skip the modify-other-threads thing.

Perhaps this is silly, but what if the default were changed to deny
reads and writes for unallocated keys?  Is there a use case that
breaks?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Andy Lutomirski

On Mon, Dec 21, 2015 at 3:00 PM, Dave Hansen
 wrote:
> On 12/21/2015 02:52 PM, Andy Lutomirski wrote:
>> Perhaps this is silly, but what if the default were changed to deny
>> reads and writes for unallocated keys?  Is there a use case that
>> breaks?
>
> It's probably a reasonable debugging feature.
>
> But, anything that takes an XSAVE feature out of its "init state" has
> the potential to do a bit of harm because it increases the potential
> size of writes during XSAVE.  XSAVEOPT will _help_ here, but we probably
> don't want to go out of our way to take things out of the init state
> when we're unsure of the benefits.

Aren't you already doing that with your magic execute-only thing?

Also, if we ever do the deferred-xstate-restore thing that Rik was
playing with awhile back, then we'll want to switch to using rdpkru
and wrpkru in-kernel directly, and we'll explicitly mask PKRU out of
the XRSTOR and XSAVEOPT state, and this particular issue will become
irrelevant.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-21 Thread Dave Hansen

On 12/18/2015 03:16 PM, Andy Lutomirski wrote:
> Hrm.  We might also want an option to change pkru and/or baseline_pkru
> in all threads in the current mm.  That's optional but it could be
> handy.  Maybe it would be as simple as having the allocate-a-pkey call
> have an option to set an initial baseline value and an option to
> propagate that initial value to pre-existing threads.

Do you mean actively going in and changing PKRU in other threads?  I
fear that will be dangerous.

IMNHO, whatever we do, I think we need to ensure that _raw_ PKRU calls
are allowed (somehow).  Raw in this case would mean a thread calling
WRPKRU without a system call and without checking in with what any other
threads are doing.

Let's say baseline_pkru=0x004 (we're access-disabling PKEY[1] and using
it for execute-only).  Now, a thread is trying to do this:

pkey2 = sys_pkey_alloc(); // now pkey2=2
tmp = rdpkru(); // 0x004
tmp |= 0x10; // set PKRU[2].AD=1
wrpkru(tmp);

While another thread does:

pkey4 = pkey_alloc(); // pkey4=4
sys_pkey_set(pkey4, ACCESS_DISABLE, SET_BASELINE_ALL_THREADS);

Without some kind of locking, that's going to race.  We could do all the
locking in the kernel, but that requires that the kernel do all the PKRU
writing, which I'd really like to avoid.

I think the closest we can get reasonably is to have the kernel track
the baseline_pkru and then allow userspace to query it in case userspace
decides that thread needs to update its thread-local PKRU from the baseline.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 3:16 PM, Andy Lutomirski  wrote:
>
> Yes, I think.  If I'm using protection keys to protect some critical
> data structure (important stuff in shared memory, important memory
> mapped files, pmem, etc), then I'll allocate a protection key and set
> PKRU to deny writes.  The problem is that I really, really want writes
> denied except when explicitly enabled in narrow regions of code that
> use wrpkru to enable them, and I don't want an asynchronous signal
> delivered in those narrow regions of code or newly cloned threads to
> pick up the write-allow value.  So I want baseline_pkru to have the
> deny writes entry.

Hmm. Ok, that does sound like a valid and interesting usage case. Fair enough.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 3:08 PM, Linus Torvalds
 wrote:
> On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski  wrote:
>>
>> Apps that don't want to use the baseline_pkru mechanism could use
>> syscalls to claim ownership of protection keys but then manage them
>> purely with WRPKRU directly.  We could optionally disallow
>> mprotect_key on keys that weren't allocated in advance.
>>
>> Does that seem sane?
>
> So everything seems sane except for the need for that baseline_pkru.
>
> I'm not seeing why it couldn't just be a fixed value. Is there any
> real downside to it?

Yes, I think.  If I'm using protection keys to protect some critical
data structure (important stuff in shared memory, important memory
mapped files, pmem, etc), then I'll allocate a protection key and set
PKRU to deny writes.  The problem is that I really, really want writes
denied except when explicitly enabled in narrow regions of code that
use wrpkru to enable them, and I don't want an asynchronous signal
delivered in those narrow regions of code or newly cloned threads to
pick up the write-allow value.  So I want baseline_pkru to have the
deny writes entry.

I think I would do exactly this in my production code here if my
server supported it.  Some day...

Hrm.  We might also want an option to change pkru and/or baseline_pkru
in all threads in the current mm.  That's optional but it could be
handy.  Maybe it would be as simple as having the allocate-a-pkey call
have an option to set an initial baseline value and an option to
propagate that initial value to pre-existing threads.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski  wrote:
>
> Apps that don't want to use the baseline_pkru mechanism could use
> syscalls to claim ownership of protection keys but then manage them
> purely with WRPKRU directly.  We could optionally disallow
> mprotect_key on keys that weren't allocated in advance.
>
> Does that seem sane?

So everything seems sane except for the need for that baseline_pkru.

I'm not seeing why it couldn't just be a fixed value. Is there any
real downside to it?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 1:45 PM, Linus Torvalds
 wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
>  wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
>
> How dynamic do we want to make this, though?
>
> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
>
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.
>
> I guess we want to leave key #0 for "normal page", so my suggesting to
> use that for the execute-only was probably misguided.
>
> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
>
>  (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.
>
>  (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.
>
> so I do think that we'd want to reserve two of the 16 keys up front.
>
> Would it be ok for the expected users to have those keys simply be
> fixed? With key 0 being used for all default pages, and key 1 being
> used for all execute-only pages? And then defaulting PKRU to 4,
> disallowing access to that key #1?
>
> I could imagine that some kernel person would want to use even more
> keys, but I think two fixed keys are kind of the minimal we'd want to
> use.

I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
Let me be a bit more concrete about what I'm suggesting:

We'd have thread_struct.baseline_pkru.  It would start with key 0
allowing all access and key 1 denying reads.

We'd have a syscall like set_protection_key that could allocate unused
keys and change the values of keys that have been allocated.  Those
changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
in baseline_pkru would not be allowed.

Signal delivery would load baseline_pkru into the PKRU register.
Signal restore would restore PKRU to its previous value.

WRPKRU would, of course, override baseline_pkru, but it wouldn't
change baseline_pkru.  The set_protection_key syscall would modify
*both* real PKRU and baseline_pkru.

Apps that don't want to use the baseline_pkru mechanism could use
syscalls to claim ownership of protection keys but then manage them
purely with WRPKRU directly.  We could optionally disallow
mprotect_key on keys that weren't allocated in advance.

Does that seem sane?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
 wrote:
>
> But, if we are picking out an execute-only pkey more dynamically, we've
> got to keep the default value for the entire process somewhere.

How dynamic do we want to make this, though?

I haven't looked at the details, and perhaps more importantly, I don't
know what exactly are the requirements you've gotten from the people
who are expected to actually use this.

I think we might want to hardcode a couple of keys as "kernel
reserved". And I'd rather reserve them up-front than have some user
program be unhappy later when we want to use them.

I guess we want to leave key #0 for "normal page", so my suggesting to
use that for the execute-only was probably misguided.

But I do think we might want to have that "no read access" as a real
fixed key too, because I think the kernel itself would want to use it:

 (a) to make sure that it gets the right fault when user space passes
in a execute-only address to a system call.

 (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.

so I do think that we'd want to reserve two of the 16 keys up front.

Would it be ok for the expected users to have those keys simply be
fixed? With key 0 being used for all default pages, and key 1 being
used for all execute-only pages? And then defaulting PKRU to 4,
disallowing access to that key #1?

I could imagine that some kernel person would want to use even more
keys, but I think two fixed keys are kind of the minimal we'd want to
use.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 01:04 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski  wrote:
>>
>> IOW, I like my idea in which signal delivery always sets PKRU to the
>> application-requested-by-syscall values and sigreturn restores it.
> 
> So I don't mind that, as long as the whole "sigreturn restores it" is
> part of things.
> 
> Your original email with the suggestion to *not* resture PKRU I didn't
> like. Setting it and restoring it is fine.
> 
> I do wonder if you need an explicit value, though. I think it's
> reasonable to say that PKRU value 0 is special. It's what we'd start
> processes with, and why not just say that it's what we run signal
> handlers in?
> 
> Would any other value ever make sense, really?

Having a PKRU with the execute-only permissions set is the only one I
can think of.  For a system with a _dedicated_ PKEY for execute-only,
this is easy and could even be made a part of init_fpstate with no other
code changes.

But, if we are picking out an execute-only pkey more dynamically, we've
got to keep the default value for the entire process somewhere.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 1:04 PM, Linus Torvalds
 wrote:
>
> I do wonder if you need an explicit value, though. I think it's
> reasonable to say that PKRU value 0 is special. It's what we'd start
> processes with, and why not just say that it's what we run signal
> handlers in?
>
> Would any other value ever make sense, really?

Ahh. Your point about the PROT_EXEC handling means that maybe we don't
want to default to zero. Maybe we want to make the default PKRU
startup value be 1 instead, enabling access disable key for key 0?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 01:02 PM, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin  wrote:
>> > On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
>>> >>
>>> >> IOW, I like my idea in which signal delivery always sets PKRU to the
>>> >> application-requested-by-syscall values and sigreturn restores it.
>>> >> Kinda like sigaltstack, but applies to all signals and affects PKRU
>>> >> instead of RSP.
>>> >>
>> > I think this is the only sensible option, with the default being all zero.
>> >
> Or not quite all zero if we do Dave's experimental PROT_EXEC thing.
> 
> Actually, I want to introduce a set of per-mm "incompatible" bits.  By
> default, they'd be zero.  We can, as needed, define bits that do
> something nice but break old code.  I want one of the bits to turn
> vsyscalls off entirely.  Another bit could say that the kernel is
> allowed to steal a protection key for PROT_EXEC.

That really only makes sense if we have userspace that expects all the
protection keys to be available.  If we go the route of having
pkey_alloc/free() syscalls, then the kernel can easily tell userspace to
keep its mitts off a particular pkey by not ever returning it from a
pkey_alloc().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski  wrote:
>
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.

So I don't mind that, as long as the whole "sigreturn restores it" is
part of things.

Your original email with the suggestion to *not* resture PKRU I didn't
like. Setting it and restoring it is fine.

I do wonder if you need an explicit value, though. I think it's
reasonable to say that PKRU value 0 is special. It's what we'd start
processes with, and why not just say that it's what we run signal
handlers in?

Would any other value ever make sense, really?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 12:58 PM, H. Peter Anvin  wrote:
> On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
>>
>> IOW, I like my idea in which signal delivery always sets PKRU to the
>> application-requested-by-syscall values and sigreturn restores it.
>> Kinda like sigaltstack, but applies to all signals and affects PKRU
>> instead of RSP.
>>
>
> I think this is the only sensible option, with the default being all zero.
>

Or not quite all zero if we do Dave's experimental PROT_EXEC thing.

Actually, I want to introduce a set of per-mm "incompatible" bits.  By
default, they'd be zero.  We can, as needed, define bits that do
something nice but break old code.  I want one of the bits to turn
vsyscalls off entirely.  Another bit could say that the kernel is
allowed to steal a protection key for PROT_EXEC.

These bits would be read and written by prctl, but there could also be
an ELF note mechanism to initialize them on execve without a syscall.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread H. Peter Anvin

On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
> 
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.
> Kinda like sigaltstack, but applies to all signals and affects PKRU
> instead of RSP.
> 

I think this is the only sensible option, with the default being all zero.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 12:37 PM, Linus Torvalds
 wrote:
> On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
>  wrote:
>>
>> But, I think it's a small one.  Basically, RSP would have to pointing at
>> a place which was allowed by protection keys for all of the sigframe
>> setup.
>
> Note that the whole "stack is special" is not at all a new issue.
>
> It's the main reason why sigaltstack() and SS_ONSTACK exists. This is
> in no way new to PKRU, people have had to handle the issue of
> stack-related SIGSEGV faults for a long time.
>
> So any application that uses PKRU and may play games that affects the
> stack, will always have to have a separate "safe stack" that it uses
> for signal handling. But that is in no way PKRU-specific, it's been
> the case for a lot of other memory management faults.
>

The trouble is that this isn't just limited to special "safe stack"
SA_ONSTACK aware code.  It even affects normally innocuous things.

The thing that's new is that, before PKRU, the HW stack might have
been unsafe in the sense that accessing it would fault and we need a
different backup stack for some signals.  With PKRU, we might be in a
transient extra-privileged (permissive PKRU) state, and we want to
tighten permissions for *all* signal deliveries so that, in case the
current stack is erroneous, we don't corrupt it.

Admittedly, ending up with permissive PKRU in some short sequence and
corrupt RSP is likely to corrupt things no matter what the kernel
does.  But I don't think we want to deliver SIGALRM to a library while
the main program is in a specially permissive PKRU region.

IOW, I like my idea in which signal delivery always sets PKRU to the
application-requested-by-syscall values and sigreturn restores it.
Kinda like sigaltstack, but applies to all signals and affects PKRU
instead of RSP.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
 wrote:
>
> But, I think it's a small one.  Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup.

Note that the whole "stack is special" is not at all a new issue.

It's the main reason why sigaltstack() and SS_ONSTACK exists. This is
in no way new to PKRU, people have had to handle the issue of
stack-related SIGSEGV faults for a long time.

So any application that uses PKRU and may play games that affects the
stack, will always have to have a separate "safe stack" that it uses
for signal handling. But that is in no way PKRU-specific, it's been
the case for a lot of other memory management faults.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
 wrote:
> On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
>>  wrote:
>>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
 1b. If the app malfunctions such that RSP points to pmem, the kernel
 MUST NOT clobber the pmem space.  I think that this basically mandates
 that PKRU needs to have some safe state (i.e. definitely not the init
 state) on signal delivery: the kernel is going to write a signal frame
 at the address identified by RSP, and that address is in pmem, so
 those writes need to fail.
>>>
>>> The kernel is writing the signal frame using normal old copy_to_user().
>>>  Those are writing through mappings with _PAGE_USER set and should be
>>> subject to the PKRU state of the thread before the signal started to be
>>> delivered.
>>>
>>> We don't do the fpu__clear() until after this copy, so I think pkeys
>>> enforcement is being done properly for this today.
>>
>> True, but I think only in a very limited sense.  Your average signal
>> handler is reasonably like to execute "push $rbp" as its very first
>> instruction, at which point we're immediately screwed with the current
>> arrangement.
>
> I completely agree that there's a window for corruption.
>
> But, I think it's a small one.  Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup.  Then, _just_ happened to be at a place which was denied by
> protection keys when it enters the signal handler back in userspace.
> It's possible, but it's a small window.

That's true.

OTOH, I think that the signal in the middle of the wrpkru-allowed
window is also a compelling case.  So maybe my new preference is
option B, where we save and restore PKRU like all the other xfeatures
but, rather than clearing it in the signal context, we set it to the
syscall-specified value.

If we do this, then we'd have to clearly separate the
syscall-specified value from the xstate value, and we'd probably want
one of the syscalls to offer an option to force user PKRU to match the
syscall value.  Maybe your code already does this -- I didn't check.

This has the nice property that sigreturn isn't affected at all except
insofar as we should give a little bit of thought to the ordering of
restoring PKRU relative to any other uaccess.  I'd imagine that we'd
just restore PKRU last, in case the stack we're restoring from has
access disallowed by the saved PKRU value.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
>  wrote:
>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>>> MUST NOT clobber the pmem space.  I think that this basically mandates
>>> that PKRU needs to have some safe state (i.e. definitely not the init
>>> state) on signal delivery: the kernel is going to write a signal frame
>>> at the address identified by RSP, and that address is in pmem, so
>>> those writes need to fail.
>>
>> The kernel is writing the signal frame using normal old copy_to_user().
>>  Those are writing through mappings with _PAGE_USER set and should be
>> subject to the PKRU state of the thread before the signal started to be
>> delivered.
>>
>> We don't do the fpu__clear() until after this copy, so I think pkeys
>> enforcement is being done properly for this today.
> 
> True, but I think only in a very limited sense.  Your average signal
> handler is reasonably like to execute "push $rbp" as its very first
> instruction, at which point we're immediately screwed with the current
> arrangement.

I completely agree that there's a window for corruption.

But, I think it's a small one.  Basically, RSP would have to pointing at
a place which was allowed by protection keys for all of the sigframe
setup.  Then, _just_ happened to be at a place which was denied by
protection keys when it enters the signal handler back in userspace.
It's possible, but it's a small window.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
 wrote:
> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>> MUST NOT clobber the pmem space.  I think that this basically mandates
>> that PKRU needs to have some safe state (i.e. definitely not the init
>> state) on signal delivery: the kernel is going to write a signal frame
>> at the address identified by RSP, and that address is in pmem, so
>> those writes need to fail.
>
> The kernel is writing the signal frame using normal old copy_to_user().
>  Those are writing through mappings with _PAGE_USER set and should be
> subject to the PKRU state of the thread before the signal started to be
> delivered.
>
> We don't do the fpu__clear() until after this copy, so I think pkeys
> enforcement is being done properly for this today.

True, but I think only in a very limited sense.  Your average signal
handler is reasonably like to execute "push $rbp" as its very first
instruction, at which point we're immediately screwed with the current
arrangement.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> 1b. If the app malfunctions such that RSP points to pmem, the kernel
> MUST NOT clobber the pmem space.  I think that this basically mandates
> that PKRU needs to have some safe state (i.e. definitely not the init
> state) on signal delivery: the kernel is going to write a signal frame
> at the address identified by RSP, and that address is in pmem, so
> those writes need to fail.

The kernel is writing the signal frame using normal old copy_to_user().
 Those are writing through mappings with _PAGE_USER set and should be
subject to the PKRU state of the thread before the signal started to be
delivered.

We don't do the fpu__clear() until after this copy, so I think pkeys
enforcement is being done properly for this today.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> Which reminds me: __get_user, etc all respect PKRU because the SDM
> says that PKRU affects all user *and* kernel accesses to user memory
> (i.e. anything with _PAGE_USER).  Should get_user_pages also respect
> PKRU?

It does.  There are a couple of parts to it, but the most important are
in these patches:

> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=c64acbe036487d07968ea6a5b9090169d6e40160
> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=331c62eb789f969e187964e28ac80b1e0e26b69d


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvin  wrote:
> On December 17, 2015 9:29:21 PM PST, Andy Lutomirski  
> wrote:
>>On Dec 17, 2015 6:53 PM, "Dave Hansen" 
>>wrote:
>>>
>>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
>>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
>> wrote:
>>> >> But what about the register state when delivering a signal?  Don't
>>we
>>> >> set the registers to the init state?  Do we need to preserve PKRU
>>state
>>> >> instead of init'ing it?  The init state _is_ nice here because it
>>is
>>> >> permissive allows us to do something useful no matter what PKRU
>>gets set to.
>>> >
>>> > I think we leave the extended regs alone.  Don't we?
>>> >
>>> > I think that, for most signals, we want to leave PKRU as is,
>>> > especially for things that aren't SIGSEGV.  For SIGSEGV, maybe we
>>want
>>> > an option to reset PKRU on delivery (and then set the flag to
>>restore
>>> > on return?).
>>>
>>> Is there some precedent for doing the state differently for different
>>> signals?
>>
>>Yes, to a very limited extent: SA_ONSTACK.
>>
>>>
>>> >> Well, the signal handler isn't necessarily going to clobber it,
>>but the
>>> >> delivery code already clobbers it when going to the init state.
>>> >
>>> > Can you point to that code?
>>>
>>> handle_signal() -> fpu__clear()
>>>
>>> The comment around it says:
>>>
>>> "Ensure the signal handler starts with the new fpu state."
>>>
>>
>>You win this round :)
>>
>>So maybe we should have a mask of xfeatures that aren't cleared on
>>signal delivery (e.g. PKRU, perhaps) and that are, by default,
>>preserved across signal returns.
>>

[...]

>
> I find the notion of PKRU not being initialized at the entry to a signal 
> handler a bit odd.  Saving/restoring it seems like the right thing to me.
>
> What am I missing?

On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar  wrote:
>
> So the principle is this: signals are conceptually like creating a new thread 
> of
> execution, and that's a very powerful programming concept, like fork() or
> pthread_create() are powerful concepts. So we definitely want to keep that 
> default
> behavior, and I don't think we want to deviate from that for typical new 
> extended
> CPU context features, even if signal delivery slows down as a result.

I think that PKRU is in almost no respect a typical no extended CPU
context feature.  It's a control register, not a data register.

Let's try a concrete example.  One big use case for PKRU will be in
conjunction with DAX-style pmem.  Imagine some program (database
server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ |
PROT_WRITE with protection key 1.  It sets PKRU so that key 1 can't be
written.  All of its accesses to the pmem store are in tight code
regions that do wrpkru; mov; wrpkru (just like what the kernel does
with stac/clac on SMAP systems).

>From the app's perspective, it's imperative that stray writes that
coincidentally hit the pmem region fail.  (1 TB is big enough that a
random write to canonical user memory hits it almost 1% of the time.)
This means:

1. If a signal is delivered, the app wants PKRU to be set to some safe
value as early as possible.  The app can do it by itself if it doesn't
use libraries that interfere, but it would be IMO much, much nicer if
the kernel made this happen automatically.

1b. If the app malfunctions such that RSP points to pmem, the kernel
MUST NOT clobber the pmem space.  I think that this basically mandates
that PKRU needs to have some safe state (i.e. definitely not the init
state) on signal delivery: the kernel is going to write a signal frame
at the address identified by RSP, and that address is in pmem, so
those writes need to fail.

2. clone with shared VM should preserve PKRU.

So in this regard, I think signals are kind of like new threads after all.

Now that I think about it more, there are at least two possibilities
that work for this use case.

Option A: signal delivery preserves PKRU.  If we go this route, I
think we should decide whether PKRU is preserved on exit.  Given that
we're talking about adding syscalls to adjust PKRU, it seems a bit odd
to me that sigreturn would, by default, undo whatever those syscalls
do.

Option B: signal delivery resets PKRU to user-specified values.  We're
talking about having syscalls to write PKRU.  We could have clone and
signal entry use the values set by syscall instead of the values in
the actual PKRU register.  That way:

set_protection_key(1, no writes);
...
wrpkru(allow writes to key 1)
mov something<<< fault or async signal here
wrpkru(disallow writes to key 1)

leaves key 1 protected in the signal handler.

If we go that route, I think we must restore PKRU just like any other
xstate on sigreturn, so in some sense it's the simplest of all to
implement.  We'll need to change the signal delivery stuff to do the
fpu__clear and the automatic PKRU write before trying to set up the
signal frame so that the frame is written with the syscall-specified
protections

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Borislav Petkov

On Fri, Dec 18, 2015 at 12:59:14AM -0800, Christoph Hellwig wrote:
> Stupid question, but what the heck is PKRU?  A grep of the kernel tree
> shows no results, and a web search returns mostly Thai language results.

That should explain it:

https://lkml.kernel.org/r/20151214190542.39c48...@viggo.jf.intel.com

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Christoph Hellwig

On Thu, Dec 17, 2015 at 05:48:56PM -0800, Andy Lutomirski wrote:
> Hi all-
> 
> I think that, for PKRU in particular, we want the default signal
> handling behavior to be a bit unusual.

Stupid question, but what the heck is PKRU?  A grep of the kernel tree
shows no results, and a web search returns mostly Thai language results.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > >> But what about the register state when delivering a signal?  Don't we
> > >> set the registers to the init state?  Do we need to preserve PKRU state
> > >> instead of init'ing it?  The init state _is_ nice here because it is
> > >> permissive allows us to do something useful no matter what PKRU gets set 
> > >> to.
> > >
> > > I think we leave the extended regs alone.  Don't we?
> > >
> > > I think that, for most signals, we want to leave PKRU as is,
> > > especially for things that aren't SIGSEGV.  For SIGSEGV, maybe we want
> > > an option to reset PKRU on delivery (and then set the flag to restore
> > > on return?).
> >
> > Is there some precedent for doing the state differently for different
> > signals?
> 
> Yes, to a very limited extent: SA_ONSTACK.
> 
> >
> > >> Well, the signal handler isn't necessarily going to clobber it, but the
> > >> delivery code already clobbers it when going to the init state.
> > >
> > > Can you point to that code?
> >
> > handle_signal() -> fpu__clear()
> >
> > The comment around it says:
> >
> > "Ensure the signal handler starts with the new fpu state."
> >
> 
> You win this round :)
> 
> So maybe we should have a mask of xfeatures that aren't cleared on
> signal delivery (e.g. PKRU, perhaps) and that are, by default,
> preserved across signal returns.

So the principle is this: signals are conceptually like creating a new thread 
of 
execution, and that's a very powerful programming concept, like fork() or 
pthread_create() are powerful concepts. So we definitely want to keep that 
default 
behavior, and I don't think we want to deviate from that for typical new 
extended 
CPU context features, even if signal delivery slows down as a result.

But we've been arguing about 'lightweight signals' for up to two decades that I 
can remember. (The first such suggestion was to not save the FPU state, back 
when 
FPU saves were ridiculously slow compared to other parts of saving/restoring a 
context.)

So having a well-enumerated, extensible opt-in mask (which defaults to 'all 
state 
saved') that allows smart signal handlers to skip the save/restore of certain 
CPU 
context components would be acceptable I think.

But I'd still expect this to be limited to closely coded, specialistic signal 
handlers - as the trend goes against such type of specialization: compilers and 
runtime environments do take advantage of new CPU features so if you want to 
have 
an 'easy to use' signal handler, you should use the default one.

I'd not be surprised if large-scale signal users like Valgrind could benefit.

> > I'm sure we can preserve it, we just need to be _careful_.
> 
> Right.
> 
> How much does XSAVEOPT help here?  IOW if we're careful to save to the same 
> place we restored from and we don't modify the state in the mean time, how 
> much 
> of the time do we save?  In the best case, I guess we save the memory writes 
> but 
> not the reads?

So I'd not design new signal interfaces around current behavior, I'd design 
them 
for the existing patterns (which center around programming ease of use) - with 
opt-in, performance-enhancing specializations.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Christoph Hellwig

On Thu, Dec 17, 2015 at 05:48:56PM -0800, Andy Lutomirski wrote:
> Hi all-
> 
> I think that, for PKRU in particular, we want the default signal
> handling behavior to be a bit unusual.

Stupid question, but what the heck is PKRU?  A grep of the kernel tree
shows no results, and a web search returns mostly Thai language results.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Ingo Molnar

* Andy Lutomirski <l...@amacapital.net> wrote:

> > >> But what about the register state when delivering a signal?  Don't we
> > >> set the registers to the init state?  Do we need to preserve PKRU state
> > >> instead of init'ing it?  The init state _is_ nice here because it is
> > >> permissive allows us to do something useful no matter what PKRU gets set 
> > >> to.
> > >
> > > I think we leave the extended regs alone.  Don't we?
> > >
> > > I think that, for most signals, we want to leave PKRU as is,
> > > especially for things that aren't SIGSEGV.  For SIGSEGV, maybe we want
> > > an option to reset PKRU on delivery (and then set the flag to restore
> > > on return?).
> >
> > Is there some precedent for doing the state differently for different
> > signals?
> 
> Yes, to a very limited extent: SA_ONSTACK.
> 
> >
> > >> Well, the signal handler isn't necessarily going to clobber it, but the
> > >> delivery code already clobbers it when going to the init state.
> > >
> > > Can you point to that code?
> >
> > handle_signal() -> fpu__clear()
> >
> > The comment around it says:
> >
> > "Ensure the signal handler starts with the new fpu state."
> >
> 
> You win this round :)
> 
> So maybe we should have a mask of xfeatures that aren't cleared on
> signal delivery (e.g. PKRU, perhaps) and that are, by default,
> preserved across signal returns.

So the principle is this: signals are conceptually like creating a new thread 
of 
execution, and that's a very powerful programming concept, like fork() or 
pthread_create() are powerful concepts. So we definitely want to keep that 
default 
behavior, and I don't think we want to deviate from that for typical new 
extended 
CPU context features, even if signal delivery slows down as a result.

But we've been arguing about 'lightweight signals' for up to two decades that I 
can remember. (The first such suggestion was to not save the FPU state, back 
when 
FPU saves were ridiculously slow compared to other parts of saving/restoring a 
context.)

So having a well-enumerated, extensible opt-in mask (which defaults to 'all 
state 
saved') that allows smart signal handlers to skip the save/restore of certain 
CPU 
context components would be acceptable I think.

But I'd still expect this to be limited to closely coded, specialistic signal 
handlers - as the trend goes against such type of specialization: compilers and 
runtime environments do take advantage of new CPU features so if you want to 
have 
an 'easy to use' signal handler, you should use the default one.

I'd not be surprised if large-scale signal users like Valgrind could benefit.

> > I'm sure we can preserve it, we just need to be _careful_.
> 
> Right.
> 
> How much does XSAVEOPT help here?  IOW if we're careful to save to the same 
> place we restored from and we don't modify the state in the mean time, how 
> much 
> of the time do we save?  In the best case, I guess we save the memory writes 
> but 
> not the reads?

So I'd not design new signal interfaces around current behavior, I'd design 
them 
for the existing patterns (which center around programming ease of use) - with 
opt-in, performance-enhancing specializations.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Borislav Petkov

On Fri, Dec 18, 2015 at 12:59:14AM -0800, Christoph Hellwig wrote:
> Stupid question, but what the heck is PKRU?  A grep of the kernel tree
> shows no results, and a web search returns mostly Thai language results.

That should explain it:

https://lkml.kernel.org/r/20151214190542.39c48...@viggo.jf.intel.com

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Thu, Dec 17, 2015 at 10:43 PM, H. Peter Anvin  wrote:
> On December 17, 2015 9:29:21 PM PST, Andy Lutomirski  
> wrote:
>>On Dec 17, 2015 6:53 PM, "Dave Hansen" 
>>wrote:
>>>
>>> On 12/17/2015 06:32 PM, Andy Lutomirski wrote:
>>> > On Thu, Dec 17, 2015 at 6:13 PM, Dave Hansen
>> wrote:
>>> >> But what about the register state when delivering a signal?  Don't
>>we
>>> >> set the registers to the init state?  Do we need to preserve PKRU
>>state
>>> >> instead of init'ing it?  The init state _is_ nice here because it
>>is
>>> >> permissive allows us to do something useful no matter what PKRU
>>gets set to.
>>> >
>>> > I think we leave the extended regs alone.  Don't we?
>>> >
>>> > I think that, for most signals, we want to leave PKRU as is,
>>> > especially for things that aren't SIGSEGV.  For SIGSEGV, maybe we
>>want
>>> > an option to reset PKRU on delivery (and then set the flag to
>>restore
>>> > on return?).
>>>
>>> Is there some precedent for doing the state differently for different
>>> signals?
>>
>>Yes, to a very limited extent: SA_ONSTACK.
>>
>>>
>>> >> Well, the signal handler isn't necessarily going to clobber it,
>>but the
>>> >> delivery code already clobbers it when going to the init state.
>>> >
>>> > Can you point to that code?
>>>
>>> handle_signal() -> fpu__clear()
>>>
>>> The comment around it says:
>>>
>>> "Ensure the signal handler starts with the new fpu state."
>>>
>>
>>You win this round :)
>>
>>So maybe we should have a mask of xfeatures that aren't cleared on
>>signal delivery (e.g. PKRU, perhaps) and that are, by default,
>>preserved across signal returns.
>>

[...]

>
> I find the notion of PKRU not being initialized at the entry to a signal 
> handler a bit odd.  Saving/restoring it seems like the right thing to me.
>
> What am I missing?

On Fri, Dec 18, 2015 at 12:32 AM, Ingo Molnar  wrote:
>
> So the principle is this: signals are conceptually like creating a new thread 
> of
> execution, and that's a very powerful programming concept, like fork() or
> pthread_create() are powerful concepts. So we definitely want to keep that 
> default
> behavior, and I don't think we want to deviate from that for typical new 
> extended
> CPU context features, even if signal delivery slows down as a result.

I think that PKRU is in almost no respect a typical no extended CPU
context feature.  It's a control register, not a data register.

Let's try a concrete example.  One big use case for PKRU will be in
conjunction with DAX-style pmem.  Imagine some program (database
server, perhaps) that mmaps 1 TB of pmem MAP_SHARED, PROT_READ |
PROT_WRITE with protection key 1.  It sets PKRU so that key 1 can't be
written.  All of its accesses to the pmem store are in tight code
regions that do wrpkru; mov; wrpkru (just like what the kernel does
with stac/clac on SMAP systems).

>From the app's perspective, it's imperative that stray writes that
coincidentally hit the pmem region fail.  (1 TB is big enough that a
random write to canonical user memory hits it almost 1% of the time.)
This means:

1. If a signal is delivered, the app wants PKRU to be set to some safe
value as early as possible.  The app can do it by itself if it doesn't
use libraries that interfere, but it would be IMO much, much nicer if
the kernel made this happen automatically.

1b. If the app malfunctions such that RSP points to pmem, the kernel
MUST NOT clobber the pmem space.  I think that this basically mandates
that PKRU needs to have some safe state (i.e. definitely not the init
state) on signal delivery: the kernel is going to write a signal frame
at the address identified by RSP, and that address is in pmem, so
those writes need to fail.

2. clone with shared VM should preserve PKRU.

So in this regard, I think signals are kind of like new threads after all.

Now that I think about it more, there are at least two possibilities
that work for this use case.

Option A: signal delivery preserves PKRU.  If we go this route, I
think we should decide whether PKRU is preserved on exit.  Given that
we're talking about adding syscalls to adjust PKRU, it seems a bit odd
to me that sigreturn would, by default, undo whatever those syscalls
do.

Option B: signal delivery resets PKRU to user-specified values.  We're
talking about having syscalls to write PKRU.  We could have clone and
signal entry use the values set by syscall instead of the values in
the actual PKRU register.  That way:

set_protection_key(1, no writes);
...
wrpkru(allow writes to key 1)
mov something<<< fault or async signal here
wrpkru(disallow writes to key 1)

leaves key 1 protected in the signal handler.

If we go that route, I think we must restore PKRU just like any other
xstate on sigreturn, so in some sense it's the simplest of all to
implement.  We'll need to change the signal delivery stuff to do the
fpu__clear and the automatic PKRU

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> Which reminds me: __get_user, etc all respect PKRU because the SDM
> says that PKRU affects all user *and* kernel accesses to user memory
> (i.e. anything with _PAGE_USER).  Should get_user_pages also respect
> PKRU?

It does.  There are a couple of parts to it, but the most important are
in these patches:

> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=c64acbe036487d07968ea6a5b9090169d6e40160
> http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/commit/?h=pkeys-v018=331c62eb789f969e187964e28ac80b1e0e26b69d


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
> 1b. If the app malfunctions such that RSP points to pmem, the kernel
> MUST NOT clobber the pmem space.  I think that this basically mandates
> that PKRU needs to have some safe state (i.e. definitely not the init
> state) on signal delivery: the kernel is going to write a signal frame
> at the address identified by RSP, and that address is in pmem, so
> those writes need to fail.

The kernel is writing the signal frame using normal old copy_to_user().
 Those are writing through mappings with _PAGE_USER set and should be
subject to the PKRU state of the thread before the signal started to be
delivered.

We don't do the fpu__clear() until after this copy, so I think pkeys
enforcement is being done properly for this today.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
 wrote:
>
> But, I think it's a small one.  Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup.

Note that the whole "stack is special" is not at all a new issue.

It's the main reason why sigaltstack() and SS_ONSTACK exists. This is
in no way new to PKRU, people have had to handle the issue of
stack-related SIGSEGV faults for a long time.

So any application that uses PKRU and may play games that affects the
stack, will always have to have a separate "safe stack" that it uses
for signal handling. But that is in no way PKRU-specific, it's been
the case for a lot of other memory management faults.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
 wrote:
> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>> MUST NOT clobber the pmem space.  I think that this basically mandates
>> that PKRU needs to have some safe state (i.e. definitely not the init
>> state) on signal delivery: the kernel is going to write a signal frame
>> at the address identified by RSP, and that address is in pmem, so
>> those writes need to fail.
>
> The kernel is writing the signal frame using normal old copy_to_user().
>  Those are writing through mappings with _PAGE_USER set and should be
> subject to the PKRU state of the thread before the signal started to be
> delivered.
>
> We don't do the fpu__clear() until after this copy, so I think pkeys
> enforcement is being done properly for this today.

True, but I think only in a very limited sense.  Your average signal
handler is reasonably like to execute "push $rbp" as its very first
instruction, at which point we're immediately screwed with the current
arrangement.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 12:07 PM, Dave Hansen
 wrote:
> On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
>> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
>>  wrote:
>>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
 1b. If the app malfunctions such that RSP points to pmem, the kernel
 MUST NOT clobber the pmem space.  I think that this basically mandates
 that PKRU needs to have some safe state (i.e. definitely not the init
 state) on signal delivery: the kernel is going to write a signal frame
 at the address identified by RSP, and that address is in pmem, so
 those writes need to fail.
>>>
>>> The kernel is writing the signal frame using normal old copy_to_user().
>>>  Those are writing through mappings with _PAGE_USER set and should be
>>> subject to the PKRU state of the thread before the signal started to be
>>> delivered.
>>>
>>> We don't do the fpu__clear() until after this copy, so I think pkeys
>>> enforcement is being done properly for this today.
>>
>> True, but I think only in a very limited sense.  Your average signal
>> handler is reasonably like to execute "push $rbp" as its very first
>> instruction, at which point we're immediately screwed with the current
>> arrangement.
>
> I completely agree that there's a window for corruption.
>
> But, I think it's a small one.  Basically, RSP would have to pointing at
> a place which was allowed by protection keys for all of the sigframe
> setup.  Then, _just_ happened to be at a place which was denied by
> protection keys when it enters the signal handler back in userspace.
> It's possible, but it's a small window.

That's true.

OTOH, I think that the signal in the middle of the wrpkru-allowed
window is also a compelling case.  So maybe my new preference is
option B, where we save and restore PKRU like all the other xfeatures
but, rather than clearing it in the signal context, we set it to the
syscall-specified value.

If we do this, then we'd have to clearly separate the
syscall-specified value from the xstate value, and we'd probably want
one of the syscalls to offer an option to force user PKRU to match the
syscall value.  Maybe your code already does this -- I didn't check.

This has the nice property that sigreturn isn't affected at all except
insofar as we should give a little bit of thought to the ordering of
restoring PKRU relative to any other uaccess.  I'd imagine that we'd
just restore PKRU last, in case the stack we're restoring from has
access disallowed by the saved PKRU value.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 11:21 AM, Andy Lutomirski wrote:
> On Fri, Dec 18, 2015 at 10:42 AM, Dave Hansen
>  wrote:
>> On 12/18/2015 08:04 AM, Andy Lutomirski wrote:
>>> 1b. If the app malfunctions such that RSP points to pmem, the kernel
>>> MUST NOT clobber the pmem space.  I think that this basically mandates
>>> that PKRU needs to have some safe state (i.e. definitely not the init
>>> state) on signal delivery: the kernel is going to write a signal frame
>>> at the address identified by RSP, and that address is in pmem, so
>>> those writes need to fail.
>>
>> The kernel is writing the signal frame using normal old copy_to_user().
>>  Those are writing through mappings with _PAGE_USER set and should be
>> subject to the PKRU state of the thread before the signal started to be
>> delivered.
>>
>> We don't do the fpu__clear() until after this copy, so I think pkeys
>> enforcement is being done properly for this today.
> 
> True, but I think only in a very limited sense.  Your average signal
> handler is reasonably like to execute "push $rbp" as its very first
> instruction, at which point we're immediately screwed with the current
> arrangement.

I completely agree that there's a window for corruption.

But, I think it's a small one.  Basically, RSP would have to pointing at
a place which was allowed by protection keys for all of the sigframe
setup.  Then, _just_ happened to be at a place which was denied by
protection keys when it enters the signal handler back in userspace.
It's possible, but it's a small window.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 3:08 PM, Linus Torvalds
 wrote:
> On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski  wrote:
>>
>> Apps that don't want to use the baseline_pkru mechanism could use
>> syscalls to claim ownership of protection keys but then manage them
>> purely with WRPKRU directly.  We could optionally disallow
>> mprotect_key on keys that weren't allocated in advance.
>>
>> Does that seem sane?
>
> So everything seems sane except for the need for that baseline_pkru.
>
> I'm not seeing why it couldn't just be a fixed value. Is there any
> real downside to it?

Yes, I think.  If I'm using protection keys to protect some critical
data structure (important stuff in shared memory, important memory
mapped files, pmem, etc), then I'll allocate a protection key and set
PKRU to deny writes.  The problem is that I really, really want writes
denied except when explicitly enabled in narrow regions of code that
use wrpkru to enable them, and I don't want an asynchronous signal
delivered in those narrow regions of code or newly cloned threads to
pick up the write-allow value.  So I want baseline_pkru to have the
deny writes entry.

I think I would do exactly this in my production code here if my
server supported it.  Some day...

Hrm.  We might also want an option to change pkru and/or baseline_pkru
in all threads in the current mm.  That's optional but it could be
handy.  Maybe it would be as simple as having the allocate-a-pkey call
have an option to set an initial baseline value and an option to
propagate that initial value to pre-existing threads.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 3:16 PM, Andy Lutomirski  wrote:
>
> Yes, I think.  If I'm using protection keys to protect some critical
> data structure (important stuff in shared memory, important memory
> mapped files, pmem, etc), then I'll allocate a protection key and set
> PKRU to deny writes.  The problem is that I really, really want writes
> denied except when explicitly enabled in narrow regions of code that
> use wrpkru to enable them, and I don't want an asynchronous signal
> delivered in those narrow regions of code or newly cloned threads to
> pick up the write-allow value.  So I want baseline_pkru to have the
> deny writes entry.

Hmm. Ok, that does sound like a valid and interesting usage case. Fair enough.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 2:28 PM, Andy Lutomirski  wrote:
>
> Apps that don't want to use the baseline_pkru mechanism could use
> syscalls to claim ownership of protection keys but then manage them
> purely with WRPKRU directly.  We could optionally disallow
> mprotect_key on keys that weren't allocated in advance.
>
> Does that seem sane?

So everything seems sane except for the need for that baseline_pkru.

I'm not seeing why it couldn't just be a fixed value. Is there any
real downside to it?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread H. Peter Anvin

On 12/18/2015 12:49 PM, Andy Lutomirski wrote:
> 
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.
> Kinda like sigaltstack, but applies to all signals and affects PKRU
> instead of RSP.
> 

I think this is the only sensible option, with the default being all zero.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Andy Lutomirski

On Fri, Dec 18, 2015 at 1:45 PM, Linus Torvalds
 wrote:
> On Fri, Dec 18, 2015 at 1:12 PM, Dave Hansen
>  wrote:
>>
>> But, if we are picking out an execute-only pkey more dynamically, we've
>> got to keep the default value for the entire process somewhere.
>
> How dynamic do we want to make this, though?
>
> I haven't looked at the details, and perhaps more importantly, I don't
> know what exactly are the requirements you've gotten from the people
> who are expected to actually use this.
>
> I think we might want to hardcode a couple of keys as "kernel
> reserved". And I'd rather reserve them up-front than have some user
> program be unhappy later when we want to use them.
>
> I guess we want to leave key #0 for "normal page", so my suggesting to
> use that for the execute-only was probably misguided.
>
> But I do think we might want to have that "no read access" as a real
> fixed key too, because I think the kernel itself would want to use it:
>
>  (a) to make sure that it gets the right fault when user space passes
> in a execute-only address to a system call.
>
>  (b) for much more efficient PAGEALLOC_DEBUG for kernel mappings.
>
> so I do think that we'd want to reserve two of the 16 keys up front.
>
> Would it be ok for the expected users to have those keys simply be
> fixed? With key 0 being used for all default pages, and key 1 being
> used for all execute-only pages? And then defaulting PKRU to 4,
> disallowing access to that key #1?
>
> I could imagine that some kernel person would want to use even more
> keys, but I think two fixed keys are kind of the minimal we'd want to
> use.

I imagine we'd reserve key 0 for normal page and key 1 for deny-read.
Let me be a bit more concrete about what I'm suggesting:

We'd have thread_struct.baseline_pkru.  It would start with key 0
allowing all access and key 1 denying reads.

We'd have a syscall like set_protection_key that could allocate unused
keys and change the values of keys that have been allocated.  Those
changes would be reflected in baseline_pkru.  Changes to keys 0 and 1
in baseline_pkru would not be allowed.

Signal delivery would load baseline_pkru into the PKRU register.
Signal restore would restore PKRU to its previous value.

WRPKRU would, of course, override baseline_pkru, but it wouldn't
change baseline_pkru.  The set_protection_key syscall would modify
*both* real PKRU and baseline_pkru.

Apps that don't want to use the baseline_pkru mechanism could use
syscalls to claim ownership of protection keys but then manage them
purely with WRPKRU directly.  We could optionally disallow
mprotect_key on keys that weren't allocated in advance.

Does that seem sane?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Linus Torvalds

On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski  wrote:
>
> IOW, I like my idea in which signal delivery always sets PKRU to the
> application-requested-by-syscall values and sigreturn restores it.

So I don't mind that, as long as the whole "sigreturn restores it" is
part of things.

Your original email with the suggestion to *not* resture PKRU I didn't
like. Setting it and restoring it is fine.

I do wonder if you need an explicit value, though. I think it's
reasonable to say that PKRU value 0 is special. It's what we'd start
processes with, and why not just say that it's what we run signal
handlers in?

Would any other value ever make sense, really?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?

2015-12-18 Thread Dave Hansen

On 12/18/2015 01:04 PM, Linus Torvalds wrote:
> On Fri, Dec 18, 2015 at 12:49 PM, Andy Lutomirski  wrote:
>>
>> IOW, I like my idea in which signal delivery always sets PKRU to the
>> application-requested-by-syscall values and sigreturn restores it.
> 
> So I don't mind that, as long as the whole "sigreturn restores it" is
> part of things.
> 
> Your original email with the suggestion to *not* resture PKRU I didn't
> like. Setting it and restoring it is fine.
> 
> I do wonder if you need an explicit value, though. I think it's
> reasonable to say that PKRU value 0 is special. It's what we'd start
> processes with, and why not just say that it's what we run signal
> handlers in?
> 
> Would any other value ever make sense, really?

Having a PKRU with the execute-only permissions set is the only one I
can think of.  For a system with a _dedicated_ PKEY for execute-only,
this is easy and could even be made a part of init_fpstate with no other
code changes.

But, if we are picking out an execute-only pkey more dynamically, we've
got to keep the default value for the entire process somewhere.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 161 matches

Mail list logo